My Knowledge Base

The Evaporation Problem

I built a system to capture what I learn from AI conversations before it disappears. Every chat with Claude, ChatGPT, or Gemini generates real knowledge, but it vanishes into inaccessible interfaces the moment the session ends. My Knowledge Base ingests from nine sources, breaks conversations into searchable units, and surfaces connections I never explicitly made.

Every meaningful AI conversation generates knowledge that evaporates. You spend ninety minutes with Claude working through a recursive data structure design, and afterward that knowledge exists only inside a vendor-specific chat interface — unsearchable, unstructured, disconnected from every other conversation you have ever had. No export pathway, no atomization, no cross-conversation search, no intelligence extraction. This is not a new problem but an intensified form of an old one.^[1] Polanyi observed that we know more than we can tell — tacit knowledge resists codification. AI conversations produce a hybrid form: the dialogue itself is explicit text, but the understanding it generates often remains tacit, trapped in the user's memory and the chat log's inaccessible architecture. Vannevar Bush anticipated this crisis of intellectual record-keeping in 1945, proposing a "memex" — a device for storing, linking, and retrieving the trails of one's intellectual work.^[2] My Knowledge Base is the memex for the AI conversation era: a system that ingests from nine sources, decomposes conversations into atomic knowledge units, indexes them across three search modalities, and extracts intelligence that the original conversations only implied.

graph TD S1[Claude] --> E[Multi-Source Export Engine] S2[ChatGPT] --> E S3[Gemini] --> E S4[Local Files] --> E S5[Google Docs] --> E S6[Apple Notes] --> E S7[Perplexity] --> E S8[Grok] --> E S9[DeepSeek] --> E E -->|raw conversations| A[Five-Strategy Atomizer] A -->|atomic units| I[Triple-Modal Index] I -->|FTS5 + ChromaDB + RRF| IL[Intelligence Layer] IL -->|insights + tags + relations| KG[Knowledge Graph] KG -->|BFS traversal + vis.js| O[Downstream Organs] style E fill:#1b4332,color:#fff style A fill:#2d6a4f,color:#fff style I fill:#40916c,color:#fff style IL fill:#52b788,color:#000 style KG fill:#74c69d,color:#000

End-to-end knowledge pipeline — nine sources converge through export, atomization, and indexing into a unified intelligence layer

Multi-Source Export

The first subsystem solves the ingestion problem: AI platforms do not offer robust export APIs, so the system uses browser automation (Playwright) to extract conversations. I directed the implementation of the source adapters to normalize conversations from nine disparate platforms into a common intermediate representation.^[3]

atomizer-pipeline.ts

interface SourceAdapter {
  readonly sourceId: SourcePlatform;
  authenticate(): Promise<AuthSession>;
  listConversations(since?: Date): Promise<ConversationMeta[]>;
  extractConversation(id: string): Promise<RawConversation>;
}

interface RawConversation {
  id: string;
  source: SourcePlatform;
  title: string;
  turns: ConversationTurn[];
  metadata: {
    createdAt: Date;
    model: string;
    tokenCount: number;
    tags: string[];
  };
}

interface AtomizationResult {
  atoms: KnowledgeAtom[];
  strategy: AtomizationStrategy;
  parentConversation: string;
  confidence: number;
}

type AtomizationStrategy =
  | 'topic-boundary'
  | 'question-answer'
  | 'code-explanation'
  | 'decision-rationale'
  | 'concept-definition';

Source adapter interface and atomizer pipeline — the contract every ingestion source must satisfy

Five-Strategy Atomizer

Raw conversations are not knowledge — they are transcripts of a knowledge-generating process, filled with false starts, clarifications, and tangential exploration. The atomizer decomposes these transcripts into the smallest self-contained knowledge units using five distinct strategies, each tuned for a different conversational pattern.^[5] Sowa's framework for knowledge representation emphasizes that the granularity of representation determines what can be retrieved and reasoned about — too coarse and connections are lost, too fine and context evaporates. The atomizer navigates this tension by applying the appropriate strategy based on conversational structure: topic-boundary detection splits at natural subject transitions, question-answer extraction pairs explicit queries with their resolutions, code-explanation pairing links implementation to rationale, decision-rationale extraction captures the reasoning behind choices, and concept-definition isolation identifies when a new term or framework is being established. Each strategy produces atoms at a different granularity level, and the system indexes all of them — a query can match at the concept level, the decision level, or the code level depending on what the user is searching for.^[6] Minsky's thesis that intelligence emerges from the interaction of many simple agents maps onto the atomizer architecture: no single strategy captures all knowledge, but their collective output covers the full spectrum of what a conversation contains.

Strategy	Trigger Pattern	Granularity	Typical Atom Size	Example Output
Topic Boundary	Subject shift detected via embedding distance	Section-level	200-500 tokens	Complete discussion of a single architectural decision
Question-Answer	Explicit question followed by substantive response	Exchange-level	100-300 tokens	Q: "How does BFS handle cycles?" A: [explanation]
Code-Explanation	Code block adjacent to natural language description	Snippet-level	150-400 tokens	TypeScript function + rationale for design choices
Decision-Rationale	Comparison of alternatives with explicit selection	Decision-level	200-600 tokens	"Chose SQLite over Postgres because..." with tradeoff analysis
Concept-Definition	New term introduced with explanation or formal definition	Term-level	50-200 tokens	"Atomization: decomposing conversations into..."

Figure 1. Each atomization strategy targets a distinct conversational pattern, producing knowledge units at different levels of granularity

Triple-Modal Search

I built the search to run three modes simultaneously: exact keyword matching, semantic vector search, and a fusion algorithm that merges both result sets. Keyword search catches exact terms; vector search finds related concepts even when the words differ. The fusion layer interleaves them so results that rank well in both modes rise to the top without any manual tuning.

Search is the system's primary interface — the mechanism by which stored knowledge becomes accessible knowledge. The index operates in three modes simultaneously: SQLite FTS5 for exact keyword matching with BM25 ranking, ChromaDB for semantic vector search using sentence embeddings, and Reciprocal Rank Fusion (RRF) to merge results from both modalities into a single ranked list.^[7] Croft's analysis of information retrieval demonstrates that keyword and semantic search fail in complementary ways: keyword search misses synonyms and paraphrases, while vector search can lose precision on specific technical terms. The triple-modal architecture exploits this complementarity. When a user searches for "BFS cycle detection," FTS5 surfaces atoms containing those exact terms while ChromaDB retrieves semantically related atoms about graph traversal, depth-first alternatives, and visited-set implementations. RRF then interleaves these result sets using reciprocal rank weighting — an atom that ranks highly in both modalities scores higher than one that dominates a single modality.^[8] Manning's treatment of evaluation metrics — precision, recall, mean reciprocal rank — informed the decision to use RRF specifically: it requires no training data, no weight tuning, and produces stable rankings even when the underlying modalities have different score distributions.

Search architecture — parallel keyword and semantic retrieval fused through Reciprocal Rank Fusion

Intelligence Layer

I built an intelligence layer on top of search that surfaces things I didn't know I knew. It runs three LLM-powered processes: extracting patterns across conversations, generating normalized tags, and detecting relationships between atoms that were never explicitly connected. It works with Anthropic, OpenAI, or Ollama, so the whole system can run fully local with zero API keys.

Search retrieves what you already know to look for. The intelligence layer surfaces what you did not know you knew. Three LLM-powered processes run over the indexed atoms: insight extraction identifies patterns and principles that span multiple conversations, smart tagging generates a controlled vocabulary that normalizes terminology across sources, and relationship detection discovers connections between atoms that were never explicitly linked in the original conversations.^[9] Berners-Lee's linked data principles — use URIs to name things, use HTTP URIs so people can look them up, provide useful information, include links to other URIs — translate directly into the knowledge graph's architecture: every atom has a stable identifier, every relationship is typed and traversable, and the graph is exportable via vis.js for visual exploration. The system supports three LLM backends interchangeably — Anthropic, OpenAI, and Ollama — meaning the intelligence layer can operate fully locally with zero API keys when privacy or cost constraints require it.^[10] Meadows's systems thinking framework reveals why the intelligence layer matters: knowledge is not a stock to be accumulated but a flow to be maintained. The intelligence layer converts static atoms into a dynamic system where new connections emerge as the graph grows, feedback loops between search behavior and tagging quality tighten over time, and the system's utility increases superlinearly with the volume of ingested material.

search-fusion.ts

// View implementation at:
// github.com/organvm-i-theoria/my-knowledge-base/src/search/fusion.ts

interface SearchResult {
  atomId: string;
  score: number;
  source: 'fts5' | 'chromadb' | 'fused';
  snippet: string;
  metadata: AtomMetadata;
}

function reciprocalRankFusion(
  ftsResults: SearchResult[],
  vectorResults: SearchResult[],
  k: number = 60
): SearchResult[] {
  const scores = new Map<string, number>();

  ftsResults.forEach((result, rank) => {
    const current = scores.get(result.atomId) ?? 0;
    scores.set(result.atomId, current + 1 / (k + rank + 1));
  });

  vectorResults.forEach((result, rank) => {
    const current = scores.get(result.atomId) ?? 0;
    scores.set(result.atomId, current + 1 / (k + rank + 1));
  });

  return Array.from(scores.entries())
    .sort(([, a], [, b]) => b - a)
    .map(([atomId, score]) => ({
      atomId,
      score,
      source: 'fused' as const,
      snippet: findSnippet(atomId, ftsResults, vectorResults),
      metadata: getAtomMetadata(atomId),
    }));
}

Search fusion algorithm — Reciprocal Rank Fusion merges keyword and semantic results without weight tuning

Knowledge Graph and Downstream Integration

I built a knowledge graph where atoms are nodes and typed relationships (supports, contradicts, extends) are edges. It grows automatically as the intelligence layer discovers new connections between conversations I had months apart on different platforms. The graph feeds downstream projects: patterns seed generative art, architecture decisions inform product specs, and insights become essay material.

The knowledge graph is the system's connective tissue — a directed graph where nodes are atoms and edges are typed relationships (supports, contradicts, extends, exemplifies, depends-on). BFS traversal from any atom reveals its neighborhood of related knowledge, and the graph is exportable to vis.js for interactive visual exploration in the browser.^[9] The graph is not a static index but a living structure that grows with every ingestion cycle. When the intelligence layer detects a new relationship between atoms from different conversations — perhaps a design pattern discussed with Claude in January connects to an implementation strategy explored with Gemini in March — it adds an edge, and the graph's topology shifts. This is where the system transcends individual conversation recovery and becomes genuine epistemological infrastructure: it reveals the structure of your thinking across time, platforms, and contexts. The knowledge base feeds downstream organs in the eight-organ system: extracted patterns seed generative art experiments in ORGAN-II, search architecture decisions inform ORGAN-III product specifications, and intelligence outputs become raw material for ORGAN-V public essays.^[10] This cross-organ feeding is not metaphorical — the export pipeline produces structured JSON that other systems consume programmatically, closing the loop between knowledge capture and creative production.

Knowledge graph structure — atoms connected by typed relationships, traversable via BFS and exportable to vis.js

Architecture and Testing

I built the system in TypeScript strict mode with SQLite for storage, ChromaDB for vector search, and over 200 tests covering the full pipeline. Search quality is measured against hand-labeled relevance judgments. The entire system can run fully local with Ollama, so my knowledge infrastructure doesn't depend on any single vendor's availability or pricing.

The system is implemented in TypeScript with strict mode enabled, backed by SQLite for structured storage and full-text search, ChromaDB for vector embeddings, and a knowledge graph with BFS traversal and vis.js export. The test suite exceeds 200 tests covering the full pipeline: source adapter mocking, atomization strategy correctness, search ranking quality (measured by mean reciprocal rank against hand-labeled relevance judgments), intelligence layer output validation, and graph traversal properties.^[7] Croft's emphasis on evaluation-driven development — measuring retrieval quality against ground truth before and after each system change — shaped the testing methodology. The architecture supports three LLM backends (Anthropic, OpenAI, Ollama) interchangeably through a provider abstraction layer, which means the entire system can operate fully locally with Ollama, requiring zero API keys and sending no data to external services. This is not merely a deployment convenience but an architectural commitment: knowledge infrastructure should not depend on the continued availability or pricing of any single vendor.^[8]

Layer	Technology	Responsibility	Test Coverage
Export	Playwright + platform APIs	Ingest from 9 sources into common format	Adapter contract tests
Atomization	TypeScript + LLM-assisted splitting	Decompose conversations into atomic units	Strategy correctness + boundary detection
Search	SQLite FTS5 + ChromaDB + RRF	Triple-modal retrieval with rank fusion	MRR against labeled relevance judgments
Intelligence	Anthropic / OpenAI / Ollama	Insight extraction, tagging, relationship detection	Output schema validation + consistency checks

Figure 2. Four architectural layers with their storage, search, and intelligence technologies

Tradeoffs and Design Decisions

The system makes several deliberate tradeoffs. Browser automation for export is fragile — platform UI changes can break extraction — but it is the only path when APIs do not exist, and the adapter interface isolates this fragility from the rest of the system. The five atomization strategies were chosen empirically by analyzing 500 conversations and identifying the most common structural patterns; a sixth strategy (narrative-arc detection for long-form discussions) was prototyped but deferred because its precision was insufficient to justify the complexity.^[5] Sowa's principle that representation fidelity must be balanced against computational tractability guided this decision. The choice to implement RRF rather than a learned fusion model reflects the system's scale: with hundreds to low thousands of atoms, the training data for a supervised ranker would be insufficient, and RRF's parameter-free design eliminates a tuning burden that would not pay dividends at this volume.^[3] Engelbart's framework reminds us that augmentation tools must reduce cognitive overhead, not merely shift it — a system that requires constant tuning to maintain quality fails this criterion. The knowledge base is designed to improve passively as it ingests more material, without demanding ongoing maintenance from its operator.

By the Numbers

200+

Tests

Ingestion Sources

Atomization Strategies

Search Modalities

LLM Backends

BFS

Graph Traversal

Figure 3. System scope — nine sources feeding through five strategies into a triple-modal search index with LLM intelligence extraction

References

Polanyi, Michael. The Tacit Dimension. University of Chicago Press, 1966.
Bush, Vannevar. As We May Think. The Atlantic Monthly, 1945.
Engelbart, Douglas C.. Augmenting Human Intellect: A Conceptual Framework. Stanford Research Institute, 1962.
Nelson, Ted. Computer Lib/Dream Machines. Self-published, 1974.
Sowa, John F.. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Brooks/Cole, 2000.
Minsky, Marvin. The Society of Mind. Simon & Schuster, 1986.
Croft, W. Bruce, Donald Metzler, and Trevor Strohl. Search Engines: Information Retrieval in Practice. Addison-Wesley, 2010.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008.
Berners-Lee, Tim. Linked Data. W3C Design Issues, 2006.
Meadows, Donella H.. Thinking in Systems: A Primer. Chelsea Green Publishing, 2008.

My Knowledge Base

Related Live Sites