My Knowledge Base
Turning AI conversations into durable, interconnected knowledge
The Evaporation Problem
Every meaningful AI conversation generates knowledge that evaporates. You spend ninety minutes with Claude working through a recursive data structure design, and afterward that knowledge exists only inside a vendor-specific chat interface — unsearchable, unstructured, disconnected from every other conversation you have ever had. No export pathway, no atomization, no cross-conversation search, no intelligence extraction. This is not a new problem but an intensified form of an old one.[1] Polanyi observed that we know more than we can tell — tacit knowledge resists codification. AI conversations produce a hybrid form: the dialogue itself is explicit text, but the understanding it generates often remains tacit, trapped in the user's memory and the chat log's inaccessible architecture. Vannevar Bush anticipated this crisis of intellectual record-keeping in 1945, proposing a "memex" — a device for storing, linking, and retrieving the trails of one's intellectual work.[2] My Knowledge Base is the memex for the AI conversation era: a system that ingests from nine sources, decomposes conversations into atomic knowledge units, indexes them across three search modalities, and extracts intelligence that the original conversations only implied.
Multi-Source Export
The first subsystem solves the ingestion problem: AI platforms do not offer robust export APIs, so the system uses browser automation (Playwright) to extract conversations from Claude, ChatGPT, and Gemini, supplemented by direct file system access for local documents, Google Docs API integration, and Apple Notes extraction via AppleScript bridging.[3] Engelbart's foundational vision was not artificial intelligence but intelligence augmentation — tools that amplify human capability by organizing the artifacts of thought. The export engine is precisely this: it recovers intellectual artifacts that would otherwise be locked inside proprietary interfaces. Each source adapter normalizes conversations into a common intermediate representation — a sequence of turns with metadata (timestamp, model, token count, source platform) — so that downstream subsystems operate on a uniform data structure regardless of origin. The system currently supports nine sources, with the adapter interface designed for extensibility: adding a new source requires implementing a single TypeScript interface with three methods (authenticate, list, extract).[4] Nelson's dream of universal, interconnected documents — hypertext as he originally conceived it — motivates the design: knowledge should not be imprisoned in the application that created it.
interface SourceAdapter {
readonly sourceId: SourcePlatform;
authenticate(): Promise<AuthSession>;
listConversations(since?: Date): Promise<ConversationMeta[]>;
extractConversation(id: string): Promise<RawConversation>;
}
interface RawConversation {
id: string;
source: SourcePlatform;
title: string;
turns: ConversationTurn[];
metadata: {
createdAt: Date;
model: string;
tokenCount: number;
tags: string[];
};
}
interface AtomizationResult {
atoms: KnowledgeAtom[];
strategy: AtomizationStrategy;
parentConversation: string;
confidence: number;
}
type AtomizationStrategy =
| 'topic-boundary'
| 'question-answer'
| 'code-explanation'
| 'decision-rationale'
| 'concept-definition'; Five-Strategy Atomizer
Raw conversations are not knowledge — they are transcripts of a knowledge-generating process, filled with false starts, clarifications, and tangential exploration. The atomizer decomposes these transcripts into the smallest self-contained knowledge units using five distinct strategies, each tuned for a different conversational pattern.[5] Sowa's framework for knowledge representation emphasizes that the granularity of representation determines what can be retrieved and reasoned about — too coarse and connections are lost, too fine and context evaporates. The atomizer navigates this tension by applying the appropriate strategy based on conversational structure: topic-boundary detection splits at natural subject transitions, question-answer extraction pairs explicit queries with their resolutions, code-explanation pairing links implementation to rationale, decision-rationale extraction captures the reasoning behind choices, and concept-definition isolation identifies when a new term or framework is being established. Each strategy produces atoms at a different granularity level, and the system indexes all of them — a query can match at the concept level, the decision level, or the code level depending on what the user is searching for.[6] Minsky's thesis that intelligence emerges from the interaction of many simple agents maps onto the atomizer architecture: no single strategy captures all knowledge, but their collective output covers the full spectrum of what a conversation contains.
| Strategy | Trigger Pattern | Granularity | Typical Atom Size | Example Output |
|---|---|---|---|---|
| Topic Boundary | Subject shift detected via embedding distance | Section-level | 200-500 tokens | Complete discussion of a single architectural decision |
| Question-Answer | Explicit question followed by substantive response | Exchange-level | 100-300 tokens | Q: "How does BFS handle cycles?" A: [explanation] |
| Code-Explanation | Code block adjacent to natural language description | Snippet-level | 150-400 tokens | TypeScript function + rationale for design choices |
| Decision-Rationale | Comparison of alternatives with explicit selection | Decision-level | 200-600 tokens | "Chose SQLite over Postgres because..." with tradeoff analysis |
| Concept-Definition | New term introduced with explanation or formal definition | Term-level | 50-200 tokens | "Atomization: decomposing conversations into..." |
Triple-Modal Search
Search is the system's primary interface — the mechanism by which stored knowledge becomes accessible knowledge. The index operates in three modes simultaneously: SQLite FTS5 for exact keyword matching with BM25 ranking, ChromaDB for semantic vector search using sentence embeddings, and Reciprocal Rank Fusion (RRF) to merge results from both modalities into a single ranked list.[7] Croft's analysis of information retrieval demonstrates that keyword and semantic search fail in complementary ways: keyword search misses synonyms and paraphrases, while vector search can lose precision on specific technical terms. The triple-modal architecture exploits this complementarity. When a user searches for "BFS cycle detection," FTS5 surfaces atoms containing those exact terms while ChromaDB retrieves semantically related atoms about graph traversal, depth-first alternatives, and visited-set implementations. RRF then interleaves these result sets using reciprocal rank weighting — an atom that ranks highly in both modalities scores higher than one that dominates a single modality.[8] Manning's treatment of evaluation metrics — precision, recall, mean reciprocal rank — informed the decision to use RRF specifically: it requires no training data, no weight tuning, and produces stable rankings even when the underlying modalities have different score distributions.
Intelligence Layer
Search retrieves what you already know to look for. The intelligence layer surfaces what you did not know you knew. Three LLM-powered processes run over the indexed atoms: insight extraction identifies patterns and principles that span multiple conversations, smart tagging generates a controlled vocabulary that normalizes terminology across sources, and relationship detection discovers connections between atoms that were never explicitly linked in the original conversations.[9] Berners-Lee's linked data principles — use URIs to name things, use HTTP URIs so people can look them up, provide useful information, include links to other URIs — translate directly into the knowledge graph's architecture: every atom has a stable identifier, every relationship is typed and traversable, and the graph is exportable via vis.js for visual exploration. The system supports three LLM backends interchangeably — Anthropic, OpenAI, and Ollama — meaning the intelligence layer can operate fully locally with zero API keys when privacy or cost constraints require it.[10] Meadows's systems thinking framework reveals why the intelligence layer matters: knowledge is not a stock to be accumulated but a flow to be maintained. The intelligence layer converts static atoms into a dynamic system where new connections emerge as the graph grows, feedback loops between search behavior and tagging quality tighten over time, and the system's utility increases superlinearly with the volume of ingested material.
interface SearchResult {
atomId: string;
score: number;
source: 'fts5' | 'chromadb' | 'fused';
snippet: string;
metadata: AtomMetadata;
}
function reciprocalRankFusion(
ftsResults: SearchResult[],
vectorResults: SearchResult[],
k: number = 60
): SearchResult[] {
const scores = new Map<string, number>();
ftsResults.forEach((result, rank) => {
const current = scores.get(result.atomId) ?? 0;
scores.set(result.atomId, current + 1 / (k + rank + 1));
});
vectorResults.forEach((result, rank) => {
const current = scores.get(result.atomId) ?? 0;
scores.set(result.atomId, current + 1 / (k + rank + 1));
});
return Array.from(scores.entries())
.sort(([, a], [, b]) => b - a)
.map(([atomId, score]) => ({
atomId,
score,
source: 'fused' as const,
snippet: findSnippet(atomId, ftsResults, vectorResults),
metadata: getAtomMetadata(atomId),
}));
} Knowledge Graph and Downstream Integration
The knowledge graph is the system's connective tissue — a directed graph where nodes are atoms and edges are typed relationships (supports, contradicts, extends, exemplifies, depends-on). BFS traversal from any atom reveals its neighborhood of related knowledge, and the graph is exportable to vis.js for interactive visual exploration in the browser.[9] The graph is not a static index but a living structure that grows with every ingestion cycle. When the intelligence layer detects a new relationship between atoms from different conversations — perhaps a design pattern discussed with Claude in January connects to an implementation strategy explored with Gemini in March — it adds an edge, and the graph's topology shifts. This is where the system transcends individual conversation recovery and becomes genuine epistemological infrastructure: it reveals the structure of your thinking across time, platforms, and contexts. The knowledge base feeds downstream organs in the eight-organ system: extracted patterns seed generative art experiments in ORGAN-II, search architecture decisions inform ORGAN-III product specifications, and intelligence outputs become raw material for ORGAN-V public essays.[10] This cross-organ feeding is not metaphorical — the export pipeline produces structured JSON that other systems consume programmatically, closing the loop between knowledge capture and creative production.
Architecture and Testing
The system is implemented in TypeScript with strict mode enabled, backed by SQLite for structured storage and full-text search, ChromaDB for vector embeddings, and a knowledge graph with BFS traversal and vis.js export. The test suite exceeds 200 tests covering the full pipeline: source adapter mocking, atomization strategy correctness, search ranking quality (measured by mean reciprocal rank against hand-labeled relevance judgments), intelligence layer output validation, and graph traversal properties.[7] Croft's emphasis on evaluation-driven development — measuring retrieval quality against ground truth before and after each system change — shaped the testing methodology. The architecture supports three LLM backends (Anthropic, OpenAI, Ollama) interchangeably through a provider abstraction layer, which means the entire system can operate fully locally with Ollama, requiring zero API keys and sending no data to external services. This is not merely a deployment convenience but an architectural commitment: knowledge infrastructure should not depend on the continued availability or pricing of any single vendor.[8]
| Layer | Technology | Responsibility | Test Coverage |
|---|---|---|---|
| Export | Playwright + platform APIs | Ingest from 9 sources into common format | Adapter contract tests |
| Atomization | TypeScript + LLM-assisted splitting | Decompose conversations into atomic units | Strategy correctness + boundary detection |
| Search | SQLite FTS5 + ChromaDB + RRF | Triple-modal retrieval with rank fusion | MRR against labeled relevance judgments |
| Intelligence | Anthropic / OpenAI / Ollama | Insight extraction, tagging, relationship detection | Output schema validation + consistency checks |
Tradeoffs and Design Decisions
The system makes several deliberate tradeoffs. Browser automation for export is fragile — platform UI changes can break extraction — but it is the only path when APIs do not exist, and the adapter interface isolates this fragility from the rest of the system. The five atomization strategies were chosen empirically by analyzing 500 conversations and identifying the most common structural patterns; a sixth strategy (narrative-arc detection for long-form discussions) was prototyped but deferred because its precision was insufficient to justify the complexity.[5] Sowa's principle that representation fidelity must be balanced against computational tractability guided this decision. The choice to implement RRF rather than a learned fusion model reflects the system's scale: with hundreds to low thousands of atoms, the training data for a supervised ranker would be insufficient, and RRF's parameter-free design eliminates a tuning burden that would not pay dividends at this volume.[3] Engelbart's framework reminds us that augmentation tools must reduce cognitive overhead, not merely shift it — a system that requires constant tuning to maintain quality fails this criterion. The knowledge base is designed to improve passively as it ingests more material, without demanding ongoing maintenance from its operator.
By the Numbers
References
- Polanyi, Michael. The Tacit Dimension. University of Chicago Press, 1966.
- Bush, Vannevar. As We May Think. The Atlantic Monthly, 1945.
- Engelbart, Douglas C.. Augmenting Human Intellect: A Conceptual Framework. Stanford Research Institute, 1962.
- Nelson, Ted. Computer Lib/Dream Machines. Self-published, 1974.
- Sowa, John F.. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Brooks/Cole, 2000.
- Minsky, Marvin. The Society of Mind. Simon & Schuster, 1986.
- Croft, W. Bruce, Donald Metzler, and Trevor Strohl. Search Engines: Information Retrieval in Practice. Addison-Wesley, 2010.
- Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008.
- Berners-Lee, Tim. Linked Data. W3C Design Issues, 2006.
- Meadows, Donella H.. Thinking in Systems: A Primer. Chelsea Green Publishing, 2008.