Linguistic Atomization Framework

The Problem: Rhetoric Without Instruments

For twenty-four centuries, the study of rhetoric has operated with essentially the same analytical toolkit: close reading, taxonomic classification, and argumentative intuition.^[1] Aristotle's tripartite framework — ethos, pathos, logos — remains the dominant analytical lens, not because it is complete but because no systematic alternative has emerged that preserves humanistic categories while enabling computational precision. The digital humanities have produced powerful tools for text analysis, but they overwhelmingly target semantic content: what a text means (topic modeling, sentiment analysis, named entity recognition) rather than what a text does — how its syntactic structures produce rhythmic effects, how its figurative language distributes across argumentative architecture, how its rhetorical moves sequence to produce persuasion.^[2] Burke's dramatistic framework — his insistence that language is fundamentally a mode of action, not merely representation — provides the theoretical mandate for this project. The Linguistic Atomization Framework treats texts not as containers of meaning but as machines that produce effects, and it decomposes those machines into their smallest operational parts.

The atomization pipeline: raw text is hierarchically decomposed from surface form through morphological, syntactic, and rhetorical layers, with each level carrying annotations from applicable analysis modules

Corpus and Scale

The framework operates on a curated corpus of 46 canonical works spanning 15+ languages and 12 distinct literary-rhetorical traditions. This is not a convenience sample — it is a deliberate attempt to test every analytical claim against genuine linguistic diversity.^[3] Curtius demonstrated that the rhetorical traditions of Europe form a continuous chain from antiquity through the Middle Ages to modernity, with topoi (commonplaces) serving as the connective tissue. The corpus extends this genealogy beyond Europe: Sanskrit rhetoric (the alamkara tradition), classical Arabic (balagha), Chinese parallel prose (pianwen), and Japanese zuihitsu all have independent theoretical frameworks for analyzing how texts produce effects. By including works from each tradition alongside its theoretical apparatus, the framework can test whether analytical categories developed for Greek oratory generalize to Heian-period Japanese prose — and where they do not, the failures are as informative as the successes.^[4] Auerbach's method of anchoring broad historical claims in microscopic textual analysis — his famous comparison of Homer and Genesis in the opening chapter — is the direct methodological ancestor of this framework's approach: every macro-level claim about rhetorical pattern must be grounded in atomized, verifiable textual evidence.

Tradition	Language(s)	Representative Work	Period
Greek Classical	Ancient Greek	Aristotle, Rhetoric	4th c. BCE
Roman Oratory	Latin	Cicero, De Oratore	1st c. BCE
Sanskrit Poetics	Sanskrit	Bharata, Natyashastra	2nd c. BCE
Arabic Rhetoric	Classical Arabic	Al-Jurjani, Dala'il al-I'jaz	11th c. CE
Medieval European	Latin, Old French	Dante, De Vulgari Eloquentia	14th c. CE
Chinese Parallel Prose	Classical Chinese	Liu Xie, Wenxin Diaolong	5th c. CE
Japanese Zuihitsu	Classical Japanese	Sei Shonagon, The Pillow Book	11th c. CE
Renaissance Humanism	Italian, Latin	Erasmus, De Copia	16th c. CE
Enlightenment	English, French	Blair, Lectures on Rhetoric	18th c. CE
Russian Formalism	Russian	Shklovsky, Art as Device	20th c. CE
Structuralism	French	Genette, Narrative Discourse	20th c. CE
Latin American Boom	Spanish	Borges, Ficciones	20th c. CE

Figure 1. Representative corpus diversity — 12 traditions, 15+ languages, spanning from 8th century BCE oral poetry to 20th century experimental fiction

Six Analysis Modules

The framework applies six configurable analysis modules, each operating at every level of the atomization hierarchy — from morpheme to whole-work architecture. The modules are: Figurative Language (tropes, schemes, and their distribution patterns), Rhythmic Structure (prosodic analysis, clause length variation, periodic vs. loose sentence construction), Argumentative Topology (enthymeme detection, topos mapping, warrant analysis), Narrative Mechanics (focalization, temporality, voice), Lexical Stratification (register analysis, etymological layering, code-switching patterns), and Pragmatic Force (speech act classification, implicature, illocutionary sequencing).^[5] Genette's taxonomy of narrative functions — order, duration, frequency, mood, voice — structures the Narrative Mechanics module, providing a formal vocabulary for phenomena that close readers intuit but rarely formalize. Each module can operate at any granularity level: the Figurative Language module can identify a metaphor within a single clause or trace the distribution of metaphorical clusters across an entire work's argumentative architecture.^[6] Jakobson's six functions of language — referential, emotive, conative, phatic, metalingual, poetic — inform the Pragmatic Force module, ensuring that the framework captures not just what a text says but what it does to its reader at each level of structure.

lingframe/core/decompose.py

from dataclasses import dataclass, field
from typing import Protocol, Sequence
from enum import Enum

class GranularityLevel(Enum):
    MORPHEME = "morpheme"
    WORD = "word"
    CLAUSE = "clause"
    SENTENCE = "sentence"
    PARAGRAPH = "paragraph"
    RHETORICAL_MOVE = "rhetorical_move"
    ARGUMENT = "argument"
    WHOLE_WORK = "whole_work"

class AnalysisModule(Protocol):
    """Each module implements this protocol at every granularity level."""
    name: str
    def analyze(self, unit: "LinguisticUnit") -> "ModuleResult": ...
    def supported_levels(self) -> set[GranularityLevel]: ...

@dataclass
class LinguisticUnit:
    text: str
    level: GranularityLevel
    children: list["LinguisticUnit"] = field(default_factory=list)
    annotations: dict[str, "ModuleResult"] = field(default_factory=dict)
    source_work: str = ""
    position: tuple[int, int] = (0, 0)  # start, end offsets

    def decompose(self, target: GranularityLevel) -> list["LinguisticUnit"]:
        """Recursively decompose to target granularity."""
        if self.level == target:
            return [self]
        return [
            sub_unit
            for child in self.children
            for sub_unit in child.decompose(target)
        ]

    def annotate(self, modules: Sequence[AnalysisModule]) -> None:
        """Apply all applicable modules, then recurse to children."""
        for module in modules:
            if self.level in module.supported_levels():
                self.annotations[module.name] = module.analyze(self)
        for child in self.children:
            child.annotate(modules)

Hierarchical decomposition engine — each text is atomized into nested layers, with analysis modules attached at every granularity level

Visualization and Output

Raw analysis data is only as useful as its representation. The framework generates two classes of output: interactive HTML visualizations for exploratory analysis and structured data exports (JSON, CSV, TEI-XML) for downstream computational work. The HTML views allow a reader to navigate the atomization hierarchy — clicking a rhetorical move zooms into its constituent sentences, then clauses, then morphemes — with annotations from each analysis module overlaid as color-coded layers. Heat maps show where figurative density clusters; timeline views trace argumentative structure; side-by-side comparisons align parallel passages across translations.^[7] Moretti's distant reading methodology — the deliberate refusal to close-read in favor of quantitative pattern detection across large corpora — informs the aggregate views, where individual works dissolve into tradition-level patterns: How does metaphor density in Greek oratory compare to Arabic balagha? Does periodic sentence structure correlate with argumentative complexity across languages? The structured exports feed these questions into statistical analysis, while the interactive views keep the individual text visible and navigable.^[8] Manovich's cultural analytics framework — treating cultural artifacts as data while preserving their individuality — provides the design philosophy: every visualization maintains a path from aggregate pattern back to specific textual evidence.

graph TD T[Raw Text Input] --> DE[Decomposition Engine] DE --> H[Atomized Hierarchy] H --> FL[Figurative Language] H --> RS[Rhythmic Structure] H --> AT[Argumentative Topology] H --> NM[Narrative Mechanics] H --> LS[Lexical Stratification] H --> PF[Pragmatic Force] FL --> INT[Integration Layer] RS --> INT AT --> INT NM --> INT LS --> INT PF --> INT INT --> IV[Interactive HTML Views] INT --> SD[Structured Data Export] INT --> CR[Cross-Work Comparison] IV --> EX[Exploratory Analysis] SD --> CA[Computational Analysis] CR --> TP[Tradition-Level Patterns] style T fill:#f9f,stroke:#333 style INT fill:#9ff,stroke:#333 style EX fill:#ff9,stroke:#333 style CA fill:#ff9,stroke:#333 style TP fill:#ff9,stroke:#333

Module composition architecture — raw text enters the decomposition engine, then all six analysis modules process the atomized hierarchy in parallel, producing an integrated analysis report with both interactive and structured outputs

Cross-Tradition Analysis

The most theoretically significant capability of the framework is cross-tradition comparison. Because every work is decomposed using the same hierarchical structure and annotated by the same six modules, it becomes possible to ask questions that have never been systematically addressable: Does the distribution of figurative language in Cicero's periodic oratory resemble the parallel structures of Liu Xie's pianwen prose? Are the argumentative topologies of the Natyashastra commensurable with Aristotle's enthymematic reasoning? These are not idle exercises in comparative rhetoric — they test whether the analytical categories we inherited from the Greco-Roman tradition are genuinely universal or culturally specific artifacts.^[3] Curtius traced the survival of classical topoi through medieval Latin into the modern European vernaculars, but his method could not extend beyond the Latin-Christian tradition. The Linguistic Atomization Framework operationalizes a version of his genealogical method that crosses civilizational boundaries, using formal decomposition rather than philological intuition to detect structural homologies.^[4] Auerbach demonstrated that a single passage, analyzed with sufficient care, can reveal an entire civilization's relationship to reality. The framework preserves this depth while extending its reach: not one passage from one tradition but every passage from twelve.

Module	Morpheme	Word	Clause	Sentence	Paragraph	Rhet. Move	Whole Work
Figurative Language		*	*	*	*	*	*
Rhythmic Structure	*	*	*	*	*		*
Argumentative Topology			*	*	*	*	*
Narrative Mechanics			*	*	*	*	*
Lexical Stratification	*	*	*	*			*
Pragmatic Force			*	*	*	*	*

Figure 2. Analysis module coverage — each module operates at specific granularity levels, ensuring comprehensive annotation from morpheme to whole-work architecture

Connection to ORGAN-II: From Analysis to Generation

The Linguistic Atomization Framework exists within ORGAN-I (Theoria) — the theoretical foundation of the eight-organ system. Its purpose is not merely scholarly: it directly enables the generative work of ORGAN-II (Poiesis). You cannot generate compelling text — whether procedural poetry, data-driven narrative, or interactive fiction — without understanding how compelling text is constructed at every level of granularity.^[9] Reas and Fry's Processing project demonstrated that creative coding requires a deep understanding of the formal principles underlying visual art — color theory, composition, gestalt perception — before generative algorithms can produce aesthetically meaningful output. The Linguistic Atomization Framework provides the equivalent foundation for text: a formal inventory of rhetorical devices, rhythmic patterns, argumentative structures, and narrative mechanics that generative systems can draw upon as compositional primitives.^[10] Galanter's definition of generative art — a practice where the artist creates a system that in turn creates the artwork — maps directly onto the relationship between ORGAN-I and ORGAN-II. The atomization framework is the knowledge base; the generative systems are the creative agents that query it. Without atomization, generation is blind pattern-matching. With it, generation becomes informed composition.

graph LR subgraph "ORGAN-I: Theoria" AF[Atomization Framework] --> RK[Rhetorical Knowledge Base] RK --> FI[Formal Inventory] FI --> CP[Compositional Primitives] end subgraph "ORGAN-II: Poiesis" CP --> GE[Generative Engines] GE --> PP[Procedural Poetry] GE --> DN[Data-Driven Narrative] GE --> IF[Interactive Fiction] GE --> VA[Visual-Audio Synthesis] end AF -.->|cross-tradition patterns| GE PP -.->|feedback| AF DN -.->|feedback| AF

The ORGAN-I to ORGAN-II pipeline — atomized rhetorical knowledge feeds generative systems, enabling composition that is structurally informed rather than statistically derived

graph LR subgraph "ORGAN-I: Theoria" AF[Atomization Framework] --> RK[Rhetorical Knowledge Base] RK --> FI[Formal Inventory] FI --> CP[Compositional Primitives] end subgraph "ORGAN-II: Poiesis" CP --> GE[Generative Engines] GE --> PP[Procedural Poetry] GE --> DN[Data-Driven Narrative] GE --> IF[Interactive Fiction] GE --> VA[Visual-Audio Synthesis] end AF -.->|cross-tradition patterns| GE PP -.->|feedback| AF DN -.->|feedback| AF

Testing Rhetoric Computationally

The test suite contains 142 tests organized across three categories: decomposition correctness (does the hierarchical structure preserve source text fidelity?), module agreement (do independent modules produce consistent annotations on shared units?), and cross-tradition validation (do analytical categories produce meaningful results outside their tradition of origin?).^[5] Genette's own methodology — rigorously testing narratological categories against texts that should resist them — provides the testing philosophy. The most informative tests are the ones that fail: when the Argumentative Topology module, designed primarily around Aristotelian enthymeme structure, encounters a passage from the Natyashastra that uses a fundamentally different reasoning framework, that failure illuminates the limits of the analytical category rather than a bug in the code.^[6] Jakobson's structural method — isolating the poetic function by contrasting it with the other five functions of language — informs the module agreement tests: if the Figurative Language and Pragmatic Force modules both annotate the same clause, their annotations should be complementary (describing different aspects of the same phenomenon) rather than contradictory.

By the Numbers

Canonical Works

15+

Languages

Analysis Modules

142

Tests

Literary Traditions

Python

3.10+

Figure 3. LingFrame system metrics — a computational rhetoric platform processing 46 canonical works through six analysis modules across 12 literary traditions

References

Aristotle. Rhetoric. Oxford University Press (trans. Kennedy, 2007), 350 BCE.
Burke, Kenneth. A Rhetoric of Motives. University of California Press, 1950.
Curtius, Ernst Robert. European Literature and the Latin Middle Ages. Princeton University Press (trans. Trask), 1948.
Auerbach, Erich. Mimesis: The Representation of Reality in Western Literature. Princeton University Press (trans. Trask), 1946.
Genette, Gerard. Narrative Discourse: An Essay in Method. Cornell University Press (trans. Lewin), 1972.
Jakobson, Roman. Linguistics and Poetics. MIT Press (in Style in Language, ed. Sebeok), 1960.
Moretti, Franco. Distant Reading. Verso Books, 2013.
Manovich, Lev. Cultural Analytics. MIT Press, 2020.
Reas, Casey and Ben Fry. Processing: A Programming Handbook for Visual Designers. MIT Press, 2007.
Galanter, Philip. What is Generative Art? Complexity Theory as a Context for Art Theory. International Conference on Generative Art, 2003.