Language and Thought

Our research investigates the story behind language and thought.

- Yifan Zhang, graduate student at Tsinghua University

Foundation: Categorical Grammars

A significant approach within formal semantics utilizes frameworks derived from category theory. Joachim Lambek's Categorical grammars, particularly the Lambek Calculus (LC), provide a mathematically precise way to model the syntax-semantics interface by assigning algebraic types to words and defining rules for their composition.

In LC, syntactic categories (types) start from basic types like \( n \) (noun) and \( s \) (sentence). Complex types are built using binary connectives: \( A / B \) (expects \( B \) to the right to form \( A \)) and \( B \backslash A \) (expects \( B \) to the left to form \( A \)). Grammaticality corresponds to provability in a sequent calculus using rules like Forward Application (\(/E\)) and Backward Application (\(\backslash E\)). For example, "John walks" (\( n, n \backslash s \)) reduces to \( s \): \[ \frac{ n \vdash n \quad s \vdash s }{ n, n \backslash s \vdash s } (\backslash E) \]

This type-driven compositionality provides a rigorous foundation. Our work extends these ideas using the broader framework of Markov Categories to analyze probabilistic models.

A Markov Categorical Framework for Language Modeling

Modeling AR Generation as Composed Kernels

We propose analyzing the core auto-regressive (AR) generation step, mapping a context sequence \(\mathbf{w}_{<'t'} \) to a next-token probability distribution \( P_\Params(\cdot | \mathbf{w}_{<'t'}) \), using the theory of Markov Categories (MCs). Specifically, we use the category \( \cat{Stoch} \), whose objects are standard Borel spaces (like finite sets \( \VV \), Euclidean spaces \( \RR^d \), sequence spaces \( \VV^* \)) and whose morphisms are Markov kernels \( k: X \to Y \), representing conditional probability maps.

A Markov kernel \( k: X \times \SigmaAlg{Y} \to [0, 1] \) maps an input \( x \in X \) to a probability measure \( k(x, \cdot) \) on \( Y \). Composition \( h \comp k \) follows the Chapman-Kolmogorov equation: \( (h \comp k)(x, C) = \int_Y h(y, C) \, k(x, \dd y) \).

The AR generation process is modeled as a composition of kernels in \( \cat{Stoch} \): \[ k_{\mathrm{gen}, \theta} = \KernelLMHead \comp \KernelBB \comp \KernelEmb : \ContextSeqSpaceMeas \to \VocabSpaceMeas \] Where:

  • \( \KernelEmb: \ContextSeqSpaceMeas \to \ContextRepSpaceMeas \) is the (deterministic) Embedding Layer kernel, mapping the token sequence \( \mathbf{w}_{<'t'} \in \VV^* \) to an initial representation sequence \( E_{<'t'} \), possibly including positional encodings. \( (\KernelEmb)_{\mathbf{w}_{<'t'}}(\cdot) = \deltaDirac{\EmbLayer(\mathbf{w}_{<'t'})} \).
  • \( \KernelBB: \ContextRepSpaceMeas \to \RepSpaceMeas \) is the (deterministic) Backbone kernel (e.g., Transformer layers with RoPE), processing \( E_{<'t'} \) to produce the final hidden state \( h_t \in \RepSpace \cong \RR^{d_{model}} \). \( (\KernelBB)_{E_{<'t'}}(\cdot) = \deltaDirac{\Backbone(E_{<'t'})} \).
  • \( \KernelLMHead: \RepSpaceMeas \to \VocabSpaceMeas \) is the (stochastic) LM Head kernel, mapping \( h_t \) to the output probability distribution over the vocabulary \( \VV \) via a linear layer and softmax: \( \KernelLMHead(h, \{w\}) = [\softmax(\LMHead(h))]_w \).

This compositional structure allows us to precisely track the transformation of information from context to prediction within a unified probabilistic framework. \( \cat{Stoch} \) is equipped with operations for copying (\( \copyop_X: X \to X \tens X \)) and discarding (\( \delop_X: X \to \unit \)) information, forming a rich algebraic structure for reasoning about probabilistic systems.

Divergence Enrichment and Information Measures

Quantifying Stochasticity and Dependence

The power of the MC framework is enhanced by enriching \( \cat{Stoch} \) with a statistical divergence \( D: \Prob(X) \times \Prob(X) \to [0, \infty] \), such as Kullback-Leibler (\( \KLDiv \)), Total Variation (\( \TVDist \)), Rényi (\( \RenyiDiv{\alpha} \)), or general \( f \)-divergences (\( \FD \)). Divergences quantify the dissimilarity between probability distributions.

A crucial property is the Data Processing Inequality (DPI): for any kernel \( k: X \to Y \) and probability measures (states) \( p, q \) on \( X \), \[ D_Y(k \comp p \| k \comp q) \le D_X(p \| q) \] Processing data through a channel \( k \) cannot increase the distinguishability between input distributions.

Based on a divergence \( D \) satisfying the DPI, Perrone defined intrinsic Categorical information measures:

  • Categorical Entropy of a kernel \( k: X \to Y \), measuring its intrinsic stochasticity: \[ \CategoricalEntropy_D(k) \defeq D_{Y \tens Y} \left( \copyop_Y \comp k \quad \| \quad (k \tens k) \comp \copyop_X \right) \] This compares copying the output \( y \) (state \( \deltaDirac{(y,y)} \)) with generating two independent outputs \( (y_1, y_2) \) from the same input \( x \). It is zero if \( k \) is deterministic.
  • Categorical Mutual Information of a joint state \( p: \unit \to X \tens Y \), measuring statistical dependence: \[ \Info_D(p) \defeq D_{X \tens Y} \left( p \quad \| \quad p_X \tens p_Y \right) \] where \( p_X \) and \( p_Y \) are the marginal states. This measures how far the joint state \( p \) is from the product of its marginals (independence). When \( D = \KLDiv \), this recovers Shannon Mutual Information.
These definitions automatically inherit the DPI, e.g., for a Markov chain \( X \to Y \to Z \), \( \Info_D(X; Y) \ge \Info_D(X; Z) \).

Proposed Analytical Metrics

Probing Information Flow in AR Models

Using the \( (\cat{Stoch}, D) \) framework, we define metrics to analyze the AR generation step \( k_{\mathrm{gen}, \theta} \):

  1. Representation Divergence: Quantifies how well the hidden state \( H_t \) distinguishes context properties \( S \). Given conditional hidden state distributions \( p_{H_t|s} = (\KernelBB \comp \KernelEmb) \comp p_{W_{<'t'}|s} \), we measure: \[ \text{RepDiv}_D(s_1 \| s_2) \defeq D_{\RepSpace}(p_{H_t|s_1} \| p_{H_t|s_2}) \] High divergence implies \( H_t \) effectively encodes property \( S \). Estimation is challenging in high-dimensional \( \RepSpace \).
  2. Categorical Mutual Information: Measures statistical dependencies:
    • State-Prediction (\( H_t; W_t \)): How much information \( H_t \) carries about the next token \( W_t \). Calculated from the joint state \( p_{H_t, W_t} = (\id_{\RepSpace} \tens \KernelLMHead) \comp \copyop_{\RepSpace} \comp p_{H_t} \): \[ \Info_D(H_t ; W_t) \defeq \Info_D(p_{H_t, W_t}) \]
    • Temporal State (\( H_t; H_{t+1} \)): Coherence between consecutive hidden states, indicating information propagation. Calculated from \( p_{H_t, H_{t+1}} \) using a step kernel \( k_{\text{step}}: H_t \to H_{t+1} \): \[ \Info_D(H_t ; H_{t+1}) \defeq \Info_D(p_{H_t, H_{t+1}}) \]
    Estimation involves high-dimensional MI estimators.
  3. LM Head Categorical Entropy: Assesses the intrinsic uncertainty of the final prediction kernel \( \KernelLMHead \): \[ \CategoricalEntropy_D(\KernelLMHead) = D_{\VV \tens \VV} \left( \copyop_\VV \comp \KernelLMHead \quad \| \quad (\KernelLMHead \tens \KernelLMHead) \comp \copyop_\RepSpace \right) \] High entropy implies high uncertainty in \( W_t \) even given \( H_t \). Can be related to average conditional Shannon entropy \( \Expect_{h \sim p_{H_t}} [\ShannonEntropy(\KernelLMHead(h, \cdot))] \). Easier to estimate as output space \( \VV \) is finite.
  4. Information Flow Bounds (DPI): Establishes limits on information transfer. For a context property \( S \), the Markov chain \( S \to H_t \to W_t \) and the DPI yield: \[ \Info_D(S ; H_t) \ge \Info_D(S ; W_t) \] Information about \( S \) in the prediction \( W_t \) cannot exceed that in the representation \( H_t \). The gap quantifies information present in \( H_t \) but not used for the immediate prediction.

These metrics provide principled ways to quantify information compression, representation quality, and predictive uncertainty within AR models.