Language and Thought

Our research investigates the mechanism behind language and thought.

- Yifan Zhang, graduate student at Tsinghua University

Foundation: Categorical Grammars

A significant approach within formal semantics utilizes frameworks derived from category theory. Joachim Lambek's Categorical grammars, particularly the Lambek Calculus (LC), provide a mathematically precise way to model the syntax-semantics interface by assigning algebraic types to words and defining rules for their composition.

In LC, syntactic categories (types) start from basic types like \( n \) (noun) and \( s \) (sentence). Complex types are built using binary connectives: \( A / B \) (expects \( B \) to the right to form \( A \)) and \( B \backslash A \) (expects \( B \) to the left to form \( A \)). Grammaticality corresponds to provability in a sequent calculus using rules like Forward Application (\(/E\)) and Backward Application (\(\backslash E\)). For example, "John walks" (\( n, n \backslash s \)) reduces to \( s \): \[ \frac{ n \vdash n \quad s \vdash s }{ n, n \backslash s \vdash s } (\backslash E) \]

This type-driven compositionality provides a rigorous foundation. Our work extends these ideas using the broader framework of Markov Categories to analyze probabilistic models.

A Markov Categorical Framework for Language Modeling

Modeling AR Generation as Composed Kernels

We propose analyzing the core auto-regressive (AR) generation step, mapping a context sequence \(\mathbf{w}_{<'t'} \) to a next-token probability distribution \( P_\Params(\cdot | \mathbf{w}_{<'t'}) \), using the theory of Markov Categories (MCs). Specifically, we use the category \( \cat{Stoch} \), whose objects are standard Borel spaces (like finite sets \( \VV \), Euclidean spaces \( \RR^d \), sequence spaces \( \VV^* \)) and whose morphisms are Markov kernels \( k: X \to Y \), representing conditional probability maps.

A Markov kernel \( k: X \times \SigmaAlg{Y} \to [0, 1] \) maps an input \( x \in X \) to a probability measure \( k(x, \cdot) \) on \( Y \). Composition \( h \comp k \) follows the Chapman-Kolmogorov equation: \( (h \comp k)(x, C) = \int_Y h(y, C) \, k(x, \dd y) \).

The AR generation process is modeled as a composition of kernels in \( \cat{Stoch} \): \[ k_{\mathrm{gen}, \theta} = \KernelLMHead \comp \KernelBB \comp \KernelEmb : \ContextSeqSpaceMeas \to \VocabSpaceMeas \] Where:

  • \( \KernelEmb: \ContextSeqSpaceMeas \to \ContextRepSpaceMeas \) is the (deterministic) Embedding Layer kernel, mapping the token sequence \( \mathbf{w}_{<'t'} \in \VV^* \) to an initial representation sequence \( E_{<'t'} \), possibly including positional encodings. \( (\KernelEmb)_{\mathbf{w}_{<'t'}}(\cdot) = \deltaDirac{\EmbLayer(\mathbf{w}_{<'t'})} \).
  • \( \KernelBB: \ContextRepSpaceMeas \to \RepSpaceMeas \) is the (deterministic) Backbone kernel (e.g., Transformer layers with RoPE), processing \( E_{<'t'} \) to produce the final hidden state \( h_t \in \RepSpace \cong \RR^{d_{model}} \). \( (\KernelBB)_{E_{<'t'}}(\cdot) = \deltaDirac{\Backbone(E_{<'t'})} \).
  • \( \KernelLMHead: \RepSpaceMeas \to \VocabSpaceMeas \) is the (stochastic) LM Head kernel, mapping \( h_t \) to the output probability distribution over the vocabulary \( \VV \) via a linear layer and softmax: \( \KernelLMHead(h, \{w\}) = [\softmax(\LMHead(h))]_w \).

This compositional structure allows us to precisely track the transformation of information from context to prediction within a unified probabilistic framework. \( \cat{Stoch} \) is equipped with operations for copying (\( \copyop_X: X \to X \tens X \)) and discarding (\( \delop_X: X \to \unit \)) information, forming a rich algebraic structure for reasoning about probabilistic systems.