We propose analyzing the core auto-regressive (AR) generation step, mapping a context sequence \(\mathbf{w}_{<'t'} \) to a next-token probability distribution \( P_\Params(\cdot | \mathbf{w}_{<'t'}) \), using the theory of Markov Categories (MCs). Specifically, we use the category \( \cat{Stoch} \), whose objects are standard Borel spaces (like finite sets \( \VV \), Euclidean spaces \( \RR^d \), sequence spaces \( \VV^* \)) and whose morphisms are Markov kernels \( k: X \to Y \), representing conditional probability maps.
A Markov kernel \( k: X \times \SigmaAlg{Y} \to [0, 1] \) maps an input \( x \in X \) to a probability measure \( k(x, \cdot) \) on \( Y \). Composition \( h \comp k \) follows the Chapman-Kolmogorov equation: \( (h \comp k)(x, C) = \int_Y h(y, C) \, k(x, \dd y) \).
The AR generation process is modeled as a composition of kernels in \( \cat{Stoch} \):
\[
k_{\mathrm{gen}, \theta} = \KernelLMHead \comp \KernelBB \comp \KernelEmb : \ContextSeqSpaceMeas \to \VocabSpaceMeas
\]
Where:
- \( \KernelEmb: \ContextSeqSpaceMeas \to \ContextRepSpaceMeas \) is the (deterministic) Embedding Layer kernel, mapping the token sequence \( \mathbf{w}_{<'t'} \in \VV^* \) to an initial representation sequence \( E_{<'t'} \), possibly including positional encodings. \( (\KernelEmb)_{\mathbf{w}_{<'t'}}(\cdot) = \deltaDirac{\EmbLayer(\mathbf{w}_{<'t'})} \).
- \( \KernelBB: \ContextRepSpaceMeas \to \RepSpaceMeas \) is the (deterministic) Backbone kernel (e.g., Transformer layers with RoPE), processing \( E_{<'t'} \) to produce the final hidden state \( h_t \in \RepSpace \cong \RR^{d_{model}} \). \( (\KernelBB)_{E_{<'t'}}(\cdot) = \deltaDirac{\Backbone(E_{<'t'})} \).
- \( \KernelLMHead: \RepSpaceMeas \to \VocabSpaceMeas \) is the (stochastic) LM Head kernel, mapping \( h_t \) to the output probability distribution over the vocabulary \( \VV \) via a linear layer and softmax: \( \KernelLMHead(h, \{w\}) = [\softmax(\LMHead(h))]_w \).
This compositional structure allows us to precisely track the transformation of information from context to prediction within a unified probabilistic framework. \( \cat{Stoch} \) is equipped with operations for copying (\( \copyop_X: X \to X \tens X \)) and discarding (\( \delop_X: X \to \unit \)) information, forming a rich algebraic structure for reasoning about probabilistic systems.