{"id":1731,"date":"2025-09-02T15:25:23","date_gmt":"2025-09-02T06:25:23","guid":{"rendered":"https:\/\/www.aicritique.org\/us\/?p=1731"},"modified":"2025-09-02T15:33:22","modified_gmt":"2025-09-02T06:33:22","slug":"grokking-in-large-language-models-concepts-models-and-applications","status":"publish","type":"post","link":"https:\/\/www.aicritique.org\/us\/2025\/09\/02\/grokking-in-large-language-models-concepts-models-and-applications\/","title":{"rendered":"Grokking in Large Language Models: Concepts, Models, and Applications"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\">Basic Concepts and Historical Background<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Definition of Grokking:<\/strong> <em>Grokking<\/em> refers to a surprising phenomenon of <strong>delayed generalization<\/strong> in neural network training. A model will <strong>perfectly fit the training data (near-100% training accuracy)<\/strong> yet remain at chance-level on the test set for an extended period. Then, <strong>after many more training iterations, the test performance suddenly jumps to near-perfect<\/strong> (sometimes almost overnight) even though training loss had long converged<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=Figure%201%3A%20Left,validation%20accuracy%20increases\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. This was first clearly documented by Power et al. (2022) on small algorithmic tasks (e.g. modular arithmetic). They showed that with small synthetic datasets (like learning modular addition or multiplication tables), neural networks can <strong>memorize the training set quickly but only \u201cgrok\u201d the underlying pattern much later<\/strong>, exhibiting an abrupt transition from overfitting to generalization<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. Crucially, this jump occurs <em>well past the point of overfitting<\/em> \u2013 e.g. thousands of optimization steps after training accuracy is 100%<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=Figure%201%3A%20Left,validation%20accuracy%20increases\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20show%20that%2C%20long%20after,is%20shown%20in%20Figure%201\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. Grokking is thus characterized by <strong>extreme training time disparity<\/strong>: training loss reaches near-zero early, but validation loss\/accuracy only improves dramatically after a long plateau<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=Figure%201%3A%20Left,validation%20accuracy%20increases\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20show%20that%2C%20long%20after,is%20shown%20in%20Figure%201\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Initial Discovery (Power et al., 2022):<\/strong> In their seminal work, Power et al. trained small transformers on binary operation tables (e.g. modular addition or division). They observed that for sufficiently small datasets, <strong>validation accuracy remained at random chance even after training accuracy was perfect, then spiked to 100% after extended training<\/strong><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=Figure%201%3A%20Left,validation%20accuracy%20increases\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. They coined this late generalization <em>\u201cgrokking.\u201d<\/em> They also found that <strong>smaller datasets require dramatically more training steps to grok<\/strong> (sometimes a 1% decrease in data required ~50% more steps to generalize)<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=dataset%20size%20is%20decreased\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. Notably, <strong>regularization (especially weight decay)<\/strong> was found to <em>accelerate grokking<\/em><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20compare%20various%20optimization%20details,on%20the%20tasks%20we%20study\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a> \u2013 with weight decay, the time to generalize was reduced<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20compare%20various%20optimization%20details,on%20the%20tasks%20we%20study\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. Power et al.\u2019s results established grokking as a real phenomenon and suggested that these toy tasks are a <strong>\u201cfertile ground\u201d for studying generalization beyond memorization<\/strong><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=in%20great%20detail,of%20the%20finite%20training%20dataset\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. The discovery raised fundamental questions: Why can a network eventually generalize perfectly <em>without new data<\/em> or changes in training loss? What evolves internally during the long plateau? These questions spurred numerous follow-up studies.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Key Characteristics:<\/strong> To summarize, the hallmark features of grokking are: (1) <strong>Delayed generalization:<\/strong> a long period of overfitting (training performance \u226b test performance) followed by a sudden test performance jump<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. (2) <strong>Sharp phase transition:<\/strong> the improvement in validation accuracy is rapid once it begins (resembling a phase change rather than gradual improvement)<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=Figure%201%3A%20Left,validation%20accuracy%20increases\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20show%20that%2C%20long%20after,is%20shown%20in%20Figure%201\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. (3) <strong>Small data regime:<\/strong> grokking is easiest to observe on relatively small or algorithmic datasets where the network can eventually <em>learn<\/em> the structure instead of just brute-force memorizing<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=The%20generalization%20of%20overparameterized%20neural,testbeds%20for%20theories%20of%20generalization\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=dataset%20size%20is%20decreased\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. (4) <strong>Dependence on hyperparameters:<\/strong> factors like weight decay, learning rate, and model capacity influence whether and when grokking occurs<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20compare%20various%20optimization%20details,on%20the%20tasks%20we%20study\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=Deep%20learning%20practitioners%20are%20used,than%20are%20required%20for%20training\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. For example, high weight decay tends to encourage the model to eventually find a simpler, generalizing solution rather than sticking to a memorizing solution<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20compare%20various%20optimization%20details,on%20the%20tasks%20we%20study\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. Grokking was initially documented in small transformers on tasks like modular arithmetic<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=Figure%201%3A%20Left,validation%20accuracy%20increases\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>, but subsequent work (discussed below) has found analogous behavior in other architectures and even non-neural models, indicating it taps into a general phenomenon about learning dynamics.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Theoretical Explanations and Mechanistic Models<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Researchers have proposed several theoretical frameworks to explain <em>why<\/em> grokking happens. These include views based on <strong>training dynamics regimes<\/strong>, <strong>representation learning \u201cphases,\u201d<\/strong> <strong>mechanistic circuit formation,<\/strong> and <strong>statistical distribution shifts<\/strong>. We outline the major explanations:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(a) Lazy-to-Rich Training Dynamics (Kumar et al., 2024):<\/strong> Kumar and colleagues interpret grokking as a consequence of a neural network transitioning from a \u201clazy\u201d learning regime to a \u201crich\u201d feature-learning regime<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,x%29%24%20are\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. In the <em>lazy regime<\/em>, the network\u2019s parameters change only minimally (akin to a kernel or linear model), so the network initially fits the training data using its <em>fixed initial features<\/em> (essentially performing something like kernel regression)<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,x%29%24%20are\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,are%20the%20rate%20of%20feature\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. This can drive training loss down (memorizing the training set) but doesn\u2019t generalize if the initial random features aren\u2019t aligned with the true pattern<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,x%29%24%20are\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=network%20output,settings%2C%20like%20on%20MNIST%2C%20one\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. After this, a <em>late-phase transition<\/em> happens: the network eventually leaves the lazy regime and begins genuine <strong>feature learning<\/strong> (the \u201crich\u201d regime)<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,x%29%24%20are\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. At this point it discovers new representations that capture the underlying structure, leading to a sudden generalization jump. Kumar et al. demonstrated this mechanism in a simple two-layer network on a polynomial task that groks even <em>without<\/em> regularization<a href=\"https:\/\/arxiv.org\/abs\/2310.06110#:~:text=,time%20feature\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2310.06110#:~:text=cannot%20be%20explained%20by%20existing,1%29%20the%20top\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They identified conditions for grokking: (1) <strong>Misaligned initial kernel:<\/strong> the top eigenfunctions of the Neural Tangent Kernel are not aligned with the target function, so initial learning struggles to generalize<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,x%29%24%20are\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=network%20output,settings%2C%20like%20on%20MNIST%2C%20one\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. (2) <strong>Intermediate dataset size:<\/strong> enough data that generalization is eventually possible, but not so much that test performance tracks training performance from the start<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=network%20output,settings%2C%20like%20on%20MNIST%2C%20one\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,settings%2C%20like%20on%20MNIST%2C%20one\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. (3) <strong>Small initial step size (or network scaling):<\/strong> ensuring training starts in the lazy\/linear regime rather than immediately learning features<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. Under these conditions, the network first fits training data like a linear model (lazy phase), then later <em>switches to learning new features<\/em> that solve the task (rich phase), which coincides with grokking<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,x%29%24%20are\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. This theory connects grokking to classical ideas of <strong>neural tangent kernels vs. feature learning<\/strong>. Indeed, they provided evidence in settings like MNIST and simple transformers that delaying feature learning can induce grokking<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20lazy%20regime%20so%20does,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. In short, <strong>grokking = the moment the network stops being a mere \u201clazy\u201d interpolator and actually learns the true features<\/strong>, causing test accuracy to leap<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,x%29%24%20are\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(b) Effective Theory and the \u201cGoldilocks Zone\u201d (Liu et al., 2022):<\/strong> Another perspective comes from Liu et al., who developed an <strong>effective theory of representation learning<\/strong> to map out different \u201cphases\u201d of training<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=We%20aim%20to%20understand%20grokking%2C,We%20observe%20empirically\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=the%20presence%20of%20four%20learning,phases%3A%20comprehension%2C%20grokking%2C%20memorization%2C%20and\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. They ran extensive experiments on algorithmic tasks and identified four regimes: <strong>confusion<\/strong>, <strong>memorization<\/strong>, <strong>grokking<\/strong>, and <strong>comprehension<\/strong><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=size%20can%20be%20predicted%20by,phases%3A%20comprehension%2C%20grokking%2C%20memorization%2C%20and\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. \u2013 <em>Confusion<\/em> means the model fails to even memorize (training loss high). <em>Memorization<\/em> means it can fit training data but not generalize. <em>Comprehension<\/em> means it generalizes quickly (no long delay). <em>Grokking<\/em> is the in-between case of delayed generalization<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=size%20can%20be%20predicted%20by,phases%3A%20comprehension%2C%20grokking%2C%20memorization%2C%20and\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=A3%20Grokking%20is%20a%20phase,phase%20diagrams%20in%20Figure%206\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Crucially, they found <strong>representation learning only happens in a \u201cGoldilocks zone\u201d between pure memorization and pure confusion<\/strong><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=the%20presence%20of%20four%20learning,phases%3A%20comprehension%2C%20grokking%2C%20memorization%2C%20and\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=confusion,closer%20to%20the%20memorization%20phase\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. In this zone (which includes the comprehension and grokking phases), the model learns structured internal representations that enable generalization<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=learning%20performance%20across%20hyperparameters,phases%3A%20comprehension%2C%20grokking%2C%20memorization%2C%20and\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=We%20find%20on%20transformers%20the,drive%20discovery%20of%20more%20efficient\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. If hyperparameters are tuned for too-rapid learning or too slow, the model either trivially memorizes or never learns anything useful (confusion). But just-right settings (not too much data or too high learning rate) force the model to \u201cstruggle\u201d and in doing so, discover structure<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=learning%20phases%20in%20Figure%206,although%20an%20extremely%20slow%20decoder\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=from%20these%20different%20tasks%3A%20,is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. They observed that in transformers, the grokking phase sits <em>closer to memorization<\/em> on this spectrum, hence generalization is delayed<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=confusion,closer%20to%20the%20memorization%20phase\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Intuitively, grokking is an \u201cundesirable\u201d phase caused by slightly improper hyperparameters that slow representation learning<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=decoder,is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. By charting phase diagrams of training dynamics (varying e.g. learning rate for embeddings vs. decoder, or weight decay), they showed one can see regions corresponding to each phase<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=%2890,Both%20comprehension%20and%20grokking%20are\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=match%20at%20L795%20from%20these,is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Notably, <strong>grokking appears sandwiched between the fast-generalizing comprehension phase and the pure memorization phase<\/strong><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=match%20at%20L795%20from%20these,is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. If one tunes hyperparameters toward comprehension (e.g. a somewhat faster or more flexible representation learning), the model can \u201cde-delay\u201d generalization and avoid the grokking plateau<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=A3%20Grokking%20is%20a%20phase,phase%20diagrams%20in%20Figure%206\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. This framing suggests grokking is not a mysterious anomaly but a point on a continuum of learning behaviors, explainable by the <em>competition between learning representations vs. overfitting weights<\/em>. They even likened the effect to <em>\u201cintelligence from starvation\u201d<\/em> in evolution \u2013 i.e., <strong>resource limitations (small data, weight decay) force the network to find a more efficient solution<\/strong> once pure memorization fails<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=Goldilocks%20phase%20is%20reminiscent%20of,of%20the%20origin%20of%20grokking\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Overall, Liu et al. provided an intuitive phase diagram: grokking happens when a network eventually exits a near-memorization phase to find structure, and careful tuning can eliminate the delay<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=A3%20Grokking%20is%20a%20phase,phase%20diagrams%20in%20Figure%206\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=decoder,is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(c) Phase Transitions &amp; Mechanistic Interpretability (Nanda et al., 2023):<\/strong> Some researchers approached grokking by <em>reverse-engineering the network\u2019s internal mechanism<\/em> on a grokking task. Nanda et al. performed a detailed <strong>mechanistic interpretability analysis of a small transformer trained on modular addition<\/strong>, to see <em>what changes inside the model during grokking<\/em><a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=phenomenon%20of%20,by%20the%20later%20removal%20of\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They found that the model actually learns a sensible algorithm: it represents numbers in a Fourier basis and performs addition via rotations on the circle (a known algorithm for modular addition)<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=phenomenon%20of%20,from%20the%20gradual%20amplification%20of\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Importantly, they identified three continuous \u201cphases\u201d in training: <strong>(1) Memorization phase:<\/strong> the network first stores some training mappings (e.g. lookup table behavior) to drive training loss down. <strong>(2) Circuit formation phase:<\/strong> the network gradually <strong>builds up the correct algorithmic circuit<\/strong> (sine\/cosine representations for numbers, etc.) while still mostly memorizing<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=phenomenon%20of%20,from%20the%20gradual%20amplification%20of\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. <strong>(3) Cleanup phase:<\/strong> the network <em>removes or downweights the memorization-based components<\/em> and relies on the general algorithm, which suddenly boosts test accuracy<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=study%20the%20dynamics%20of%20training,later%20removal%20of%20memorizing%20components\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In this view, grokking is not actually a \u201cmagic\u201d sudden insight but the result of a <strong>gradual improvement in the learned representations\/circuits that isn\u2019t reflected in test performance until a tipping point<\/strong><a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=study%20the%20dynamics%20of%20training,later%20removal%20of%20memorizing%20components\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Their progress measures showed that the \u201cgeneral\u201d circuit\u2019s strength grows continuously and the \u201cmemorization\u201d circuits decay, and the test accuracy jump happens when the general circuit\u2019s signal dominates<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=study%20the%20dynamics%20of%20training,later%20removal%20of%20memorizing%20components\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Thus, <em>what looks like a phase transition from the outside is underpinned by continuous changes internally<\/em>. They conclude that <strong>grokking = the emergent result of two competing solution circuits<\/strong> \u2013 one brute-force memorizing, one generalizing \u2013 where the generalizing circuit eventually wins out<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=study%20the%20dynamics%20of%20training,later%20removal%20of%20memorizing%20components\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This aligns with earlier informal conjectures (e.g. by Millidge 2022, Shah 2021) that grokking may involve the network <em>first finding a shortcut (memorization), then later discovering the true pattern<\/em> and switching over<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=,by%23DEEP_LEARNING_%2C%202021\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Mechanistic studies by Nanda et al. and others also highlight the role of <strong>weight decay<\/strong>: weight decay preferentially suppresses complex memorizing solutions, effectively encouraging the network to find the more structured algorithm sooner<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20compare%20various%20optimization%20details,on%20the%20tasks%20we%20study\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. In summary, a mechanistic view treats grokking as a <strong>competitive phase transition<\/strong> between different \u201ccircuits\u201d in the model \u2013 when the simplistic memorization falls away and the elegant general solution crystallizes, we see a sudden performance jump<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=study%20the%20dynamics%20of%20training,later%20removal%20of%20memorizing%20components\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(d) Statistical and Distribution-Shift Explanations (Carvalho et al., 2025):<\/strong> A recent line of work frames grokking as a <em>statistical phenomenon related to distribution shift<\/em>. Carvalho et al. argue that a key factor behind grokking is a <strong>mismatch between the training distribution and the test distribution<\/strong><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=test%20set%20loss%20decreases%20sharply,the%20phenomenon%2C%20demonstrating%20that%20while\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=in%20deep%20learning%20networks,convenient%20mechanism%20for%20achieving%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In their view, small or biased training sets create <em>implicit distribution shifts<\/em> that the model must eventually recognize. They formalize \u201clate generalization\u201d and demonstrate grokking on carefully constructed synthetic datasets that manipulate class\/subclass distributions<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=in%20deep%20learning%20networks,convenient%20mechanism%20for%20achieving%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=sampling%2C%20and%20the%20other%20investigates,sparse%20data%2C%20we%20demonstrate%20that\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. For example, they create a dataset where each class has two subclasses. If one subclass is sparsely sampled in training, the model initially overfits (focusing on the dominant subclass), but much later it <strong>learns to leverage relationships between subclasses, allowing generalization to the sparse subclass<\/strong><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=We%20posit%20that%20data%20sparsity,sparsity%2C%20enabling%20late%20generalization%20by\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=categories%2C%20we%20systematically%20reproduce%20the,parameter%20tuning.%20Our\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This manifests as grokking. By controlling sampling imbalance, they could induce or prevent grokking at will (e.g. <em>removing<\/em> a subclass entirely prevented late generalization, while <em>weakly sampling<\/em> it caused grokking)<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=We%20posit%20that%20data%20sparsity,sparsity%2C%20enabling%20late%20generalization%20by\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=sampling%2C%20and%20the%20other%20investigates,sparse%20data%2C%20we%20demonstrate%20that\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They conclude that <strong>data sparsity alone isn\u2019t the direct cause<\/strong> \u2013 rather, <em>data sparsity causes an implicit shift that the model only overcomes after learning higher-level relations<\/em><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=sampling%2C%20and%20the%20other%20investigates,sparse%20data%2C%20we%20demonstrate%20that\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=categories%2C%20we%20systematically%20reproduce%20the,parameter%20tuning.%20Our\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Interestingly, they show grokking can occur even with <em>dense data and minimal hyper-parameter tuning<\/em>, contrary to early beliefs that you needed extremely tiny data and heavy regularization<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=Instead%2C%20small,criteria%20in%20future%20training%20processes\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They also extended experiments to a real-world scenario (inducing distribution shift on MNIST digits via clustering distortions) and observed delayed generalization there as well<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=In%20addition%20to%20the%20synthetic,set%20but%20with%20a%20different\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=distribution%20in%20their%20representations,our%20findings%20beyond%20synthetic%20data\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This statistical view connects grokking with phenomena like <em>domain adaptation<\/em>: the training set is not fully representative, and only after memorizing does the model find a more general decision boundary that works for the broader distribution<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=test%20set%20loss%20decreases%20sharply,the%20phenomenon%2C%20demonstrating%20that%20while\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=categories%2C%20we%20systematically%20reproduce%20the,parameter%20tuning.%20Our\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. One practical outcome they stress is the need for <strong>better early-stopping or progress metrics<\/strong> \u2013 since traditional validation loss might be flat during the grokking plateau, one might prematurely stop training<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=test%20set%20loss%20decreases%20sharply,the%20phenomenon%2C%20demonstrating%20that%20while\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=primarily%20arises%20from%20high%20regularization,criteria%20in%20future%20training%20processes\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Their work \u201cpaves the way for developing better stopping criteria\u201d by understanding how distribution structure cues late generalization<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=Instead%2C%20small,criteria%20in%20future%20training%20processes\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In essence, the Carvalho et al. explanation is: <strong>grokking happens when the training data is just sufficient to eventually infer a general rule, but not initially sufficient to show that generalization on the test distribution is possible<\/strong> \u2013 the model initially fits idiosyncrasies, then slowly figures out the underlying pattern connecting train and test distributions<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=test%20set%20loss%20decreases%20sharply,the%20phenomenon%2C%20demonstrating%20that%20while\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=We%20posit%20that%20data%20sparsity,sparsity%2C%20enabling%20late%20generalization%20by\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Each of these frameworks sheds light on grokking from different angles (dynamics, representations, circuits, data distribution). They aren\u2019t mutually exclusive; indeed, there\u2019s a sense in which <strong>\u201ccircuits competition\u201d<\/strong>, <strong>\u201clazy-to-rich transition\u201d<\/strong>, and <strong>\u201cdistribution shift recognition\u201d<\/strong> might be describing the same core process at different levels. The next framework attempts to unify some of these ideas.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>(e) Unified \u201cCircuits Competition\u201d Framework (Huang et al., 2024):<\/strong> Huang et al. propose a unifying view connecting grokking with <strong>double descent<\/strong> (the puzzling re-improvement of test error when model size grows) and <strong>emergent abilities in LLMs<\/strong><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=framework%20that%20provides%20a%20unified,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=four%20distinct%20training%20dynamics%2C%20each,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They build on the idea of <strong>memorization vs. generalization circuits competing inside the model<\/strong><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=framework%20that%20provides%20a%20unified,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=four%20distinct%20training%20dynamics%2C%20each,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This perspective was initially used for grokking (as discussed by Nanda et al.), but Huang et al. extend it across different scales of model size and data size<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=framework%20that%20provides%20a%20unified,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=initially%20employed%20to%20explain%20grokking%2C,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Their framework outlines <em>four training dynamics regimes<\/em> depending on model capacity and data: for instance, (1) small model + little data = can\u2019t even memorize (underfitting), (2) large model + little data = memorization (overfitting), (3) intermediate model\/data = grokking (late generalization), (4) large model + ample data = immediate generalization (comprehension)<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=initially%20employed%20to%20explain%20grokking%2C,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=four%20distinct%20training%20dynamics%2C%20each,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This mirrors the phases discussed by Liu et al., but explicitly ties them to model size and the classic double-descent curve. Using this framework, they reinterpret double descent: as model size increases, initially a model memorizes (test error high), but beyond a critical size it can implement generalizing circuits, causing test error to drop again (second descent)<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=framework%20that%20provides%20a%20unified,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=four%20distinct%20training%20dynamics%2C%20each,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They made <strong>two predictions about when double descent occurs<\/strong> based on this circuits competition, and verified them experimentally<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=initially%20employed%20to%20explain%20grokking%2C,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=four%20distinct%20training%20dynamics%2C%20each,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a> (for example, predicting the model size at which memorization gives way to generalization given a fixed data size). Furthermore, they extended the idea to <strong>multi-task learning and emergent abilities<\/strong>. By treating an <em>\u201calgorithmic\u201d task as an emergent ability<\/em> (e.g. a task that only larger models learn, analogous to tasks that only \u201cemerge\u201d in very large LLMs), they show that the same framework predicts when a task will suddenly be learned (emerge) as data or model scale increases<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=detailed%20analysis%20of%20the%20double,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In other words, an <em>emergent ability<\/em> can be seen as grokking in the context of many tasks: smaller models effectively memorize training tasks without discovering the algorithm for the \u201cemergent\u201d task, until a certain scale where the circuit for that task can form and win out<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=detailed%20analysis%20of%20the%20double,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This offers a novel theoretical lens on why large language models suddenly acquire new skills at scale \u2013 it\u2019s the outcome of internal competition between specialized circuits, analogous to grokking dynamics. Overall, Huang et al.\u2019s contribution is a <strong>conceptual synthesis<\/strong>: grokking, double descent, and emergent capabilities are all manifestations of <em>the same underlying dynamic<\/em> of memorization vs. generalization circuits, playing out over different axes (time, model size, or task complexity)<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=framework%20that%20provides%20a%20unified,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=four%20distinct%20training%20dynamics%2C%20each,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This unified view helps tie the grokking literature into the broader context of generalization phenomena in deep learning.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Applicability and Generalization Across Model Types<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Initially, grokking was studied in small transformers on synthetic tasks. A natural question is: does grokking occur in other models, or is it unique to deep neural networks (and to those specific tasks)? Recent research indicates <strong>grokking is more general<\/strong> \u2013 it appears in various model families and deeper networks, though sometimes in modified forms.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Grokking in Non-Neural Models (Miller et al., 2023):<\/strong> Strikingly, Miller et al. showed that <strong>grokking is <em>not<\/em> exclusive to neural nets<\/strong> \u2013 they observed analogous delayed generalization in models like <strong>Gaussian Processes (GPs)<\/strong>, simple <strong>linear regression<\/strong>, and <strong>Bayesian neural networks<\/strong><a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=,is%20not%20restricted%20to%20settings\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=validation%20set%20long%20after%20the,guided%20by%20complexity%20and%20error\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In their study, even a <em>Gaussian process classifier<\/em> trained on a small algorithmic dataset displayed a grokking-like jump in test performance after more data was effectively considered<a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=,is%20not%20restricted%20to%20settings\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Likewise, linear regression on a certain structured task showed a form of late generalization<a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=,is%20not%20restricted%20to%20settings\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This is surprising because these models don\u2019t undergo iterative \u201cfeature learning\u201d in the neural sense. Miller et al. argue the common factor is that these learning methods can be implicitly guided by a notion of <strong>solution complexity vs. error trade-off<\/strong><a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=but%20occurs%20in%20other%20settings,guided%20by%20complexity%20and%20error\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. For example, in GP regression with a certain kernel, the function that fits the training data initially might be very wiggly (memorizing), but as the GP posterior updates (or as hyperparameters favor smoother functions), a simpler generalizing function can eventually dominate \u2013 analogous to a late shift to a low-complexity solution<a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=but%20occurs%20in%20other%20settings,guided%20by%20complexity%20and%20error\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They even devised a way to <em>induce grokking behavior<\/em> in algorithmic tasks by adding spurious input dimensions: extra random features that the model can memorize initially, forcing a delay until it learns to ignore them and focus on real features<a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=but%20occurs%20in%20other%20settings,guided%20by%20complexity%20and%20error\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. The key takeaway is <strong>any learning system that balances model complexity and fit could exhibit grokking<\/strong><a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=but%20occurs%20in%20other%20settings,guided%20by%20complexity%20and%20error\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Grokking is \u201cnot restricted to settings considered in current theoretical and empirical studies\u201d \u2013 it may arise <em>\u201cin any model where solution search is guided by complexity and error\u201d<\/em><a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=but%20occurs%20in%20other%20settings,guided%20by%20complexity%20and%20error\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In simpler terms, if a learning algorithm can first find a more complex solution that fits the data and only later shift to a simpler, generalizing solution, it can grok. This finding broadens the scope: delayed generalization is <em>not<\/em> a quirk of backpropagation or transformers, but a potential phenomenon in general learning dynamics, including Bayesian paradigms. It invites theoretical analysis of, say, double descent in kernel methods and how that might relate to grokking. (Indeed, double descent in linear models could be seen as a cousin of grokking, where varying model complexity yields non-monotonic generalization.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Deep Neural Networks and Multi-Stage Generalization (Fan et al., 2024):<\/strong> Earlier grokking studies mostly used shallow networks (e.g. a 1-layer transformer). Fan et al. asked: what happens in <em>deeper<\/em> neural networks? They trained deeper MLPs (up to 12 layers) on algorithmic tasks and found that <strong>deeper networks not only grok, but can exhibit multiple grokking <em>stages<\/em><\/strong><a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=research%20primarily%20focus%20on%20shallow,Additionally%2C%20we\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=observe%20an%20intriguing%20multi,believe%20our%20work%20is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Specifically, a 12-layer network sometimes showed <em>two<\/em> distinct jumps in test accuracy: an initial delayed jump, then after further training, a second surge in performance<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=research%20primarily%20focus%20on%20shallow,Additionally%2C%20we\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This <strong>\u201cmulti-stage generalization\u201d<\/strong> was rarely seen in shallow models<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=observe%20an%20intriguing%20multi,believe%20our%20work%20is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. It suggests that deeper models might learn complex tasks in a hierarchical fashion \u2013 e.g. first grok some simpler aspect, then later grok a finer aspect. Correspondingly, Fan et al. measured the internal <strong>feature rank<\/strong> (roughly, the dimensionality of the learned representations) over training<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=is%20scarcely%20seen%20on%20shallow,feature%20rank%20and%20generalization%20performance\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They observed that as grokking occurs, <strong>the feature rank of intermediate layers <em>drops<\/em><\/strong>, indicating the network\u2019s representations become more compressed and structured<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=is%20scarcely%20seen%20on%20shallow,feature%20rank%20and%20generalization%20performance\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Intriguingly, the <strong>feature rank trajectory often showed a double-descent shape<\/strong>: it would decrease, then increase, then decrease again, aligning with the two surges in accuracy<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=is%20scarcely%20seen%20on%20shallow,feature%20rank%20and%20generalization%20performance\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In other words, when test accuracy made a second leap, it coincided with another drop in feature rank complexity. These findings hint that <strong>internal representation compression is a signature of generalization<\/strong> \u2013 the network throws away redundant\/memorized information and distills a more low-dimensional concept, which yields better generalization<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=is%20scarcely%20seen%20on%20shallow,feature%20rank%20and%20generalization%20performance\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They even suggest that <em>feature rank might predict grokking better than weight norm or other metrics<\/em><a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=from%20overfitting%20to%20the%20generalization,believe%20our%20work%20is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Practically, one could monitor feature rank during training as an unsupervised indicator of an impending generalization jump<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=from%20overfitting%20to%20the%20generalization,believe%20our%20work%20is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Fan et al.\u2019s work expands grokking research to <em>deeper architectures<\/em>, showing that depth can increase the propensity to grok (they found deep nets were <em>more susceptible<\/em> to grokking than shallow ones)<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=research%20primarily%20focus%20on%20shallow,Additionally%2C%20we\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. It also reinforces connections between grokking and <strong>double descent<\/strong>: the double descent in feature complexity mirrors a kind of double descent in performance<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=observe%20an%20intriguing%20multi,believe%20our%20work%20is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In summary, deep networks can grok in potentially multiple phases, and studying their layerwise representations (like rank) can reveal how generalization emerges internally across training stages<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=observe%20an%20intriguing%20multi,believe%20our%20work%20is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Unified Perspectives: Grokking, Double Descent, Emergence (Huang et al., 2024):<\/strong> As mentioned earlier, Huang et al. provide a framework that unifies these phenomena by focusing on circuits competition across scales<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=framework%20that%20provides%20a%20unified,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=four%20distinct%20training%20dynamics%2C%20each,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In the context of different model types: <em>double descent<\/em> has been observed in linear models and random forests, and <em>emergent abilities<\/em> are discussed for LLMs \u2013 by showing these can all be seen through the lens of late-forming generalist circuits overtaking memorizing ones, they argue grokking is a widespread dynamic. Huang et al. delineated how increasing model capacity or data can move a model between the four regimes (no-fit, memorize, grok, comprehend)<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=initially%20employed%20to%20explain%20grokking%2C,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This suggests, for example, that a sufficiently large model might not grok (it jumps straight to generalization, the comprehension phase) \u2013 which might explain why grokking is harder to notice in very large-scale settings unless carefully analyzed. But in intermediate regimes (including many practical scenarios), we might expect some grokking-like behaviors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Other Model-Agnostic Findings:<\/strong> Additional works have explored simplified theoretical models of grokking. For instance, <em>Levi et al. (2023)<\/em> analyzed a <strong>linear estimator that groks<\/strong> \u2013 they constructed a solvable setup (linear regression with particular features) where the solution exhibits a delayed generalization effect, giving a fully analytical handle on grokking dynamics. <em>Lyu et al. (2023)<\/em> proved in a theoretical setting that a \u201cdichotomy of early vs. late implicit bias\u201d in gradient descent could <em>provably<\/em> lead to grokking-like behavior (early training minimizes training error in one way, later dynamics of gradient descent shift the solution towards a different minimum with better generalization)<a href=\"https:\/\/gwern.net\/doc\/ai\/scaling\/emergence\/grokking\/index#:~:text=Zhu%20et%20al%202024%20,27\" target=\"_blank\" rel=\"noreferrer noopener\">gwern.net<\/a>. These works support the notion that grokking can emerge from general properties of optimization in high-dimensional systems, not just quirky neural network tricks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The upshot is that <strong>grokking generalizes beyond its initial context<\/strong>. It has been replicated in kernel methods, probabilistic models, and deeper nets, and connected to phenomena like double descent and phase transitions. This broadens confidence that grokking reveals something fundamental about learning: there can be <em>multiple qualitatively different regimes during training<\/em>, and the <strong>final generalization behavior may be decided by a late-time dynamical shift<\/strong> rather than evident early on.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Applications in LLM Pretraining<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">A critical question is whether grokking occurs in <em>large-scale, real-world training<\/em>, such as the pretraining of large language models (LLMs). For a long time, grokking was mostly a toy-setting curiosity. Recent studies in 2025 provide evidence that <strong>grokking does manifest during LLM pretraining \u2013 albeit in a more complex, asynchronous way \u2013 and that we can detect it via the model\u2019s internal dynamics<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Grokking in Mixture-of-Experts LLM (Li et al., 2025):<\/strong> Li and colleagues conducted the first study of grokking <em>in the context of a full-scale LLM pretraining run<\/em>. They analyzed checkpoints from the training of <strong>OLMoE<\/strong>, a 7-billion-parameter Mixture-of-Experts transformer (so, a large model with multiple experts per layer)<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=Grokking%2C%20i,specific\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=one%20or%20two%20highly,specific%20knowledge%20retrieval%20tasks\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Crucially, they <strong>did not have a traditional held-out test set to watch accuracy during pretraining<\/strong> (since language model pretraining is unsupervised), so they crafted a methodology: they computed the model\u2019s loss on its training data over time and also periodically evaluated the <em>emergence of capabilities<\/em> on various downstream tasks (math reasoning, code generation, factual QA) using intermediate checkpoints<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=one%20or%20two%20highly,specific%20knowledge%20retrieval%20tasks\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=language%20model%20%28LLM%29%2C%20i.e.%2C%20OLMoE%C2%A0,specific%20knowledge%20retrieval%20tasks\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They indeed found that <strong>grokking-like delayed generalization happens in LLM pretraining<\/strong><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=Our%20study%2C%20for%20the%20first,conversion%2C%20providing%20a%20mechanistic%20explanation\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. However, unlike the toy tasks where suddenly <em>all<\/em> data is grokked at once, in LLMs <strong>different domains or skill areas grokked at different times<\/strong><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=Our%20study%2C%20for%20the%20first,conversion%2C%20providing%20a%20mechanistic%20explanation\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=pretraining%20of%20practical%2C%20large,develop%20two%20novel%20metrics%20to\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. For example, perhaps the model\u2019s performance on math word problems might remain low until very late in training (a \u201clater grokking\u201d capability), whereas its performance on commonsense QA might improve earlier. They called this <em>\u201clocal grokking\u201d<\/em> \u2013 each subset of the training data (or each domain\/task) has its own delayed generalization point<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=parameters,sufficient%20data%20have%20been%20memorized\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Early in pretraining, the model\u2019s generalization (when evaluated on downstream tasks) was <em>unstable<\/em>, improving on some tasks then dropping, etc., which they attribute to these asynchronous grokking events across domains<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=parameters,sufficient%20data%20have%20been%20memorized\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Once <em>sufficient data had been seen and memorized in a domain, that domain\u2019s test performance started improving steadily<\/em><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=parameters,sufficient%20data%20have%20been%20memorized\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Notably, <strong>more difficult data (or tasks) grokked later and had longer delays<\/strong>, which aligns with intuition \u2013 complex patterns take longer for the model to discover even after fitting the easier parts<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=parameters,sufficient%20data%20have%20been%20memorized\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=time%20and%20lasting%20steps%20vary,often%20takes%20longer%20to%20generalize\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Routing Dynamics as Generalization Indicators:<\/strong> Because evaluating an LLM on test tasks in the middle of pretraining is expensive and confounded (since the model isn\u2019t instruction-tuned yet), Li et al. proposed to monitor <strong>internal model metrics<\/strong> instead. In a Mixture-of-Experts (MoE) model, a <em>routing network<\/em> directs each input to certain expert sub-networks at each layer. Li et al. tracked the <strong>expert choice patterns (pathways) for training samples<\/strong> throughout training<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=%E2%80%9Cemergence%20of%20generalization%E2%80%9D%20by%20investigating,predict%20the%20generalization%20improvement%20on\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=states,conversion%2C%20providing%20a%20mechanistic%20explanation\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. They discovered an intriguing mechanistic change: during grokking, <strong>the expert pathways for different samples go from being random and instance-specific to becoming more <em>structured and shared<\/em> among samples<\/strong><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=%E2%80%9Cemergence%20of%20generalization%E2%80%9D%20by%20investigating,predict%20the%20generalization%20improvement%20on\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=states,conversion%2C%20providing%20a%20mechanistic%20explanation\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In other words, early in training each data point might activate a unique sequence of experts (suggesting rote memorization of individual quirks), but later in training, <strong>the model converges on more uniform pathways that generalize across examples<\/strong> (suggesting it found common patterns)<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=%E2%80%9Cemergence%20of%20generalization%E2%80%9D%20by%20investigating,predict%20the%20generalization%20improvement%20on\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=states,conversion%2C%20providing%20a%20mechanistic%20explanation\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Additionally, they defined a \u201cpathway complexity\u201d measure (essentially, how complicated a single sample\u2019s expert route is). They observed that <strong>even though training loss had plateaued, the pathway complexity of samples kept <em>decreasing<\/em><\/strong> as training continued<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=%E2%80%9Cemergence%20of%20generalization%E2%80%9D%20by%20investigating,predict%20the%20generalization%20improvement%20on\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=choices%20across%20layers,conversion%2C%20providing%20a%20mechanistic%20explanation\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This means the model was finding <em>simpler internal explanations<\/em> for each sample (using fewer or more consistent experts) without any change in loss \u2013 a clear indicator of memorization turning into generalization internally. These changes in routing behavior strongly correlated with actual downstream performance gains<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=of%20the%20delayed%20generalization,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=We%20demonstrate%20their%20capabilities%20to,monitor%20the%20generalization%20performance%20without\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Based on this, the authors proposed two metrics: (1) <strong>Pathway distance between samples<\/strong> \u2013 measuring if inputs start to share similar expert routes, and (2) <strong>Pathway consistency for a sample<\/strong> \u2013 measuring if a single input\u2019s route becomes more stable\/simple layer-to-layer<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=of%20the%20delayed%20generalization,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Both metrics showed a marked shift exactly when generalization (as measured by downstream tasks) improved<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=of%20the%20delayed%20generalization,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=We%20demonstrate%20their%20capabilities%20to,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Impressively, these metrics can be computed <em>without any test data<\/em>: they rely only on the model\u2019s internal choices on training data. This offers a potentially powerful tool: <strong>monitoring generalization in large-scale training without needing a validation set<\/strong><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=of%20the%20delayed%20generalization,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=dependent%20on%20training%20data,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In practical terms, one could decide when a pretraining run has effectively \u201cgrokked\u201d its data and is ready, by looking at the trends of these routing metrics \u2013 useful for early stopping or dynamic scheduling of training<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=of%20the%20delayed%20generalization,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=We%20demonstrate%20their%20capabilities%20to,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. The authors also provided theoretical grounding for why more structured pathways imply better generalization: in a one-layer MoE, they prove that if the routing function clusters inputs (i.e., pathways are shared) the model\u2019s effective complexity is lower and yields a tighter generalization bound<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=dependent%20on%20training%20data,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=finetuning%20and%20test,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Mechanistic Interpretability in LLMs:<\/strong> While the routing analysis is one form of mechanistic insight, there are also efforts to directly interpret <em>what large models are learning<\/em> during grokking. For example, one could attempt to identify emerging neurons or circuits corresponding to new abilities that activate late in training. The study <em>\u201cGrokked Transformers are Implicit Reasoners\u201d<\/em> (Wang et al., 2024) examines whether after grokking, transformers effectively perform multi-step reasoning without explicit chain-of-thought \u2013 suggesting grokking might coincide with the network internalizing implicit algorithms. They found that small transformers trained to grok a reasoning task ended up using their feedforward layers to carry out multi-step logical inferences implicitly (hence \u201cimplicit reasoners\u201d). This again underscores that <strong>when a model groks, it often has discovered an interpretable algorithm or structure internally<\/strong> (like a reasoning procedure or a Fourier transform, as earlier cases showed). Such mechanistic studies on larger models are just beginning, but they promise to connect <em>emergent behaviors<\/em> in LLMs to the grokking framework.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In summary, <strong>grokking does occur in large-scale LLM training<\/strong>, but it\u2019s more nuanced: not all tasks grok at once (some skills emerge earlier or later than others), and we need clever metrics to catch it since we can\u2019t rely on simple train\/test loss curves in one-pass training<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=parameters,sufficient%20data%20have%20been%20memorized\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=time%20and%20lasting%20steps%20vary,often%20takes%20longer%20to%20generalize\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. The MoE study provides encouraging evidence that even in a 7B model trained on a diverse corpus, one can see telltale signs of grokking in the model\u2019s <strong>routing patterns and representation complexity<\/strong><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=%E2%80%9Cemergence%20of%20generalization%E2%80%9D%20by%20investigating,predict%20the%20generalization%20improvement%20on\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=choices%20across%20layers,crucial%20practical%20value%20to%20model\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This bridges the gap from toy problems to real-world foundation models, implying that the lessons learned about grokking (e.g. importance of continued training past apparent convergence, internal competition of circuits) are relevant for understanding how LLMs acquire capabilities over the course of training.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Current Challenges and Future Research Directions<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Despite significant progress in understanding grokking, several challenges and open questions remain:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Limitations of Current Studies:<\/strong> Thus far, many grokking investigations have been on <strong>toy tasks or small models<\/strong>. Algorithmic operations (modular arithmetic, group theory tasks) have been the prototypical setting because they cleanly demonstrate delayed generalization<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=The%20generalization%20of%20overparameterized%20neural,testbeds%20for%20theories%20of%20generalization\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=Figure%201%3A%20Left,validation%20accuracy%20increases\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. A concern is how well conclusions transfer to more complex, noisy tasks or datasets. For instance, natural data may not exhibit as stark a plateau or jump; instead, partial or domain-specific grokking (as seen in LLMs) might be more common<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=parameters,sufficient%20data%20have%20been%20memorized\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=time%20and%20lasting%20steps%20vary,often%20takes%20longer%20to%20generalize\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Additionally, most mechanistic interpretability success (e.g. fully reverse-engineering the Fourier addition circuit<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=phenomenon%20of%20,from%20the%20gradual%20amplification%20of\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>) has been on very small networks. Scaling those methods to interpret a grokking event in a billion-parameter model is non-trivial. <strong>Future work<\/strong> needs to test grokking in a wider array of tasks \u2013 e.g., does a vision model ever grok a pattern in image data? If not, why not (is it data size, or architecture)? There\u2019s early evidence of grokking in MNIST with distribution shifts<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=In%20addition%20to%20the%20synthetic,set%20but%20with%20a%20different\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=distribution%20in%20their%20representations,our%20findings%20beyond%20synthetic%20data\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>, but more real-world cases would bolster the universality of the phenomenon.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Detecting and Leveraging Grokking in Large-Scale Training:<\/strong> One practical challenge is <strong>observability<\/strong>. In giant models trained on massive data, a small jump in aggregate validation loss might be hard to notice or attribute to a grokking-like dynamic. As Li et al. (2025) noted, generalization gains might be asynchronous and spread over training<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=parameters,sufficient%20data%20have%20been%20memorized\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=grokking%20for%20most%20data%C2%A0,Due%20to%20the%20local%20grokking\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. We might need to develop <em>new metrics or probes to detect grokking<\/em> in such settings. The pathway complexity metrics discussed above are a promising start<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=of%20the%20delayed%20generalization,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=We%20demonstrate%20their%20capabilities%20to,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Another idea is using <em>training dynamics modeling<\/em> (e.g., <strong>\u201cpredicting grokking long before it happens\u201d<\/strong> \u2013 Notsawo et al., 2023 used loss landscape analysis to anticipate a coming grokking event<a href=\"https:\/\/gwern.net\/doc\/ai\/scaling\/emergence\/grokking\/index#:~:text=,Learning%20Local%20Rules%20With%20Gradient\" target=\"_blank\" rel=\"noreferrer noopener\">gwern.net<\/a>). If we can predict that a model will grok given enough time, we can manage training accordingly. This ties into <strong>training efficiency and early stopping<\/strong>: A big risk today is stopping training too early when the validation loss plateaus; grokking teaches us that <em>apparent convergence may mask potential future gains<\/em>. But we also don\u2019t want to waste compute if no grokking is forthcoming. So a future direction is developing <strong>early-warning signals for grokking<\/strong> \u2013 e.g., monitoring internal feature rank (per Fan et al.<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=is%20scarcely%20seen%20on%20shallow,feature%20rank%20and%20generalization%20performance\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>) or pathway metrics (per Li et al.<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=of%20the%20delayed%20generalization,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>) to know that the model is in a \u201cmemorization plateau but actively reorganizing internally,\u201d as opposed to truly stuck. Carvalho et al. explicitly mention using their insights on distribution shifts to inform better stopping criteria<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=Instead%2C%20small,criteria%20in%20future%20training%20processes\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. For example, if one detects that the model has not yet learned relationships between subclasses (via some probe), one might decide to continue training longer or adjust training to facilitate that.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Scaling Mechanistic Interpretability:<\/strong> One fascinating direction is applying <strong>mechanistic interpretability at scale<\/strong> to grokking. The small-scale studies literally found the circuit (e.g., discrete Fourier transform) the model used<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=phenomenon%20of%20,from%20the%20gradual%20amplification%20of\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. In a larger model, say an LLM, can we identify a subnetwork or set of neurons that implemented a new capability at the moment it grokked that capability? If so, we could potentially <strong>see an emergent chain-of-thought or algorithm form<\/strong>. This could connect to research on <em>phase changes in model behavior<\/em> \u2013 e.g., if an LLM suddenly learns to do multi-step reasoning, is there an internal circuit that \u201csnaps\u201d into place? Grokking provides a controlled way to study such phase changes. <em>Phase transition<\/em> analyses (like calculating order parameters for when a network\u2019s representation changes qualitatively) could be borrowed from physics more in the future, continuing the work of Liu et al.\u2019s phase diagrams<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=%2890,Both%20comprehension%20and%20grokking%20are\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=Universality%20of%20phase%20diagrams%20We,b\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Understanding the Role of Regularization and Optimization:<\/strong> Many works noted that weight decay (or implicit regularization) was important for grokking<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20compare%20various%20optimization%20details,on%20the%20tasks%20we%20study\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. Why exactly? Does it simply slow down memorization enough for feature learning to catch up (as Power et al. intuited<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20compare%20various%20optimization%20details,on%20the%20tasks%20we%20study\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>)? Or does it actively favor low-complexity circuits, tipping the competition? Similarly, <em>optimizer choices<\/em> might matter: an AdamW vs. SGD might traverse the loss landscape differently in the plateau. There was a paper by Thilak et al. (2022) on the \u201cSlingshot Mechanism\u201d that looked at adaptive optimizers and grokking, suggesting that certain optimizer behaviors (like overshooting and retracing in loss) can facilitate escaping a memorization minimum<a href=\"https:\/\/gwern.net\/doc\/ai\/scaling\/emergence\/grokking\/index#:~:text=al%202024%20%20,20\" target=\"_blank\" rel=\"noreferrer noopener\">gwern.net<\/a>. Future work can explore how different training algorithms influence the lazy-to-rich transition or circuit formation. This might inform <em>best practices<\/em> \u2013 e.g., if we want a model to grok a solution, should we use a smaller learning rate initially (to encourage lazy training) and then increase it?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Grokkability of Tasks and Models:<\/strong> It remains an open question which tasks are <em>grokkable<\/em>. Clearly, tasks that have an underlying exact structure (group theory, arithmetic) exhibit grokking. Tasks that are purely memorization (random mapping) would never grok because there is no structure to find. Most real tasks lie between \u2013 they have patterns plus idiosyncrasies. One could define a measure of a task\u2019s \u201clearnability gap\u201d: how much better could a model potentially do if it discovered an optimal representation vs. just memorizing? Perhaps tasks with a large gap are likely to produce grokking if the model size\/data regime is right. There is also the question of <em>model architecture<\/em>: do some architectures lend themselves to grokking more? (Transformers vs RNNs vs CNNs, etc.) The evidence so far (transformers, MLPs, even GPs) suggests it\u2019s broad, but perhaps recurrent models might behave differently due to how they process data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Practical Implications \u2013 Training Strategy:<\/strong> If grokking can be achieved, could we <em>intentionally leverage it<\/em> to train models more efficiently? For example, one might deliberately train on a smaller subset of data until grokking occurs (to force the model to find a general solution under constrained data), then fine-tune on more data. This might yield a better generalizing model than training on all data from scratch (where the model might memorize more). This is speculative, but it relates to <strong>curriculum learning<\/strong>: small data induced grokking might act like a curriculum that teaches the model an underlying concept which then helps on bigger data. On the flip side, grokking also implies <em>wasted time<\/em> in training (the long plateau). If we understand it well, we could try to shorten that plateau (e.g. via hyperparameter tuning or auxiliary losses that encourage the general solution sooner). Work like <strong>\u201cGrokfast: Accelerated Grokking by Amplifying Slow Gradients\u201d<\/strong> (Lee et al., 2024) explicitly looked at speeding up grokking by modifying the training dynamics to amplify the learning signal of the true pattern. Continuing such research can make grokking less of an oddity and more of a tool.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Theoretical Questions:<\/strong> The convergence properties of grokking are not fully understood. Why does the generalization often <em>snap almost vertically<\/em>? Is there a bifurcation in the gradient flow dynamics underlying that? Some have drawn analogies to <strong>phase transitions<\/strong> in physics, where an order parameter changes rapidly once a threshold is passed<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=Advancing%20mathemat%02ics%20by%20guiding%20human,Optimal%20regularization%20can\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=1798%E2%80%931828%2C%202013.%20,supervised%20learning\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Connecting formal learning theory to grokking is challenging, but one could imagine analyzing a simplified model of grokking as a dynamical system with multiple attractors (a memorization attractor and a generalization attractor). Recent work by <em>\u017dunkovi\u010d &amp; Ilievski (2022)<\/em> indeed studied \u201cgrokking phase transitions\u201d in learning local rules, noting parallels to physical systems<a href=\"https:\/\/gwern.net\/doc\/ai\/scaling\/emergence\/grokking\/index#:~:text=Nanda%20et%20al%202023%20,39\" target=\"_blank\" rel=\"noreferrer noopener\">gwern.net<\/a>. Bridging these perspectives could yield a more rigorous definition of <em>when<\/em> grokking occurs (perhaps in terms of a threshold on data size relative to model complexity, as hinted by Liu et al. with a critical dataset fraction<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=Figure%204%3A%20,phase%20transition%20of%20RQI%20around\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>, or a threshold on alignment of model\u2019s eigenfunctions with the target as per Kumar et al.<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=network%20output,settings%2C%20like%20on%20MNIST%2C%20one\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Emergent Abilities and Grokking:<\/strong> As Huang et al. argue, emergent abilities in very large models might be essentially grokking happening along the scale axis<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=detailed%20analysis%20of%20the%20double,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This raises an exciting prospect: by studying grokking in controlled settings, can we predict what abilities will emerge in frontier models and at what point? For example, if we treat a certain complex task as a \u201cheld-out capability,\u201d can we estimate how much data or model size is needed before that task\u2019s solution \u201cclicks\u201d (grokks) into place? Research by Zhu et al. (2024) on \u201cCritical data size of language models from a grokking perspective\u201d touches on this \u2013 finding the minimum data required for an LLM to grok linguistic phenomena<a href=\"https:\/\/gwern.net\/doc\/ai\/scaling\/emergence\/grokking\/index#:~:text=Dohmatob%20et%20al%202024%20,An%20Empirical%20Exploration%20With%20Model\" target=\"_blank\" rel=\"noreferrer noopener\">gwern.net<\/a>. This kind of research could guide dataset design: if some ability hasn\u2019t emerged, maybe more data or a different training regimen is needed to induce a grokking event.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Safety and Alignment Considerations:<\/strong> An interesting side note is that grokking implies models can harbor latent capabilities that only <em>activate<\/em> after extensive training. For AI alignment, this is a double-edged sword: on one hand, it means a model might unexpectedly become capable (which could be risky if the capability is misaligned); on the other hand, monitoring for grokking-like shifts (via interpretability tools) might alert us to sudden capability gains. Research in mechanistic interpretability born out of alignment (like Nanda\u2019s work) is likely to continue leveraging grokking as a testbed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In conclusion, grokking has graduated from a curious phenomenon on toy data to a concept that links a variety of deep learning mysteries: generalization dynamics, double descent, emergence, and more. <strong>Current challenges<\/strong> revolve around scaling our understanding and detection of grokking to realistic settings and harnessing it for positive ends (improving training, predicting emergent behaviors). <strong>Future research<\/strong> will likely focus on unifying theoretical models, developing new diagnostics for ongoing training, and applying these ideas to ever larger models to see just how ubiquitous delayed generalization is. Grokking has essentially opened a new window into the <em>time dimension of learning<\/em>: it reminds us that <em>when<\/em> a model learns can be as fascinating as <em>whether<\/em> it learns at all.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary of Key Papers on Grokking<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The table below summarizes major papers discussed, including their core contributions and methodologies:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Paper (Authors, Year)<\/strong><\/th><th><strong>Core Findings<\/strong><\/th><th><strong>Methodologies<\/strong><\/th><th><strong>Contributions \/ Open Issues<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Power et al., 2022<\/strong> \u2013 <em>\u201cGrokking: Generalization Beyond Overfitting on Small Algorithmic Datasets\u201d<\/em><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=in%20great%20detail,of%20the%20finite%20training%20dataset\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><\/td><td>Discovered <strong>grokking<\/strong> (delayed generalization) in small neural networks on algorithmic tasks. Validation accuracy jumped from random to 100% long after training accuracy was 100%. Smaller datasets cause longer delays; weight decay accelerates generalization<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20compare%20various%20optimization%20details,on%20the%20tasks%20we%20study\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>.<\/td><td>Empirical study on small transformers (modular arithmetic operations). Monitored train\/test curves for various dataset sizes and hyperparameters<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=Figure%201%3A%20Left,validation%20accuracy%20increases\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a><a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=dataset%20size%20is%20decreased\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>.<\/td><td>Introduced the term <em>\u201cgrokking\u201d<\/em> and its key traits. Highlighted the role of data size and regularization<a href=\"https:\/\/ar5iv.labs.arxiv.org\/html\/2201.02177#:~:text=We%20compare%20various%20optimization%20details,on%20the%20tasks%20we%20study\" target=\"_blank\" rel=\"noreferrer noopener\">ar5iv.labs.arxiv.org<\/a>. Provided an open-source testbed for studying generalization beyond memorization. Open issues: initially lacked a clear explanation of mechanism (spurred follow-ups).<\/td><\/tr><tr><td><strong>Liu et al., 2022<\/strong> \u2013 <em>\u201cTowards Understanding Grokking: An Effective Theory of Representation Learning\u201d<\/em><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=We%20aim%20to%20understand%20grokking%2C,We%20observe%20empirically\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=the%20presence%20of%20four%20learning,phases%3A%20comprehension%2C%20grokking%2C%20memorization%2C%20and\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><\/td><td>Proposed a <strong>phase diagram of learning<\/strong> with four phases: confusion, memorization, grokking (delayed gen), comprehension (immediate gen). Showed representation learning occurs only in a \u201c<strong>Goldilocks zone<\/strong>\u201d between memorization and confusion<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=confusion,closer%20to%20the%20memorization%20phase\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Grokking phase is nearer memorization, causing delay<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=confusion,closer%20to%20the%20memorization%20phase\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Hyperparameters determine phase; proper tuning can eliminate grokking (move to comprehension)<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=A3%20Grokking%20is%20a%20phase,phase%20diagrams%20in%20Figure%206\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=decoder,is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>.<\/td><td>Both theoretical (effective theory) and empirical: developed an analytic toy model predicting a phase transition in representation quality vs. data fraction<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=Figure%204%3A%20,phase%20transition%20of%20RQI%20around\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>; ran grid searches to produce phase diagrams for transformer models on tasks (addition, permutation groups)<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=%2890,Both%20comprehension%20and%20grokking%20are\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=Universality%20of%20phase%20diagrams%20We,b\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>.<\/td><td>Gave an intuitive <strong>\u201ccomprehension\u2013grokking\u2013memorization\u201d framework<\/strong><a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=match%20at%20L795%20from%20these,is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Introduced physics-inspired analysis (phase transitions, \u201cintelligence from starvation\u201d analogy)<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=Goldilocks%20phase%20is%20reminiscent%20of,of%20the%20origin%20of%20grokking\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Contribution: explained grokking as hyperparameter mis-tuning and provided a path to <em>\u201cde-delay\u201d<\/em> generalization<a href=\"https:\/\/papers.neurips.cc\/paper_files\/paper\/2022\/file\/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf#:~:text=A3%20Grokking%20is%20a%20phase,phase%20diagrams%20in%20Figure%206\" target=\"_blank\" rel=\"noreferrer noopener\">papers.neurips.cc<\/a>. Open issues: applicability of phase diagrams to complex tasks; defining Goldilocks zone quantitatively.<\/td><\/tr><tr><td><strong>Kumar et al., 2024<\/strong> \u2013 <em>\u201cGrokking as the Transition from Lazy to Rich Training Dynamics\u201d<\/em><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,x%29%24%20are\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><\/td><td>Explained grokking via a <strong>two-regime dynamic<\/strong>: initially <em>lazy training<\/em> (network acts nearly linear\/NTK, fits training data without feature change), later transitions to <em>rich feature learning<\/em>, yielding generalization<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,x%29%24%20are\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. Key determinants: misalignment of initial kernel and target, dataset size in an intermediate range, and small initial learning rate to enforce lazy start<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=network%20output,settings%2C%20like%20on%20MNIST%2C%20one\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. Showed this transition causes test loss to plummet late.<\/td><td>Theoretical analysis on a polynomial regression task with a 2-layer ReLU network; derived sufficient statistics for test loss<a href=\"https:\/\/arxiv.org\/abs\/2310.06110#:~:text=,time%20feature\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2310.06110#:~:text=cannot%20be%20explained%20by%20existing,1%29%20the%20top\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Empirical demonstrations on simple tasks and extensions to MNIST and small transformers<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a><a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20lazy%20regime%20so%20does,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>.<\/td><td>Provided a <strong>clear mechanistic story<\/strong> for delayed generalization in terms of kernel vs. feature-learning regimes<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20test%20loss%20of%20such,x%29%24%20are\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. Bridged grokking with classical NTK theory. Contribution: identified controllable factors (feature learning rate, etc.) to induce or prevent grokking<a href=\"https:\/\/openreview.net\/forum?id=vt5mnLVIVo#:~:text=the%20network%20to%20generalize%20eventually%2C,teacher%20networks\" target=\"_blank\" rel=\"noreferrer noopener\">openreview.net<\/a>. Open question: how to measure \u201clazy vs rich\u201d in large-scale nets in real-time; linking to implicit bias in GD.<\/td><\/tr><tr><td><strong>Nanda et al., 2023<\/strong> \u2013 <em>\u201cProgress Measures for Grokking via Mechanistic Interpretability\u201d<\/em><a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=study%20the%20dynamics%20of%20training,later%20removal%20of%20memorizing%20components\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><\/td><td>Conducted full <strong>reverse-engineering of a grokking model\u2019s algorithm<\/strong>. Found model learned modular addition via Fourier transforms<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=phenomenon%20of%20,from%20the%20gradual%20amplification%20of\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Defined three training phases: <strong>memorization, circuit formation, cleanup<\/strong>, and showed grokking is the result of a <strong>gradual increase of the algorithmic circuit\u2019s strength and removal of memorizing components<\/strong><a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=study%20the%20dynamics%20of%20training,later%20removal%20of%20memorizing%20components\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a> (not a truly instantaneous jump). Developed continuous progress measures that split training into phases.<\/td><td>Mechanistic interpretability on a small transformer (mod&nbsp;97 addition): traced neuron values, discovered Fourier basis in embeddings<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=phenomenon%20of%20,from%20the%20gradual%20amplification%20of\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>; performed ablations and \u201ccircuit tests\u201d (e.g. intervening in Fourier-space) to confirm the learned algorithm<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=phenomenon%20of%20,from%20the%20gradual%20amplification%20of\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Tracked metrics like circuit strength over thousands of training steps.<\/td><td><strong>Demonstrated that grokking has interpretable internal dynamics<\/strong> (not magic). Introduced the idea of competing circuits (memorization vs. generalization) and provided evidence with clean metrics<a href=\"https:\/\/arxiv.org\/abs\/2301.05217#:~:text=study%20the%20dynamics%20of%20training,later%20removal%20of%20memorizing%20components\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Contribution: showed a path to quantify emergence (progress measures), inspiring others to find such measures in larger models. Open issues: scaling this approach beyond toy settings; identifying progress measures in high-dimensional models.<\/td><\/tr><tr><td><strong>Carvalho et al., 2025<\/strong> \u2013 <em>\u201cGrokking Explained: A Statistical Phenomenon\u201d<\/em><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=test%20set%20loss%20decreases%20sharply,the%20phenomenon%2C%20demonstrating%20that%20while\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=in%20deep%20learning%20networks,convenient%20mechanism%20for%20achieving%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><\/td><td>Argues grokking arises from a <strong>distribution shift between train and test<\/strong>. Showed that imbalanced sampling of classes and subclasses can <em>systematically<\/em> produce grokking<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=in%20deep%20learning%20networks,convenient%20mechanism%20for%20achieving%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=sampling%2C%20and%20the%20other%20investigates,sparse%20data%2C%20we%20demonstrate%20that\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a> \u2013 the model overfits to frequent substructures then later leverages relationships to handle rare ones (delayed gen). Demonstrated grokking with <em>dense data<\/em> and minimal regularization if a latent shift exists<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=Instead%2C%20small,criteria%20in%20future%20training%20processes\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Validated on synthetic datasets (equidistant and equivariant subclass structures) and even induced a grokking-like effect on MNIST via clustered distortions<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=In%20addition%20to%20the%20synthetic,set%20but%20with%20a%20different\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=distribution%20in%20their%20representations,our%20findings%20beyond%20synthetic%20data\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>.<\/td><td>Statistical analysis and dataset design: created synthetic classification tasks with controllable subclass sampling to induce or remove distribution shifts<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=We%20posit%20that%20data%20sparsity,sparsity%2C%20enabling%20late%20generalization%20by\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Monitored training dynamics and final accuracy under different sampling regimes. Also did an experiment on a real dataset (MNIST) by clustering digit styles to simulate domain shift<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=In%20addition%20to%20the%20synthetic,set%20but%20with%20a%20different\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=distribution%20in%20their%20representations,our%20findings%20beyond%20synthetic%20data\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>.<\/td><td>Brought a <strong>data-centric view<\/strong>: highlighted that <em>small data is a proxy for distribution gaps<\/em>, not the sole cause of grokking<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=sampling%2C%20and%20the%20other%20investigates,sparse%20data%2C%20we%20demonstrate%20that\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=categories%2C%20we%20systematically%20reproduce%20the,parameter%20tuning.%20Our\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Contribution: showed one can trigger or prevent grokking by tweaking data composition, implying potential control over late generalization. Suggests using insights for <strong>better early stopping<\/strong> \u2013 e.g. detect when a model might grok by looking at data subsets performance<a href=\"https:\/\/arxiv.org\/html\/2502.01774v1#:~:text=Instead%2C%20small,criteria%20in%20future%20training%20processes\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Open issues: how general is this to other forms of distribution shift? Can we quantify \u201chow much shift causes how much delay\u201d?<\/td><\/tr><tr><td><strong>Miller et al., 2023<\/strong> \u2013 <em>\u201cGrokking Beyond Neural Networks: An Empirical Exploration with Model Complexity\u201d<\/em><a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=,is%20not%20restricted%20to%20settings\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=but%20occurs%20in%20other%20settings,guided%20by%20complexity%20and%20error\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><\/td><td>Discovered that grokking-like delayed generalization occurs in <strong>non-neural models<\/strong> too: observed in Gaussian Process classifiers, GP regression, linear regression, and Bayesian neural nets<a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=,is%20not%20restricted%20to%20settings\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Concluded that <em>any learning system where solutions trade off complexity and error could grok<\/em>. Also showed adding extraneous \u201cdecoy\u201d features to the input can induce grokking by encouraging an initial memorizing solution which is later abandoned<a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=but%20occurs%20in%20other%20settings,guided%20by%20complexity%20and%20error\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>.<\/td><td>Empirical experiments mirroring neural grokking but with other models: e.g. training a GP on a small algorithmic dataset and tracking when its posterior starts to generalize; analytical discussion of linear regression under certain feature setups. Complexity-guided search perspective used to interpret results.<\/td><td><strong>Generalized the scope of grokking<\/strong> beyond deep learning<a href=\"https:\/\/arxiv.org\/abs\/2310.17247#:~:text=validation%20set%20long%20after%20the,guided%20by%20complexity%20and%20error\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. This suggests grokking is about <strong>solution selection dynamics<\/strong>, not just SGD quirk. Important contribution: implies theories of grokking should also apply to kernel methods and even analytic learners \u2013 a direction for future theoretical work. Open issues: can we formally prove grokking in, say, Gaussian processes or linear models? What does this mean for using grokking to select inductive biases?<\/td><\/tr><tr><td><strong>Fan et al., 2024<\/strong> \u2013 <em>\u201cDeep Grokking: Would Deep Neural Networks Generalize Better?\u201d<\/em><a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=observe%20an%20intriguing%20multi,believe%20our%20work%20is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><\/td><td>Showed that <strong>deeper networks (12-layer MLPs)<\/strong> not only grok but can have <strong>multiple generalization surges<\/strong> (\u201cmulti-stage grokking\u201d)<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=observe%20an%20intriguing%20multi,believe%20our%20work%20is%20the\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Noticed a secondary jump in test accuracy in deep nets (absent in shallow nets) and correlated this with <strong>feature rank dynamics<\/strong>: internal feature rank drops at each generalization jump<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=is%20scarcely%20seen%20on%20shallow,feature%20rank%20and%20generalization%20performance\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Identified a double-descent pattern in feature rank (complexity) corresponding to the grokking stages<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=is%20scarcely%20seen%20on%20shallow,feature%20rank%20and%20generalization%20performance\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Suggests internal representation compression is an indicator of grokking progress.<\/td><td>Experimental study varying network depth on modular tasks. Measured layer-wise feature rank (via SVD or PCA of activations) throughout training<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=is%20scarcely%20seen%20on%20shallow,feature%20rank%20and%20generalization%20performance\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Compared training trajectories of deep vs. shallow models, noting differences in test accuracy curves and complexity measures.<\/td><td>Extended grokking analysis to <strong>deep architectures<\/strong>, emphasizing that depth increases the propensity for delayed yet eventual generalization<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=research%20primarily%20focus%20on%20shallow,Additionally%2C%20we\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Contribution: proposed <strong>feature rank as a proxy<\/strong> for generalization readiness<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=is%20scarcely%20seen%20on%20shallow,feature%20rank%20and%20generalization%20performance\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a> \u2013 a potential tool for monitoring training. Also connected grokking with the phenomenon of <strong>double descent<\/strong> in a new way (via feature complexity)<a href=\"https:\/\/arxiv.org\/abs\/2405.19454#:~:text=is%20scarcely%20seen%20on%20shallow,feature%20rank%20and%20generalization%20performance\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Open question: can feature rank metrics be used in practice to decide training schedules? Why do deep nets have multiple grokking phases \u2013 is it hierarchical learning of sub-concepts?<\/td><\/tr><tr><td><strong>Huang et al., 2024<\/strong> \u2013 <em>\u201cUnified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition\u201d<\/em><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=framework%20that%20provides%20a%20unified,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=four%20distinct%20training%20dynamics%2C%20each,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><\/td><td>Presented a <strong>unifying framework<\/strong> where a model\u2019s training behavior is governed by competition between <strong>memorization circuits vs. generalization circuits<\/strong><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=framework%20that%20provides%20a%20unified,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Used this to explain three phenomena: grokking (time-based competition where generalization circuit wins late), double descent (model-size-based competition \u2013 test error spikes when mem circuits dominate at medium model sizes, then falls as gen circuits dominate in larger models)<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=framework%20that%20provides%20a%20unified,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=four%20distinct%20training%20dynamics%2C%20each,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>, and emergent abilities in multi-task LLMs (task-wise competition \u2013 a new ability \u201cemerges\u201d when model\/data scale allows a generalist solution for that task to overcome trivial solutions)<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=detailed%20analysis%20of%20the%20double,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Mapped out four regimes of training dynamics (depending on model capacity &amp; data): confusion, memorization, grokking, comprehension<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=initially%20employed%20to%20explain%20grokking%2C,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Made testable predictions about double descent thresholds, confirmed by experiments.<\/td><td>Theoretical framework building on prior grokking interpretation, extended to larger-scale phenomena. Provided conceptual arguments and some empirical validation on algorithmic tasks with varying model sizes and multi-task setups<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=initially%20employed%20to%20explain%20grokking%2C,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=detailed%20analysis%20of%20the%20double,abilities%20in%20Large%20Language%20Models\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. For emergent abilities, framed algorithmic tasks in a multi-task environment to show how a new task\u2019s performance stays low then jumps as model scales.<\/td><td><strong>Synthesis contribution:<\/strong> Connected grokking to other deep learning mysteries under one lens<a href=\"https:\/\/arxiv.org\/abs\/2402.15175#:~:text=framework%20that%20provides%20a%20unified,This\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Emphasized the universality of the <em>\u201ctwo-circuits\u201d competition<\/em> as a driver of non-linear generalization effects. Provides a mental model for researchers: e.g., if you see double descent, think of it as \u201cgrokking across model sizes.\u201d It also suggests practical insight: to avoid poor generalization, ensure conditions where generalization circuits dominate early (more data or regularization to suppress pure memorization). Open issues: how to identify these \u201ccircuits\u201d in real networks; extending the unified framework to continuous spectra of solutions (not just binary mem vs gen).<\/td><\/tr><tr><td><strong>Li et al., 2025<\/strong> \u2013 <em>\u201cWhere to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test\u201d<\/em><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=Our%20study%2C%20for%20the%20first,conversion%2C%20providing%20a%20mechanistic%20explanation\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=%E2%80%9Cemergence%20of%20generalization%E2%80%9D%20by%20investigating,predict%20the%20generalization%20improvement%20on\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><\/td><td>Provided the first evidence that <strong>grokking happens during large-scale LLM pretraining<\/strong>, though <em>asynchronously across domains<\/em>. Different skill areas in a 7B MoE model \u201cgrok\u201d (show late generalization) at different times<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=parameters,sufficient%20data%20have%20been%20memorized\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Importantly, introduced <strong>internal routing metrics<\/strong> to detect grokking: as training continues, <strong>Mixture-of-Expert routing patterns become more shared and simpler<\/strong>, indicating a shift from memorizing each example separately to generalizing across examples<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=%E2%80%9Cemergence%20of%20generalization%E2%80%9D%20by%20investigating,predict%20the%20generalization%20improvement%20on\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=choices%20across%20layers,conversion%2C%20providing%20a%20mechanistic%20explanation\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. Developed metrics (pathway distance between samples, pathway consistency for single sample) that predict downstream test improvements<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=of%20the%20delayed%20generalization,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a><a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=We%20demonstrate%20their%20capabilities%20to,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>. These metrics allow monitoring generalization <em>without a test set<\/em>. Also grounded findings with a theoretical result linking structured pathways to improved generalization bounds<a href=\"https:\/\/arxiv.org\/html\/2506.21551v1#:~:text=dependent%20on%20training%20data,and%20improve%20the%20generalization%20bound\" target=\"_blank\" rel=\"noreferrer noopener\">arxiv.org<\/a>.<\/td><\/tr><\/tbody><\/table><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Basic Concepts and Historical Background Definition of Grokking: Grokking refers to a surprising phenomenon of delayed generalization in neural network training. A model will perfectly fit the training data (near-100% training accuracy) yet remain at chance-level on the test set&hellip;<\/p>\n","protected":false},"author":4,"featured_media":1732,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[23,3],"tags":[],"class_list":["post-1731","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-academic","category-llm"],"_links":{"self":[{"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/posts\/1731","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/comments?post=1731"}],"version-history":[{"count":2,"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/posts\/1731\/revisions"}],"predecessor-version":[{"id":1734,"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/posts\/1731\/revisions\/1734"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/media\/1732"}],"wp:attachment":[{"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/media?parent=1731"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/categories?post=1731"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aicritique.org\/us\/wp-json\/wp\/v2\/tags?post=1731"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}