What does it do?
Predicts practical viability of any transformer LLM
before you spend GPU/$. Answers questions like "will this model work at L=32K?" or
"should I train custom or use API?" using deterministic Python formulas (TAF β Thermodynamic Attention Framework).
How to use β 7 modes
π Profile: paste model id β all recipes at once = TAF Card. Best starting point.
π Compare: 2-3 models side-by-side on same recipe. Best when choosing between candidates.
π Inspect config: paste raw config.json β tool parses + runs full Profile. For private models, in-development configs, or models not yet on HF Hub.
π¬ Ask plain English: free-form question, in-browser LLM picks the recipe. Best for casual exploration.
π Recipe + form: manual selection, full parameter control. Best when you want exact control.
π©Ί Diagnose CLI: generate Python command to measure Ξ³ on your local machine (transformers + numpy). Fast β5 min CPU; full β20β60 min GPU. Output JSON re-uploadable via Inspect.
π Phase diagram: scatter plot of 23 panel models on (log ΞΈ, Ξ³) plane. Hagedorn line Ξ³=1 separates Phase A from Phase B. Click a dot to load that model into Recipe form.
The 8 recipes available
X-1 Custom training vs API β compares cost of training your own model vs paying for API access.
Try: "Should I train an 8B custom model or use GPT-4o for 50M tokens/month?"
Answer types: YES (custom) / NO (API) with break-even months.
X-2 Long Context Viability β predicts if a model serves a target context length reliably.
Try: "Will Meta-Llama-3-8B handle 32000 tokens for retrieval?"
Chains: Ξ³_PadΓ© β decomposition β d_horizon β NIAH ceiling β hallucination β KV memory.
Verdict: YES / DEGRADED / NO with mitigation if needed.
X-3 Budget pre-flight β given $ budget, what model is feasible to train?
Try: "I have $5000, what model can I train?"
Answer: GO / TINY-MODEL / MEMORY-LIMITED with concrete N (params) and D (tokens).
X-5 Hardware selection β which GPU should I use to serve at target throughput?
Try: "Cheapest hardware to serve Llama-3-8B at 10M tokens/day"
Answer: best GPU + $/Mtok + capacity vs target.
X-19 KV Compression decision β should I use soft decay, hard cutoff, or literature methods?
Try: "How to compress KV cache for Qwen2.5-7B at 32K?"
Answer: USE SOFT DECAY / USE D_f CUTOFF / USE LITERATURE METHODS / USE HARD T_train.
β v0.4 (sesiΓ³n 29 findings) β
What's new in v0.4 (sesiΓ³n 29 findings 2026-04-28): three diagnostic recipes derived from cross-model panel analysis (n=22 LLMs).
X-21 Imprint Purity Diagnostic β predicts Ξ³ on RANDOM tokens via Ξ½=β1/(2Ο); how clean is the model's RoPE prediction?
Try: "How clean is the RoPE prediction on Llama-3-8B?"
Answer: predicted Ξ³_random + purity diagnostic (CLEAN / OVER-IMPRINTED / UNDER-IMPRINTED).
Learned-imprint slope Ξ½ = β1/(2Ο): RoPE rotation period 2Ο drives a positional bias on weights, proportional to log(N_params). Even random tokens show this scaling. Ξ½ is DERIVED β not fitted (empirical err 0.3%).
X-22 Compute-Context Invariant β does Ξ³ Γ log(NΒ²Β·D) lie in panel band 51.2 Β± 16.8? Detects scaling/training anomalies.
Try: "Does Mistral-7B fit the compute-context invariant?"
Answer: K = Ξ³Β·log(NΒ²Β·D), z-score, IN-BAND or OUTLIER.
Chinchilla-attention invariant K: Ξ³ Γ log(NΒ²Β·D) β 51.2 Β± 16.8 (CV=0.329). Connects compute scaling and attention exponent into a single dimensionless number.
X-23 IH-Phase Detector β pre- or post-induction-head? Cheap probe via sign(Ξ³_text β Ξ³_random).
Try: "Is Qwen2.5-7B post-induction-head?"
Answer: CONFIRMED PRE-IH / CONFIRMED POST-IH / ANOMALY (with size-vs-ΞΞ³ consistency check).
ΞΞ³ as IH probe: sign(Ξ³_text β Ξ³_random) > 0 βΊ post-induction-head. Cheaper than running an in-context-learning benchmark.
Ξ³-cluster on famous constants (intriguing, n=4): CodeLlama-13b Ξ³=0.382 β 1β1/Ο (golden conjugate, err 0.0003); pythia-1.4b Ξ³=0.705 β 1/β2; Llama-2-7b Ξ³=0.287 β 1β1/β2; Mistral-Nemo Ξ³=0.428 β log_10(e). Caveat: could be coincidence.
π v0.4 β New diagnostics (sesion 31)
Four new diagnostic functions derived sesion 31 (2026-04-30) from cross-of-crosses formula games + SΓ³cratic interrogation. Available in taf_browser.py Β§33.
Architectural Concentration β Ξ³_text β Ξ³_PadΓ© β 0.012Β·n_kv. Cross-panel correlational law (RΒ²=0.30). Caveat: not per-model predictor.
PDI β PadΓ© Deviation Index β PDI = d_horizon_obs/T_eval. Traffic light: green (β1), orange (>>1), yellow (<<1), red (Phase B negative).
4-bit Shift Predictor β MHA: RΒ²(bf16)<0.9 β Ξ³ rises; RΒ²>0.99 β Ξ³ drops. GQA: precision-robust regardless.
Critical Exponents Bundle β Ξ½_c, Ξ²_c, Ξ·_c (=Ξ³β1, CORRECTED), Ξ±_C, Ξ³_susc with AM-GM minimum at Ξ³=1β1/β2β0.293.
π¬ v0.5 β Machine-verified consistency (sesion 32)
Sage Groebner basis + Lean Mathlib4 dual-tool verification of 15 algebraic identities of TAF critical exponents. First transformer-attention framework with formal machine-proof backing.
Algebraic Consistency Check β Given measured Ξ³, verifies 12 D-SAGE identities (D-SAGE-1: 2Ξ·Β²+Ξ·Β·Ξ³_Ο+1=0, Ξ²Β·Ο=β1, Ξ±+Ο=2, etc.). All passing = framework intact. Failures indicate bf16 outliers / quantization artifacts.
D-SAGE-1 (β
β
core) β Quadratic identity 2Ξ·Β² + Ξ·Β·Ξ³_Ο + 1 = 0 (Sage Groebner-discovered, Lean-verified). Replaces incorrect 'triple closure' claim. Refutes paper 1's Ξ·=2Ξ³ algebraically.
Paper 1 erratum β Ξ· correction β Paper 1 originally claimed Ξ· = 2Ξ³. Sage Groebner + Lean Mathlib4 proved this fails (residual (-4Ξ³Β³+5Ξ³+1)/(1-Ξ³) > 0 βΞ³ β Phase A). Correct value: Ξ· = Ξ³β1, satisfying D-SAGE-1.
Reproducibility β All 15 theorems machine-proof in Lean Mathlib4 (1973 jobs build success). Sage script: analysis/sage_recursive_sweep_2026-04-30.sage. Lean code: lean_taf/taf/Taf/Identities.lean.
π v0.6 β Ξ³ predicted-vs-observed + Cardy ΞH + Lean badges
v0.6 (2026-05-06): three new diagnostics live in the TAF Card under π¬ Diagnostics. All run in your browser; Ξ³_observed comes from the Diagnose CLI on real weights.
TAF Card layout (new in v0.6)
After clicking π Generate full profile the card shows: a hero strip on top (architecture class + meta + 3 pills: aggregate verdict β
/β /β, Ξ³ headline, π§² Anti-Ising if Phase A) and four expandable sections: π Recipes (open by default β verdict per dimension), π¬ Diagnostics (key numbers, Ξ³ predicted vs observed, what-if explorer), β Verification (Sage+Lean algebraic consistency, falsification F1-F23), π Provenance & share (calibration audit + JSON download / share link / registry submit). Click any header to expand. Every variable has an inline β tooltip.
Ξ³ predicted vs observed
Enter the empirically-measured Ξ³ from your model and the tool computes Ξ· = ΞΈ_eff_obs / ΞΈ_eff_PadΓ© and classifies into one of 5 regimes:
- Normal (Ξ· β [0.85, 1.15]) β model uses its full nominal context. Use case: validate a new release before adopting it.
- Fraud (Ξ· < 0.01) β nominal ΞΈ inflated; model behaves as if ΞΈ βͺ advertised. Use case: detect YaRN/marketing inflation (CodeLlama / Mistral-Nemo pattern).
- Compressed (Ξ· < 0.5) β context compressed; model attends shorter than nominal ΞΈ. Use case: spot RLHF/instruction-tuning compression (LLaMA-2 pattern).
- Over-PadΓ© (Ξ· > 1.5) β model attends farther than PadΓ© predicts. Use case: identify Lerch-corrected regime or undertrained early checkpoints (pythia-1b pattern).
- SWA random-corpus (Ξ³_obs > 1.05 with random_corpus=Yes) β sliding-window attention signature. Use case: confirm Mistral / Gemma SWA on random tokens.
Validity gate (v0.8.9+)
When Ξ· falls outside [0.85, 1.15] OR the regime is not normal, the panel shows a warning banner explaining that the closed-form prediction may not apply. Trust the empirical Ξ³ in those cases. See docs/LIMITATIONS.md for the full regime-of-validity discussion (closed-form Ξ³ assumes natural attention without explicit regularization; Ξ½ = -1/(2Ο) assumes i.i.d. tokens).
Cardy ΞH diagnostic
ΞH_Cardy = log(ΞΈ_eff_obs / ΞΈ_nominal). Entropy shift between observed effective ΞΈ and nominal ΞΈ. Strong negative = compression entropy; near zero = nominal match. Complements Ξ· for borderline cases.
Lean + Mathlib verification badges
TAF identities (Anti-Ising, D-SAGE-1 quadratic, PadΓ© z-substitution, etc.) are formally machine-proven in Lean Mathlib4. Source: github.com/karlesmarin/lean-taf. Anyone can clone + lake build to re-verify. The π§² Anti-Ising pill in the hero strip is one such badge.
Variable glossary (also embedded in TAF Card)
Every variable in the TAF Card has an inline β tooltip. The complete list: Ξ³, Ξ³_PadΓ©, Ξ³_decomposed, Ξ³_observed, ΞΈ, ΞΈ_eff_obs, ΞΈ_eff_PadΓ©, Ξ·, ΞH_Cardy, Ο, d_horizon, L_NIAH, KV memory, regime. Hover any β for the definition + paper section.
Adding new models (3 ways)
- Preset list: 11 popular models curated. Just select from dropdown.
- HF Hub fetch: paste any model id (e.g.
Qwen/Qwen2.5-32B-Instruct),
click π₯ Fetch. Browser downloads config.json directly from HuggingFace, fills the form. Works for any public model.
- Manual: fill the form fields directly with values from the model card.
π v0.7 β Anti-bullshit pack (4 new modes)
v0.7 (2026-05-06): four new modes that solve concrete pain points reported by the HuggingFace community. Each one runs in your browser with no inference β pure metadata + math.
πͺ Context Unmasker
Detects when max_position_embeddings is misleading. Mistral-7B-v0.1 declares 32k but attends within ~4-8k via SWA. Paste an HF model id β 1-second verdict (HONEST / INFLATED / SEVERELY INFLATED / YARN-EXTENDED). Catches SWA, RoPE-scaling (YaRN/linear/dynamic NTK), small-d_head + GQA. Use case: before paying GPU for 32k context, verify the model actually attends that far.
π Chat-template Sniffer
Detects which chat-template family a model uses (Llama-3 / ChatML / Mistral / Gemma / Phi-3 / Alpaca / DeepSeek / custom / none) and gives you the exact CLI flag for lm-evaluation-harness, vLLM, and transformers. Solves issue #1841 in lm-eval-harness: forgetting --apply_chat_template silently halves multi-turn accuracy. Use case: before reporting a benchmark score, confirm you applied the template correctly.
π― Arena-Elo CI Reconstructor
Chatbot Arena strips confidence intervals from its public leaderboard β a 5-Elo gap can be statistically meaningless. Paste raw pairwise vote data (model_a, model_b, winner) β Bradley-Terry MLE + 200-iteration bootstrap β ranked Elos with 95% CIs and a "statistical ties" panel listing pairs whose CIs overlap. Try the Load sample button. Use case: before declaring "model A beats model B", verify their CIs don't overlap.
π§ͺ Contamination Prior
Bayesian-ish prior on whether a benchmark score is contaminated. Enter your model's training cutoff date β tool rates 20+ popular benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MMLU-Pro, GPQA, AIME, MATH-500, BBH, MUSRβ¦) by P(contamination) based on time gap, corpus inclusion, and known leak history. Open LLM Leaderboard v1 was killed in 2024 after MMLU/HellaSwag scores became contaminated. Use case: decide which scores to trust when comparing two models.
βοΈ Quant-regime Classifier
Predicts Ξ³-shift and ΞPPL for any (model Γ quant scheme: NF4, AWQ, GPTQ, GGUF Q4_K_M / Q5_K_M / Q8_0, int8, FP8, β¦). Architecture-aware: small d_head + aggressive GQA β more sensitive; calibrated schemes (AWQ) absorb shift better than uncalibrated (NF4). Recommends safer alternatives if a cliff is detected. Use case: before quantizing, predict whether your specific architecture Γ scheme combo will keep PPL acceptable, with a concrete switch-to suggestion otherwise.
π Cross-framework Drift Bound
Same model, different scores on different setups. Tool predicts the maximum drift admissible from numerical noise alone (dtype, framework, batch). If the observed gap exceeds it β real bug, typically chat-template mismatch (lm-eval-harness issue #1841) or KV-cache layout. Try the "Load sample" button for the canonical chat-template bug. Use case: before reporting a regression or claiming reproducibility, verify whether the gap between two evals is bigger than what numerical noise can explain.
π NIAH β Reasoning Gap
RULER paper (NVIDIA 2024) shows that long-context models often pass NIAH (needle retrieval) but fail multi-hop reasoning at the same context. Tool predicts both pass rates from architecture (Ξ³_PadΓ© + d_horizon + arch pressure: small d_head, GQA, SWA), reports the gap, and finds your model's "safe reasoning context" where reasoning stays β₯65%. Sweep mode shows the curve across 1k/4k/16k/64k/T_train. Use case: before deploying at the claimed context, find out whether the model will actually reason there or just retrieve.
π Benchmark Saturation Detector
MMLU is saturated (top 88-94%), AIME 2025 saturated within months of release, HumanEval near-saturated. Pick any benchmark and the tool returns top-3 frontier scores, spread, mean, and a verdict β saturated / near-saturated / discriminative β plus a recommended replacement (e.g. MMLU β MMLU-Pro / GPQA / HLE). Live fetch from DemandSphere AI Frontier Tracker (CC BY-NC 4.0) when reachable; baked 2026-05-05 snapshot when not. Use case: before you cite '92% on MMLU' or design an eval, check whether the benchmark still discriminates anything.
π JSON CoT-aware Linter
Constrained-decoding engines (llguidance, Outlines, SGLang grammars) emit JSON properties in the order your schema declares them. If you write { answer, reasoning } the model commits to answer first and CoT collapses into post-hoc justification. Paste any schema (or example response) β the linter classifies each field as reasoning, answer, or other, flags the ordering, and emits a reordered fix you can copy back. Use case: 'My CoT prompt works in plaintext but degrades under JSON mode' β run linter, find the inverted order, fix.
π§ PEFT Anti-Pattern Checker
PEFT's get_peft_model(base, config) creates a FRESH adapter β it does not load saved weights from a path. Users who paste tutorial code and try to resume from a checkpoint silently throw away their training. peft #2115 has the canonical bug report. This linter scans your training script for the pattern + 3 related issues (QLoRA ordering, target_modules/arch mismatch, lora_alpha ratio) and reports findings with line numbers and suggested fixes. Use case: before you launch a 10-hour LoRA fine-tune, paste your script β catch the silent bugs in 200ms.
π Prompt-Cache Diff Predictor
Provider prompt caches each have different rules: Anthropic's cache_control breaks at the first token diff in the marked prefix; OpenAI auto-caches prefixes β₯1024 tokens; Gemini context caches require β₯32K tokens. A misplaced edit silently 10x's your bill β the API never warns you, and the cost only shows up on the next invoice. Paste old + new prompt, the predictor finds the longest common prefix, estimates tokens with three tokenizer profiles (English / code / CJK), and shows per-provider hit ratio + $ delta vs no-cache for Claude Opus/Sonnet/Haiku, GPT-5/mini, and Gemini 2.5 Pro. Use case: 'I tweaked the system prompt and the bill jumped β what broke?' β paste both prompts, see exactly which provider stopped caching.
π¬ Speculative-Decode Compatibility
Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected β you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full tokenβid map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical Ξ±=0.5/0.7/0.85 acceptance rates. Use case: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.
π Multilingual Tokenizer Tax
Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Use case: 'My multilingual support added 30% to the bill β which language costs the most?' β paste real production text, see exact per-tokenizer breakdown.
π― LongScore
Every long-ctx LLM claims 128K but degrades long before that. The 100-LongBench paper (ACL 2025, arXiv:2505.19293) noticed that raw long-ctx scores are dominated by base ability. They propose LongScore: LC_l = (S_l β Base) / Base with Base = mean(S_short), then average over long lengths. This tafagent mode embeds LongScore-ready data: RULER aggregate per-context (n=33 models, 4K-128K) + HELMET aggregate at 128K (n=60 models, 7 task categories). Lookup is exact-match by HF model id. Use case: 'I want to use Llama-3.1-70B-Instruct for 100K-token doc summarization β how much accuracy do I actually lose?' β paste id, see -10% LongScore (moderate degradation, mostly the 128K cliff).
π§ Solutions Hub
tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability Β· diagnostics Β· setup Β· training Β· retrieval Β· multimodal Β· observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. Use case: 'I have problem X β does tafagent solve it, and if not, who does?'
π v0.9 β YaRN / RoPE Context-Extension Planner
v0.9 (2026-05-23): the most-asked HuggingFace question β "how do I set rope_scaling to extend context, and will it actually work?" β answered with a copy-paste config snippet AND a TAF quality verdict. Browser-only, no inference.
π§΅ YaRN / RoPE Context-Extension Planner
The dozen GGUF/VRAM calculators on HF (NyxKrage, oobabooga, DavidAU, β¦) all answer the same question: does context length L fit in my GPU? None answer the harder one: does L fit AND still work? Enter a model id (or its ΞΈ + trained context) and a target length L. The planner computes the extension factor, emits the exact rope_scaling block for transformers β₯4.43 (yarn / linear / dynamic / llama3, with paper-default Ξ² ramps), then runs TAF's Ξ³_PadΓ© / d_horizon math: Ξ³ with no extension (the problem), Ξ³ after the chosen method (the fix), the effective attention horizon, and a verdict β HEALTHY / USABLE-WITH-CARE / NEEDS-FINETUNE / DEGRADES. It flags the ΞΈ_effβΞΈΒ·factor estimate and the >4Γ fine-tune requirement honestly. Use case: 'I want Mistral-7B (ΞΈ=10k, 8k trained) at 32k' β see Ξ³ collapse from naive use, YaRN partially recover it, and get the exact config to paste. Or 'Qwen2.5 at 128k' β discover its ΞΈ=1e6 already covers it, no aggressive scaling needed.
π§ GGUF Validity Bridge
The dozen GGUF/VRAM calculators (NyxKrage, oobabooga, β¦) read a .gguf header to tell you if a quant fits in your GPU. This reads the same header β via HTTP Range, so no multi-GB download β and answers the question they skip: does it fit AND still work? Paste a GGUF repo, pick a quant file; the bridge pulls rope_theta, context_length, the quant scheme (from general.file_type or the filename), and head geometry, then runs TAF's Ξ³_PadΓ© / d_horizon plus the architecture-aware quant-regime Ξ³-shift. Output: effective attention horizon at the trained context, how far the quant erodes Ξ³ (and ΞPPL) for this model, and a verdict β HEALTHY / USABLE-WITH-CARE / DEGRADES. Use case: 'unsloth/Qwen3.5-9B-GGUF Q4_K_M fits 8GB β but is it brain-dead past 30K?' β see the horizon and the Q4 Ξ³-penalty before you download 6 GB.
π Launch-Flag Generator
The VRAM calculators tell you whether a model fits; they don't hand you the command. This does. Pick a model (fetches geometry from HF config.json), a quant, a GPU and a target context β it computes the VRAM breakdown (weights + KV cache + scratch), how many layers to offload (-ngl), and emits the copy-paste llama-server and Ollama commands with -c context, -fa flash-attention, KV-cache type, and --no-mmap (the Blackwell OOM fix: force all weights into physical VRAM). Plus the TAF reality check no calculator gives: if you're allocating KV for a context past the model's d_horizon, it warns you that memory is wasted β the attention won't reach there. Use case: 'What -ngl for Llama-70B-Q4 on my 4090?' β 39 of 80 layers, exact command, and a note if your context is past the usable horizon.
The audit chain
Every result shows the full Computation Chain β each formula step with its inputs,
output, and interpretation. Click any step to expand. Cite section numbers (Β§26.1, Β§19.1, etc.) refer
to the underlying paper for derivation.
The plain-English answer
After the deterministic chain runs, an in-browser LLM (Qwen2.5-0.5B, ~350MB cached after first load)
synthesizes a plain-English summary. The numbers above are always correct (deterministic Python);
the synthesis is LLM-generated β verify against the chain if in doubt.
Common parameters explained
- ΞΈ (rope_theta): RoPE base frequency. Higher = more long-range capacity. Typical: 10000 (early), 500000 (Llama-3), 1000000 (Qwen2.5).
- T_train: max context the model was trained on. From
max_position_embeddings.
- T_eval: your target inference context length. The key knob.
- n_kv_heads < n_attention_heads: model uses GQA (Grouped Query Attention). Reduces KV memory but pushes Ξ³ toward Hagedorn.
- has_SWA: model uses Sliding Window Attention (Mistral, gemma-2).
- n_params: total parameter count. Threshold ~400M for induction-head emergence.
π v0.9 β Architecture-aware + reality-check
Three additions: a memory-architecture classifier, a predicted-vs-measured comparator with confidence, and a confidence score on predictions. All browser-only, no inference.
π§ Memory Reality Check β detects the architecture (full-attention / SSM / RWKV / linear / TTT / hybrid) from config.json and tells you what its "context length" actually means and how it fails (exact-recall loss, sink, fixed-state compression). No inference.
Try: state-spaces/mamba-2.8b-hf β State-space: fixed-size lossy state β exact needle-recall fails; test with single-needle NIAH.
π Prediction vs Reality β compares TAF's closed-form predictions against MEASURED values (the shipped dataset or your Diagnose-CLI JSON), with a confidence score, and lets you contribute your measurement back to the public dataset β server-less.
Try: paste a Diagnose-CLI JSON β table of Ξ³ predicted vs measured (Ξ, within-tolerance β), then "β Contribute" to PR it to the public dataset.
Confidence score β every X-2 viability verdict + Memory Reality + Prediction-vs-Reality carries a 0β100% confidence with a β/β factor checklist (Ξ³ measured vs closed-form, validated regime, benchmark available, calibration). Predictions are never shown as absolute truth.
What to look for in verdicts
- YES / GO β proceed with confidence; numbers support the choice.
- DEGRADED / TINY-MODEL β works but with caveats; read the action.
- NO / MEMORY-LIMITED β don't proceed as-is; mitigation provided.
Privacy
Everything runs in your browser. No telemetry, no analytics, no data sent anywhere. Even the LLM model
runs locally via WebGPU/WebAssembly. Your model_ids and questions never leave this page.
Source & paper
Source code: github.com/karlesmarin/tafagent
Paper: Marin 2026 β Predicting How Transformers Attend (Zenodo; arXiv forthcoming)
Dataset: taf-attention-decay β 58 Ξ³-measurements across 32 models (CC-BY-4.0)