Bookmarks Menu
Mozilla Firefox
- Get Help
- Customize Firefox
- Get Involved
- About Us
Bookmarks Toolbar
- arxiv-sanity
- Modal Mapping / Learn Vimscript the Hard Way
foundations
- Applied Linear Algebra Lecture Series โ LessWrong
- Six (and a half) intuitions for SVD โ LessWrong
- Some ML-Related Math I Now Understand Better โ LessWrong
- Streamlit
- Linear Algebra for Programmers | Build a foundation for Machine Learning, AI and other areas
- Maximum likelihood estimation for the beer example model - YouTube
- Why is a likelihood not a probability distribution? - YouTube
- QuantFinance/notebooks/Module 2.3.1 - Statistical Significance.ipynb at master ยท PythonCharmers/QuantFinance ยท GitHub
- [1604.06737] Entity Embeddings of Categorical Variables
- Mode Collapse and the Norm One Principle โ LessWrong
- Basic Facts about Language Model Internals โ LessWrong
- Residual stream norms grow exponentially over the forward pass โ LessWrong
- [1706.02515] Self-Normalizing Neural Networks
- [2303.01610] Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers
- Shtetl-Optimized ยป Blog Archive ยป The First Law of Complexodynamics
- Taking the parameters which seem to matter and rotating them until they don't โ LessWrong
- A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features'
- [2403.20284v1] LayerNorm: A key component in parameter-efficient fine-tuning
- Superposition, Memorization, and Double Descent
- Softmax Linear Units
- Toy Models of Superposition
- [1805.07091] Tropical Geometry of Deep Neural Networks
- Singular Learning Theory - LessWrong
- [2008.02217] Hopfield Networks is All You Need
- arxiv-sanity
- Ch 9: Convolutional Networks - YouTube
- 2003.04887v2.pdf
- [1603.07285] A guide to convolution arithmetic for deep learning
- JensenโShannon Divergence - Notes on AI
- Kaiji Opening 01 HD - Mirai wa Bokura no Te no Naka - YouTube
- GitHub - sshuttle/sshuttle: Transparent proxy server that works as a poor man's VPN. Forwards over ssh. Doesn't require admin. Works with Linux and MacOS. Supports DNS tunneling.
- I Wrote a Faster Sorting Algorithm | Probably Dance
- Deconvolution and Checkerboard Artifacts
- fast.ai โ AdamW and Super-convergence is now the fastest way to train neural nets
- https://www.youtube.com/watch?feature=shared&t=1542&v=0_BBRNYInx8
- 1902.05522v2.pdf
advanced transformer
- A Walkthrough of A Mathematical Framework for Transformer Circuits - YouTube
- [1711.03953] Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
- Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions - ACL Anthology
- [2404.07647] Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
- Basic Facts about Language Model Internals โ LessWrong
- LLM.int8() and Emergent Features โ Tim Dettmers
- Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic โ LessWrong
- [2203.11355] Origami in N dimensions: How feed-forward networks manufacture linear separability
- [1805.06576] Mad Max: Affine Spline Insights into Deep Learning
- [1609.09106] HyperNetworks
- [1607.06450] Layer Normalization
- Estimating efficiency improvements in LLM pre-training โ LessWrong
- [2302.01834] Coinductive guide to inductive transformer heads
- Re-Examining LayerNorm โ LessWrong
- Interpreting Neural Networks through the Polytope Lens โ LessWrong
- Branch Specialization
- Activation Atlas
- What are the tradeoffs of increasing the depth of a neural network? - Quora
- Intro to Optimization in Deep Learning: Busting the Myth About Batch Normalization
- Keeping up with AGI
- GradIEEEnt half decent
- deep learning - Why do ResNets avoid the vanishing gradient problem? - Artificial Intelligence Stack Exchange
- 1711.00165v3.pdf
- A Visual Exploration of Gaussian Processes
- [1806.07572] Neural Tangent Kernel: Convergence and Generalization in Neural Networks
- [2012.00152] Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
- [2212.13881] Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features
- [2111.00034] Neural Networks as Kernel Learners: The Silent Alignment Effect
- OyedotunAl IsmaeilAouada_NeurocomputingJournal.pdf
- Extending Context is Hard...but not Impossibleโ - Kaio
- 1906.00904v2.pdf
- Tropical Geometry of Deep Neural Networks - 1805.07091v1.pdf
- 2211.12312v1.pdf
- Keeping NN Simple by Minimizing the Description Legnth of the Weights - colt93.pdf
- Work dumber not smarter โ LessWrong
- 1603.05027v3.pdf
- Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice - NIPS-2017-resurrecting-the-sigmoid-in-deep-learning-through-dynamical-isometry-theory-and-practice-Paper.pdf
- Lecture 4: Signal Propagation and Dynamical Isometry in Deep Neural Networks (English) - YouTube
- Lecture 4: Signal Propagation and Dynamical Isometry in Deep Neural Networks (English) - YouTube
- [2210.01245] Random orthogonal additive filters: a solution to the vanishing/exploding gradient of deep neural networks
- 2402.03332v1.pdf
counterintuitive
- The Bitter Lesson
- The Scaling Hypothesis ยท Gwern.net
- Deep Double Descent: Where Bigger Models and More Data Hurt
- Single Headed Attention RNN: Stop Thinking With Your Head
- Stanford CS25: V1 I Transformer Circuits, Induction Heads, In-Context Learning - YouTube
- The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
- Uncovering Mesa-Optimization Algorithms in Transformers
- On Emergent Abilities, Scaling Architectures and Large Language Models โ Yi Tay
- Are Emergent Abilities of Large Language Models a Mirage?
- Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models
- Language Models Represent Space and Time
- Invitation to Imitation
- Energy Based Models
- GitHub - PythonCharmers/QuantFinance: Top training materials in quantitative finance
- othello_world/mechanistic_interpretability/tl_initial_exploration.py at master ยท likenneth/othello_world ยท GitHub
- [2404.05729] Finding Visual Task Vectors
- [2404.05405] Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
- [2106.03216] On Memorization in Probabilistic Deep Generative Models
- [2404.06773] Adapting LLaMA Decoder to Vision Transformer
- The British-American coup that ended Australian independence | John Pilger | The Guardian
- megamined/deepseek-math-7b-rl-8.0bpw-exl2 at main
- [2404.12358] From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
- 2207.00729.pdf
- 2305.12766.pdf
- 2305.12907.pdf
- 2404.07544.pdf
- 2306.04637.pdf
- Why is in-context learning lower quality than fine-tuning? Andโฆwhat if it wasn't? ยท Hazy Research
- On Average, Youโre Using the Wrong Average: Geometric & Harmonic Means in Data Analysis | by Daniel McNichol | Towards Data Science
- the Stolarsky mean - YouTube
- Riemannian manifold - Wikipedia
- Frรฉchet mean - Wikipedia
- Root mean square - Wikipedia
- Mean - Wikipedia
- Lesson 8 - Practical Deep Learning for Coders 2022 - YouTube
- [2404.03626v1] Training LLMs over Neurally Compressed Text
- [2404.14408] SpaceByte: Towards Deleting Tokenization from Large Language Modeling
- GitHub - neelnanda-io/TransformerLens: A library for mechanistic interpretability of GPT-style language models
- main_notes.pdf
- 6.2 Logistic Regression and the Cross Entropy Cost
- Detexify LaTeX handwritten symbol recognition
TRANSFORMERS TRANSFORMERS TRANSFORMERS
- Welcome To Colab - Colab
- Let's build GPT: from scratch, in code, spelled out. - YouTube
- The spelled-out intro to neural networks and backpropagation: building micrograd - YouTube
- gpt-dev.ipynb - Colab
- GitHub - fastai/fastbook: The fastai book, published as Jupyter Notebooks
- 10_nlp.ipynb - Colab
- 12_nlp_dive.ipynb - Colab
- 13_convolutions.ipynb - Colab
- Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic โ LessWrong
- A Mathematical Framework for Transformer Circuits
- Transformer Circuits Thread
- What is a Transformer? (Transformer Walkthrough Part 1/2) - YouTube
- Implementing GPT-2 From Scratch (Transformer Walkthrough Part 2/2) - YouTube
- Interpreting Neural Networks through the Polytope Lens โ LessWrong
- Some ML-Related Math I Now Understand Better โ LessWrong
- Re-Examining LayerNorm โ LessWrong
- Mechanistic Interpretability Quickstart Guide โ Neel Nanda
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability โ Neel Nanda
- main_notes.pdf
- 6.2 Logistic Regression and the Cross Entropy Cost
- Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy - YouTube
- But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning - YouTube
- Discord | Your Place to Talk and Hang Out
- NumPy quickstart โ NumPy v1.26 Manual
- Exploring Low Rank Training of Deep Neural Networks - 2209.13569.pdf
- Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention - 2020.coling-main.324.pdf
- GitHub - EGjoni/DRUGS: Stop messing around with finicky sampling parameters and just use DRยตGS!
TRANSFORMERS2
- Let's build GPT: from scratch, in code, spelled out. - YouTube
- 14_resnet.ipynb - Colab
- 15_arch_details.ipynb - Colab
- 16_accel_sgd.ipynb - Colab
- 18_CAM.ipynb - Colab
- 10_nlp.ipynb - Colab
- 12_nlp_dive.ipynb - Colab
- Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic โ LessWrong
- A Mathematical Framework for Transformer Circuits
- What is a Transformer? (Transformer Walkthrough Part 1/2) - YouTube
- Implementing GPT-2 From Scratch (Transformer Walkthrough Part 2/2) - YouTube
- Einops tutorial, part 1: basics - Einops
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability โ Neel Nanda
- Streamlit
- [1.1] Transformer from Scratch (exercises).ipynb - Colab
- main_notes.pdf
- 6.2 Logistic Regression and the Cross Entropy Cost
- pyvene/tutorials/advanced_tutorials/DAS_Main_Introduction.ipynb at main ยท stanfordnlp/pyvene ยท GitHub
- 5613__what_learning_algorithm_is_in.pdf
- MATS Program
- ANT-intro - Dropbox
- 2112.04426v3.pdf
- The Illustrated Transformer โ Jay Alammar โ Visualizing machine learning one concept at a time.
- Kullback-Leibler Divergence Explained โ Count Bayesie
- 2012.14913v2.pdf
- [2404.15758] Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
- Neural Networks โ PyTorch Tutorials 2.3.0+cu121 documentation
- Transformer Inference Arithmetic | kipply's blog
- GitHub - konstmish/prodigy: The Prodigy optimizer and its variants for training neural networks.
- A basic introduction to NumPy's einsum โ ajcr โ Haphazard investigations
- [2404.12224v1] Length Generalization of Causal Transformers without Position Encoding
- You should (usually) log transform your positive data | Statistical Modeling, Causal Inference, and Social Science
- Simple Linear Regression
- 2404.17625v1.pdf
- Xournal++ - Xournal++
- Differential Manifolds (Pure and Applied Mathematics, Volume 138): Kosinski, Antoni A.: 9780124218505: Amazon.com: Books
- kipply's blog
- 2311.01460v1.pdf
- 2203.00555v1.pdf
- NeurIPS-2023-vanillanet-the-power-of-minimalism-in-deep-learning-Paper-Conference.pdf
- [2302.10322] Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
- [2402.16788] Why Transformers Need Adam: A Hessian Perspective
- [2305.02790] BranchNorm: Robustly Scaling Extremely Deep Transformers
- 2203.03466v2.pdf
- A Comprehensive Mechanistic Interpretability Explainer & Glossary โ Neel Nanda
- Transformer Inference Arithmetic | kipply's blog
- Untitled13.ipynb - Colab
- Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained) - YouTube
- Deep Networks Are Kernel Machines (Paper Explained) - YouTube
- Predictive Coding Approximates Backprop along Arbitrary Computation Graphs (Paper Explained) - YouTube
- Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained) - YouTube
- โกLightning AI - Pricing
- Intrinsic dimension of data representations in deep neural networks - NeurIPS-2019-intrinsic-dimension-of-data-representations-in-deep-neural-networks-Paper.pdf
- Welcome to Spinning Up in Deep RL! โ Spinning Up documentation
- [Prob&Stats] 3. Expected Value, Variance, and Standard Deviation | by jun94 | jun-devpBlog | Medium
- Why Momentum Really Works
- DL-Tutorial-NIPS2015.pptx - DL-Tutorial-NIPS2015.pdf
- machine learning - Why do we have to divide by 2 in the ML squared error cost function? - Data Science Stack Exchange
- Writing Custom Datasets, DataLoaders and Transforms โ PyTorch Tutorials 2.3.0+cu121 documentation
- 2304.14802v1.pdf
- Bayes theorem, the geometry of changing beliefs - YouTube
- Bayes Theorem & Confusion Matrix. Bayesโ theorem is crucial forโฆ | by Max Kleiner | Medium
- Ben Lambert - YouTube
- A Student's Guide to Bayesian Statistics - YouTube
llms
- MiniCPM: Unveiling the Potential of End-side Large Language Models
- Stop Shuffling
- [2402.01878] LiPO: Listwise Preference Optimization through Learning-to-Rank
- [2402.13753] LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
- OLMO
- Yi Pretraining
- Question ยท Issue #6 ยท pratyushasharma/laser ยท GitHub
- main : add Self-Extend support by ggerganov ยท Pull Request #4815 ยท ggerganov/llama.cpp ยท GitHub
- Quiet-STaR
- AstraMindAI/AstraQuasar-4B ยท Hugging Face
- Deepseek Pretraining
- [2312.04455] Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use
- [2305.10429] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
- [2402.03620] Self-Discover: Large Language Models Self-Compose Reasoning Structures
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases - 2402.14905.pdf
- NEFTune
- A Review of Public Japanese Training Sets ยท AUGMXNT/shisa Wiki ยท GitHub
- Apple Researchers Unveil DeepPCR: A Novel Machine Learning Algorithm that Parallelizes Typically Sequential Operations in Order to Speed Up Inference and Training of Neural Networks : LocalLLaMA
- [2311.04254] Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation
- Sparsetral Paper
- Sparsetral Unsloth
- Layer removal
- [2403.11901] Larimar: Large Language Models with Episodic Memory Control
- [2311.10770] Exponentially Faster Language Modelling
- Tool Generalization
- [2401.07004] Extending LLMs' Context Window with 100 Samples
bestmedia
- Antimemetics Division Hub - SCP Foundation
- Starting Over: Chapters 1 to 23 (by Sugaru Miaki) - vgperson's Novel Translations
- About the Reckless Girl who kept Challenging a Reborn Man like Me | Inlitify OA
- Animal Farm
- Episode 1 | The Ember Knight
- Lelouch Checkmate's Schneizel - Code Geass [HD] - YouTube
- Code Geass: The Evil Paradox - YouTube
- How the US Government took control of the Internet | The Hacker Crackdown - YouTube
- DEF CON 22 - Don't Fuck it Up! - YouTube
- When Cybercriminals with Good OpSec Attack - YouTube
- Paul Le Roux - Wikipedia
- Surprisingly Turing-Complete ยท Gwern.net
- Death Note: L, Anonymity & Eluding Entropy ยท Gwern.net
- proceedings.pdf
- qmailsec.dvi - qmailsec-20071101.pdf
- The Melancholy of Subculture Society ยท Gwern.net
- Laws of Tech: Commoditize Your Complement ยท Gwern.net
- Timing Technology: Lessons From The Media Lab ยท Gwern.net
- The Monkey On Wall Street - Watch BEFORE You Invest - YouTube
- Don't Talk to the Police - YouTube
- The Rules for Rulers - YouTube
- How to be a Pirate Quartermaster. ๐ ๐ ๐ - YouTube
- Apple's $3500 Nightmare - YouTube
- Why Have Modular Smartphones Failed? - YouTube
- Why we hate engineers - YouTube
- What if Japan became Christian in 1637? - YouTube
- postpostmodernism - YouTube
- How To Criticize Computer Scientists
- This Blog is Systematic: If you're so smart, how come you're not SBF? The Kelly criterion and choice of expectations and utility function when bet sizing
- The Microprice: Estimating the fair price, given the state of the order book. - YouTube
- The Financial Mathematics of Market Liquidity: From Optimal Execution to ... - Olivier Gueant - Google Books
- Usability for Nerds - Wikibooks, open books for an open world
- Pluralistic: Tiktokโs enshittification (21 Jan 2023) โ Pluralistic: Daily links from Cory Doctorow
- The Only Unbreakable Law - YouTube
- George Hotz | Researching | documenting the AMD 7900XTX so we can understand why it crashes | RDNA 3 - YouTube
- William F. Baker: "On the Harmony of Theory and Practice in the Design of Tall Buildings" - YouTube
- Growtopia | Official Website
- Hi Score Girl - Wikipedia
- novena by cecile richard
- Shattered Pixel Dungeon - Shattered Pixel
- Nebulous.io - Apps on Google Play
- The Blockheads | Exploration, creation and survival sandbox game for iOS and Android
- Rikei ga Koi ni Ochita no de Shoumei shitemita. - MyAnimeList.net
startup
- Avoid blundering: 80% of a winning strategy
- comma ai | Three Stories - The Past, Future, and Present | George Hotz Announcing comma 3X COMMA_CON - YouTube
- Sam Altman - How to Succeed with a Startup - YouTube
- Mark Zuckerberg at Startup School 2013 - YouTube
- Lecture 3 - Before the Startup (Paul Graham) - YouTube
- Lean Game Server
- 31 nooby C++ habits you need to ditch - YouTube
- Stanford CS 25 | Transformers United
- Apart Sprints
- Representation Engineering: A Top-Down Approach to AI Transparency
- 2311.05908.pdf
- Three interpretations of matrix products
- Smart animate layers between frames โ Figma Learn - Help Center
- GitHub - jacobhilton/deep_learning_curriculum: Language model alignment-focused deep learning curriculum
- A Comprehensive Mechanistic Interpretability Explainer & Glossary - Dynalist
- Mechanistic Interpretability Quickstart Guide โ Neel Nanda
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability โ Neel Nanda
- ThursdAI - The top AI news from the past week | Alex Volkov | Substack
- Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU - 2403.06504.pdf
- Mathematics for Machine Learning - mml-book.pdf
- GitHub - center-for-humans-and-machines/transformer-heads: Toolkit for attaching, training, saving and loading of new heads for transformer models
- Chess-GPTโs Internal World Model | Adam Karvonen
- GitHub - adamkarvonen/chess_gpt_eval: A repo to evaluate various LLM's chess playing abilities.
- Manipulating Chess-GPTโs World Model | Adam Karvonen
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- 2309.08600.pdf
- [2309.05858] Uncovering mesa-optimization algorithms in Transformers
- 2311.03658.pdf
- Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes - 2402.05406.pdf
- 2309.16318.pdf
- bringing down the cost of pretraining llama to $200 - Pastebin.com
- RLbook2020.pdf
- There is a Narrative Trick in this Story | Inlitify OA
rej
- WTF Happened In 1971?
- [2403.03507v1] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- 2307.05695.pdf
- ReLoRA - what we've learned ยท Issue #320 ยท allenai/OLMo ยท GitHub
- Financial Engineering Playground: Signal Processing, Robust Estimation, Kalman, Optimization - YouTube
- Transformers Learn In-Context by Gradient Descent - 2212.07677.pdf
- 7195_one_step_of_gradient_descent_i.pdf
- In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization - 2402.14951.pdf
- 6735_in_context_convergence_of_tran.pdf
- [2310.05249] In-Context Convergence of Transformers
- [2306.09927] Trained Transformers Learn Linear Models In-Context
- MATHAI_29_paper.pdf
- Stanford Seminar - Generalized Reversible Computing and the Unconventional Computing Landscape - YouTube
- Perspectives on diffusion โ Sander Dieleman
- Permission is hereby granted | Suno
- 2404.03626.pdf
- 1906.01820.pdf
- kleene56representation.pdf
- 2404.02060.pdf
- DiJiang: Efficient Large Language Models through Compact Kernelization - 2403.19928.pdf
- AI Mathematical Olympiad - Progress Prize 1 | Kaggle
- [2311.04897] Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
- [2403.16952] Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
- 2403.08763.pdf
- Can language models explore in context?
- In-Context Convergence of Transformers
- Privileged Bases in the Transformer Residual Stream
- ml stash
- https://www.youtube.com/watch?v=NfnWJUyUJYU&feature=youtu.be
- https://www.youtube.com/playlist?list=PL6EF60E1027E1A10B
- [2109.12927] Faking Brownian motion with continuous Markov martingales
- [2302.01834v1] Coinductive guide to inductive transformer heads
- hywu/Camelidae-8x7B ยท Hugging Face
- hywu/Qwen2idae-16x14B-v1.0 ยท Hugging Face
- GitHub - mathllm/MathCoder: Family of LLMs for mathematical reasoning.
- GitHub - hendrycks/math: The MATH Dataset (NeurIPS 2021)
- nvidia/OpenMath-Mistral-7B-v0.1-hf ยท Hugging Face
- Reverse Training to Nurse the Reversal Curse
- Common 7b models already have strong math
- CoT without prompting
- Contrastive Decoding
- ReFT
- [2404.02060] Long-context LLMs Struggle with Long In-context Learning
- Mixture of Depths
- SSC-CoT
- Formalized CoT with theorem proving
- horspools algortihm at DuckDuckGo