Bookmarks Menu

Mozilla Firefox

Get Help
Customize Firefox
Get Involved
About Us

Bookmarks Toolbar

arxiv-sanity
Modal Mapping / Learn Vimscript the Hard Way

foundations

Applied Linear Algebra Lecture Series โ€” LessWrong
Six (and a half) intuitions for SVD โ€” LessWrong
Some ML-Related Math I Now Understand Better โ€” LessWrong
Streamlit
Linear Algebra for Programmers | Build a foundation for Machine Learning, AI and other areas
Maximum likelihood estimation for the beer example model - YouTube
Why is a likelihood not a probability distribution? - YouTube
QuantFinance/notebooks/Module 2.3.1 - Statistical Significance.ipynb at master ยท PythonCharmers/QuantFinance ยท GitHub
[1604.06737] Entity Embeddings of Categorical Variables
Mode Collapse and the Norm One Principle โ€” LessWrong
Basic Facts about Language Model Internals โ€” LessWrong
Residual stream norms grow exponentially over the forward pass โ€” LessWrong
[1706.02515] Self-Normalizing Neural Networks
[2303.01610] Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers
Shtetl-Optimized ยป Blog Archive ยป The First Law of Complexodynamics
Taking the parameters which seem to matter and rotating them until they don't โ€” LessWrong
A Discussion of 'Adversarial Examples Are Not Bugs, They Are Features'
[2403.20284v1] LayerNorm: A key component in parameter-efficient fine-tuning
Superposition, Memorization, and Double Descent
Softmax Linear Units
Toy Models of Superposition
[1805.07091] Tropical Geometry of Deep Neural Networks
Singular Learning Theory - LessWrong
[2008.02217] Hopfield Networks is All You Need
arxiv-sanity
Ch 9: Convolutional Networks - YouTube
2003.04887v2.pdf
[1603.07285] A guide to convolution arithmetic for deep learning
Jensenโ€“Shannon Divergence - Notes on AI
Kaiji Opening 01 HD - Mirai wa Bokura no Te no Naka - YouTube
GitHub - sshuttle/sshuttle: Transparent proxy server that works as a poor man's VPN. Forwards over ssh. Doesn't require admin. Works with Linux and MacOS. Supports DNS tunneling.
I Wrote a Faster Sorting Algorithm | Probably Dance
Deconvolution and Checkerboard Artifacts
fast.ai โ€“ AdamW and Super-convergence is now the fastest way to train neural nets
https://www.youtube.com/watch?feature=shared&t=1542&v=0_BBRNYInx8
1902.05522v2.pdf

advanced transformer

A Walkthrough of A Mathematical Framework for Transformer Circuits - YouTube
[1711.03953] Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions - ACL Anthology
[2404.07647] Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
Basic Facts about Language Model Internals โ€” LessWrong
LLM.int8() and Emergent Features โ€” Tim Dettmers
Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic โ€” LessWrong
[2203.11355] Origami in N dimensions: How feed-forward networks manufacture linear separability
[1805.06576] Mad Max: Affine Spline Insights into Deep Learning
[1609.09106] HyperNetworks
[1607.06450] Layer Normalization
Estimating efficiency improvements in LLM pre-training โ€” LessWrong
[2302.01834] Coinductive guide to inductive transformer heads
Re-Examining LayerNorm โ€” LessWrong
Interpreting Neural Networks through the Polytope Lens โ€” LessWrong
Branch Specialization
Activation Atlas
What are the tradeoffs of increasing the depth of a neural network? - Quora
Intro to Optimization in Deep Learning: Busting the Myth About Batch Normalization
Keeping up with AGI
GradIEEEnt half decent
deep learning - Why do ResNets avoid the vanishing gradient problem? - Artificial Intelligence Stack Exchange
1711.00165v3.pdf
A Visual Exploration of Gaussian Processes
[1806.07572] Neural Tangent Kernel: Convergence and Generalization in Neural Networks
[2012.00152] Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
[2212.13881] Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features
[2111.00034] Neural Networks as Kernel Learners: The Silent Alignment Effect
OyedotunAl IsmaeilAouada_NeurocomputingJournal.pdf
Extending Context is Hard...but not Impossibleโ€  - Kaio
1906.00904v2.pdf
Tropical Geometry of Deep Neural Networks - 1805.07091v1.pdf
2211.12312v1.pdf
Keeping NN Simple by Minimizing the Description Legnth of the Weights - colt93.pdf
Work dumber not smarter โ€” LessWrong
1603.05027v3.pdf
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice - NIPS-2017-resurrecting-the-sigmoid-in-deep-learning-through-dynamical-isometry-theory-and-practice-Paper.pdf
Lecture 4: Signal Propagation and Dynamical Isometry in Deep Neural Networks (English) - YouTube
Lecture 4: Signal Propagation and Dynamical Isometry in Deep Neural Networks (English) - YouTube
[2210.01245] Random orthogonal additive filters: a solution to the vanishing/exploding gradient of deep neural networks
2402.03332v1.pdf

counterintuitive

The Bitter Lesson
The Scaling Hypothesis ยท Gwern.net
Deep Double Descent: Where Bigger Models and More Data Hurt
Single Headed Attention RNN: Stop Thinking With Your Head
Stanford CS25: V1 I Transformer Circuits, Induction Heads, In-Context Learning - YouTube
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
Uncovering Mesa-Optimization Algorithms in Transformers
On Emergent Abilities, Scaling Architectures and Large Language Models โ€” Yi Tay
Are Emergent Abilities of Large Language Models a Mirage?
Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models
Language Models Represent Space and Time
Invitation to Imitation
Energy Based Models
GitHub - PythonCharmers/QuantFinance: Top training materials in quantitative finance
othello_world/mechanistic_interpretability/tl_initial_exploration.py at master ยท likenneth/othello_world ยท GitHub
[2404.05729] Finding Visual Task Vectors
[2404.05405] Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
[2106.03216] On Memorization in Probabilistic Deep Generative Models
[2404.06773] Adapting LLaMA Decoder to Vision Transformer
The British-American coup that ended Australian independence | John Pilger | The Guardian
megamined/deepseek-math-7b-rl-8.0bpw-exl2 at main
[2404.12358] From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
2207.00729.pdf
2305.12766.pdf
2305.12907.pdf
2404.07544.pdf
2306.04637.pdf
Why is in-context learning lower quality than fine-tuning? Andโ€ฆwhat if it wasn't? ยท Hazy Research
On Average, Youโ€™re Using the Wrong Average: Geometric & Harmonic Means in Data Analysis | by Daniel McNichol | Towards Data Science
the Stolarsky mean - YouTube
Riemannian manifold - Wikipedia
Frรฉchet mean - Wikipedia
Root mean square - Wikipedia
Mean - Wikipedia
Lesson 8 - Practical Deep Learning for Coders 2022 - YouTube
[2404.03626v1] Training LLMs over Neurally Compressed Text
[2404.14408] SpaceByte: Towards Deleting Tokenization from Large Language Modeling
GitHub - neelnanda-io/TransformerLens: A library for mechanistic interpretability of GPT-style language models
main_notes.pdf
6.2 Logistic Regression and the Cross Entropy Cost
Detexify LaTeX handwritten symbol recognition

TRANSFORMERS TRANSFORMERS TRANSFORMERS

Welcome To Colab - Colab
Let's build GPT: from scratch, in code, spelled out. - YouTube
The spelled-out intro to neural networks and backpropagation: building micrograd - YouTube
gpt-dev.ipynb - Colab
GitHub - fastai/fastbook: The fastai book, published as Jupyter Notebooks
10_nlp.ipynb - Colab
12_nlp_dive.ipynb - Colab
13_convolutions.ipynb - Colab
Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic โ€” LessWrong
A Mathematical Framework for Transformer Circuits
Transformer Circuits Thread
What is a Transformer? (Transformer Walkthrough Part 1/2) - YouTube
Implementing GPT-2 From Scratch (Transformer Walkthrough Part 2/2) - YouTube
Interpreting Neural Networks through the Polytope Lens โ€” LessWrong
Some ML-Related Math I Now Understand Better โ€” LessWrong
Re-Examining LayerNorm โ€” LessWrong
Mechanistic Interpretability Quickstart Guide โ€” Neel Nanda
Concrete Steps to Get Started in Transformer Mechanistic Interpretability โ€” Neel Nanda
main_notes.pdf
6.2 Logistic Regression and the Cross Entropy Cost
Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy - YouTube
But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning - YouTube
Discord | Your Place to Talk and Hang Out

NumPy quickstart โ€” NumPy v1.26 Manual
Exploring Low Rank Training of Deep Neural Networks - 2209.13569.pdf
Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention - 2020.coling-main.324.pdf
GitHub - EGjoni/DRUGS: Stop messing around with finicky sampling parameters and just use DRยตGS!

TRANSFORMERS2

Let's build GPT: from scratch, in code, spelled out. - YouTube
14_resnet.ipynb - Colab
15_arch_details.ipynb - Colab
16_accel_sgd.ipynb - Colab
18_CAM.ipynb - Colab
10_nlp.ipynb - Colab
12_nlp_dive.ipynb - Colab
Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic โ€” LessWrong
A Mathematical Framework for Transformer Circuits
What is a Transformer? (Transformer Walkthrough Part 1/2) - YouTube
Implementing GPT-2 From Scratch (Transformer Walkthrough Part 2/2) - YouTube
Einops tutorial, part 1: basics - Einops
Concrete Steps to Get Started in Transformer Mechanistic Interpretability โ€” Neel Nanda
Streamlit
[1.1] Transformer from Scratch (exercises).ipynb - Colab
main_notes.pdf
6.2 Logistic Regression and the Cross Entropy Cost

pyvene/tutorials/advanced_tutorials/DAS_Main_Introduction.ipynb at main ยท stanfordnlp/pyvene ยท GitHub
5613__what_learning_algorithm_is_in.pdf
MATS Program
ANT-intro - Dropbox
2112.04426v3.pdf
The Illustrated Transformer โ€“ Jay Alammar โ€“ Visualizing machine learning one concept at a time.
Kullback-Leibler Divergence Explained โ€” Count Bayesie
2012.14913v2.pdf
[2404.15758] Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
Neural Networks โ€” PyTorch Tutorials 2.3.0+cu121 documentation
Transformer Inference Arithmetic | kipply's blog
GitHub - konstmish/prodigy: The Prodigy optimizer and its variants for training neural networks.
A basic introduction to NumPy's einsum โ€“ ajcr โ€“ Haphazard investigations
[2404.12224v1] Length Generalization of Causal Transformers without Position Encoding
You should (usually) log transform your positive data | Statistical Modeling, Causal Inference, and Social Science
Simple Linear Regression
2404.17625v1.pdf
Xournal++ - Xournal++
Differential Manifolds (Pure and Applied Mathematics, Volume 138): Kosinski, Antoni A.: 9780124218505: Amazon.com: Books
kipply's blog
2311.01460v1.pdf
2203.00555v1.pdf
NeurIPS-2023-vanillanet-the-power-of-minimalism-in-deep-learning-Paper-Conference.pdf
[2302.10322] Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
[2402.16788] Why Transformers Need Adam: A Hessian Perspective
[2305.02790] BranchNorm: Robustly Scaling Extremely Deep Transformers
2203.03466v2.pdf
A Comprehensive Mechanistic Interpretability Explainer & Glossary โ€” Neel Nanda
Transformer Inference Arithmetic | kipply's blog
Untitled13.ipynb - Colab
Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained) - YouTube
Deep Networks Are Kernel Machines (Paper Explained) - YouTube
Predictive Coding Approximates Backprop along Arbitrary Computation Graphs (Paper Explained) - YouTube
Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained) - YouTube
โšกLightning AI - Pricing
Intrinsic dimension of data representations in deep neural networks - NeurIPS-2019-intrinsic-dimension-of-data-representations-in-deep-neural-networks-Paper.pdf
Welcome to Spinning Up in Deep RL! โ€” Spinning Up documentation
[Prob&Stats] 3. Expected Value, Variance, and Standard Deviation | by jun94 | jun-devpBlog | Medium
Why Momentum Really Works
DL-Tutorial-NIPS2015.pptx - DL-Tutorial-NIPS2015.pdf
machine learning - Why do we have to divide by 2 in the ML squared error cost function? - Data Science Stack Exchange
Writing Custom Datasets, DataLoaders and Transforms โ€” PyTorch Tutorials 2.3.0+cu121 documentation
2304.14802v1.pdf
Bayes theorem, the geometry of changing beliefs - YouTube
Bayes Theorem & Confusion Matrix. Bayesโ€™ theorem is crucial forโ€ฆ | by Max Kleiner | Medium
Ben Lambert - YouTube
A Student's Guide to Bayesian Statistics - YouTube

llms

MiniCPM: Unveiling the Potential of End-side Large Language Models
Stop Shuffling
[2402.01878] LiPO: Listwise Preference Optimization through Learning-to-Rank
[2402.13753] LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
OLMO
Yi Pretraining
Question ยท Issue #6 ยท pratyushasharma/laser ยท GitHub
main : add Self-Extend support by ggerganov ยท Pull Request #4815 ยท ggerganov/llama.cpp ยท GitHub
Quiet-STaR
AstraMindAI/AstraQuasar-4B ยท Hugging Face
Deepseek Pretraining
[2312.04455] Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use
[2305.10429] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
[2402.03620] Self-Discover: Large Language Models Self-Compose Reasoning Structures
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases - 2402.14905.pdf
NEFTune
A Review of Public Japanese Training Sets ยท AUGMXNT/shisa Wiki ยท GitHub
Apple Researchers Unveil DeepPCR: A Novel Machine Learning Algorithm that Parallelizes Typically Sequential Operations in Order to Speed Up Inference and Training of Neural Networks : LocalLLaMA
[2311.04254] Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation
Sparsetral Paper
Sparsetral Unsloth
Layer removal
[2403.11901] Larimar: Large Language Models with Episodic Memory Control
[2311.10770] Exponentially Faster Language Modelling
Tool Generalization
[2401.07004] Extending LLMs' Context Window with 100 Samples

bestmedia

Antimemetics Division Hub - SCP Foundation
Starting Over: Chapters 1 to 23 (by Sugaru Miaki) - vgperson's Novel Translations
About the Reckless Girl who kept Challenging a Reborn Man like Me | Inlitify OA
Animal Farm
Episode 1 | The Ember Knight
Lelouch Checkmate's Schneizel - Code Geass [HD] - YouTube
Code Geass: The Evil Paradox - YouTube
How the US Government took control of the Internet | The Hacker Crackdown - YouTube
DEF CON 22 - Don't Fuck it Up! - YouTube
When Cybercriminals with Good OpSec Attack - YouTube
Paul Le Roux - Wikipedia
Surprisingly Turing-Complete ยท Gwern.net
Death Note: L, Anonymity & Eluding Entropy ยท Gwern.net
proceedings.pdf
qmailsec.dvi - qmailsec-20071101.pdf
The Melancholy of Subculture Society ยท Gwern.net
Laws of Tech: Commoditize Your Complement ยท Gwern.net
Timing Technology: Lessons From The Media Lab ยท Gwern.net
The Monkey On Wall Street - Watch BEFORE You Invest - YouTube
Don't Talk to the Police - YouTube
The Rules for Rulers - YouTube
How to be a Pirate Quartermaster. ๐Ÿ“ˆ ๐Ÿ’Ž ๐Ÿ“ˆ - YouTube
Apple's $3500 Nightmare - YouTube
Why Have Modular Smartphones Failed? - YouTube
Why we hate engineers - YouTube
What if Japan became Christian in 1637? - YouTube
postpostmodernism - YouTube
How To Criticize Computer Scientists
This Blog is Systematic: If you're so smart, how come you're not SBF? The Kelly criterion and choice of expectations and utility function when bet sizing
The Microprice: Estimating the fair price, given the state of the order book. - YouTube
The Financial Mathematics of Market Liquidity: From Optimal Execution to ... - Olivier Gueant - Google Books
Usability for Nerds - Wikibooks, open books for an open world
Pluralistic: Tiktokโ€™s enshittification (21 Jan 2023) โ€“ Pluralistic: Daily links from Cory Doctorow
The Only Unbreakable Law - YouTube
George Hotz | Researching | documenting the AMD 7900XTX so we can understand why it crashes | RDNA 3 - YouTube
William F. Baker: "On the Harmony of Theory and Practice in the Design of Tall Buildings" - YouTube
Growtopia | Official Website
Hi Score Girl - Wikipedia
novena by cecile richard
Shattered Pixel Dungeon - Shattered Pixel
Nebulous.io - Apps on Google Play
The Blockheads | Exploration, creation and survival sandbox game for iOS and Android
Rikei ga Koi ni Ochita no de Shoumei shitemita. - MyAnimeList.net

startup

Avoid blundering: 80% of a winning strategy
comma ai | Three Stories - The Past, Future, and Present | George Hotz Announcing comma 3X COMMA_CON - YouTube
Sam Altman - How to Succeed with a Startup - YouTube
Mark Zuckerberg at Startup School 2013 - YouTube
Lecture 3 - Before the Startup (Paul Graham) - YouTube

Lean Game Server
31 nooby C++ habits you need to ditch - YouTube
Stanford CS 25 | Transformers United
Apart Sprints
Representation Engineering: A Top-Down Approach to AI Transparency
2311.05908.pdf
Three interpretations of matrix products
Smart animate layers between frames โ€“ Figma Learn - Help Center
GitHub - jacobhilton/deep_learning_curriculum: Language model alignment-focused deep learning curriculum
A Comprehensive Mechanistic Interpretability Explainer & Glossary - Dynalist
Mechanistic Interpretability Quickstart Guide โ€” Neel Nanda
Concrete Steps to Get Started in Transformer Mechanistic Interpretability โ€” Neel Nanda
ThursdAI - The top AI news from the past week | Alex Volkov | Substack
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU - 2403.06504.pdf
Mathematics for Machine Learning - mml-book.pdf
GitHub - center-for-humans-and-machines/transformer-heads: Toolkit for attaching, training, saving and loading of new heads for transformer models
Chess-GPTโ€™s Internal World Model | Adam Karvonen
GitHub - adamkarvonen/chess_gpt_eval: A repo to evaluate various LLM's chess playing abilities.
Manipulating Chess-GPTโ€™s World Model | Adam Karvonen
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
2309.08600.pdf
[2309.05858] Uncovering mesa-optimization algorithms in Transformers
2311.03658.pdf
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes - 2402.05406.pdf
2309.16318.pdf
bringing down the cost of pretraining llama to $200 - Pastebin.com
RLbook2020.pdf
There is a Narrative Trick in this Story | Inlitify OA

rej

WTF Happened In 1971?
[2403.03507v1] GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
2307.05695.pdf

ReLoRA - what we've learned ยท Issue #320 ยท allenai/OLMo ยท GitHub
Financial Engineering Playground: Signal Processing, Robust Estimation, Kalman, Optimization - YouTube
Transformers Learn In-Context by Gradient Descent - 2212.07677.pdf
7195_one_step_of_gradient_descent_i.pdf
In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization - 2402.14951.pdf
6735_in_context_convergence_of_tran.pdf
[2310.05249] In-Context Convergence of Transformers
[2306.09927] Trained Transformers Learn Linear Models In-Context
MATHAI_29_paper.pdf
Stanford Seminar - Generalized Reversible Computing and the Unconventional Computing Landscape - YouTube
Perspectives on diffusion โ€“ Sander Dieleman
Permission is hereby granted | Suno
2404.03626.pdf
1906.01820.pdf
kleene56representation.pdf
2404.02060.pdf
DiJiang: Efficient Large Language Models through Compact Kernelization - 2403.19928.pdf
AI Mathematical Olympiad - Progress Prize 1 | Kaggle
[2311.04897] Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
[2403.16952] Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
2403.08763.pdf
Can language models explore in context?
In-Context Convergence of Transformers
Privileged Bases in the Transformer Residual Stream
ml stash
https://www.youtube.com/watch?v=NfnWJUyUJYU&feature=youtu.be
https://www.youtube.com/playlist?list=PL6EF60E1027E1A10B
[2109.12927] Faking Brownian motion with continuous Markov martingales
[2302.01834v1] Coinductive guide to inductive transformer heads
hywu/Camelidae-8x7B ยท Hugging Face
hywu/Qwen2idae-16x14B-v1.0 ยท Hugging Face
GitHub - mathllm/MathCoder: Family of LLMs for mathematical reasoning.
GitHub - hendrycks/math: The MATH Dataset (NeurIPS 2021)
nvidia/OpenMath-Mistral-7B-v0.1-hf ยท Hugging Face
Reverse Training to Nurse the Reversal Curse
Common 7b models already have strong math
CoT without prompting
Contrastive Decoding
ReFT
[2404.02060] Long-context LLMs Struggle with Long In-context Learning
Mixture of Depths
SSC-CoT
Formalized CoT with theorem proving
horspools algortihm at DuckDuckGo