how we go from a base model to a instruction-tuned model...
playing the training game, predicting next token, cross-entropy loss and more...
combining mha, layernorm, and a feed-forward network to get the engine running....
one attention head isn't enough, combining multiple heads, and making a panel of experts...
what k,q,v geometrically mean, why behind the formula, tackling tricky questions...
setting the stage for transformers, how and why positional encoding work, and what's next?...
how computers understand text, types of tokenization, bpe, and other related topics.
addressing limitations of rnns, how lstms "remember" across long-range dependencies, and what's next?
word2vec, rnns, usecases, their limitations, and what's next?
a brief primer on n-gram models, their applications, and limitations