this exciting journey!Summary LLMs have transformed the field of natural language processing, which previ-ously mostly relied on explicit rule-based systems and simpler statistical meth-ods. The advent of LLMs introduced new deep learning-driven approachesthat led to advancements in understanding, generating, and translating humanlanguage . Modern LLMs are trained in tw show annotation

and translating humanlanguage. Modern LLMs are trained in two main steps: – First, they are pretrained on a large corpus of unlabeled text by using theprediction of the next word in a sentence as a label.– Then, they are fine-tuned on a smaller, labeled target dataset to followinstructions or perform classification tasks . LLMs are based on the tr show annotation

perform classification tasks. LLMs are based on the transformer architecture. The key idea of the trans-former architecture is an attention mechanism that gives the LLM selectiveaccess to the whole input sequence when generating the output one word ata time .  The original transformer show annotation

the output one word ata time.  The original transformer architecture consists of an encoder for parsing textand a decoder for generating text.  LLMs for generating text show annotation

decoder for generating text.  LLMs for generating text and following instructions, such as GPT-3 andChatGPT, only implement decoder modules, simplifying the architecture . Large datasets consisting show annotation

sential for pretrainingLLMs. While the general pretraining task for GPT-like models is to predict the nextword in a sentence, these LLMs exhibit emergent properties, such as capabili-ties to classify, translate, or summarize texts. 16 CHAPTER 1 Understanding large show annotation

standing large language models Once an LLM is pretrained, the resulting foundation model can be fine-tunedmore efficiently for various downstream tasks.  LLMs fine-tuned on custom dat show annotation

for various downstream tasks.  LLMs fine-tuned on custom datasets can outperform general LLMs on specifictasks .17Working with text dataSo far show annotation

f the longest text in the batch. The tokenizer used for GPT models does not need any of these tokens; it only uses an<|endoftext|> token for simplicity. <|endoftext|> is analogous to t show annotation

GPT 为什么不使用其他特殊token呢?

oreover, the tokenizer used for GPT models also doesn’t use an <|unk|> tokenfor out-of-vocabulary words. Instead, GPT models use a byte pair encoding tokenizer,which breaks words down into subword units, which we will discuss next.332. show annotation

BPE: byte pair encoding

iliar with one-hot encoding, the embeddinglayer approach described here is essentially just a more efficient way of imple-menting one-hot encoding followed by matrix multiplication in a fully con-nected layer, which is illustrated in th show annotation

he LLM. To achieve this, we can use two broad categories of position-aware embeddings: rela-tive positional embeddings and absolute positional embeddings . Absolute positionalembedding show annotation

n, as illustrated infigure 2.18. Instead of focusing on the absolute position of a token, the emphasis of relative posi-tional embeddings is on the relative position or distance between tokens. This meansthe model learns the relationships in terms of “how far apart” rather than “at whichexact position.” The advantage here is that the model can generalize better to sequencesof varying lengths, even if it hasn’t seen such lengths during training . Both types of positional show annotation

推理时支持更长的上下文 就是由于使用了这个 relative positional embeddings 技术吗?

CHAPTER 2 Working with text data Summary LLMs require textual data to be converted into numerical vectors, known asembeddings, since they can’t process raw text. Embeddings transform discretedata (like words or images) into continuous vector spaces, making them com-patible with neural network operations.  As the first step, raw text i show annotation

th neural network operations.  As the first step, raw text is broken into tokens, which can be words or characters.Then, the tokens are converted into integer representations, termed token IDs.  Special tokens, such as <|unk| show annotation

esentations, termed token IDs. Special tokens, such as <|unk|> and <|endoftext|>, can be added to enhancethe model’s understanding and handle various contexts, such as unknownwords or marking the boundary between unrelated texts .GPT-likedecoder-onlytransformer show annotation

the main LLM layers.49Summary The byte pair encoding (BPE) tokenizer used for LLMs like GPT-2 and GPT-3can efficiently handle unknown words by breaking them down into subwordunits or individual characters. We use a sliding window approach on tokenized data to generate input–targetpairs for LLM training.  Embedding layers in PyTorch fu show annotation

–targetpairs for LLM training. Embedding layers in PyTorch function as a lookup operation, retrieving vectorscorresponding to token IDs. The resulting embedding vectors provide continu-ous representations of tokens, which is crucial for training deep learning mod-els like LLMs.  While token embeddings pro show annotation

p learning mod-els like LLMs.  While token embeddings provide consistent vector representations for eachtoken, they lack a sense of the token’s position in a sequence. To rectify this,two main types of positional embeddings exist: absolute and relative. OpenAI’sGPT models utilize absolute positional embeddings, which are added to the tokenembedding vectors and are optimized during the model training. 50Coding attentionmechanismsAt t show annotation