Practical guide to Transformers

Summarize the following in all salient bullet points. Do not include introductory material. Include a timeline.
Title: “Hugging Face Transformers: the basics. Practical coding guides SE1E1. NLP Models (BERT/RoBERTa)”
Transcript: “hello everyone uh welcome to my first episode in my practical guide series today we’re going to be uh looking at the hugging faces transformer library going through some of the basics um in future episodes we’re going to be going on to some more advanced topics taking some of the language models from the transformers library retraining them and applying them to our own downstream tasks so stay tuned for that so in today’s episode i said we’re just going to go through some of the basics and um you know the basics include what is the huge face transformer library what can it do um we’re gonna be looking at how to now how to navigate their website and having a look at their documentation when i first found um the transformers library was trying to use it for some tasks i thought there was a bit of a lack of a sort of um a guide on actually focusing on how to implement these models how to navigate their documentation so that’s what we’re going to be trying to to do today so we’re going to get some of these hugging face models these language models from from their website or from the library and applying them to some very simple examples of the beginning in future episodes we’re going to be training our own models and applying them to our own tasks so let’s jump in so let’s start just on their website so i’m just going to go to huge face dot co so what is hugging face so hug and face is a company uh their most sort of successful uh product is this thing called the transformers library so you can you can have a look on um on the github hub face transformers it’s got a bunch of you know a bunch of um favorites a bunch of styles and stuff so what can you actually do with this library well um their sort of main purpose is to sort of um offer this platform to access these large language models so um and they’re all based off the transformers architecture underneath uh which is why it’s called the transformers library um so let’s just have a little look at some of their models that they offer so these are their most popular ones at the top so but based on case distilled roberta roberta bass distilled bert so what does this all mean so um back in sort of 2017 i think it was 2017 the attention is all you need paper came out which was the first sort of paper to apply the transformers architecture to a language model a year later i think it was 2018 but i have to have to double check um the bullet paper came out so what does bert stand for bur is the bi-directional encoding representation from transformers and it’s a large language model training on a mass language modeling task and it’s trained on gigabytes and gigabytes for text data so the model comes in a couple of flavors it comes in a base on a large version so the base is it’s got a smaller set of weights more parameters the large has a larger set so the large is um going to be more effective but you know you have a training overhead and inference overhead of you know processing all the weights and parameters in the model there’s also for bur there’s an uncased and a cased version so whether you’re going to be using capital letters or not um so yeah you’re going to want to pick what’s right for your task and you know what sort of compute power you have um there’s also the distilled versions of these models so the distilled versions are basically just a same model but with a much smaller set of weights and parameters so fast inference faster training you suffer of course a little bit performance but it’s not that much so you can you can weigh that up for your use case um there’s also the roberta models which stand for the robustly optimized bert approach which was released i think 2019 by facebook um and that is basically a bert model trained in a slightly different way trained on much for a much longer time on much more text um and there’s a few other special techniques they did while so training to massively improve the performance over the bass but monologue so roberta i think is always the case so you just have you have the base model you have the large model so let’s have a little look at just the bert based model the most popular model they offer and let’s jump in so over here so you know you can read the model description you can read the paper if you like um so you know it’s focus off this paper and it’ll take you to the archive so it was released in 2018 i thought it was so let’s have a little look at their hosted inference api so as i said it’s trained on a mass language modeling task so you mask out a token so be careful here it’s not a word you’re masking out a token and you’re trying to predict what the um what the missing token is so let’s just have a look at what this thing says so paris is the blank of france and they put capital with a very high probability um and you know the output of this model by the way it’s not just five five tokens with um a probability it actually produces a probability distribution over your whole vocabulary and it just happens that capital was the largest probability here by quite a long way but perhaps let’s put in another example where you’re not actually gonna you’re gonna want back a token that’s part of the previous word so let’s say i strolled along the river mask and full stop and riverbanks actually one word so here we see a hashtag bank and the hashtag hashtag that actually means that it’s part of the previous token so i strolled along the riverbank one word and that’s assigned it quite a high probability but you can see the other um options here i strolled along the river bank at the river again the river path the river thames these are all totally um you know decent they you know they might it might be the real mass token and i think that’s the important thing to understand about um the mass language modeling tasks is that you know there’s not always going to be one correct answer there’s there’s often the mast token could often often be many things and that’s why when you’re uh training but you have to see lots and lots of examples but at the end you get this model that has a sort of inherent understanding of the language so this is an example of a mask task but what happens if we go to maybe a sentiment classification task so let’s go to text classification here under tasks and we’ll go to the distilbert base uncased fine-tuned on this i think it’s stanford sentiments tree bank english task um so instead of trying to produce a probability distribution over the vocabulary here we’re going to be wanting to probably you know either have a um well we probably want to have since this binary classification here of positive and negative we’re going to be having want to have like a single uh neuron at the end of your neural network that’s going to be one for positive or zero for negative or you could might have two at the end um something like that um and so here it’s obviously put i like you i love you and it’s put a very high probability to a positive but i hate you and you get very high probably probability for negative and they probably you know in the underlying code they can use a sigmoid function or something like that so you’re going to be squished up to either one or zero um so this is all good you know you can use these models straight out the box if you like and it’s sort of fun to play around with the hosted inference api but how do you actually use these um models in practice so you sort of have two options you can either use the hugging face library itself so they’ve got some useful pipelines for sentiment analysis which will go through or they’ve got a trainer class to train your own models or the alternative is to sort of implement the models yourself with an existing ml framework such as pytorch or tensorflow and i think we’ve started to um support some more things as well so let’s just go up to resources here and we’re going to go down to the transformers documentation so this is sort of a really really useful section for when you’re actually using the library where you’ll be sort of trying to implement models um so down here on the left-hand side they’ve got some some guides using transformers some advanced guides and if you keep scrolling down you’ll find the model section um so although in the previous page i showed you they have lots of models so you know there was that sentiment classification model for example for example that was fine tuned on the stanford tree bank data set now that’s a specific model but here uh slightly different slight tweak it’s actually the architectures that are supported so the bert architecture or if i go down i’ll find the roberta architecture down here where is it roberta down here so you can go into these so i’m just going to go back up to the burp model just to keep it simple for now so if you once you click one of these models you’ll see down here a list so you’ve got a burp model you’ve got a burp for sequence classification then you also see you’ll see these duplicates you’ll see you know tf for the sequence classification and that’s talking about actually how the model is implemented so tf is for tensorflow but the standard model is done in pi torch so we’re going to be focusing on pie torch in this um in this series um and we’re also going to be looking at some of uh the sort of inbuilt transformers stuff as well we’re going to have a look at their trainer in some later episodes so let’s go to um but for sequence classification so here you’ll find the parameters to the model that you’re going to need you you find out what it gets back you’re going to see some little example code of how to implement it and some other things one other really useful thing is if you actually just go up to birth sequence classification you can just go to the source code itself and you can have a little look at the source code and this can often be maybe a good way to figure out what’s actually happening um when you’re you’re using the model because it can sometimes be a bit black boxy and you’re not really sure what’s going on um you can also use this to have a little look at some i don’t know uh what sort of bert model names they have in this uh in this class so you know you’ve got the large model here you’ve got the case you’ve got the encased you’ve got a multilingual version a chinese version so you know it can be it can be pretty useful to have a look at this stuff um so now let’s actually jump into some coding and have a look at uh how to instead of just looking at the documentation or using their online inference api let’s have a look at how to get your own models into a runtime and use them so let’s just go here i’m just going to connect this runtime on google colab if you don’t know what google colab is then google it and figure it out it’s really useful it’s just an online coding environment but one thing that’s really useful when you’re using the transformers library if you’re going to be retraining any models you can get access to a gpu online for free so let’s just have a little run through i’ve made a little example here so first thing we’re going to do is we’re going to want to install the library and i’ve just put this percent capture here at the top that just consumes any of that insulation sort of dialogue that you normally get um i don’t i don’t like that so get getting rid of that um so let’s just have a little look here at their inbuilt sentiment analysis pipeline so from transformers import pipeline we’re just going to make a classifier for sentiment analysis and i’ve made two example sentences here a positive one and a negative one i love dogs and i really hate dogs so we know which switch and also to note here it says no model was supplied so it’s defaulted to distilled bur uncased fine-tuned on the stanford sentence tree banks and that was actually the model that we looked at earlier so you can already see you know maybe i could swap that out and i could maybe use a different one train on a different task on a different you know sentiment analysis data set or something for example so now let’s downloaded this uh model we now have it in our run time so i’m just gonna put in our positive and negative sentence and we’ll have a little look so the first one got labeled as a positive with a very high score the second one got labeled as negative with a very high score exactly as you know exactly what we’d expect so that’s all well and good uh we’ve used their sort of pipeline straight out of the box and we’ve sort of basically done what they have on the inference api but we’ve done it in our own runtime but now perhaps let’s um have a look at sort of getting one of those raw implementations of the model and we can maybe do some more interesting things with it so let’s just have a little look down here so from transformers we’re going to import the auto tokenizer and the auto model sequence classification and that’s also one thing that’s quite useful about the hung face library is they’ve got these auto classes that or this auto tokenizer and auto model that is basically just going to infer what is the best model and what is the best tokenizer to use based off the model name so you see here i’ve actually used the same model that we’ve used here this model and i’ve just input this in as a string and i’m just going to get this model back so i’m just going to run that and i’m also going to run the auto tokenizer to get a tokenizer which will be based off the distilled encased model but we could actually instead of doing the auto tokenizer we could use the distilbert tokenizer instead and that’s absolutely fine it’s the same thing because it’s just automatically choosing it for us if we were to perhaps do instead the birth tokenizer so i’ll run that again perhaps we’ll use the bert tokenizer here and what’s it going to say it’s going to say the token is a class you chose tokenizing class you load from the checkpoint is not the same type as the class this function is called from it may result”

– The episode is part of a practical guide series on Hugging Face Transformers.
– The focus of the episode is on the basics of the Hugging Face Transformers library.
– The Transformers library provides access to large language models based on the Transformers architecture.
– The library offers models like BERT, RoBERTa, and others.
– BERT stands for “Bidirectional Encoder Representations from Transformers” and is trained on a large amount of text data.
– BERT models come in base and large versions, as well as cased and uncased variants.
– Distilled versions of the models have smaller weights and parameters for faster inference and training.
– RoBERTa models are trained differently from BERT and offer improved performance.
– The library allows users to interact with models through a hosted inference API.
– Examples of using the models for language tasks, such as masked token prediction and sentiment classification, are demonstrated.
– Users have two options for implementing the models: using the Hugging Face library or implementing them with existing ML frameworks like PyTorch or TensorFlow.
– The Transformers documentation provides guides, examples, and information about the supported architectures.
– Code examples are shown for using the sentiment analysis pipeline and accessing models and tokenizers using the Hugging Face library.
– Google Colab is recommended for running the code, as it provides free GPU access.
– The episode concludes with running code examples for sentiment analysis using pre-trained models and tokenizers.

Timeline:
– 0:00-0:30: Introduction to the episode and the practical guide series.
– 0:30-2:50: Overview of Hugging Face Transformers library and its purpose.
– 2:50-5:40: Explanation of BERT and RoBERTa models and their variations.
– 5:40-9:15: Demonstration of using models for masked token prediction and sentiment classification.
– 9:15-12:25: Options for implementing models using the Hugging Face library or existing ML frameworks.
– 12:25-17:45: Exploration of Transformers documentation, including guides, architectures, and model code.
– 17:45-22:15: Code examples for sentiment analysis using the Hugging Face library’s pipelines and model/tokenizer classes.
– 22:15-End: Conclusion and code examples for using pre-trained models and tokenizers in a custom runtime environment.