Quick Primer on Hugging Face

Bhaskar S

04/27/2024

Overview

There seems to be a great interest as well as misinformation amongst folks around what Hugging Face really is. The intent of this article is to clarify and introduce the Hugging Face platform.

Hugging Face is a popular, open-source, democratized ecosystem for AI/ML community to discover, experiment, and share models, datasets, libraries, and applications. Think of it like how GitHub is to developers, but for the AI/ML community.

In addition, Hugging Face provides an open-source Python library called transformers, which provides access to a plethora of pre-trained models which can be used for building, training, and deploying applications for natural language processing, audio/image/video processing, etc.

Finally, Hugging Face provides a way for users to host their AI/ML applicfations via Spaces to either showcase projects or work collaboratively with others in the platform.

In this primer, we will keep our focus on the commonly used language processing tasks as well as model tuning using the Hugging Face platform.

Installation and Setup

The installation and setup will be on a Ubuntu 22.04 LTS based Linux desktop.

Ensure that the Python 3.x programming language as well as the Jupyter Notebook package is installed and setup on the Linux desktop.

To install the necessary Python packages for this primer, execute the following command:

$ pip install torch datasets transformers

Note that the transformers package downloads pre-trained models from the Hugging Face models hub into a local cache directory as specified by the environment variable TRANSFORMERS_CACHE.

The default cache directory is at ~/.cache/huggingface/hub

To run the transformers model in an offline model (after the model is downloaded and cached), one set the environment variable TRANSFORMERS_OFFLINE to a value of 1.

This completes all the installation and setup for the Hugging Face hands-on demonstrations.

Hands-on with Hugging Face

The Pipeline is a base class that provides a simple API abstraction over complex sequence of operations and makes it easy to use the different models for the various AI/ML tasks, such as sentiment analysis (implemented by the TextClassificationPipeline class), text summarization (implemented by the SummarizationPipeline class), text translation (implemented by the TranslationPipeline class), and so on.

The pipeline() function is a wrapper that creates an instance of the Pipeline class and performs several tasks, such as, pre-processing (tokenization, embedding), model execution, and post-processing (labels, scores), etc.

Now, let us explore some of the commonly used natural language processing tasks in the following sections.

Sentiment Analysis

Sentiment analysis is the process of analyzing text to determine if the emotional tone of the message is positive, negative, or neutral.

To perform the pre-defined task of sentiment analysis on a given text using the transformers model, execute the following code snippet:

from transformers import pipeline

sa_model = pipeline('sentiment-analysis')

text = 'The restaurant was meh, but expensive'

sa_model(text)

Executing the above Python code generates the following typical output:

Output.1

[{'label': 'POSITIVE', 'score': 0.9943925738334656}]

Note that the default pre-trained model used by Hugging Face for the Sentiment Analysis task is distilbert/distilbert-base-uncased-finetuned-sst-2-english.

To perform sentiment analysis on a list of texts using the default pre-trained transformers model, execute the following code snippet:

from transformers import pipeline

sa_model2 = pipeline('sentiment-analysis')

text_list = ['The book on Leadership was very good',
             'The movie was not that great']

sa_model2(text_list)

Executing the above Python code generates the following typical output:

Output.2

[{'label': 'POSITIVE', 'score': 0.9998422861099243},
 {'label': 'NEGATIVE', 'score': 0.9997527003288269}]

As indicated earlier, Hugging Face is a hub for various pre-trained models shared by other firms and individuals. One can explore the various pre-trained Models shared by others based on the task at hand, such as sentiment analysis.

One of the popular pre-trained models is ProsusAI/finbert. Let us you this model to re-test our list of text.

To perform sentiment analysis on a list of texts using the model ProsusAI/finbert, execute the following code snippet:

from transformers import pipeline

sa_model3 = pipeline('sentiment-analysis', model='ProsusAI/finbert')

text_list = ['The book on Leadership was very good',
             'The movie was not that great']

sa_model3(text_list)

Executing the above Python code generates the following typical output:

Output.3

[{'label': 'neutral', 'score': 0.6887921690940857},
 {'label': 'neutral', 'score': 0.8795873522758484}]

Text Generation

Text Generation is the process of repeatedly predicting the most probable next word given the initial sequence of words, then appending the predicted word to the initial sequence, and repeating the process until a constraint (such as maximum length) is met.

To perform the pre-defined task of text generation on a given text using the transformers model, execute the following code snippet:

from transformers import pipeline

tg_model = pipeline('text-generation')

prompt = 'Found a good book on Leadership and it was'

tg_model(prompt, max_length=60)

Executing the above Python code generates the following typical output:

Output.4

[{'generated_text': 'Found a good book on Leadership and it was really helpful."\n\nHe said a few of his friends made some mistakes in their careers, including dropping the "great job" question.\n\n"I can\'t remember the last time I used that word, but it was definitely on the front page'}]

Note that the default pre-trained model used by Hugging Face for the Text Generation task is openai-community/gpt2.

Now, let us try the same text generation task using the model microsoft/phi-2 by executing the following code snippet:

from transformers import pipeline

tg_model2 = pipeline('text-generation', model='microsoft/phi-2')

prompt = 'Found a good book on Leadership and it was'

tg_model2(prompt, max_length=60)

Executing the above Python code generates the following typical output:

Output.5

[{'generated_text': 'Found a good book on Leadership and it was a great read. I am going to share it with my team.\nI am a big fan of the book "The 7 Habits of Highly Effective People" by Stephen Covey. It is a great read and I highly recommend it.\nI'}]

!!! AWARENESS !!!

The model microsoft/phi-2 will take some time to download and execute !!!

Text Summarization

Text Summarization is the process of condensing a lengthy text document into a more compact version that still retains the most important information and meaning from the original text.

To perform the pre-defined task of text summarization on a given text using the transformers model, execute the following code snippet:

from transformers import pipeline

sm_model = pipeline('summarization')

text2 = "New York City comprises 5 boroughs sitting where the Hudson River meets the Atlantic Ocean. At its core is Manhattan, a densely
         populated borough that's among the world's major commercial, financial and cultural centers. Its iconic sites include skyscrapers
         such as the Empire State Building and sprawling Central Park. Broadway theater is staged in neon-lit Times Square"

sm_model(text2)

Executing the above Python code generates the following typical output:

Output.6

[{'summary_text': "New York City comprises 5 boroughs sitting where the Hudson River meets the Atlantic Ocean . At its core is Manhattan, a densely populated borough that's among the world's major commercial, financial and cultural centers . Its iconic sites include skyscrapers such as the Empire State Building and sprawling Central"}]

Note that the default pre-trained model used by Hugging Face for the Text Summarization task is sshleifer/distilbart-cnn-12-6.

Zero-Shot Classification

Zero-Shot Classification is the process of classifying a given text into a set of specified categories.

To perform the pre-defined task of zero-shot classification on a given text using the transformers model, execute the following code snippet:

from transformers import pipeline

zs_model = pipeline('zero-shot-classification')

text3 = 'Spring Boot is a popular Java Framework and favored by developers'

zs_model(text3, candidate_labels=['Books', 'Movies', 'Technology'])

Executing the above Python code generates the following typical output:

Output.7

{'sequence': 'Spring Boot is a popular Java Framework and favored by developers',
 'labels': ['Technology', 'Books', 'Movies'],
 'scores': [0.8282957673072815, 0.10491892695426941, 0.06678532809019089]}

Note that the default pre-trained model used by Hugging Face for the Zero-Shot Classification task is facebook/bart-large-mnli.

Fine Tuning a Model

The more data a model is trained on, the better the model performs. Rather than starting from scratch, one could use a pre-trained model to train the loaded model on the custom domain data set (referred to as Transfer Learning), and then use the trained model for text processing tasks in the custom domain. This process is often referred to as Fine Tuning of a model.

For this demonstration, we will use a custom restaurant reviews data set. The reviews data set is split into two CSV (tab separated) files.

To load the custom restaurant reviews data set using the Hugging Face utility function load_dataset() from the datasets module, execute the following code snippet:

from datasets import load_dataset

dataset = load_dataset('csv', data_files={'train': './data/reviews-train.tsv', 'eval': './data/reviews-eval.tsv'}, delimiter='\t')

dataset

Executing the above Python code generates the following typical output:

Output.8

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 750
    })
    eval: Dataset({
        features: ['text', 'label'],
        num_rows: 251
    })
})

Note that the reviews are text values of variable length. However, the Deep Learning models work only with numerical values. In order to convert the variable length texts to numbers, one would need some kind of a mechanism to split the text into words and symbols. This is where the Hugging Face tokenizer class AutoTokenizer comes in handy.

To initialize an instance of the AutoTokenizer class from the transformers module, execute the following code snippet:

from transformers import AutoTokenizer

default_model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = AutoTokenizer.from_pretrained(default_model)

To tokenize the given text, execute the following code snippet:

text = 'The restaurant was meh, but expensive'

tokens = tokenizer.tokenize(text)
tokens

Executing the above Python code generates the following typical output:

Output.9

['the', 'restaurant', 'was', 'me', '##h', ',', 'but', 'expensive']

To convert tokens to numbers, execute the following code snippet:

token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

Executing the above Python code generates the following typical output:

Output.10

[1996, 4825, 2001, 2033, 2232, 1010, 2021, 6450]

Each of the reviews are of variable length words. For any transformer model to learn from the collection of reviews, the following challenges need to be addressed:

One needs a way to indicate the start and the end of each of the reviews from the data set. For this, the tokenizer uses special tokens to represent the start and the end
Each of the reviews from the data set are of variable length (in terms of tokens). They all need to be represented of equal length upto a maximum length. For this, the tokenizer uses a dummy padding token for filling empty slots
To identify the real tokens from the dummy padding tokens, the tokenizer uses a mask to the identify the real versus dummy tokens

To convert a given text into numerical tokens upto a maximum length satisfying the above constraints, execute the following code snippet:

token_ids_mask = tokenizer(text, padding=True, truncation=True, max_length=10, return_tensors='pt')
token_ids_mask

Executing the above Python code generates the following typical output:

Output.11

{'input_ids': tensor([[ 101, 1996, 4825, 2001, 2033, 2232, 1010, 2021, 6450,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

To convert each of reviews in the data set to its equivalent numerical representation, execute the following code snippet:

def tokenize_reviews(reviews):
  return tokenizer(reviews['text'], padding='max_length', truncation=True)

dataset = dataset.map(tokenize_reviews, batched=True)
dataset

Executing the above Python code generates the following typical output:

Output.12

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 750
    })
    eval: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 251
    })
})

Note that our restaurant reviews only have TWO classes - Negative (with a label of '0') and Positive (with a label of '1').

To predict the sentiment of a sample review text using the default pre-trained model, execute the following code snippet:

sa_classifier = AutoModelForSequenceClassification.from_pretrained(default_model, num_labels=2)

outputs = sa_classifier(input_ids=id_mask['input_ids'], attention_mask=id_mask['attention_mask'])

probs = F.softmax(outputs.logits, dim=-1)
probs

Executing the above Python code generates the following typical output:

Output.13

tensor([[0.0056, 0.9944]], grad_fn=<SoftmaxBackward0>)

The Output.13 above seems to indicate a Positive sentiment, which is really not true !

This is an indication that we need to Fine Tune the default pre-trained model using our custom restaurant reviews data set.

Before we Fine Tune the default pre-trained model, we need to train the tokenizer on the tokens from our custom restaurant reviews data set.

To train our tokenizer on the tokens from the custom restaurant reviews data set, execute the following code snippet:

batch_sz = 16

def reviews_text_iterator(tag):
  for i in range(0, len(dataset[tag]), batch_sz):
      yield dataset[tag]['text'][i : i + batch_sz]

tokenizer = tokenizer.train_new_from_iterator(reviews_text_iterator('train'), len(tokenizer))
tuned_tokenizer = tokenizer.train_new_from_iterator(reviews_text_iterator('eval'), len(tokenizer))

To train our default pre-trained model on the custom restaurant reviews data set, execute the following code snippet:

batch_sz = 16
num_train_epochs = 5
model_dir = './model'

logging_steps = len(dataset['train']) // batch_sz
num_steps = int((len(dataset['train']) / batch_sz) * num_train_epochs)
warmup_steps = int(0.2 * num_steps)
save_steps = 500

training_args = TrainingArguments(
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=batch_sz,
    per_device_eval_batch_size=batch_sz,
    load_best_model_at_end=True,
    evaluation_strategy='steps',
    save_strategy='steps',
    learning_rate=2e-5,
    logging_strategy='steps',
    warmup_steps= warmup_steps,
    save_steps=save_steps,
    output_dir=model_dir
)

trainer = Trainer(model=tuned_sa_classifier, tokenizer=tuned_tokenizer, args=training_args, train_dataset=dataset['train'])
trainer.train()

The following are some details about the various parameters:

num_train_epochs is the number of training epochs to run
per_device_train_batch_size is the batch size for the training data to the underlying device (CPU or GPU)
per_device_eval_batch_size is the batch size for the evaluation data to the underlying device (CPU or GPU)
evaluation_strategy indicates the model evaluation strategy to use during training. In our case, it is after num_steps iterations.
save_strategy indicates the model checkpointing strategy to use during training. In our case, it is after num_steps iterations.
load_best_model_at_end indicates which trained model instance to use from the training. In our case, we have indicated to use the best evaluated model from the checkpoint after the training.

!!! ATTENTION !!!

If your desktop has a decent GPU, the training will take about 2 minutes. Else, will run for about 25 minutes on a CPU !!!

Once again, to predict the sentiment of a sample review text using the default pre-trained model, execute the following code snippet:

tuned_sa_classifier.to('cpu')

tuned_outputs = tuned_sa_classifier(input_ids=id_mask['input_ids'], attention_mask=id_mask['attention_mask'])

tuned_probs = F.softmax(tuned_outputs.logits, dim=-1)
tuned_probs

Executing the above Python code generates the following typical output:

Output.14

tensor([[0.9838, 0.0162]], grad_fn=<SoftmaxBackward0>)

The Output.14 above indicates a Negative sentiment, which is seems to be the reality !

Large Language Model

A Large Language Model (or LLM for short) is a very large deep neural network (with billions of parameters) that is pre-trained on vast amounts of data from the Internet and has an enormous potential to achieve general-purpose language generation capabilities as well as perform other language processing tasks such as summarizing text, translating languages, completing sentences, etc.

To perform text generation on a given prompt using the recently released microsoft/Phi-3-mini-4k-instruct LLM model, execute the following code snippet:

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

default_llm_model = 'microsoft/Phi-3-mini-4k-instruct'

llm_tokenizer = AutoTokenizer.from_pretrained(default_llm_model, trust_remote_code=True)
llm_model = AutoModelForCausalLM.from_pretrained(default_llm_model, torch_dtype='auto', trust_remote_code=True)

prompts = [
    {'role': 'user', 'content': 'Describe what Hugging Face is all about in a sentence'}
]

llm_inputs = llm_tokenizer(prompt, return_attention_mask=True, return_tensors='pt')

pipe = pipeline('text-generation', model=llm_model, tokenizer=llm_tokenizer)

generation_args = {
  'max_new_tokens': 150,
  'return_full_text': False,
  'temperature': 0.0,
  'do_sample': False
}

llm_output = pipe(prompts, **generation_args)
llm_output

Executing the above Python code generates the following typical output:

Output.15

[{'generated_text': "Hugging Face is a leading platform for natural language processing (NLP) that provides pre-trained models and tools for building and sharing machine learning models.\n\nHere's a more detailed description in a sentence:\n\nHugging Face is an innovative company that offers a comprehensive suite of open-source machine learning models and libraries, primarily focused on natural language understanding and generation, facilitating collaboration and accessibility in the AI research and development community."}]

!!! AWARENESS !!!

The LLM model microsoft/Phi-3-mini-4k-instruct will take few minutes to download and execute !!!

This concludes the various demonstrations on using the Hugging Face packages for the various text processing tasks !

References

Hugging Face

Restaurant Reviews - Training Data

Restaurant Reviews - Evaluation Data