Fine-tune MPT-7B on Amazon SageMaker

Learn how to prepare a dataset and create a training job to fine-tune MPT-7B on Amazon SageMaker.

New large language models (LLMs) are being announced every week, each trying to beat its predecessor and take over the evaluation leaderboards. One of the latest models out there is MPT-7B that was released by MosaicML. Unlike other models of its kind, this 7-billion-parameter model is open-source and licensed for commercial use (Apache 2.0 license) 🚀.

Foundation models like MPT-7B are pre-trained on datasets with trillions of tokens (100 tokens ~ 75 words) crawled from the web and, when prompted well, they can produce impressive outputs. However, to truly unlock the value of large language models in real-world applications, smart prompt-engineering might not be enough to make them work for your use case and, therefore, fine-tuning a foundation model on a domain-specific dataset is required.

LLMs have billions of parameters and, consequently, fine-tuning such large models is challenging. Good news is that fine-tuning is much cheaper and faster as compared to pre-training the foundation model given that 1) the domain-specific datasets are “small” and 2) fine-tuning requires only a few passes over the training data.

Here is what we will learn in this article:

Install dependencies and set S3 paths

Let’s start by installing the SageMaker Python SDK and a few other packages. This SDK makes it possible to train and deploy machine learning models on AWS with a few lines of Python code. The code below is available in the sagemaker_finetuning.ipynb notebook on Github. Run the notebook in SageMaker Studio, a SageMaker notebook instance, or in your laptop after authenticating to an AWS account.

!pip install "sagemaker==2.162.0" s3path boto3 --quiet

from sagemaker.huggingface import HuggingFace
from sagemaker.inputs import TrainingInput
from sagemaker import s3_utils
import sagemaker
import boto3
import json

Next step is to define the paths where the data will be saved in S3 and create a SageMaker session.

# Define S3 paths

bucket = "<YOUR-S3-BUCKET>"
training_data_path = f"s3://{bucket}/toy_data/train/data.jsonl"
test_data_path = f"s3://{bucket}/toy_data/test/data.jsonl"
output_path = f"s3://{bucket}/outputs"
code_location = f"s3://{bucket}/code"

# Create SageMaker session

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

Build a fine-tuning dataset

We will create a dummy dataset to demonstrate how to fine-tune MPT-7B. Since training models of this size on a complete dataset takes long and is costly, it is a good idea to first test & debug the training job on a small dataset and second scale training to the complete dataset.

{
"prompt": "What is a Pastel de Nata?",
"response": "A Pastel de Nata is a Portuguese egg custard tart pastry, optionally dusted with cinnamon."
}

The prompt is the input given to the model (e.g., a question). The response is the output that the model is trained to predict (e.g., the answer to the question in the prompt). The raw prompt is often preprocessed to fit in a prompt template that helps the model to generate better outputs. Note that the model is trained for causal language modelling, so you can think of it as a “document completer”. It is a good idea to design the prompt template in such a way that the model thinks that it is completing a document. Andrej Karpathy explains well this mechanism in his talk State of GPT.

prompt_template = """Write a response that appropriately answers the question below.

### Question:

{question}

### Response:

"""

dataset = [
{"prompt": "What is a Pastel de Nata?",
"response": "A Pastel de Nata is a Portuguese egg custard tart pastry, optionally dusted with cinnamon."},
{"prompt": "Which museums are famous in Amsterdam?",
"response": "Amsterdam is home to various world-famous museums, and no trip to the city is complete without stopping by the Rijksmuseum, Van Gogh Museum, or Stedelijk Museum."},
{"prompt": "Where is the European Parliament?",
"response": "Strasbourg is the official seat of the European Parliament."},
{"prompt": "How is the weather in The Netherlands?",
"response": "The Netherlands is a country that boasts a typical maritime climate with mild summers and cold winters."},
{"prompt": "What are Poffertjes?",
"response": "Poffertjes are a traditional Dutch batter treat. Resembling small, fluffy pancakes, they are made with yeast and buckwheat flour."},
]

# Format prompt based on template

for example in dataset:
example["prompt"] = prompt_template.format(question=example["prompt"])

training_data, test_data = dataset[0:4], dataset[4:]

print(f"Size of training data: {len(training_data)}nSize of test data: {len(test_data)}")
def write_jsonlines_to_s3(data, s3_path):
"""Writes list of dictionaries as a JSON lines file to S3"""

    json_string = ""
    for d in data:
        json_string += json.dumps(d) + "n"

    s3_client   = boto3.client("s3")

    bucket, key = s3_utils.parse_s3_url(s3_path)
    s3_client.put_object(
        Body   = json_string,
        Bucket = bucket,
        Key    = key,
    )

write_jsonlines_to_s3(training_data, training_data_path)
write_jsonlines_to_s3(test_data, test_data_path)

SageMaker Training job

With the datasets available in S3, we will now create a training job in Amazon SageMaker. For that, we have to create an entry point script, modify the configuration file specifying the training settings, and define an HuggingFace estimator. We will (re-)use the training script from LLM Foundry and Composer library’s CLI launcher that sets up the distributed training environment. Both of these packages are maintained by MosaicML, the company behind MPT-7B. The working folder should be structured like:

└── fine-tune-mpt-7b-sagemaker/
├── training_script_launcher.sh
├── fine_tuning_config.yaml
├── sagemaker_finetuning.ipynb

We will now dive deep into each of these files.

max_seq_len: 512
global_seed: 17

...

# Dataloaders

train_loader:
name: finetuning
dataset:
hf_name: json
hf_kwargs:
data_dir: /opt/ml/input/data/train/
...

eval_loader:
name: finetuning
dataset:
hf_name: json
hf_kwargs:
data_dir: /opt/ml/input/data/test/

...
max_duration: 3ep
eval_interval: 1ep
...
global_train_batch_size: 128

...

# FSDP

fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
activation_checkpointing: true
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false

# Checkpoint to local filesystem or remote object store

save_folder: /tmp/checkpoints
dist_timeout: 2000

The max_seq_length indicates the maximum number of tokens of the input (remember that 100 tokens ~ 75 words). The training and test data will be loaded using the 😊 Datasets library from the /opt/ml/input/data/{train, test} directory inside the container associated with the training job. Check out the SageMaker Training Storage Folders‘ documentation to understand how the container directories are structured. The max_duration specifies the number of epochs for fine-tuning. Two to three epochs is typically a good choice. eval_interval indicates how often the model will be evaluated on the test set.

The distributed training strategy is Fully Sharded Data Parallel (FSDP), which enables efficient training of large models like MPT-7B. Unlike the traditional data parallel strategy, which keeps a copy of the model in each GPU, FSDP shards model parameters, optimizer states, and gradients across data parallel workers. If you want to learn more about FSDP, check this insightful PyTorch intro post. FSDP is integrated in Composer, the distributed training library used by LLM Foundry.

save_folder determines where the model checkpoint (.pt file) is saved. We set it to the temporary folder /tmp/checkpoints.

# Clone llm-foundry package from MosaicML

# This is where the training script is hosted

git clone https://github.com/mosaicml/llm-foundry.git
cd llm-foundry

# Install required packages

pip install -e ".[gpu]"
pip install git+https://github.com/mosaicml/composer.git@dev

# Run training script with fine-tuning configuration

composer scripts/train/train.py /opt/ml/code/finetuning_config.yaml

# Convert Composer checkpoint to HuggingFace model format

python scripts/inference/convert_composer_to_hf.py
--composer_path /tmp/checkpoints/latest-rank0.pt
--hf_output_path /opt/ml/model/hf_fine_tuned_model
--output_precision bf16

# Print content of the model artifact directory

ls /opt/ml/model/
# Define container image for the training job

training_image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04-v1.1"

# Define metrics to send to CloudWatch

metrics = [ # On training set
{"Name": "train:LanguageCrossEntropy",
"Regex": "Train metrics/train/LanguageCrossEntropy: ([+-]?((d+.?d*)|(.d+)))"},
{"Name": "train:LanguagePerplexity",
"Regex": "Train metrics/train/LanguagePerplexity: ([+-]?((d+.?d*)|(.d+)))"}, # On test set
{"Name": "test:LanguageCrossEntropy",
"Regex": "Eval metrics/eval/LanguageCrossEntropy: ([+-]?((d+.?d*)|(.d+)))"},
{"Name": "test:LanguagePerplexity",
"Regex": "Eval metrics/eval/LanguagePerplexity: ([+-]?((d+.?d*)|(.d+)))"},
]

estimator_args = {
"image_uri": training_image_uri, # Training container image
"entry_point": "launcher.sh", # Launcher bash script
"source_dir": ".", # Directory with launcher script and configuration file
"instance_type": "ml.g5.48xlarge", # Instance type
"instance_count": 1, # Number of training instances
"base_job_name": "fine-tune-mpt-7b", # Prefix of the training job name
"role": role, # IAM role
"volume_size": 300, # Size of the EBS volume attached to the instance (GB)
"py_version": "py310", # Python version
"metric_definitions": metrics, # Metrics to track
"output_path": output_path, # S3 location where the model artifact will be uploaded
"code_location": code_location, # S3 location where the source code will be saved
"disable_profiler": True, # Do not create profiler instance
"keep_alive_period_in_seconds": 240, # Enable Warm Pools while experimenting
}

huggingface_estimator = HuggingFace(\*\*estimator_args)

⚠️ Make sure to request the respective quotas for SageMaker Training, along with Warm Pools‘ quota in case you are making use of this cool feature. If you plan to run many jobs in SageMaker, take a look at SageMaker Saving Plans.

huggingface_estimator.fit({
"train": TrainingInput(
s3_data=training_data_path,
content_type="application/jsonlines"),
"test": TrainingInput(
s3_data=test_data_path,
content_type="application/jsonlines"),
}, wait=True)

The training time will depend on the size of your dataset. With our dummy dataset, training takes roughly 20min to complete. Once the model is trained and converted to 😊 HuggingFace format, SageMaker will upload the model tarball (model.tar.gz) to the S3 output_path. I found that in practice the uploading step takes rather long (>1h), which might be due to the size of the model artifacts to compress (~25GB).

Summary

In this article, I showed how you can prepare a dataset and create a training job in SageMaker to fine-tune MPT-7B for your use case. The implementation leverages the training script from LLM Foundry and uses Composer library’s distributed training launcher. Once you have fine-tuned your model and want to deploy it, I recommend to check out the blog posts by Philipp Schmid; there are plenty of examples on how to deploy LLMs in SageMaker. Have fun with your fine-tuned MPT-7B model! 🎉

All the code used in this article is available on my Github:

GitHub – jpcpereira/sagemaker-fine-tune-mpt-7b

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Fast and Scalable Hyperparameter Tuning and Cross-validation in AWS SageMaker
  • Introducing Solar Scan