Learn how to prepare a dataset and create a training job to fine-tune MPT-7B on Amazon SageMaker.
New large language models (LLMs) are being announced every week, each trying to beat its predecessor and take over the evaluation leaderboards. One of the latest models out there is MPT-7B that was released by MosaicML. Unlike other models of its kind, this 7-billion-parameter model is open-source and licensed for commercial use (Apache 2.0 license) 🚀.
Foundation models like MPT-7B are pre-trained on datasets with trillions of tokens (100 tokens ~ 75 words) crawled from the web and, when prompted well, they can produce impressive outputs. However, to truly unlock the value of large language models in real-world applications, smart prompt-engineering might not be enough to make them work for your use case and, therefore, fine-tuning a foundation model on a domain-specific dataset is required.
LLMs have billions of parameters and, consequently, fine-tuning such large models is challenging. Good news is that fine-tuning is much cheaper and faster as compared to pre-training the foundation model given that 1) the domain-specific datasets are “small” and 2) fine-tuning requires only a few passes over the training data.
Here is what we will learn in this article:
Let’s start by installing the SageMaker Python SDK and a few other packages. This SDK makes it possible to train and deploy machine learning models on AWS with a few lines of Python code. The code below is available in the sagemaker_finetuning.ipynb notebook on Github. Run the notebook in SageMaker Studio, a SageMaker notebook instance, or in your laptop after authenticating to an AWS account.
!pip install "sagemaker==2.162.0" s3path boto3 --quiet
from sagemaker.huggingface import HuggingFace
from sagemaker.inputs import TrainingInput
from sagemaker import s3_utils
import sagemaker
import boto3
import jsonNext step is to define the paths where the data will be saved in S3 and create a SageMaker session.
# Define S3 paths
bucket = "<YOUR-S3-BUCKET>"
training_data_path = f"s3://{bucket}/toy_data/train/data.jsonl"
test_data_path = f"s3://{bucket}/toy_data/test/data.jsonl"
output_path = f"s3://{bucket}/outputs"
code_location = f"s3://{bucket}/code"
# Create SageMaker session
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()We will create a dummy dataset to demonstrate how to fine-tune MPT-7B. Since training models of this size on a complete dataset takes long and is costly, it is a good idea to first test & debug the training job on a small dataset and second scale training to the complete dataset.
{
"prompt": "What is a Pastel de Nata?",
"response": "A Pastel de Nata is a Portuguese egg custard tart pastry, optionally dusted with cinnamon."
}The prompt is the input given to the model (e.g., a question). The response is the output that the model is trained to predict (e.g., the answer to the question in the prompt). The raw prompt is often preprocessed to fit in a prompt template that helps the model to generate better outputs. Note that the model is trained for causal language modelling, so you can think of it as a “document completer”. It is a good idea to design the prompt template in such a way that the model thinks that it is completing a document. Andrej Karpathy explains well this mechanism in his talk State of GPT.
prompt_template = """Write a response that appropriately answers the question below.
### Question:
{question}
### Response:
"""
dataset = [
{"prompt": "What is a Pastel de Nata?",
"response": "A Pastel de Nata is a Portuguese egg custard tart pastry, optionally dusted with cinnamon."},
{"prompt": "Which museums are famous in Amsterdam?",
"response": "Amsterdam is home to various world-famous museums, and no trip to the city is complete without stopping by the Rijksmuseum, Van Gogh Museum, or Stedelijk Museum."},
{"prompt": "Where is the European Parliament?",
"response": "Strasbourg is the official seat of the European Parliament."},
{"prompt": "How is the weather in The Netherlands?",
"response": "The Netherlands is a country that boasts a typical maritime climate with mild summers and cold winters."},
{"prompt": "What are Poffertjes?",
"response": "Poffertjes are a traditional Dutch batter treat. Resembling small, fluffy pancakes, they are made with yeast and buckwheat flour."},
]
# Format prompt based on template
for example in dataset:
example["prompt"] = prompt_template.format(question=example["prompt"])
training_data, test_data = dataset[0:4], dataset[4:]
print(f"Size of training data: {len(training_data)}nSize of test data: {len(test_data)}")def write_jsonlines_to_s3(data, s3_path):
"""Writes list of dictionaries as a JSON lines file to S3"""
json_string = ""
for d in data:
json_string += json.dumps(d) + "n"
s3_client = boto3.client("s3")
bucket, key = s3_utils.parse_s3_url(s3_path)
s3_client.put_object(
Body = json_string,
Bucket = bucket,
Key = key,
)
write_jsonlines_to_s3(training_data, training_data_path)
write_jsonlines_to_s3(test_data, test_data_path)With the datasets available in S3, we will now create a training job in Amazon SageMaker. For that, we have to create an entry point script, modify the configuration file specifying the training settings, and define an HuggingFace estimator. We will (re-)use the training script from LLM Foundry and Composer library’s CLI launcher that sets up the distributed training environment. Both of these packages are maintained by MosaicML, the company behind MPT-7B. The working folder should be structured like:
└── fine-tune-mpt-7b-sagemaker/
├── training_script_launcher.sh
├── fine_tuning_config.yaml
├── sagemaker_finetuning.ipynbWe will now dive deep into each of these files.
finetuning_config.yaml –– The template provided in the LLM Foundry repository is a good starting point, specifically the mpt-7b-dolly-sft.yaml file. However, depending on your dataset size and training instance, you might have to adjust some of these configurations, such as the batch size. I have modified the file to fine-tune the model in SageMaker (check finetuning_config.yaml). The parameters that you should pay attention to are the following:max_seq_len: 512
global_seed: 17
...
# Dataloaders
train_loader:
name: finetuning
dataset:
hf_name: json
hf_kwargs:
data_dir: /opt/ml/input/data/train/
...
eval_loader:
name: finetuning
dataset:
hf_name: json
hf_kwargs:
data_dir: /opt/ml/input/data/test/
...
max_duration: 3ep
eval_interval: 1ep
...
global_train_batch_size: 128
...
# FSDP
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
activation_checkpointing: true
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false
# Checkpoint to local filesystem or remote object store
save_folder: /tmp/checkpoints
dist_timeout: 2000The max_seq_length indicates the maximum number of tokens of the input (remember that 100 tokens ~ 75 words). The training and test data will be loaded using the 😊 Datasets library from the /opt/ml/input/data/{train, test} directory inside the container associated with the training job. Check out the SageMaker Training Storage Folders‘ documentation to understand how the container directories are structured. The max_duration specifies the number of epochs for fine-tuning. Two to three epochs is typically a good choice. eval_interval indicates how often the model will be evaluated on the test set.
The distributed training strategy is Fully Sharded Data Parallel (FSDP), which enables efficient training of large models like MPT-7B. Unlike the traditional data parallel strategy, which keeps a copy of the model in each GPU, FSDP shards model parameters, optimizer states, and gradients across data parallel workers. If you want to learn more about FSDP, check this insightful PyTorch intro post. FSDP is integrated in Composer, the distributed training library used by LLM Foundry.
save_folder determines where the model checkpoint (.pt file) is saved. We set it to the temporary folder /tmp/checkpoints.
launcher.sh –— A bash script is used as entry point. The bash script clones the LLM Foundry repository, installs requirements, and, more importantly, runs the training script using Composer library’s distributed launcher. Note that, typically, training jobs in SageMaker run the training script using a command like python train.py. However, it is possible to pass a bash script as entry point, which provides more flexibility in our scenario. Finally, we convert the model checkpoint saved to /tmp/checkpoints to the HuggingFace model format and save the final artifacts into /opt/ml/model/. SageMaker will compress all files in this directory, create a tarball model.tar.gz, and upload it to S3. The tarball is useful for inference.# Clone llm-foundry package from MosaicML
# This is where the training script is hosted
git clone https://github.com/mosaicml/llm-foundry.git
cd llm-foundry
# Install required packages
pip install -e ".[gpu]"
pip install git+https://github.com/mosaicml/composer.git@dev
# Run training script with fine-tuning configuration
composer scripts/train/train.py /opt/ml/code/finetuning_config.yaml
# Convert Composer checkpoint to HuggingFace model format
python scripts/inference/convert_composer_to_hf.py
--composer_path /tmp/checkpoints/latest-rank0.pt
--hf_output_path /opt/ml/model/hf_fine_tuned_model
--output_precision bf16
# Print content of the model artifact directory
ls /opt/ml/model/g5.48xlarge that has 8x NVIDIA A10G GPUs. The p4d.24xlarge is also a good choice. Even though it is more expensive, it is equipped with 8x NVIDIA A100 GPUs. We also indicate the metrics to track on the training and test sets (Cross Entropy and Perplexity). The values of these metrics are captured via Regex expressions and sent to Amazon CloudWatch.# Define container image for the training job
training_image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-training:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04-v1.1"
# Define metrics to send to CloudWatch
metrics = [ # On training set
{"Name": "train:LanguageCrossEntropy",
"Regex": "Train metrics/train/LanguageCrossEntropy: ([+-]?((d+.?d*)|(.d+)))"},
{"Name": "train:LanguagePerplexity",
"Regex": "Train metrics/train/LanguagePerplexity: ([+-]?((d+.?d*)|(.d+)))"}, # On test set
{"Name": "test:LanguageCrossEntropy",
"Regex": "Eval metrics/eval/LanguageCrossEntropy: ([+-]?((d+.?d*)|(.d+)))"},
{"Name": "test:LanguagePerplexity",
"Regex": "Eval metrics/eval/LanguagePerplexity: ([+-]?((d+.?d*)|(.d+)))"},
]
estimator_args = {
"image_uri": training_image_uri, # Training container image
"entry_point": "launcher.sh", # Launcher bash script
"source_dir": ".", # Directory with launcher script and configuration file
"instance_type": "ml.g5.48xlarge", # Instance type
"instance_count": 1, # Number of training instances
"base_job_name": "fine-tune-mpt-7b", # Prefix of the training job name
"role": role, # IAM role
"volume_size": 300, # Size of the EBS volume attached to the instance (GB)
"py_version": "py310", # Python version
"metric_definitions": metrics, # Metrics to track
"output_path": output_path, # S3 location where the model artifact will be uploaded
"code_location": code_location, # S3 location where the source code will be saved
"disable_profiler": True, # Do not create profiler instance
"keep_alive_period_in_seconds": 240, # Enable Warm Pools while experimenting
}
huggingface_estimator = HuggingFace(\*\*estimator_args)⚠️ Make sure to request the respective quotas for SageMaker Training, along with Warm Pools‘ quota in case you are making use of this cool feature. If you plan to run many jobs in SageMaker, take a look at SageMaker Saving Plans.
huggingface_estimator.fit({
"train": TrainingInput(
s3_data=training_data_path,
content_type="application/jsonlines"),
"test": TrainingInput(
s3_data=test_data_path,
content_type="application/jsonlines"),
}, wait=True)The training time will depend on the size of your dataset. With our dummy dataset, training takes roughly 20min to complete. Once the model is trained and converted to 😊 HuggingFace format, SageMaker will upload the model tarball (model.tar.gz) to the S3 output_path. I found that in practice the uploading step takes rather long (>1h), which might be due to the size of the model artifacts to compress (~25GB).
In this article, I showed how you can prepare a dataset and create a training job in SageMaker to fine-tune MPT-7B for your use case. The implementation leverages the training script from LLM Foundry and uses Composer library’s distributed training launcher. Once you have fine-tuned your model and want to deploy it, I recommend to check out the blog posts by Philipp Schmid; there are plenty of examples on how to deploy LLMs in SageMaker. Have fun with your fine-tuned MPT-7B model! 🎉
All the code used in this article is available on my Github:
GitHub – jpcpereira/sagemaker-fine-tune-mpt-7b
Here are some more articles you might like to read next: