Migrating to Mosaic Composer: A Step-by-Step Guide

Table of contents
Partner with
Aimpoint Digital
Meet an Expert

In our third installment of our Mosaic Composer series, we will cover the steps we had to take to migrate our custom LLM codebase, written in base PyTorch, into a full integration between both Mosaic Composer and MCLI, ultimately using the setup to train a custom LLM transformer architecture. If you missed our general Composer overview or our deep dive into features, be sure to take a look.

MCLI Set Up

After getting MCLI installed, you’ll need to first make sure you have access to compute. You can do this by running `mcli get clusters`; if you are connected correctly, you should see some clusters listed along with cluster configurations. If you don’t you can reach out to someone over at mosaic ML to get some configured. 

The next step getting MCLI set up is to configure your environment variables and integrations. Let’s first go through git integration (we’re assuming Windows local environment here, but similar steps hold for other OS’s). 

Let’s go through the github steps for completeness. First you need to generate a ssh key:

Then go into your github account, into your profile and then SSH and GPG keys. You can use a text editor to copy the contents of the ssh key you just created (should be named something like id_ed25519.pub in your .ssh folder) and paste it into a new SSH key in github. 

So now you have a key you can use for MCLI, but we still need to upload it to MCLI so that the instance the job is pushed to is able to communicate to your git account. 

MCLI has ways to set up generic secrets for whatever you want, but also a few specific integrations, such as with github. For git secret creation you can use a command like:

Make sure to replace the filepath to the ssh key with your file. Note the first “git-ssh” tells MCLI the type of secret you are creating, and the second the name of the secret. The name of the secret is key, and we ran into several errors due to originally leaving this unspecified. If you have any typos and need to redo a secret, you can delete it with `mcli delete secret secret-name` and then recreate it.

The other secret we had to set up was an integration with our AWS account. There are other integrations supported as well in case you use another cloud provider. The cloud provider integration will supply the instance running our job the MDS data shards as well as a location for storing objects like logs and model checkpoints. 

You will need your AWS credentials stored locally already (if you have AWS CLI set up and working then this step should be completed). You can set up the MCLI integration with:

Again, ensure all those files exist first before attempting. 

And that should be it for MCLI secrets setup! You can set up other integrations as needed; also, a note: we will go through environment variables in a later section. 

Data Loader Migration

First, let’s go through how we previously had our data being processed in our PyTorch based codebase. 

Our codebase was built on base nummpy memmap arrays rather than PyTorch dataloaders:

These numpy binary files were pre-tokenized datasets (The Pile, for example):

Figure 8 Code snippet from numpy memmap usage

An aside, the encoding for large datasets can be tricky using the Huggingface  API due to large storage requirements for the Huggingface cache, in particular the Apache Arrow pointers, which can bring storage for a loaded Huggingface dataset up to 10x the size of just the data. One can avoid this by tokenizing the data in streaming format (at cost of speed), but I digress. 

Locally running Composer can run with either the PyTorch data loaders, or with the Mosaic StreamingDataset class. Let’s first go through integration with the PyTorch dataloaders directly. 

Because our data was preprocessed using the Huggingface API and pretokenized into binary files, we first had to create a wrapper class to wrap up these files to enable PyTorch data loaders to integrate correctly. This consisted of creating a new class inheriting from the PyTorch Dataset class with the required methods:

The methods `len` and `getitem` are required methods by PyTorch. Next, in our training loop, we simply instantiate the DataLoader objects and pass our new wrapper class into it:

Note that the data_file attribute is the location of our original numpy binary file. 

So, all of that will run locally with Composer. One benefit of using the Mosaic stack, however, is being able to pull in the Mosaic Data Shard format (discussed in previous section). To take advantage of MDS and ultimately to allow integration with remotely running MCLI jobs, we need to convert our binary file to MDS format and push to our cloud location. To do that, we set up this simple function:

In which `block_size` is the context length of the LLM being trained, and data_file the location of the original pre-tokenized numpy binary file. 

Note that the x and y are sampled in the same manner as our original utility function, which is simply sampling indices from the source dataset, and creating an offset of 1 token for the output target, in line with how we want to train our autoregressive sequence model (LLM). 

After running this function on the binary file, we see an output local to the script containing the function that looks like this:

You should see both the JSON meta data object as well as the actual data shards. 

Next, we want to upload these shards to a location the MCLI job will be able to access, namely, AWS S3 storage. That was accomplished using this function:

Nothing revolutionary here; the function simply loops through the files in our MDS directory and pushes them to the specified S3 bucket. 

Almost there!

Now, lastly in our train loop, we load the shards like:

These datasets can be passed directly into the PyTorch DataLoader as we did with the previous wrapper class, and finally directly into the Trainer class. The config parameters passed into the remote attribute is simply the filepath to our S3 bucket (something like 's3://your-s3-name/mds_data'), and the local cache is simply a location MDS can use to store files/shards that are being read into the local model (local to whatever instance is running the code, mind you). 

Whew! That was probably the heftiest part of this migration. Now let’s go through how we set up our model and training loops.

Model Setup

Our original model class was quite extensive, housing multiple attention mechanisms, different block MLP configurations, and many thousands of lines of code worth of other transformer related weirdness. So rather than rewrite our classes, we opted to take the wrapper approach. This ended up being quite simple, since our existing class had the forward method previously implemented, our wrapper was as simple as:

A few notes here:

  • The wrapper must inherit the ComposerModel class (note the super().__init__() )
  • It must have methods forward and loss
  • Additionally, note the `batch[x]`; this is an artifact from the way we encoded our data shards

That’s it for the modeling class. You can build a complete model from scratch in the Mosaic framework as well, if doing so the documentation has some helpful quickstart guides.  

Training Setup

This is likely the simplest part of the setup. Our previous training loop was quite extensive, including custom logging set up, checkpointing, and other crucial model training tasks. With the Trainer class, though, we can set up our training loop in just a few lines of code:

You can see here our DataLoader objects being passed into the Trainer class, and the model wrapper we put together being used to wrap up our previously instantiated base model. We had to make no changes to our optimizer, which is an attribute of our base model:

And is simply passed into the Trainer class as well. You can also set up checkpointing and other parameters in this class. Get a complete set here

YAML Building and Final Submission

Okay, last step. We need to put together a simple YAML file to tell the remote compute instance how to set up and process our job. There are a few things to note about setting up this YAML file:

  • Environment variables: previously, we had our environment variables set up using a locally stored .env file; this method is fine for running Composer locally, but when pushing the job via MCLI we had to move our environment variables either into MCLI secrets (discussed above), or set them in the YAML file. When you set up .env variables in the YAML file, your code can call them normally (via something like `os.genenv(“FOO”)`)
  • The command can vary in contents, but should start with the composer instruction. Previously, our code was run in multi node setups with a PyTorch torchrun command, where we could specify multi-GPU setups using nproc_per_node and manage communications between multiple servers by specifying number of nodes and IP addresses, etc. With composer we just have to request a number of GPUs, so the command can be simplified and split into multiple areas of the YAML file
  • When cloning the git repo, we need to be sure to specify what branch we want to clone and run commands on

Here is a picture of our YAML file, expertly highlighted with our graphics editing software (MS Paint):

That’s it! After that, we can submit our job using MCLI and get a nice progress bar indicating training completeness. 

Conclusion

Thanks for bearing with me through that long tour of Mosaic Composer and how we were able to migrate our PyTorch LLM Codebase into Composer’s framework. It doesn’t take much in the grand scheme of migrations to do so, and offers a hefty amount of benefits as a result. These range from automated compute right sizing, to simplified training loops, to ease of integration with custom algorithmic changes via callbacks and events, and more. We recommend you take a look at the Mosaic documentation to get a more comprehensive view of all the ways Composer can help speed up your workflow. 

Also, did we mention its free to use? So, feel free to do your own comparison against other popular frameworks, such as Lightning. Check out the integrations with DeepSpeed and FSDP and other popular compute efficient ways to spread out a model. If you have any questions, feel free to reach out to our APD Labs team (includes me) or the helpful folks over at Databricks/MosaicML. Until next time!

Author
Aaron McClendon
Aaron McClendon
Head of AI
Read Bio

Let’s talk data.
We’ll bring the solutions.

Whether you need advanced AI solutions, strategic data expertise, or tailored insights, our team is here to help.

Meet an Expert