Training and Scaling
===================

This page provides detailed information on training speechlm2 models, including setup requirements, running experiments at scale, debugging, and parallelism strategies.

Running Experiments
-----------------

The speechlm2 collection includes several scripts to facilitate running experiments, especially on SLURM-based clusters.

SLURM Job Submission
^^^^^^^^^^^^^^^^^^

For training on SLURM clusters, use the following workflow:

.. code-block:: bash

    # Submit 8 consecutive jobs with random seeds
    scripts/speechlm2/auto_launcher_with_seed.sh -n8 s2s_tinyllama_repro.sub

The ``auto_launcher_with_seed.sh`` script:

1. Generates a random seed for each submitted job
2. Leverages ``shard_seed="randomized"`` in Lhotse to ensure each data parallel rank is seeded differently
3. Ensures each tensor parallel rank is seeded identically

SLURM Submission Script
^^^^^^^^^^^^^^^^^^^^^

Example ``s2s_tinyllama_repro.sub`` script:

.. code-block:: bash

    #!/bin/bash
    #SBATCH --job-name=s2s_training
    #SBATCH --nodes=4
    #SBATCH --ntasks-per-node=8
    #SBATCH --gres=gpu:8
    #SBATCH --time=24:00:00
    #SBATCH --exclusive
    #SBATCH --output=s2s_tinyllama_repro_%j.out
    
    # Check that the global random seed base is provided
    if [ -z "$1" ]; then
      echo "Usage: $0 <global_random_seed_base>"
      exit 1
    fi
    SEED=${1}

    EXP_NAME="s2s_training"
    RESULTS_DIR="results/${EXP_NAME}"
    
    srun --ntasks=${SLURM_NTASKS} --ntasks-per-node=${SLURM_NTASKS_PER_NODE} \
      python -u examples/speechlm2/s2s_duplex_train.py \
      --config-path=/path/to/config/dir \
      --config-name=s2s_training.yaml \
      exp_manager.name=${EXP_NAME} \
      exp_manager.wandb_logger_kwargs.name=${EXP_NAME} \
      trainer.num_nodes=$SLURM_JOB_NUM_NODES \
      exp_manager.explicit_log_dir=${RESULTS_DIR} \
      data.train_ds.seed=$SEED \
      data.validation_ds.seed=$SEED 


Configuration Files
^^^^^^^^^^^^^^^^^

The main configuration file (``s2s_training.yaml``) contains all model, training, and data parameters. See :doc:`configs` for more details. It's recommended to copy and modify this file rather than overriding options in the SLURM script to maintain versioning and configuration clarity.

Debugging
--------

Running Locally with torchrun
^^^^^^^^^^^^^^^^^^^

For local debugging and profiling, use ``torchrun``:

.. code-block:: bash

    # Run with 4 GPUs locally
    torchrun --nproc_per_node=4 examples/speechlm2/s2s_duplex_train.py \
      --config-path=/path/to/config/dir \
      --config-name=s2s_training.yaml

Scaling Strategies
----------------

The speechlm2 collection includes support for model parallelism to scale training to large models across multiple GPUs.

Model Parallel Strategies
^^^^^^^^^^^^^^^^^^^^^^^

The collection supports multiple parallelism strategies:

1. **Fully Sharded Data Parallel (FSDP2)**: Distributes model parameters across GPUs
2. **Tensor Parallelism (TP)**: Splits individual tensors across GPUs
3. **Sequence Parallelism (SP)**: Splits sequence processing across GPUs
4. **2D Parallelism**: Combination of FSDP2 with TP/SP

Configuration
^^^^^^^^^^^

To configure parallelism, modify the ``trainer.strategy`` section in your YAML config:

.. code-block:: yaml

    trainer:
      strategy:
        _target_: nemo.core.ModelParallelStrategy
        find_unused_parameters: False
        data_parallel: 1   # World size for data parallelism (FSDP2)
        tensor_parallel: 8  # World size for tensor parallelism
      devices: 8
      num_nodes: 1
      accelerator: gpu
      precision: bf16-true

The model's ``configure_model`` method automatically sets up the appropriate parallelization based on this configuration.

FSDP2 Configuration
^^^^^^^^^^^^^^^^

For Fully Sharded Data Parallel training:

1. Set ``data_parallel`` to the number of GPUs you want to use for data parallelism
2. Set ``tensor_parallel`` to 1 (disabled)

FSDP2 shards the model parameters across GPUs, all-gathers them for forward/backward passes, and then de-allocates after computation. This allows training of larger models with limited GPU memory.
See :doc:`PyTorch FSDP2 <https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html>`_ for more details.

Tensor Parallelism Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For Tensor Parallelism:

1. Set ``tensor_parallel`` to the number of GPUs you want to use for tensor parallelism
2. Set ``data_parallel`` to 1 (or higher for 2D parallelism)

The ``parallelize_module`` function applies a parallelization plan to specific model components, like splitting attention heads or embedding dimensions across GPUs.
See :doc:`PyTorch TP <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`_ for more details.

Implementation Details
-------------------

The core implementation of model parallelism is in the ``configure_model`` method of the model classes. Key aspects include:

1. **Module Sharding**: Calling ``fully_shard`` on modules to distribute parameters across data parallel ranks
2. **Parallelization Plans**: Creating and applying plans that specify how different layers should be parallelized
3. **Model-Specific Adaptations**: Handling architectural differences between different LLMs

Advanced Usage
------------

Script Customization
^^^^^^^^^^^^^^^^^

When customizing the training scripts, keep these points in mind:

1. **Path Overrides**: Override paths in the YAML configuration files with your own, as needed
2. **W&B Keys**: Update Weights & Biases API keys in configuration files
3. **Batch Size Tuning**: Adjust batch size based on your GPU memory and model size
