lamp.data.distributed
Type members
Value members
Concrete methods
Drives the distributed training loop.
Drives the distributed training loop.
Must be called on the root rank. If nranks is > 1 then followDistributedTraining must be called on the rest of the ranks.
The batch streams across all ranks must:
- not contain empty batches
- have the same number of batches.
Models across all ranks must have the same shape.
Communication is done by two independent communication channels:
- tensor data is sent via NCCL, thus NCCL's requirement for network setup applies (i.e. single private network if distributed) This method will set up and tear down the NCCL communication clique.
- control messages and initial rendez-vous are using an implementation of DistributedCommunicationRoot and DistributedCommunicationNonRoot. This is a very low traffic channel, 1 message before each epoch. An Akka implementation is provided which is suitable for distributed and single-process multi-gpu settings. A within process cats effect implementation is also provided for single-process multi-gpu settings.
For single process multi gpu settings lamp provides two mutually exclusive alternatives:
- The training loop in IOLoops (see the
dataParallelModels
argument of IOLoops.epochs). - ???
Drives multiple epochs to find the minimum of smoothed validation loss
Drives multiple epochs to find the minimum of smoothed validation loss
This method does not explicitly trains a model but assumes there is a side effecting effectful function which steps through an optimizer through a whole epoch worth of batches.
- Value parameters:
- checkpointState
Function to checkpoint the state managed in this loop.
- epochs
Max epochs to make
- initState
Initial state of the validation loss management state
- learningRateScheduleInitState
Initial state of the learning rate state
- returnMinValidationLossModel
In which epocchs to calculat validation loss
- saveModel
A side effect to save the current optimizer and model states
- trainEpoch
An effectful function which steps the optimizer over a complete epoch and returns the training loss
- validationEpoch
An effectful function which steps through in forward mode a complete epoch and returns the validation loss
- validationFrequency
How often (by epoch count) to calculate the validation loss
- validationLossExponentialSmoothingFactor
Smoothing factor in exponential smoothing of validation loss <= 1.0
- Returns:
The final loop state
Follows a distributed training loop. See the documentation of driveDistributedTraining.
Follows a distributed training loop. See the documentation of driveDistributedTraining.
Data parallel training loop driving multiple devices from a single process
Data parallel training loop driving multiple devices from a single process
modelsWithDataStreams
is sequence of models, training and validation
streams allocated to each devices. The streams must have the same length
and must not contain empty batches. Models must have the same shape.
Once the returned suspended side effect is completed the trained model is
in the first model of modelsWithDataStreams
.
Drives one epoch in the clique All batch streams in all members of the clique MUST have the same number of batches otherwise this will never terminate
Drives one epoch in the clique All batch streams in all members of the clique MUST have the same number of batches otherwise this will never terminate