# Training

Training behavior is controlled by the `training`, `losses`, `loss_scaling`, `data.sampler`, and `data.validation` sections of the YAML config. The bundled templates, such as `nf2/cartesian/sharp_cea.yaml`, are the best starting point for complete files.

```{toctree}
:maxdepth: 1
:caption: Training Details

configuration
```

## Loss Setup

Losses are listed under `losses`. Every explicit loss needs a `type`, a stable `name`, a `weight`, and usually one or more dataset ids.

```yaml
losses:
  - type: boundary
    name: boundary
    weight: 1.0
    datasets: [boundary]
  - type: force_free
    name: force_free
    weight: 1.0e-3
    datasets: [random]
```

Use `datasets` to point a loss at boundary, sampler, or validation dataset ids. NF2 v0.4 uses `weight`; the legacy `lambda` key is rejected.

Common Cartesian losses include `boundary`, `force_free`, and `potential`. Multi-height LOS/transverse/azimuth configs often use `boundary_los_trv_azi` plus a `height` loss on the elevated boundary. Spherical configs usually combine `boundary`, `force_free`, `potential`, and sometimes `energy_gradient`.

## Loss Schedules

A loss `weight` can be a number or a schedule. Supported schedule types are `exponential`, `linear`, and `step`. If `type` is omitted, NF2 uses an exponential schedule.

```yaml
losses:
  - type: force_free
    name: force_free
    weight:
      type: exponential
      start: 1.0e-4
      end: 1.0e-2
      iterations: 50000
    datasets: [random]
  - type: potential
    name: potential
    weight:
      type: step
      steps: 5000
      start: 1.0e-4
      end: 0.0
    datasets: [random]
```

Use schedules when one objective should enter gradually or disappear after a warm-up. A common Cartesian pattern is to turn off the potential loss after the model has learned the initial large-scale structure.

## Height Scaling

Loss scaling changes how strongly selected losses contribute across height or radius. Cartesian examples use `b_height` scaling for volume losses:

```yaml
loss_scaling:
  - type: b_height
    name: b_height
    loss_ids: [force_free, potential]
```

Spherical examples use radial scaling:

```yaml
loss_scaling:
  - type: radial
    name: radial
    base_radius: 1.0
    max_radius: 1.3
    loss_ids: [force_free, potential, energy_gradient]
```

For multi-height data, set `height_mapping` on the elevated boundary and add a matching height transform:

```yaml
data:
  boundaries:
    - id: chromosphere
      type: los_trv_azi
      height_mapping: { z: 2.0, z_min: 0.0, z_max: 20.0 }
transforms:
  - type: height
    height_range: [0, 20]
    datasets: [chromosphere]
```

## Batch Sizes

Training memory is mostly controlled by dataset batch sizes. Start by reducing the largest sampler or boundary batches:

```yaml
data:
  sampler:
    type: height
    batch_size: 8192
  potential_boundary:
    type: potential
    strides: 4
    batch_size: 4096
  validation_batch_size: 4096
```

For spherical `random_radial_grouped` samplers, `batch_size` must be divisible by `n_lat_lon_sample`.

```yaml
data:
  samplers:
    - id: random
      type: random_radial_grouped
      batch_size: 8192
      n_lat_lon_sample: 64
```

Validation can use a smaller `batch_size` than training. This is useful when callbacks or metrics run out of memory even though training batches fit.

## Loader And Series Cadence

NF2 defaults to 4 PyTorch DataLoader workers. On shared filesystems or series runs with frequent DataLoader reloads, lowering validation workers often reduces transition overhead:

```yaml
data:
  num_workers: 4
  validation_num_workers: 0
  prefetch_factor: 2
```

For series runs, `data.num_workers` also controls the multiprocessing pool used to preload per-step data modules. Set `data.data_module_workers` only when that preload pool should differ from the PyTorch DataLoader worker count.

Series configs advance to a new dataset every epoch by default. The example series configs validate every 10th dataset while still saving one `.nf2` result per dataset:

```yaml
training:
  reload_dataloaders_every_n_epochs: 1
  check_val_every_n_epoch: 10
```

If preloading every series step uses too much memory, set `data.preload_data_modules: false` to load only the active step.

## Validation Resolution

Large active regions can make Cartesian validation cubes expensive. Reduce validation resolution by increasing `ds_per_pixel` on `cube` or `slices`, or by lowering the global validation density.

```yaml
data:
  validation_pixel_per_ds: 64
  validation:
    - id: cube
      type: cube
      ds_per_pixel: 0.03125
      batch_size: 4096
    - id: slices
      type: slices
      n_slices: 6
      batch_size: 4096
```

For spherical validation, reduce `sphere.resolution`, `spherical_slices.longitude_resolution`, or `spherical_slices.n_slices`.

```yaml
data:
  validation:
    - id: sphere
      type: sphere
      resolution: 128
      batch_size: 1024
    - id: slices
      type: spherical_slices
      longitude_resolution: 128
      n_slices: 5
```

## Out Of Memory Errors

When training fails with CUDA out-of-memory errors, reduce memory in this order:

1. Lower `data.sampler.batch_size` or spherical sampler `batch_size`.
2. Lower boundary dataset `batch_size`, especially high-resolution full-disk maps.
3. Reduce validation `batch_size` and validation resolution.
4. Increase `potential_boundary.strides` for Cartesian runs.
5. Reduce `model.network.hidden_dim` only after data and validation batches have been tuned.

If the error happens only during export or `nf2-metrics`, pass a smaller evaluation batch size to the command:

```bash
nf2-metrics ./runs/case/extrapolation_result.nf2 --batch_size 2048 --Mm_per_pixel 0.72 --height_range 0 80
```