Training#

Training behavior is controlled by the training, losses, loss_scaling, data.sampler, and data.validation sections of the YAML config. The bundled templates, such as nf2/cartesian/sharp_cea.yaml, are the best starting point for complete files.

Training Details

Configuration

Loss Setup#

Losses are listed under losses. Every explicit loss needs a type, a stable name, a weight, and usually one or more dataset ids.

losses:
  - type: boundary
    name: boundary
    weight: 1.0
    datasets: [boundary]
  - type: force_free
    name: force_free
    weight: 1.0e-3
    datasets: [random]

Use datasets to point a loss at boundary, sampler, or validation dataset ids. NF2 v0.4 uses weight; the legacy lambda key is rejected.

Common Cartesian losses include boundary, force_free, and potential. Multi-height LOS/transverse/azimuth configs often use boundary_los_trv_azi plus a height loss on the elevated boundary. The older boundary_azi loss remains available for split transverse/azimuth experiments; prefer boundary_los_trv_azi for new ambiguity-aware LOS/transverse/azimuth runs. Spherical configs usually combine boundary, force_free, potential, and sometimes energy_gradient.

For boundary losses, set weights to raw per-component multipliers in dataset component order. For spherical Br/Btheta/Bphi maps, weights: [1.0, 0.1, 0.1] keeps Br at full strength and gives Btheta and Bphi one tenth of the Br penalty.

Loss Schedules#

A loss weight can be a number or a schedule. Supported schedule types are exponential, linear, and step. If type is omitted, NF2 uses an exponential schedule.

losses:
  - type: force_free
    name: force_free
    weight:
      type: exponential
      start: 1.0e-4
      end: 1.0e-2
      iterations: 50000
    datasets: [random]
  - type: potential
    name: potential
    weight:
      type: step
      steps: 5000
      start: 1.0e-4
      end: 0.0
    datasets: [random]

Use schedules when one objective should enter gradually or disappear after a warm-up. A common Cartesian pattern is to turn off the potential loss after the model has learned the initial large-scale structure.

Height Scaling#

Loss scaling changes how strongly selected losses contribute across height or radius. Cartesian examples use b_height scaling for volume losses:

loss_scaling:
  - type: b_height
    name: b_height
    loss_ids: [force_free, potential]

Spherical examples use radial scaling:

loss_scaling:
  - type: radial
    name: radial
    base_radius: 1.0
    max_radius: 1.3
    loss_ids: [force_free, potential, energy_gradient]

For multi-height data, set height_mapping on the elevated boundary and add a matching height transform:

data:
  boundaries:
    - id: chromosphere
      type: los_trv_azi
      height_mapping: { z: 2.0, z_min: 0.0, z_max: 20.0 }
transforms:
  - type: height
    height_range: [0, 20]
    datasets: [chromosphere]

Batch Sizes#

Training memory is mostly controlled by dataset batch sizes. Start by reducing the largest sampler or boundary batches:

data:
  sampler:
    type: height
    batch_size: 8192
  potential_boundary:
    type: potential
    strides: 4
    batch_size: 4096
  validation_batch_size: 4096

For spherical random_radial_grouped samplers, batch_size must be divisible by n_lat_lon_sample.

data:
  samplers:
    - id: random
      type: random_radial_grouped
      batch_size: 8192
      n_lat_lon_sample: 64

Validation can use a smaller batch_size than training. This is useful when callbacks or metrics run out of memory even though training batches fit.

Loader And Series Cadence#

NF2 defaults to 4 PyTorch DataLoader workers. On shared filesystems or series runs with frequent DataLoader reloads, lowering validation workers often reduces transition overhead:

data:
  num_workers: 4
  validation_num_workers: 0
  prefetch_factor: 2

For series runs, data.num_workers also controls the multiprocessing pool used to preload per-step data modules. Set data.data_module_workers only when that preload pool should differ from the PyTorch DataLoader worker count.

Series configs advance to a new dataset every epoch by default. The example series configs validate every 10th dataset while still saving one .nf2 result per dataset:

training:
  reload_dataloaders_every_n_epochs: 1
  check_val_every_n_epoch: 10

If preloading every series step uses too much memory, set data.preload_data_modules: false to load only the active step.

Validation Resolution#

Large active regions can make Cartesian validation cubes expensive. Reduce validation resolution by increasing ds_per_pixel on cube or slices, or by lowering the global validation density.

data:
  validation_pixel_per_ds: 64
  validation:
    - id: cube
      type: cube
      ds_per_pixel: 0.03125
      batch_size: 4096
    - id: slices
      type: slices
      n_slices: 6
      batch_size: 4096

For spherical validation, reduce sphere.resolution, spherical_slices.longitude_resolution, or spherical_slices.n_slices.

data:
  validation:
    - id: sphere
      type: sphere
      resolution: 128
      batch_size: 1024
    - id: slices
      type: spherical_slices
      longitude_resolution: 128
      n_slices: 5

Out Of Memory Errors#

When training fails with CUDA out-of-memory errors, reduce memory in this order:

Lower data.sampler.batch_size or spherical sampler batch_size.
Lower boundary dataset batch_size, especially high-resolution full-disk maps.
Reduce validation batch_size and validation resolution.
Increase potential_boundary.strides for Cartesian runs.
Reduce model.network.hidden_dim only after data and validation batches have been tuned.

If the error happens only during export or nf2-metrics, pass a smaller evaluation batch size to the command:

nf2-metrics ./runs/case/extrapolation_result.nf2 --batch_size 2048 --Mm_per_pixel 0.72 --height_range 0 80