Troubleshooting¶

This page lists common problems encountered when using the heterodyne package and their recommended solutions.

NLSQ Convergence Failure¶

Symptom: result.success is False; the solver reports “Maximum number of function evaluations reached” or “Cost function not decreasing.”

Possible causes and remedies:

Poor initial guess – Use multi-start optimisation (n_starts=20 or more) to explore the parameter space from diverse starting points.
Multi-modal landscape – Switch to CMA-ES for global search, then refine with NLSQ. See CMA-ES Global Optimisation.
Tight bounds – Widen parameter bounds in the configuration. Check whether any parameter is hitting its bound at the solution (result.validate() will flag this).
Ill-conditioned Jacobian – Check the condition number. If extremely large, consider fixing one or more weakly constrained parameters (e.g., D_offset_ref, f2).
Insufficient function evaluations – Increase max_nfev in NLSQConfig.

CMC Divergent Transitions¶

Symptom: NumPyro reports divergent transitions during NUTS sampling; result.convergence_passed is False.

Remedies:

Increase target acceptance probability – Set target_accept_prob=0.95 in CMCConfig. This reduces the step size, improving sampling in regions of high curvature.
Check priors – Overly wide priors can send the sampler into unphysical regions. Reduce nlsq_prior_width_factor from 5.0 to 3.0.
Tighten bounds – Ensure parameter bounds exclude regions where the model is undefined or numerically unstable.
Increase warmup – More warmup iterations allow the sampler to adapt its step size and mass matrix more thoroughly.

Memory Errors¶

Symptom: MemoryError or the process is killed by the OOM killer.

Remedies:

Switch to chunked or sequential strategy for NLSQ:

config = NLSQConfig(strategy="chunked", chunk_size=128)

Trim frame range – Load only the frames you need:

data = loader.load(frame_start=0, frame_end=500)

Increase CMC shards – More shards means less memory per shard:
```
cmc_config = CMCConfig(num_shards=16)
```
Check for memory leaks – If memory grows across multiple fits, ensure you are not accumulating JAX arrays in a loop without releasing references.

JAX Compilation Slow¶

Symptom: The first NLSQ call takes minutes before any fitting begins.

Causes:

Large array shapes – JIT compilation time scales with the complexity of the computation graph. For very large \(C_2\) matrices, use the chunked or sequential strategy to avoid compiling a single monolithic kernel.
Inconsistent shapes – JAX recompiles whenever input shapes change. Ensure all angles use the same number of frames, or pad to a common size.
Thread contention – Verify OMP_NUM_THREADS is set appropriately. Over-subscription can slow compilation.
XLA flags not set – Run heterodyne-config-xla to configure optimal compiler flags.

Parameter at Bounds¶

Symptom: A fitted parameter is exactly at its lower or upper bound; result.validate() may not flag this directly, but uncertainties for that parameter will be unreliable.

Remedies:

Widen bounds – If the physical range permits, increase the bound.
Check initial values – A poor starting point near a bound can trap the optimiser.
Fix the parameter – If the data cannot constrain a parameter, fix it to a physically motivated value and re-fit.
Inspect the residuals – Parameter-at-bound may indicate a model mismatch rather than a bound problem.

Poor R-hat After CMC¶

Symptom: \(\hat{R} > 1.1\) for one or more parameters.

Remedies:

Run longer – Increase num_warmup and num_samples.
Increase chains – More chains provide better \(\hat{R}\) estimates: num_chains=6 or num_chains=8.
Check for bimodality – Use plot_shard_comparison(shard_results) to see if shards converge to different modes.
Improve warm-start – A better NLSQ solution as the CMC warm-start helps chains explore the correct region faster.

NaN or Inf in Results¶

Symptom: Fitted parameters contain NaN or Inf.

Causes:

NaN in input data – Check np.any(np.isnan(c2_data)). The loader’s validation should catch this, but preprocessed data may slip through.
Numerical overflow – Very large D0 or v0 values combined with long time spans can cause overflow in the exponential. Tighten bounds.
Division by zero – If the fraction function reaches exactly 0 or 1, some terms may become degenerate. Check f0 and f3 bounds.