EMOD3D checkpointing

In order to resume EMOD3D runs that have gone too long, or were cancelled part way through, we need to enable check pointing.

Check pointing with EMOD3D requires that the variable 'enable_restart' is set to 1 and the 'restart_itinc' variable be set to the number of steps between each check point.

At each multiple of the given number of steps the currently completed calculations are saved to the LF/Restart folder.

To resume a check pointed run the variable 'read_restart' must be set to 1, and the LF/Restart folder must contain the data from the previous run.

Implementation

We wish to checkpoint approximately every 10 mins.

To do this we use the estimated call clock time to estimate the number of steps calculated every 10 minutes and set 'restart_itinc' to this.

Impact on performance

A run of EMOD3D was done with check pointing every 40 seconds or so and compared with an un-check pointed run.

This resulted in the run time increasing from 5:53 to 5:54 (minutes:seconds) and the impact is therefore considered insignificant.


  • No labels