Add log_train_loss_on_step toggle to EasySyntax#886
Conversation
Adds an opt-in `log_train_loss_on_step` constructor argument that, when enabled, logs the per-batch training loss under `train_loss_step` in addition to the epoch-aggregated `train_loss`. Default is False so existing behavior is unchanged.
|
I will look at this. |
| scheduler_class: Optional[type] = None, | ||
| scheduler_kwargs: Optional[Dict] = None, | ||
| scheduler_config: Optional[Dict] = None, | ||
| log_train_loss_on_step: bool = False, |
There was a problem hiding this comment.
The variable name could be renamed to also_log_train_loss_per_step. This would immediately clarify, that it is an additional option for logging the per-batch loss under a different key.
| log_train_loss_on_step: bool = False, | |
| also_log_train_loss_per_step: bool = False, |
It could be also useful to add a Docstring explaining the arguments in __init__(), but especially for also_log_train_loss_per_step.
"""
Args:
also_log_train_loss_per_step:
If `True`, logs an additional per-batch metric (`train_loss_step`)
alongside the existing per-epoch metric (`train_loss`). This can
be useful for debugging training instabilities or monitoring
convergence within long epochs.
"""| self._scheduler_class = scheduler_class | ||
| self._scheduler_kwargs = scheduler_kwargs or dict() | ||
| self._scheduler_config = scheduler_config or dict() | ||
| self._log_train_loss_on_step = log_train_loss_on_step |
There was a problem hiding this comment.
| self._log_train_loss_on_step = log_train_loss_on_step | |
| self._also_log_train_loss_per_step = also_log_train_loss_per_step |
| if self._log_train_loss_on_step: | ||
| self.log( | ||
| "train_loss_step", | ||
| loss, | ||
| batch_size=batch_size, | ||
| prog_bar=False, | ||
| on_epoch=False, | ||
| on_step=True, | ||
| sync_dist=True, |
There was a problem hiding this comment.
It might be computationally expensive, if sync_dist=True.
There would be syncing across GPUs on every batch, which quickly adds up for high batch number. It should be maybe clarified in the Docstring at the top, that the training might be slowed down. The default of this option could also be set to sync_dist=False.
| if self._log_train_loss_on_step: | |
| self.log( | |
| "train_loss_step", | |
| loss, | |
| batch_size=batch_size, | |
| prog_bar=False, | |
| on_epoch=False, | |
| on_step=True, | |
| sync_dist=True, | |
| if self._also_log_train_loss_on_step: | |
| self.log( | |
| "train_loss_step", | |
| loss, | |
| batch_size=batch_size, | |
| prog_bar=False, | |
| on_epoch=False, | |
| on_step=True, | |
| sync_dist=True, |
There was a problem hiding this comment.
Good point! Let's set sync_dist to false as the default
Co-authored-by: Christian Locatelli <97306084+christianlocatelli@users.noreply.github.com>
christianlocatelli
left a comment
There was a problem hiding this comment.
I left some comments about optional naming and doc improvements.
The previous commit ("Apply suggestions from code review") was created via
GitHub's batch-suggestion apply, which mangled the indentation and left a
name mismatch, so the module no longer imported:
- under-indented `also_log_train_loss_per_step` parameter and attribute
- top-level `if self._also_log_train_loss_on_step:` referencing an attribute
that is never set (`_on_step` vs `_per_step`)
Re-apply the reviewer's intent cleanly: rename to `also_log_train_loss_per_step`,
log the per-step metric with `sync_dist=False`, and document all `__init__`
arguments.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Hey @christianlocatelli, I implemented your suggestions. Sorry for the commit name, that was claude 😅 |
christianlocatelli
left a comment
There was a problem hiding this comment.
This looks good to me, thanks for editing the code 👍
|
@Aske-Rosted tagging you here for completeness (and a potential approval 😅 ) |
Summary
log_train_loss_on_stepflag toEasySyntaxthat logs a per-steptrain_loss_stepmetric in addition to the existing epoch-aggregatedtrain_loss.False, so existing logging behavior is unchanged.addresses #882