Skip to content

Add log_train_loss_on_step toggle to EasySyntax#886

Open
sevmag wants to merge 3 commits into
graphnet-team:mainfrom
sevmag:feature/log-train-loss-on-step
Open

Add log_train_loss_on_step toggle to EasySyntax#886
sevmag wants to merge 3 commits into
graphnet-team:mainfrom
sevmag:feature/log-train-loss-on-step

Conversation

@sevmag
Copy link
Copy Markdown
Collaborator

@sevmag sevmag commented May 4, 2026

Summary

  • Adds an opt-in log_train_loss_on_step flag to EasySyntax that logs a per-step train_loss_step metric in addition to the existing epoch-aggregated train_loss.
  • Default is False, so existing logging behavior is unchanged.

addresses #882

Adds an opt-in `log_train_loss_on_step` constructor argument that, when
enabled, logs the per-batch training loss under `train_loss_step` in
addition to the epoch-aggregated `train_loss`. Default is False so
existing behavior is unchanged.
@christianlocatelli
Copy link
Copy Markdown
Collaborator

I will look at this.

@christianlocatelli christianlocatelli self-requested a review May 12, 2026 18:33
Comment thread src/graphnet/models/easy_model.py Outdated
scheduler_class: Optional[type] = None,
scheduler_kwargs: Optional[Dict] = None,
scheduler_config: Optional[Dict] = None,
log_train_loss_on_step: bool = False,
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name could be renamed to also_log_train_loss_per_step. This would immediately clarify, that it is an additional option for logging the per-batch loss under a different key.

Suggested change
log_train_loss_on_step: bool = False,
also_log_train_loss_per_step: bool = False,

It could be also useful to add a Docstring explaining the arguments in __init__(), but especially for also_log_train_loss_per_step.

    """
    Args:
        also_log_train_loss_per_step:
            If `True`, logs an additional per-batch metric (`train_loss_step`)
            alongside the existing per-epoch metric (`train_loss`). This can
            be useful for debugging training instabilities or monitoring
            convergence within long epochs.
    """

Comment thread src/graphnet/models/easy_model.py Outdated
self._scheduler_class = scheduler_class
self._scheduler_kwargs = scheduler_kwargs or dict()
self._scheduler_config = scheduler_config or dict()
self._log_train_loss_on_step = log_train_loss_on_step
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._log_train_loss_on_step = log_train_loss_on_step
self._also_log_train_loss_per_step = also_log_train_loss_per_step

Comment thread src/graphnet/models/easy_model.py Outdated
Comment on lines +258 to +266
if self._log_train_loss_on_step:
self.log(
"train_loss_step",
loss,
batch_size=batch_size,
prog_bar=False,
on_epoch=False,
on_step=True,
sync_dist=True,
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be computationally expensive, if sync_dist=True.
There would be syncing across GPUs on every batch, which quickly adds up for high batch number. It should be maybe clarified in the Docstring at the top, that the training might be slowed down. The default of this option could also be set to sync_dist=False.

Suggested change
if self._log_train_loss_on_step:
self.log(
"train_loss_step",
loss,
batch_size=batch_size,
prog_bar=False,
on_epoch=False,
on_step=True,
sync_dist=True,
if self._also_log_train_loss_on_step:
self.log(
"train_loss_step",
loss,
batch_size=batch_size,
prog_bar=False,
on_epoch=False,
on_step=True,
sync_dist=True,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let's set sync_dist to false as the default

Co-authored-by: Christian Locatelli <97306084+christianlocatelli@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments about optional naming and doc improvements.

The previous commit ("Apply suggestions from code review") was created via
GitHub's batch-suggestion apply, which mangled the indentation and left a
name mismatch, so the module no longer imported:
- under-indented `also_log_train_loss_per_step` parameter and attribute
- top-level `if self._also_log_train_loss_on_step:` referencing an attribute
  that is never set (`_on_step` vs `_per_step`)

Re-apply the reviewer's intent cleanly: rename to `also_log_train_loss_per_step`,
log the per-step metric with `sync_dist=False`, and document all `__init__`
arguments.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sevmag
Copy link
Copy Markdown
Collaborator Author

sevmag commented May 27, 2026

Hey @christianlocatelli, I implemented your suggestions. Sorry for the commit name, that was claude 😅

@sevmag sevmag requested a review from christianlocatelli May 27, 2026 01:59
Copy link
Copy Markdown
Collaborator

@christianlocatelli christianlocatelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks for editing the code 👍

@sevmag
Copy link
Copy Markdown
Collaborator Author

sevmag commented May 27, 2026

@Aske-Rosted tagging you here for completeness (and a potential approval 😅 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants