YaRN by rrutmann · Pull Request #445 · Modalities/modalities

rrutmann · 2026-05-11T14:04:11Z

What does this PR do?

This PR adds YaRN support to rotary position embeddings in the GPT-2 attention path.

General Changes

Implemented YaRN parameterization in rotary embeddings in gpt2_model.py
Added/updated YaRN configuration in config_lorem_ipsum_long_fsdp2_yarn.yaml
Refactored and strengthened rotary tests in test_rotary_qkv_transform.py

Breaking Changes

..

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

Co-authored-by: Copilot <copilot@github.com>

therealdavidos

looks good from the math perspective!

therealdavidos · 2026-05-18T15:32:25Z

is this change related to the yarn PR?

BlueCrescent · 2026-06-01T06:28:26Z


        self.reset_parameters()

+    def _compute_yarn_parameters(self, device: torch.device | None) -> tuple[torch.Tensor, float]:


Pleace place private methods below the public interface of the class.

BlueCrescent · 2026-06-01T08:37:26Z

            seq_length_dim: Annotated[int, Field(strict=True)]
            base_freq: Annotated[int, Field(strict=True, ge=10000)]
+            max_position_embeddings: Optional[Annotated[int, Field(strict=True, ge=1)]] = None
+            rope_scaling: Optional[dict[str, object]] = None


Does this play nicely with our config setup? Would it be possible to have something like "rope_scaling: YarnSettings | DefaultSettingsIfExists | SomeFutureRopeScalingSettings | None = None" with the Settings being BaseModels themselves?

BlueCrescent · 2026-06-01T08:46:07Z

+        beta_fast_raw = self.rope_scaling.get("beta_fast")
+        beta_slow_raw = self.rope_scaling.get("beta_slow")
+        beta_fast = float(beta_fast_raw) if isinstance(beta_fast_raw, (int, float)) else 32.0
+        beta_slow = float(beta_slow_raw) if isinstance(beta_slow_raw, (int, float)) else 1.0


I'm a bit worried that in case these parameters are strings or torch types for some reason they will get dropped silently here.

BlueCrescent · 2026-06-01T08:57:40Z

+            return 0.1 * mscale * math.log(scale) + 1.0
+
+        if attention_factor is None:
+            if isinstance(mscale, (int, float)) and isinstance(mscale_all_dim, (int, float)):


I'm a bit worried that in case these parameters are strings or torch types for some reason they will get dropped silently here.

rrutmann and others added 5 commits May 11, 2026 12:49

feat: Implement context extension with yarn

6524b24

Co-authored-by: Copilot <copilot@github.com>

test: Add test for yarn

f87eabb

Co-authored-by: Copilot <copilot@github.com>

docs: Add type annotations

779e7c1

Co-authored-by: Copilot <copilot@github.com>

docs: Add docstrings

2126b0b

Co-authored-by: Copilot <copilot@github.com>

fix: Write to unique filenames

309d147

Co-authored-by: Copilot <copilot@github.com>

rrutmann requested a review from le1nux May 12, 2026 12:28

rrutmann self-assigned this May 12, 2026

le1nux requested a review from BlueCrescent May 13, 2026 10:06

rrutmann requested review from therealdavidos and removed request for BlueCrescent May 20, 2026 08:32

chore: Apply black formatter

a06d6b4

therealdavidos reviewed Jun 1, 2026

View reviewed changes

Comment thread tests/fsdp2_parallelization/test_tensor_parallelism.py

Copy link
Copy Markdown

Collaborator

therealdavidos May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change related to the yarn PR?

BlueCrescent reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YaRN#445

YaRN#445
rrutmann wants to merge 6 commits into
mainfrom
yarn_hf

rrutmann commented May 11, 2026 •

edited

Loading

Uh oh!

therealdavidos left a comment

Uh oh!

therealdavidos May 18, 2026

Uh oh!

BlueCrescent Jun 1, 2026

Uh oh!

BlueCrescent Jun 1, 2026

Uh oh!

BlueCrescent Jun 1, 2026

Uh oh!

BlueCrescent Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		self.reset_parameters()

		def _compute_yarn_parameters(self, device: torch.device \| None) -> tuple[torch.Tensor, float]:

Conversation

rrutmann commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

therealdavidos left a comment

Choose a reason for hiding this comment

Uh oh!

therealdavidos May 18, 2026

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rrutmann commented May 11, 2026 •

edited

Loading