fix(sync): 修改dirty_state并重构sync syncfs系统调用 by oeasy1412 · Pull Request #1934 · DragonOS-Community/DragonOS

oeasy1412 · 2026-05-27T08:17:01Z

1. 实现 ext4 脏 inode 追踪与延迟元数据刷盘

新增 Ext4FileSystem::dirty_inodes 列表，在 write_at 完成写操作后将 inode 加入脏列表（mark_inode_dirty），由 flush_dirty_inodes 统一刷盘。
使用 Weak<LockedExt4Inode> 避免延长 inode 生命周期。
Linux 参考：Linux 6.6 使用 per-backing-device 的 bdi_writeback.b_dirty 链表 + inode->i_lock 自旋锁 + i_state 中的 I_DIRTY 状态标志位。__mark_inode_dirty() 通过 was_dirty = i_state & I_DIRTY 判断是否已在脏列表，避免重复入队。
DragonOS 实现：使用 bitflags! 定义的 InodeDirtyState: u32，bit 位置对齐 Linux I_DIRTY_*
DragonOS 简化为单一 Mutex + bitflags，无锁快速路径暂未实现

2. 新增 VFS `write_inode` 回调

在 IndexNode trait 中新增 write_inode() 方法（默认 no-op），对应 Linux super_operations.write_inode。
LockedExt4Inode 实现 write_inode，调用已有的 flush_metadata 回写 size/mtime。
MountFSInode 透传到内部 inode。procfs/sysfs/pipe/socket 等无磁盘元数据的 inode 使用默认实现，不阻塞。
已知简化：Linux write_inode 接受 struct writeback_control *wbc 参数（含 WB_SYNC_NONE/WB_SYNC_ALL 模式），DragonOS 暂未实现此参数。当前无影响，后续实现周期性异步回写时需补充。

3. 新增 VFS `sync_fs` 回调

在 FileSystem trait 中新增 sync_fs(wait: bool) 方法（默认 no-op），对应 Linux super_operations.sync_fs。
Ext4FileSystem 实现 sync_fs，调用 flush_dirty_inodes 将脏 inode 元数据刷盘。
MountFS 透传到内部文件系统。
当前限制：wait 参数暂未使用。Linux ext4 中 wait 控制是否 jbd2_log_wait_commit 等待日志提交以及是否发送 blkdev_issue_flush barrier。DragonOS ext4 无日志机制，sync_fs 仅执行 flush_dirty_inodes，wait 不影响行为。

4. 重写 `sync(2)` 系统调用

对齐 Linux ksys_sync() 的流程：

步骤 1：flush_dirty_pages() — 唤醒回写脏页（对应 wakeup_flusher_threads(WB_REASON_SYNC)）
步骤 2：逐 superblock 调 sync_inodes_of_mount（对应 iterate_supers(sync_inodes_one_sb, NULL) → sync_inodes_sb）
步骤 3：逐 superblock 调 sync_fs(false)（对应 iterate_supers(sync_fs_one_sb, &nowait)，提交元数据但不等待）
步骤 4：逐 superblock 调 sync_fs(true)（对应 iterate_supers(sync_fs_one_sb, &wait)，等待元数据落盘）

所有步骤丢弃错误，始终返回 0，符合 Linux sync(2) 语义。
已知缺失：Linux 在步骤 4 之后还执行 sync_bdevs(false) + sync_bdevs(true) 刷写块设备缓存。DragonOS 尚无块设备 barrier 层，此步骤暂未实现。

5. 重写 `syncfs(2)` 系统调用

对齐 Linux SYNCFS(2) → sync_filesystem() 的流程。syncfs 直接调用 MountFS::sync_filesystem() 方法，该方法内部编排完整的同步序列：

sync_inodes_of_mount（对应 writeback_inodes_sb(sb, WB_REASON_SYNC)，启动脏页写回）
sync_fs(false)（对应 sync_fs(sb, 0)，非等待模式）
sync_inodes_of_mount（对应 sync_inodes_sb(sb)，同步等待所有 inode 写回完成）
sync_fs(true)（对应 sync_fs(sb, 1)，等待模式）

sync_filesystem 方法封装了只读检查（is_readonly 直接返回 Ok），syncfs 不再手动编排步骤，而是委托给 MountFS::sync_filesystem() 并映射错误到返回值。
非 VFS fd（pipe/socket 等）的 downcast_arc::<MountFSInode> 失败时直接返回 EBADF，对齐 Linux 行为（这些 fd 所属伪文件系统为只读，sync_filesystem 检查 sb_rdonly 后直接返回 0）。
O_PATH fd 返回 EBADF，对齐 Linux fdget() 的 FMODE_PATH 掩码过滤。
已知缺失：

Linux 在 sync_filesystem 内部执行 sync_blockdev_nowait(sb->s_bdev) / sync_blockdev(sb->s_bdev)。DragonOS 尚无块设备 barrier 层。
Linux sync_filesystem 持有 down_read(&sb->s_umount) 防止 sync 期间 umount，DragonOS 暂未实现此锁保护。

5.5 umount 时 sync_filesystem

MountFS::do_umount_and_prepare_remount()

6. PageCacheManager::sync 触发 write_inode

PageCacheManager::sync() 在脏页写完后调用 inode.write_inode() 回写元数据，对应 Linux __writeback_single_inode中 do_writepages() 之后调用 write_inode() 的语义。
配合 sync_fs → flush_dirty_inodes 形成两层防护：page cache sync 路径和 sync_fs 路径都会尝试元数据回写，flush_metadata 内部通过 InodeDirtyState 的 SIZE_DIRTY/MTIME_DIRTY 位检查实现幂等。
已知差异：Linux 通过 dirty & ~I_DIRTY_PAGES 条件守卫，仅当 inode 元数据脏（I_DIRTY_SYNC/I_DIRTY_DATASYNC）时才调用 write_inode，纯脏页不触发。DragonOS 无条件调用 write_inode，依赖 flush_metadata 内部 !size_dirty && !mtime_dirty 早退检查实现幂等，功能等价但存在额外的函数调用开销。

7. page_reclaim_thread 周期性 sync_fs

后台回收线程在 flush_dirty_pages 后对每个非只读 mount 调用 sync_fs(true)，补充逐页 writeback 路径未覆盖的元数据回写。
对应 Linux 中 __writeback_single_inode 在 do_writepages 后调用 write_inode 的语义：脏页和元数据在同一次遍历中完成。DragonOS 的 flush_dirty_pages() 不触发 write_inode，此处通过 sync_fs 刷回 dirty_inodes 中的脏元数据。

Signed-off-by: aLinChe <1129332011@qq.com>

oeasy1412 · 2026-05-27T11:18:26Z

1. Implementing ext4 Dirty Inode Tracking and Deferred Metadata Flushing

Added an Ext4FileSystem::dirty_inodes list. After a write_at operation completes, the inode is added to the dirty list (mark_inode_dirty), and a unified flush is performed by flush_dirty_inodes.
Weak<LockedExt4Inode> is used to avoid extending the inode's lifetime.
Linux Reference: Linux 6.6 uses a per-backing-device bdi_writeback.b_dirty linked list + the inode->i_lock spinlock + the I_DIRTY state flag in i_state. __mark_inode_dirty() checks was_dirty = i_state & I_DIRTY to determine if the inode is already in the dirty list, preventing duplicate enqueueing.
DragonOS Implementation: Uses InodeDirtyState: u32 defined with bitflags!, with bit positions aligned to Linux's I_DIRTY_*.
DragonOS simplifies this to a single Mutex + bitflags; the lockless fast path is not yet implemented.

2. Adding a VFS `write_inode` Callback

A write_inode() method (default no-op) is added to the IndexNode trait, corresponding to Linux's super_operations.write_inode.
LockedExt4Inode implements write_inode, calling the existing flush_metadata to write back size/mtime.
MountFSInode forwards the call to the inner inode. Inodes without on-disk metadata, such as procfs, sysfs, pipes, and sockets, use the default implementation and do not block.
Known Simplification: Linux's write_inode accepts a struct writeback_control *wbc parameter (containing WB_SYNC_NONE/WB_SYNC_ALL modes). DragonOS does not yet implement this parameter. There is no impact at present; it will need to be added when periodic asynchronous writeback is implemented later.

3. Adding a VFS `sync_fs` Callback

A sync_fs(wait: bool) method (default no-op) is added to the FileSystem trait, corresponding to Linux's super_operations.sync_fs.
Ext4FileSystem implements sync_fs, calling flush_dirty_inodes to flush dirty inode metadata to disk.
MountFS forwards the call to the inner file system.
Current Limitation: The wait parameter is currently unused. In Linux ext4, wait controls whether jbd2_log_wait_commit waits for journal commit and whether a blkdev_issue_flush barrier is sent. DragonOS ext4 has no journaling mechanism; sync_fs only executes flush_dirty_inodes, and wait does not affect behavior.

4. Rewriting the `sync(2)` System Call

Aligns with the Linux ksys_sync() flow:

Step 1: flush_dirty_pages() — Wakes up to write back dirty pages (corresponds to wakeup_flusher_threads(WB_REASON_SYNC))
Step 2: Calls sync_inodes_of_mount for each superblock (corresponds to iterate_supers(sync_inodes_one_sb, NULL) → sync_inodes_sb)
Step 3: Calls sync_fs(false) for each superblock (corresponds to iterate_supers(sync_fs_one_sb, &nowait), submitting metadata without waiting)
Step 4: Calls sync_fs(true) for each superblock (corresponds to iterate_supers(sync_fs_one_sb, &wait), waiting for metadata to hit disk)

All errors during the steps are discarded and 0 is always returned, matching the semantics of Linux sync(2).
Known Missing Features: In Linux, sync_bdevs(false) and sync_bdevs(true) are also executed after step 4 to flush block device caches. DragonOS does not yet have a block device barrier layer, so this step is not implemented.

5. Rewriting the `syncfs(2)` System Call

Aligns with the Linux SYNCFS(2) → sync_filesystem() flow. syncfs directly calls the MountFS::sync_filesystem() method, which internally orchestrates the complete synchronization sequence:

sync_inodes_of_mount (corresponds to writeback_inodes_sb(sb, WB_REASON_SYNC), starting dirty page writeback)
sync_fs(false) (corresponds to sync_fs(sb, 0), non-waiting mode)
sync_inodes_of_mount (corresponds to sync_inodes_sb(sb), synchronously waiting for all inode writeback to complete)
sync_fs(true) (corresponds to sync_fs(sb, 1), waiting mode)

The sync_filesystem method encapsulates the read-only check (is_readonly returns Ok directly). syncfs no longer orchestrates the steps manually; it delegates to MountFS::sync_filesystem() and maps errors to return values.
For non-VFS file descriptors (pipes, sockets, etc.), if downcast_arc::<MountFSInode> fails, it returns EBADF directly, aligning with Linux behavior (the pseudo-filesystems for these fds are read-only, so sync_filesystem checks sb_rdonly and returns 0 immediately).
O_PATH fds return EBADF, aligning with the FMODE_PATH mask filtering in Linux's fdget().
Known Missing Features:

Inside Linux's sync_filesystem, sync_blockdev_nowait(sb->s_bdev) / sync_blockdev(sb->s_bdev) are executed. DragonOS does not yet have a block device barrier layer.
Linux's sync_filesystem holds down_read(&sb->s_umount) to prevent umount during sync. DragonOS does not yet implement this lock protection.

5.5 sync_filesystem during umount

MountFS::do_umount_and_prepare_remount()

6. PageCacheManager::sync triggers write_inode

PageCacheManager::sync() calls inode.write_inode() to write back metadata after dirty pages have been written out. This corresponds to the semantics in Linux's __writeback_single_inode where write_inode() is called after do_writepages().
Together with sync_fs → flush_dirty_inodes, this forms a two-layer protection: both the page cache sync path and the sync_fs path attempt metadata writeback. flush_metadata achieves idempotency internally through checks on the SIZE_DIRTY/MTIME_DIRTY bits of InodeDirtyState.
Known Difference: Linux guards this with the condition dirty & ~I_DIRTY_PAGES, calling write_inode only when inode metadata is dirty (I_DIRTY_SYNC/I_DIRTY_DATASYNC); purely dirty pages do not trigger it. DragonOS calls write_inode unconditionally, relying on the early-exit check !size_dirty && !mtime_dirty inside flush_metadata to achieve idempotency. This is functionally equivalent but incurs an extra function call overhead.

7. Periodic sync_fs in the page_reclaim_thread

After flush_dirty_pages, the background reclaim thread calls sync_fs(true) for each non-read-only mount, supplementing the metadata writeback not covered by the per-page writeback path.
This corresponds to the semantics in Linux where __writeback_single_inode calls write_inode after do_writepages: dirty pages and metadata are completed in the same traversal. DragonOS's flush_dirty_pages() does not trigger write_inode, so dirty metadata in dirty_inodes is flushed here via sync_fs.

fslongjin · 2026-05-27T11:46:07Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2494473dfa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T11:52:55Z

+                    inode
+                        .0
+                        .lock()
+                        .dirty_state
+                        .remove(InodeDirtyState::ON_DIRTY_LIST);


重新入队并发变脏的 inode

当 flush_dirty_inodes() 已经从 dirty_inodes 取走该 inode 但尚未清除 ON_DIRTY_LIST 时，另一个 buffered write 会在 mark_inode_dirty() 中因为该位仍然存在而直接返回、不再重新入队；flush_metadata() 在发现 cached_file_size/cached_mtime 已变化时会保留 SIZE_DIRTY/MTIME_DIRTY，但这里仍无条件移除 ON_DIRTY_LIST，使 inode 处于脏状态却不在 dirty 列表中。之后后台 page_reclaim_thread 的 flush_dirty_pages() 不调用 write_inode，元数据只依赖 sync_fs() 遍历 dirty 列表，因此这个并发写入的 size/mtime 可能长期不会落盘。

Useful? React with 👍 / 👎.

fslongjin · 2026-05-27T11:56:26Z

Thanks for working on this. I think the overall direction is valuable: adding VFS-level write_inode() / sync_fs() hooks and making syncfs() target the fd's filesystem are both useful steps toward Linux-compatible writeback semantics.

However, I do not think this implementation is safe to merge yet. I found several semantic and architectural issues that should be addressed first.

Dirty inode re-dirty race can lose metadata writeback

In Ext4FileSystem::flush_dirty_inodes(), the dirty list is first removed with mem::take(), then each inode is flushed, and on success ON_DIRTY_LIST is cleared unconditionally.

If the same inode is written again while flush_metadata() is in progress, mark_inode_dirty() will observe ON_DIRTY_LIST and return without re-queueing the inode. After the old flush succeeds, flush_dirty_inodes() clears ON_DIRTY_LIST. At that point the inode may still have SIZE_DIRTY / MTIME_DIRTY set from the concurrent write, but it is no longer present in dirty_inodes, so later sync_fs() calls may never see it.

Linux avoids this class of bug with the inode writeback state machine: dirty bits are taken/cleared under i_lock, writeback is marked with I_SYNC, and requeue_inode() handles inodes dirtied again during writeback. DragonOS does not need to clone the full Linux machinery immediately, but it does need an equivalent invariant: after a flush completes, if the inode is still dirty or was dirtied during the flush, it must remain queued or be re-queued.
PageCacheManager::sync() currently drops write_inode() errors

The new call to inode.write_inode() only logs failures and still returns Ok(()). This means sync_inodes_of_mount() and syncfs() cannot observe inode metadata writeback failures through this path.

Linux propagates writeback failures through the writeback/errseq path, and syncfs() reports superblock writeback errors via s_wb_err. DragonOS may not have the full superblock-level errseq infrastructure yet, but newly introduced synchronous metadata flush paths should not silently convert metadata writeback failure into success.
sync(2) and the page reclaim thread are scoped to the current mount namespace

sync() currently iterates ProcessManager::current_mntns().mount_list(). Linux ksys_sync() uses iterate_supers() and syncs all mounted superblocks in the system, not only the filesystems visible from the caller's mount namespace.

The same issue is more problematic in the page reclaim thread: writeback should not depend on an arbitrary current process mount namespace. Architecturally, this suggests DragonOS needs a global mounted-superblock/filesystem registry for system-wide sync and background writeback, while syncfs(fd) should remain fd/superblock-scoped.
umount() sync is performed after detach and ignores failures

MountFS::umount() calls do_umount() first, then calls self.sync_filesystem() only after the mount has been detached, and ignores the result. That ordering is fragile: writeback should happen before teardown/detach, under appropriate lifetime protection, and failures should not be silently discarded unless there is an explicit Linux-compatible reason to do so.
syncfs() is still missing superblock-level writeback error semantics

The TODO around errseq_check_and_advance(&sb->s_wb_err, &file->f_sb_err) is a real semantic gap, not just an optimization. Linux syncfs(fd) reports asynchronous writeback errors for the whole superblock since the file was opened. DragonOS currently has per-page-cache writeback error state, but that cannot represent errors from other inodes on the same filesystem.

Suggested direction:

Keep the VFS write_inode() / sync_fs() abstraction, but separate dirty content bits from queue/writeback lifecycle bits.
Add an inode dirty/writeback invariant equivalent to: if an inode is dirty after writeback, or is dirtied during writeback, it remains queued for future writeback.
Introduce a global mounted-superblock/filesystem registry for sync(2) and background writeback instead of using the current mount namespace.
Add superblock-level writeback error tracking before considering syncfs() Linux-compatible.
Move umount() writeback before detach/teardown and decide explicitly which errors should be returned or recorded.

I verified that this branch builds locally with make kernel, so the concerns above are about Linux semantics, dirty-state lifecycle, and writeback architecture rather than compilation.

Signed-off-by: longjin <longjin@dragonos.org>

fslongjin · 2026-05-27T12:59:58Z

Resolved the sync/writeback review issues in commit 350b46d9.

What changed:

Reworked ext4 dirty inode tracking so dirty inodes are pinned by the dirty list and redirty/writeback races requeue instead of losing metadata.
Added superblock-level writeback error state and per-file syncfs(2) error sampling/advance semantics.
Switched sync(2) and page reclaim to the global mounted-superblock registry instead of the current process mount namespace.
Centralized umount_lock handling in MountFS helpers so sync(2), syncfs(2), page reclaim, propagation, and umount(2) follow the Linux-style s_umount lifetime invariant.
Moved umount(2) sync before detach/namespace teardown and preserves the mount on sync failure.
Propagates page-cache and metadata writeback errors into superblock errseq for later syncfs(2) observation.
Added normal/syncfs_semantics dunitest coverage for Linux-compatible syncfs(2) fd semantics and a concurrent sync()/umount() lifetime smoke test.

Validation:

git diff --check
make kernel
make fmt
Host Linux: user/apps/tests/dunitest/bin/normal/syncfs_semantics_test passed 8/8
DragonOS guest after regenerating the disk image with make run-nographic: /opt/tests/dunitest/bin/normal/syncfs_semantics_test passed 8/8

…n sync path The sync/syncfs refactoring introduced in DragonOS-Community#1934 caused the test-x86 CI to hang indefinitely. This patch fixes three critical issues: 1. Self-deadlock on umount_lock RwSem: MountFS::umount() acquired umount_write, then through propagate_umount → umount_at_peer called child.sync_filesystem() which attempted umount_read on the SAME RwSem (shared via Arc::clone in deepcopy). Since RwSem is non-reentrant, writer + reader on same thread = permanent sleep. Fix: Move sync_filesystem() BEFORE acquiring umount_write (aligning with Linux's generic_shutdown_super pattern), and remove redundant sync from umount_at_peer since all propagated peers share the same superblock. 2. Writeback page leak: writeback_entry() could return Err after prepare_writeback_entry set page state to Writeback, without calling finish_writeback_entry. The page would be stuck in Writeback state forever, causing any future wait_writeback_entry to sleep indefinitely. Fix: Add RAII WritebackGuard that ensures finish_writeback_entry is always called on early exit paths. 3. Over-aggressive page reclaim: The page reclaim thread was calling sync_fs_with_umount_read(true) for all mounts every 500ms, competing with user I/O on io_guard and holding umount_read which delays umount. Metadata sync is not the reclaimer's responsibility. Fix: Remove sync_fs from page reclaim thread; increase writeback interval from 500ms to 5s (matching Linux dirty_writeback_centisecs). Additionally: - Add has_dirty_pages() short-circuit to sync_inodes_of_mount() to skip clean page caches without expensive inode upgrade + Arc::ptr_eq. - Add SharedMountPropagationUmountNoDeadlock dunitest that exercises the exact deadlock scenario (MS_SHARED + concurrent sync/umount). Signed-off-by: longjin <longjin@DragonOS.org>

oeasy1412 added 3 commits May 27, 2026 16:03

feat(inode): sync dirty inode

bcdaf39

Signed-off-by: aLinChe <1129332011@qq.com>

feat(dirty_state)

540fcc3

Signed-off-by: aLinChe <1129332011@qq.com>

fix

2494473

Signed-off-by: aLinChe <1129332011@qq.com>

github-actions Bot added the Bug fix A bug is fixed in this pull request label May 27, 2026

oeasy1412 requested a review from fslongjin May 27, 2026 11:18

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

fix(vfs): align sync writeback lifetime semantics

350b46d

Signed-off-by: longjin <longjin@dragonos.org>

github-actions Bot added the test Unitest/User space test label May 27, 2026

fslongjin merged commit 074e673 into DragonOS-Community:master May 27, 2026
27 checks passed

fslongjin mentioned this pull request May 28, 2026

fix(vfs): resolve umount_lock self-deadlock and writeback page leak in sync path #1936

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sync): 修改dirty_state并重构sync syncfs系统调用#1934

fix(sync): 修改dirty_state并重构sync syncfs系统调用#1934
fslongjin merged 4 commits into
DragonOS-Community:masterfrom
oeasy1412:fix-sync

oeasy1412 commented May 27, 2026

Uh oh!

oeasy1412 commented May 27, 2026

Uh oh!

fslongjin commented May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Uh oh!

fslongjin commented May 27, 2026

Uh oh!

fslongjin commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

oeasy1412 commented May 27, 2026

1. 实现 ext4 脏 inode 追踪与延迟元数据刷盘

2. 新增 VFS write_inode 回调

3. 新增 VFS sync_fs 回调

4. 重写 sync(2) 系统调用

5. 重写 syncfs(2) 系统调用

5.5 umount 时 sync_filesystem

6. PageCacheManager::sync 触发 write_inode

7. page_reclaim_thread 周期性 sync_fs

Uh oh!

oeasy1412 commented May 27, 2026

1. Implementing ext4 Dirty Inode Tracking and Deferred Metadata Flushing

2. Adding a VFS write_inode Callback

3. Adding a VFS sync_fs Callback

4. Rewriting the sync(2) System Call

5. Rewriting the syncfs(2) System Call

5.5 sync_filesystem during umount

6. PageCacheManager::sync triggers write_inode

7. Periodic sync_fs in the page_reclaim_thread

Uh oh!

fslongjin commented May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

fslongjin commented May 27, 2026

Uh oh!

fslongjin commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. 新增 VFS `write_inode` 回调

3. 新增 VFS `sync_fs` 回调

4. 重写 `sync(2)` 系统调用

5. 重写 `syncfs(2)` 系统调用

2. Adding a VFS `write_inode` Callback

3. Adding a VFS `sync_fs` Callback

4. Rewriting the `sync(2)` System Call

5. Rewriting the `syncfs(2)` System Call