fix(sync): 修改dirty_state并重构sync syncfs系统调用#1934
Conversation
Signed-off-by: aLinChe <1129332011@qq.com>
Signed-off-by: aLinChe <1129332011@qq.com>
1. Implementing ext4 Dirty Inode Tracking and Deferred Metadata FlushingAdded an 2. Adding a VFS
|
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2494473dfa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| inode | ||
| .0 | ||
| .lock() | ||
| .dirty_state | ||
| .remove(InodeDirtyState::ON_DIRTY_LIST); |
There was a problem hiding this comment.
当 flush_dirty_inodes() 已经从 dirty_inodes 取走该 inode 但尚未清除 ON_DIRTY_LIST 时,另一个 buffered write 会在 mark_inode_dirty() 中因为该位仍然存在而直接返回、不再重新入队;flush_metadata() 在发现 cached_file_size/cached_mtime 已变化时会保留 SIZE_DIRTY/MTIME_DIRTY,但这里仍无条件移除 ON_DIRTY_LIST,使 inode 处于脏状态却不在 dirty 列表中。之后后台 page_reclaim_thread 的 flush_dirty_pages() 不调用 write_inode,元数据只依赖 sync_fs() 遍历 dirty 列表,因此这个并发写入的 size/mtime 可能长期不会落盘。
Useful? React with 👍 / 👎.
|
Thanks for working on this. I think the overall direction is valuable: adding VFS-level However, I do not think this implementation is safe to merge yet. I found several semantic and architectural issues that should be addressed first.
Suggested direction:
I verified that this branch builds locally with |
Signed-off-by: longjin <longjin@dragonos.org>
|
Resolved the sync/writeback review issues in commit What changed:
Validation:
|
…n sync path The sync/syncfs refactoring introduced in DragonOS-Community#1934 caused the test-x86 CI to hang indefinitely. This patch fixes three critical issues: 1. Self-deadlock on umount_lock RwSem: MountFS::umount() acquired umount_write, then through propagate_umount → umount_at_peer called child.sync_filesystem() which attempted umount_read on the SAME RwSem (shared via Arc::clone in deepcopy). Since RwSem is non-reentrant, writer + reader on same thread = permanent sleep. Fix: Move sync_filesystem() BEFORE acquiring umount_write (aligning with Linux's generic_shutdown_super pattern), and remove redundant sync from umount_at_peer since all propagated peers share the same superblock. 2. Writeback page leak: writeback_entry() could return Err after prepare_writeback_entry set page state to Writeback, without calling finish_writeback_entry. The page would be stuck in Writeback state forever, causing any future wait_writeback_entry to sleep indefinitely. Fix: Add RAII WritebackGuard that ensures finish_writeback_entry is always called on early exit paths. 3. Over-aggressive page reclaim: The page reclaim thread was calling sync_fs_with_umount_read(true) for all mounts every 500ms, competing with user I/O on io_guard and holding umount_read which delays umount. Metadata sync is not the reclaimer's responsibility. Fix: Remove sync_fs from page reclaim thread; increase writeback interval from 500ms to 5s (matching Linux dirty_writeback_centisecs). Additionally: - Add has_dirty_pages() short-circuit to sync_inodes_of_mount() to skip clean page caches without expensive inode upgrade + Arc::ptr_eq. - Add SharedMountPropagationUmountNoDeadlock dunitest that exercises the exact deadlock scenario (MS_SHARED + concurrent sync/umount). Signed-off-by: longjin <longjin@DragonOS.org>
…n sync path The sync/syncfs refactoring introduced in DragonOS-Community#1934 caused the test-x86 CI to hang indefinitely. This patch fixes three critical issues: 1. Self-deadlock on umount_lock RwSem: MountFS::umount() acquired umount_write, then through propagate_umount → umount_at_peer called child.sync_filesystem() which attempted umount_read on the SAME RwSem (shared via Arc::clone in deepcopy). Since RwSem is non-reentrant, writer + reader on same thread = permanent sleep. Fix: Move sync_filesystem() BEFORE acquiring umount_write (aligning with Linux's generic_shutdown_super pattern), and remove redundant sync from umount_at_peer since all propagated peers share the same superblock. 2. Writeback page leak: writeback_entry() could return Err after prepare_writeback_entry set page state to Writeback, without calling finish_writeback_entry. The page would be stuck in Writeback state forever, causing any future wait_writeback_entry to sleep indefinitely. Fix: Add RAII WritebackGuard that ensures finish_writeback_entry is always called on early exit paths. 3. Over-aggressive page reclaim: The page reclaim thread was calling sync_fs_with_umount_read(true) for all mounts every 500ms, competing with user I/O on io_guard and holding umount_read which delays umount. Metadata sync is not the reclaimer's responsibility. Fix: Remove sync_fs from page reclaim thread; increase writeback interval from 500ms to 5s (matching Linux dirty_writeback_centisecs). Additionally: - Add has_dirty_pages() short-circuit to sync_inodes_of_mount() to skip clean page caches without expensive inode upgrade + Arc::ptr_eq. - Add SharedMountPropagationUmountNoDeadlock dunitest that exercises the exact deadlock scenario (MS_SHARED + concurrent sync/umount). Signed-off-by: longjin <longjin@DragonOS.org>
…n sync path The sync/syncfs refactoring introduced in DragonOS-Community#1934 caused the test-x86 CI to hang indefinitely. This patch fixes three critical issues: 1. Self-deadlock on umount_lock RwSem: MountFS::umount() acquired umount_write, then through propagate_umount → umount_at_peer called child.sync_filesystem() which attempted umount_read on the SAME RwSem (shared via Arc::clone in deepcopy). Since RwSem is non-reentrant, writer + reader on same thread = permanent sleep. Fix: Move sync_filesystem() BEFORE acquiring umount_write (aligning with Linux's generic_shutdown_super pattern), and remove redundant sync from umount_at_peer since all propagated peers share the same superblock. 2. Writeback page leak: writeback_entry() could return Err after prepare_writeback_entry set page state to Writeback, without calling finish_writeback_entry. The page would be stuck in Writeback state forever, causing any future wait_writeback_entry to sleep indefinitely. Fix: Add RAII WritebackGuard that ensures finish_writeback_entry is always called on early exit paths. 3. Over-aggressive page reclaim: The page reclaim thread was calling sync_fs_with_umount_read(true) for all mounts every 500ms, competing with user I/O on io_guard and holding umount_read which delays umount. Metadata sync is not the reclaimer's responsibility. Fix: Remove sync_fs from page reclaim thread; increase writeback interval from 500ms to 5s (matching Linux dirty_writeback_centisecs). Additionally: - Add has_dirty_pages() short-circuit to sync_inodes_of_mount() to skip clean page caches without expensive inode upgrade + Arc::ptr_eq. - Add SharedMountPropagationUmountNoDeadlock dunitest that exercises the exact deadlock scenario (MS_SHARED + concurrent sync/umount). Signed-off-by: longjin <longjin@DragonOS.org>
1. 实现 ext4 脏 inode 追踪与延迟元数据刷盘
新增
Ext4FileSystem::dirty_inodes列表,在write_at完成写操作后将 inode 加入脏列表(mark_inode_dirty),由flush_dirty_inodes统一刷盘。使用
Weak<LockedExt4Inode>避免延长 inode 生命周期。Linux 参考:Linux 6.6 使用 per-backing-device 的
bdi_writeback.b_dirty链表 +inode->i_lock自旋锁 +i_state中的I_DIRTY状态标志位。__mark_inode_dirty()通过was_dirty = i_state & I_DIRTY判断是否已在脏列表,避免重复入队。DragonOS 实现:使用
bitflags!定义的InodeDirtyState: u32,bit 位置对齐 LinuxI_DIRTY_*DragonOS 简化为单一 Mutex + bitflags,无锁快速路径暂未实现
2. 新增 VFS
write_inode回调在
IndexNodetrait 中新增write_inode()方法(默认 no-op),对应 Linuxsuper_operations.write_inode。LockedExt4Inode实现write_inode,调用已有的flush_metadata回写 size/mtime。MountFSInode透传到内部 inode。procfs/sysfs/pipe/socket 等无磁盘元数据的 inode 使用默认实现,不阻塞。已知简化:Linux
write_inode接受struct writeback_control *wbc参数(含WB_SYNC_NONE/WB_SYNC_ALL模式),DragonOS 暂未实现此参数。当前无影响,后续实现周期性异步回写时需补充。3. 新增 VFS
sync_fs回调在
FileSystemtrait 中新增sync_fs(wait: bool)方法(默认 no-op),对应 Linuxsuper_operations.sync_fs。Ext4FileSystem实现sync_fs,调用flush_dirty_inodes将脏 inode 元数据刷盘。MountFS透传到内部文件系统。当前限制:
wait参数暂未使用。Linux ext4 中wait控制是否jbd2_log_wait_commit等待日志提交以及是否发送blkdev_issue_flushbarrier。DragonOS ext4 无日志机制,sync_fs仅执行flush_dirty_inodes,wait不影响行为。4. 重写
sync(2)系统调用对齐 Linux
ksys_sync()的流程:flush_dirty_pages()— 唤醒回写脏页(对应wakeup_flusher_threads(WB_REASON_SYNC))sync_inodes_of_mount(对应iterate_supers(sync_inodes_one_sb, NULL)→sync_inodes_sb)sync_fs(false)(对应iterate_supers(sync_fs_one_sb, &nowait),提交元数据但不等待)sync_fs(true)(对应iterate_supers(sync_fs_one_sb, &wait),等待元数据落盘)所有步骤丢弃错误,始终返回 0,符合 Linux sync(2) 语义。
已知缺失:Linux 在步骤 4 之后还执行
sync_bdevs(false)+sync_bdevs(true)刷写块设备缓存。DragonOS 尚无块设备 barrier 层,此步骤暂未实现。5. 重写
syncfs(2)系统调用对齐 Linux
SYNCFS(2)→sync_filesystem()的流程。syncfs直接调用MountFS::sync_filesystem()方法,该方法内部编排完整的同步序列:sync_inodes_of_mount(对应writeback_inodes_sb(sb, WB_REASON_SYNC),启动脏页写回)sync_fs(false)(对应sync_fs(sb, 0),非等待模式)sync_inodes_of_mount(对应sync_inodes_sb(sb),同步等待所有 inode 写回完成)sync_fs(true)(对应sync_fs(sb, 1),等待模式)sync_filesystem方法封装了只读检查(is_readonly直接返回 Ok),syncfs 不再手动编排步骤,而是委托给MountFS::sync_filesystem()并映射错误到返回值。非 VFS fd(pipe/socket 等)的
downcast_arc::<MountFSInode>失败时直接返回 EBADF,对齐 Linux 行为(这些 fd 所属伪文件系统为只读,sync_filesystem检查sb_rdonly后直接返回 0)。O_PATH fd 返回 EBADF,对齐 Linux
fdget()的FMODE_PATH掩码过滤。已知缺失:
sync_blockdev_nowait(sb->s_bdev)/sync_blockdev(sb->s_bdev)。DragonOS 尚无块设备 barrier 层。down_read(&sb->s_umount)防止 sync 期间 umount,DragonOS 暂未实现此锁保护。5.5 umount 时 sync_filesystem
MountFS::do_umount_and_prepare_remount()6. PageCacheManager::sync 触发 write_inode
PageCacheManager::sync()在脏页写完后调用inode.write_inode()回写元数据,对应 Linux__writeback_single_inode中do_writepages()之后调用write_inode()的语义。配合
sync_fs → flush_dirty_inodes形成两层防护:page cache sync 路径和 sync_fs 路径都会尝试元数据回写,flush_metadata内部通过InodeDirtyState的SIZE_DIRTY/MTIME_DIRTY位检查实现幂等。已知差异:Linux 通过
dirty & ~I_DIRTY_PAGES条件守卫,仅当 inode 元数据脏(I_DIRTY_SYNC/I_DIRTY_DATASYNC)时才调用write_inode,纯脏页不触发。DragonOS 无条件调用write_inode,依赖flush_metadata内部!size_dirty && !mtime_dirty早退检查实现幂等,功能等价但存在额外的函数调用开销。7. page_reclaim_thread 周期性 sync_fs
后台回收线程在
flush_dirty_pages后对每个非只读 mount 调用sync_fs(true),补充逐页 writeback 路径未覆盖的元数据回写。对应 Linux 中
__writeback_single_inode在do_writepages后调用write_inode的语义:脏页和元数据在同一次遍历中完成。DragonOS 的flush_dirty_pages()不触发write_inode,此处通过sync_fs刷回dirty_inodes中的脏元数据。