Skip to content

Performance optimizations#4360

Open
WonderMr wants to merge 9 commits into
flipperdevices:devfrom
WonderMr:dev
Open

Performance optimizations#4360
WonderMr wants to merge 9 commits into
flipperdevices:devfrom
WonderMr:dev

Conversation

@WonderMr
Copy link
Copy Markdown

@WonderMr WonderMr commented Mar 17, 2026

What's new

10 safe performance optimizations for STM32WB55 (Cortex-M4 @ 64MHz) + 1 Bugfix

Compiler & build:

  1. -Og → -Os for release builds — enables most -O2 passes: function inlining, dead code elimination, loop optimization, aggressive register allocation, tail call optimization, and common subexpression elimination (site_scons/firmwareopts.scons)

Memory management (correctness + perf):
2. Disable heap memset on free() in release — configHEAP_CLEAR_MEMORY_ON_FREE is now conditional on FURI_DEBUG. Saves ~500+ memset calls/sec during active GUI/protocol work (FreeRTOSConfig.h)
3. Fix calloc() to explicitly zero memory — with heap-clear disabled in release, calloc() now does its own memset(0) to guarantee zero-initialized returns (memmgr.c)
4. Fix realloc() to copy min(old_size, new_size) bytes — original copied size (new) bytes, reading past old allocation when growing. Added memmgr_heap_get_block_size() that reads usable size from Heap_4 BlockLink_t header. Also added heapVALIDATE_BLOCK_POINTER to the new public memmgr_heap_get_block_size() for parity with vPortFree().

Branch prediction:
5. __builtin_expect hints on furi_check/assert/break — crash code moves to end of function, hot path becomes fall-through (0 pipeline penalty on Cortex-M4 3-stage pipeline). Affects ~2300+ call sites across the firmware (check.h)

DMA:
6. SPI TX via DMA with RX drain — furi_hal_spi_bus_tx() now delegates to DMA when scheduler is running, freeing the CPU during display updates (~1KB/frame @ 20fps) and radio TX. TX-only path sets up RX DMA channel draining into a dummy byte to prevent OVR accumulation. Polling fallback preserved for pre-scheduler context (furi_hal_spi.c)
Also fixes a pre-existing race in the cleanup path: LL_DMA_DisableIT_TC was issued after furi_semaphore_release, allowing a late ISR to crash furi_check on a double-release. Now disables TC IRQ and clears TC flag before releasing the semaphore

Hot function inlining:
7. attribute((flatten)) on furi_get_tick() — forces inlining of FreeRTOS wrappers at call sites (kernel.c)
8. attribute((flatten)) on hot thread functions — applied to furi_thread_get_current_id(), furi_thread_get_current(), furi_thread_flags_get() (thread.c)

String formatting:
9. In-place vprintf for furi_string_cat_vprintf() — formats directly into destination buffer at current offset, growing only if needed. Eliminates temporary FuriString allocation (malloc + format + memcpy + free) per call (string.c)

Power:
10. Reduce configEXPECTED_IDLE_TIME_BEFORE_SLEEP from 4 to 2 ticks — allows FreeRTOS tickless idle to enter STOP mode more aggressively (2ms threshold instead of 4ms). Reduces average power consumption (FreeRTOSConfig.h)

Bugfix:
11. Fix DMA timeout race in furi_hal_spi_bus_trx_dma() - on timeout the cleanup released spi_dma_completed while LL_DMA_DisableIT_TC was issued after. A late or pending DMA completion ISR would then call furi_semaphore_release() on an already-full binary semaphore and crash furi_check. Disabled the TC IRQ and cleared the pending TC flag before releasing the semaphore (spi_dma_isr gates its release on LL_DMA_IsEnabledIT_TC). Applied to both the TX-only path touched by #6 (b51e744) and the symmetric pre-existing TRX/RX path (f48d096) (furi_hal_spi.c).

Verification

  • Build: ./fbt COMPACT=1 DEBUG=0 — compiles without warnings/errors
  • Build: ./fbt DEBUG=1 — debug build still compiles and links
  • Boot device, navigate Settings → About — FW version shown correctly
  • Open SubGHz, NFC, IR apps — UI responsive, no hangs
  • Verify formatted strings render correctly in UI (menus, popups, dialogs)
  • SPI peripherals working: display renders, Sub-GHz TX/RX functional
  • Leave device idle 30s — confirm no increased battery drain or wake-up issues
  • Stress test: rapid app switching, menu scrolling — no crashes or visual artifacts
  • Verify calloc-dependent code (e.g. allocating zero-initialized buffers) works correctly in release build

Checklist (For Reviewer)

  • PR has description of feature/bug or link to Confluence/Jira task
  • Description contains actions to verify feature/bugfix
  • I've built this code, uploaded it to the device and verified feature/bugfix

AlZh-Mex and others added 2 commits March 17, 2026 10:32
…MHz)

Ported from WonderMr/unleashed-firmware feat/opus-optimised and
feat/cortex-m4-micro-optimizations branches.

1. Compiler: -Og → -Os for release builds (firmwareopts.scons)
   Enables -O2-level passes: inlining, dead code elimination, loop
   optimization, register allocation, tail call optimization, and CSE.

2. Disable heap memset on free() in release (FreeRTOSConfig.h)
   configHEAP_CLEAR_MEMORY_ON_FREE now conditional on FURI_DEBUG.
   Saves ~500+ memset calls/sec during active GUI/protocol work.

3. Fix calloc() to explicitly zero memory (memmgr.c)
   With optimization flipperdevices#2 disabling heap-clear in release, calloc()
   must memset(0) explicitly to guarantee zero-initialized returns.

4. Fix realloc() to copy min(old_size, new_size) bytes (memmgr.c,
   memmgr_heap.c/h, api_symbols.csv)
   Added memmgr_heap_get_block_size() to read usable size from
   Heap_4 BlockLink_t header. Also added NULL-guard on pvPortMalloc
   result to preserve original allocation on OOM.

5. Branch prediction hints on furi_check/assert/break (check.h)
   Added __builtin_expect(!(__e), 0) to all assertion macros. Crash
   code moves to end of function, hot path becomes fall-through.
   Affects ~2300+ call sites across the firmware.

6. SPI TX via DMA with RX drain (furi_hal_spi.c)
   furi_hal_spi_bus_tx() now delegates to DMA when scheduler is
   running, freeing CPU during display updates and radio TX. RX DMA
   channel drains into dummy byte to prevent OVR accumulation.

7. __attribute__((flatten)) on furi_get_tick() (kernel.c)
   Forces inlining of FreeRTOS wrappers at call sites, eliminating
   function call overhead on this very hot path.

8. __attribute__((flatten)) on hot thread functions (thread.c)
   Applied to furi_thread_get_current_id(), furi_thread_get_current(),
   and furi_thread_flags_get().

9. In-place vprintf for furi_string_cat_vprintf() (string.c)
   Formats directly into destination buffer at current offset instead
   of allocating a temporary FuriString. Eliminates malloc+format+
   memcpy+free per call.

10. Reduce configEXPECTED_IDLE_TIME_BEFORE_SLEEP 4 → 2 (FreeRTOSConfig.h)
    Allows FreeRTOS tickless idle to enter STOP mode more aggressively
    (2ms threshold instead of 4ms). Reduces average power consumption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
perf: 10 safe performance optimizations for STM32WB55 (Cortex-M4 @ 64MHz)
@WillyJL
Copy link
Copy Markdown
Contributor

WillyJL commented May 4, 2026

  1. Fix calloc() to explicitly zero memory — with heap-clear disabled in release, calloc() now does its own memset(0) to guarantee zero-initialized returns (memmgr.c)

this is already done in pvPortMalloc(). and before you remove that from pvPortMalloc(), keep in mind a lot of flipper code (even and especially in official firmware) already relies on this behavior from malloc().

other changes sound interesting 👀

pvPortMalloc() in furi/core/memmgr_heap.c already memsets the returned
buffer to zero (xToWipe = xWantedSize, line 467) regardless of
configHEAP_CLEAR_MEMORY_ON_FREE. Calling memset() again in calloc()
was a no-op.

Reported by @WillyJL in flipperdevices#4360.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 16:31
@WonderMr
Copy link
Copy Markdown
Author

WonderMr commented May 4, 2026

this is already done in pvPortMalloc().

Agreed. Removed.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets firmware-level performance and power improvements on STM32WB55 by optimizing compiler settings, heap behavior, hot-path branching, SPI transfers, and some frequently called core helpers.

Changes:

  • Adjust release build optimization level and tweak FreeRTOS idle-sleep threshold.
  • Reduce heap/free overhead and fix realloc() copy sizing by reading heap block usable size (exported via API).
  • Improve hot paths via __builtin_expect, function flattening, SPI TX DMA usage, and in-place vprintf string concatenation.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
targets/f7/inc/FreeRTOSConfig.h Makes heap clear-on-free conditional on FURI_DEBUG; reduces idle time before tickless sleep.
targets/f7/furi_hal/furi_hal_spi.c Uses DMA for TX when scheduler is running; adds RX-drain DMA in TX-only mode to avoid OVR.
targets/f7/api_symbols.csv Bumps API version and exports memmgr_heap_get_block_size.
site_scons/firmwareopts.scons Switches release CCFLAGS from -Og to -Os.
furi/core/thread.c Applies __attribute__((flatten)) to hot thread helper functions.
furi/core/string.c Implements in-place furi_string_cat_vprintf() using vsnprintf into the destination buffer.
furi/core/memmgr.c Fixes realloc() to copy min(old_size, new_size) and preserve allocation on OOM.
furi/core/memmgr_heap.h Declares the new memmgr_heap_get_block_size() API.
furi/core/memmgr_heap.c Implements memmgr_heap_get_block_size() by reading Heap_4 block headers.
furi/core/kernel.c Applies __attribute__((flatten)) to furi_get_tick().
furi/core/check.h Adds __builtin_expect to check/assert/break macros to bias the hot path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread targets/f7/furi_hal/furi_hal_spi.c
Comment thread furi/core/memmgr_heap.c
Comment thread furi/core/memmgr.c Outdated
WonderMr and others added 2 commits May 5, 2026 00:08
furi_hal_spi.c (TX-only DMA path):
On timeout the cleanup unconditionally released spi_dma_completed
while LL_DMA_DisableIT_TC was issued *after*. A late or pending DMA
completion ISR would then call furi_semaphore_release() on an already
full binary semaphore and crash furi_check. Disable TC IRQ and clear
the pending TC flag before releasing the semaphore so the ISR cannot
double-release.

memmgr_heap.c (memmgr_heap_get_block_size):
Add heapVALIDATE_BLOCK_POINTER(pxLink) to match vPortFree(). Without
it a caller passing an invalid pointer through this public API would
read out of bounds before the configASSERT fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TRX/RX branch (else of furi_hal_spi_bus_trx_dma) had the same
race as the TX-only path fixed in b51e744: on timeout the cleanup
released spi_dma_completed before disabling LL_DMA_DisableIT_TC,
so a late or pending DMA completion ISR would call furi_semaphore_release()
on an already-full binary semaphore and crash furi_check.

Pre-existing bug, not introduced by this PR — fixed for symmetry with
the TX-only path now that the pattern is documented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread furi/core/string.c
Comment thread furi/core/memmgr.c Outdated
Comment thread furi/core/string.c
furi/core/string.c (furi_string_cat_vprintf):
The retry condition used >= which fired one extra vsnprintf when the
formatted output fit exactly into the reserved capacity (NUL byte
included). vsnprintf only truncates when size + 1 > buffer; change
the predicate to match.

furi/core/memmgr.c (realloc):
Drop the unreachable NULL-guard around the copy/free. pvPortMalloc()
calls furi_check(pvReturn, ...) on OOM (memmgr_heap.c:466) and crashes
before returning, so p cannot be NULL after the call. The guard was
dead code; the "preserve allocation on OOM" behavior advertised in
the original commit message never actually triggered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

furi/core/memmgr.c:35

  • The PR description mentions adding a NULL-guard to preserve the original allocation on OOM in realloc(), but the implementation here still unconditionally uses the result of pvPortMalloc(size) (which furi_checks on failure) and will not preserve ptr on OOM. Please either update the PR description to match the current behavior, or adjust the implementation if standard realloc-on-OOM semantics are intended.
void* realloc(void* ptr, size_t size) {
    if(size == 0) {
        vPortFree(ptr);
        return NULL;
    }

    void* p = pvPortMalloc(size);
    if(ptr != NULL) {
        size_t old_size = memmgr_heap_get_block_size(ptr);
        size_t copy_size = old_size < size ? old_size : size;
        memcpy(p, ptr, copy_size);
        vPortFree(ptr);
    }

    return p;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread targets/f7/furi_hal/furi_hal_spi.c Outdated
The DMA path only reads from tx_buffer; nothing inside writes through
the pointer. Mirrors the signature of furi_hal_spi_bus_trx() which
already takes const uint8_t* tx_buffer.

Drops the (uint8_t*) cast that furi_hal_spi_bus_tx() needed to call
furi_hal_spi_bus_trx_dma() with its own const uint8_t* buffer
parameter, and turns the (uint8_t*)&dma_dummy_u32 cast (the dummy
buffer is itself const uint32_t) into a properly const-preserving
(const uint8_t*) cast.

api_symbols.csv updated to match. Existing in-tree callers
(furi_hal_sd.c) pass non-const pointers and continue to compile
without changes; out-of-tree callers passing const pointers no
longer need to drop qualifiers.

Reported by Copilot review on flipperdevices#4360.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread targets/f7/furi_hal/furi_hal_spi.c Outdated
Comment thread targets/f7/inc/FreeRTOSConfig.h
Comment thread furi/core/memmgr_heap.h Outdated
Comment thread targets/f7/furi_hal/furi_hal_spi.c
WonderMr and others added 2 commits May 5, 2026 10:02
furi_hal_spi.c (TX-only and TRX/RX DMA paths, setup and cleanup):
The TX channel TC flag (TC7) is set on transfer completion but its
interrupt is not enabled or handled, so the flag was left latched.
Cleared TC7 alongside the existing TC6 (RX) clear so the SPI DMA
state is clean before/after each transfer, matching the pattern
used by other DMA users in the codebase. Wrapped both clears in a
single combined #if to keep the existing channel-mismatch guard.

FreeRTOSConfig.h:
Added a brief comment next to configHEAP_CLEAR_MEMORY_ON_FREE
documenting the rationale for disabling wipe-on-free in release:
pvPortMalloc() already zeros every allocated buffer (memmgr_heap.c
xToWipe), so the next allocation cannot see stale data. The narrow
exposure window between free() and the next reuse is acceptable
under Flipper's threat model; code holding secrets is expected to
zero its buffers explicitly before free().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The function only reads the block header before pv (xBlockSize) and
does not modify either the block header or the pointed-to allocation.
Switched the public API to const void* to match intent and to let
callers pass const pointers without dropping qualifiers.

Drop-in compatible: existing in-tree caller (memmgr.c realloc) passes
a non-const void*, which converts implicitly.

Reported by Copilot review on flipperdevices#4360.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WonderMr WonderMr requested a review from Copilot May 5, 2026 06:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants