Performance optimizations#4360
Conversation
…MHz) Ported from WonderMr/unleashed-firmware feat/opus-optimised and feat/cortex-m4-micro-optimizations branches. 1. Compiler: -Og → -Os for release builds (firmwareopts.scons) Enables -O2-level passes: inlining, dead code elimination, loop optimization, register allocation, tail call optimization, and CSE. 2. Disable heap memset on free() in release (FreeRTOSConfig.h) configHEAP_CLEAR_MEMORY_ON_FREE now conditional on FURI_DEBUG. Saves ~500+ memset calls/sec during active GUI/protocol work. 3. Fix calloc() to explicitly zero memory (memmgr.c) With optimization flipperdevices#2 disabling heap-clear in release, calloc() must memset(0) explicitly to guarantee zero-initialized returns. 4. Fix realloc() to copy min(old_size, new_size) bytes (memmgr.c, memmgr_heap.c/h, api_symbols.csv) Added memmgr_heap_get_block_size() to read usable size from Heap_4 BlockLink_t header. Also added NULL-guard on pvPortMalloc result to preserve original allocation on OOM. 5. Branch prediction hints on furi_check/assert/break (check.h) Added __builtin_expect(!(__e), 0) to all assertion macros. Crash code moves to end of function, hot path becomes fall-through. Affects ~2300+ call sites across the firmware. 6. SPI TX via DMA with RX drain (furi_hal_spi.c) furi_hal_spi_bus_tx() now delegates to DMA when scheduler is running, freeing CPU during display updates and radio TX. RX DMA channel drains into dummy byte to prevent OVR accumulation. 7. __attribute__((flatten)) on furi_get_tick() (kernel.c) Forces inlining of FreeRTOS wrappers at call sites, eliminating function call overhead on this very hot path. 8. __attribute__((flatten)) on hot thread functions (thread.c) Applied to furi_thread_get_current_id(), furi_thread_get_current(), and furi_thread_flags_get(). 9. In-place vprintf for furi_string_cat_vprintf() (string.c) Formats directly into destination buffer at current offset instead of allocating a temporary FuriString. Eliminates malloc+format+ memcpy+free per call. 10. Reduce configEXPECTED_IDLE_TIME_BEFORE_SLEEP 4 → 2 (FreeRTOSConfig.h) Allows FreeRTOS tickless idle to enter STOP mode more aggressively (2ms threshold instead of 4ms). Reduces average power consumption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
perf: 10 safe performance optimizations for STM32WB55 (Cortex-M4 @ 64MHz)
this is already done in pvPortMalloc(). and before you remove that from pvPortMalloc(), keep in mind a lot of flipper code (even and especially in official firmware) already relies on this behavior from malloc(). other changes sound interesting 👀 |
pvPortMalloc() in furi/core/memmgr_heap.c already memsets the returned buffer to zero (xToWipe = xWantedSize, line 467) regardless of configHEAP_CLEAR_MEMORY_ON_FREE. Calling memset() again in calloc() was a no-op. Reported by @WillyJL in flipperdevices#4360. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Agreed. Removed. |
There was a problem hiding this comment.
Pull request overview
This PR targets firmware-level performance and power improvements on STM32WB55 by optimizing compiler settings, heap behavior, hot-path branching, SPI transfers, and some frequently called core helpers.
Changes:
- Adjust release build optimization level and tweak FreeRTOS idle-sleep threshold.
- Reduce heap/free overhead and fix
realloc()copy sizing by reading heap block usable size (exported via API). - Improve hot paths via
__builtin_expect, function flattening, SPI TX DMA usage, and in-placevprintfstring concatenation.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
targets/f7/inc/FreeRTOSConfig.h |
Makes heap clear-on-free conditional on FURI_DEBUG; reduces idle time before tickless sleep. |
targets/f7/furi_hal/furi_hal_spi.c |
Uses DMA for TX when scheduler is running; adds RX-drain DMA in TX-only mode to avoid OVR. |
targets/f7/api_symbols.csv |
Bumps API version and exports memmgr_heap_get_block_size. |
site_scons/firmwareopts.scons |
Switches release CCFLAGS from -Og to -Os. |
furi/core/thread.c |
Applies __attribute__((flatten)) to hot thread helper functions. |
furi/core/string.c |
Implements in-place furi_string_cat_vprintf() using vsnprintf into the destination buffer. |
furi/core/memmgr.c |
Fixes realloc() to copy min(old_size, new_size) and preserve allocation on OOM. |
furi/core/memmgr_heap.h |
Declares the new memmgr_heap_get_block_size() API. |
furi/core/memmgr_heap.c |
Implements memmgr_heap_get_block_size() by reading Heap_4 block headers. |
furi/core/kernel.c |
Applies __attribute__((flatten)) to furi_get_tick(). |
furi/core/check.h |
Adds __builtin_expect to check/assert/break macros to bias the hot path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
furi_hal_spi.c (TX-only DMA path): On timeout the cleanup unconditionally released spi_dma_completed while LL_DMA_DisableIT_TC was issued *after*. A late or pending DMA completion ISR would then call furi_semaphore_release() on an already full binary semaphore and crash furi_check. Disable TC IRQ and clear the pending TC flag before releasing the semaphore so the ISR cannot double-release. memmgr_heap.c (memmgr_heap_get_block_size): Add heapVALIDATE_BLOCK_POINTER(pxLink) to match vPortFree(). Without it a caller passing an invalid pointer through this public API would read out of bounds before the configASSERT fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TRX/RX branch (else of furi_hal_spi_bus_trx_dma) had the same race as the TX-only path fixed in b51e744: on timeout the cleanup released spi_dma_completed before disabling LL_DMA_DisableIT_TC, so a late or pending DMA completion ISR would call furi_semaphore_release() on an already-full binary semaphore and crash furi_check. Pre-existing bug, not introduced by this PR — fixed for symmetry with the TX-only path now that the pattern is documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
furi/core/string.c (furi_string_cat_vprintf): The retry condition used >= which fired one extra vsnprintf when the formatted output fit exactly into the reserved capacity (NUL byte included). vsnprintf only truncates when size + 1 > buffer; change the predicate to match. furi/core/memmgr.c (realloc): Drop the unreachable NULL-guard around the copy/free. pvPortMalloc() calls furi_check(pvReturn, ...) on OOM (memmgr_heap.c:466) and crashes before returning, so p cannot be NULL after the call. The guard was dead code; the "preserve allocation on OOM" behavior advertised in the original commit message never actually triggered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
furi/core/memmgr.c:35
- The PR description mentions adding a NULL-guard to preserve the original allocation on OOM in
realloc(), but the implementation here still unconditionally uses the result ofpvPortMalloc(size)(whichfuri_checks on failure) and will not preserveptron OOM. Please either update the PR description to match the current behavior, or adjust the implementation if standardrealloc-on-OOM semantics are intended.
void* realloc(void* ptr, size_t size) {
if(size == 0) {
vPortFree(ptr);
return NULL;
}
void* p = pvPortMalloc(size);
if(ptr != NULL) {
size_t old_size = memmgr_heap_get_block_size(ptr);
size_t copy_size = old_size < size ? old_size : size;
memcpy(p, ptr, copy_size);
vPortFree(ptr);
}
return p;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The DMA path only reads from tx_buffer; nothing inside writes through the pointer. Mirrors the signature of furi_hal_spi_bus_trx() which already takes const uint8_t* tx_buffer. Drops the (uint8_t*) cast that furi_hal_spi_bus_tx() needed to call furi_hal_spi_bus_trx_dma() with its own const uint8_t* buffer parameter, and turns the (uint8_t*)&dma_dummy_u32 cast (the dummy buffer is itself const uint32_t) into a properly const-preserving (const uint8_t*) cast. api_symbols.csv updated to match. Existing in-tree callers (furi_hal_sd.c) pass non-const pointers and continue to compile without changes; out-of-tree callers passing const pointers no longer need to drop qualifiers. Reported by Copilot review on flipperdevices#4360. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
furi_hal_spi.c (TX-only and TRX/RX DMA paths, setup and cleanup): The TX channel TC flag (TC7) is set on transfer completion but its interrupt is not enabled or handled, so the flag was left latched. Cleared TC7 alongside the existing TC6 (RX) clear so the SPI DMA state is clean before/after each transfer, matching the pattern used by other DMA users in the codebase. Wrapped both clears in a single combined #if to keep the existing channel-mismatch guard. FreeRTOSConfig.h: Added a brief comment next to configHEAP_CLEAR_MEMORY_ON_FREE documenting the rationale for disabling wipe-on-free in release: pvPortMalloc() already zeros every allocated buffer (memmgr_heap.c xToWipe), so the next allocation cannot see stale data. The narrow exposure window between free() and the next reuse is acceptable under Flipper's threat model; code holding secrets is expected to zero its buffers explicitly before free(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The function only reads the block header before pv (xBlockSize) and does not modify either the block header or the pointed-to allocation. Switched the public API to const void* to match intent and to let callers pass const pointers without dropping qualifiers. Drop-in compatible: existing in-tree caller (memmgr.c realloc) passes a non-const void*, which converts implicitly. Reported by Copilot review on flipperdevices#4360. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What's new
10 safe performance optimizations for STM32WB55 (Cortex-M4 @ 64MHz) + 1 Bugfix
Compiler & build:
Memory management (correctness + perf):
2. Disable heap memset on free() in release — configHEAP_CLEAR_MEMORY_ON_FREE is now conditional on FURI_DEBUG. Saves ~500+ memset calls/sec during active GUI/protocol work (FreeRTOSConfig.h)
3. Fix calloc() to explicitly zero memory — with heap-clear disabled in release, calloc() now does its own memset(0) to guarantee zero-initialized returns (memmgr.c)4. Fix realloc() to copy min(old_size, new_size) bytes — original copied size (new) bytes, reading past old allocation when growing. Added memmgr_heap_get_block_size() that reads usable size from Heap_4 BlockLink_t header. Also added heapVALIDATE_BLOCK_POINTER to the new public memmgr_heap_get_block_size() for parity with vPortFree().
Branch prediction:
5. __builtin_expect hints on furi_check/assert/break — crash code moves to end of function, hot path becomes fall-through (0 pipeline penalty on Cortex-M4 3-stage pipeline). Affects ~2300+ call sites across the firmware (check.h)
DMA:
6. SPI TX via DMA with RX drain — furi_hal_spi_bus_tx() now delegates to DMA when scheduler is running, freeing the CPU during display updates (~1KB/frame @ 20fps) and radio TX. TX-only path sets up RX DMA channel draining into a dummy byte to prevent OVR accumulation. Polling fallback preserved for pre-scheduler context (furi_hal_spi.c)
Also fixes a pre-existing race in the cleanup path: LL_DMA_DisableIT_TC was issued after furi_semaphore_release, allowing a late ISR to crash furi_check on a double-release. Now disables TC IRQ and clears TC flag before releasing the semaphore
Hot function inlining:
7. attribute((flatten)) on furi_get_tick() — forces inlining of FreeRTOS wrappers at call sites (kernel.c)
8. attribute((flatten)) on hot thread functions — applied to furi_thread_get_current_id(), furi_thread_get_current(), furi_thread_flags_get() (thread.c)
String formatting:
9. In-place vprintf for furi_string_cat_vprintf() — formats directly into destination buffer at current offset, growing only if needed. Eliminates temporary FuriString allocation (malloc + format + memcpy + free) per call (string.c)
Power:
10. Reduce configEXPECTED_IDLE_TIME_BEFORE_SLEEP from 4 to 2 ticks — allows FreeRTOS tickless idle to enter STOP mode more aggressively (2ms threshold instead of 4ms). Reduces average power consumption (FreeRTOSConfig.h)
Bugfix:
11. Fix DMA timeout race in furi_hal_spi_bus_trx_dma() - on timeout the cleanup released
spi_dma_completedwhileLL_DMA_DisableIT_TCwas issued after. A late or pending DMA completion ISR would then callfuri_semaphore_release()on an already-full binary semaphore and crashfuri_check. Disabled the TC IRQ and cleared the pending TC flag before releasing the semaphore (spi_dma_isrgates its release onLL_DMA_IsEnabledIT_TC). Applied to both the TX-only path touched by #6 (b51e744) and the symmetric pre-existing TRX/RX path (f48d096) (furi_hal_spi.c).Verification
Checklist (For Reviewer)