Skip to content

Add direct HuggingFace safetensors loader for Gemma 4 (E2B/E4B)#919

Open
ssfdre38 wants to merge 8 commits into
google:mainfrom
ssfdre38:safetensors-gemma4-loader
Open

Add direct HuggingFace safetensors loader for Gemma 4 (E2B/E4B)#919
ssfdre38 wants to merge 8 commits into
google:mainfrom
ssfdre38:safetensors-gemma4-loader

Conversation

@ssfdre38
Copy link
Copy Markdown

Summary

Adds a new code path to load Gemma 4 weights directly from HuggingFace *.safetensors files, bypassing the BlobStore conversion step. This avoids any potential weight precision loss from format conversion and lets you use freshly downloaded HF checkpoints without a separate conversion tool.

Changes

New files

  • io/safetensors.h / io/safetensors.ccSafetensorsIndex class: scans a directory for *.safetensors shards, parses the 8-byte LE header + JSON, and provides ReadTensor() via seek-based I/O. Handles both single-file and sharded (model.safetensors.index.json) checkpoints.
  • gemma/load_safetensors.ccWeightsPtrs::LoadFromSafetensors(): maps HF tensor names to gemma.cpp MatPtr fields. Handles Q/K/V concat, gate/up-proj concat, o_proj direct load, and per-layer token embedding transpose ([V, L*D][L*D, V]). Calls Fixup() at the end.

Modified files

  • gemma/weights.h — adds LoadFromSafetensors() public declaration
  • gemma/model_store.h / model_store.cc — adds ModelStore(const ModelConfig&, const Path& tokenizer_path) constructor for the safetensors path (reads tokenizer directly from file, leaves scales_ empty)
  • gemma/gemma.h / gemma.cc — adds Gemma(ModelConfig, tokenizer, safetensors_dir, InferenceArgs, ThreadingContext) constructor; changes BlobReader reader_ to unique_ptr<BlobReader> to allow null when not using BlobStore
  • gemma/gemma_args.h — adds --safetensors (directory path) and --model_spec (e.g. gemma4-e4b-bf16-it) flags to LoaderArgs
  • gemma/run.cc — wires new flags into Run() with a conditional branch (uses unique_ptr<Gemma> to avoid copy/move)
  • CMakeLists.txt — adds new source files; links nlohmann_json to libgemma

Usage

./gemma --safetensors /path/to/gemma-4-e4b-it \
        --model_spec   gemma4-e4b-bf16-it \
        --tokenizer    /path/to/tokenizer.model \
        --prompt       "Hello!"

The --model_spec specifier uses the existing ModelConfig(std::string) format: {model-prefix}-{type}-{wrapping} e.g. gemma4-e2b-bf16-it or gemma4-e4b-bf16-it.

Tested

  • Builds cleanly (MSVC + ninja) with the existing CMake setup
  • --help shows both new flags with descriptions
  • E4B multimodal HF checkpoint (model.language_model.* prefix, 2130 tensors, 42 layers): loads fully, prompt processing begins
  • Both single-shard and multi-shard (*.index.json) layouts supported by SafetensorsIndex

ssfdre38 and others added 8 commits May 21, 2026 10:23
Adds initial support for Gemma 4 in gemma.cpp:

- configs.h: Add GEMMA4_E2B/E4B to Model enum, IsVLM(), per_layer_embd_dim
  field to ModelConfig, fix KVCacheCols() for variable per-layer qkv_dim
- configs.cc: Add ConfigGemma4_E2B() and ConfigGemma4_E4B() with full
  per-layer config building (BuildGemma4LayerConfigs helper)
  - E2B: 35 layers, model_dim=1536, TTTTF SWA pattern, mixed FFN (6144/12288)
  - E4B: 42 layers, model_dim=2560, TTTTTF SWA pattern, uniform FFN (10240)
  - Both: qkv_dim=256 for SWA layers, qkv_dim=512 for full-attention layers
  - SWA window=512 tokens, final_cap=30.0, vocab=262144
- tensor_info.cc: Register per_layer_token_embd.weight tensor for Gemma 4
- weights.h: Add per_layer_input_embedding MatPtr to WeightsPtrs

Architecture notes:
- Gemma 4 has physically distinct SWA and full-attention layers with
  different head dimensions (256 vs 512), requiring per-layer LayerConfig
- per_layer_token_embd enables per-layer embedding injection, shape
  [num_layers * per_layer_embd_dim, vocab_size]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r layer)

Gemma 4 uses two different attention head dimensions depending on layer type:
- SWA layers: qkv_dim=256
- Full-attention layers: qkv_dim=512

The previous code indexed the KV cache as layer_idx * cache_layer_size
which assumes all layers have the same qkv_dim. For Gemma 4 this is wrong:
layers 0-3 use 512 bytes/head, layer 4 uses 1024 bytes/head, etc., so the
cumulative offset does not equal index × current_size.

Fix: add KVCacheLayerOffset() to ModelConfig that sums CacheLayerSize() for
all preceding layers. For existing uniform models this produces the same
result as before. Update DotSoftmaxWeightedSum() and ComputeQKV() in
attention.cc to use the new method.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The vendored sentencepiece_processor.h uses uint32_t without including
<cstdint>, which fails to compile with newer g++ versions (MinGW-w64 on
Windows). Add the missing include to unblock the full gemma target build.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Verifies E2B and E4B model configs against values observed in GGUF metadata:
- Layer counts (35/42), model dims (1536/2560), vocab (262144)
- SWA/full-attention layer distribution (28+7 / 35+7)
- Per-layer qkv_dim (256 for SWA, 512 for full-att)
- Non-uniform FFN dims for E2B (6144 layers 0-14, 12288 layers 15-34)
- KV cache layout correctness via KVCacheLayerOffset()
- Serialize/deserialize round-trip

All tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements zero-conversion BF16 weight loading from HuggingFace safetensors
directories, bypassing BlobStore to preserve exact weight precision.

New files:
- io/safetensors.h/.cc: SafetensorsIndex class that scans sharded *.safetensors
  files, parses the 8-byte LE header + JSON, builds a unified tensor index with
  random-access reads via File::Read()
- gemma/load_safetensors.cc: WeightsPtrs::LoadFromSafetensors() maps HF tensor
  names to gemma.cpp MatPtrs; handles Q+K+V concat, gate+up concat, o_proj
  direct copy, per_layer_embd transpose [L,V,D]->[L*D,V], and calls Fixup()

Modified files:
- gemma/weights.h: adds public LoadFromSafetensors() declaration
- gemma/model_store.h/.cc: adds ModelStore(ModelConfig&, Path&) ctor for
  BlobStore-free construction (reads tokenizer from file)
- gemma/gemma.h: changes BlobReader reader_ to unique_ptr<BlobReader>;
  adds Gemma(ModelConfig, tokenizer_path, safetensors_dir, ...) constructor
- gemma/gemma.cc: fixes reader_ -> *reader_ refs; adds safetensors constructor
- CMakeLists.txt: adds io/safetensors.cc, gemma/load_safetensors.cc to SOURCES;
  links nlohmann_json::nlohmann_json to libgemma

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When --safetensors <dir> and --model_spec <specifier> are both given,
constructs Gemma via the new safetensors constructor instead of the
BlobStore path. Uses unique_ptr<Gemma> to avoid copy/move issues.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Public Gemma 4 checkpoints (e4b/e2b) on HuggingFace wrap the language
model under model.language_model.*, not model.* directly.  Also the
per-layer token embedding is named embed_tokens_per_layer.weight with
shape [V, L*D] (not [L,V,D]), requiring a simpler matrix transpose.

- LN() prefix: model.layers.N. -> model.language_model.layers.N.
- Global tensors: model.embed_tokens.weight -> model.language_model.*
- LoadPerLayerEmbd: new name + correct [V, L*D] -> [L*D, V] transpose

Tested: 2130 tensors indexed, 42 layers loaded, prompt processing
begins (CPU-only inference is slow for 4B BF16 model).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 21, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@ssfdre38
Copy link
Copy Markdown
Author

im part of cla already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant