How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="mudler/Step-3.7-Flash-APEX-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Step-3.7-Flash — APEX GGUF quants

APEX (Adaptive Precision for EXpert models) quantizations of stepfun-ai/Step-3.7-Flash, a ~198B-parameter vision-language MoE with ~11B active per token (288 routed experts + 1 shared expert, top-8 routing; 3 dense layers + 42 MoE layers; hidden 4096).

What is APEX?

APEX assigns per-tensor precision along a layer-wise gradient rather than using one quant type for the whole model:

  • Edges high, middle low. The first and last MoE blocks are far more sensitive to quantization noise than the middle of the stack, so APEX keeps the edges at a higher precision (e.g. Q5_K / Q6_K) and progressively drops the middle toward the profile's base type.
  • Dense layers kept at the edge precision. Step-3.7's three leading dense FFN layers (0–2) are not gated and carry every token, so they are quantized at the same precision as the edge MoE layers rather than being lumped in with the deep experts.
  • Shared expert protected. The shared expert sees every token in every MoE layer, so its weights (ffn_*_shexp) are kept at a high precision (Q6_K or Q8_0 depending on profile) across the whole stack, while the 288 routed experts (ffn_*_exps) follow the layer-wise gradient.
  • Attention kept above experts. Attention weights (Q/K/V/O) are held above the expert precision, since they are reused at every token regardless of routing.

The I- variants additionally use an importance matrix (imatrix) computed on a calibration mix — quantization noise is preferentially placed in directions the calibration data shows are least active.

Available files

File Base Profile Size Notes
Step-3.7-Flash-APEX-I-Quality.gguf Q6_K quality + imatrix 123 GB highest fidelity
Step-3.7-Flash-APEX-I-Balanced.gguf Q5_K balanced + imatrix 141 GB recommended
Step-3.7-Flash-APEX-I-Compact.gguf Q4_K compact + imatrix 90 GB best size/quality tradeoff
Step-3.7-Flash-APEX-I-Mini.gguf Q3_K mini + imatrix 73 GB smallest with imatrix
Step-3.7-Flash-APEX-Quality.gguf Q6_K quality 123 GB no-imatrix
Step-3.7-Flash-APEX-Balanced.gguf Q5_K balanced 141 GB no-imatrix
Step-3.7-Flash-APEX-Compact.gguf Q4_K compact 90 GB no-imatrix
mmproj-step3.7-flash-f16.gguf F16 vision tower 4.0 GB mirrored from StepFun, pair with any of the above for VLM use
imatrix.dat 466 MB importance matrix (BF16-derived)

The included imatrix.dat is the same matrix used to produce the I- files above. It is published so you can apply it yourself to any other quantization of the same BF16 weights without re-running calibration.

The mmproj is the vision tower in F16, mirrored as-is from stepfun-ai/Step-3.7-Flash-GGUF. Pair it with any of the language-tower files above (--mmproj mmproj-step3.7-flash-f16.gguf in llama.cpp) to run image+text inference.

How these were built

  • Source weights: stepfun-ai/Step-3.7-Flash-GGUF (BF16) — the upstream BF16 GGUF (9 shards).
  • Importance matrix: computed on the BF16 GGUF via llama-imatrix on a calibration mix and published here as imatrix.dat.
  • Quantization: per-tensor precision targets emitted by APEX, applied via llama-quantize --tensor-type-file (with --imatrix imatrix.dat for the I- variants).

Because the imatrix is keyed by tensor name, it is portable across any GGUF that comes from the same converted BF16 weights — no GPU needed to reproduce these quants from the published BF16 GGUF.

Architecture notes

Step-3.7-Flash's HF config (Step3p7ForConditionalGeneration) is mapped to the existing STEP35 arch at GGUF level: the convert script registers the Step-3.7 architecture and tokenizer hash, while runtime/quantize uses the standard step35 compute graph.

  • Layers: 45 (3 dense FFN + 42 MoE)
  • Hidden: 4096
  • Experts: 288 routed (top-8) + 1 shared
  • Total params: ~198B
  • Active params per token: ~11B
  • Modality: vision-language (mmproj included)

License

Inherits the upstream Apache-2.0 license from stepfun-ai/Step-3.7-Flash.

Downloads last month
-
GGUF
Model size
197B params
Architecture
step35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mudler/Step-3.7-Flash-APEX-GGUF

Quantized
(8)
this model

Collection including mudler/Step-3.7-Flash-APEX-GGUF