Model Export

This guide explains how to export Hugging Face Transformer models to RKNN format for deployment on Rockchip NPUs.

Export Workflow

The export process consists of three main steps:

ONNX Export: Convert Hugging Face model to ONNX using Optimum
RKNN Build: Load ONNX and build RKNN model with quantization/optimization
Configuration: Save model configuration for runtime loading

Command-Line Export

Basic Export

rk-transformers-cli export \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --platform rk3588 \
  --optimization-level 3

Key Parameters

Required Arguments

-m, --model: Path to ONNX model file or Hugging Face model ID.
output: Path indicating the directory or file where to store the generated RKNN model. Defaults to the parent directory of the model file or the Hugging Face model directory.

Optional Arguments

-bs, --batch-size: Batch size for input shapes (default: 1). Example: batch_size=1 → [1, seq_len].
-msl, --max-seq-length: Max sequence length for input shapes. Auto-detected from model config if not specified (fallback: 512).
--task-kwargs: Task-specific keyword arguments for ONNX export as comma-separated key=value pairs. Example: num_choices=4.
--model-inputs: Comma-separated list of model input names (e.g., ‘input_ids,attention_mask’). Auto-detected based on model’s type_vocab_size.
--platform: Target platform. Choices: rk3588, rk3576, rk3568, rk3566, rk3562 (default: rk3588).

Optimization Arguments

-o, --optimization-level: RKNN Optimization level (0-3). Default: 0.
-fa, --flash-attention: Enable Flash Attention optimization.
--compress-weight: Compress model weights to reduce RKNN model size.
--single-core-mode: Enable single NPU core mode (only applicable for rk3588). Reduces model size.
--enable-custom-kernels: Enable custom kernels (e.g., CumSum) for operations not supported by RKNN.

Quantization Arguments

-q, --quantize: Enable quantization. Otherwise, the model will be exported as float16.
-dt, --dtype: Quantization data type. Options: w8a8 (default), w8a16, w16a16i, w16a16i_dfp, w4a16.
-a, --algorithm: Quantization algorithm. Options: normal (default), mmse, kl_divergence, gdq.
-qm, --quantized-method: Quantization method. Options: layer, channel (default).
--auto-hybrid-cos-thresh: Cosine distance threshold for automatic hybrid quantization (default: 0.98).
--auto-hybrid-euc-thresh: Euclidean distance threshold for automatic hybrid quantization (default: None).

Dataset Arguments

-d, --dataset: Hugging Face dataset name for quantization (e.g. ‘sentence-transformers/natural-questions’).
-dsb, --dataset-subset: Subset name for the dataset.
-dsz, --dataset-size: Number of samples to use for quantization (default: 128).
-dsp, --dataset-split: Comma-separated list of dataset splits to use. Auto-detected if not specified.
-dc, --dataset-columns: Comma-separated list of dataset columns to use for calibration.

Optimum Arguments

--opset: ONNX opset version (default: 19). Recommended: 18+.
--task: ONNX task type for export (default: auto).

Hugging Face Hub Arguments

--push-to-hub: Push the exported model to the Hugging Face Hub.
--model-id: The repository ID to push to on the Hugging Face Hub.
--token: The token to use to push to the Model Hub.
--private-repo: Indicates whether the repository created should be private.
--create-pr: Whether to create a Pull Request instead of pushing directly to the main branch.

Optimization Levels

RKNN supports 4 optimization levels (0-3):

Level	Description	Speed	Model Size
O0	No optimization	Slowest	Largest
O1	Basic optimizations	Fast	Medium
O2	Moderate optimizations	Faster	Smaller
O3	Aggressive optimizations (recommended)	Fastest	Smallest

Programmatic Export

For more control, use the Python API. You can configure the export process using RKNNConfig, OptimizationConfig, and QuantizationConfig.

from rktransformers import (
    OptimizationConfig,
    QuantizationConfig,
    RKNNConfig,
)
from rktransformers.exporters.rknn.convert import export_rknn

config = RKNNConfig(
    model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
    output_path="./my-exported-model",
    target_platform="rk3588",
    batch_size=1,
    max_seq_length=128,
    quantization=QuantizationConfig(
        do_quantization=True,
        quantized_dtype="w8a8",
        dataset_name="wikitext",
        dataset_size=100,
    ),
    optimization=OptimizationConfig(
        optimization_level=3,
        enable_flash_attention=True,
    ),
)

export_rknn(config)

See RKNNConfig for all available options.

Push to Hugging Face Hub

Share your exported model on the Hugging Face Hub:

rk-transformers-cli export \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --platform rk3588 \
  --push-to-hub \
  --repo-id my-username/my-model-rk3588

This will:

Export the model to RKNN format
Generate a model card with usage examples
Push to the specified repository on Hugging Face Hub

Troubleshooting

Unsupported Operators

If your model uses operators not supported by RKNN, please refer to the Operator Support documentation for more details on supported operators and workarounds.

Conversion Failures

If export fails:

Check RKNN toolkit version compatibility
Verify model architecture is supported
Try different optimization levels
Try different opset versions
Try different quantization settings
Disable flash attention
Use different batch size or sequence length
Submit a clear and descriptive issue on GitHub with the error message and model details

Memory Issues

For large models or long sequences, you may encounter memory issues. Please refer to the Memory Constraints documentation for more details on memory constraints and optimization strategies.