Configuration
Configuration classes for RKNN model export and runtime.
RKNN Configuration
- class rktransformers.configuration.RKNNConfig[source]
Bases:
objectThis configures the conversion of ONNX models to RKNN format for Rockchip NPU.
- Parameters:
target_platform –
Target Rockchip platform. Options:
”rk3588”
”rk3576”
”rk3568”
”rk3566”
”rk3562”
Auto-detected: Not auto-detected, defaults to “rk3588”.
quantization – Quantization configuration (see
QuantizationConfig).optimization – Optimization configuration (see
OptimizationConfig).model_input_names –
Names of model inputs (e.g., [“input_ids”, “attention_mask”, “token_type_ids”]). Auto-detected: Optimum automatically determines required inputs during ONNX export based on the model’s architecture.
BERT models that use segment embeddings: Will include token_type_ids automatically
RoBERTa, sentence transformers: Will exclude token_type_ids automatically
Note: This parameter is primarily used for RKNN conversion. The ONNX export process inspects the exported model to determine actual inputs.
type_vocab_size – Token type vocabulary size (informational, from model’s config.json). Auto-detected: Read from model’s config.json.
batch_size – Batch size for input shapes during ONNX export and RKNN conversion (e.g., [batch_size, max_seq_length]). This controls the shape of inputs, not which inputs are included. Example: batch_size=1 creates inputs shaped [1, 128], batch_size=4 creates [4, 128].
max_seq_length – Maximum sequence length for input shapes during ONNX export and RKNN conversion. Auto-detected: Read from model’s config.json (max_position_embeddings). Falls back to 512 if not found in config. Note: large sequence length causes the RKNN export to segmentation fault.
float_dtype –
Floating point data type for non-quantized operations. Options:
”float16”
mean_values – Mean values for input normalization (for image models).
std_values – Standard deviation values for input normalization (for image models).
custom_string – Custom configuration string passed to RKNN toolkit.
inputs_yuv_fmt – YUV format for inputs (for image models).
single_core_mode – Run model on single NPU core instead of multi-core. Only applicable for rk3588. Reduces model size.
dynamic_input – Dynamic input shapes configuration. Experimental feature with sparse support on Rockchip NPUs. Format: [[[batch_size,seq_length],[batch_size,seq_length],…],[[batch_size,seq_length],[batch_size,seq_length],…],…]. e.g.: [[[1,128],[1,128],…],[[1,256],[1,256],…],…].
op_target –
Specify the target device for specific operations. Useful for offloading operations to the CPU which are not supported by the NPU. Format: {‘op_id’:’cpu’, ‘op_id3’:’cpu’}. Default is None. Example:
unsupported_used = [(node.op_type, node.name) for node in model.graph.node if node.op_type in unsupported] op_target = {n:"cpu" for _,n in unsupported_used}
model_name_or_path – Path to input ONNX model file or Hugging Face model ID.
output_path – Path for output RKNN model file or directory. Optional. Defaults to the model’s parent directory (for local files) or current directory (for Hub models).
push_to_hub – Upload the exported model to HuggingFace Hub.
hub_model_id – HuggingFace Hub repository ID (required if push_to_hub=True). Should include username/namespace (e.g., “username/model-name”). If no namespace is provided (e.g., “model-name”), the username will be auto-detected from the token via whoami() API.
hub_token – HuggingFace Hub authentication token.
hub_private_repo – Create a private repository on HuggingFace Hub.
opset – ONNX opset version. Minimum: 14 (required for SDPA). Maximum: 19. (maximum supported by RKNN). (default: 19).
task –
Task type for export (default: “auto”).
’auto’: Uses optimum to detect the task based on model architecture.
Can be used to export models supported by optimum and not rk-transformers runtime functionality, in which case, the user is responsible for developing inference code using rknn-toolkit-lite2 library or subclassing rktransformers.RKModel.
ForSequenceClassification -> sequence-classification
ForMaskedLM -> fill-mask
ForQuestionAnswering -> question-answering
ForTokenClassification -> token-classification
ForMultipleChoice -> multiple-choice
ForFeatureExtraction -> feature-extraction
Fallback: feature-extraction if XYZModel
task_kwargs – Task-specific keyword arguments for ONNX export (dict[str, Any]). Example: For multiple-choice tasks, use {“num_choices”: 4}. These kwargs are passed directly to optimum’s main_export function.
- quantization: QuantizationConfig
- optimization: OptimizationConfig
- task: Literal['auto', 'feature-extraction', 'fill-mask', 'sequence-classification', 'question-answering', 'token-classification', 'multiple-choice'] = 'auto'
- to_dict()[source]
Convert config to dictionary for RKNN.config(). This includes only parameters relevant to RKNN.config() otherwise RKNN will raise errors.
- to_export_dict()[source]
Convert complete config to dictionary for export/persistence in config.json (“rknn” key). This includes ALL configuration parameters for reproducibility.
- __init__(target_platform='rk3588', quantization=<factory>, optimization=<factory>, model_input_names=None, type_vocab_size=None, batch_size=1, max_seq_length=512, task_kwargs=None, float_dtype='float16', mean_values=None, std_values=None, custom_string=None, inputs_yuv_fmt=None, single_core_mode=False, dynamic_input=None, op_target=None, model_name_or_path=None, output_path=None, push_to_hub=False, hub_model_id=None, hub_token=None, hub_private_repo=False, hub_create_pr=False, opset=19, task='auto')
- Parameters:
target_platform (Literal['simulator', 'rk3588', 'rk3576', 'rk3568', 'rk3566', 'rk3562'])
quantization (QuantizationConfig)
optimization (OptimizationConfig)
type_vocab_size (int | None)
batch_size (int)
max_seq_length (int | None)
float_dtype (str)
custom_string (str | None)
single_core_mode (bool)
model_name_or_path (str | None)
output_path (str | None)
push_to_hub (bool)
hub_model_id (str | None)
hub_token (str | None)
hub_private_repo (bool)
hub_create_pr (bool)
opset (Literal[14, 15, 16, 17, 18, 19] | None)
task (Literal['auto', 'feature-extraction', 'fill-mask', 'sequence-classification', 'question-answering', 'token-classification', 'multiple-choice'])
- Return type:
None
Quantization Configuration
- class rktransformers.configuration.QuantizationConfig[source]
Bases:
objectConfiguration for RKNN quantization.
Quantization reduces model size and improves inference speed by converting weights and activations to lower precision (e.g., int8 instead of float32).
- Parameters:
do_quantization – Enable quantization during build. Requires a calibration dataset.
dataset_name – HuggingFace dataset name for quantization calibration (e.g., ‘wikitext’). Auto-detected: Not auto-detected, must be provided if do_quantization=True.
dataset_subset – Subset name for the dataset (e.g., ‘ax’ for ‘nyu-mll/glue’).
dataset_size – Number of samples to use from the dataset for calibration. Recommendation: 100-500 samples is usually sufficient.
dataset_split – Dataset splits to use (e.g., [“train”, “validation”]). Auto-detected: Uses [“train”, “validation”, “test”] if not specified.
dataset_columns – List of dataset columns to use for calibration (e.g., [“question”, “context”]). If not specified, falls back to auto-detection.
quantized_dtype –
Quantization data type. Options:
”w8a8” (8-bit weights and activations)
”w8a16” (8-bit weights, 16-bit activations)
”w16a16i” (16-bit weights, 16-bit activations, int8)
”w16a16i_dfp” (16-bit weights, 16-bit activations, float)
”w4a16” (4-bit weights, 16-bit activations)
Recommendation: “w8a8” for best performance, “w16a16” for better accuracy.
quantized_algorithm –
Quantization calibration algorithm. Options:
”normal” (normal quantization)
”mmse” (minimum mean square error)
”kl_divergence” (Kullback-Leibler divergence)
”gdq” (gradient descent quantization)
Recommendation: “normal” is fastest, “kl_divergence” may provide better accuracy.
quantized_method –
Quantization granularity. Options:
channel: (per-channel)
layer: (per-layer)
group{SIZE}: (group quantization) where SIZE is multiple of 32 between 32 and 256.
Recommendation: “channel” provides better accuracy.
quantized_hybrid_level – Hybrid quantization level (0-3). Higher values keep more layers in float for better accuracy but larger size.
quant_img_RGB2BGR – Convert RGB to BGR during quantization (for image models only).
auto_hybrid_cos_thresh – Cosine distance threshold for automatic hybrid quantization. Default: 0.98. Used when auto_hybrid is enabled in build.
auto_hybrid_euc_thresh – Euclidean distance threshold for automatic hybrid quantization. Default: None. Used when auto_hybrid is enabled in build.
- quantized_method: Literal['layer', 'channel', 'group32', 'group64', 'group96', 'group128', 'group160', 'group192', 'group224', 'group256'] = 'channel'
- __init__(do_quantization=False, dataset_name=None, dataset_subset=None, dataset_size=128, dataset_split=None, dataset_columns=None, quantized_dtype='w8a8', quantized_algorithm='normal', quantized_method='channel', quantized_hybrid_level=0, quant_img_RGB2BGR=False, auto_hybrid_cos_thresh=0.98, auto_hybrid_euc_thresh=None)
- Parameters:
do_quantization (bool)
dataset_name (str | None)
dataset_subset (str | None)
dataset_size (int)
quantized_dtype (Literal['w8a8', 'w8a16', 'w16a16i', 'w16a16i_dfp', 'w4a16'])
quantized_algorithm (Literal['normal', 'mmse', 'kl_divergence', 'gdq'])
quantized_method (Literal['layer', 'channel', 'group32', 'group64', 'group96', 'group128', 'group160', 'group192', 'group224', 'group256'])
quantized_hybrid_level (int)
quant_img_RGB2BGR (bool)
auto_hybrid_cos_thresh (float)
auto_hybrid_euc_thresh (float | None)
- Return type:
None
Optimization Configuration
- class rktransformers.configuration.OptimizationConfig[source]
Bases:
objectConfiguration for RKNN graph optimization.
These optimizations transform the model graph for better performance on NPU.
- Parameters:
optimization_level –
Graph optimization level (0-3).
0: No optimization
1: Basic optimization
2: Moderate optimization
3: Aggressive optimization (recommended)
Recommendation: Use 3 for best performance.
enable_flash_attention – Enable Flash Attention optimization for transformer models. Significantly improves attention layer performance. Recommendation: Enable for transformer/BERT models.
remove_weight – Remove weights from the model (for weight-sharing scenarios).
compress_weight – Compress model weights to reduce model size.
remove_reshape – Remove redundant reshape operations.
sparse_infer – Enable sparse inference optimization.
model_pruning – Enable model pruning to remove redundant connections.
- __init__(optimization_level=0, enable_flash_attention=False, remove_weight=False, compress_weight=False, remove_reshape=False, sparse_infer=False, model_pruning=False)