Configuration

Configuration classes for RKNN model export and runtime.

RKNN Configuration

class rktransformers.configuration.RKNNConfig[source]

Bases: object

This configures the conversion of ONNX models to RKNN format for Rockchip NPU.

Parameters:

target_platform –
Target Rockchip platform. Options:
- ”rk3588”
- ”rk3576”
- ”rk3568”
- ”rk3566”
- ”rk3562”
Auto-detected: Not auto-detected, defaults to “rk3588”.
quantization – Quantization configuration (see QuantizationConfig).
optimization – Optimization configuration (see OptimizationConfig).
model_input_names –
Names of model inputs (e.g., [“input_ids”, “attention_mask”, “token_type_ids”]). Auto-detected: Optimum automatically determines required inputs during ONNX export based on the model’s architecture.
- BERT models that use segment embeddings: Will include token_type_ids automatically
- RoBERTa, sentence transformers: Will exclude token_type_ids automatically
Note: This parameter is primarily used for RKNN conversion. The ONNX export process inspects the exported model to determine actual inputs.
type_vocab_size – Token type vocabulary size (informational, from model’s config.json). Auto-detected: Read from model’s config.json.
batch_size – Batch size for input shapes during ONNX export and RKNN conversion (e.g., [batch_size, max_seq_length]). This controls the shape of inputs, not which inputs are included. Example: batch_size=1 creates inputs shaped [1, 128], batch_size=4 creates [4, 128].
max_seq_length – Maximum sequence length for input shapes during ONNX export and RKNN conversion. Auto-detected: Read from model’s config.json (max_position_embeddings). Falls back to 512 if not found in config. Note: large sequence length causes the RKNN export to segmentation fault.
float_dtype –
Floating point data type for non-quantized operations. Options:
- ”float16”
mean_values – Mean values for input normalization (for image models).
std_values – Standard deviation values for input normalization (for image models).
custom_string – Custom configuration string passed to RKNN toolkit.
inputs_yuv_fmt – YUV format for inputs (for image models).
single_core_mode – Run model on single NPU core instead of multi-core. Only applicable for rk3588. Reduces model size.
dynamic_input – Dynamic input shapes configuration. Experimental feature with sparse support on Rockchip NPUs. Format: [[[batch_size,seq_length],[batch_size,seq_length],…],[[batch_size,seq_length],[batch_size,seq_length],…],…]. e.g.: [[[1,128],[1,128],…],[[1,256],[1,256],…],…].
op_target –
Specify the target device for specific operations. Useful for offloading operations to the CPU which are not supported by the NPU. Format: {‘op_id’:’cpu’, ‘op_id3’:’cpu’}. Default is None. Example:
```
unsupported_used = [(node.op_type, node.name) for node in model.graph.node if node.op_type in unsupported]
op_target = {n:"cpu" for _,n in unsupported_used}
```
model_name_or_path – Path to input ONNX model file or Hugging Face model ID.
output_path – Path for output RKNN model file or directory. Optional. Defaults to the model’s parent directory (for local files) or current directory (for Hub models).
push_to_hub – Upload the exported model to HuggingFace Hub.
hub_model_id – HuggingFace Hub repository ID (required if push_to_hub=True). Should include username/namespace (e.g., “username/model-name”). If no namespace is provided (e.g., “model-name”), the username will be auto-detected from the token via whoami() API.
hub_token – HuggingFace Hub authentication token.
hub_private_repo – Create a private repository on HuggingFace Hub.
opset – ONNX opset version. Minimum: 14 (required for SDPA). Maximum: 19. (maximum supported by RKNN). (default: 19).
task –
Task type for export (default: “auto”).
- ’auto’: Uses optimum to detect the task based on model architecture.
  - Can be used to export models supported by optimum and not rk-transformers runtime functionality, in which case, the user is responsible for developing inference code using rknn-toolkit-lite2 library or subclassing rktransformers.RKModel.
- ForSequenceClassification -> sequence-classification
- ForMaskedLM -> fill-mask
- ForQuestionAnswering -> question-answering
- ForTokenClassification -> token-classification
- ForMultipleChoice -> multiple-choice
- ForFeatureExtraction -> feature-extraction
- Fallback: feature-extraction if XYZModel
task_kwargs – Task-specific keyword arguments for ONNX export (dict[str, Any]). Example: For multiple-choice tasks, use {“num_choices”: 4}. These kwargs are passed directly to optimum’s main_export function.

target_platform: Literal['simulator', 'rk3588', 'rk3576', 'rk3568', 'rk3566', 'rk3562'] = 'rk3588'

quantization: QuantizationConfig

optimization: OptimizationConfig

model_input_names: list[str] | None = None

type_vocab_size: int | None = None

batch_size: int = 1

max_seq_length: int | None = 512

task_kwargs: dict[str, Any] | None = None

float_dtype: str = 'float16'

mean_values: list[list[float]] | None = None

std_values: list[list[float]] | None = None

custom_string: str | None = None

inputs_yuv_fmt: list[str] | None = None

single_core_mode: bool = False

dynamic_input: list[list[list[int]]] | None = None

op_target: dict[str, str] | None = None

model_name_or_path: str | None = None

output_path: str | None = None

push_to_hub: bool = False

hub_model_id: str | None = None

hub_token: str | None = None

hub_private_repo: bool = False

hub_create_pr: bool = False

opset: Literal[14, 15, 16, 17, 18, 19] | None = 19

task: Literal['auto', 'feature-extraction', 'fill-mask', 'sequence-classification', 'question-answering', 'token-classification', 'multiple-choice'] = 'auto'

to_dict()[source]

Convert config to dictionary for RKNN.config(). This includes only parameters relevant to RKNN.config() otherwise RKNN will raise errors.

Return type:: dict[str, Any]

to_export_dict()[source]

Convert complete config to dictionary for export/persistence in config.json (“rknn” key). This includes ALL configuration parameters for reproducibility.

Return type:: dict[str, Any]

classmethod from_dict(data)[source]

Load configuration from a dictionary.

Parameters:: data (dict[str, Any])
Return type:: RKNNConfig

__init__(target_platform='rk3588', quantization=<factory>, optimization=<factory>, model_input_names=None, type_vocab_size=None, batch_size=1, max_seq_length=512, task_kwargs=None, float_dtype='float16', mean_values=None, std_values=None, custom_string=None, inputs_yuv_fmt=None, single_core_mode=False, dynamic_input=None, op_target=None, model_name_or_path=None, output_path=None, push_to_hub=False, hub_model_id=None, hub_token=None, hub_private_repo=False, hub_create_pr=False, opset=19, task='auto')

Parameters:

target_platform (Literal['simulator', 'rk3588', 'rk3576', 'rk3568', 'rk3566', 'rk3562'])
quantization (QuantizationConfig)
optimization (OptimizationConfig)
model_input_names (list[str] | None)
type_vocab_size (int | None)
batch_size (int)
max_seq_length (int | None)
task_kwargs (dict[str, Any] | None)
float_dtype (str)
mean_values (list[list[float]] | None)
std_values (list[list[float]] | None)
custom_string (str | None)
inputs_yuv_fmt (list[str] | None)
single_core_mode (bool)
dynamic_input (list[list[list[int]]] | None)
op_target (dict[str, str] | None)
model_name_or_path (str | None)
output_path (str | None)
push_to_hub (bool)
hub_model_id (str | None)
hub_token (str | None)
hub_private_repo (bool)
hub_create_pr (bool)
opset (Literal[14, 15, 16, 17, 18, 19] | None)
task (Literal['auto', 'feature-extraction', 'fill-mask', 'sequence-classification', 'question-answering', 'token-classification', 'multiple-choice'])

Return type:

None

Quantization Configuration

class rktransformers.configuration.QuantizationConfig[source]

Bases: object

Configuration for RKNN quantization.

Quantization reduces model size and improves inference speed by converting weights and activations to lower precision (e.g., int8 instead of float32).

Parameters:

do_quantization – Enable quantization during build. Requires a calibration dataset.
dataset_name – HuggingFace dataset name for quantization calibration (e.g., ‘wikitext’). Auto-detected: Not auto-detected, must be provided if do_quantization=True.
dataset_subset – Subset name for the dataset (e.g., ‘ax’ for ‘nyu-mll/glue’).
dataset_size – Number of samples to use from the dataset for calibration. Recommendation: 100-500 samples is usually sufficient.
dataset_split – Dataset splits to use (e.g., [“train”, “validation”]). Auto-detected: Uses [“train”, “validation”, “test”] if not specified.
dataset_columns – List of dataset columns to use for calibration (e.g., [“question”, “context”]). If not specified, falls back to auto-detection.
quantized_dtype –
Quantization data type. Options:
- ”w8a8” (8-bit weights and activations)
- ”w8a16” (8-bit weights, 16-bit activations)
- ”w16a16i” (16-bit weights, 16-bit activations, int8)
- ”w16a16i_dfp” (16-bit weights, 16-bit activations, float)
- ”w4a16” (4-bit weights, 16-bit activations)
Recommendation: “w8a8” for best performance, “w16a16” for better accuracy.
quantized_algorithm –
Quantization calibration algorithm. Options:
- ”normal” (normal quantization)
- ”mmse” (minimum mean square error)
- ”kl_divergence” (Kullback-Leibler divergence)
- ”gdq” (gradient descent quantization)
Recommendation: “normal” is fastest, “kl_divergence” may provide better accuracy.
quantized_method –
Quantization granularity. Options:
- channel: (per-channel)
- layer: (per-layer)
- group{SIZE}: (group quantization) where SIZE is multiple of 32 between 32 and 256.
Recommendation: “channel” provides better accuracy.
quantized_hybrid_level – Hybrid quantization level (0-3). Higher values keep more layers in float for better accuracy but larger size.
quant_img_RGB2BGR – Convert RGB to BGR during quantization (for image models only).
auto_hybrid_cos_thresh – Cosine distance threshold for automatic hybrid quantization. Default: 0.98. Used when auto_hybrid is enabled in build.
auto_hybrid_euc_thresh – Euclidean distance threshold for automatic hybrid quantization. Default: None. Used when auto_hybrid is enabled in build.

do_quantization: bool = False

dataset_name: str | None = None

dataset_subset: str | None = None

dataset_size: int = 128

dataset_split: list[str] | None = None

dataset_columns: list[str] | None = None

quantized_dtype: Literal['w8a8', 'w8a16', 'w16a16i', 'w16a16i_dfp', 'w4a16'] = 'w8a8'

quantized_algorithm: Literal['normal', 'mmse', 'kl_divergence', 'gdq'] = 'normal'

quantized_method: Literal['layer', 'channel', 'group32', 'group64', 'group96', 'group128', 'group160', 'group192', 'group224', 'group256'] = 'channel'

quantized_hybrid_level: int = 0

quant_img_RGB2BGR: bool = False

auto_hybrid_cos_thresh: float = 0.98

auto_hybrid_euc_thresh: float | None = None

to_dict()[source]

Convert to dictionary for serialization.

Return type:: dict[str, Any]

__init__(do_quantization=False, dataset_name=None, dataset_subset=None, dataset_size=128, dataset_split=None, dataset_columns=None, quantized_dtype='w8a8', quantized_algorithm='normal', quantized_method='channel', quantized_hybrid_level=0, quant_img_RGB2BGR=False, auto_hybrid_cos_thresh=0.98, auto_hybrid_euc_thresh=None)

Parameters:

do_quantization (bool)
dataset_name (str | None)
dataset_subset (str | None)
dataset_size (int)
dataset_split (list[str] | None)
dataset_columns (list[str] | None)
quantized_dtype (Literal['w8a8', 'w8a16', 'w16a16i', 'w16a16i_dfp', 'w4a16'])
quantized_algorithm (Literal['normal', 'mmse', 'kl_divergence', 'gdq'])
quantized_method (Literal['layer', 'channel', 'group32', 'group64', 'group96', 'group128', 'group160', 'group192', 'group224', 'group256'])
quantized_hybrid_level (int)
quant_img_RGB2BGR (bool)
auto_hybrid_cos_thresh (float)
auto_hybrid_euc_thresh (float | None)

Return type:

None

Optimization Configuration

class rktransformers.configuration.OptimizationConfig[source]

Bases: object

Configuration for RKNN graph optimization.

These optimizations transform the model graph for better performance on NPU.

Parameters:

optimization_level –
Graph optimization level (0-3).
- 0: No optimization
- 1: Basic optimization
- 2: Moderate optimization
- 3: Aggressive optimization (recommended)
Recommendation: Use 3 for best performance.
enable_flash_attention – Enable Flash Attention optimization for transformer models. Significantly improves attention layer performance. Recommendation: Enable for transformer/BERT models.
remove_weight – Remove weights from the model (for weight-sharing scenarios).
compress_weight – Compress model weights to reduce model size.
remove_reshape – Remove redundant reshape operations.
sparse_infer – Enable sparse inference optimization.
model_pruning – Enable model pruning to remove redundant connections.

optimization_level: Literal[0, 1, 2, 3] = 0

enable_flash_attention: bool = False

remove_weight: bool = False

compress_weight: bool = False

remove_reshape: bool = False

sparse_infer: bool = False

model_pruning: bool = False

to_dict()[source]

Convert to dictionary for serialization.

Return type:: dict[str, Any]

__init__(optimization_level=0, enable_flash_attention=False, remove_weight=False, compress_weight=False, remove_reshape=False, sparse_infer=False, model_pruning=False)

Parameters:

optimization_level (Literal[0, 1, 2, 3])
enable_flash_attention (bool)
remove_weight (bool)
compress_weight (bool)
remove_reshape (bool)
sparse_infer (bool)
model_pruning (bool)

Return type:

None