Configuration

Configuration classes for RKNN model export and runtime.

RKNN Configuration

class rktransformers.configuration.RKNNConfig[source]

Bases: object

This configures the conversion of ONNX models to RKNN format for Rockchip NPU.

Parameters:
  • target_platform

    Target Rockchip platform. Options:

    • ”rk3588”

    • ”rk3576”

    • ”rk3568”

    • ”rk3566”

    • ”rk3562”

    Auto-detected: Not auto-detected, defaults to “rk3588”.

  • quantization – Quantization configuration (see QuantizationConfig).

  • optimization – Optimization configuration (see OptimizationConfig).

  • model_input_names

    Names of model inputs (e.g., [“input_ids”, “attention_mask”, “token_type_ids”]). Auto-detected: Optimum automatically determines required inputs during ONNX export based on the model’s architecture.

    • BERT models that use segment embeddings: Will include token_type_ids automatically

    • RoBERTa, sentence transformers: Will exclude token_type_ids automatically

    Note: This parameter is primarily used for RKNN conversion. The ONNX export process inspects the exported model to determine actual inputs.

  • type_vocab_size – Token type vocabulary size (informational, from model’s config.json). Auto-detected: Read from model’s config.json.

  • batch_size – Batch size for input shapes during ONNX export and RKNN conversion (e.g., [batch_size, max_seq_length]). This controls the shape of inputs, not which inputs are included. Example: batch_size=1 creates inputs shaped [1, 128], batch_size=4 creates [4, 128].

  • max_seq_length – Maximum sequence length for input shapes during ONNX export and RKNN conversion. Auto-detected: Read from model’s config.json (max_position_embeddings). Falls back to 512 if not found in config. Note: large sequence length causes the RKNN export to segmentation fault.

  • float_dtype

    Floating point data type for non-quantized operations. Options:

    • ”float16”

  • mean_values – Mean values for input normalization (for image models).

  • std_values – Standard deviation values for input normalization (for image models).

  • custom_string – Custom configuration string passed to RKNN toolkit.

  • inputs_yuv_fmt – YUV format for inputs (for image models).

  • single_core_mode – Run model on single NPU core instead of multi-core. Only applicable for rk3588. Reduces model size.

  • dynamic_input – Dynamic input shapes configuration. Experimental feature with sparse support on Rockchip NPUs. Format: [[[batch_size,seq_length],[batch_size,seq_length],…],[[batch_size,seq_length],[batch_size,seq_length],…],…]. e.g.: [[[1,128],[1,128],…],[[1,256],[1,256],…],…].

  • op_target

    Specify the target device for specific operations. Useful for offloading operations to the CPU which are not supported by the NPU. Format: {‘op_id’:’cpu’, ‘op_id3’:’cpu’}. Default is None. Example:

    unsupported_used = [(node.op_type, node.name) for node in model.graph.node if node.op_type in unsupported]
    op_target = {n:"cpu" for _,n in unsupported_used}
    

  • model_name_or_path – Path to input ONNX model file or Hugging Face model ID.

  • output_path – Path for output RKNN model file or directory. Optional. Defaults to the model’s parent directory (for local files) or current directory (for Hub models).

  • push_to_hub – Upload the exported model to HuggingFace Hub.

  • hub_model_id – HuggingFace Hub repository ID (required if push_to_hub=True). Should include username/namespace (e.g., “username/model-name”). If no namespace is provided (e.g., “model-name”), the username will be auto-detected from the token via whoami() API.

  • hub_token – HuggingFace Hub authentication token.

  • hub_private_repo – Create a private repository on HuggingFace Hub.

  • opset – ONNX opset version. Minimum: 14 (required for SDPA). Maximum: 19. (maximum supported by RKNN). (default: 19).

  • task

    Task type for export (default: “auto”).

    • ’auto’: Uses optimum to detect the task based on model architecture.

      • Can be used to export models supported by optimum and not rk-transformers runtime functionality, in which case, the user is responsible for developing inference code using rknn-toolkit-lite2 library or subclassing rktransformers.RKModel.

    • ForSequenceClassification -> sequence-classification

    • ForMaskedLM -> fill-mask

    • ForQuestionAnswering -> question-answering

    • ForTokenClassification -> token-classification

    • ForMultipleChoice -> multiple-choice

    • ForFeatureExtraction -> feature-extraction

    • Fallback: feature-extraction if XYZModel

  • task_kwargs – Task-specific keyword arguments for ONNX export (dict[str, Any]). Example: For multiple-choice tasks, use {“num_choices”: 4}. These kwargs are passed directly to optimum’s main_export function.

target_platform: Literal['simulator', 'rk3588', 'rk3576', 'rk3568', 'rk3566', 'rk3562'] = 'rk3588'
quantization: QuantizationConfig
optimization: OptimizationConfig
model_input_names: list[str] | None = None
type_vocab_size: int | None = None
batch_size: int = 1
max_seq_length: int | None = 512
task_kwargs: dict[str, Any] | None = None
float_dtype: str = 'float16'
mean_values: list[list[float]] | None = None
std_values: list[list[float]] | None = None
custom_string: str | None = None
inputs_yuv_fmt: list[str] | None = None
single_core_mode: bool = False
dynamic_input: list[list[list[int]]] | None = None
op_target: dict[str, str] | None = None
model_name_or_path: str | None = None
output_path: str | None = None
push_to_hub: bool = False
hub_model_id: str | None = None
hub_token: str | None = None
hub_private_repo: bool = False
hub_create_pr: bool = False
opset: Literal[14, 15, 16, 17, 18, 19] | None = 19
task: Literal['auto', 'feature-extraction', 'fill-mask', 'sequence-classification', 'question-answering', 'token-classification', 'multiple-choice'] = 'auto'
to_dict()[source]

Convert config to dictionary for RKNN.config(). This includes only parameters relevant to RKNN.config() otherwise RKNN will raise errors.

Return type:

dict[str, Any]

to_export_dict()[source]

Convert complete config to dictionary for export/persistence in config.json (“rknn” key). This includes ALL configuration parameters for reproducibility.

Return type:

dict[str, Any]

classmethod from_dict(data)[source]

Load configuration from a dictionary.

Parameters:

data (dict[str, Any])

Return type:

RKNNConfig

__init__(target_platform='rk3588', quantization=<factory>, optimization=<factory>, model_input_names=None, type_vocab_size=None, batch_size=1, max_seq_length=512, task_kwargs=None, float_dtype='float16', mean_values=None, std_values=None, custom_string=None, inputs_yuv_fmt=None, single_core_mode=False, dynamic_input=None, op_target=None, model_name_or_path=None, output_path=None, push_to_hub=False, hub_model_id=None, hub_token=None, hub_private_repo=False, hub_create_pr=False, opset=19, task='auto')
Parameters:
  • target_platform (Literal['simulator', 'rk3588', 'rk3576', 'rk3568', 'rk3566', 'rk3562'])

  • quantization (QuantizationConfig)

  • optimization (OptimizationConfig)

  • model_input_names (list[str] | None)

  • type_vocab_size (int | None)

  • batch_size (int)

  • max_seq_length (int | None)

  • task_kwargs (dict[str, Any] | None)

  • float_dtype (str)

  • mean_values (list[list[float]] | None)

  • std_values (list[list[float]] | None)

  • custom_string (str | None)

  • inputs_yuv_fmt (list[str] | None)

  • single_core_mode (bool)

  • dynamic_input (list[list[list[int]]] | None)

  • op_target (dict[str, str] | None)

  • model_name_or_path (str | None)

  • output_path (str | None)

  • push_to_hub (bool)

  • hub_model_id (str | None)

  • hub_token (str | None)

  • hub_private_repo (bool)

  • hub_create_pr (bool)

  • opset (Literal[14, 15, 16, 17, 18, 19] | None)

  • task (Literal['auto', 'feature-extraction', 'fill-mask', 'sequence-classification', 'question-answering', 'token-classification', 'multiple-choice'])

Return type:

None

Quantization Configuration

class rktransformers.configuration.QuantizationConfig[source]

Bases: object

Configuration for RKNN quantization.

Quantization reduces model size and improves inference speed by converting weights and activations to lower precision (e.g., int8 instead of float32).

Parameters:
  • do_quantization – Enable quantization during build. Requires a calibration dataset.

  • dataset_name – HuggingFace dataset name for quantization calibration (e.g., ‘wikitext’). Auto-detected: Not auto-detected, must be provided if do_quantization=True.

  • dataset_subset – Subset name for the dataset (e.g., ‘ax’ for ‘nyu-mll/glue’).

  • dataset_size – Number of samples to use from the dataset for calibration. Recommendation: 100-500 samples is usually sufficient.

  • dataset_split – Dataset splits to use (e.g., [“train”, “validation”]). Auto-detected: Uses [“train”, “validation”, “test”] if not specified.

  • dataset_columns – List of dataset columns to use for calibration (e.g., [“question”, “context”]). If not specified, falls back to auto-detection.

  • quantized_dtype

    Quantization data type. Options:

    • ”w8a8” (8-bit weights and activations)

    • ”w8a16” (8-bit weights, 16-bit activations)

    • ”w16a16i” (16-bit weights, 16-bit activations, int8)

    • ”w16a16i_dfp” (16-bit weights, 16-bit activations, float)

    • ”w4a16” (4-bit weights, 16-bit activations)

    Recommendation: “w8a8” for best performance, “w16a16” for better accuracy.

  • quantized_algorithm

    Quantization calibration algorithm. Options:

    • ”normal” (normal quantization)

    • ”mmse” (minimum mean square error)

    • ”kl_divergence” (Kullback-Leibler divergence)

    • ”gdq” (gradient descent quantization)

    Recommendation: “normal” is fastest, “kl_divergence” may provide better accuracy.

  • quantized_method

    Quantization granularity. Options:

    • channel: (per-channel)

    • layer: (per-layer)

    • group{SIZE}: (group quantization) where SIZE is multiple of 32 between 32 and 256.

    Recommendation: “channel” provides better accuracy.

  • quantized_hybrid_level – Hybrid quantization level (0-3). Higher values keep more layers in float for better accuracy but larger size.

  • quant_img_RGB2BGR – Convert RGB to BGR during quantization (for image models only).

  • auto_hybrid_cos_thresh – Cosine distance threshold for automatic hybrid quantization. Default: 0.98. Used when auto_hybrid is enabled in build.

  • auto_hybrid_euc_thresh – Euclidean distance threshold for automatic hybrid quantization. Default: None. Used when auto_hybrid is enabled in build.

do_quantization: bool = False
dataset_name: str | None = None
dataset_subset: str | None = None
dataset_size: int = 128
dataset_split: list[str] | None = None
dataset_columns: list[str] | None = None
quantized_dtype: Literal['w8a8', 'w8a16', 'w16a16i', 'w16a16i_dfp', 'w4a16'] = 'w8a8'
quantized_algorithm: Literal['normal', 'mmse', 'kl_divergence', 'gdq'] = 'normal'
quantized_method: Literal['layer', 'channel', 'group32', 'group64', 'group96', 'group128', 'group160', 'group192', 'group224', 'group256'] = 'channel'
quantized_hybrid_level: int = 0
quant_img_RGB2BGR: bool = False
auto_hybrid_cos_thresh: float = 0.98
auto_hybrid_euc_thresh: float | None = None
to_dict()[source]

Convert to dictionary for serialization.

Return type:

dict[str, Any]

__init__(do_quantization=False, dataset_name=None, dataset_subset=None, dataset_size=128, dataset_split=None, dataset_columns=None, quantized_dtype='w8a8', quantized_algorithm='normal', quantized_method='channel', quantized_hybrid_level=0, quant_img_RGB2BGR=False, auto_hybrid_cos_thresh=0.98, auto_hybrid_euc_thresh=None)
Parameters:
  • do_quantization (bool)

  • dataset_name (str | None)

  • dataset_subset (str | None)

  • dataset_size (int)

  • dataset_split (list[str] | None)

  • dataset_columns (list[str] | None)

  • quantized_dtype (Literal['w8a8', 'w8a16', 'w16a16i', 'w16a16i_dfp', 'w4a16'])

  • quantized_algorithm (Literal['normal', 'mmse', 'kl_divergence', 'gdq'])

  • quantized_method (Literal['layer', 'channel', 'group32', 'group64', 'group96', 'group128', 'group160', 'group192', 'group224', 'group256'])

  • quantized_hybrid_level (int)

  • quant_img_RGB2BGR (bool)

  • auto_hybrid_cos_thresh (float)

  • auto_hybrid_euc_thresh (float | None)

Return type:

None

Optimization Configuration

class rktransformers.configuration.OptimizationConfig[source]

Bases: object

Configuration for RKNN graph optimization.

These optimizations transform the model graph for better performance on NPU.

Parameters:
  • optimization_level

    Graph optimization level (0-3).

    • 0: No optimization

    • 1: Basic optimization

    • 2: Moderate optimization

    • 3: Aggressive optimization (recommended)

    Recommendation: Use 3 for best performance.

  • enable_flash_attention – Enable Flash Attention optimization for transformer models. Significantly improves attention layer performance. Recommendation: Enable for transformer/BERT models.

  • remove_weight – Remove weights from the model (for weight-sharing scenarios).

  • compress_weight – Compress model weights to reduce model size.

  • remove_reshape – Remove redundant reshape operations.

  • sparse_infer – Enable sparse inference optimization.

  • model_pruning – Enable model pruning to remove redundant connections.

optimization_level: Literal[0, 1, 2, 3] = 0
enable_flash_attention: bool = False
remove_weight: bool = False
compress_weight: bool = False
remove_reshape: bool = False
sparse_infer: bool = False
model_pruning: bool = False
to_dict()[source]

Convert to dictionary for serialization.

Return type:

dict[str, Any]

__init__(optimization_level=0, enable_flash_attention=False, remove_weight=False, compress_weight=False, remove_reshape=False, sparse_infer=False, model_pruning=False)
Parameters:
  • optimization_level (Literal[0, 1, 2, 3])

  • enable_flash_attention (bool)

  • remove_weight (bool)

  • compress_weight (bool)

  • remove_reshape (bool)

  • sparse_infer (bool)

  • model_pruning (bool)

Return type:

None