RKNN Limitations

Understanding RKNN’s limitations is crucial for successful and efficient deployment.

Dynamic Inputs & Static Shapes

Current RKNN support for dynamic inputs is experimental and not fully functional. As a result, models exported via RK-Transformers CLI use static input shapes defined at export time. Dynamic inputs can be enabled using the programmatic export API for advanced users.

Performance Impact

The NPU allocates memory based on the static shape. If you export with max_seq_length=512 but only infer on 10 tokens, the NPU still processes the full 512-token padding, leading to inefficient inference.

Example:

# Model exported with max_seq_length=512, batch_size=1
model = RKModelForFeatureExtraction.from_pretrained(
    "rk-transformers/all-MiniLM-L6-v2"
)

# Short input (10 tokens)
inputs = tokenizer("short text", return_tensors="np")
# Automatically padded to 512 tokens
# NPU processes all 512 tokens, wasting computation

Input Padding

RK-Transformers automatically pads inputs so they match the static shape used at export time:

input_ids: padded from the actual length (e.g. [1, 10]) to the export length (e.g. [1, 512]) using tokenizer.pad_token_id
attention_mask: padded with zeros to the export length
token_type_ids: padded to the export length using tokenizer.pad_token_type_id

This guarantees correct static tensor shapes for the NPU, but may increase computation when the exported sequence length is much larger than typical inputs.

Warning

RK-Transformers only performs padding, not truncation. You must ensure that your input batch size and sequence length are less than or equal to the model’s compiled input shapes. Inputs exceeding these dimensions will result in a runtime error.

Recommendations

Export multiple model variants for different sequence lengths:

# Short sequences (faster)
rk-transformers-cli export --model bert-base-uncased --max-seq-length 128

# Medium sequences
rk-transformers-cli export --model bert-base-uncased --max-seq-length 256

# Long sequences (slower)
rk-transformers-cli export --model bert-base-uncased --max-seq-length 512

Export multiple batch sizes if workload varies:

# Single inference
rk-transformers-cli export --model bert-base-uncased --batch-size 1

# Batch inference
rk-transformers-cli export --model bert-base-uncased --batch-size 4

Choose optimal sequence length based on your data:

# Analyze your dataset to find optimal max_seq_length
import math

import numpy as np
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
texts = ["your", "dataset", "texts"]

# Batch tokenize and obtain lengths if supported
try:
    enc = tokenizer(
        texts,
        truncation=False,
        padding=False,
        return_length=True,  # Fast tokenizer support
    )
    lengths = np.array(enc["length"], dtype=int)
except Exception:
    # Fallback for non-fast tokenizers
    lengths = np.array(
        [len(tokenizer.encode(text, truncation=False)) for text in texts],
        dtype=int,
    )

if lengths.size == 0:
    raise ValueError("No texts to analyze for percentile calculation")

target_percentile = 0.95
k = max(0, min(len(lengths) - 1, math.ceil(len(lengths) * target_percentile) - 1))
target_length = int(np.partition(lengths, k)[k])

print(f"Mean length: {lengths.mean():.2f}")
print(f"{int(target_percentile * 100)}th percentile: {target_length}")

rk-transformers-cli export --max-seq-length <target_length>

Quantization Support

While the tool supports various quantization data types, many are experimental.

Supported Datatypes

Datatype	Status	Notes
`float16`	Supported	No quantization, larger model size
`w8a8`	Recommended	Widely supported and tested. 8-bit weights and activations
`w8a16`	Experimental	May fail on certain models, operators, or SoC platforms
`w16a16i`	Experimental	May fail on certain models, operators, or SoC platforms
`w16a16i_dfp`	Experimental	May fail on certain models, operators, or SoC platforms
`w4a16`	Experimental	May fail on certain models, operators, or SoC platforms

Recommendations

Always use w8a8 for production: It’s the most stable and widely supported
Test thoroughly before deploying other datatypes
Fallback to float16 if quantization fails

Example:

rk-transformers-cli export \
   --model bert-base-uncased \
   --platform rk3588 \
   --quantize \
   --dtype w8a8 \
   --dataset sentence-transformers/natural-questions \
   --dataset-split train \
   --dataset-columns answer \
   --dataset-size 128 \
   --max-seq-length 128 \
   --batch-size 1

Operator Support

RKNN currently supports a subset of ONNX operators.

Unsupported Operators

If your model uses unsupported operators, export may fail with errors like:

E RKNN: [<time-stamp>] Unsupport NPU op: <operator-name>
E RKNN: [<time-stamp>] Unsupport CPU op: <operator-name>

Solutions

Easy Methods (limited success):

Change ONNX opset version:

rk-transformers-cli export --model bert-base-uncased --opset 19
# Try different versions: 14, 15, 16, 17, 18, 19

Run operators on CPU (requires custom configuration):

Modify export code to specify CPU fallback for specific operators.

from rktransformers import RKNNConfig

rknn_config = RKNNConfig(
    op_target={
        "op_id": "cpu",
    }
)

Difficult Methods:

Modify ONNX graph: Replace unsupported ops with supported alternatives
Register custom operators: Use rknn.register_custom_op() in export code. Currently requires source code modification.

Checking Operator Support

Before exporting, you can check if your model’s operators are supported:

Export model to ONNX:

optimum-cli export onnx --model bert-base-uncased onnx_output/

Inspect ONNX model:

import onnx
model = onnx.load("onnx_output/model.onnx")

# List all operators
ops = set()
for node in model.graph.node:
    ops.add(node.op_type)

print("Operators used:", sorted(ops))

Compare with RKNN supported operators

Dtype Limitations

Input Tensor Types

RKNN NPUs only support specific input tensor dtypes:

int8
uint8
int16
float16
float32

Not Supported: int64 (commonly used for input_ids in transformers)

Impact

Transformer models typically use int64 for input IDs, but RKNN requires int16. RK-Transformers automatically converts inputs, but this causes:

Type conversion overhead (minor performance impact)
Potential precision loss if vocabulary size > 32,767 (rare)

# Internal conversion (automatic)
# input_ids: torch.int64 -> np.int16 (for RKNN inference)

Model Weight Types

Float16 models: Weights stored as float16
Quantized models (w8a8): Weights stored as int8

Memory Constraints

NPU Memory Limits

Rockchip NPUs have limited addressable memory (4GB). Large models or long sequences may exceed available memory.

If you encounter memory errors such as segmentation faults or allocation failures during export or inference, try the following:

Reduce max_seq_length
Reduce batch_size
Use quantization (w8a8)
Modify the model architecture (if possible)

Platform Compatibility

Supported Platforms

Platform	NPU Cores	TOPS	Notes
RK3588	3 cores	6 TOPS	Full tested, best performance
RK3576	2 cores	6 TOPS	Supported by RKNN 2.3.2
RK3568	1 core	1 TOPS	Supported by RKNN 2.3.2
RK3566	1 core	1 TOPS	Supported by RKNN 2.3.2
RK3562	1 core	1 TOPS	Supported by RKNN 2.3.2

Export Requirements

Platform: Linux (x86_64 or arm64)
Python: 3.10-3.12
RKNN Toolkit: 2.3.2

Inference Requirements

Platform: Rockchip device with RKNPU2
OS: Linux (Ubuntu, Debian, Armbian, etc.)
RKNN Runtime: 2.3.2 (must match toolkit version)

Version Compatibility

Warning

RKNN toolkit version must match RKNN runtime version. A model exported with toolkit 2.3.2 requires runtime 2.3.2.

Known Issues

Very Long Sequences: Sequences > 4096 may cause memory issues

Workaround: Reduce max_seq_length or use model chunking
Cross-Attention Models: Limited support for encoder-decoder and decoder models. Support additional model architecture is planned.

Workaround: Use encoder-only models when possible

Getting Help

If you encounter issues:

Check the GitHub Issues
Run diagnostics: rk-transformers-cli env
Review RKNN documentation
Open a new issue with full error output and environment details