RKNN Limitations
Understanding RKNN’s limitations is crucial for successful and efficient deployment.
Dynamic Inputs & Static Shapes
Current RKNN support for dynamic inputs is experimental and not fully functional. As a result, models exported via RK-Transformers CLI use static input shapes defined at export time. Dynamic inputs can be enabled using the programmatic export API for advanced users.
Performance Impact
The NPU allocates memory based on the static shape. If you export with max_seq_length=512 but only infer on 10 tokens, the NPU still processes the full 512-token padding, leading to inefficient inference.
Example:
# Model exported with max_seq_length=512, batch_size=1
model = RKModelForFeatureExtraction.from_pretrained(
"rk-transformers/all-MiniLM-L6-v2"
)
# Short input (10 tokens)
inputs = tokenizer("short text", return_tensors="np")
# Automatically padded to 512 tokens
# NPU processes all 512 tokens, wasting computation
Input Padding
RK-Transformers automatically pads inputs so they match the static shape used at export time:
input_ids: padded from the actual length (e.g. [1, 10]) to the export length (e.g. [1, 512]) using tokenizer.pad_token_id
attention_mask: padded with zeros to the export length
token_type_ids: padded to the export length using tokenizer.pad_token_type_id
This guarantees correct static tensor shapes for the NPU, but may increase computation when the exported sequence length is much larger than typical inputs.
Warning
RK-Transformers only performs padding, not truncation. You must ensure that your input batch size and sequence length are less than or equal to the model’s compiled input shapes. Inputs exceeding these dimensions will result in a runtime error.
Recommendations
Export multiple model variants for different sequence lengths:
# Short sequences (faster)
rk-transformers-cli export --model bert-base-uncased --max-seq-length 128
# Medium sequences
rk-transformers-cli export --model bert-base-uncased --max-seq-length 256
# Long sequences (slower)
rk-transformers-cli export --model bert-base-uncased --max-seq-length 512
Export multiple batch sizes if workload varies:
# Single inference
rk-transformers-cli export --model bert-base-uncased --batch-size 1
# Batch inference
rk-transformers-cli export --model bert-base-uncased --batch-size 4
Choose optimal sequence length based on your data:
# Analyze your dataset to find optimal max_seq_length
import math
import numpy as np
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
texts = ["your", "dataset", "texts"]
# Batch tokenize and obtain lengths if supported
try:
enc = tokenizer(
texts,
truncation=False,
padding=False,
return_length=True, # Fast tokenizer support
)
lengths = np.array(enc["length"], dtype=int)
except Exception:
# Fallback for non-fast tokenizers
lengths = np.array(
[len(tokenizer.encode(text, truncation=False)) for text in texts],
dtype=int,
)
if lengths.size == 0:
raise ValueError("No texts to analyze for percentile calculation")
target_percentile = 0.95
k = max(0, min(len(lengths) - 1, math.ceil(len(lengths) * target_percentile) - 1))
target_length = int(np.partition(lengths, k)[k])
print(f"Mean length: {lengths.mean():.2f}")
print(f"{int(target_percentile * 100)}th percentile: {target_length}")
rk-transformers-cli export --max-seq-length <target_length>
Quantization Support
While the tool supports various quantization data types, many are experimental.
Supported Datatypes
Datatype |
Status |
Notes |
|---|---|---|
|
Supported |
No quantization, larger model size |
|
Recommended |
Widely supported and tested. 8-bit weights and activations |
|
Experimental |
May fail on certain models, operators, or SoC platforms |
|
Experimental |
May fail on certain models, operators, or SoC platforms |
|
Experimental |
May fail on certain models, operators, or SoC platforms |
|
Experimental |
May fail on certain models, operators, or SoC platforms |
Recommendations
Always use w8a8 for production: It’s the most stable and widely supported
Test thoroughly before deploying other datatypes
Fallback to float16 if quantization fails
Example:
rk-transformers-cli export \
--model bert-base-uncased \
--platform rk3588 \
--quantize \
--dtype w8a8 \
--dataset sentence-transformers/natural-questions \
--dataset-split train \
--dataset-columns answer \
--dataset-size 128 \
--max-seq-length 128 \
--batch-size 1
Operator Support
RKNN currently supports a subset of ONNX operators.
Unsupported Operators
If your model uses unsupported operators, export may fail with errors like:
E RKNN: [<time-stamp>] Unsupport NPU op: <operator-name>
E RKNN: [<time-stamp>] Unsupport CPU op: <operator-name>
Solutions
Easy Methods (limited success):
Change ONNX opset version:
rk-transformers-cli export --model bert-base-uncased --opset 19
# Try different versions: 14, 15, 16, 17, 18, 19
Run operators on CPU (requires custom configuration):
Modify export code to specify CPU fallback for specific operators.
from rktransformers import RKNNConfig
rknn_config = RKNNConfig(
op_target={
"op_id": "cpu",
}
)
Difficult Methods:
Modify ONNX graph: Replace unsupported ops with supported alternatives
Register custom operators: Use
rknn.register_custom_op()in export code. Currently requires source code modification.
Checking Operator Support
Before exporting, you can check if your model’s operators are supported:
Export model to ONNX:
optimum-cli export onnx --model bert-base-uncased onnx_output/
Inspect ONNX model:
import onnx
model = onnx.load("onnx_output/model.onnx")
# List all operators
ops = set()
for node in model.graph.node:
ops.add(node.op_type)
print("Operators used:", sorted(ops))
Compare with RKNN supported operators
Dtype Limitations
Input Tensor Types
RKNN NPUs only support specific input tensor dtypes:
int8uint8int16float16float32
Not Supported: int64 (commonly used for input_ids in transformers)
Impact
Transformer models typically use int64 for input IDs, but RKNN requires int16. RK-Transformers automatically converts inputs, but this causes:
Type conversion overhead (minor performance impact)
Potential precision loss if vocabulary size > 32,767 (rare)
# Internal conversion (automatic)
# input_ids: torch.int64 -> np.int16 (for RKNN inference)
Model Weight Types
Float16 models: Weights stored as
float16Quantized models (w8a8): Weights stored as
int8
Memory Constraints
NPU Memory Limits
Rockchip NPUs have limited addressable memory (4GB). Large models or long sequences may exceed available memory.
If you encounter memory errors such as segmentation faults or allocation failures during export or inference, try the following:
Reduce
max_seq_lengthReduce
batch_sizeUse quantization (
w8a8)Modify the model architecture (if possible)
Platform Compatibility
Supported Platforms
Platform |
NPU Cores |
TOPS |
Notes |
|---|---|---|---|
RK3588 |
3 cores |
6 TOPS |
Full tested, best performance |
RK3576 |
2 cores |
6 TOPS |
Supported by RKNN 2.3.2 |
RK3568 |
1 core |
1 TOPS |
Supported by RKNN 2.3.2 |
RK3566 |
1 core |
1 TOPS |
Supported by RKNN 2.3.2 |
RK3562 |
1 core |
1 TOPS |
Supported by RKNN 2.3.2 |
Export Requirements
Platform: Linux (x86_64 or arm64)
Python: 3.10-3.12
RKNN Toolkit: 2.3.2
Inference Requirements
Platform: Rockchip device with RKNPU2
OS: Linux (Ubuntu, Debian, Armbian, etc.)
RKNN Runtime: 2.3.2 (must match toolkit version)
Version Compatibility
Warning
RKNN toolkit version must match RKNN runtime version. A model exported with toolkit 2.3.2 requires runtime 2.3.2.
Known Issues
Very Long Sequences: Sequences > 4096 may cause memory issues
Workaround: Reduce max_seq_length or use model chunking
Cross-Attention Models: Limited support for encoder-decoder and decoder models. Support additional model architecture is planned.
Workaround: Use encoder-only models when possible
Getting Help
If you encounter issues:
Check the GitHub Issues
Run diagnostics:
rk-transformers-cli envReview RKNN documentation
Open a new issue with full error output and environment details