Inference
This guide covers loading and running inference with RKNN models on Rockchip NPUs.
Loading Models
From Hugging Face Hub
Load pre-exported models directly from the Hugging Face Hub:
from rktransformers import RKModelForFeatureExtraction
model = RKModelForFeatureExtraction.from_pretrained(
"rk-transformers/all-MiniLM-L6-v2",
core_mask="auto"
)
From Local Path
Load from a local directory:
model = RKModelForFeatureExtraction.from_pretrained(
"./my-exported-model",
)
Selecting Specific Model File
If a repository contains multiple RKNN files (e.g., different quantization levels):
# Load INT8 quantized version
model = RKModelForFeatureExtraction.from_pretrained(
"rk-transformers/all-MiniLM-L6-v2",
file_name="rknn/model_w8a8.rknn"
)
# Load float16 version
model = RKModelForFeatureExtraction.from_pretrained(
"rk-transformers/all-MiniLM-L6-v2",
file_name="rknn/model.rknn"
)
Sentence Transformers
RK-Transformers provides drop-in replacements for sentence-transformers models.
See the Sentence Transformers integration guide for details on using:
RKSentenceTransformer: For bi-encoder models
RKCrossEncoder: For cross-encoder models
Running Inference
Basic Inference
from transformers import AutoTokenizer
from rktransformers import RKModelForFeatureExtraction
tokenizer = AutoTokenizer.from_pretrained("rk-transformers/all-MiniLM-L6-v2")
model = RKModelForFeatureExtraction.from_pretrained(
"rk-transformers/all-MiniLM-L6-v2"
)
# Tokenize input
inputs = tokenizer(
"Sample text for embedding",
padding="max_length",
truncation=True,
return_tensors="np",
)
# Run inference
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
Batch Inference
Process multiple inputs at once:
sentences = [
"First sentence",
"Second sentence",
"Third sentence",
]
inputs = tokenizer(
sentences,
padding="max_length",
truncation=True,
return_tensors="np",
)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape) # (3, seq_len, hidden_size)
Note
The batch size must match the batch_size configured during export.
If you export with --batch-size 1, you can only process one input at a time.
Input Padding
See Limitations: Input Padding for details.
Return Types
Choose between Hugging Face ModelOutput-based dictionary or tuple output:
# dictionary output (default)
outputs = model(**inputs, return_dict=True)
hidden_state = outputs.last_hidden_state
# Tuple output
outputs = model(**inputs, return_dict=False)
hidden_state = outputs[0]
Supported Tasks
RK-Transformers supports various tasks through specific model classes. See the API Reference for full details.
RKModelForFeatureExtraction: For computing embeddingsRKModelForSequenceClassification: For text classificationRKModelForTokenClassification: For named entity recognition, etc.RKModelForQuestionAnswering: For extractive QARKModelForMaskedLM: For masked language modelingRKModelForMultipleChoice: For multiple choice tasks
Performance Optimization
Core Mask Selection
Choose the optimal core mask for your workload (see NPU Core Configuration):
# Auto mode (recommended)
model = RKModelForFeatureExtraction.from_pretrained(
"rk-transformers/all-MiniLM-L6-v2",
core_mask="auto"
)
# All cores for maximum performance
model = RKModelForFeatureExtraction.from_pretrained(
"rk-transformers/all-MiniLM-L6-v2",
core_mask="all"
)
Input Tensor Format
Use NumPy arrays for better performance (avoid conversion overhead):
# NumPy (recommended for RKNN)
inputs = tokenizer(text, return_tensors="np")
# PyTorch (works but adds conversion overhead)
inputs = tokenizer(text, return_tensors="pt")
Model Selection
Choose the right model variant:
Float16: Better accuracy, larger size, slower inference
INT8 (w8a8): Decent accuracy, smaller size, faster inference
# Load quantized model for speed
model = RKModelForFeatureExtraction.from_pretrained(
"rk-transformers/all-MiniLM-L6-v2",
file_name="rknn/model_w8a8.rknn" # INT8 quantized
)
Error Handling
Runtime Errors
If inference fails:
Verify platform matches your device
Check RKNN toolkit version compatibility
Ensure sufficient system memory
Try different core mask settings
Ensure proper input shape
Shape Mismatches
If you get shape mismatch errors, check:
Input batch size doesn’t exceed the model’s configured batch size
Sequence length doesn’t exceed max_seq_length
Input tensors have correct dimensions
# Check model configuration
print(f"Batch size: {model.batch_size}")
print(f"Max seq length: {model.max_seq_length}")
print(f"Input names: {model.input_names}")