Quick Start
This guide will get you started with RK-Transformers in minutes.
Export a Model to RKNN
RK-Transformers CLI Help
Display help message for the export command:
rk-transformers-cli export -h
Basic Export (Float16)
Export a Sentence Transformer model from Hugging Face Hub:
rk-transformers-cli export \
--model sentence-transformers/all-MiniLM-L6-v2 \
--platform rk3588 \
--flash-attention \
--optimization-level 3
Export with Quantization (INT8)
Export with custom dataset for quantization:
rk-transformers-cli export \
--model sentence-transformers/all-MiniLM-L6-v2 \
--platform rk3588 \
--flash-attention \
--quantize \
--dtype w8a8 \
--dataset sentence-transformers/natural-questions \
--dataset-split train \
--dataset-columns answer \
--dataset-size 128 \
--max-seq-length 128
Export Local ONNX Model
rk-transformers-cli export \
--model ./my-model/model.onnx \
--platform rk3588 \
--flash-attention \
--batch-size 4
Programmatic Export
from rktransformers.export import (
OptimizationConfig,
QuantizationConfig,
RKNNConfig,
export_rknn,
)
config = RKNNConfig(
model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
output_path="./my-exported-model",
target_platform="rk3588",
batch_size=1,
max_seq_length=128,
quantization=QuantizationConfig(
quantized_dtype="w8a8",
dataset_name="wikitext",
dataset_size=100,
),
optimization=OptimizationConfig(optimization_level=3),
)
export_rknn(config)
Run Inference
Using Sentence Transformers
SentenceTransformer
from rktransformers import RKSentenceTransformer
model = RKSentenceTransformer(
"rk-transformers/all-MiniLM-L6-v2",
model_kwargs={
"core_mask": "all",
},
)
sentences = ["This is a test sentence", "Another example"]
embeddings = model.encode(sentences)
print(embeddings.shape) # (2, 384)
# Load specific quantized model file
model = RKSentenceTransformer(
"rk-transformers/all-MiniLM-L6-v2",
model_kwargs={
"file_name": "rknn/model_w8a8.rknn",
},
)
CrossEncoder
from rktransformers import RKCrossEncoder
model = RKCrossEncoder(
"rk-transformers/ms-marco-MiniLM-L12-v2",
model_kwargs={"core_mask": "auto"},
)
pairs = [
["How old are you?", "What is your age?"],
["Hello world", "Hi there!"],
["What is RKNN?", "This is a test."],
]
scores = model.predict(pairs)
print(scores)
query = "Hi there!"
documents = [
"What is going on?",
"I am 25 years old.",
"This is a test.",
"RKNN is a neural network toolkit.",
]
results = model.rank(query, documents)
print(results)
Using RK-Transformers API
See task-specific models and their usage in the API docs: Task-Specific Models.
from transformers import AutoTokenizer
from rktransformers import RKModelForFeatureExtraction
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("rk-transformers/all-MiniLM-L6-v2")
model = RKModelForFeatureExtraction.from_pretrained(
"rk-transformers/all-MiniLM-L6-v2",
core_mask="auto"
)
# Tokenize and run inference
inputs = tokenizer(
["Sample text for embedding"],
padding="max_length",
truncation=True,
return_tensors="np",
)
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(axis=1) # Mean pooling
print(embeddings.shape) # (1, 384)
Using Transformers Pipelines
from transformers import pipeline
from rktransformers import RKModelForMaskedLM
# Load the RKNN model
model = RKModelForMaskedLM.from_pretrained(
"rk-transformers/bert-base-uncased",
file_name="rknn/model_w8a8.rknn"
)
# Create a fill-mask pipeline with the RKNN-accelerated model
fill_mask = pipeline(
"fill-mask",
model=model,
tokenizer="rk-transformers/bert-base-uncased",
framework="pt", # required for RKNN
)
# Run inference
results = fill_mask("Paris is the [MASK] of France.")
print(results)
Next Steps
Read about Model Export for advanced export options
Learn about NPU Core Configuration configuration
Explore the API Reference for detailed API documentation