NPU Core Configuration

Rockchip SoCs with multiple NPU cores support flexible core allocation strategies through the core_mask parameter. Choosing the right core mask can optimize performance based on your workload and system conditions.

Note

core_mask is specified at inference time, not during export.

Available Core Masks

Value	Description	Use Case
`"auto"`	Automatic mode - selects idle cores dynamically	Recommended: Best for most scenarios
`"0"`	NPU Core 0 only	Fixed core assignment
`"1"`	NPU Core 1 only	Fixed core assignment
`"2"`	NPU Core 2 only	Fixed core assignment (RK3588 only)
`"0_1"`	NPU Core 0 and 1 simultaneously	Parallel execution across 2 cores for larger models
`"0_1_2"`	NPU Core 0, 1, and 2 simultaneously	Maximum parallelism (RK3588 only) for demanding models
`"all"`	All available NPU cores	Equivalent to `"0_1_2"` on RK3588, `"0_1"` on RK3576

Platform-Specific Notes

Platform	Available Cores	Recommended Default
RK3588	0, 1, 2 (3 cores)	`"auto"` or `"0_1_2"` for large models
RK3576	0, 1 (2 cores)	`"auto"` or `"0_1"` for large models
RK3566/RK3568	0 (1 core)	`"0"` (only option)

Warning

Attempting to use unavailable cores (e.g., "2" on RK3576) may result in a runtime error.

Usage Examples

RK-Transformers API

from rktransformers import RKModelForFeatureExtraction

model = RKModelForFeatureExtraction.from_pretrained(
    "rk-transformers/all-MiniLM-L6-v2",
    core_mask="all"
)

Sentence Transformers Integration

from rktransformers import RKSentenceTransformer

model = RKSentenceTransformer(
    "rk-transformers/all-MiniLM-L6-v2",
    model_kwargs={
        "platform": "rk3588",
        "core_mask": "auto",
    },
)

CrossEncoder Integration

from rktransformers import RKCrossEncoder

model = RKCrossEncoder(
    "rk-transformers/ms-marco-MiniLM-L12-v2",
    model_kwargs={
        "platform": "rk3588",
        "core_mask": "auto",
    },
)

Performance Considerations

Single Core vs Multi-Core

Single Core ("0", "1", "2"):

Lower power consumption
Predictable latency
Good for lightweight models
Useful when cores are allocated to different tasks

Multi-Core ("0_1", "0_1_2", "all"):

Higher throughput
Better for large models
Potentially higher latency due to synchronization
Higher power consumption

Auto vs Manual Selection

Auto Mode ("auto"):

Pros: - RKNN runtime provides automatic load balancing - Adapts to system load - No manual tuning needed
Cons: - Less predictable core assignment - May not be optimal for all scenarios

Manual Mode (specific cores):

Pros: - Predictable behavior - Fine-grained control - Better for multi-model/multi-instance deployment
Cons: - Requires manual tuning - May not adapt to changing conditions

Best Practices

Start with “auto”: Begin with automatic mode and measure performance
Benchmark different configurations: Test various core masks for your specific workload
Consider power constraints: Use fewer cores if power consumption is a concern
Monitor core utilization: Check which cores are busy before manual assignment. ajokela/rktop can help monitor NPU core usage.

Multi-Model Deployment

When running multiple models simultaneously:

# Model 1 on core 0
model1 = RKModelForFeatureExtraction.from_pretrained(
    "model1",
    core_mask="0"
)

# Model 2 on core 1
model2 = RKModelForSequenceClassification.from_pretrained(
    "model2",
    core_mask="1"
)

# Model 3 on core 2
model3 = RKModelForMultipleChoice.from_pretrained(
    "model3",
    core_mask="2"
)

Troubleshooting

Performance Issues

If performance is not as expected:

Try different core mask configurations
Ensure no other NPU-intensive tasks are running
Check CPU and memory usage (may be bottlenecks)
Verify model is properly quantized for the platform
Monitor NPU temperature (thermal throttling may occur)