Blog | SHL

RISC-V matrix extension example

February 1, 2023 · 5 min read

Engineer @ T-Head

In order to enhance the AI inference ability of the Xuantie processor, T-Head propose a matrix extension Instruction set. The following is an example of AI inference introduced through the T-Head open source AI deployment package.

1 docker

Pull the image hhb:2.1-matrix that supports matrix extension from docker hub, start a container, and open a terminal in interactive mode in the container:

docker pull hhb4tools/hhb:2.1-matrix
docker run -itd --name=your.hhb2.1-matrix -p 22 -v /your_mount_dir/:/mnt "hhb4tools/hhb:2.1-matrix"
docker exec -it your.hhb2.1-matrix /bin/bash

After entering container, you can use the hhb --version command to confirm:

root@c14249f8243c:/# hhb --version
HHB version: 2.1.x-matrix, build 20230131

docker installation guide：Docker Engine installation overview

2 Model deployment

Take the deployment of MobileNet as an example, in /home/rvm_caffe_mv1_int8, there is already a complete Makefile script, execute the make command to convert the model into the required sample program, which can be executed on RISC-V architecture chips that support matrix extension.

cd /home/rvm_caffe_mv1_int8
make

The key steps in the model deployment process are described below:

2.1 Model compilation

HHB is an offline AI model compilation and optimization tool. Executing the following commands can quantize the original model, optimize operators such as operator fusion, and generate a C code model with high execution efficiency on the target chip.

hhb -C --calibrate-dataset ./cat.jpg --model-file ./mobilenetv1.prototxt ./mobilenetv1.caffemodel --data-scale 0.017 --data-mean '104 117 124' --output . --board rvm --quantization-scheme="int8_asym_w_sym" --pixel-format BGR --fuse-conv-relu --channel-quantization --target-layout NHWC

The model compilation options are described as:

-C ：specifies to execute the main command until C code is generated.
--calibrate-dataset：specifies the calibration image used for quantization.
--model-file ：specifies a MobileNet model downloaded to the current directory. A Caffe model is divided into two files. The files following the option are not sequence-sensitive.
--data-mean ：specifies a mean.
--data-scale ：specifies a scale.
--output ：specifies the current directory as the path to store files that you need to generate.
--board ：specifies the platform as the destination platform.
--quantization-scheme ：specifies a quantization scheme.
--pixel-format：specifies the input image format required by the model, the default is RGB, and the BGR image needs to be set to BGR when the model is trained.
--fuse-conv-relu：specifies relu fuse to convolution layer.
-channel-quantization：specifies weight channel quantization.
--target-layout NHWC：specifies the tensor layout.

After the command is executed, multiple files such as main.c and model.c will be generated in the current directory:

data.0.tensor： cat.png preprocessed tensor by decoding.
data.0.bin：data.0.tensor binary data.
main.c：the reference entry to the sample program.
model.c：a model structure file that describes the model structure.
hhb.bm：HHB format model file.
model.params：the weights converted to int8.
io.c：the helper function for reading and writing files.
io.h：the declaration of the helper function for reading and writing files.
process.c：the image preprocessing function.
process.h：the declaration of the image preprocessing function.

2.2 SHL library

SHL is a set of neural network library API for Xuantie CPU platform, and provides a series of optimized binary libraries. SHL supports convolutional layer-focused optimization by matrix extension, . In this example, the prebuilt inference library has been placed in the /home/install_nn2 directory, and the source code can also be downloaded and rebuild by the following steps.

git clone -b matrix https://github.com/T-head-Semi/csi-nn2.git
cd csi-nn2
make nn2_rvm
make install

2.2 Executable program

After hhb completes the code generation, execute the following compilation command to link the rvm high-performance library and generate the c_runtime program in the current directory:

riscv64-unknown-linux-gnu-gcc -O2 -g3 -march=rv64gcv_zfh_xtheadc -mabi=lp64d -I/home -I/home/install_nn2/include -I/home/decode/install/include -o c_runtime  main.c model.c io.c process.c -L/home/install_nn2/lib -L/home/decode/install/lib/rv -ljpeg -lpng -lz -lstdc++ -lshl_rvm -lm -static -Wl,--gc-sections

The compilation option is described as follows:

-O2 -g3: specifies the optimization option and debug-level.
-march: specifies the architecture option for RISC-V matrix extension chip.
-mabi: specifies the application binary interface (ABI) option for RISC-V matrix extension chip.
-I: specifies the location of the header file that needs to be used during compilation.
main.c model.c io.c process.c: the source file that you need to use for compilation.
-L: specifies the path to store the specified library.
-ljpeg: links to a JPEG decoding library.
-lpng: links to a PNG decoding library.
-lz: links to a zlib.
-lstdc++: links to a standard C++ library.
-lshl_rvm: links to an optimized version library of rvm in SHL.
-lm: links to a standard math library.
-static: a static link.
-Wl,--gc-sections: recycles unused sections during linking.

The gcc version used in this example is V2.6.1, you can use the following command to check:

riscv64-unknown-linux-gnu-gcc -v

3 Simulate

After the compilation is complete, use T-Head's qemu simulation program to execute, and you can see the top5 execution results on the terminal:

qemu-riscv64 -cpu  rv64,x-v=true,vext_spec=v1.0,vlen=128,x-matrix=on,mlen=128 c_runtime model.params data.0.bin

The qemu version used in this example is V6.0.94, you can use the following command to check:

qemu-riscv64 -version

4 Other

RISC-V matrix extension also supports fp16 data type, just modify the hhb compilation command as follows, and keep other steps unchanged, you can use fp16 for inference.

hhb -C --calibrate-dataset ./cat.jpg --model-file ./mobilenetv1.prototxt ./mobilenetv1.caffemodel 
--data-scale 0.017 --data-mean '104 117 124' --output . --board rvm --quantization-scheme="float16" 
--pixel-format BGR --target-layout NHWC

How to Deploy a Neural Network on TH1520

January 11, 2023 · 7 min read

Wenmeng Zhang

Engineer @ T-Head

Introduction

T-Head has recently introduced a high-performance SoC prototyping, i.e. TH1520, which is built on the Wujian600 chip development platform. With a quad-core XuanTie C910 CPU withbuilt-in 4-TOPS NPU, TH1520 engenders a new combination of CPU and AI computing.

In this blog, we will describe the process of how to deploy a neural network model on C910 and on C910 and NPU simultaneously.

Tools

T-Head offers two open-source deployment tools that enable seamless, highly efficient integration of NN frameworks and underlying hardware:

Heterogeneous Honey Badge)(HHB): It supports models from different NN frameworks, and provides quantization and graph optimization.
Structure of Heterogeneous Library (SHL): It is a common interface that is compatible with all hardware types, whil offering a reference schedule that facilitates software portability.

Tool flow

HHB

HHB is a collection of tools provided by T-Head to deploy neural network models on XuanTie processors. These tools can be incorporated for compilation, profiling, and simulation.

Its framework is based on Apache TVM, which is an end-to-end machine learning compiler structure. We have shared the source code on GitHub.

HHB supports models such as Caffe, TensorFlow, ONNX, and TensorFlow Lite. It can convert these models into unified intermediate expressions for graphing performance optimization.

In addition, HHB supports multiple quantization methods to handle various data types. This framework can automatically provide the optimal scheme for the specified XuanTie CPU platform. After quantization, HHB generates a graph structure in C code from the intermediate expression. Each node of the graph structure is constructed by calling the CSI-NN2 API.

Here is an example to use HHB in deploying MobileNet model on TH1520. The sample code shows the hhb command to compile the model:

hhb -C --board light --calibrate-dataset ./cat.jpg --model-file ./mobilenetv1.prototxt ./mobilenetv1.caffemodel --data-mean "103.94 116.98 123.68" --data-scale 0.007843 --output . --quantization-scheme="int8_asym" --pixel-format BGR

The following content describes the parameter options:

C: specifies to execute the main command until C code is generated.
board: emphasizes as the destination platform; light is an alias of TH1520.
calibrate-dataset: specifies the calibration image used for quantization.
model-file: specifies a MobileNet model downloaded to the current directory. A Caffe model is divided into two files. The files following the option are not sequence-sensitive.
data-mean: defines a mean.
data-scale: defines a scale.
output: describes the current directory as the path to store files that you need to generate.
quantization-scheme: identifies a quantization scheme.
pixel-format: identifies the input image format required by the model training.

After the command is executed, multiple files such as main.c and model.c are generated in the current directory:

main.c: the reference entry to the sample program.
model.c: a model structure file that describes the model.
hhb.hm: the weights converted to int8.
io.c: the helper function for reading and writing files.
io.h: the declaration of the helper function for above files.
process.c: the image preprocessing function.
process.h: the declaration of the above function.

After the HHB command generates code, the gcc command performs binary encoding.

riscv64-unknown-linux-gnu-gcc -O0 -g3 -march=rv64gcv0p7_zfh_xtheadc -mabi=lp64d -I/home -I/home/install_nn2/include -I/home/decode/install/include -o c_runtime  main.c model.c io.c process.c -L/home/install_nn2/lib -L/home/decode/install/lib/rv -ljpeg -lpng -lz -lstdc++ -lshl_rvv -lm -static -Wl,--gc-sections

The following content describes the parameter options:

O0 -g3: specifies the optimization option. In this example, you can use the debug-level O0 only.
march: identifies the architecture option for C910.
mabi: identifies the application binary interface (ABI) option for C910.
I: describes the location of the header file that is used during compilation.
o: describes the name of the executable file needed to generate.
main.c model.c io.c process.c: the source file yu for compilation.
L: specifies the path to store the specified library.
ljpeg: links to a JPEG decoding library.
lpng: links to a PNG decoding library.
lz: links to a zlib.
lstdc++: links to a standard C++ library.
lshl_rvv: links to an optimized version library of C910 in SHL.
lm: links to a standard math library.
static: a static link.
Wl,–gc-sections: recycles unused sections during linking.

After the compilation is complete, the c_runtime file is created under the current directory. Copy the hhb.bm file and the cat.jpg image that are generated by incorporating the hhb command and the c_runtime file to the development board of C910 to execute at a time:

./c_runtime hhb.bm cat.jpg

You can view the top 5 execution results on the terminal.

SHL

SHL, previously called CSI-NN2, is a neural network acceleration library.

It abstracts various common neural network operators to form unified interfaces. SHL also implements an acceleration library for XuanTie CPU. This interface offers optimization code at the assembly level for the RISC-V Vector extension. The acceleration library has adapted to multiple data types of quantization schemes.

Combined with the automatic quantization function of HHB, SHL can quickly change the original model from the single-precision floating-point data type to optimal. As a result,the model can deliver the best performance on the development board.

The source code of SHL has been made available on GitHub.

SHL shares the specifications of RISCV-V Vector extension V0.7.1 in the implementation of the neural network operator on XuanTie C910. Considering the features of the CPU hardware (such as pipeline dependence, branch prediction, or cache), SHL fully excavates the parallel capabilities of the fp16 data format in the algorithm.

To balance performance and accuracy, some SoCs may have an NPU to accelerate some int8 neural network operators. SHL provides one reference schedule module to find the best processor for operators.

graph opt

C910 Performance

XuanTie C910 is a 64-bit high-performance processor based on the 64-bit RISC-V architecture. This processor adopts a state-of-the-art 12-stage and out-of-order multiple issue superscalar pipeline. On TH1520, it can clock up to 2.5GHz. It is also equipped with 128-bit vector operation units to deliver optimized performance.

The vector operation units of XuanTie C910 are designed following version 0.7.1 of RISC-V Vector Extension. C910 supports wide-ranging data formats, including int8, int16, int32, int64, bf16, fp16, fp32, and fp64. fp16 is the default format for deploying network models, with which Xuantie C910 can achieve its best performance.

We have tested various typical image classification models. The table below presents the performance of our deployment software on C910 at 1.85 GHz.

c910 perf

As a comparison, XNNPACK costs 77ms (multi-threaded) to infer a MobileNet model on Raspberry Pi 4B.

C910 and NPU

In order to accelerate the convolution operator in the neural network, TH1520 is equipped with a 4-TOPS NPU. The NPU can also expedite more than 20 other operators in the neural network under int8.

The table below presents the performance of combining C910 and NPU to access typical image classification models:

npu perf

Conclusion

This article describes in details on how to deploy a neural network model on TH1520, We have also presented optimal performance of TH1520 in basic image classification tasks.

TH1520 has already been incorporated inside Alibaba’s ecosystem, which demonstrates the feasibility of RISC-V-based high-performance devices to deploy neural network models. In addition, the source code of deployment tools, HHB and SHL, has been open-sourced and shared on GitHub.

XuanTie C908 Accelerates AI with Software and Hardware Fusion

December 20, 2022 · 4 min read

Wengan Shao

Engineer @ T-Head

1. Introduction

XuanTie C908 is the latest RISC-V processor released by T-Head Semiconductor, It has a frequency of up to 2 GHz. which allows it to be widely used in visual AI, intelligent interaction, and other advanced technologies. This article focuses on an array of topics from processor micro-architecture, to convolution acceleration algorithm, to optimized operators for XuanTie C908. We are also showcasing the AI inference performance of XuanTie C908 by using the T-Head open source AI deployment kit for the first time.

2. AI acceleration of hardware and software integration

2.1 Processor micro-architecture

Support instruction fusion technology
Compliant with RISC-V vector extension 1.0
Support 128/256 configurable vector register bit width VLEN
The vector execution unit supports FP16/BFP16/FP32 floating point and INT8/INT32/INT64 integer operations
Support INT8/INT4 vector dot product operations.

2.2 Software algorithm optimization

Structure of Heterogeneous Library (SHL) is a set of neural network library APIs for XuanTie CPU platform. It abstracts common neural network operator interfaces. For the newly released XuanTie C908, SHL provides the inference acceleration of multiple data types (fp32/fp16/int8). Combined with the processor pipeline, instruction fusion, and high-speed cache technology, it offers deep assembly optimization for core operators in neural networks.

Convolution has been the most crucial operator in CNN models. Currently, im2col + GEMM and Winograd are supported in SHL to accelerate convolution calculations. The main steps of Winograd are:

Input padding
Input transformation
Input reordering
Batch GEMM operations
Output transformation
Output cropping

The core computing of the two algorithms is gemm. The following figure uses vlen128/fp16 as an example to show the calculation process of gemm.

im1

Vector load (vle) is used for weight data, while scalar load (flh) for input data. This design takes 16*12 register blocks to improve computational efficiency by performing outer product matrix. We manually remove read-after-write and write-after-write data dependencies to adjust instruction flow. Last but not least, we have incorporated advanced instruction fusion technology to fully optimize performance of XuanTie C908. (The arrows in the figure indicate the arrangement order of the data in the memory.)

The list of optimized operators supported by SHL for XuanTie C908 is as follows:

conv2d
depthwiseconv2d
maxpool2d
avgpool2d
global_maxpool2d
global_avgpool2d
fullyconnected
relu
relu6
leaky_relu
prelu
sigmoid
softmax
concat
pad
elementwise_add
elementwise_mul
sum

2.3 Model deployment

Heterogenous Honey Badger (HHB) has been adapted to the latest XuanTie C908 processor. It supports weight symmetric, activation asymmetric int8 data type quantization and fp16 data type quantization. One only needs a simple command to generate the C code model file for inference on XuanTie C908. While calling on the SHL XuanTie C908 high-performance inference computing library, you can achieve the best performance experience of model inference on XuanTie C908.

hhb -C –calibrate-dataset ./cat.jpg –model-file ./mobilenetv1.prototxt ./mobilenetv1.caffemodel –data-scale 0.017 –data-mean ‘104 117 124’ –output . –board c908 –quantization-scheme=”int8_asym_w_sym” –pixel-format BGR –fuse-conv-relu –channel-quantization

3. Performance

We tested the AI inference performance of some common CNN models on XuanTie C908 using HHB and SHL. After adding the int8 vector dot product instruction, we improved XuanTie C908 performance by 3.35 times on mobilenet. This step enables us to expand the vector length to 256 results in a speedup ratio of 1.55 to 1.68. The AI performance provided by XuanTie C908 (@vlen128) has been increased by 3.75 to 4.57 times compared with that of the previous generation XuanTie C906 (@D1).

3.1 Vector dot product extension

im2

3.2 Vlen256 and vlen128

im3

3.3 XuanTie C908 and XuanTie C906

im4

4. Conclusion

XuanTie C908 greatly improves AI computing power and performance. We have followed the standard RISC-V vector extension 1.0 and supported int8/int4 vector dot product extensions. Thus, we have provided 256-bit wide vector register configurable options for Xuantie C908. This article describes the specific steps of integration with micro-architecture and instruction characteristics of the XuanTie C908 processor. By doing so, we are able to accelerate the convolution operator in CNN and introduces the SHL high-performance computing library GEMM optimization ideas and the list of optimized operators. Moreover, we have compared the AI performance of XuanTie C908 and the previous generation XuanTie C906, further highlighting the potential and advantages of the RISIC-V processor architecture in the field of AIOT through software and hardware joint optimization.

XuanTie C906 Tops MLPerf Tiny v0.7 Benchmark

June 13, 2022 · 5 min read

Wenmeng Zhang

Engineer @ T-Head

XuanTie C906 is a processor developed by Alibaba Cloud based on the RISC-V instruction set architecture. It has attained top marks in the most recent findings from MLPerf Tiny v0.7, an AI benchmark focusing on IoT devices. The performance of XuanTie C906 excelled in all four core categories: Visual Wake Words (VWW), Image Classifications (IC), Keyword Spotting (KWS), and Anomaly Detection (AD).

About MLPerf Tiny

MLPerf Tiny Inference is a benchmark developed by MLCommons. It is designed to measure the efficiency of processing new data by a trained neural network for extremely low-power devices., as well as providing an optional power measurement test.

The benchmark consists of four machine learning tasks that involve using microphone and camera sensors within embedded devices¹:

Keyword Spotting (KWS): a feature that utilizes a neural network to detect keywords from a spectrogram
Visual Wake Words (VWW): a binary image classification task to determine the presence of a person in an image
Tiny Image Classification (IC): a small image classification benchmark with 10 classes
Anomaly Detection (AD): uses a neural network to identify abnormalities in machine operating sounds

The image below details the results.

res

XuanTie C906 with the RISC-V Vector Extension

XuanTie C906 is a 64-bit high-energy processor based on a 64-bit RISC-V architecture. This processor is designed with a five to eight stage integer pipeline. It is also equipped with 128-bit vector operation units to deliver excellent performance. Not only does XuanTie C906 adopt a multi-channel and mode data prefetching technologies, it improves and optimizes data access bandwidth and prefetching.

The vector operation units of XuanTie C906 are designed to follow the specifications of RISC-V Vector extension V0.7.1. Data formats, including int8, int16, int32, int64, bf16, fp16, fp32, and fp64, are supported. In the benchmark we have used f16 as the default, with which Xuantie C906 achieved the best performance.

The XuanTie C906 silicon chip is used in Allwinner SoC D1, which has been put into full-scale production. Allwinner D1 has been embedded in various development boards and is available in the open market.

Software Stack

stack

As shown in the preceding flowchart, the original model is obtained from MLPerf Tiny. An optional next step is to then be compressed by Sinian. Subsequently, Heterogeneous Honey Badger (HHB) converts the model to function library calls which are supported by the CSI-NN2 API. CSI-NN2 finally implements neural network interfaces by using the vector operation units of XuanTie C906.

CSI-NN2

CSI-NN2 is a set of API interfaces for neural network acceleration libraries. It abstracts various common neural network operators to form unified interfaces.

CSI-NN2 also implements an acceleration library for XuanTie CPU. This interface provides optimization code at the assembly level for the RISC-V Vector extension. The acceleration library has adapted to multiple data types of quantization schemes.

Combined with the automatic quantization function of HHB, CSI-NN2 can quickly change the original model from the single-precision floating-point data type to optimal so that the model can deliver the best performance on the development board.

The source code of CSI-NN2 has been made available on GitHub.

CSI-NN2 shares the specifications of RISCV-V Vector extension V0.7.1 in the implementation of neural network operator on XuanTie C906. Considering the features of the CPU hardware (such as pipeline dependence, branch prediction, or cache), CSI-NN2 fully excavates the parallel capabilities of the fp16 data format in the algorithm.

HHB

HHB is a collection of tools provided by T-Head to deploy neural network models on XuanTie processors. These tools can be incorporated for compilation, profiling, and simulation. The framework is based on Apache TVM, which is an end-to-end machine learning compiler structure.

The source code of HHB has been shared on GitHub.

HHB supports the network model formats of Caffe, TensorFlow, ONNX, and TensorFlow Lite. It can convert these model formats into unified intermediate expressions for graphing performance optimization.

As a common deployment tool set, HHB can also access the original model in the benchmark with the following performance:

perf

Sinian

Sinian is a computing acceleration platform for neural network models. It utilizes technologies for model compression such as network structure search and knowledge distillation.

In the benchmark, Sinian has reduced the calculation workload of every model by three to eight times.

Conclusion

This article describes the results the XuanTie C906 attained in the MLPerf Tiny v0.7 benchmark in terms of performance. XuanTie C906 implements the specifications of RISC-V Vector extension V0.7.1. It has been put into scale production and is available on AliExpress. In addition, the source code of CSI-NN2 and HHB have been open sourced and shared on GitHub.

References: [1] MLPerf Tiny Inference Benchmark. from: https://mlcommons.org/en/news/mlperf-tiny-v05/

1 docker​

2 Model deployment​

2.1 Model compilation​

2.2 SHL library​

2.2 Executable program​

3 Simulate​

4 Other​

Introduction​

Tools​

HHB​

SHL​

C910 Performance​

C910 and NPU​

Conclusion​

1. Introduction​

2. AI acceleration of hardware and software integration​

2.1 Processor micro-architecture​

2.2 Software algorithm optimization​

2.3 Model deployment​

3. Performance​

3.1 Vector dot product extension​

3.2 Vlen256 and vlen128​

3.3 XuanTie C908 and XuanTie C906​

4. Conclusion​

About MLPerf Tiny​

XuanTie C906 with the RISC-V Vector Extension​

Software Stack​

CSI-NN2​

HHB​

Sinian​

Conclusion​

1 docker

2 Model deployment

2.1 Model compilation

2.2 SHL library

2.2 Executable program

3 Simulate

4 Other

Introduction

Tools

HHB

SHL

C910 Performance

C910 and NPU

Conclusion

1. Introduction

2. AI acceleration of hardware and software integration

2.1 Processor micro-architecture

2.2 Software algorithm optimization

2.3 Model deployment

3. Performance

3.1 Vector dot product extension

3.2 Vlen256 and vlen128

3.3 XuanTie C908 and XuanTie C906

4. Conclusion

About MLPerf Tiny

XuanTie C906 with the RISC-V Vector Extension

Software Stack

CSI-NN2

HHB

Sinian

Conclusion