Project Introduction

What is this project doing?

This project aims to complete the introductory learning of AI computer vision, and complete all handwriting and performance optimization of the neural network based on resnet50 on Intel CPU.

First, we will understand the meaning of computer vision by practicing some classic traditional computer vision algorithms; then, using the resnet50 neural network as an example, we will systematically explain the basic algorithm principles and related background knowledge of an AI model.

Finally, through the actual code in this warehouse, the resnet50 neural network was handwritten from scratch to complete the recognition of any picture and the performance optimization of the neural network model .

  • In the traditional computer vision part, there will be small practical projects such as grayscale images, RGB, mean/Gaussian filters, image edge detection using the Canny operator, and image segmentation using the Otsu algorithm.
  • The AI ​​part will have a handwritten digit recognition project (Mnist) as an introduction to understand the training and inference process of an AI model.
  • In the AI ​​principle part, the algorithms and background knowledge used in resnet50 will be explained and analyzed in detail.
  • The practical part uses python/C++ to complete the handwriting of the resnet50 model from scratch.
    • All core algorithms and network structures of resnet50 (including Conv2d, AvgPool, MaxPool, fc, Relu, and residual structures) are all handwritten and do not borrow any third-party libraries.
    • Since the algorithm and model structure are handwritten by myself, there will be a lot of freedom for performance optimization. Performance optimization is the last part of this project. Multiple versions will be iterated to adjust the model’s performance step by step to a relatively good effect. .
  • In the practical part, based on the completion of the handwriting algorithm, in addition to ensuring that the network accuracy is available (that is, given any picture, after preprocessing, Top1/Top5 can correctly predict the picture), we will also focus on the performance optimization part. This will be introduced later.

Why do we need to write all the core algorithms by hand?

There are many tutorials on the Internet at present. When teaching you how to build a neural network, they are basically based on the nn module or other modules of torch, and use nn.conv2d to complete the convolution calculation.

For students who want to delve deeper into algorithms and learn algorithms, or some beginners, even if they build a neural network according to the tutorial, or infer pictures, they are still in the clouds and do not know its principles, and they are always superficial. , I don’t feel at ease when I learn. This was especially obvious to me when I first started learning many years ago.

In fact, nn.conv2d encapsulates the algorithm implementation of conv2d. We cannot see its implementation mechanism, and it is difficult to learn the implementation details inside, let alone how to optimize performance on this basis (although the interface has been optimized).

So this project came about.

Initially I just wanted to complete the handwriting of a simple resnet50 model by myself.

Later, some friends contacted me and wanted to learn from me, so I started writing articles systematically. As a result, I wrote more and more articles, so I simply made a booklet. Through the writing of the booklet, I was inspired to continue to maintain and update this project. As of now , and are still constantly updating the code, writing comments for the code, and writing related articles.

So, as you can see, everyone can download and study the code part of this project, but the more than 100 articles supporting the warehouse are paid. If you are interested, you can come here to take a look .

The code in this project started to be written in April 2023, and was made into a booklet in November 2023. It was debugged many times. All the code was written manually by me.

At present, all the codes in the project have been completely run through, and the accuracy is also very good. After 5 versions of iteration, the performance has basically achieved good results.

what you can learn

Through this project, you can get a glimpse of the classic algorithms of traditional computer vision, understand the connections and differences between traditional computer vision and computer vision algorithms based on deep learning, and have an in-depth understanding of all algorithm prototypes, algorithm background principles, and the ideas of Resent50 used in resnet50. , the network structure of resnet50, and common neural network optimization methods.

You can refer to the code in the project to actually run a resnet50 neural network to complete the inference of one or more pictures.

There is a version with comments in the new_version_with_notes directory of the project . I will give detailed text explanations of key points in the code.

If you read the project code and supporting articles and practice them thoroughly, I think it is not difficult to get started with AI vision. At the same time, regarding the classic model of resnet50, even if you are a novice, you can start practicing after practicing it completely. .

Articles involved in the project

This project is equipped with 100+ introductory articles related to background knowledge, principle analysis and code practice, and it took a lot of effort to write.

warehouse structure

  • 0_gray is the grayscale image related code
  • 1_RGB is the code related to grayscale image and RGB conversion
  • 2_mean_blur is the code related to mean filtering
  • 3_gussian_blur is Gaussian filter related code
  • 4_canny is related to the canny algorithm and is used to complete edge detection of images.
  • 5_dajin is related to the Dajin algorithm and is used to complete image segmentation .
  • 6_minst is a classic handwritten digit recognition AI model (neural network) that can be used for model training and inference on a notebook (CPU).
  • Practice is the main directory for handwriting of model algorithms, model construction and related work based on resnet50. It is also the main directory for handwriting resnet50 from scratch in this project. It also includes:
    • model directory: files related to open source models, including downloading of model parameters, parameter analysis, etc.
    • pics directory: When using the model to identify a picture, the directory where the picture is stored
    • python directory: resnet50 project handwritten in python language
    • cpp directory: resnet50 project handwritten in c++ language.

Among them, the python directory and the cpp directory are independent of each other.

In the cpp directory, there are 6 directories from 1st to 6th, which are iterative directories for performance optimization. The 6 directories are independent of each other and can run the code in any directory independently. Compare and see the changes caused by the optimization of the code during the iteration process. Performance improvement effect.

  • new_version_with_notes directory: This is a new version of this repository, containing all the above code, and the directory structure inside replicates the above structure. The difference is that comments have been added to the code and some details have been optimized. It is recommended that students who are using it for the first time directly use the code in the new_version_with_notes directory.

How I implemented handwritten resnet50 from scratch

Implementation ideas

Model acquisition

Use torchvision to save the weights of each layer of resnet50 from the pre-trained model to the warehouse. The saved weight files will be loaded later to participate in the calculation of convolution, full connection, and BN layers.

Let’s talk more here. In the model deployment of actual industrial projects, the weights of the neural network are also loaded into the GPU/CPU as independent data to complete the calculation.

The performance bottleneck of many actual models will be in the weight loading part. why? I analyze there are several reasons:

  • Limited by chip memory limitations. As a result, it is impossible to load all the values ​​of the neural network at once, and the side effect of multiple loadings is that it will cause redundant IO operations. The smaller the memory, the more serious this problem is.
  • Limited by chip bandwidth limitations. Today, with the increasing number of model parameters, GB-level bandwidth is becoming increasingly difficult, and in many cases, IO and calculation cannot be fully pipelined on the chip. Especially when computing power is piled up, IO is highlighted. Out.
  • In the model directory, run the following script to save the parameters to model/resnet50_weight.$ python3 resnet50_parser.py

Code

After saving the weights, use python / C++ language to implement core functions such as Conv2d, BatchNorm, Relu, AvgPool, MaxPool, FullyConnect(MatMul) respectively.

According to the network structure of Resent50 , the above algorithm is built.

reasoning

After the code is implemented, it means that the basic algorithms and parameters required for model operation are in place. Next, read a local picture and perform inference.

After reading the picture, start inference, and correctly infer that it is a cat. The first phase goal of this project (accuracy verification) is completed.

optimization

After the basic functions are implemented, performance optimization begins.

Performance optimization is a major focus in neural networks and will be explained in a separate chapter below.

Performance optimization

python version

This part is the performance optimization of the python version. Let’s first look at how to use the python code in this warehouse.

How to use python version

  1. The core algorithm and hand-built network of resnet50 are written in basic python syntax, and some very basic operations call the numpy library.
  2. The pillow library is called by importing images. The logic of importing images does not belong to the handwritten resnet50 core algorithm from scratch . I don’t have time to write similar logic, so I use the pillow library directly.
  3. Install dependencies, mainly the dependencies of the above two libraries (the domestic Tsinghua source is faster, you can choose according to your needs), in the python directory, execute:

Do not use Tsinghua Source

$ pip3 install -r requirements.txt

Use Tsinghua Source:

$ pip3 install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
  1. reasoning
  • In the python directory, run the following command to complete the inference. You can modify the logic of obtaining the image in my_infer.py and replace the image with your own image to see if it can be correctly identified.$ python3 my_infer.py

Since the Python version basically does not call third-party libraries, the performance of convolution loops written in python syntax is absolutely terrible. Actual testing found that using the python version to reason about a picture is very slow, mainly because there are too many loops (but in order to demonstrate the algorithm implemented internally).

A little optimization of the python version

Use np.dot (inner product operation) to replace the multiply-accumulate loop of convolution.

If python does not call third-party libraries, many optimization points cannot be done (for example, the instruction set is not easy to control, and the memory is not easy to control). The following will focus on optimizing the C++ version.

C++ version

This part is the performance optimization of the C++ version. Let’s first look at how to use the C++ code in this repository.

How to use the c++ version

The C++ code in this warehouse has been incorporated into several optimization submissions. Each time it is further optimized based on the previous optimization. The optimization record can be easily seen through the file name in the cpp directory.

  • The cpp/1st_origin directory stores the first version of the C++ code
  • The cpp/2nd_avx2 directory stores the second version of the C++ code, which enables the optimization of the avx2 instruction set and the -Ofast compilation option.
  • The third version of C++ code is stored in the cpp/3rd_preload directory. It uses a memory pool-like method to add the logic of early loading of weights , and still retains the dynamic malloc process of input and output of each layer’s results.
  • The cpp/4th_no_malloc directory stores the fourth version of optimized C++ code, which deletes all dynamic memory application operations and greatly improves performance.
  • The cpp/5th_codegen directory stores the fifth version of optimized C++ code, which uses CodeGen and jit compilation technology to generate core computing logic.
  • The cpp/6th_mul_thread directory stores the sixth version of optimized C++ code, which uses multi-threading to optimize convolution operations and greatly improve performance.

compile

The files in the directory of each version are independent and there are no dependencies. If you want to see the code changes between the two versions, you can use the source code comparison tool to view it.

The compilation process of files in the directory of each version is the same. If you only have a windows environment but not a linux environment, you can check out How to quickly install a linux system under windows in 10 minutes without a virtual machine. Here you can quickly install a linux system. If you purchase a paid article, there will be more detailed installation instructions.

If you have a Linux environment and are familiar with Linux operations, please read directly below:

  • The C++ version compilation relies on the opencv library, which is used to import images. Its function is similar to the python version of pillow. In the Linux environment, execute the following command to install the opencv library:$ sudo apt-get install libopencv-dev python3-opencv libopencv-contrib-dev
  • cpp directory, run compile.sh to complete the compilation.$ bash ./compile.sh

After compilation is completed, an executable file named resnet is generated in the current directory. Executing the file directly will perform inference on the images saved in the warehouse and display the results.

$ ./resnet

Initial version one

The directory is cpp/1st_origin .

The first version did not consider performance issues. It just completed the functions according to the idea. As you can imagine, the performance was terrible. The performance data of this version:

Average Latency

Average Throughput

16923ms

0.059 fps

Performance data is related to computer performance. You can try running it to see what the printed Lantency is.

Optimized version two

The directory is cpp/2nd_avx2 .

Based on the first edition, the second edition uses the vector instruction set to parallelize and accelerate the multiply-accumulate loop operation in the convolution algorithm. The vector instruction set used is avx2. You can check your CPU through the following command Whether to support avx2 instruction set.

shell复制代码$ cat /proc/cpuinfo

If avx2 exists in the displayed information, the instruction set is supported.

Performance data for this version:

Average Latency: 4973 ms

Average Throughput:0.201 fps

Optimized version three

The directory is cpp/3rd_preload .

The third edition, based on the second edition, eliminates the process of dynamic malloc for weight parameters during the computational reasoning process. Instead, std::map is used to manage a memory pool-like structure before reasoning. Before reasoning, all All weight parameters are loaded in. This step of optimization is of practical significance in actual model deployment.

Early loading of model parameters can minimize the IO pressure on the system and reduce latency.

Performance data for this version:

Average Latency: 862 ms

Average Throughput:1.159 fps

Optimized version four

The directory is cpp/4th_no_malloc .

The fourth edition is based on the third edition and eliminates all dynamic memory applications and string-related operations in the computational reasoning process.

Performance data for this version:

Average Latency: 742 ms

Average Throughput:1.347 fps

Optimized version five

The directory is cpp/5th_codegen .

Based on the fourth edition, the fifth edition uses CodeGen technology to generate core computing logic and uses jit compilation to complete the compilation process.

Performance data for this version:

Average Latency: 781ms

Average Throughput:1.281 fps

Optimized version six

The directory is cpp/6th_mul_thread .

Based on the fifth edition, the sixth edition uses multi-threading to optimize convolution calculations, splits the co dimension independently between threads, and uses the full number of CPU threads.

Performance data for this version:

Average Latency: 297 ms

Average Throughput:3.363 fps

After 6 versions of optimization, the inference delay has been optimized from 16923 ms to 297 ms, improving performance by nearly 60 times. Inferring that a picture no longer feels stuck, which is a good effect.

Leave a Reply

Your email address will not be published. Required fields are marked *