WO2020114311A1

WO2020114311A1 - Cpu-gpu heterogeneous soc performance characterization method based on machine learning

Info

Publication number: WO2020114311A1
Application number: PCT/CN2019/121592
Authority: WO
Inventors: 喻之斌; 林灵锋; 伍浩文
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2018-12-07
Filing date: 2019-11-28
Publication date: 2020-06-11
Also published as: CN109871237B; CN109871237A

Abstract

Disclosed is a CPU- GPU heterogeneous SoC performance characterization method based on machine learning, relating to the technical field of information technology. The method comprises the following steps: S1: collecting big data on performance including data on CPU hardware events and data on GPU hardware events; S2: processing the collected big data on performance; S3: performing performance characterization to CPU and GPU; S4: performing collection and analysis on system energy consumption. One the one hand, the method enables a user to obtain performance characteristics of an artificial intelligence program by monitoring hardware events at CPU and GPU, thus providing guidance for optimizing the artificial intelligence program; on the other hand, the method provides guidance for the user to, according to the performance characteristics of the artificial intelligence program, specifically optimize the compiler or the computer microarchitecture for adaption to the artificial intelligence program; finally, the method enables the user to monitor and analyze CPU and GPU according to the described monitoring strategy and analysis method.

Description

Method for characterizing CPU and GPU heterogeneous SoC performance based on machine learning

Technical field

The present invention relates to the field of information technology, and in particular, to a method for describing performance of CPU and GPU heterogeneous SoC based on machine learning.

Background technique

The current mainstream processors are heterogeneous system-on-chips (SoCs) containing CPU cores and GPU cores, and the research and application of artificial intelligence are developing rapidly. In order to understand the performance characteristics of such heterogeneous SoCs running artificial intelligence programs, a performance characterization model for CPU core and GPU core heterogeneous systems is proposed.

The characterization of the performance of the processor can help improve the design of the server structure of the data center. In addition, analyzing the performance characteristics of the processor helps optimize the compiler to speed up program execution. The performance characteristics of the processor also provide an important reference for the analysis and optimization of many applications.

The current mainstream CPU performance characteristic analysis usually uses the Top-Down method proposed by Ahmad Yasin. This method builds a top-down tree structure based on the perf tool in the liunx kernel. The weight of the tree node is used to guide the user to focus on those factors that really have influence, and ignore the unimportant parts. The premise of this method is that the user selects the processor microstructure events of interest, and the number of Intel processor microstructure events is large, ranging from 338 to 1423. This method is difficult to analyze CPU performance characteristics comprehensively.

The current method for characterizing the performance of heterogeneous SoC processors of CPU and GPU for artificial intelligence program benchmark is proposed by Mauricio Guignard et al. This method is to characterize the performance of running artificial intelligence programs on heterogeneous SoC and determine the performance bottleneck of the platform . In order to determine the type of operations that take more time and evaluate the similarity of deep learning models from different performances of training and speculation; the ability to understand parallel scalability. This makes it difficult to deeply analyze the performance characteristics and principles of heterogeneous SoCs. In addition, this method has no way of knowing the energy consumption.

Summary of the invention

In order to solve the above problems in the background art, the present invention proposes a machine learning-based CPU and GPU heterogeneous SoC performance characterization method. On the one hand, users can obtain performance features that reflect the performance of artificial intelligence programs based on monitoring hardware events on the CPU and GPU sides. In order to provide guidance for optimizing artificial intelligence programs; on the other hand, users can provide specific guidance for optimizing compilers or computer microarchitectures to adapt to artificial intelligence programs based on the performance characteristics of artificial intelligence programs. Finally, users can monitor and analyze the CPU and GPU through the monitoring strategies and analysis methods used in this framework.

The technical solution for solving the above problems of the present invention is: a method for characterizing the performance of heterogeneous SoCs of CPU and GPU based on machine learning, which is special in that it includes the following steps:

S1: Collect big performance data; the big performance data includes CPU hardware event data and GPU hardware event data;

S2: Process the collected big performance data;

S3: Characterize the performance of CPU and GPU;

S4: Collect and analyze system energy consumption.

Further, the above step S1 includes:

S101: Collect CPU hardware events according to the mode of One Counter Event (OCOE);

S102: Use the perf tool to specify the event code to be collected and the collection interval;

S103: Collect GPU hardware events according to the mode of One Running One (OROE);

S104: Use the nvprof tool to specify the event code to be collected.

Further, the above step S2 includes:

S201: CPU hardware event processing part, first converts the original format of the event collected at runtime into a multi-column form at a sampling interval, then stitches the columns from different runtimes into a large data matrix, and the last column is the IPC.

S202: The GPU hardware event processing part first converts the kernel name into a standard format, then aggregates the values of the monitored events according to different kernels, stitches them into a large kernel data matrix, and finally puts the IPC in the last column.

Further, the above step S3 includes:

S301: Train a GBRT machine learning model using the big data matrix of the CPU part, sort the features, and obtain the 10 CPU hardware events that have the most important influence on the IPC.

S302: Use the large kernel data matrix of the GPU part to train multiple GBRT machine learning models in the order of time consumed, sort the features, and obtain the 10 GPU hardware events that have the most important impact on the IPC.

Further, the above step S4 includes:

S401: Use nvprof to measure the power consumed by each GPU;

S402: Use the electricity meter UNIT-T UT230A/C-II to measure the actual electricity consumption of the server.

Advantages of the invention:

The present invention is a machine learning-based CPU and GPU heterogeneous SoC performance characterization method, which collects CPU hardware event and GPU hardware event information through the Linux kernel tool perf and the NVIDIA monitoring tool nvprof. Afterwards, the performance of the heterogeneous CPU and GPU SoC system is analyzed and characterized by the performance data processing module, performance characterization module, and energy consumption collection and analysis module. This provides more reliable and detailed suggestions for processor performance characterization. On the one hand, users of the present invention can monitor the hardware events on the CPU and GPU to obtain performance characteristics that can reflect the artificial intelligence program, thereby providing guidance for optimizing the artificial intelligence program; on the other hand According to the performance characteristics of artificial intelligence programs, users can provide specific guidance for optimizing compilers or computer micro-architectures to adapt to artificial intelligence programs. Finally, users can monitor and control CPUs and GPUs through the monitoring strategies and analysis methods used in this framework. analysis.

BRIEF DESCRIPTION

FIG. 1 is a flowchart of a method for characterizing the performance of CPU and GPU heterogeneous SoC based on machine learning in an embodiment of the present invention;

2 is a design diagram of a method for describing performance of CPU and GPU heterogeneous SoC based on machine learning in an embodiment of the present invention;

3 is a schematic diagram of a matrix stitching method of CPU hardware event data in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a matrix stitching method of GPU hardware event data in an embodiment of the present invention.

detailed description

To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments in the present invention, all other embodiments obtained by a person of ordinary skill in the art without making creative efforts fall within the protection scope of the present invention. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely represents selected embodiments of the present invention. Based on the embodiments in the present invention, all other embodiments obtained by a person of ordinary skill in the art without making creative efforts fall within the protection scope of the present invention.

Referring to FIG. 1, a method for characterizing the performance of heterogeneous SoCs of CPU and GPU based on machine learning mainly includes four parts:

S2: Process the collected high-performance data; the high-performance data processing includes CPU data and GPU data. Among them, CPU data needs to merge each monitored hardware event data into a large data matrix. GPU data needs to merge hardware event data into a large data matrix according to different kernel functions.

S3: Performance characterization of CPU and GPU, including modeling CPU and GPU hardware event data separately, and sorting features; selecting the top ten most important features as the basis for performance characterization.

S4: Collect and analyze system energy consumption, including monitoring the power consumption of the whole machine and the power consumption of each GPU.

Referring to FIG. 2, the above step S1 includes:

S104: Use the nvprof tool to specify the event code to be collected.

Specifically, step S1 is performed on the CPU side and the GPU side:

On the CPU side, the present invention uses the Linux kernel component perf. Perf is a monitoring tool for monitoring performance counters within the Linux kernel components. In the present invention, the artificial intelligence program is run on the server, and a program that monitors the process name is used to monitor when the artificial intelligence program starts to execute, and once started, the perf monitoring is started. Perf monitoring specifies how many hardware events need to be monitored each time the program is run according to OCOE. The PMU of the Intel(R)Xeon(R)CPU E5-2650v4@2.20GHz processor used in the present invention provides 6 performance counters. Therefore, 6 hardware events are monitored at one time, and these 6 hardware events include 2 resident events: instruction, cycles. The monitoring interval is 1000 milliseconds. After the program runs, the monitoring will stop. In order to collect the values of all events, it is necessary to run the program multiple times.

On the GPU side, the present invention uses the NVIDIA monitoring tool nvprof. Nvprof is a monitoring tool for NVIDIA GPUs that can monitor CUDA, OpenACC or OpenMP applications. The invention also runs the artificial intelligence program on the server, and nvprof can pass the executable statement to be executed as a parameter to the nvprof tool. Because NVIDIA does not disclose the number of its GPU performance counters, the present invention adopts to select a part of hardware events and monitor one event once the program runs. Specifying --print-gpu-trace on. means recording the value of the event each time a kernel function is called. In order to collect the values of all hardware events, it is necessary to run the program multiple times.

The above step S2 includes:

Specifically, step S2 includes performing on the CPU side and the GPU side:

On the CPU side. Organize hardware events into a large data matrix, such as Mij shown in Figure 3. The columns of the matrix are hardware events. The rows of the matrix are each acquisition interval. First, the monitoring data generated by the original running program is converted into a small data matrix, as shown in the small matrix mij in the upper left corner of Figure 3. The columns of the small data matrix are the hardware events other than instruction and cycles monitored by the running program perf. 3 The small matrix E1, E2, E3, E4 in the upper left corner, the row is the monitoring interval. The last column is IPC, which is calculated by instruction and cycles. Secondly, the non-IPC columns of all small data matrices are spliced into a large data matrix. The splicing method is to place the data generated by each monitoring at the diagonal position of the large data matrix, as shown in Figure 3 , Where the last column is IPC, which is used as the label data during model training.

On the GPU side, the hardware events are organized into a large data matrix according to different kernels, as shown in Mij in Figure 4. Different from the CPU side, the hardware event data generated every time the program is run is no longer stitched according to the diagonal position, but unified stitched by line. Each line is the monitoring interval set by the nvprof tool. Each column is the hardware event monitored during each execution of the program, and the last column is the IPC, which is used as the label data during model training.

Further, the above step S3 includes:

S301: Train a GBRT machine learning model using the big data matrix of the CPU part, sort the features, and obtain the 10 CPU hardware events that have the most important impact on IPC;

Specifically, step S3 includes CPU and GPU performance characterization:

On the CPU side, a Gradient Boosted Regression Tree (GBRT) machine learning model is trained based on the CPU data matrix. The GBRT algorithm is a machine learning algorithm with high prediction accuracy and wide adaptability, which is suitable for various data learning scenarios. The purpose of the BGRT algorithm used in the present invention has two aspects: one is that the algorithm has high prediction accuracy; the other is that the algorithm can learn the relative importance of features (events), helping to understand which factors (events) have a key impact on prediction (IPC) . This advantage is particularly important in the present invention's ranking of the importance of events. Therefore, the present invention uses the GBRT algorithm. The present invention uses the last column of the data matrix as the label of the training and test set, and the remaining columns as the data set. The data set and Label are divided into training set and test set according to the ratio of 8:2. The training set data is used to train the GBRT algorithm. The test set is used to verify the error rate of the model. In the training set, the data is trained in multiple rounds according to the cross-validation method to train an optimal model. After completing a training, the data of the 10 least important event features are removed, and the remaining event feature data is used as the data set to train the GBRT model again. This process is called "feature purification". The reason for this is because there are many CPU event features, ranging from 226-1423, so you need to consider whether the model is overfitting. Feature purification until the GBRT model with the lowest error rate is obtained. The feature ranking of this model is taken as the final importance ranking of the CPU part event features of the present invention, and the top 10 important events are finally used as performance descriptions.

On the GPU side, the GBRT algorithm is also trained with GPU hardware event data. The data is divided like the 8:2 ratio of the CPU side to divide the training set and the test set. The difference with the CPU part is that the GPU data is not "feature purified". The reason is that the number of features on the GPU side of the present invention is 35, and the number of features on the GPU side is relatively small. It is considered that the model has no overfitting effect. After ranking the importance of the event features obtained by the model training, the top 10 important events are used as performance descriptions.

The monitoring data of the CPU part and the monitoring data of the GPU part are integrated. Characterize the performance characteristics of current artificial intelligence programs based on the most important events. For example, in the image classification program, the most important event in the CPU part is Number of self-modifying-code machine clears detected. It represents the number of self-modifying-code detected when the processor is cleaned. Self-modifying code (self-modifying-code) is the code that changes its own instructions when it is executed. It is usually used to reduce the length of the instruction path and improve performance, or simply reduce similar codes that are otherwise repeated, thereby simplifying maintenance. The second most important event is Cycles dueled to re-order buffer full, indicating that the instruction pipeline stalled due to the reordering buffer being full. The most important event in the GPU part is Number of transactions for shared accesses, which represents the number of transactions accessed by shared storage. The maximum number of transactions in the Maxwell architecture is 128 bytes. For a shared load instruction, any warp greater than 128 bytes access will result in multiple transactions. The incident also includes additional transactions caused by shared bank conflicts. The second most important event is Number of branches instructions executed warp on multiprocessor. It represents the number of executions of each warp branch instruction of a multiprocessor.

Further, the above step S4 includes:

S401: Use nvprof to measure the power consumed by each GPU.

Specifically, step S4 is: use the nvprof tool to collect GPU energy consumption. The power data of each GPU can be obtained by configuring the nvprof parameter system-profiling. According to the GPU running time, you can get the power consumed by the GPU when running the program.

Use the electricity meter tool to collect server energy consumption. By recording the voltage, current, and program running time, the power consumed by the server when running the program is obtained. As in the present invention, the UT230A/C-II electrical energy measuring instrument is used to record electrical energy data. Finally, the ratio of GPU power consumption is calculated. As discovered by the present invention, the power consumption ratio of the GPU ranges from 27% to 44%, indicating that the execution of artificial intelligence programs requires a large amount of power consumption.

The above are only embodiments of the present invention, and are not intended to limit the scope of protection of the present invention. Any equivalent structure or equivalent process transformation made by the description and drawings of the present invention, or directly or indirectly used in other related In the field of systems, the same reason is included in the protection scope of the present invention.

Claims

A method for characterizing CPU and GPU heterogeneous SoC performance based on machine learning, which is special in that it includes the following steps:

S1: Collect big performance data; the big performance data includes CPU hardware event data and GPU hardware event data;

S2: Process the collected big performance data;

S3: Characterize the performance of CPU and GPU;

S4: Collect and analyze system energy consumption.
The method for describing the performance of CPU and GPU heterogeneous SoC based on machine learning according to claim 1, wherein the special feature is that step S1 includes:

S101: Collect CPU hardware events according to the mode of One Counter One Event;

S102: Use the perf tool to specify the event code to be collected and the collection interval;

S103: Collect GPU hardware events according to the mode of One Running One Event;

S104: Use the nvprof tool to specify the event code to be collected.
The method for characterizing the performance of CPU and GPU heterogeneous SoCs based on machine learning according to claim 1, wherein the special feature is that step S2 includes:

S201: CPU hardware event processing part, first converts the original format of the events collected at runtime into a multi-column form at a sampling interval, then stitches the columns from different runtimes into a large data matrix, and the last column is IPC;

S202: The GPU hardware event processing part first converts the kernel name into a standard format, then aggregates the values of the monitored events according to different kernels, stitches them into a large kernel data matrix, and finally puts the IPC in the last column.
The method for characterizing the heterogeneous SoC of CPU and GPU based on machine learning according to claim 1, wherein the special feature is that step S3 includes:

S301: Train a GBRT machine learning model using the big data matrix of the CPU part, sort the features, and obtain the 10 CPU hardware events that have the most important impact on IPC;

S302: Use the large kernel data matrix of the GPU part to train multiple GBRT machine learning models in the order of time consumed, sort the features, and obtain the 10 GPU hardware events that have the most important impact on the IPC.
A method for characterizing a heterogeneous SoC of CPU and GPU based on machine learning according to any one of claims 1 to 3, which is special in that step S4 includes:

S401: Use nvprof to measure the power consumed by each GPU;

S402: Use the electricity meter UNIT-T UT230A/C-II to measure the actual electricity consumption of the server.