WO2020114311A1 - Cpu-gpu heterogeneous soc performance characterization method based on machine learning - Google Patents

Cpu-gpu heterogeneous soc performance characterization method based on machine learning Download PDF

Info

Publication number
WO2020114311A1
WO2020114311A1 PCT/CN2019/121592 CN2019121592W WO2020114311A1 WO 2020114311 A1 WO2020114311 A1 WO 2020114311A1 CN 2019121592 W CN2019121592 W CN 2019121592W WO 2020114311 A1 WO2020114311 A1 WO 2020114311A1
Authority
WO
WIPO (PCT)
Prior art keywords
gpu
cpu
performance
data
machine learning
Prior art date
Application number
PCT/CN2019/121592
Other languages
French (fr)
Chinese (zh)
Inventor
喻之斌
林灵锋
伍浩文
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2020114311A1 publication Critical patent/WO2020114311A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Definitions

  • the present invention relates to the field of information technology, and in particular, to a method for describing performance of CPU and GPU heterogeneous SoC based on machine learning.
  • the current mainstream processors are heterogeneous system-on-chips (SoCs) containing CPU cores and GPU cores, and the research and application of artificial intelligence are developing rapidly.
  • SoCs system-on-chips
  • a performance characterization model for CPU core and GPU core heterogeneous systems is proposed.
  • the characterization of the performance of the processor can help improve the design of the server structure of the data center.
  • analyzing the performance characteristics of the processor helps optimize the compiler to speed up program execution.
  • the performance characteristics of the processor also provide an important reference for the analysis and optimization of many applications.
  • the current mainstream CPU performance characteristic analysis usually uses the Top-Down method proposed by Ahmad Yasin.
  • This method builds a top-down tree structure based on the perf tool in the liunx kernel.
  • the weight of the tree node is used to guide the user to focus on those factors that really have influence, and ignore the unimportant parts.
  • the premise of this method is that the user selects the processor microstructure events of interest, and the number of Intel processor microstructure events is large, ranging from 338 to 1423. This method is difficult to analyze CPU performance characteristics comprehensively.
  • the current method for characterizing the performance of heterogeneous SoC processors of CPU and GPU for artificial intelligence program benchmark is proposed by Mauricio Guignard et al. This method is to characterize the performance of running artificial intelligence programs on heterogeneous SoC and determine the performance bottleneck of the platform . In order to determine the type of operations that take more time and evaluate the similarity of deep learning models from different performances of training and speculation; the ability to understand parallel scalability. This makes it difficult to deeply analyze the performance characteristics and principles of heterogeneous SoCs. In addition, this method has no way of knowing the energy consumption.
  • the present invention proposes a machine learning-based CPU and GPU heterogeneous SoC performance characterization method.
  • users can obtain performance features that reflect the performance of artificial intelligence programs based on monitoring hardware events on the CPU and GPU sides.
  • users can provide specific guidance for optimizing compilers or computer microarchitectures to adapt to artificial intelligence programs based on the performance characteristics of artificial intelligence programs.
  • users can monitor and analyze the CPU and GPU through the monitoring strategies and analysis methods used in this framework.
  • the technical solution for solving the above problems of the present invention is: a method for characterizing the performance of heterogeneous SoCs of CPU and GPU based on machine learning, which is special in that it includes the following steps:
  • S1 Collect big performance data; the big performance data includes CPU hardware event data and GPU hardware event data;
  • S4 Collect and analyze system energy consumption.
  • step S1 includes:
  • S101 Collect CPU hardware events according to the mode of One Counter Event (OCOE);
  • S102 Use the perf tool to specify the event code to be collected and the collection interval;
  • step S2 includes:
  • S201 CPU hardware event processing part, first converts the original format of the event collected at runtime into a multi-column form at a sampling interval, then stitches the columns from different runtimes into a large data matrix, and the last column is the IPC.
  • the GPU hardware event processing part first converts the kernel name into a standard format, then aggregates the values of the monitored events according to different kernels, stitches them into a large kernel data matrix, and finally puts the IPC in the last column.
  • step S3 includes:
  • S302 Use the large kernel data matrix of the GPU part to train multiple GBRT machine learning models in the order of time consumed, sort the features, and obtain the 10 GPU hardware events that have the most important impact on the IPC.
  • step S4 includes:
  • S402 Use the electricity meter UNIT-T UT230A/C-II to measure the actual electricity consumption of the server.
  • the present invention is a machine learning-based CPU and GPU heterogeneous SoC performance characterization method, which collects CPU hardware event and GPU hardware event information through the Linux kernel tool perf and the NVIDIA monitoring tool nvprof. Afterwards, the performance of the heterogeneous CPU and GPU SoC system is analyzed and characterized by the performance data processing module, performance characterization module, and energy consumption collection and analysis module. This provides more reliable and detailed suggestions for processor performance characterization.
  • users of the present invention can monitor the hardware events on the CPU and GPU to obtain performance characteristics that can reflect the artificial intelligence program, thereby providing guidance for optimizing the artificial intelligence program; on the other hand According to the performance characteristics of artificial intelligence programs, users can provide specific guidance for optimizing compilers or computer micro-architectures to adapt to artificial intelligence programs. Finally, users can monitor and control CPUs and GPUs through the monitoring strategies and analysis methods used in this framework. analysis.
  • FIG. 1 is a flowchart of a method for characterizing the performance of CPU and GPU heterogeneous SoC based on machine learning in an embodiment of the present invention
  • FIG. 2 is a design diagram of a method for describing performance of CPU and GPU heterogeneous SoC based on machine learning in an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a matrix stitching method of CPU hardware event data in an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a matrix stitching method of GPU hardware event data in an embodiment of the present invention.
  • a method for characterizing the performance of heterogeneous SoCs of CPU and GPU based on machine learning mainly includes four parts:
  • S1 Collect big performance data; the big performance data includes CPU hardware event data and GPU hardware event data;
  • S2 Process the collected high-performance data; the high-performance data processing includes CPU data and GPU data. Among them, CPU data needs to merge each monitored hardware event data into a large data matrix. GPU data needs to merge hardware event data into a large data matrix according to different kernel functions.
  • S4 Collect and analyze system energy consumption, including monitoring the power consumption of the whole machine and the power consumption of each GPU.
  • step S1 includes:
  • S101 Collect CPU hardware events according to the mode of One Counter Event (OCOE);
  • S102 Use the perf tool to specify the event code to be collected and the collection interval;
  • step S1 is performed on the CPU side and the GPU side:
  • Perf is a monitoring tool for monitoring performance counters within the Linux kernel components.
  • the artificial intelligence program is run on the server, and a program that monitors the process name is used to monitor when the artificial intelligence program starts to execute, and once started, the perf monitoring is started.
  • Perf monitoring specifies how many hardware events need to be monitored each time the program is run according to OCOE.
  • the PMU of the Intel(R)Xeon(R)CPU E5-2650v4@2.20GHz processor used in the present invention provides 6 performance counters. Therefore, 6 hardware events are monitored at one time, and these 6 hardware events include 2 resident events: instruction, cycles. The monitoring interval is 1000 milliseconds. After the program runs, the monitoring will stop. In order to collect the values of all events, it is necessary to run the program multiple times.
  • NVIDIA is a monitoring tool for NVIDIA GPUs that can monitor CUDA, OpenACC or OpenMP applications.
  • the invention also runs the artificial intelligence program on the server, and nvprof can pass the executable statement to be executed as a parameter to the nvprof tool.
  • NVIDIA does not disclose the number of its GPU performance counters
  • the present invention adopts to select a part of hardware events and monitor one event once the program runs. Specifying --print-gpu-trace on. means recording the value of the event each time a kernel function is called. In order to collect the values of all hardware events, it is necessary to run the program multiple times.
  • step S2 includes:
  • S201 CPU hardware event processing part, first converts the original format of the event collected at runtime into a multi-column form at a sampling interval, then stitches the columns from different runtimes into a large data matrix, and the last column is the IPC.
  • the GPU hardware event processing part first converts the kernel name into a standard format, then aggregates the values of the monitored events according to different kernels, stitches them into a large kernel data matrix, and finally puts the IPC in the last column.
  • step S2 includes performing on the CPU side and the GPU side:
  • the columns of the matrix are hardware events.
  • the rows of the matrix are each acquisition interval.
  • the monitoring data generated by the original running program is converted into a small data matrix, as shown in the small matrix mij in the upper left corner of Figure 3.
  • the columns of the small data matrix are the hardware events other than instruction and cycles monitored by the running program perf. 3
  • the small matrix E1, E2, E3, E4 in the upper left corner, the row is the monitoring interval.
  • the last column is IPC, which is calculated by instruction and cycles.
  • the non-IPC columns of all small data matrices are spliced into a large data matrix.
  • the splicing method is to place the data generated by each monitoring at the diagonal position of the large data matrix, as shown in Figure 3 , Where the last column is IPC, which is used as the label data during model training.
  • the hardware events are organized into a large data matrix according to different kernels, as shown in Mij in Figure 4. Different from the CPU side, the hardware event data generated every time the program is run is no longer stitched according to the diagonal position, but unified stitched by line. Each line is the monitoring interval set by the nvprof tool. Each column is the hardware event monitored during each execution of the program, and the last column is the IPC, which is used as the label data during model training.
  • step S3 includes:
  • S302 Use the large kernel data matrix of the GPU part to train multiple GBRT machine learning models in the order of time consumed, sort the features, and obtain the 10 GPU hardware events that have the most important impact on the IPC.
  • step S3 includes CPU and GPU performance characterization:
  • a Gradient Boosted Regression Tree (GBRT) machine learning model is trained based on the CPU data matrix.
  • the GBRT algorithm is a machine learning algorithm with high prediction accuracy and wide adaptability, which is suitable for various data learning scenarios.
  • the purpose of the BGRT algorithm used in the present invention has two aspects: one is that the algorithm has high prediction accuracy; the other is that the algorithm can learn the relative importance of features (events), helping to understand which factors (events) have a key impact on prediction (IPC) . This advantage is particularly important in the present invention's ranking of the importance of events. Therefore, the present invention uses the GBRT algorithm.
  • the present invention uses the last column of the data matrix as the label of the training and test set, and the remaining columns as the data set.
  • the data set and Label are divided into training set and test set according to the ratio of 8:2.
  • the training set data is used to train the GBRT algorithm.
  • the test set is used to verify the error rate of the model.
  • the data is trained in multiple rounds according to the cross-validation method to train an optimal model.
  • the data of the 10 least important event features are removed, and the remaining event feature data is used as the data set to train the GBRT model again. This process is called "feature purification".
  • feature purification The reason for this is because there are many CPU event features, ranging from 226-1423, so you need to consider whether the model is overfitting. Feature purification until the GBRT model with the lowest error rate is obtained.
  • the feature ranking of this model is taken as the final importance ranking of the CPU part event features of the present invention, and the top 10 important events are finally used as performance descriptions.
  • the GBRT algorithm is also trained with GPU hardware event data.
  • the data is divided like the 8:2 ratio of the CPU side to divide the training set and the test set.
  • the difference with the CPU part is that the GPU data is not "feature purified".
  • the reason is that the number of features on the GPU side of the present invention is 35, and the number of features on the GPU side is relatively small. It is considered that the model has no overfitting effect. After ranking the importance of the event features obtained by the model training, the top 10 important events are used as performance descriptions.
  • the monitoring data of the CPU part and the monitoring data of the GPU part are integrated. Characterize the performance characteristics of current artificial intelligence programs based on the most important events. For example, in the image classification program, the most important event in the CPU part is Number of self-modifying-code machine clears detected. It represents the number of self-modifying-code detected when the processor is cleaned. Self-modifying code (self-modifying-code) is the code that changes its own instructions when it is executed. It is usually used to reduce the length of the instruction path and improve performance, or simply reduce similar codes that are otherwise repeated, thereby simplifying maintenance. The second most important event is Cycles dueled to re-order buffer full, indicating that the instruction pipeline stalled due to the reordering buffer being full.
  • the most important event in the GPU part is Number of transactions for shared accesses, which represents the number of transactions accessed by shared storage.
  • the maximum number of transactions in the Maxwell architecture is 128 bytes. For a shared load instruction, any warp greater than 128 bytes access will result in multiple transactions. The incident also includes additional transactions caused by shared bank conflicts.
  • the second most important event is Number of branches instructions executed warp on multiprocessor. It represents the number of executions of each warp branch instruction of a multiprocessor.
  • step S4 includes:
  • S402 Use the electricity meter UNIT-T UT230A/C-II to measure the actual electricity consumption of the server.
  • step S4 is: use the nvprof tool to collect GPU energy consumption.
  • the power data of each GPU can be obtained by configuring the nvprof parameter system-profiling. According to the GPU running time, you can get the power consumed by the GPU when running the program.
  • the electricity meter tool uses the electricity meter tool to collect server energy consumption. By recording the voltage, current, and program running time, the power consumed by the server when running the program is obtained. As in the present invention, the UT230A/C-II electrical energy measuring instrument is used to record electrical energy data. Finally, the ratio of GPU power consumption is calculated. As discovered by the present invention, the power consumption ratio of the GPU ranges from 27% to 44%, indicating that the execution of artificial intelligence programs requires a large amount of power consumption.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Disclosed is a CPU- GPU heterogeneous SoC performance characterization method based on machine learning, relating to the technical field of information technology. The method comprises the following steps: S1: collecting big data on performance including data on CPU hardware events and data on GPU hardware events; S2: processing the collected big data on performance; S3: performing performance characterization to CPU and GPU; S4: performing collection and analysis on system energy consumption. One the one hand, the method enables a user to obtain performance characteristics of an artificial intelligence program by monitoring hardware events at CPU and GPU, thus providing guidance for optimizing the artificial intelligence program; on the other hand, the method provides guidance for the user to, according to the performance characteristics of the artificial intelligence program, specifically optimize the compiler or the computer microarchitecture for adaption to the artificial intelligence program; finally, the method enables the user to monitor and analyze CPU and GPU according to the described monitoring strategy and analysis method.

Description

一种基于机器学习的CPU与GPU异构SoC性能刻画方法Method for characterizing CPU and GPU heterogeneous SoC performance based on machine learning 技术领域Technical field
本发明涉及信息技术领域,具体而言,涉及一种基于机器学习的CPU与GPU异构SoC性能刻画方法。The present invention relates to the field of information technology, and in particular, to a method for describing performance of CPU and GPU heterogeneous SoC based on machine learning.
背景技术Background technique
当前主流处理器是包含CPU核与GPU核的异构片上系统(SoC),并且人工智能研究与应用发展迅速。针对理解运行人工智能程序的这种异构SoC性能特征,提出CPU核与GPU核异构系统性能特征刻画模型。The current mainstream processors are heterogeneous system-on-chips (SoCs) containing CPU cores and GPU cores, and the research and application of artificial intelligence are developing rapidly. In order to understand the performance characteristics of such heterogeneous SoCs running artificial intelligence programs, a performance characterization model for CPU core and GPU core heterogeneous systems is proposed.
处理器的性能刻画能够辅助改进数据中心的服务器结构设计。另外,分析处理器的性能特征有助于优化编译器达到加速程序执行的效果。处理器的性能特征也为众多应用分析与优化提供重要的参考依据。The characterization of the performance of the processor can help improve the design of the server structure of the data center. In addition, analyzing the performance characteristics of the processor helps optimize the compiler to speed up program execution. The performance characteristics of the processor also provide an important reference for the analysis and optimization of many applications.
当前主流的CPU性能特征分析通常使用Ahmad Yasin提出的Top-Down方法。该方法基于liunx内核中的perf工具构造一个自顶向下有层级的树结构。树节点的权值用于指导使用者重点关注真正有影响的那些因素,忽略不重要的部分。该方法的前提是使用者自选感兴趣的处理器微结构事件,而Intel处理器微结构事件数量多,从338到1423个不等。这种方法难以全面地分析CPU性能特征。The current mainstream CPU performance characteristic analysis usually uses the Top-Down method proposed by Ahmad Yasin. This method builds a top-down tree structure based on the perf tool in the liunx kernel. The weight of the tree node is used to guide the user to focus on those factors that really have influence, and ignore the unimportant parts. The premise of this method is that the user selects the processor microstructure events of interest, and the number of Intel processor microstructure events is large, ranging from 338 to 1423. This method is difficult to analyze CPU performance characteristics comprehensively.
当前针对人工智能程序benchmark作CPU与GPU的异构SoC处理器性能刻画的方法如Mauricio Guignard等人提出的,该方法是刻画在异构SoC上运行人工智能程序的性能并且确定该平台的性能瓶颈。从而确定花费时间较多的操作的类型,以及从训练与推测的不同性能表现评估深度学习模型的相似性;理解并行扩展性的能力。这样难以深入分析异构SoC的性能特征及其原理。除此之外,该方法对能耗情况无从知晓。The current method for characterizing the performance of heterogeneous SoC processors of CPU and GPU for artificial intelligence program benchmark is proposed by Mauricio Guignard et al. This method is to characterize the performance of running artificial intelligence programs on heterogeneous SoC and determine the performance bottleneck of the platform . In order to determine the type of operations that take more time and evaluate the similarity of deep learning models from different performances of training and speculation; the ability to understand parallel scalability. This makes it difficult to deeply analyze the performance characteristics and principles of heterogeneous SoCs. In addition, this method has no way of knowing the energy consumption.
发明内容Summary of the invention
为解决上述背景技术中存在的问题,本发明提出一种基于机器学习的CPU 与GPU异构SoC性能刻画方法,一方面用户可以根据在CPU和GPU端监控硬件事件得到能够反映人工智能程序性能特点,从而为优化人工智能程序提供指导;另一方面用户可以根据人工智能程序的性能特点,为适应人工智能程序而特定地优化编译器或计算机微体系结构提供指导。最后用户可以通过本框架中使用的监控策略和分析方法对CPU与GPU进行监控与分析。In order to solve the above problems in the background art, the present invention proposes a machine learning-based CPU and GPU heterogeneous SoC performance characterization method. On the one hand, users can obtain performance features that reflect the performance of artificial intelligence programs based on monitoring hardware events on the CPU and GPU sides. In order to provide guidance for optimizing artificial intelligence programs; on the other hand, users can provide specific guidance for optimizing compilers or computer microarchitectures to adapt to artificial intelligence programs based on the performance characteristics of artificial intelligence programs. Finally, users can monitor and analyze the CPU and GPU through the monitoring strategies and analysis methods used in this framework.
本发明解决上述问题的技术方案是:一种基于机器学习的CPU与GPU异构SoC性能刻画方法,其特殊之处在于,包括以下步骤:The technical solution for solving the above problems of the present invention is: a method for characterizing the performance of heterogeneous SoCs of CPU and GPU based on machine learning, which is special in that it includes the following steps:
S1:采集大性能数据;所述大性能数据包括CPU硬件事件数据与GPU硬件事件数据;S1: Collect big performance data; the big performance data includes CPU hardware event data and GPU hardware event data;
S2:对采集的大性能数据进行处理;S2: Process the collected big performance data;
S3:对CPU与GPU进行性能刻画;S3: Characterize the performance of CPU and GPU;
S4:进行系统能耗采集与分析。S4: Collect and analyze system energy consumption.
进一步地,上述步骤S1包括:Further, the above step S1 includes:
S101:按照One Counter One Event(OCOE)的模式收集CPU硬件事件;S101: Collect CPU hardware events according to the mode of One Counter Event (OCOE);
S102:使用perf工具指定要采集的事件编码,采集间隔;S102: Use the perf tool to specify the event code to be collected and the collection interval;
S103:按照One Running One Event(OROE)的模式收集GPU硬件事件;S103: Collect GPU hardware events according to the mode of One Running One (OROE);
S104:使用nvprof工具指定要采集的事件编码。S104: Use the nvprof tool to specify the event code to be collected.
进一步地,上述步骤S2包括:Further, the above step S2 includes:
S201:CPU硬件事件处理部分,首先将运行时收集的事件原始格式转换成一次采样间隔多列的形式,接着再将不同运行时的列拼接成大数据矩阵,最后一列是IPC。S201: CPU hardware event processing part, first converts the original format of the event collected at runtime into a multi-column form at a sampling interval, then stitches the columns from different runtimes into a large data matrix, and the last column is the IPC.
S202:GPU硬件事件处理部分,首先将kernel名转换成标准格式,再按照不同的kernel聚集已监控的事件的值,拼接成一个大kernel数据矩阵,最后再把IPC拼在最后一列。S202: The GPU hardware event processing part first converts the kernel name into a standard format, then aggregates the values of the monitored events according to different kernels, stitches them into a large kernel data matrix, and finally puts the IPC in the last column.
进一步地,上述步骤S3包括:Further, the above step S3 includes:
S301:利用CPU部分的大数据矩阵训练一个GBRT机器学习模型,对特征进行排序,得到对IPC影响最重要的10个CPU硬件事件。S301: Train a GBRT machine learning model using the big data matrix of the CPU part, sort the features, and obtain the 10 CPU hardware events that have the most important influence on the IPC.
S302:利用GPU部分的大kernel数据矩阵按照所消耗时间的次序训练多个GBRT机器学习模型,对特征进行排序,得到对IPC影响最重要的10个GPU硬件事件。S302: Use the large kernel data matrix of the GPU part to train multiple GBRT machine learning models in the order of time consumed, sort the features, and obtain the 10 GPU hardware events that have the most important impact on the IPC.
进一步地,上述步骤S4包括:Further, the above step S4 includes:
S401:利用nvprof测量每一块GPU消耗的电能;S401: Use nvprof to measure the power consumed by each GPU;
S402:利用电量测量仪UNIT-T UT230A/C-II测量服务器实际消耗电量。S402: Use the electricity meter UNIT-T UT230A/C-II to measure the actual electricity consumption of the server.
本发明的优点:Advantages of the invention:
本发明一种基于机器学习的CPU与GPU异构SoC性能刻画方法,通过linux内核工具perf与NVIDIA监控工具nvprof收集CPU硬件事件与GPU硬件事件信息。之后通过性能数据处理模块、性能刻画模块、能耗采集与分析模块来对异构的CPU与GPU SoC系统的性能进行分析与刻画。从而为处理器性能刻画提供更加可靠、详细的建议;本发明一方面用户可以根据在CPU和GPU端监控硬件事件得到能够反映人工智能程序性能特点,从而为优化人工智能程序提供指导;另一方面用户可以根据人工智能程序的性能特点,为适应人工智能程序而特定地优化编译器或计算机微体系结构提供指导,最后用户可以通过本框架中使用的监控策略和分析方法对CPU与GPU进行监控与分析。The present invention is a machine learning-based CPU and GPU heterogeneous SoC performance characterization method, which collects CPU hardware event and GPU hardware event information through the Linux kernel tool perf and the NVIDIA monitoring tool nvprof. Afterwards, the performance of the heterogeneous CPU and GPU SoC system is analyzed and characterized by the performance data processing module, performance characterization module, and energy consumption collection and analysis module. This provides more reliable and detailed suggestions for processor performance characterization. On the one hand, users of the present invention can monitor the hardware events on the CPU and GPU to obtain performance characteristics that can reflect the artificial intelligence program, thereby providing guidance for optimizing the artificial intelligence program; on the other hand According to the performance characteristics of artificial intelligence programs, users can provide specific guidance for optimizing compilers or computer micro-architectures to adapt to artificial intelligence programs. Finally, users can monitor and control CPUs and GPUs through the monitoring strategies and analysis methods used in this framework. analysis.
附图说明BRIEF DESCRIPTION
图1是本发明实施例中的基于机器学习的CPU与GPU异构SoC性能刻画方法流程图;FIG. 1 is a flowchart of a method for characterizing the performance of CPU and GPU heterogeneous SoC based on machine learning in an embodiment of the present invention;
图2是本发明实施例中的基于机器学习的CPU与GPU异构SoC性能刻画方法的设计图;2 is a design diagram of a method for describing performance of CPU and GPU heterogeneous SoC based on machine learning in an embodiment of the present invention;
图3是本发明实施例中的CPU硬件事件数据的矩阵拼接方法示意图;3 is a schematic diagram of a matrix stitching method of CPU hardware event data in an embodiment of the present invention;
图4是本发明实施例中的GPU硬件事件数据的矩阵拼接方法示意图。FIG. 4 is a schematic diagram of a matrix stitching method of GPU hardware event data in an embodiment of the present invention.
具体实施方式detailed description
为使本发明实施方式的目的、技术方案和优点更加清楚,下面将结合本发明实施方式中的附图,对本发明实施方式中的技术方案进行清楚、完整地描述,显然,所描述的实施方式是本发明一部分实施方式,而不是全部的实施方式。基于本发明中的实施方式,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施方式,都属于本发明保护的范围。因此,以下对在附图中提供的本发明的实施方式的详细描述并非旨在限制要求保护的本发明的范围,而是仅仅表示本发明的选定实施方式。基于本发明中的实施方式,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施方式,都属于本发明保护的范围。To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments in the present invention, all other embodiments obtained by a person of ordinary skill in the art without making creative efforts fall within the protection scope of the present invention. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely represents selected embodiments of the present invention. Based on the embodiments in the present invention, all other embodiments obtained by a person of ordinary skill in the art without making creative efforts fall within the protection scope of the present invention.
参见图1,一种基于机器学习的CPU与GPU异构SoC性能刻画方法,主要包括四部分:Referring to FIG. 1, a method for characterizing the performance of heterogeneous SoCs of CPU and GPU based on machine learning mainly includes four parts:
S1:采集大性能数据;所述大性能数据包括CPU硬件事件数据与GPU硬件事件数据;S1: Collect big performance data; the big performance data includes CPU hardware event data and GPU hardware event data;
S2:对采集的大性能数据进行处理;大性能数据处理包括CPU数据与GPU数据。其中CPU数据需要将每一次监控的硬件事件数据合并成大数据矩阵。GPU数据需要按照不同kernel函数对硬件事件数据合并成大数据矩阵。S2: Process the collected high-performance data; the high-performance data processing includes CPU data and GPU data. Among them, CPU data needs to merge each monitored hardware event data into a large data matrix. GPU data needs to merge hardware event data into a large data matrix according to different kernel functions.
S3:对CPU与GPU进行性能刻画,包括分别对CPU与GPU硬件事件数据建模,并对特征进行排序;选择最重要的前十个特征作为性能刻画的依据。S3: Performance characterization of CPU and GPU, including modeling CPU and GPU hardware event data separately, and sorting features; selecting the top ten most important features as the basis for performance characterization.
S4:进行系统能耗采集与分析,包括对整机电量消耗监控和对每一块GPU电量消耗监控。S4: Collect and analyze system energy consumption, including monitoring the power consumption of the whole machine and the power consumption of each GPU.
参见图2,上述步骤S1包括:Referring to FIG. 2, the above step S1 includes:
S101:按照One Counter One Event(OCOE)的模式收集CPU硬件事件;S101: Collect CPU hardware events according to the mode of One Counter Event (OCOE);
S102:使用perf工具指定要采集的事件编码,采集间隔;S102: Use the perf tool to specify the event code to be collected and the collection interval;
S103:按照One Running One Event(OROE)的模式收集GPU硬件事件;S103: Collect GPU hardware events according to the mode of One Running One (OROE);
S104:使用nvprof工具指定要采集的事件编码。S104: Use the nvprof tool to specify the event code to be collected.
具体地,步骤S1在CPU端和GPU端进行:Specifically, step S1 is performed on the CPU side and the GPU side:
在CPU端,本发明使用Linux内核组件perf。Perf是Linux内核组件内用性能计数器监控的监控工具。本发明将人工智能程序运行在服务器,用一个监控进程名的程序监控何时人工智能程序开始执行,一旦开始,就开启perf监控。Perf监控按照OCOE的方式指定每一次运行程序需要监控多少个硬件事件。本发明使用的Intel(R)Xeon(R)CPU E5-2650v4@2.20GHz处理器的PMU提供6个性能计数器。因此一次性监控6个硬件事件,这6个硬件事件包括2个常驻事件:instruction,cycles.监控间隔为1000毫秒。程序运行完毕,则停止监控。为了收集全部事件的值,需要多次运行程序。On the CPU side, the present invention uses the Linux kernel component perf. Perf is a monitoring tool for monitoring performance counters within the Linux kernel components. In the present invention, the artificial intelligence program is run on the server, and a program that monitors the process name is used to monitor when the artificial intelligence program starts to execute, and once started, the perf monitoring is started. Perf monitoring specifies how many hardware events need to be monitored each time the program is run according to OCOE. The PMU of the Intel(R)Xeon(R)CPU E5-2650v4@2.20GHz processor used in the present invention provides 6 performance counters. Therefore, 6 hardware events are monitored at one time, and these 6 hardware events include 2 resident events: instruction, cycles. The monitoring interval is 1000 milliseconds. After the program runs, the monitoring will stop. In order to collect the values of all events, it is necessary to run the program multiple times.
在GPU端,本发明使用NVIDIA监控工具nvprof。Nvprof是NVIDIA GPU专用的可监控CUDA,OpenACC或OpenMP应用的监控工具。本发明同样将人工智能程序运行在服务器,nvprof可将要执行的可执行语句作为参数传入nvprof工具。因为NVIDIA没有公开其GPU性能计数器的个数,所以本发明采用选取一部分硬件事件,程序运行一次就监控一个事件。指定--print-gpu-trace on.表示记录每一个kernel函数每一次被调用时事件的值。为了收集全部硬件事件的值,需要多次运行程序。On the GPU side, the present invention uses the NVIDIA monitoring tool nvprof. Nvprof is a monitoring tool for NVIDIA GPUs that can monitor CUDA, OpenACC or OpenMP applications. The invention also runs the artificial intelligence program on the server, and nvprof can pass the executable statement to be executed as a parameter to the nvprof tool. Because NVIDIA does not disclose the number of its GPU performance counters, the present invention adopts to select a part of hardware events and monitor one event once the program runs. Specifying --print-gpu-trace on. means recording the value of the event each time a kernel function is called. In order to collect the values of all hardware events, it is necessary to run the program multiple times.
上述步骤S2包括:The above step S2 includes:
S201:CPU硬件事件处理部分,首先将运行时收集的事件原始格式转换成一次采样间隔多列的形式,接着再将不同运行时的列拼接成大数据矩阵,最后一列是IPC。S201: CPU hardware event processing part, first converts the original format of the event collected at runtime into a multi-column form at a sampling interval, then stitches the columns from different runtimes into a large data matrix, and the last column is the IPC.
S202:GPU硬件事件处理部分,首先将kernel名转换成标准格式,再按照不同的kernel聚集已监控的事件的值,拼接成一个大kernel数据矩阵,最后再把IPC拼在最后一列。S202: The GPU hardware event processing part first converts the kernel name into a standard format, then aggregates the values of the monitored events according to different kernels, stitches them into a large kernel data matrix, and finally puts the IPC in the last column.
具体地,步骤S2包括在CPU端和GPU端进行:Specifically, step S2 includes performing on the CPU side and the GPU side:
在CPU端。将硬件事件整理成大数据矩阵,如图3所示的Mij。矩阵的列是硬件事件。矩阵的行是每个采集间隔。首先将原始某一次运行程序产生监控数据转换成小数据矩阵,如图3左上角的小矩阵mij,小数据矩阵的列是本次运行程序 perf监控的除instruction和cycles之外的硬件事件如图3左上角的小矩阵的E1,E2,E3,E4,行是监控间隔。最后一列是IPC,IPC由instruction和cycles计算得到。其次将所有小数据矩阵的非IPC列拼接成大数据矩阵,拼接的方法是在这个大数据矩阵的对角线位置放置每一次监控产生的数据,如图3所示的对角线位置的拼接,其中最后一列是IPC,用作模型训练时的label数据。On the CPU side. Organize hardware events into a large data matrix, such as Mij shown in Figure 3. The columns of the matrix are hardware events. The rows of the matrix are each acquisition interval. First, the monitoring data generated by the original running program is converted into a small data matrix, as shown in the small matrix mij in the upper left corner of Figure 3. The columns of the small data matrix are the hardware events other than instruction and cycles monitored by the running program perf. 3 The small matrix E1, E2, E3, E4 in the upper left corner, the row is the monitoring interval. The last column is IPC, which is calculated by instruction and cycles. Secondly, the non-IPC columns of all small data matrices are spliced into a large data matrix. The splicing method is to place the data generated by each monitoring at the diagonal position of the large data matrix, as shown in Figure 3 , Where the last column is IPC, which is used as the label data during model training.
在GPU端,按不同kernel将硬件事件整理成大数据矩阵,如图4所示的Mij。与CPU端不同的是,每次运行程序产生的硬件事件数据不再按照对角线位置拼接,而是按行统一拼接。每一行是nvprof工具设置的监控间隔。每一列是程序每一遍执行时监控的硬件事件,最后一列是IPC,用作模型训练时的label数据。On the GPU side, the hardware events are organized into a large data matrix according to different kernels, as shown in Mij in Figure 4. Different from the CPU side, the hardware event data generated every time the program is run is no longer stitched according to the diagonal position, but unified stitched by line. Each line is the monitoring interval set by the nvprof tool. Each column is the hardware event monitored during each execution of the program, and the last column is the IPC, which is used as the label data during model training.
进一步地,上述步骤S3包括:Further, the above step S3 includes:
S301:利用CPU部分的大数据矩阵训练一个GBRT机器学习模型,对特征进行排序,得到对IPC影响最重要的10个CPU硬件事件;S301: Train a GBRT machine learning model using the big data matrix of the CPU part, sort the features, and obtain the 10 CPU hardware events that have the most important impact on IPC;
S302:利用GPU部分的大kernel数据矩阵按照所消耗时间的次序训练多个GBRT机器学习模型,对特征进行排序,得到对IPC影响最重要的10个GPU硬件事件。S302: Use the large kernel data matrix of the GPU part to train multiple GBRT machine learning models in the order of time consumed, sort the features, and obtain the 10 GPU hardware events that have the most important impact on the IPC.
具体地,步骤S3包括CPU与GPU性能刻画:Specifically, step S3 includes CPU and GPU performance characterization:
在CPU端,根据CPU数据矩阵训练一个梯度提升回归树(Gradient Boosted Regression Tree,GBRT)机器学习模型。GBRT算法是一种预测精度高,适应性广泛的机器学习算法,适用于各类数据学习场景。本发明使用BGRT算法目的有两方面:一是该算法预测精度高;二是该算法能够学习特征(事件)的相对重要度,助于理解哪些因素(事件)是对预测(IPC)有关键影响。这一优势在本发明对事件的重要性排序特别重要。因此本发明使用GBRT算法。本发明利用数据矩阵的最后一列作为训练与测试集的Label,其余列作为数据集。将数据集和Label按照8:2的比例分为训练集与测试集。训练集数据用于训练GBRT算法。测试集用于验证模型的错误率。其中在训练集中,将数据按照交叉验证 的方法多轮训练,以训练一个最优模型。在完成一次训练后,将最不重要的10个事件特征的数据去除,用剩下的事件特征数据作为数据集再次训练GBRT模型,这个过程称为“特征提纯”。这样做的原因是因为:CPU事件特征较多,数量从226-1423个不等,因此需要考虑模型是否过拟合。特征提纯直到得到错误率最低的GBRT模型。将该模型的特征排序作为本发明最终的CPU部分事件特征的重要性排序,并最终取前10重要的事件用作性能刻画。On the CPU side, a Gradient Boosted Regression Tree (GBRT) machine learning model is trained based on the CPU data matrix. The GBRT algorithm is a machine learning algorithm with high prediction accuracy and wide adaptability, which is suitable for various data learning scenarios. The purpose of the BGRT algorithm used in the present invention has two aspects: one is that the algorithm has high prediction accuracy; the other is that the algorithm can learn the relative importance of features (events), helping to understand which factors (events) have a key impact on prediction (IPC) . This advantage is particularly important in the present invention's ranking of the importance of events. Therefore, the present invention uses the GBRT algorithm. The present invention uses the last column of the data matrix as the label of the training and test set, and the remaining columns as the data set. The data set and Label are divided into training set and test set according to the ratio of 8:2. The training set data is used to train the GBRT algorithm. The test set is used to verify the error rate of the model. In the training set, the data is trained in multiple rounds according to the cross-validation method to train an optimal model. After completing a training, the data of the 10 least important event features are removed, and the remaining event feature data is used as the data set to train the GBRT model again. This process is called "feature purification". The reason for this is because there are many CPU event features, ranging from 226-1423, so you need to consider whether the model is overfitting. Feature purification until the GBRT model with the lowest error rate is obtained. The feature ranking of this model is taken as the final importance ranking of the CPU part event features of the present invention, and the top 10 important events are finally used as performance descriptions.
在GPU端,同样用GPU硬件事件数据训练GBRT算法。数据的划分如同CPU端的8:2比例划分训练集与测试集。与CPU部分不同的是GPU的数据不进行“特征提纯”。原因是本发明GPU端的特征数是35,GPU端的特征数量较少,认为模型无过拟合影响。模型训练得到的事件特征重要性排序后,取前10重要的事件用作性能刻画。On the GPU side, the GBRT algorithm is also trained with GPU hardware event data. The data is divided like the 8:2 ratio of the CPU side to divide the training set and the test set. The difference with the CPU part is that the GPU data is not "feature purified". The reason is that the number of features on the GPU side of the present invention is 35, and the number of features on the GPU side is relatively small. It is considered that the model has no overfitting effect. After ranking the importance of the event features obtained by the model training, the top 10 important events are used as performance descriptions.
综合CPU部分的监控数据与GPU部分的监控数据。根据最重要的事件刻画当前人工智能程序的性能特征。如图像分类程序中,CPU部分最重要的事件是Number of self-modifying-code machine clears detected.表示处理器清理时检测到的self-modifying-code数量。自修改代码(self-modifying-code)是在执行时更改其自身指令的代码,通常用于减少指令路径长度并提高性能,或者简单地减少否则重复的相似代码,从而简化维护。次重要的事件是Cycles stalled due to re-order buffer full,表示由于重排序缓存满了导致指令流水线停滞。GPU部分最重要的事件是Number of transactions for shared store accesses,表示共享存储访问的transaction数。Maxwell架构中最大transaction数量是128字节。对于一次共享加载指令,任何大于128字节访问的warp将导致多个transaction。该事件还包括由共享bank冲突引起的额外transaction。次重要的事件是Number of branch instructions executed per warp on a multiprocessor.表示多处理器每个warp分支指令执行次数。The monitoring data of the CPU part and the monitoring data of the GPU part are integrated. Characterize the performance characteristics of current artificial intelligence programs based on the most important events. For example, in the image classification program, the most important event in the CPU part is Number of self-modifying-code machine clears detected. It represents the number of self-modifying-code detected when the processor is cleaned. Self-modifying code (self-modifying-code) is the code that changes its own instructions when it is executed. It is usually used to reduce the length of the instruction path and improve performance, or simply reduce similar codes that are otherwise repeated, thereby simplifying maintenance. The second most important event is Cycles dueled to re-order buffer full, indicating that the instruction pipeline stalled due to the reordering buffer being full. The most important event in the GPU part is Number of transactions for shared accesses, which represents the number of transactions accessed by shared storage. The maximum number of transactions in the Maxwell architecture is 128 bytes. For a shared load instruction, any warp greater than 128 bytes access will result in multiple transactions. The incident also includes additional transactions caused by shared bank conflicts. The second most important event is Number of branches instructions executed warp on multiprocessor. It represents the number of executions of each warp branch instruction of a multiprocessor.
进一步地,上述步骤S4包括:Further, the above step S4 includes:
S401:利用nvprof测量每一块GPU消耗的电能。S401: Use nvprof to measure the power consumed by each GPU.
S402:利用电量测量仪UNIT-T UT230A/C-II测量服务器实际消耗电量。S402: Use the electricity meter UNIT-T UT230A/C-II to measure the actual electricity consumption of the server.
具体地,步骤S4为:使用nvprof工具收集GPU能耗。通过配置nvprof参数system-profiling可得到每块GPU的的功率数据。根据GPU运行时间,可以得到运行程序时GPU消耗的电能。Specifically, step S4 is: use the nvprof tool to collect GPU energy consumption. The power data of each GPU can be obtained by configuring the nvprof parameter system-profiling. According to the GPU running time, you can get the power consumed by the GPU when running the program.
使用电量测量仪工具收集服务器能耗。通过记录电压,电流,程序运行时间得到运行程序时服务器消耗的电能。如本发明使用UT230A/C-II电量测量仪记录电能数据。最后计算得到GPU耗电量的比例。如本发明发现,GPU的耗电量比例是27%--44%不等,说明执行人工智能程序需要消耗大量用电量。Use the electricity meter tool to collect server energy consumption. By recording the voltage, current, and program running time, the power consumed by the server when running the program is obtained. As in the present invention, the UT230A/C-II electrical energy measuring instrument is used to record electrical energy data. Finally, the ratio of GPU power consumption is calculated. As discovered by the present invention, the power consumption ratio of the GPU ranges from 27% to 44%, indicating that the execution of artificial intelligence programs requires a large amount of power consumption.
以上所述仅为本发明的实施例,并非以此限制本发明的保护范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的系统领域,均同理包括在本发明的保护范围内。The above are only embodiments of the present invention, and are not intended to limit the scope of protection of the present invention. Any equivalent structure or equivalent process transformation made by the description and drawings of the present invention, or directly or indirectly used in other related In the field of systems, the same reason is included in the protection scope of the present invention.

Claims (5)

  1. 一种基于机器学习的CPU与GPU异构SoC性能刻画方法,其特殊之处在于,包括以下步骤:A method for characterizing CPU and GPU heterogeneous SoC performance based on machine learning, which is special in that it includes the following steps:
    S1:采集大性能数据;所述大性能数据包括CPU硬件事件数据与GPU硬件事件数据;S1: Collect big performance data; the big performance data includes CPU hardware event data and GPU hardware event data;
    S2:对采集的大性能数据进行处理;S2: Process the collected big performance data;
    S3:对CPU与GPU进行性能刻画;S3: Characterize the performance of CPU and GPU;
    S4:进行系统能耗采集与分析。S4: Collect and analyze system energy consumption.
  2. 根据权利要求1所述的一种基于机器学习的CPU与GPU异构SoC性能刻画方法,其特殊之处在于:步骤S1包括:The method for describing the performance of CPU and GPU heterogeneous SoC based on machine learning according to claim 1, wherein the special feature is that step S1 includes:
    S101:按照One Counter One Event的模式收集CPU硬件事件;S101: Collect CPU hardware events according to the mode of One Counter One Event;
    S102:使用perf工具指定要采集的事件编码,采集间隔;S102: Use the perf tool to specify the event code to be collected and the collection interval;
    S103:按照One Running One Event的模式收集GPU硬件事件;S103: Collect GPU hardware events according to the mode of One Running One Event;
    S104:使用nvprof工具指定要采集的事件编码。S104: Use the nvprof tool to specify the event code to be collected.
  3. 根据权利要求1所述的一种基于机器学习的CPU与GPU异构SoC性能刻画方法,其特殊之处在于:步骤S2包括:The method for characterizing the performance of CPU and GPU heterogeneous SoCs based on machine learning according to claim 1, wherein the special feature is that step S2 includes:
    S201:CPU硬件事件处理部分,首先将运行时收集的事件原始格式转换成一次采样间隔多列的形式,接着再将不同运行时的列拼接成大数据矩阵,最后一列是IPC;S201: CPU hardware event processing part, first converts the original format of the events collected at runtime into a multi-column form at a sampling interval, then stitches the columns from different runtimes into a large data matrix, and the last column is IPC;
    S202:GPU硬件事件处理部分,首先将kernel名转换成标准格式,再按照不同的kernel聚集已监控的事件的值,拼接成一个大kernel数据矩阵,最后再把IPC拼在最后一列。S202: The GPU hardware event processing part first converts the kernel name into a standard format, then aggregates the values of the monitored events according to different kernels, stitches them into a large kernel data matrix, and finally puts the IPC in the last column.
  4. 根据权利要求1所述的一种基于机器学习的CPU与GPU异构SoC性能刻画方法,其特殊之处在于:步骤S3包括:The method for characterizing the heterogeneous SoC of CPU and GPU based on machine learning according to claim 1, wherein the special feature is that step S3 includes:
    S301:利用CPU部分的大数据矩阵训练一个GBRT机器学习模型,对特征进行排序,得到对IPC影响最重要的10个CPU硬件事件;S301: Train a GBRT machine learning model using the big data matrix of the CPU part, sort the features, and obtain the 10 CPU hardware events that have the most important impact on IPC;
    S302:利用GPU部分的大kernel数据矩阵按照所消耗时间的次序训练多个GBRT机器学习模型,对特征进行排序,得到对IPC影响最重要的10个GPU硬件事件。S302: Use the large kernel data matrix of the GPU part to train multiple GBRT machine learning models in the order of time consumed, sort the features, and obtain the 10 GPU hardware events that have the most important impact on the IPC.
  5. 根据权利要求1-3任一所述的一种基于机器学习的CPU与GPU异构SoC性能刻画方法,其特殊之处在于:步骤S4包括:A method for characterizing a heterogeneous SoC of CPU and GPU based on machine learning according to any one of claims 1 to 3, which is special in that step S4 includes:
    S401:利用nvprof测量每一块GPU消耗的电能;S401: Use nvprof to measure the power consumed by each GPU;
    S402:利用电量测量仪UNIT-T UT230A/C-II测量服务器实际消耗电量。S402: Use the electricity meter UNIT-T UT230A/C-II to measure the actual electricity consumption of the server.
PCT/CN2019/121592 2018-12-07 2019-11-28 Cpu-gpu heterogeneous soc performance characterization method based on machine learning WO2020114311A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811495369.9 2018-12-07
CN201811495369.9A CN109871237B (en) 2018-12-07 2018-12-07 CPU and GPU heterogeneous SoC performance characterization method based on machine learning

Publications (1)

Publication Number Publication Date
WO2020114311A1 true WO2020114311A1 (en) 2020-06-11

Family

ID=66917046

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/121592 WO2020114311A1 (en) 2018-12-07 2019-11-28 Cpu-gpu heterogeneous soc performance characterization method based on machine learning

Country Status (2)

Country Link
CN (1) CN109871237B (en)
WO (1) WO2020114311A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871237B (en) * 2018-12-07 2021-04-09 中国科学院深圳先进技术研究院 CPU and GPU heterogeneous SoC performance characterization method based on machine learning
CN112784435B (en) * 2021-02-03 2023-05-23 浙江工业大学 GPU real-time power modeling method based on performance event counting and temperature

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880785A (en) * 2012-08-01 2013-01-16 北京大学 Method for estimating transmission energy consumption of source code grade data directed towards GPU program
CN107168859A (en) * 2017-05-09 2017-09-15 中国科学院计算技术研究所 Energy consumption analysis method for Android device
CN108733531A (en) * 2017-04-13 2018-11-02 南京维拓科技有限公司 GPU performance monitoring systems based on cloud computing
US20180341856A1 (en) * 2017-05-24 2018-11-29 International Business Machines Corporation Balancing memory consumption of multiple graphics processing units in deep learning
CN109871237A (en) * 2018-12-07 2019-06-11 中国科学院深圳先进技术研究院 A kind of CPU based on machine learning and GPU isomery SoC performance depicting method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8112250B2 (en) * 2008-11-03 2012-02-07 International Business Machines Corporation Processor power management
US20170017576A1 (en) * 2015-07-16 2017-01-19 Qualcomm Incorporated Self-adaptive Cache Architecture Based on Run-time Hardware Counters and Offline Profiling of Applications
CN106991030B (en) * 2017-03-01 2020-04-14 北京航空航天大学 Online learning-based system power consumption optimization lightweight method
CN107908536B (en) * 2017-11-17 2020-05-19 华中科技大学 Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880785A (en) * 2012-08-01 2013-01-16 北京大学 Method for estimating transmission energy consumption of source code grade data directed towards GPU program
CN108733531A (en) * 2017-04-13 2018-11-02 南京维拓科技有限公司 GPU performance monitoring systems based on cloud computing
CN107168859A (en) * 2017-05-09 2017-09-15 中国科学院计算技术研究所 Energy consumption analysis method for Android device
US20180341856A1 (en) * 2017-05-24 2018-11-29 International Business Machines Corporation Balancing memory consumption of multiple graphics processing units in deep learning
CN109871237A (en) * 2018-12-07 2019-06-11 中国科学院深圳先进技术研究院 A kind of CPU based on machine learning and GPU isomery SoC performance depicting method

Also Published As

Publication number Publication date
CN109871237B (en) 2021-04-09
CN109871237A (en) 2019-06-11

Similar Documents

Publication Publication Date Title
Zhang et al. Performance and power analysis of ATI GPU: A statistical approach
García-Martín et al. Estimation of energy consumption in machine learning
Ren et al. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning
Li et al. Strategies for energy-efficient resource management of hybrid programming models
Chen et al. Statistical GPU power analysis using tree-based methods
McCraw et al. Power monitoring with PAPI for extreme scale architectures and dataflow-based programming models
Liu et al. Pinpointing data locality bottlenecks with low overhead
WO2020114311A1 (en) Cpu-gpu heterogeneous soc performance characterization method based on machine learning
Meng et al. Skope: A framework for modeling and exploring workload behavior
Ganapathi Predicting and optimizing system utilization and performance via statistical machine learning
Zhou et al. GPA: A GPU performance advisor based on instruction sampling
Rohou Tiptop: Hardware performance counters for the masses
Gao et al. Data motif-based proxy benchmarks for big data and AI workloads
Liu et al. Runtime concurrency control and operation scheduling for high performance neural network training
Wang et al. A statistic approach for power analysis of integrated GPU
CN105094949A (en) Method and system for simulation based on instruction calculation model and feedback compensation
Gottschall et al. Balancing Accuracy and Evaluation Overhead in Simulation Point Selection
Zhang et al. A performance prediction scheme for computation-intensive applications on cloud
Mammeri et al. Performance counters based power modeling of mobile GPUs using deep learning
EP4177747A1 (en) Machine learning based contention delay prediction in multicore architectures
Moore et al. User-defined events for hardware performance monitoring
Nilakantan et al. Platform-independent analysis of function-level communication in workloads
Heirman et al. Sniper: Simulation-based instruction-level statistics for optimizing software on future architectures
Hornich et al. Collecting and presenting reproducible intranode stencil performance: INSPECT
Cheema et al. Power and Performance Analysis of Deep Neural Networks for Energy-aware Heterogeneous Systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19892596

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.11.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19892596

Country of ref document: EP

Kind code of ref document: A1