CN118312394A

CN118312394A - Evaluation method and device for operator optimization

Info

Publication number: CN118312394A
Application number: CN202410489536.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2024-04-22
Filing date: 2024-04-22
Publication date: 2024-07-09

Abstract

The disclosure relates to the technical field of artificial intelligence, and in particular relates to an operator optimization evaluation method and device, wherein the method comprises the following steps: setting a plurality of target frequencies required by the first processor to execute the target operator; determining the total execution time spent by the first processor for executing the target operators under each target frequency respectively; evaluating the target operator according to the determined total execution time lengths to obtain an evaluation result, wherein the evaluation result is used for indicating the optimization direction of the target operator; wherein the target frequency includes at least one of a first frequency associated with the first processor performing data reading and writing and a second frequency associated with the first processor performing data calculation. The operator optimization evaluation method can realize simple, rapid, efficient and accurate operation of operator optimization evaluation, can improve operator development and optimization efficiency and speed, and can realize the acceleration of the whole process.

Description

Evaluation method and device for operator optimization

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to an operator optimization evaluation method and device.

Background

In the operator performance optimization process of artificial intelligence chips, evaluating whether an operator (or algorithm) is optimized to a theoretical performance limit is a challenging matter for the following reasons: on the one hand, the processor in the artificial intelligent chip is more and more complex, and the data interaction of a plurality of computing modules and memory modules is related to the inside of hardware; on the other hand, in the case of operators that are complex, the performance bottlenecks of the operators may be represented in different modules at different time windows. How to simply, quickly and accurately determine the restriction bottleneck of the execution of the operators in the processor and guide the optimization direction of the follow-up operators is a technical problem to be solved.

Disclosure of Invention

In view of this, the present disclosure proposes an operator optimized evaluation method and apparatus.

According to an aspect of the present disclosure, there is provided an operator-optimized evaluation method, the method including:

Setting a plurality of target frequencies required by the first processor to execute the target operator;

determining the total execution time spent by the first processor for executing the target operators under each target frequency respectively;

According to the determined total execution time lengths, evaluating the target operator to obtain an evaluation result, wherein the evaluation result is used for indicating the optimization direction of the target operator;

Wherein the target frequency includes at least one of a first frequency related to data reading and writing by the first processor and a second frequency related to data calculation by the first processor.

In one possible implementation manner, according to the determined multiple total execution durations, evaluating the target operator to obtain an evaluation result includes:

And evaluating the target operator according to the determined total execution time lengths, the bandwidth utilization rate and the calculation power utilization rate of the target operator under each total execution time length, and obtaining an evaluation result.

In one possible implementation, the method further includes:

And under the condition that the target frequency needs to be selected from the first frequency and the second frequency, determining the first frequency or the second frequency as the target frequency according to the data read-write speed and the data calculation speed of the first processor.

In one possible implementation manner, determining the first frequency or the second frequency as a target frequency according to the data read-write speed and the data calculation speed of the first processor includes:

Determining the first frequency as a target frequency under the condition that the transmission time length of the first processor for data reading and writing is smaller than the calculation time length of the first processor for data calculation according to the data reading and writing speed and the data calculation speed;

and determining the second frequency as a target frequency under the condition that the transmission time length of the first processor for data reading and writing is longer than the calculation time length of the first processor for data calculation according to the data reading and writing speed and the data calculation speed.

In a possible implementation manner, the first frequency includes a read-write frequency of data read-write by a load/store unit in the first processor;

Or the first frequency includes the read-write frequency and a memory frequency of a memory accessed by the load/store unit, and in the case that the target frequency is the first frequency and the first frequency includes the read-write frequency and the memory frequency, at least one of the read-write frequency and the memory frequency of each target frequency is different.

In one possible implementation, in a case where the target frequency includes the first frequency, the plurality of total execution durations includes a plurality of second total execution durations; or alternatively

In the case where the target frequency includes the second frequency, the plurality of total execution durations includes a plurality of first total execution durations; or alternatively

In the case that the target frequency includes the first frequency and the second frequency, the plurality of total execution durations includes a plurality of first total execution durations and a plurality of second total execution durations;

The first total execution time periods are time periods consumed by the first processor to execute the target operator respectively under the condition that the first frequency is unchanged and the second frequency is different; the plurality of second total execution time periods are second total execution time periods consumed by the first processor to execute the target operator respectively under the condition that the second frequency is unchanged and the first frequency is different.

In one possible implementation manner, according to the determined multiple total execution durations, the bandwidth utilization rate and the computational power utilization rate of the target operator under each total execution duration, the evaluation is performed on the target operator to obtain an evaluation result, including:

Determining the influence importance of the transmission time length and the calculation time length on the total execution time length according to the determined multiple total execution time lengths;

Determining a current performance limiting bottleneck of the target operator executed in the first processor according to the influence importance of the transmission time length and the calculation time length and the bandwidth utilization rate and the calculation power utilization rate of the target operator under each total execution time length, wherein the current performance limiting bottleneck comprises a memory bottleneck or a calculation bottleneck;

and determining an optimization direction based on the current performance limiting bottleneck to form an evaluation result.

According to another aspect of the present disclosure, there is provided an operator-optimized evaluation apparatus, the apparatus including:

A frequency setting module for setting a plurality of target frequencies required by the first processor to execute the target operator;

the time length determining module is used for determining total execution time length consumed by the first processor for executing the target operators under the target frequencies respectively;

the optimization evaluation module is used for evaluating the target operator according to the determined total execution time lengths to obtain an evaluation result, and the evaluation result is used for indicating the optimization direction of the target operator;

In one possible implementation, the optimization evaluation module includes:

And the evaluation sub-module is used for evaluating the target operator according to the determined total execution time lengths, the bandwidth utilization rate and the calculation power utilization rate of the target operator under each total execution time length and obtaining an evaluation result.

In one possible implementation, the apparatus further includes:

And the frequency selection module is used for determining the first frequency or the second frequency as the target frequency according to the data read-write speed and the data calculation speed of the first processor under the condition that the target frequency needs to be selected from the first frequency and the second frequency.

According to another aspect of the present disclosure, there is provided an electronic device including: a second processor; a memory for storing second processor-executable instructions; wherein the second processor is configured to implement the above-described method when executing the instructions stored by the memory.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a third processor, implement the above-described method.

According to another aspect of the present disclosure, there is provided a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a fourth processor of an electronic device, performs the above method.

The embodiment of the disclosure provides an operator optimization evaluation method and device, wherein a plurality of target frequencies required by a first processor for executing a target operator are preset; determining the total execution time spent by the first processor for executing the target operators under each target frequency respectively; evaluating the target operator according to the determined total execution time lengths to obtain an evaluation result, wherein the evaluation result is used for indicating the optimization direction of the target operator; the target frequency comprises at least one of a first frequency of data reading and writing by the first processor and a second frequency of data calculation by the first processor. The operator optimization evaluation method can realize simple, rapid, efficient and accurate operation of operator optimization evaluation, can improve operator development and optimization efficiency and speed, and can realize the acceleration of the whole process.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow chart of an operator optimized evaluation method according to an embodiment of the present disclosure.

Fig. 2 shows a block diagram of a first processor according to an embodiment of the present disclosure.

FIG. 3 illustrates a timing diagram of a target operator under a memory bottleneck according to one embodiment of the disclosure.

FIG. 4 illustrates a frequency-performance graph of a target operator under a memory bottleneck according to an embodiment of the disclosure.

FIG. 5 illustrates a timing diagram of a target operator under a computational bottleneck according to an embodiment of the disclosure.

FIG. 6 illustrates a frequency-performance graph of a target operator under a computational bottleneck according to an embodiment of the disclosure.

FIG. 7 illustrates a block diagram of an operator optimized evaluation device according to an embodiment of the present disclosure.

Fig. 8 is a block diagram illustrating an apparatus 1900 for an electronic device or server, according to an example embodiment.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

In the process of analyzing and evaluating the operator performance, operator performance bottlenecks can be generally categorized into two cases of memory bound (memory bottleneck, which may also be referred to as memory-limited, memory access-limited, etc.) and computer bound (computational bottleneck, which may also be referred to as computational-limited, etc.). The memory bottleneck may refer to that the performance bottleneck of the operator is mainly embodied on the access memory limitation, which is caused by the performance problem of the operator due to insufficient access memory bandwidth. The computational bottleneck may be that the computational performance bottleneck is mainly reflected in data computation, which is caused by the problem of operator performance due to insufficient computing performance of hardware. And determining the current limiting bottleneck (memory bottleneck or computing bottleneck) of the operator, the operator can be continuously optimized, the performance of the operator is improved, and the bandwidth utilization rate and the computing power utilization rate of the operator executed in the processor are improved. In the related art, in the process of optimizing an operator, the determination mode of the current restriction bottleneck of the operator has the following problems:

The first scheme is that the bandwidth utilization rate and the computational power utilization rate of the operator to the processor in the artificial intelligent chip are directly determined, the operator currently limits the bottleneck, but the determination mode of the scheme is too simple, and the accuracy is low although the process of determining the result is very efficient.

The second type of scheme realizes the simulation test of the operation of operators on a processor in an artificial intelligent chip in a performance simulation mode, and the scheme has the problems of long time consumption, complex process and the like due to high determination accuracy and too much dependence on specific hardware.

In order to solve the above technical problems, an embodiment of the present disclosure provides an operator optimization evaluation method and apparatus, where in the method, a plurality of target frequencies required by a first processor to execute a target operator are preset; determining the total execution time spent by the first processor for executing the target operators under each target frequency respectively; according to the determined total execution time lengths, evaluating the target operator to obtain an evaluation result, wherein the evaluation result is used for indicating the optimization direction of the target operator; the target frequency comprises at least one of a first frequency of data reading and writing by the first processor and a second frequency of data calculation by the first processor. The operator optimization evaluation can be simply, quickly, efficiently and accurately performed.

As shown in fig. 1, the evaluation method for operator optimization provided by the embodiment of the present disclosure includes step S101 to step S103. The method may be applied to a server or an electronic device.

In step S101, a plurality of target frequencies required for the first processor to execute the target operator are set. Wherein the target frequency includes at least one of a first frequency related to data reading and writing by the first processor and a second frequency related to data calculation by the first processor. The target operator can be an operator for processing various types of user input data such as audio, video, images and texts, and can be applied to the fields of scientific calculation, machine learning, data analysis, artificial intelligence, financial modeling and the like.

In this embodiment, the first processor executing the target operator may be an artificial intelligent chip, such as an image processor (Graphics Processing Unit, GPU), a General-purpose computing graphics processor (GPGPU), etc., which is not limited by this disclosure. As shown in fig. 2, the first processor may include at least one Load Store Unit (LSU) 11 and a plurality of computing units (computing units) 12, where the load store unit 11 is configured to access a memory (memory) 13 of the first processor for data reading and writing. That is, the load/store unit 11 reads data from the memory 13, and writes the result calculated by the calculation unit 12 into the memory 13. Wherein only one load/store unit 11 and one compute unit 12 of the first processor are schematically shown in figure 2 for simplicity, the remaining load/store units 11 and compute units 12 are not shown.

The first frequency may be a related frequency affecting the speed of the first processor for reading and writing data, for example, the first frequency may include a read-write frequency of the load/store unit 11 for reading and writing data, and a data read-write frequency of the memory 13 of the first processor (may be referred to as a memory frequency and a memory frequency). In the process of executing the target operator by the first processor, the load/store unit 11 and the plurality of calculation units 12 execute the relevant steps of the operator in parallel, and since the frequency of data reading and storing in the first processor is the same, the read-write frequency may refer to the frequency of data reading by each load/store unit 11 or the frequency of data storing by the load/store unit 11 in the first processor. The second frequency may be a related frequency that affects the speed at which the first processor performs the data calculations, e.g., the second frequency may include the frequency at which each computing unit 12 performs the data calculations.

In one possible implementation, the method may further include: before step S101, in a case where it is determined that a target frequency needs to be selected from the first frequency and the second frequency, the first frequency or the second frequency is determined as the target frequency according to the data read-write speed and the data calculation speed of the first processor.

In this implementation, the target frequency may be selected by: and determining the first frequency as a target frequency under the condition that the transmission time length of the first processor for data reading and writing is smaller than the calculation time length of the first processor for data calculation according to the data reading and writing speed and the data calculation speed. And determining the second frequency as a target frequency under the condition that the transmission time length of the first processor for data reading and writing is longer than the calculation time length of the first processor for data calculation according to the data reading and writing speed and the data calculation speed.

In this embodiment, a plurality of target frequencies may be configured based on the computational power and bandwidth of the first processor. After each completion of one or more rounds of optimization of the target operator, a plurality of target frequency settings may be made based on the current read/write frequency f1 ₀, the second frequency f2 ₀, and the memory frequency f3 ₀ of the first processor executing the target operator. for example, if the target frequency is the first frequency and the first frequency includes only the read-write frequency or the memory frequency (wherein, one of the read-write frequency and the memory frequency that affects mainly the transmission time period may be regarded as the target frequency, that is, one of the read-write frequency and the memory frequency that affects relatively large the transmission time period may be regarded as the target frequency), a plurality of read-write frequencies different from f1 ₀ may be set based on the current read-write frequency f1 ₀, For example, 0.9×f1 ₀、0.8×f1₀、0.7×f1₀ … …, the second frequency and the memory frequency of the first processor are unchanged, and are still the current f2 ₀ and f3 ₀. If the target frequency is the second frequency, a plurality of second frequencies different from f2 ₀, such as 0.9×f2 ₀、0.8×f2₀、0.7×f2₀ … …, can be set based on the current second frequency f2 ₀, the read/write frequency and the memory frequency of the first processor are unchanged, Still current f1 ₀ and f3 ₀. If the target frequency is the first frequency and the first frequency includes the read-write frequency and the memory frequency, a plurality of sets of read-write frequency and memory frequency different from at least one of f1 ₀、f3₀ based on the current f1 ₀、f3₀ may be set, such as 0.9 xf 1 ₀ and 0.9 xf 1 ₃,0.9×f1₀ and 0.8 xf 1 ₃,0.8×f1₀ and 0.9 xf 1 ₃,0.7×f1₀ and 0.8 xf 1 ₃ … …, The second frequency of the first processor is unchanged and is still the current f2 ₀.

In step S102, a total execution duration consumed by the first processor to execute the target operators at each of the target frequencies is determined.

In this embodiment, in a case where the target frequency includes the second frequency, the plurality of total execution durations includes a plurality of first total execution durations. Or in the case that the target frequency includes the first frequency, the plurality of total execution durations includes a plurality of second total execution durations. Or in the case that the target frequency includes the first frequency and the second frequency, the plurality of total execution durations includes a plurality of first total execution durations and a plurality of second total execution durations. The first total execution duration is the duration consumed by the first processor to execute the target operator under the condition that the first frequency is unchanged and the second frequency is different; the plurality of second total execution time periods are second total execution time periods consumed by the first processor to execute the target operator respectively under the condition that the second frequency is unchanged and the first frequency is different. That is, the first frequencies corresponding to different first total execution durations are the same but the second frequencies are different, the first frequencies corresponding to different second total execution durations are different (the first frequency being different may refer to at least one of a read-write frequency and a memory frequency) but the second frequencies are the same.

In step S103, according to the determined multiple total execution durations, the target operator is evaluated to obtain an evaluation result, where the evaluation result is used to indicate the optimization direction of the target operator. According to the determined total execution time lengths and the like, the current performance limiting bottleneck of the target operator executed in the first processor can be determined, then the optimization direction is determined based on the current performance limiting bottleneck, and further an evaluation result is generated. The optimization direction may correspond to a current performance limiting bottleneck of the target operator executed in the current processor, and if the current performance limiting bottleneck is a memory bottleneck, the optimization direction may be an operator optimization direction for solving the memory bottleneck; if the current performance limiting bottleneck is a computing bottleneck, the optimization direction can be an operator optimization direction for solving the computing bottleneck.

In this embodiment, there are several different implementations of step S103, and in order to further illustrate the rationality of the different implementations of step S103, the following is a schematic description of the practical principles and basis of the operator-optimized evaluation method provided in the embodiments of the present disclosure with reference to fig. 2 to 6.

It is assumed that for a certain target operator, the first processor executing the target operator has the structure as shown in fig. 2.

In the first case, if the current performance limiting bottleneck of the target operator executed in the first processor is assumed to be a memory bottleneck, the input data is assumed to be divided into n data blocks to be processed respectively. Then in theory, in the memory bottleneck scenario, the transmission duration of each data block should be longer than the calculation duration, so that the timing diagram shown in fig. 3 can be obtained, in fig. 3, li represents the duration of loading the ith data block by the first processor, ci represents the calculation duration of calculating the ith result by the first processor for the ith data block, si represents the duration of storing the ith result corresponding to the ith data block by the first processor, and the value of i is 1 and 2 … n. The total execution time of the target operator in the first processor is "l1+l … +ln+cn+sn". It can be seen that in the memory bottleneck scenario, the main factor affecting the total execution duration is theoretically the total loading duration "l1+l … +ln" of the data, so the following conclusion can be presumed: if the second frequency of the calculation unit 12 is reduced, the total execution time of the target operator in the first processor is necessarily not changed much.

To verify the above conclusion, we assume that the input data is divided into 16 blocks (i.e., n=16), and set the duration of data loading by a specific first processor executing a specific target operator to be t1, the duration of data calculation to be t2, the duration of data storage to be t3, i.e., li=t1, ci=t2, si=t3, and further set t1=t3, t2=k1×t1, then the total execution duration of the specific target operator in the specific first processor l1+l2 … +ln+cn+sn=n×t1+t2+t3. Assuming that the frequency-reducing proportion is q, that is, the second frequency will be q times of the original frequency, the total execution duration of the target operator after frequency reduction is n×t1+t2/q+t3, the difference between the two total execution durations is (1/q-1) t2, and the performance is reduced to 1- ((1/q-1) t 2)/(n×t1+t2+t3). From this equation, a plot of performance degradation versus down-conversion ratio as shown in fig. 4 can be obtained. It can be seen in conjunction with fig. 4 that: in the case where k1 (i.e., t2/t 1) is 0.1, 0.2 and 0.3, even if the calculation frequency is reduced to 30% of the original, the performance can be maintained at 95% or more of the original, and it is confirmed that the above-described estimation is performed.

In other words, in the above step S103, if the target frequency is the second frequency, after setting the plurality of second frequencies, the obtained plurality of first total execution durations are taken as the total execution durations, and if the difference between the plurality of first total execution durations is within the first difference range, it is proved that the influence of the change of the second frequency on the total execution duration is not great, and it may be determined that the current performance limiting bottleneck of the execution of the target operator in the first processor is the memory bottleneck. Wherein the first difference range may be set based on a data read-write speed and a data calculation speed of the first processor executing the target operator.

In the second case, if the current performance limitation bottleneck executed by the target operator in the first processor is assumed to be a calculation bottleneck, then it is assumed that the input data is divided into n data blocks to be processed respectively, and then in the calculation bottleneck scene, the transmission duration of each data block should be smaller than the calculation duration, so that the timing diagram shown in fig. 5 can be obtained, li in fig. 5 represents the duration of loading the ith data block by the first processor, ci represents the calculation duration of calculating the ith result by the first processor for the ith data block, si represents the duration of storing the ith result corresponding to the ith data block by the first processor, and the value of i is 1 and 2 … n. The total execution time of the target operator in the first processor is "l1+c1+c2 … +cn+sn". It can be seen that in the bottleneck computing scenario, the main factor affecting the total execution duration is theoretically the total computation duration "c1+c … +cn" of the data, so the following conclusion can be assumed: if the first frequency of the load/store unit 11 is reduced, the total execution time of the target operator in the first processor must not vary much.

To verify the above conclusion, we assume that the input data is divided into 16 blocks (i.e., n=16), and set the duration of data loading by a specific first processor executing a specific target operator to be t1, the duration of data calculation to be t2, the duration of data storage to be t3, i.e., li=t1, ci=t2, si=t3, and further set t1=t3, t1=k2×t2, the total execution duration of the specific target operator in the specific first processor l1+c1+c … +cn+sn=t1+n×t2+t3. Assuming that the down-conversion ratio is p, that is, the first frequency will be p times of the original frequency, the total execution duration of the target operator after down-conversion is t1/p+n×t2+t3/p, the difference between the two times is (1/p-1) (t1+t3), and the performance is reduced to 1- ((1/p-1) (t1+t3))/(t1+n×t2+t3). From this equation, a plot of performance degradation versus down-conversion ratio shown in fig. 6 can be obtained. It can be seen in connection with fig. 6 that: in the case where k2 (i.e., t1/t 2) is 0.1, 0.2, and 0.3, even if the calculation frequency is reduced to 30% of the original, the performance can be maintained at 90% or more of the original, and it is confirmed that the above-described estimation is performed.

In other words, in the above step S103, if the target frequency is the first frequency, after setting the plurality of first frequencies, the obtained plurality of second total execution durations are taken as the total execution durations, and if the difference between the plurality of second total execution durations is within the second difference range, it is proved that the influence of the change of the first frequency on the total execution duration is not great, and it is determined that the current performance limitation bottleneck of the target operator executed in the first processor is the calculation bottleneck. Wherein the second difference range may be set based on a data read-write speed and a data calculation speed of the first processor executing the target operator.

In this embodiment, the implementation logic of the whole method may be as follows:

If the proportional relationship between t1, t2 and t3 of the target operator executed by the first processor is clear, that is, if the data read-write speed and the data calculation speed of the first processor are known, one of the "first frequency" and the "second frequency" may be determined as the target frequency first, where it is actually assumed that the current performance limitation bottleneck of the target operator is the bottleneck corresponding to the target frequency (that is, the current performance limitation bottleneck is the calculation bottleneck if the target frequency is the first frequency, and the current performance limitation bottleneck is the memory bottleneck if the target frequency is the second frequency).

If the obtained difference value between the total execution durations is just within the corresponding difference value range (the first difference value range or the second difference value range), the bottleneck corresponding to the target frequency can be determined to be more important in the influence importance of the memory bottleneck and the calculation bottleneck which currently influence the performance of the target operator, and the current performance limiting bottleneck can be determined to be the bottleneck corresponding to the target frequency. Or comprehensively evaluating which of the memory bottleneck and the computing bottleneck can be used as the current performance limiting bottleneck according to the influence importance of the memory bottleneck and the computing bottleneck and the bandwidth utilization rate and the computing power utilization rate of the target operator under each total execution duration, and adaptively setting the comprehensive evaluation strategy according to the conditions of the first processor and the target operator.

If the obtained difference value between the total execution durations is not in the corresponding difference value range (the first difference value range or the second difference value range), or it cannot be determined which of the memory bottleneck and the computing bottleneck is more important, then the remaining one frequency which is not determined as the target frequency can be further determined as the new target frequency, then the influence importance of the memory bottleneck and the computing bottleneck in the performance influence of the target operator is determined based on the obtained total execution durations and the obtained total execution durations, and then the more important influence importance of the memory bottleneck and the computing bottleneck is determined as the current performance limitation bottleneck. Or comprehensively evaluating which of the memory bottleneck and the computing bottleneck can be used as the current performance limiting bottleneck according to the influence importance of the memory bottleneck and the computing bottleneck and the bandwidth utilization rate and the computing power utilization rate of the target operator under each total execution duration, and adaptively setting the comprehensive evaluation strategy according to the conditions of the first processor and the target operator.

That is, there are several different implementations of step S103:

Mode one: under the condition that the total execution time periods are the first total execution time periods or the second total execution time periods, the influence importance of the transmission time periods and the calculation time periods on the total execution time periods can be directly determined according to the first total execution time periods or the second total execution time periods. And then determining the current performance limiting bottleneck of the target operator executed in the first processor according to the transmission time length and the influence importance of the calculation time length, namely taking the bottleneck with larger influence importance in the memory bottleneck and the calculation bottleneck as the current performance limiting bottleneck. And finally, determining an optimization direction based on the current performance limiting bottleneck to form an evaluation result.

Wherein, the first mode is practically suitable for a plurality of first total execution time periods or a plurality of second total execution time periods, which is more important to determine the memory bottleneck and the computing bottleneck.

Mode two: under the condition that the total execution time length is a plurality of first total execution time lengths or a plurality of second total execution time lengths, the influence importance of the transmission time length and the calculation time length on the total execution time length can be directly determined according to the plurality of first total execution time lengths or the plurality of second total execution time lengths. And comprehensively evaluating which of the memory bottleneck and the computing bottleneck can be used as the current performance limiting bottleneck of the target operator executed in the first processor according to the influence importance of the transmission time and the computing time and the bandwidth utilization rate and the computing power utilization rate of the target operator under each total execution time. And finally, determining an optimization direction based on the current performance limiting bottleneck to form an evaluation result.

The second mode is actually suitable for a scene in which a memory bottleneck and a computing bottleneck are more important can be determined by a plurality of first total execution time periods or a plurality of second total execution time periods, and compared with the first mode, the determined evaluation result is more accurate.

Mode three: under the condition that the total execution time length is a plurality of first total execution time lengths (or a plurality of second total execution time lengths), determining that the importance difference of the influence of the transmission time length and the calculated time length on the total execution time length is not large according to the plurality of first total execution time lengths (or the plurality of second total execution time lengths); then multiple target frequencies need to be added to obtain multiple second total execution durations (or multiple first total execution durations). And determining the influence importance of the memory bottleneck and the computing bottleneck in the performance influence of the target operator based on the obtained first total execution time lengths and the second total execution time lengths. And determining the more important influence importance of the memory bottleneck and the computing bottleneck as the current performance limiting bottleneck of the target operator executed in the first processor. And finally, determining an optimization direction based on the current performance limiting bottleneck to form an evaluation result.

The third mode is actually suitable for a scene that a single first total execution duration or a second total execution duration cannot determine which of the memory bottleneck and the computing bottleneck is more important, and can further determine which of the memory bottleneck and the computing bottleneck is more important by combining comprehensive evaluation of the first total execution duration and the second total execution duration.

Mode four: under the condition that the total execution time length is a plurality of first total execution time lengths (or a plurality of second total execution time lengths), determining that the importance difference of the influence of the transmission time length and the calculated time length on the total execution time length is not large according to the plurality of first total execution time lengths (or the plurality of second total execution time lengths); then multiple target frequencies need to be added to obtain multiple second total execution durations (or multiple first total execution durations). And determining the influence importance of the memory bottleneck and the computing bottleneck in the performance influence of the target operator based on the obtained first total execution time lengths and the second total execution time lengths. And comprehensively evaluating which of the memory bottleneck and the computing bottleneck can be used as the current performance limiting bottleneck of the target operator executed in the first processor according to the influence importance of the transmission time and the computing time and the bandwidth utilization rate and the computing power utilization rate of the target operator under each total execution time. And finally, determining an optimization direction based on the current performance limiting bottleneck to form an evaluation result.

The fourth mode is actually suitable for a scene that a single first total execution duration or a second total execution duration cannot determine which of the memory bottleneck and the computing bottleneck is more important, and can be further combined with comprehensive evaluation of the first total execution duration and the second total execution duration to determine which of the memory bottleneck and the computing bottleneck is more important, so that the evaluation result determined by the method is more accurate than the evaluation result determined by the three-phase mode.

In some embodiments, the evaluation result may further include a performance evaluation result for the target operator, where the performance evaluation result may be superior, moderate, inferior, and performance evaluation results corresponding to different total execution durations, bandwidth utilization rates, and calculation power utilization rates may be preset, which is not limited in the disclosure.

Therefore, after optimization of the target operator is completed, the optimization of the target operator can be evaluated by means of the operator optimization evaluation method, whether the target operator can stop optimization or not is further determined according to the evaluation result, and under the condition that further optimization is determined to be needed, the target operator is correspondingly optimized based on the optimization direction in the evaluation result, so that the optimization speed and efficiency of the operator can be improved, and the whole process speed of operator development and optimization is improved.

As shown in fig. 7, the embodiment of the present disclosure further provides an operator optimized evaluation apparatus for performing the operator optimized evaluation method described above, which includes a frequency setting module 71, a duration determining module 72, and an optimized evaluation module 73.

The frequency setting module 71 is configured to set a plurality of target frequencies required by the first processor to execute the target operator.

A duration determining module 72, configured to determine a total execution duration consumed by the first processor to execute the target operator at each of the target frequencies.

And the optimization evaluation module 73 is configured to evaluate the target operator according to the determined multiple total execution durations to obtain an evaluation result, where the evaluation result is used to indicate an optimization direction of the target operator.

In one possible implementation, the optimization evaluation module 73 includes:

In one possible implementation, the apparatus further includes:

Determining the first frequency as a target frequency under the condition that the transmission time length of the first processor for data reading and writing is smaller than the calculation time length of the processor for data calculation according to the data reading and writing speed and the data calculation speed;

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

It should be noted that, although the above embodiments are described as examples of the operator optimized evaluation method and apparatus, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each step and each module according to personal preference and/or actual application scene, so long as the technical scheme of the disclosure is met.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a third processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a second processor; a memory for storing second processor-executable instructions; wherein the second processor is configured to implement the above-described method when executing the instructions stored by the memory.

The disclosed embodiments also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a fourth processor of an electronic device, causes the processor in the electronic device to perform the above method.

It should be noted that, the first processor, the second processor, the third processor, the fourth processor, and the fifth processor in the embodiments of the present disclosure may be the same or different processors, and the first processor, the second processor, the third processor, the fourth processor, and the fifth processor may be set according to the requirements of the processors in different situations, which is not limited in this disclosure.

Fig. 8 is a block diagram illustrating an apparatus 1900 for an electronic device or server, according to an example embodiment. That is, the apparatus 1900 may be provided as a server or terminal device. Referring to fig. 8, the apparatus 1900 includes a processing component 1922 that further includes one or more fifth processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The apparatus 1900 may further comprise a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output interface 1958 (I/O interface). The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server ^TM,Mac OS X^TM,Unix^TM,Linux^TM,FreeBSD^TM or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above-described methods.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An evaluation method for operator optimization, the method comprising:

2. The method according to claim 1, wherein evaluating the target operator according to the determined plurality of total execution durations to obtain an evaluation result comprises:

3. The method according to claim 1 or 2, characterized in that the method further comprises:

4. A method according to claim 3, wherein determining the first frequency or the second frequency as a target frequency based on the data read-write speed and the data calculation speed of the first processor comprises:

5. The method of claim 4, wherein the first frequency comprises a read-write frequency of data read-write by a load/store unit in the first processor;

6. A method according to claim 1 or 2, characterized in that,

In the case that the target frequency includes the first frequency, the plurality of total execution durations includes a plurality of second total execution durations; or alternatively

7. The method according to claim 5, wherein evaluating the target operator according to the determined total execution time lengths, the bandwidth utilization rate and the computational power utilization rate of the target operator for each of the total execution time lengths, to obtain an evaluation result, comprises:

8. An operator optimized evaluation device, the device comprising:

9. An electronic device, comprising:

a second processor;

a memory for storing second processor-executable instructions;

wherein the second processor is configured to implement the method of any one of claims 1 to 7 when executing the instructions stored by the memory.

10. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a third processor, implement the method of any of claims 1 to 7.

11. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, characterized in that a fourth processor in an electronic device performs the method of any one of claims 1 to 7 when the computer readable code is run in the fourth processor.