CN114238182A - Processor, data processing method and device - Google Patents

Processor, data processing method and device Download PDF

Info

Publication number
CN114238182A
CN114238182A CN202111566629.9A CN202111566629A CN114238182A CN 114238182 A CN114238182 A CN 114238182A CN 202111566629 A CN202111566629 A CN 202111566629A CN 114238182 A CN114238182 A CN 114238182A
Authority
CN
China
Prior art keywords
data
conflict
delay time
data access
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111566629.9A
Other languages
Chinese (zh)
Other versions
CN114238182B (en
Inventor
郭向飞
陈玉平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Eswin Computing Technology Co Ltd
Original Assignee
Beijing Eswin Computing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Eswin Computing Technology Co Ltd filed Critical Beijing Eswin Computing Technology Co Ltd
Priority to CN202111566629.9A priority Critical patent/CN114238182B/en
Publication of CN114238182A publication Critical patent/CN114238182A/en
Application granted granted Critical
Publication of CN114238182B publication Critical patent/CN114238182B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/161Handling requests for interconnection or transfer for access to memory bus based on arbitration with latency improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0842Multiuser, multiprocessor or multiprocessing cache systems for multiprocessing or multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/36Handling requests for interconnection or transfer for access to common bus or bus system
    • G06F13/368Handling requests for interconnection or transfer for access to common bus or bus system with decentralised access control
    • G06F13/376Handling requests for interconnection or transfer for access to common bus or bus system with decentralised access control using a contention resolving method, e.g. collision detection, collision avoidance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application provides a processor, a data processing method and a data processing device. The processor includes: the data request module and the data response module; the data response module is configured to determine the conflict type of the data access conflict event and send the conflict type to the data request module under the condition that the data access conflict event occurs; a data request module configured to determine a delay time corresponding to the conflict type according to the conflict type; and retransmitting the data request corresponding to the data access conflict event according to the delay time. In the embodiment of the application, the delay time is associated with the conflict type, so that the data request retransmitted based on the delay time can be ensured to be processed in time, the influence of data access conflict on the subsequent data request in the instruction stream is reduced, and the IPC performance of the CPU is improved.

Description

Processor, data processing method and device
Technical Field
The present application relates to the field of data processing, and in particular, to a processor, a data processing method, and an apparatus.
Background
A superscalar CPU (central processing unit) architecture refers to a type of parallel operation that implements instruction level parallelism in a processor core. The technology can realize higher CPU throughput rate under the same CPU main frequency.
In the superscalar CPU, the emission strategy of the instruction refers to that a relevant detection method and a relevant processing measure are adopted in the emission process of the instruction to determine the emission sequence of the instruction in an instruction queue, and the performance of the superscalar processor is directly influenced by the quality of the algorithm efficiency.
The superscalar CPU has the capability of executing more than one instruction in one clock cycle, when the CPU executes the current instruction, a data access conflict event occurs, and in the prior art, a data request corresponding to the current instruction is usually retransmitted based on fixed delay so as to execute the instruction again; however, multiple instructions often have data dependency, and resending data requests with fixed delay reduces the processing efficiency of subsequent instruction streams, and affects the IPC (Inter-Process Communication) performance of the CPU.
Disclosure of Invention
The embodiment of the application provides a data processing method and device, electronic equipment and a computer readable storage medium, which can improve the IPC performance of a CPU. The technical scheme is as follows:
according to an aspect of an embodiment of the present application, there is provided a processor including a data request module and a data response module; wherein the content of the first and second substances,
the data response module is configured to determine the conflict type of the data access conflict event and send the conflict type to the data request module under the condition that the data access conflict event occurs;
a data request module configured to determine a delay time corresponding to the conflict type according to the conflict type; and retransmitting the data request corresponding to the data access conflict event according to the delay time.
Optionally, the data request module is configured to query based on a preset delay data comparison table to obtain the delay time corresponding to the conflict type.
Optionally, the processor further includes:
the delay data comparison table building module is configured to run at least one test program in the processor, and when a data access conflict event is detected in the running process of the test program, the conflict type and the target delay time corresponding to the data access conflict event are determined based on the running result of the test program;
and constructing a delay data comparison table based on the conflict type and the target delay time.
Optionally, the delay data comparison table building module is configured to count data access conflict events generated when each test program is run, and conflict types corresponding to the data access conflict events; detecting the operation data of each test program aiming at each conflict type to obtain the initial delay time of each test program; and counting the initial delay time, and calculating the target delay time corresponding to the conflict type.
Optionally, the delay data comparison table building module is configured to calculate an average value or an extreme value of each initial delay time, and obtain the target delay time according to the average value or the extreme value.
Optionally, the delay data look-up table constructing module is configured to generate key-value pair data based on the conflict type and the target delay time; a delay data look-up table is generated based on the key-value pair data.
According to another aspect of the embodiments of the present application, there is provided a data processing method applied in a processor, the method including:
determining the conflict type of the data access conflict event under the condition that the data access conflict event occurs;
determining a delay time corresponding to the conflict type;
and retransmitting the data request corresponding to the data access conflict event according to the delay time.
Optionally, the determining the delay time corresponding to the conflict type includes:
and inquiring based on a preset delay data comparison table to obtain the delay time corresponding to the conflict type.
Optionally, the method further includes:
acquiring at least one test program;
running a test program in a processor, and when a data access conflict event is detected, determining a conflict type and a target delay time corresponding to the data access conflict event based on the running result of the test program;
and constructing a delay data comparison table based on the conflict type and the target delay time.
Optionally, the determining the conflict type and the target delay time corresponding to the data access conflict event includes:
counting data access conflict events generated when each test program is operated and conflict types corresponding to the data access conflict events;
detecting the operation data of each test program aiming at each conflict type to obtain the initial delay time of each test program;
and counting the initial delay time, and calculating the target delay time corresponding to the conflict type.
Optionally, the calculating the target delay time corresponding to the conflict type includes:
and calculating the average value or the extreme value of each initial delay time, and obtaining the target delay time according to the average value or the extreme value.
Optionally, the constructing a delay data comparison table based on the conflict type and the target delay time includes:
generating key-value pair data based on the conflict type and the target delay time;
a delay data look-up table is generated based on the key-value pair data.
According to another aspect of embodiments of the present application, there is provided a data processing apparatus including:
the first determining module is used for determining the conflict type of the data access conflict event under the condition that the data access conflict event occurs;
a second determining module, configured to determine a delay time corresponding to the collision type;
and the retransmission module is used for retransmitting the data request corresponding to the data access conflict event according to the delay time.
Optionally, the second determining module is configured to:
and inquiring based on a preset delay data comparison table to obtain the delay time corresponding to the conflict type.
Optionally, the apparatus further includes a testing module, configured to:
acquiring at least one test program;
running a test program in a processor, and when a data access conflict event is detected, determining a conflict type and a target delay time corresponding to the data access conflict event based on the running result of the test program;
and constructing a delay data comparison table based on the conflict type and the target delay time.
Optionally, the test module is configured to:
counting data access conflict events generated when each test program is operated and conflict types corresponding to the data access conflict events;
detecting the operation data of each test program aiming at each conflict type to obtain the initial delay time of each test program;
and counting the initial delay time, and calculating the target delay time corresponding to the conflict type.
Optionally, the test module is further configured to:
and calculating the average value or the extreme value of each initial delay time, and obtaining the target delay time according to the average value or the extreme value.
Optionally, the test module is further configured to:
generating key-value pair data based on the conflict type and the target delay time;
a delay data look-up table is generated based on the key-value pair data.
According to another aspect of an embodiment of the present application, there is provided an electronic apparatus including: the device comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of the method shown in the first aspect of the embodiment of the application.
According to a further aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as set forth in the first aspect of embodiments of the present application.
According to an aspect of embodiments of the present application, there is provided a computer program product comprising a computer program that, when executed by a processor, performs the steps of the method illustrated in the first aspect of embodiments of the present application.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
in the embodiment of the application, under the condition that a data access conflict event occurs, the conflict type of the data access conflict event is determined, and then the delay time corresponding to the conflict type is obtained, so that the data request corresponding to the data access conflict event is retransmitted according to the delay time, and the data request is responded on the premise that no data access conflict event exists, and the purpose of improving the data processing efficiency is achieved; in the embodiment of the application, the delay time is associated with the conflict type, so that the data request retransmitted based on the delay time can be ensured to be processed in time, the influence of data access conflict on the subsequent data request in the instruction stream is reduced, and the IPC performance of the CPU is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a system architecture diagram of a processor according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating an exemplary data processing method according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a data processing electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification in connection with embodiments of the present application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, as embodied in the art. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The CPU pipeline technology is a technology that decomposes an instruction into multiple steps and overlaps operations of different instructions, thereby implementing parallel processing of several instructions to accelerate the program running process. Each step of the instruction has independent circuits to process, and each step is completed, the next step is carried out, and the previous step processes the subsequent instruction.
With the emergence of application scenes such as big data processing, cloud computing, deep learning and the like, higher challenges are provided for the requirements of processor design performance, and high performance is always the most popular topic in the processor field. At present, most of processing in the market is superscalar processors, after data access conflict, a typical scenario is that after a data request is sent, a backup execution unit returns NACK, and for retransmission of the data request corresponding to NACK, a general mechanism number is that the data request is retransmitted after the access flow is reduced and idle or after a fixed delay. Particularly, in the case of data dependency of an instruction stream, earlier instructions may be dependent on subsequent instructions, which may reduce the processing efficiency of the instruction stream, affect the communication rate of each hardware unit in the processor, and reduce the hardware utilization of the processor.
The application provides a data processing method, a data processing device, an electronic device and a computer-readable storage medium, which aim to solve the above technical problems in the prior art.
The technical solutions of the embodiments of the present application and the technical effects produced by the technical solutions of the present application will be described below through descriptions of several exemplary embodiments. It should be noted that the following embodiments may be referred to, referred to or combined with each other, and the description of the same terms, similar features, similar implementation steps and the like in different embodiments is not repeated.
The embodiment of the application provides a data processing method, which is applied to a processor, as shown in fig. 1, the processor includes a data request module and a data response module. As shown in fig. 2, the method includes the following steps.
S201, the data response module determines the conflict type of the data access conflict event under the condition that the data access conflict event occurs, and sends the conflict type to the data request module.
Specifically, in the instruction execution stage inside the processor, the data response module monitors whether a data access conflict event occurs when the instruction is executed. When a data access conflict event occurs, the conflict type of the data access conflict event is determined. Optionally, the conflict types may include Cache conflicts and bank conflicts.
Wherein the communication interaction data may include: the data response module responds to a positive feedback message and/or a negative feedback message of the data request. Specifically, when the NACK message is obtained by the query, it is determined that a data access collision event has occurred.
The Cache Memory is a Memory located between a CPU and a main Memory DRAM (Dynamic Random Access Memory), has a small scale and a high speed, and is generally composed of an SRAM (Static Random Access Memory). The function of the Cache is to increase the input and output rate of CPU data. The speed of the CPU is far higher than that of the memory, when the CPU directly accesses data from the memory, the CPU waits for a certain time period, the Cache can store a part of data which is just used or recycled by the CPU, and if the CPU needs to reuse the part of data, the CPU can be directly called from the Cache, so that the data is prevented from being repeatedly accessed, the waiting time of the CPU is reduced, and the efficiency of the system is improved. The Cache is divided into an L1Cache (primary Cache) and an L2Cache (secondary Cache), wherein the L1Cache is mainly integrated inside the CPU, and the L2Cache is integrated on the mainboard or on the CPU.
The shared memory is used for realizing the sharing of certain data structures or memory areas among the processes which are communicated with each other and exchanging or transferring data. To achieve higher memory bandwidth, the shared memory is typically divided into a plurality of equally sized memory modules, called banks, which can be accessed simultaneously. Therefore, any read-write operation of n addresses spanning b different banks can be simultaneously carried out, so that the whole bandwidth is improved and can reach b times of the bandwidth of a single bank; wherein b and n are positive integers.
In the embodiment of the application, when the CPU needs to acquire target data from the Cache, the CPU sends a data request corresponding to the target data to the Cache, and when the Cache does not find the corresponding target data based on the data request, a NACK message is returned to the CPU; at this time, it may be determined that a data access collision event occurs according to the NACK message, and determine that a collision type corresponding to the data access collision event is a Cache collision.
When the GPU accesses different word addresses in a bank of the shared memory based on different threads in warp (thread bundle), the shared memory returns a NACK (negative acknowledgement) message to the GPU, a data access conflict event can be judged to occur according to the NACK message, and the conflict type corresponding to the data access conflict event is determined to be a bank conflict.
S202, the data request module determines the delay time corresponding to the conflict type according to the conflict type.
The delay time may be obtained by testing for the corresponding conflict type based on the test program.
Specifically, a correspondence between the conflict type and the delay time may be previously constructed, and the delay time corresponding to the conflict type may be determined based on the correspondence. In the application, the delay time is associated with the conflict type, so that the data request retransmitted based on the delay time can be ensured to be processed in time, and the influence of the data access conflict on the subsequent data request in the instruction stream is reduced.
In the embodiment of the application, when the CPU needs to acquire target data from the Cache, the CPU sends a data request corresponding to the target data to the Cache, and when the Cache does not find the corresponding target data based on the data request, a NACK message is returned to the CPU; at this time, it may be determined that a data access collision event occurs according to the NACK message, and determine that a collision type corresponding to the data access collision event is a Cache collision. At this time, the corresponding delay time obtained based on the Cache conflict query is 4 clock cycles.
And S203, the data request module retransmits the data request corresponding to the data access conflict event according to the delay time.
Specifically, the data request module may retransmit the data request corresponding to the data access collision event after waiting for the delay time.
In the embodiment of the application, when the CPU needs to acquire target data from the Cache, the CPU sends a data request corresponding to the target data to the Cache, and when the Cache does not find the corresponding target data based on the data request, a NACK message is returned to the CPU; at this time, it may be determined that a data access collision event occurs according to the NACK message, and determine that a collision type corresponding to the data access collision event is a Cache collision. At this time, the corresponding delay time obtained based on the Cache conflict query is 4 clock cycles. Then, after waiting for 4 clock cycles, the CPU sends a data request corresponding to the target data to the Cache again, so that the Cache makes a correct response to the data request, and returns an ACK message to the CPU.
According to the embodiment of the application, under the condition that the data access conflict event occurs, the conflict type of the data access conflict event is determined, and the delay time corresponding to the conflict type is further acquired, so that the data request corresponding to the data access conflict event is retransmitted according to the delay time, and the data request is responded on the premise that no data access conflict event exists, and the purpose of improving the data processing efficiency is achieved. In the embodiment of the application, the delay time is associated with the conflict type, so that the data request retransmitted based on the delay time can be ensured to be processed in time, the influence of data access conflict on the subsequent data request in the instruction stream is reduced, and the IPC performance of the CPU is improved.
A possible implementation manner is provided in this embodiment of the present application, and the determining the delay time corresponding to the conflict type in step S202 includes: and inquiring based on a preset delay data comparison table to obtain the delay time corresponding to the conflict type.
The delay data comparison table is obtained by testing based on each conflict type and a preset test program; the manner in which the delay data look-up table is constructed will be described in detail below.
In an embodiment of the present application, a possible implementation manner is provided, where the processor further includes a delay data lookup table construction module, and the delay data lookup table construction module constructs the delay data lookup table through the following steps:
(1) at least one test program is obtained.
Therein, the test program may be benchmark test source code in the SPEC CPU 2017.
SPEC CPU2017 is a set of CPU subsystem testing tools. The purpose of the SPEC CPU2017 benchmark and its operating rules is to further facilitate a fair and objective CPU benchmark. These rules help ensure that the published results are meaningful and repeatable.
SPEC (Standard Performance Evaluation Corporation) provides CPU2017 benchmark testing in source code, unless in some very limited cases, the tester does not allow modification. SPEC CPU2017 includes 43 benchmarks, divided into 4 suites: SPECpeed 2017Integer, with 10 benchmarks; SPECrate2017 integrator, with 10 benchmarks; SPECsped 2017 floating point, with 10 references; SPECrate floating point of 2017, there are 13 benchmarks.
(2) And running the test program in the processor, and determining a conflict type and a target delay time corresponding to the data access conflict event based on the running result of the test program when the data access conflict event is detected in the running process of the test program.
In the embodiment of the application, a test program can be run on a processor, a data access conflict event generated when the program runs is detected, when the data access conflict event is judged to occur, a conflict type and delay time corresponding to the data access conflict event are counted according to the running result of the test program, and target delay time corresponding to each conflict type is determined. The preset processor may include a CPU, and data interaction nodes corresponding to the CPU, such as a Cache and a register.
The embodiment of the present application provides a possible implementation manner, where the determining of the conflict type and the target delay time corresponding to the data access conflict event includes performing the following steps a, b, and c:
a. and counting data access conflict events generated when each test program is operated and conflict types corresponding to the data access conflict events.
Specifically, the delayed data comparison table building module may count the number of data access conflict events based on the running log of the test program, and determine a conflict type corresponding to each data access conflict event.
In some embodiments, when the CPU needs to acquire target data from the Cache, the CPU sends a data request corresponding to the target data to the Cache, and when the Cache does not find the corresponding target data based on the data request, returns a NACK message to the CPU; at this time, it may be determined that a data access collision event occurs according to the NACK message, and determine that a collision type corresponding to the data access collision event is a Cache collision.
Optionally, one Cache is divided into a plurality of groups, each group corresponds to a plurality of cachelines, and each Cacheline may correspond to a plurality of storage units for performing memory mapping. The write-back policy is a write mechanism in the Cache, and when the CPU writes a new value into the L1Cache, the content is not propagated to the lower-level storage, and when other data interaction nodes access the value, the content is written into the lower-level Cache, so as to improve the execution efficiency of the CPU.
In general, if a Cache miss occurs, Cache access is blocked and the miss value is waited to be retrieved from the lower level storage. The introduction of the non-blocking Cache requires the introduction of additional hardware to store the relevant information of the Cache miss, so that the Cache can continue to process the Cache miss. The extra hardware is called mshr (miss Status Handling registers) and is used to store the address corresponding to the Cache miss to be processed, the target block in the Cache, and the target register.
Cache conflicts may include the following:
when the CPU sends a data request to the Cache, the Cache analyzes a corresponding Cacheline based on an access address of the data request, but the Cacheline is the same as the Cacheline corresponding to write-back write data in the Cache, and at this time, write-back data collision can occur.
When the CPU sends a data request to the Cache, the Cache analyzes a corresponding Cacheline based on an access address of the data request, but the Cacheline is the same as a Cacheline corresponding to a probe function which is registered on a bus in a driving manner, and at this time, probe data collision occurs.
When the CPU sends a data request to the Cache, Cache miss occurs, but no idle MSHR is used for processing the Cache miss at the moment, and MSHR data conflict occurs at the moment.
When the CPU sends a data request to the Cache, the Cache analyzes the corresponding Cacheline and the index 1 of the physical address corresponding to the Cacheline based on the access address of the data request, but at this time, the index 2 in the Cacheline corresponding to the MSHR request is the same as the index 1, but the tags of the memory addresses corresponding to the two cachelines are different, and at this time, tag data collision occurs.
In other embodiments, when the GPU accesses different word addresses in one bank of the shared memory based on different threads in warp (thread bundle), the shared memory returns a NACK message to the GPU, and may determine that a data access collision event occurs according to the NACK message, and determine that a collision type corresponding to the data access collision event is a bank collision.
b. And detecting the operation data of each test program aiming at each conflict type to obtain the initial delay time of each test program.
In this embodiment of the present application, a preset processor may be tested based on a test program, it is determined that 5 bank conflicts occur, and counting an initial delay time corresponding to each bank conflict includes: 4, 3, 5 clock cycles.
c. And counting the initial delay time, and calculating the target delay time corresponding to the conflict type.
Specifically, the delay data look-up table construction module may perform statistics according to each initial delay time to obtain the target delay time. The specific calculation process will be described in detail below.
A possible implementation manner is provided in the embodiment of the present application, and the calculating the target delay time corresponding to the conflict type includes: and calculating the average value or the extreme value of each initial delay time, and obtaining the target delay time according to the average value or the extreme value.
In some embodiments, the delay data look-up table constructing module may count an average value of each initial delay time to obtain a target delay time; in other embodiments, the delay data lookup table construction module may count a maximum value or a minimum value of each initial delay time as the target delay time corresponding to the conflict type.
In this embodiment of the present application, the delay data comparison table building module may test a preset processor based on a test program, determine that 5 bank conflicts occur, and count an initial delay time corresponding to each bank conflict, including: 4, 3, 5 clock cycles. The average value can be calculated according to the initial delay time, and the target delay time corresponding to the bank conflict is obtained to be 4 clock cycles.
(3) And constructing a delay data comparison table based on the conflict type and the target delay time.
A possible implementation manner is provided in the embodiment of the present application, where the constructing of the delay data comparison table based on the conflict type and the target delay time includes: generating key-value pair data based on the conflict type and the target delay time; a delay data look-up table is generated based on the key-value pair data.
Wherein each conflict type corresponds to a target delay time. Specifically, the delayed data look-up table constructing module may determine a storage space of the key-value pair data according to the number of the key-value pair data, and store the delayed data look-up table in the storage space.
In this embodiment of the application, the delay data look-up table constructing module may obtain the target delay time corresponding to the conflict type by querying the delay data look-up table. Because each target delay time is the optimal time set according to the characteristics of each conflict type, the success rate of data access can be further improved, and the data processing efficiency and performance of the processor are improved.
In order to better understand the above data processing method, an example of the data processing method of the present application is described in detail below with reference to fig. 3, and is applied to a processor, where the processor includes a data request module, a data response module, and a delayed data lookup table construction module, and the method includes the following steps:
s301, the delay data comparison table construction module acquires at least one test program.
Therein, the test program may be benchmark test source code in the SPEC CPU 2017.
S302, running a test program in the processor, and when a data access conflict event is detected, determining a conflict type and a target delay time corresponding to the data access conflict event based on the running result of the test program.
In this embodiment of the application, the delayed data comparison table building module may run a test program based on a preset processor, detect a data access conflict event occurring when the program runs, count a conflict type and a delay time corresponding to the data access conflict event according to a running result of the test program when the data access conflict event is judged to occur, and determine a target delay time corresponding to each conflict type. The processor comprises a CPU, a Cache, a register and other data interaction nodes corresponding to the CPU.
S303, constructing a delay data comparison table based on the conflict type and the target delay time.
Specifically, the delay data comparison table construction module may generate key-value pair data based on the conflict type and the target delay time; a delay data look-up table is generated based on the key-value pair data.
Wherein each conflict type corresponds to a target delay time. Specifically, the storage space of the key-value pair data may be determined according to the number of the key-value pair data, and the delay data look-up table may be stored in the storage space.
S304, the data response module determines the conflict type of the data access conflict event under the condition that the data access conflict event occurs.
Specifically, the data response module may detect communication interaction data between the data sending node and the data receiving node, and determine whether a data access collision event occurs based on the communication interaction data; when a data access conflict event occurs, determining the conflict type of the data access conflict event.
S305, the data request module queries based on the delay data comparison table to obtain the delay time corresponding to the conflict type.
The delay data comparison table is obtained by testing based on each conflict type and a preset test program.
And S306, retransmitting the data request corresponding to the data access conflict event according to the delay time.
Specifically, the data request module may retransmit the data request corresponding to the data access collision event after waiting for the delay time.
In the embodiment of the application, the conflict type is taken as Cache conflict for example to explain, when the CPU needs to acquire target data from the Cache, the CPU sends a data request corresponding to the target data to the Cache, and when the Cache does not find corresponding target data based on the data request, a NACK message is returned to the CPU; at this time, it may be determined that a data access collision event occurs according to the NACK message, and determine that a collision type corresponding to the data access collision event is a Cache collision. At this time, the corresponding delay time obtained based on the Cache conflict query is 4 clock cycles. Then, after waiting for 4 clock cycles, the CPU sends a data request corresponding to the target data to the Cache again, so that the Cache makes a correct response to the data request, and returns an ACK message to the CPU.
In the embodiment of the application, under the condition that a data access conflict event occurs, the conflict type of the data access conflict event is determined, and then the delay time corresponding to the conflict type is obtained, so that the data request corresponding to the data access conflict event is retransmitted according to the delay time, and the data request is responded on the premise that no data access conflict event exists, and the purpose of improving the data processing efficiency is achieved; in the embodiment of the application, the delay time is associated with the conflict type, so that the data request retransmitted based on the delay time can be ensured to be processed in time, the influence of data access conflict on the subsequent data request in the instruction stream is reduced, and the IPC performance of the CPU is improved.
An embodiment of the present application provides a data processing apparatus, and as shown in fig. 4, the data processing apparatus 40 may include: a first determining module 401, a second determining module 402 and a retransmitting module 403;
the first determining module 401 is configured to determine a conflict type of a data access conflict event when the data access conflict event occurs;
a second determining module 402, configured to determine a delay time corresponding to the conflict type;
a retransmission module 403, configured to retransmit the data request corresponding to the data access collision event according to the delay time.
In an embodiment of the present application, a possible implementation manner is provided, where the second determining module 402 is configured to:
and inquiring based on a preset delay data comparison table to obtain the delay time corresponding to the conflict type.
In an embodiment of the present application, a possible implementation manner is provided, and the apparatus further includes a test module, configured to:
acquiring at least one test program;
running a test program in a processor, and when a data access conflict event is detected, determining a conflict type and a target delay time corresponding to the data access conflict event based on the running result of the test program;
and constructing a delay data comparison table based on the conflict type and the target delay time.
The embodiment of the present application provides a possible implementation manner, and the test module is configured to:
counting data access conflict events generated when each test program is operated and conflict types corresponding to the data access conflict events;
detecting the operation data of each test program aiming at each conflict type to obtain the initial delay time of each test program;
and counting the initial delay time, and calculating the target delay time corresponding to the conflict type.
The embodiment of the present application provides a possible implementation manner, and the test module is further configured to:
and calculating the average value or the extreme value of each initial delay time, and obtaining the target delay time according to the average value or the extreme value.
The embodiment of the present application provides a possible implementation manner, and the test module is further configured to:
generating key-value pair data based on the conflict type and the target delay time;
a delay data look-up table is generated based on the key-value pair data.
The apparatus of the embodiment of the present application may execute the method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus of the embodiments of the present application correspond to the steps in the method of the embodiments of the present application, and for the detailed functional description of the modules of the apparatus, reference may be specifically made to the description in the corresponding method shown in the foregoing, and details are not repeated here.
In the embodiment of the application, under the condition that a data access conflict event occurs, the conflict type of the data access conflict event is determined, and then the delay time corresponding to the conflict type is obtained, so that the data request corresponding to the data access conflict event is retransmitted according to the delay time, and the data request is responded on the premise that no data access conflict event exists, and the purpose of improving the data processing efficiency is achieved; in the embodiment of the application, the delay time is associated with the conflict type, so that the data request retransmitted based on the delay time can be ensured to be processed in time, the influence of data access conflict on the subsequent data request in the instruction stream is reduced, and the IPC performance of the CPU is improved.
The embodiment of the application provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of the data processing method, and compared with the related art, the method can realize the following steps: in the embodiment of the application, under the condition that a data access conflict event occurs, the conflict type of the data access conflict event is determined, and then the delay time corresponding to the conflict type is obtained, so that the data request corresponding to the data access conflict event is retransmitted according to the delay time, and the data request is responded on the premise that no data access conflict event exists, and the purpose of improving the data processing efficiency is achieved; in the embodiment of the application, the delay time is associated with the conflict type, so that the data request retransmitted based on the delay time can be ensured to be processed in time, the influence of data access conflict on the subsequent data request in the instruction stream is reduced, and the IPC performance of the CPU is improved.
In an alternative embodiment, an electronic device is provided, as shown in fig. 5, the electronic device 500 shown in fig. 5 comprising: a processor 501 and a memory 503. Wherein the processor 501 is coupled to the memory 503, such as via the bus 502. Optionally, the electronic device 500 may further include a transceiver 504, and the transceiver 504 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. It should be noted that the transceiver 504 is not limited to one in practical applications, and the structure of the electronic device 500 is not limited to the embodiment of the present application.
The Processor 501 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 501 may also be a combination of implementing computing functionality, e.g., comprising one or more microprocessors, a combination of DSPs and microprocessors, and the like.
Bus 502 may include a path that transfers information between the above components. The bus 502 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 502 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.
The Memory 503 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.
The memory 503 is used for storing computer programs for executing the embodiments of the present application, and is controlled by the processor 501 for execution. The processor 501 is adapted to execute a computer program stored in the memory 503 to implement the steps shown in the aforementioned method embodiments.
Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, PADs, etc. and fixed terminals such as digital TVs, desktop computers, etc.
Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program may implement the steps and corresponding contents of the foregoing method embodiments.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device realizes the following when executed:
determining the conflict type of the data access conflict event under the condition that the data access conflict event occurs;
determining a delay time corresponding to the conflict type;
and retransmitting the data request corresponding to the data access conflict event according to the delay time.
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than illustrated or otherwise described herein.
It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as desired, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times, respectively. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.
The foregoing is only an optional implementation manner of a part of implementation scenarios in this application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of this application are also within the protection scope of the embodiments of this application without departing from the technical idea of this application.

Claims (10)

1. A processor, comprising: the data request module and the data response module; wherein the content of the first and second substances,
the data response module is configured to determine a conflict type of a data access conflict event and send the conflict type to the data request module when the data access conflict event occurs;
the data request module is configured to determine a delay time corresponding to the conflict type according to the conflict type; and retransmitting the data request corresponding to the data access conflict event according to the delay time.
2. The processor according to claim 1, wherein the data request module is configured to perform a query based on a preset delay data lookup table to obtain the delay time corresponding to the conflict type.
3. The processor of claim 2, further comprising:
the delay data comparison table building module is configured to run at least one test program in a processor, and when a data access conflict event is detected in the running process of the test program, the conflict type and the target delay time corresponding to the data access conflict event are determined based on the running result of the test program; and constructing a delay data comparison table based on the conflict type and the target delay time.
4. The processor according to claim 3, wherein the delayed data lookup table construction module is configured to count data access conflict events generated when each of the test programs is executed, and conflict types corresponding to the data access conflict events; detecting the operation data of each test program aiming at each conflict type to obtain the initial delay time of each test program; and counting the initial delay time, and calculating the target delay time corresponding to the conflict type.
5. The processor according to claim 4, wherein the delay data look-up table constructing module is configured to calculate an average value or an extreme value of each of the initial delay times, and obtain the target delay time according to the average value or the extreme value.
6. The processor according to claim 3, wherein the delay data look-up table building module is configured to generate key-value pair data based on the conflict type and the target delay time; and generating a delay data comparison table based on the key value pair data.
7. A data processing method is applied to a processor and comprises the following steps:
determining a conflict type of a data access conflict event when the data access conflict event occurs;
determining a delay time corresponding to the conflict type;
and retransmitting the data request corresponding to the data access conflict event according to the delay time.
8. A data processing apparatus, comprising:
the device comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining the conflict type of a data access conflict event under the condition that the data access conflict event occurs;
a second determining module, configured to determine a delay time corresponding to the collision type;
and the retransmission module is used for retransmitting the data request corresponding to the data access conflict event according to the delay time.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to perform the steps of the method of claim 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in claim 7.
CN202111566629.9A 2021-12-20 2021-12-20 Processor, data processing method and device Active CN114238182B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111566629.9A CN114238182B (en) 2021-12-20 2021-12-20 Processor, data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111566629.9A CN114238182B (en) 2021-12-20 2021-12-20 Processor, data processing method and device

Publications (2)

Publication Number Publication Date
CN114238182A true CN114238182A (en) 2022-03-25
CN114238182B CN114238182B (en) 2023-10-20

Family

ID=80759816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111566629.9A Active CN114238182B (en) 2021-12-20 2021-12-20 Processor, data processing method and device

Country Status (1)

Country Link
CN (1) CN114238182B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114928816A (en) * 2022-04-24 2022-08-19 深圳数马电子技术有限公司 Device connection method, system, terminal device, detection device and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6449673B1 (en) * 1999-05-17 2002-09-10 Hewlett-Packard Company Snapshot and recall based mechanism to handle read after read conflict
US6467032B1 (en) * 1999-06-04 2002-10-15 International Business Machines Corporation Controlled reissue delay of memory requests to reduce shared memory address contention
US20110314338A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Data collisions in concurrent programs
US20140075121A1 (en) * 2012-09-07 2014-03-13 International Business Machines Corporation Selective Delaying of Write Requests in Hardware Transactional Memory Systems
US20140337587A1 (en) * 2013-05-13 2014-11-13 Advanced Micro Devices, Inc. Method for memory consistency among heterogeneous computer components
CN106610816A (en) * 2016-12-29 2017-05-03 山东师范大学 Avoidance method for conflict between instruction sets in RISC-CPU and avoidance system thereof
US20180260231A1 (en) * 2017-03-13 2018-09-13 Board Of Supervisors Of Louisiana State University And Agricultural And Mechanical College Enhanced performance for graphical processing unit transactional memory
CN112217701A (en) * 2019-07-09 2021-01-12 杭州萤石软件有限公司 Bus collision avoidance method and device
CN112506700A (en) * 2020-11-30 2021-03-16 北京达佳互联信息技术有限公司 Conflict processing method and device, electronic equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6449673B1 (en) * 1999-05-17 2002-09-10 Hewlett-Packard Company Snapshot and recall based mechanism to handle read after read conflict
US6467032B1 (en) * 1999-06-04 2002-10-15 International Business Machines Corporation Controlled reissue delay of memory requests to reduce shared memory address contention
US20110314338A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Data collisions in concurrent programs
US20140075121A1 (en) * 2012-09-07 2014-03-13 International Business Machines Corporation Selective Delaying of Write Requests in Hardware Transactional Memory Systems
US20140337587A1 (en) * 2013-05-13 2014-11-13 Advanced Micro Devices, Inc. Method for memory consistency among heterogeneous computer components
CN106610816A (en) * 2016-12-29 2017-05-03 山东师范大学 Avoidance method for conflict between instruction sets in RISC-CPU and avoidance system thereof
US20180260231A1 (en) * 2017-03-13 2018-09-13 Board Of Supervisors Of Louisiana State University And Agricultural And Mechanical College Enhanced performance for graphical processing unit transactional memory
CN112217701A (en) * 2019-07-09 2021-01-12 杭州萤石软件有限公司 Bus collision avoidance method and device
CN112506700A (en) * 2020-11-30 2021-03-16 北京达佳互联信息技术有限公司 Conflict processing method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张吉赞;古志民;: "多核共享缓存bank冲突分析及其延迟最小化", 计算机学报, no. 09, pages 169 - 185 *
欧焱;冯煜晶;李文明;叶笑春;王达;范东睿;: "面向数据流结构的指令内访存冲突优化研究", 计算机研究与发展, no. 12, pages 204 - 216 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114928816A (en) * 2022-04-24 2022-08-19 深圳数马电子技术有限公司 Device connection method, system, terminal device, detection device and storage medium

Also Published As

Publication number Publication date
CN114238182B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
US8140828B2 (en) Handling transaction buffer overflow in multiprocessor by re-executing after waiting for peer processors to complete pending transactions and bypassing the buffer
US10394714B2 (en) System and method for false sharing prediction
US20060047849A1 (en) Apparatus and method for packet coalescing within interconnection network routers
US20050144399A1 (en) Multiprocessor system, and consistency control device and consistency control method in multiprocessor system
US11281967B1 (en) Event-based device performance monitoring
US20150234687A1 (en) Thread migration across cores of a multi-core processor
CN114238182B (en) Processor, data processing method and device
US11093245B2 (en) Computer system and memory access technology
US9678883B2 (en) System and method for detecting false sharing
US9880849B2 (en) Allocation of load instruction(s) to a queue buffer in a processor system based on prediction of an instruction pipeline hazard
CN116909943A (en) Cache access method and device, storage medium and electronic equipment
US11989599B2 (en) Isolating communication streams to achieve high performance multi-threaded communication for global address space programs
CN110727611B (en) Configurable consistency verification system with state monitoring function
CN115269199A (en) Data processing method and device, electronic equipment and computer readable storage medium
US20170185320A1 (en) Delayed read indication
US10284501B2 (en) Technologies for multi-core wireless network data transmission
Machado et al. Parallel Local Search: Experiments with a PGAS-based programming model
CN114063923A (en) Data reading method and device, processor and electronic equipment
US11216377B2 (en) Hardware accelerator automatic detection of software process migration
CN110647357B (en) Synchronous multithread processor
US11093401B2 (en) Hazard prediction for a group of memory access instructions using a buffer associated with branch prediction
US20110161629A1 (en) Arithmetic processor, information processor, and pipeline control method of arithmetic processor
Giordano Design and Implementation of an Architecture-aware In-memory Key-Value Store
CN115174673B (en) Data processing device, data processing method and apparatus having low-latency processor
US20180032339A1 (en) Cross-level prefetch for shared multi-level libraries

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100176 Room 101, 1f, building 3, yard 18, Kechuang 10th Street, Beijing Economic and Technological Development Zone, Beijing

Applicant after: Beijing yisiwei Computing Technology Co.,Ltd.

Address before: 100176 Room 101, 1f, building 3, yard 18, Kechuang 10th Street, Beijing Economic and Technological Development Zone, Beijing

Applicant before: Beijing yisiwei Computing Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant