WO2020090009A1

WO2020090009A1 - Arithmetic processing device and control method thereof

Info

Publication number: WO2020090009A1
Application number: PCT/JP2018/040345
Authority: WO
Inventors: 朋広永野
Original assignee: 富士通株式会社
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2020-05-07
Also published as: JPWO2020090009A1; JP7036226B2

Abstract

The present invention comprises: a determination circuit (1221) that, for first to third chain groups to be respectively subjected to first to third arithmetic processes, sets, as a first determined chain group, the first chain group having a relationship in which a second arithmetic process is continuously performed after a first arithmetic process, sets, as a second determined chain group, a chain group which is obtained by calculating the third chain group in a certain manner with respect to the first chain group, and which has a relationship in which after the first arithmetic process, one or more third arithmetic processes are executed, and then the second arithmetic process is executed, and that determines whether the first or second determined chain group includes a second chain group to be subjected to the second arithmetic process; a generation circuit (1222) which generates an initialization instruction of an identifier of the second arithmetic process, when the first and second determined chain groups do not include the second chain group; and an acquisition circuit (322) which acquires an initialized identifier for the second arithmetic process when the initialization instruction is issued, and acquires an identifier that is continuous with the identifier of the first arithmetic process, for the second arithmetic process, when the initialization instruction is not issued.

Description

Arithmetic processing device and control method thereof

The present invention relates to an arithmetic processing device and its control method.

There is a multi-core arithmetic unit in which the main memory and Memory Access Controller (MAC) and the core register file are connected one-to-many.

In the multi-core architecture, each core is provided with a register file and an arithmetic execution unit, and an instruction unit decodes instructions for data read / write (in other words, load / store) and arithmetic execution, and each read / write unit. Issues the order.

JP 2001-175632 A Japanese Patent Publication No. 2008-509493

-The performance of the multi-core computing device depends on the throughput of memory data transfer. However, if control is performed to wait for the processing of the subsequent multicast load instruction until the memory read completion of the preceding multicast load instruction is received, the throughput of memory data may decrease.

In one aspect, the technology described in this specification aims to improve the throughput of memory data by reducing the waiting time between chains in a multi-core architecture.

In one aspect, the arithmetic processing device is an arithmetic processing device having a multi-core forming a plurality of chains, and includes a first chain group that is a target of a first arithmetic processing among the plurality of chains, and the plurality of chains. In the second chain group that is the target of the second arithmetic processing of the chain and the third chain group that is the target of the third arithmetic processing in the plurality of chains. The first chain group, which has a relationship in which the second arithmetic processing is continuously executed later, is defined as a first determined chain group, and at least one third arithmetic operation is performed after the first arithmetic processing. The second chain to be judged is obtained by calculating the third chain group with respect to the first chain group in a fixed manner, in which the second arithmetic processing is executed after the processing is executed. And the plurality of chains A determination circuit that determines whether the first or second determination target chain group includes the second chain group that is the target of the second arithmetic processing, and the first and second determination chain groups. A generation circuit that generates an initialization instruction that initializes the identifier of the second arithmetic processing when none of the determined chain groups includes the second chain group, and the initialization instruction is issued. In this case, an identifier initialized for the second arithmetic processing is obtained, and when the initialization instruction is not issued, the identifier for the second arithmetic processing is continued to the identifier of the first arithmetic processing. And an acquisition circuit for acquiring an identifier.

According to the disclosed arithmetic processing device, it is possible to improve the throughput of memory data by reducing the waiting time between chains in a multi-core architecture.

It is a block diagram which shows the structural example of the accelerator in a related example typically. It is a block diagram explaining an example of composition of an accelerator and a memory access processing in a related example. FIG. 3 is a block diagram illustrating a memory access process in a writing unit of the accelerator shown in FIG. 2. 3 is a flowchart illustrating a memory access process in the accelerator shown in FIG. 3 is a flowchart illustrating a memory access process in the accelerator shown in FIG. 6 is a flowchart illustrating a multicast REQID initialization process in the read / write unit of the accelerator illustrated in FIG. 2. 3 is a flowchart illustrating a load instruction issuing process in the accelerator instruction unit illustrated in FIG. 2. 3 is a flowchart illustrating REQID acquisition processing in the read / write unit of the accelerator illustrated in FIG. 2. 3 is a table illustrating the usage status of REQIDs in a multicast load process in the related example shown in FIG. 2. FIG. 1 is a block diagram schematically showing a hardware configuration example of an information processing device in an example. It is a block diagram which shows typically the structural example of the accelerator shown in FIG. It is a figure which shows the format of the request packet in the accelerator shown in FIG. FIG. 12 is a diagram showing a format of a completion packet in the accelerator shown in FIG. 11. FIG. 12 is a block diagram illustrating memory access processing in the accelerator shown in FIG. 11. It is a figure which shows the structural example of the synchronization monitoring circuit of REQID for multicast of the accelerator shown in FIG. FIG. 13 is a diagram showing a configuration example of a multicast REQID initialization signal generation circuit of the accelerator shown in FIG. 12. FIG. 13 is a diagram showing a configuration example of a multicast acquisition circuit of the accelerator shown in FIG. 12. 12 is a flowchart illustrating a process of issuing a REQID initialization instruction in the instruction unit of the accelerator illustrated in FIG. 11. 12 is a flowchart illustrating REQID initialization processing in the read / write unit of the accelerator illustrated in FIG. 11. 12 is a flowchart illustrating a load instruction issuing process in the instruction unit of the accelerator illustrated in FIG. 11. 12 is a flowchart illustrating REQID acquisition processing in the read / write unit of the accelerator illustrated in FIG. 11. 12 is a table showing an example of the usage status of REQIDs in the multicast load process in the embodiment shown in FIG. 11.

An embodiment will be described below with reference to the drawings. However, the embodiments described below are merely examples, and there is no intention to exclude the application of various modifications and techniques not explicitly shown in the embodiments. The present embodiment can be variously modified and implemented without departing from the spirit thereof.

Also, each diagram is not intended to include only the constituent elements shown in the diagram, and may include other functions and the like.

Hereafter, in the figures, parts with the same reference numerals indicate similar parts.

[A] Related Example FIG. 1 is a block diagram schematically showing a configuration example of the accelerator 600 in the related example.

The accelerator 600 processes an arithmetic instruction, and includes a MAC 6, a memory 7, and a plurality of (three in the illustrated example) cores 8 (“cores # 0 to # 2”).

The MAC 6 handles access to the memory 7 by each core 8.

The memory 7 may be used as a primary recording memory or a working memory.

Each core 8 loads and stores the memory 7 via the MAC 6. Each core 8 functions as an instruction unit 81, a read / write unit 82, and a calculation execution unit 83, and stores a register file 84.

The register file 84 stores the data acquired from the memory 7. The operation execution unit 83 performs an operation using the data stored in the register file 84.

The instruction unit 81 decodes the load instruction sent from the software and instructs the read / write unit 82 to execute the load instruction.

The read / write unit 82 divides the load instruction into memory access units, puts a request ID (may be referred to as “REQID”) on each divided load instruction, and issues a memory read request to the MAC 6.

2 is a block diagram explaining a memory access process in the accelerator 600, which is different from FIG. 1, and an instruction unit 81 and a read / write unit 82 are provided independently from each core. FIG. 3 is a block diagram illustrating a memory access process in the read / write unit 82 of the accelerator 600 shown in FIG.

The memory access process will be described below with reference to FIGS. 2 and 3.

The instruction unit 81 decodes the instruction from the software (see symbol A1 in FIGS. 2 and 3).

The instruction unit 81 issues a multicast load instruction to the read / write units 82 of all target chains (see reference numeral A2 in FIGS. 2 and 3). At this time, the instruction unit 81 notifies the read / write units 82 of all target chains of the target chains.

The decision circuit 822 in the instruction division circuit 821 of each read / write unit 82 determines that a multicast load instruction has been issued by receiving a load instruction having a plurality of target instructions. Then, the instruction dividing circuit 821 divides the load instruction into 256 bytes (see symbol A3 in FIGS. 2 and 3).

The REQID management circuit 823 of the read / write unit 82 includes a unicast acquisition circuit 824, a multicast acquisition circuit 825, and an acquisition wait buffer 826. The multicast acquisition circuit 825 acquires the REQID and the data buffer area for each of the divided load instructions (see symbol A4 in FIGS. 2 and 3). Here, the multicast acquisition circuit 825 acquires REQID = 0 for the first divided request. When REQID = 0 is in use, the multicast acquisition circuit 825 waits until REQID = 0 is released. After that, the multicast acquisition circuit 825 acquires the REQIDs so that the REQIDs are serial numbers.

Each read / write unit 82 notifies the acquisition waiting buffer 826 in the read / write unit 82 in charge of issuing the memory read request of the acquired REQID (see symbol A5 in FIGS. 2 and 3). As a result, the read / write unit 82 in charge of issuing the memory read request can recognize that the REQIDs acquired by the read / write units 82 are unified.

The memory request generation circuit 827 in the read / write unit 82 in charge of issuing the memory read request issues the memory read request to the reception buffer 611 in the port 61 of the MAC 6 after being notified of the REQIDs acquired from the read / write units 82 of all target chains. Is issued (see A6 in FIGS. 2 and 3). At this time, the target chain is designated by the bitmap in the dst field.

The read / write units 82 of all target chains receive the memory read completion from the transmission buffer 612 at the port 61 of the MAC 6 as a response to the memory read request, and store the accompanying memory read data in the data buffer 829 (FIG. 2 and Reference A7 in FIG. 3).

The register control request generation circuit 828 of each read / write unit 82 reads the memory read data from the data buffer 829, and transfers the read memory read data together with the write request to the register file 84 of the core 8 (see FIGS. 2 and 3). Reference A8).

The processing indicated by the reference signs A4 to A8 is repeatedly executed for all the divided requests.

Each read / write unit 82 issues a completion notice corresponding to the multicast load instruction indicated by reference sign A2 to the instruction unit 81 (see reference sign A9 in FIGS. 2 and 3).

The instruction unit 81 receives completion notifications from the read / write units 82 of all target chains, and recognizes the completion of the instruction (see symbol A10 in FIGS. 2 and 3).

The memory access process in the accelerator 600 in the above-described related example will be described with reference to the flowcharts (steps S1 to S19) in FIGS. 4 and 5.

In FIG. 4, the instruction unit 81 decodes the instruction and issues the instruction to the read / write unit 82 of the target chain (step S1).

The target chain read / write unit 82 determines whether there are a plurality of target chains (step S2).

If there are multiple target chains (see No route in step S2), unicast operation is performed.

On the other hand, when there are a plurality of target chains (see Yes route in step S2), the read / write unit 82 determines whether the REQID to be acquired is free by the multicast acquisition circuit 825 (step S3).

If the REQID you want to acquire is not empty (see No route in step S3), the process in step S3 is repeated.

On the other hand, if the REQID to be acquired is free (see Yes route in step S3), the reading / writing unit 82 notifies the reading / writing unit 82 in charge of the request that the REQID has been acquired (step S4).

The read / write unit 82 determines whether it is the read / write unit 82 in charge of the request (step S5).

If the read / write unit 82 is not in charge of the request (see No route in step S5), the process proceeds to step S9.

On the other hand, when the read / write unit 82 is in charge of the request (see Yes route in step S5), the read / write unit 82 determines whether REQID acquisition notifications have been issued from all target chains (step S6).

If there is a target chain for which the REQID acquisition notification has not been issued (see No route in step S6), the process in step S6 is repeatedly executed.

On the other hand, when REQID acquisition notifications have been issued from all target chains (see Yes route in step S6), the read / write unit 82 determines whether or not the receive buffer 611 is available at the port 61 of the MAC6 (step S6). S7).

If there is no free space in the reception buffer 611 (see No route in step S7), the process in step S7 is repeated.

On the other hand, if the reception buffer 611 has a free space (see Yes route in step S7), the read / write unit 82 issues a multicast read request to the MAC 6 (step S8). Here, the destination of the multicast read is set to all target chains.

The process in MAC6 is shown in step S11 and subsequent steps in FIG.

The read / write unit 82 determines whether all the divided read requests have been issued (step S9). If there is a read request that has not been issued (see No route in step S9), the process returns to step S3.

On the other hand, when all the divided read requests are issued (see Yes route in step S9), the read / write unit 82 issues a request corresponding to the subsequent instruction from the instruction unit 81 (step S10). Then, the process returns to step S2.

In FIG. 5, the MAC 6 receives the multicast read request (step S11).

MAC6 performs memory read (step S12).

The MAC 6 issues a completion with read data added to the read / write unit 82 in the dst field (step S13).

The read / write unit 82 receives the completion from the MAC 6 (step S14).

The read / write unit 82 issues a register write request with read data added to each core 8 (step S15).

The read / write unit 82 releases the REQID and the data buffer 829 (step S16).

The read / write unit 82 determines whether all the register write requests corresponding to the divided read requests have been issued (step S17).

If there is a register write request that has not been issued (see No route in step S17), the process returns to step S14.

On the other hand, when all the register write requests have been issued (see Yes route in step S17), the read / write unit 82 notifies the instruction unit 81 of the completion of the instruction in its own chain (step S18).

The instruction unit 81 determines whether the instruction completion notification has been received from the read / write units 82 of all target chains (step S19).

If there is a target chain that has not received the instruction completion notification (see No route in step S19), the process in step S19 is repeatedly executed.

On the other hand, when the instruction completion notification is received from the read / write units 82 of all the target chains (see Yes route in step S19), the instruction unit 81 recognizes the completion of the instruction and the memory access process ends.

Next, the initialization process of the multicast REQID in the read / write unit 82 of the accelerator 600 in the related example will be described according to the flowchart (steps S21 to S25) shown in FIG.

The read / write unit 82 processes the Nth division of one instruction (step S21).

The read / write unit 82 determines whether N is 1 (step S22).

If N is 1 (see Yes route in step S22), the read / write unit 82 initializes the multicast acquisition circuit 825 (step S23). Then, the process proceeds to step S25.

On the other hand, if N is not 1 (see No route in step S22), the multicast acquisition circuit 825 is updated (step S24).

The read / write unit 82 determines the REQID to be issued in the multicast (step S25). Then, the initialization process of the multicast REQID ends.

Next, the issuing process of the load instruction in the instruction unit 81 of the accelerator 600 in the related example will be described according to the flowchart (steps S31 to S32) shown in FIG.

The instruction unit 81 decodes the load instruction from the software (step S31).

The instruction unit 81 issues the load instruction and the target chain to the target read / write unit 82 (step S32). Then, the processing is taken over by the processing in the reading / writing unit 82 after step S41 in FIG. 8, and the load instruction issuing processing ends.

Next, the REQID acquisition process in the read / write unit 82 of the accelerator 600 in the related example will be described according to the flowchart (steps S41 to S47) shown in FIG.

The read / write unit 82 receives the load instruction and the target chain according to the instruction from the instruction unit 81 shown in step S32 of FIG. 7 (step S41).

The read / write unit 82 determines whether there are a plurality of target chains (step S42).

If there are not multiple target chains (see No route in step S42), unicast operation is performed.

On the other hand, when there are a plurality of target chains (see Yes route in step S42), the read / write unit 82 determines whether the first division of the load instruction is being processed (step S43).

If the first division is being processed (see Yes route in step S43), the read / write unit 82 initializes the multicast acquisition circuit 825 (step S44). Then, the process proceeds to step S46.

On the other hand, when the first division is not being processed (see No route in step S43), the read / write unit 82 updates the multicast acquisition circuit 825 (step S45).

The read / write unit 82 determines whether the REQID to be acquired is free (step S46).

If the REQID to be acquired is not empty (see No route in step S46), the process in step S46 is repeatedly executed.

On the other hand, if the REQID to be acquired is free (see Yes route in step S46), the reading / writing unit 82 performs waiting processing for the acquired REQID (step S47). Then, the REQID acquisition process ends.

FIG. 9 is a table illustrating the usage status of REQIDs in the multicast load process in the related example shown in FIG.

-The performance of the multi-core computing device depends on the throughput of memory data transfer. In the above-mentioned related example, it is premised that the multicast load command does not occur frequently. Therefore, when the multicast load command is continuously decoded by the instruction unit 81 and issued to the read / write unit 82, each read / write unit 82 executes the first multicast load command and the request ID used for the multicast = 0. The second multicast load instruction cannot be executed until is released.

For example, assume that the multicast load instructions “1” to “4” shown in FIG. 9 are issued consecutively. The target chains of the multicast load instruction “1” are # 0 to # 7, and the target chains of the multicast load instruction “2” are # 0 to # 7. The target chains of the multicast load instruction “3” are # 0 to # 3, and the target chains of the multicast load instruction “3” are # 0 to # 7. The memory access size of each multicast load instruction is 1 kilobyte.

When each read / write unit 82 executes the multicast load instruction “1”, the multicast memory read to the MAC 6 is issued four times in total. Request IDs used at this time are 0, 1, 2, 3 in order.

Next, when the read / write unit 82 executes the multicast load instruction “1”, the multicast memory read to the MAC 6 is issued four times in total. At this time, the unused request IDs are 4 to 15, but these request IDs cannot be used and are released until the request ID = 0 is released (in other words, “until completion of lead 1-01 is received”). )), Then request ID = 0 is used. The same applies to request ID = 1 to 3 as request ID = 0.

For this reason, memory read requests cannot be issued consecutively for the multicast load instructions “1” and “2”. Similarly to the multicast load instructions “1” and “2”, the memory read request cannot be continuously issued for the multicast load instructions “3” and “4”.

Due to this, for the processing of unifying the request IDs among the read / write units 82, the processing for the subsequent multicast load instruction is suspended until the completion of the memory read of the preceding multicast load instruction is received. Then, the throughput of memory data may be reduced.

[B] Example of Embodiment [B-1] System Configuration Example FIG. 10 is a block diagram schematically showing a hardware configuration example of the information processing apparatus 1 in the example.

As shown in FIG. 10, the information processing device 1 has a Central Processing Unit (CPU) 10, an Input / Output (I / O) controller 11, an accelerator 12, a hard disk 13, an I / O device 14 and a memory 15.

The I / O controller 11 is connected to the accelerator 12, the hard disk 13, and the I / O device 14. Here, the I / O device 14 refers to an I / O device other than the accelerator 12 and the hard disk 13. The I / O controller 11 receives an instruction from the CPU 10 and controls the accelerator 12, the hard disk 13, and the I / O device 14. Then, the I / O controller 11 relays communication between the accelerator 12, the hard disk 13, the I / O device 14, and the CPU 10.

The CPU 10, which is an arithmetic processing unit, is connected to the I / O controller 11 and the memory 15 by a bus. Then, the CPU 10 can send and receive data to and from the memory 15. Further, the CPU 10 can send and receive data to and from the accelerator 12, the hard disk 13, and the I / O device 14 via the I / O controller 11.

The hard disk 13 stores various programs such as Operating System (OS) and various applications.

The CPU 10 operates the OS and various applications by reading the program from the hard disk 13, expanding it on the memory 15, and executing it. Examples of applications include applications that execute deep learning and the like.

Further, the CPU 10 causes the accelerator 12 to perform a specific process when executing the application. For example, the CPU 10 causes the accelerator 12 to perform arithmetic processing such as deep learning. Specifically, the software executed by the CPU 10 transmits an operation command to the accelerator 12 together with data used for the operation via the I / O controller 11.

FIG. 11 is a block diagram schematically showing a configuration example of the accelerator 12 shown in FIG.

The accelerator 12 includes a MAC 121, a memory 122, a plurality of cores 123, an instruction unit 124, and a read / write unit 125.

The memory 122 may be used as a primary recording memory or a working memory.

The MAC 121 processes access to the memory 7 by each core 123. The MAC 121 includes a plurality of ports 1211 (denoted as “P # 0 to # 7” in FIG. 11).

A plurality of cores 123, one read / write unit 125, and one port 1211 are provided for each chain # 0 to # 7 indicated by the broken line frame in FIG. In the illustrated example, seven cores 123 are provided in each chain. For example, chain # 0 has cores # 0-1 to # 0-N, chain # 1 has cores # 1-1 to # 1-N, and chain # 7 has core # 7-1. ~ # 7-N are provided.

Each core 123 executes loading and storing on the memory 122 via the read / write unit 125 and the MAC 121.

The instruction unit 124 decodes the load instruction sent from the software and instructs the read / write unit 125 to execute the load instruction.

The read / write unit 125 divides the load instruction into memory access units, puts a request ID (may be referred to as “REQID”) on each divided load instruction, and issues a memory read request to the MAC 121.

As shown in FIG. 11, when the number of mounted cores 123 is large, a configuration in which an access bus to the memory 122 is shared between the cores 123 and connected in a ring shape is adopted from the viewpoint of circuit mounting and wiring. May be done.

Here, when the same memory data is loaded into the core groups of a plurality of chains, when each read / write unit 125 issues a memory read request independently, the MAC 121 reads the same memory area by the number of requests. .. In this case, the MAC 121 repeats the same operation, and the processing of the subsequent instruction is kept waiting during that time, which is inefficient. Therefore, as the instruction, a load (which may be referred to as “multicast load”) instruction in which a plurality of chains are designated at a time is supported. The MAC 121 and the read / write unit 125 support a memory read (may be referred to as a “multicast read request”) that specifies a plurality of chains.

FIG. 12 is a diagram showing a format of a request packet in the accelerator 12 shown in FIG.

The multicast read request generated by the read / write unit 125 has the format shown in FIG. In the multicast read request, opc indicating the type of request, dst indicating the transmission destination of the read data, REQID indicating the assigned ID, and Address indicating the read address of the data are registered.

For example, opc, dst and REQID are sent in the first cycle. The Address is sent in the first cycle and the second cycle.

FIG. 13 is a diagram showing a format of a completion packet in the accelerator 12 shown in FIG.

The memory read recompletion obtained by the read / write unit 125 has the format shown in FIG. The area transmitted in the first cycle of the memory read completion is opc indicating the request type, REQID indicating the ID of the multicast read request that is the response target, and Status indicating the response status, and the reserve (rsv). A header having an area is stored.

In the dst field of the request shown in FIG. 12, the chain to which the completion is issued is specified by the bitmap. In the opc field shown in FIGS. 12 and 13, the request completion of memory read or memory write is determined. The REQID field shown in FIG. 13 is a completion corresponding to the request shown in FIG. 12, and the same value is stored. An address value to be accessed is stored in the Address field shown in FIG.

The instruction unit 124 notifies each read / write unit 125 together with the load instruction along with information about the target chain.

The representative read / write unit 125 specifies the target chain in the dst field with a bitmap when issuing a memory read request to the MAC 121. Upon receiving the memory read request, the MAC 121 acquires data from the memory 122 and issues a completion to all chains specified by the dst field.

FIG. 14 is a block diagram illustrating a memory access process in the accelerator 12 shown in FIG.

The instruction unit 124 decodes an instruction from software. The instruction unit 124 issues a multicast load command to the read / write units 125 of all target chains. At this time, the instruction unit 124 notifies the read / write unit 125 of all target chains of the target chains. The instruction unit 124 propagates the REQID initialization signal as an interface signal to the reading / writing unit 125 to each reading / writing unit 125 together with the load instruction.

The instruction unit 124 includes a synchronization monitoring circuit 1221 and an initialization signal generation circuit 1222.

FIG. 15 is a diagram showing an example of the configuration of the multicast REQID synchronization monitoring circuit 1221 of the accelerator 12 shown in FIG.

When the accelerator 12 includes the chains # 0 to # 7, as shown in FIG. 15, the synchronization monitoring circuit 1221 receives 28 patterns of (X, Y) combinations.

The output signal same_reqid_grp_XY also has 28 patterns and is stored in the group table 1220 shown in FIG. load_valid is a valid signal of a load instruction and an update instruction signal of this circuit.

In the synchronization monitoring circuit 1221, when both chain [X] and [Y] are the target chains for the multicast load instruction, same_reqid_grp_XY = 1 is updated. Same_reqid_grp_XY = 0 is updated in the case of a multicast load instruction targeting one of chain [X] and [Y] or a single cast load instruction targeting one of chain [X] and [Y]. .. If both chain [X] and [Y] are not the target chain, the previous value is retained.

When same_reqid_grp_XY = 1, it indicates that the load instructions last received by the read / write unit 125 of the chains [X] and [Y] from the instruction unit 124 are the same multicast load instruction.

Therefore, the multicast acquisition circuit 322, which will be described later, has already been initialized in the read / write unit 125 for chains [X] and [Y] by multicast_reqid_rst [7: 0] described later with reference to FIG. Since the load instruction is being processed, the same number of REQIDs are used in sequence. Therefore, when the same multicast load instruction is processed next, it indicates that the same REQID can be acquired without initializing the acquisition circuit for multicast 322. That is, when same_reqid_grp_XY = 1, it indicates that the multicast acquisition circuit 322 can be synchronized.

On the other hand, when same_reqid_grp_XY = 0, it indicates that the load instruction last received by the read / write unit 125 of chain [X] and [Y] is a different multicast load instruction or single cast load instruction. That is, if same_reqid_grp_XY = 0, it means that the acquisition circuit for multicast 322 is not synchronized.

In other words, the synchronization monitoring circuit 1221 determines whether the first or second determined chain group of the plurality of chains includes the second chain group that is the target of the second arithmetic processing. It is an example of a circuit. The first determined chain group is a first chain group in which the second arithmetic processing is continuously executed after the first arithmetic processing. The second chain to be judged is a first chain group in a relationship in which after the first arithmetic processing, at least one or more third arithmetic processing is executed and then the second arithmetic processing is executed. On the other hand, the third chain group is calculated by a constant method. The first chain group is a target of the first arithmetic processing of the plurality of chains, the second chain group is a target of the second arithmetic processing of the plurality of chains, and the third chain group is a plurality of the plurality of chains. It is the target of the third arithmetic processing in the chain.

Here, the fixed method is a process of removing the third chain group from the first chain group.

FIG. 16 is a diagram showing a configuration example of the multicast REQID initialization signal generation circuit 1222 of the accelerator 12 shown in FIG.

The initialization signal generation circuit 1222 compares the same_reqid_grp_XY shown in FIG. 15 with the chain to which the multicast load is issued, and checks whether the read / write units 125 in all the target chains are synchronized with the REQID for multicast. That is, the initialization signal generation circuit 1222 checks whether the output of the synchronization monitoring circuit 1221 is same_reqid_grp_XY = 1.

The initialization signal generation circuit 1222 determines that the REQID for multicast is not synchronized if same_reqid_grp_XY = 0 in any of the target chains, and multicast_reqid_rst [Z] = 1 (Z is a target for issuing the multicast load). Chain). On the other hand, the initialization signal generation circuit 1222 sets multicast_reqid_rst [Z] = 0 if same_reqid_grp_XY = 1 in all target chains.

In other words, the initialization signal generation circuit 1222 initializes the identifier of the second arithmetic processing when neither the first nor the second chain group to be judged includes the second chain group. It is an example of a generation circuit that generates an activation instruction.

As shown in FIG. 14, the decision circuit 311 in the instruction division circuit 31 of each read / write unit 125 determines that a multicast load instruction has been issued by receiving a load instruction having a plurality of target instructions. Then, the instruction division circuit 31 divides the load instruction into 256 bytes.

In other words, the instruction division circuit 31 is an example of a division circuit that divides the instruction related to the second arithmetic processing into a plurality of instructions.

The REQID management circuit 32 of the read / write unit 125 includes a unicast acquisition circuit 321, a multicast acquisition circuit 322, and an acquisition wait buffer 323. The multicast acquisition circuit 322 acquires the REQID and the data buffer area for each of the divided load instructions. Here, the multicast acquisition circuit 322 acquires REQID = 0 for the first divided request. When REQID = 0 is in use, the multicast acquisition circuit 322 waits until REQID = 0 is released. After that, the multicast acquisition circuit 322 acquires the REQIDs so that the REQIDs are serial numbers.

FIG. 17 is a diagram showing a configuration example of the multicast acquisition circuit 322 of the accelerator 12 shown in FIG.

In the above-mentioned related example, when the multicast load instruction is divided in memory access units, initialization was executed in the first divided request.

In the present embodiment, the multicast acquisition circuit 322 executes initialization when (div_1st_memrd & multicast_reqid_rst) = 1 by using multicast_reqid_rst which is the REQID initialization signal distributed to each read / write unit 125.

In other words, the multicast acquisition circuit 322 acquires the identifier that has been initialized for the second arithmetic processing when the initialization instruction is issued, and the second when the initialization instruction is not issued. 2 is an example of an acquisition circuit that acquires an identifier that is continuous with the identifier of the first arithmetic processing for the arithmetic processing of.

The multicast acquisition circuit 322 may acquire an identifier that is continuous with the identifier of the acquisition target immediately before the acquisition target when the acquisition target of the identifier is not the first division of the instruction.

As shown in FIG. 14, the multicast acquisition circuit 322 of each read / write unit 125 notifies the acquired REQID to the acquisition waiting buffer 323 in the read / write unit 125 in charge of issuing a memory read request. As a result, the read / write unit 125 in charge of issuing the memory read request can recognize that the REQIDs acquired by the read / write units 125 are unified.

In other words, the multicast acquisition circuit 322 is an example of an acquisition circuit that notifies the other reader / writer 125 of the plurality of readers / writers 125 of the acquired identifier.

The memory request generation circuit 33 in the read / write unit 125 in charge of issuing the memory read request notifies the reception buffer 21 in the port 1211 of the MAC 121 to the memory read request after being notified of the REQIDs acquired from the read / write units 125 of all target chains. To issue. At this time, the target chain is designated by the bitmap in the dst field.

The read / write unit 125 of all target chains receives the memory read completion from the transmission buffer 22 at the port 1211 of the MAC 121 as a response to the memory read request, and stores the accompanying memory read data in the data buffer 35.

The register control request generation circuit 34 of each read / write unit 125 reads the memory read data from the data buffer 35, and transfers the read memory read data together with the write request to the register file of the core 123.

Each read / write unit 125 issues a completion notice corresponding to the multicast load instruction to the instruction unit 124.

The instruction unit 124 receives the completion notification from the read / write units 125 of all target chains and recognizes the completion of the command.

[B-2] Operation Example The issuing process of the REQID initialization instruction in the instruction unit 124 of the accelerator 12 shown in FIG. 11 will be described according to the flowchart (steps S51 to S55) shown in FIG.

The synchronization monitoring circuit 1221 updates the group table 1220 for REQID synchronization to the latest state (step S51).

The initialization signal generation circuit 1222 determines whether or not all target chains of the load instruction to be issued are included in the group table 1220 (step S52).

If all target chains are included (see Yes route in step S52), the initialization signal generation circuit 1222 de-asserts the REQID initialization instruction signal (step S53). Then, the process proceeds to step S55.

On the other hand, if there is a target chain that is not included (see No route in step S52), the initialization signal generation circuit 1222 asserts a REQID initialization instruction signal (step S54).

The initialization signal generation circuit 1222 notifies the read / write unit 125 of the REQID initialization instruction signal. The processing is taken over by the processing of the reading / writing unit 125 in step S61 of FIG. 19, and the issuing processing of the REQID initialization command ends.

Next, the initialization process of the multicast REQID in the read / write unit 125 of the accelerator 12 shown in FIG. 11 will be described according to the flowchart (steps S61 to S65) shown in FIG.

The instruction division circuit 31 receives the REQID initialization instruction signal in response to the transmission from the instruction unit 124 in step S55 of FIG. 18 (step S61).

The multicast acquisition circuit 322 determines whether the REQID initialization instruction signal is 1 (step S62).

If the REQID initialization instruction signal is 1 (see Yes route in step S62), the value of the multicast acquisition circuit 322 is initialized (step S63). Then, the process proceeds to step S65.

On the other hand, when the REQID initialization instruction signal is not 1 (see No route in step S62), the multicast acquisition circuit 825 updates or holds the value (step S64).

The decision circuit 311 decides the REQID to be issued in the multicast (step S65). Then, the initialization process of the multicast REQID ends.

Next, the issuing process of the load instruction in the instruction unit 124 of the accelerator 12 shown in FIG. 11 will be described according to the flowchart (steps S71 to S76) shown in FIG.

The instruction unit 124 decodes the load instruction from the software (step S71).

The initialization signal generation circuit 1222 determines whether or not all target chains of the load instruction to be issued are included in the group table 1220 (step S72).

If all target chains are included (see Yes route in step S72), the initialization signal generation circuit 1222 de-asserts the REQID initialization instruction signal (step S73). Then, the process proceeds to step S75.

On the other hand, if there is a target chain that is not included (see No route in step S72), the initialization signal generation circuit 1222 asserts a REQID initialization instruction signal (step S74).

The initialization signal generation circuit 1222 notifies the target read / write unit 125 of the load instruction and the target chain (step S75). Then, the processing is taken over by the processing in the reading / writing unit 125 after step S81 in FIG.

The synchronization monitoring circuit 1221 updates the group table 1220 for REQID synchronization based on the target chain for reference at the next load instruction (step S76).

Next, the REQID acquisition process in the read / write unit 125 of the accelerator 12 shown in FIG. 11 will be described according to the flowchart (steps S81 to S88) shown in FIG.

The instruction division circuit 31 receives the load instruction and the target chain together with the REQID initialization instruction signal in response to the notification from the instruction unit 124 shown in step S75 of FIG. 20 (step S81).

The determination circuit 311 determines whether there are a plurality of target chains (step S82).

If there are not multiple target chains (see No route in step S82), unicast operation is performed.

On the other hand, when there are a plurality of target chains (see Yes route in step S82), the multicast acquisition circuit 322 determines whether the first division of the load instruction is being processed (step S83).

If the first division is being processed (see Yes route in step S83), the multicast acquisition circuit 322 determines whether the REQID initialization instruction signal is 1 (step S84).

If the REQID initialization instruction signal is not 1 (see No route in step S84), the process proceeds to step S86.

On the other hand, when the REQID initialization instruction signal is 1, the multicast acquisition circuit 322 is initialized with a value (step S85). Then, the process proceeds to step S87.

If the first division is not being processed in step S83 (see No route in step S83), the value of the multicast acquisition circuit 322 is updated (step S86).

The multicast acquisition circuit 322 determines whether the REQID to be acquired is free (step S87).

If the REQID to be acquired is not empty (see No route in step S87), the process in step S87 is repeatedly executed.

On the other hand, if the REQID to be acquired is free (see Yes route in step S87), the multicast acquisition circuit 322 performs waiting processing for the acquired REQID (step S88). Then, the REQID acquisition process ends.

[B-3] Effect FIG. 22 is a table illustrating the usage status of REQIDs in the multicast load processing in the embodiment shown in FIG.

In the above-mentioned embodiment, when the multicast load instruction is processed continuously, the REQID can be serially used without being initialized.

In the example shown in FIG. 22, four multicast load instructions “1” to “4” are issued consecutively. The multicast load instructions "1", "2", and "4" are targeted at chains # 0 to # 7, and the multicast load instruction "3" is targeted at chains # 0 to # 3. The memory access size of each of the multicast load instructions “1” to “4” is 1 kilobyte.

For the multicast load instruction “1”, the load target chain of the multicast load instruction “2” is included in the group table 1220 for REQID synchronization. The multicast load instruction “3” is included in the group table 1220 for the REQID synchronization of the load target chain as the multicast load instruction “2”.

On the other hand, for the multicast load instruction “3”, the multicast load instruction “4” is initialized by asserting the REQID because the load target chain is not included in the group table 1220 for REQID synchronization.

As a result, the opportunity to initialize the REQID is reduced compared to the table that illustrates the usage status of the REQID in the multicast load process in the related example shown in FIG. Then, the opportunity to wait for the release of the REQID in use is reduced, and the memory read request to the MAC 121 can be issued promptly, so that the reduction in the throughput of the memory data bus can be prevented.

According to the accelerator 12 in the example of the above-described embodiment, the following operational effects can be obtained, for example.

The synchronization monitoring circuit 1221 determines whether the first or second determination target chain group among the plurality of chains includes the second chain group which is the target of the second arithmetic processing. The first determined chain group is a first chain group in which the second arithmetic processing is continuously executed after the first arithmetic processing. The second chain to be judged is a first chain group in a relationship in which after the first arithmetic processing, at least one or more third arithmetic processing is executed and then the second arithmetic processing is executed. On the other hand, the third chain group is calculated by a constant method. The first chain group is a target of the first arithmetic processing of the plurality of chains, the second chain group is a target of the second arithmetic processing of the plurality of chains, and the third chain group is a plurality of the plurality of chains. It is the target of the third arithmetic processing in the chain. The initialization signal generation circuit 1222 generates an initialization instruction for initializing the identifier of the second arithmetic processing when neither the first nor the second chain group to be judged includes the second chain group. .. The multicast acquisition circuit 322 acquires the identifier initialized for the second arithmetic processing when the initialization instruction is issued, while the multicast acquisition circuit 322 performs the second arithmetic processing for the second arithmetic processing when the initialization instruction is not issued. Then, an identifier that follows the identifier of the first arithmetic processing is acquired.

With this, in the multi-core architecture, it is possible to improve the throughput of memory data by reducing the number of REQID initializations and reducing the waiting time between chains.

The multicast acquisition circuit 322 notifies the acquired identifier to the other read / write unit 125 among the plurality of read / write units 125.

With this, the read / write unit 125 in charge of issuing the memory read request can recognize that the REQIDs acquired by the respective read / write units 125 are unified.

The instruction division circuit 31 divides the instruction related to the second arithmetic processing into a plurality of instructions. When the identifier acquisition target is not the first division of the instruction, the multicast acquisition circuit 322 acquires an identifier continuous with the acquisition target identifier immediately before the acquisition target.

With this, consecutive REQIDs can be acquired for a series of commands.

[C] Others The disclosed technique is not limited to the above-described embodiment, and various modifications can be performed without departing from the spirit of the present embodiment. Each configuration and each process of this embodiment can be selected or omitted as necessary, or may be appropriately combined.

1:

Information processing device

7, 15, 122: Memory 8, 123: Core 10: CPU
11: I / O controller 12, 600: Accelerator 13: Hard disk 14: I / O device 21,611: Reception buffer 22, 612: Transmission buffer 31,821: Instruction division circuit 32, 823: REQID management circuit 33, 827: Memory request generation circuits 34, 828: Register control request generation circuits 35, 829: Data buffers 61, 1211: Ports 81, 124: Instructors 82, 125: Read / write unit 83: Operation execution unit 84: Register files 311 and 822: Decision Circuits 321 and 824: Unicast acquisition circuit 322, 825: Multicast acquisition circuit 323, 826: Acquisition wait buffer 1220: Group table 1221: Synchronization monitoring circuit 1222: Initialization signal generation circuit

Claims

An arithmetic processing device having a multi-core forming a plurality of chains,
A first chain group that is a target of the first arithmetic processing among the plurality of chains;
A second chain group that is a target of the second arithmetic processing among the plurality of chains;
A third chain group that is a target of the third arithmetic processing among the plurality of chains;
At
The first chain group, which has a relationship in which the second arithmetic processing is continuously executed after the first arithmetic processing, is a first determined chain group,
After the first arithmetic processing, at least one third arithmetic processing is executed, and then the second arithmetic processing is executed, and the third chain group with respect to the first chain group. Is a second group of chains to be judged, which is calculated by a certain method,
A determination circuit that determines whether the first or second determined chain group of the plurality of chains includes the second chain group that is the target of the second arithmetic processing;
A generation circuit that generates an initialization instruction for initializing the identifier of the second arithmetic processing when neither the first nor the second chain group to be judged includes the second chain group;
When the initialization instruction is issued, the identifier initialized for the second arithmetic processing is acquired, while when the initialization instruction is not issued, the second arithmetic processing is performed for the second arithmetic processing. An acquisition circuit for acquiring an identifier that is continuous with the identifier of the arithmetic processing of 1.
An arithmetic processing unit comprising:
The constant method is a process of removing the third chain group from the first chain group.
The arithmetic processing device according to claim 1.
The acquisition circuit is
Each of the plurality of chains is provided in a plurality of read / write units that relay the read / write processing of data from the multi-core to the memory,
Notifying the acquired identifier to other read / write units of the plurality of read / write units,
The arithmetic processing device according to claim 1.
Further comprising a division circuit for dividing the instruction related to the second arithmetic processing into a plurality of instructions,
The acquisition circuit acquires an identifier that is continuous with the identifier of the acquisition target immediately before the acquisition target when the acquisition target of the identifier is not the first division of the instruction.
The arithmetic processing unit according to any one of claims 1 to 3.
A method for controlling an arithmetic processing device having a multi-core that constitutes a plurality of chains, comprising:
A first chain group that is a target of the first arithmetic processing among the plurality of chains;
A second chain group that is a target of the second arithmetic processing among the plurality of chains;
A third chain group that is a target of the third arithmetic processing among the plurality of chains;
At
The first chain group, which has a relationship in which the second arithmetic processing is continuously executed after the first arithmetic processing, is a first determined chain group,
After the first arithmetic processing, at least one third arithmetic processing is executed, and then the second arithmetic processing is executed, and the third chain group with respect to the first chain group. Is a second group of chains to be judged, which is calculated by a certain method,
It is determined whether the first or second determined chain group of the plurality of chains includes the second chain group that is the target of the second arithmetic processing,
Generating an initialization instruction for initializing the identifier of the second arithmetic processing when neither of the first and second determined chain groups includes the second chain group,
When the initialization instruction is issued, the identifier initialized for the second arithmetic processing is acquired, while when the initialization instruction is not issued, the second arithmetic processing is performed for the second arithmetic processing. Acquire an identifier that is continuous with the identifier of the arithmetic processing of 1.
A method for controlling an arithmetic processing unit.
The constant method is a process of removing the third chain group from the first chain group.
The control method of the arithmetic processing unit according to claim 5.
An identifier is acquired by each of a plurality of read / write units that relay the read / write processing of data from the multi-core to the memory for each of the plurality of chains,
Notifying the acquired identifier to other read / write units of the plurality of read / write units,
A method of controlling the arithmetic processing device according to claim 5.
Dividing the instruction related to the second arithmetic processing into a plurality of instructions,
When the identifier acquisition target is not the first division of the instruction, an identifier continuous to the acquisition target identifier immediately preceding the acquisition target for the second arithmetic processing is acquired.
A method for controlling an arithmetic processing unit according to any one of claims 5 to 7.