WO2019180288A1

WO2019180288A1 - Method and device for parallel processing of program instructions and trace instructions

Info

Publication number: WO2019180288A1
Application number: PCT/ES2019/070176
Authority: WO
Inventors: Antonio Da Silva; Óscar RODRÍGUEZ POLO; Agustín MARTÍNEZ HELLÍN; Pablo Parra Espada; Sebastián SÁNCHEZ PRIETO
Original assignee: Universidad Politécnica de Madrid; Universidad De Alcalá
Priority date: 2018-03-20
Filing date: 2019-03-18
Publication date: 2019-09-26
Also published as: ES2697548A1; ES2697548B2

Abstract

The invention relates to a method and device for synchronisation and parallel execution of trace instructions on a segmented RISC processor. The invention consists of a device of which the internal structure, based on a segmented processor, does away with the execution time overload introduced by the code instrumentation used to measure execution time in the worst case scenario. For this, the device uses a specific instruction code for the instrumentation, which is interpreted as enabling the tracing of the preceding instruction, and which makes it possible to identify unequivocally the time at which said instruction is executed. The proposed device executes each trace instruction in parallel, in a synchronised fashion, with the instruction to be traced that precedes same, and conditions said execution on completion of the execution of the instruction to be traced without it being affected by bubbles.

Description

DESCRIPTION

A METHOD AND A PARALLEL PROCESSING DEVICE FOR PROGRAM INSTRUCTIONS AND TRACE INSTRUCTIONS

SECTOR OF THE TECHNIQUE

The invention falls, in general, in the Electronics, Information Technology and Telecommunications (ICT) sector, although it has specific application in critical systems typical of the Aerospace, Defense and high reliability sectors.

BACKGROUND OF THE INVENTION

Different inventions have been identified that propose solutions to facilitate the tracing of instructions but differ from the present invention.

US patent application 5996092 A, "System and method for tracing program execution within a processor before and after a triggering event", allows to start and interrupt the instruction trace, using a trace processor that works in parallel to the processor that executes the own instructions. The trace processor, after detecting the start time of the trace, by means of a specific instruction, stores in a shared memory information relative to the entire sequence of instruction execution until the moment in which it detects the stop instruction of the trace. In the present device, there is no shared memory or a parallel trace processor, and the trace is based on an instrumentation of the code that adds, at specific points of the code in which it is desired to obtain a specific trace, trace instructions that identify uniquely the point we wish to draw, without resorting to the program counter. The trace instruction is executed in parallel with the previous instruction that is intended to be plotted, introducing redundancy of the necessary parts of the processor pipeline, and the result of the execution is the writing in an output register where an analysis hardware captures it, so that the specific moment in which the plotted instruction was executed is recorded. The present invention presents a different approach, where it is possible, by selective instrumentation of instructions located anywhere in the code, to obtain the worst execution time of each of the system functions. This type of instrumentation is being used by commercial tools such as RapiTime in critical avionics systems (G. Bernat et al., “Identifying Opportunities for Worst-case Execution Time Reduction in an Avionics System,” Ada User Journal, Volume 28, Number 3, 2007, pp. 189 -194). However, its application using commercially available processors has the main disadvantage of the overhead introduced at runtime, which is eliminated with the invention presented.

With respect to US patent application 2017147472 (A1), "Systems and methods for a real time embedded trace", the main difference is that the system traces the jump instructions autonomously. In the invention presented here, the instructions that are drawn are defined by selective instrumentation techniques, which insert, after each instruction we wish to draw, a trace instruction. What this invention does is to execute them in parallel in a synchronized manner with the instruction that is intended to be drawn. As already described, this instrumentation technique is being used in critical systems and allows to draw blocks of code to evaluate its execution time in the worst case, taking into account that the transition between some of these blocks, such as that corresponding to a e / se block, and the subsequent block, do not imply a jump in the execution, so they would not be detected by the solution presented in US patent 2017147472 (A1).

As for US Patent 6513134 (B1), "System and method for tracing program execution within a superscalar processor", it presents an improvement over US 5996092 A, allowing work with superscalar processors that work at high frequencies, above the 400MHz To do this, it uses an encoding of the information that is to be traced that allows reducing the space that needs to be used to store it in the trace buffer that is provided as output. As in this patent, instruction blocks are drawn to analyze its execution, but defining a more flexible way to trigger the trace, and using a trace coding that allows savings in terms of stored information and number of pins used. This patent, therefore, does not prevent the overload of the use of code instrumentation techniques that affect the entire system, as is the case with the claimed invention, but aims to optimize a tracing mechanism, not based on instrumentation, but on the Detection of events that conform to predefined conditions. Therefore, it is concluded that the existing systems for the instruction trace allow to program specific trace trigger events, to collect limited trace information at a certain interval before and / or after said event. These methods suffer from a certain rigidity in that the number of blocks that can be traced in each execution is always limited, and they do not adapt well to the code instrumentation techniques that are used in the characterization of the execution time in the worst case in critical systems, such as that used by the aforementioned RapiTime tool. Since the application of code instrumentation using current processors introduces overload at runtime, the invention presented, aimed at eliminating these overloads, provides an improvement with a specific objective framed in this area.

DESCRIPTION OF THE INVENTION

In a first aspect of the invention, a parallel processing device of program instructions and trace instructions is disclosed. The parallel processing device for program instructions and trace instructions comprises:

• an instruction search stage, which in turn includes:

or a module for calculating the direction of the instruction; Y,

o module for searching instructions with double reading port;

• a duplicate decoding stage;

• a trace pipeline (trace pipeline) for processing trace instructions only;

• an exit record for the trace;

• a data path that in turn comprises a set of multiplexers;

• a controller of the data path, which in turn comprises inputs and outputs that control the multiplexers, the load in associated registers of the different stages and the output register for the trace;

where the data path controller is configured to determine, depending on the state of said controller and the value of the inputs in said controller, the value of the outputs that are sent to the multiplexers of the data path such that a trace instruction is executed in synchronization with the preceding instruction, said execution being effective during the last stage of the trace pipeline. In one embodiment of the invention, the controller comprises the following instruction sequences S1, S2 and S3:

S1: corresponds to Instruction-Trace pairs in which the instructions to be drawn are always loaded in the “INSTRUCTION N” element, while the corresponding trace instructions are loaded in the “N + 1 INSTRUCTION” element;

S2: corresponds to a sequence of instructions that are not traced, so that “INSTRUCTION N” and “INSTRUCTION N + 1” always load instructions (and not traces);

S3: corresponds to two Instruction-Trace pairs in which the trace instructions are loaded in successive cycles in the “INSTRUCTION N + 1” element, while the instructions to be traced are loaded in those same cycles in the “INSTRUCTION N + 1 element” "

The processing device executes the sequence S1 for two clock cycles, T and T + 1; such that the instructions stored in directions X + 1, X + 3 and X + 5, are the trace instructions of the instructions that precede them, located, respectively, in directions X, X + 2 and X + 4 .

The processing device executes the S2 sequence in which no trace instructions are loaded for two cycles; so that during the first cycle T it is detected that the two instructions that are loaded in the decoding stage are not trace, and therefore the signals "N_ES_TRAZA" "N_1_ES_TRAZA" are worth both "0"; and, in the T + 1 cycle the controller is in the state called "PENDING INSTR", in which the pending instruction located in the second decoding unit is directed to "step 3" of the processor instruction pipeline.

The processing device executes the sequence S3: in the T cycle the value of the “N_ES_TRAZA” signal is “1”, while “N_1_ES_TRAZA” is worth “0”, and the value of the multiplexing signal “SEL_TR_P4” is “2 ”, So that a route is enabled where the trace instruction is synchronized with the execution of the instruction to be plotted; in the T + 1 cycle, the trace instruction is located in step 4 of the trace pipeline. In the T cycle, in addition, the multiplexing signal "SEL_PIPE_TRAZA" takes the value 0, so that in the T + 1 cycle a zero is found in step 3 of the trace pipeline.

In an embodiment of the invention, the processing device during a cycle "T" detects a bubble in step 3 of the instruction pipeline, so that the controller sets the route that loads a "0" in stages 3 and 4 of the pipeline and a jump direction "Z" the search stage input register is routed, so that the detection of the bubble in stage 3 corresponds to setting "1" of the "BURBUJA_P3" signal and the "BUBBLE" signal. The controller controls: the route to stage 3 by assigning “0” to the "SEL_PIPE_TRAZA"signal; the route to stage 4 assigning a “0” to the "SEL_TR_P4"signal; and loading the input register of the search stage by activating the "LD_DIR" signal and routing the "Z" address to said register by assigning a "0" to the "SEL_DIR" signal.

In an embodiment of the invention, the device, the processing device during a "T" cycle, detects a bubble in step 4 of the instruction pipeline, such that the controller sets the route that loads a "0" in the stages 3, 4 and 5 of the pipeline and the jump direction "Z" is routed to the entry register of the search stage, such that the detection of the bubble in stage 4 corresponds to the setting to "1" of the "BURBUJA_P4" signal and of the "BUBBLE" signal. The controller controls: the route to stage 3 of the pipeline by assigning "0" to the "SEL_PIPE_TRAZA" signal; the route to stage 4 assigning a “0” to the "SEL_TR_P4" signal; the route to step 5 is controlled by assigning a "0" to the "SEL_TR_P5" signal; and, loading the input register of the search stage by activating the "LD_DIR" signal and routing the "Z" address to said register by assigning a 0 to the "SEL_DIR" signal.

A second aspect of the invention discloses a RISC processor, "Computer with Reduced Instruction Set", which comprises a parallel processing device for program instructions and trace instructions according to any one of the above embodiments for the first aspect of the invention.

In a third aspect of the invention a parallel processing method of program instructions and trace instructions is disclosed which, executed on a parallel processing device of program instructions and trace instructions defined in any of the embodiments of the first aspect of the invention, it processes an instruction and a trace instruction in parallel.

BRIEF DESCRIPTION OF THE FIGURES To complement the description of the invention and in order to help a better understanding of its characteristics, a set of drawings is attached as an integral part of said description, where the following is illustrated and not limited to:

Figure 1. Structure of the device for synchronization and parallel execution proposed in the invention.

Figure 2. Mealy machine of the data path controller.

Figure 3. Evolution of the data path before the sequence of instructions S1.

Figure 4. Evolution of the data path before the sequence of instructions S2.

Figure 5. Evolution of the data path before the sequence of instructions S3.

Figure 6. Evolution of the data path before a bubble in stage 3 of the pipeline.

Figure 7. Evolution of the data path before a bubble in stage 4 of the pipeline.

Figure 8. Evolution of the data path after a bubble in the previous cycle.

Figure 9 State transition table of the data path controller.

Figure 10 Table of controller outputs of the data path relative to the search stage.

Figure 11 Table of controller outputs of the data path relative to the decoding stage.

Figure 12 Table of the outputs of the controller of the data path relative to stages 3, 4 and 5 of the pipeline of the trace instructions.

The elements of the parallel synchronization and execution device proposed in the invention are referenced in Figure 1. These elements are as follows:

100 Stage 1 instruction search

101 Address selection module of the following instruction

102 Instruction search module with double reading port

103 Stage 2 duplicate decoding

104 Device data path controller

105 Inputs to the data path controller

106 Data Path Controller Outputs

107 "INSTRUCTION N" decoding module of stage 2

108 Decoding module "INSTRUCTION N + 1" of stage 2

109 Step 3 input selection multiplexer of the RISC processor instruction pipeline 110 Pipeline stage 3 input selection multiplexer from trace instructions

112 Pipeline of RISC processor instructions

113 Pipeline of trace instructions

114 Pipeline stage 3 of the RISC processor instructions

115 Pipeline stage 3 of the trace instructions

116 Input selection multiplexer of stage 4 of the trace instructions pipeline

117 Pipeline stage 4 of the RISC processor instructions

118 Pipeline stage 4 of the trace instructions

119 Input selection multiplexer of stage 5 of the trace instructions pipeline

120 Pipeline stage 5 of the RISC processor instructions

121 Pipeline stage 5 of the trace instructions

122 Input to the data path controller that monitors bubble detection

123 Record output of trace information

124 WAIT: Input to the controller of the data path that monitors the wait in the search for instructions

125 BURBUJA_P3: Input signal to the controller that monitors the detection of a bubble in step 3 of the RISC processor instruction pipeline

126 BUBBLE_P4: Input signal to the controller that monitors the detection of a bubble in step 4 of the RISC processor instruction pipeline

127 LD_N: Output of the data path controller that controls the load of the decoding module "INSTRUCTION N" of step 2

128 LD_N_1: Output of the data path controller that controls the load of the decoding module "INSTRUCTION N + 1" of step 2

129 N_EN_TRAZA: Input signal to the controller that monitors whether the instruction that has been decoded in the "Instruction N" element is of the trace type

130 N_1_EN_TRAZA: Input signal to the controller that monitors whether the instruction that has been decoded in the "Instruction N + 1" element is trace type 131 SEL_PIPE_INSTR: Output of the data path controller that controls the input multiplexer to stage 3 of the RISC processor instruction pipeline

132 SEL_PIPE_TRAZA: Output of the data path controller that controls the input multiplexer to step 3 of the trace instruction pipeline

133 SEL_TR_P4: Output of the data path controller that controls the input multiplexer to step 4 of the trace instruction pipeline

134 SEL_TR_P5: Output of the data path controller that controls the input multiplexer to step 5 of the trace instruction pipeline

135 TR_P5_EN_CERO: Signal that monitors whether stage 5 of the trace instruction pipeline stores a zero value

136 LD_TR_OUT: Signal that controls the storage in the output register of the trace information. It takes the complementary value to the TR_P5_ES_CERO signal, so the register is only loaded when the trace information is nonzero.

137 LD_DIR: Signal that controls the storage in the entry register at the search stage of the address of the next instruction to be searched.

138 SEL_DIR: Output of the data path controller that controls the input multiplexer to the input register to the search stage that stores the address of the next instruction to be searched.

139 Entry record to the search stage that stores the address of the next instruction to search.

140 Input multiplexer to the entry register to the search stage that stores the address of the next instruction to be searched.

DESCRIPTION OF A FORM OF EMBODIMENT OF THE INVENTION

The invention consists of a device equipped with an internal processing structure that allows eliminating the runtime overhead that introduces the code instrumentation used to measure the "worst case execution" time using hybrid analysis. This analysis combines the static analysis of the code with runtime measurements on the deployment platform. The static analysis determines which instructions are necessary to draw and, by means of instrumentation techniques, adds plot code after each instruction that is desired to be drawn, so that the instant of execution of said instruction can be captured by means of a hardware of support and a logic analyzer. The code added after the instruction that we wish to trace allows us to uniquely identify the moment of execution of said instruction, but introduces an overload that can be eliminated with this invention. The device is able to detect the trace instructions and execute them in parallel, in a synchronized way, and conditioned to the complete execution of the instruction that precedes it. In this way, it allows the plotting process to be non-intrusive in regards to the execution time, since the sequence and the moment of execution of the program under analysis are not modified by the introduction of the traces, since they are executed in parallel.

The device proposed in the invention uses a specific instruction code, which will be used for instrumentation, and whose internal structure interprets as the trace enablement of the preceding instruction. The main elements of this device, which are shown in Figure 1, are: 1) an instruction search stage (100), which has an instruction address calculation module (101) and a search module for the instructions with double reading port (102); 2) a duplicate decoding stage (103); 3) a specific pipeline for the trace instructions (113); 4) an output record for the trace (123); 5) the data path, consisting of a set of multiplexers (109, 110, 116 and 119); 6) the device's data path controller (104), which determines, depending on its status, and the value of its inputs (105), the value of the outputs (106) that control both multiplexers (109, 110, 1 16 and 119), such as the loading in associated registers of the different stages (107, 108, 114, 115, 117, 118, 120 and 121), as well as the output register (123). Both the inputs (105) and the outputs (106) are plotted in Figure 1 next to the label assigned for each signal.

The device, thanks to the double port search stage (100), allows two instructions to be loaded simultaneously to the decoding stage (103) to be decoded in parallel. The “WAIT” signal (124) of this stage is used to model possible waiting states in said search, and can be activated after a reset of the processor, or as a result of a jump, which causes the injection of bubbles in the processor instruction pipeline (112), monitored by the signals “BURBUJA_P3” (bubble injection in step 3, 125), 'BURBUJA_P4 ”(bubble injection in stage 4, 126), and their OR function, labeled "BUBBLE" (122). The “WAIT” signal (124) will be deactivated to Notify the decoding stage (103) that the instructions are available for loading.

The two instructions found in the decoding stage always correspond to instructions stored in consecutive memory words. In Figure 1, the item labeled "INSTRUCTION N" (107) will be the one that will receive the first of the two instructions, while the one labeled "INSTRUCTION N + 1" (108) will receive the following. The N and N + 1 values do not correspond to physical memory addresses that are consecutive, but instead represent two instructions stored in consecutive memory words, regardless of the word size in bytes of the processor, which in the most general case of a processor 32-bit RISC would be 4.

In the decoding step (103), and as a consequence of decoding each of the instructions, it is determined whether the instructions are of the trace type or belong to the rest of the set of instructions, calculating the signals labeled in Figure 1 as " N_ES_TRAZA ”(127) and“ N_1_ES_TRAZA ”(128).

The route controller (104) uses the values of those signals, and that of the “WAIT” (124) and “BUBBLE” (122) signals, together with the state of the controller itself, to determine the route that the instructions will follow. Towards the next stages. The controller configures the multiplexers (109, 110, 116 and 119) of the route to ensure that a trace instruction is executed in synchronization with the preceding instruction, making such execution effective during the last stage (121) of the pipeline of trace (113), in which it is verified that the signal “TR_P5_ES_CERO” (135) is deactivated, in which case the signal “LD_TR_OUT” is activated (136) and the trace value is directed to the output register (123) .

Figure 2 represents the Mealy machine of the data path controller of this device (200), which is formally specified in the tables of Figures 9, 10, 11 and 12.

Figures 3, 4 and 5 are, respectively, examples of how the controller, to effect synchronization, sets the route in the following possible instruction sequences S1, S2, and S3:

• The sequence S1 corresponds to Instruction-Trace pairs (301-302, 303-304 and 305-306) in which the instructions to be traced (301, 303 and 305) they are always loaded in the “INSTRUCTION N” element (107), while the corresponding trace instructions (302, 304 and 306) are loaded in the “INSTRUCTION N + 1” element (115).

• S2 corresponds to a sequence of instructions (401, 402, 403 and 404) that are not traced, so that “INSTRUCTION N” (114) and “INSTRUCTION N + 1” (115) always load instructions and not traces

• The S3 sequence corresponds to two Instruction-Trace pairs (502-503 and 504-505) in which the trace instructions (503 and 505) are loaded in successive cycles (500 and 510) in the “INSTRUCTION N + 1 ”(107), while the instructions to be drawn (502 and 504) are loaded in those same cycles in the“ INSTRUCTION N + 1 ”element (108).

Figure 3 shows the operation of the device during two clock cycles, T (300) and T + 1 (307), in which the processor executes the instruction sequence S1. In the sequence S1 the instructions stored in the addresses X + 1 (302), X + 3 (304) and X + 5 (306), are the trace instructions of the preceding instructions, located, respectively, in the addresses X (301), X + 2 (303) and X + 4 (305). The scheme shows in the two cycles, T (300) and T + 1 (307), how the trace instructions (302, 304, 306), added as fruit of the instrumentation, are directed towards the stages (115 and 118) that belong to the pipeline of the trace type instructions, while the instructions to be drawn (301, 303, and 305), are directed towards the steps (114 and 117) that belong to the pipeline of the processor instructions. In this way a synchronized execution of the plotted instruction and its trace instruction is produced, and the execution time overhead of inserting trace instructions in a program is avoided, since these are executed in parallel.

Figure 4 shows the operation of the device in the sequence S2, in which for two cycles no trace instructions are loaded. In that case during the first cycle T (400) it is detected that the two instructions that are loaded in the decoding stage (403 and 404) are not trace, and therefore the signals "N_ES_TRAZA" (129) and "N_1_EN_TRAZA" (130) both are worth 0. This situation leads to the decoding stage (103) not loading two new instructions at the beginning of the T + 1 cycle (408), since in the T cycle (400) the values of "LD_N" (127) and "LD_N_1" (128) that control said load are both 0. In the T +1 cycle (408) the controller is in the state called "PENDING INSTR" (202), in which the pending instruction ( 404) located in the second decoding unit (108) it is directed towards step 3 of the processor instruction pipeline (114). In both cycles, T (400) and T +1 (408), the controller loads stage 3 of the pipeline of the trace instructions (115) with values 0, so the instructions will not be traced _.

Figure 5 shows the operation of the device for the sequence S3, which covers the case in which in cycle T (500) the instruction to be traced (502) is in step 3 of the instruction pipeline (114), while the trace instruction (503) is loaded in the "INSTRUCTION N" element (107) of the decoding stage (103). The sequence also includes that in that same cycle the element "INSTRUCTION N + 1" (108) of the decoding stage (103) contains the following instruction (504) to be executed. According to this sequence, in cycle T (500) the value of the “N_ES_TRAZA” signal (129) is 1, while “N_1_ES_TRAZA” (130) is 0, and the value of the multiplexing signal “SEL_TR_P4” ( 133) is 2, which enables a route where the trace instruction (503) is synchronized with the execution of the instruction to be plotted (502). Synchronization is effective in the T + 1 cycle (510), the trace instruction (503) being located in step 4 of the trace pipeline (118). In the T cycle (500), in addition, the multiplexing signal “SEL_PIPE_TRAZA” (132) takes the value 0, so that in the T + 1 cycle (510) a zero (507) is found in step 3 of the trace pipeline (115).

The sequence S3 also causes the following instruction (504), which is stored in the "INSTRUCTION N + 1" (108), during cycle T (500), to have the route to stage 3 of the pipeline enabled of instructions (114). To enable this route, the “SEL_PIPE_INSTR” signal (131) takes the value 1 during the T cycle (500).

In the T + 1 cycle (510), the device repeats the same configuration of the data path as in the T cycle (500), since the sequence again locates a trace type instruction (505) in the element “ INSTRUCTION N "(107) of the decoding stage (103) and the next instruction to be executed (506) in the" INSTRUCTION N + 1 "element (108).

Finally, Figures 6, 7 and 8 describe the operation of the device before the detection of bubbles. The bubbles are inserted in the instruction pipeline of an RISC processor in all situations in which the sequential execution of instructions is interrupted, as is the case with jump instructions, both conditional and unconditional, or in function calls and returns. When an instruction causes the sequential order of execution to be interrupted, the processor must discard the execution of the instructions following that instruction, and start the search for the instruction whose address has been determined after the execution of the instruction that caused the sequence to break. In figures 6 and 7 this address is labeled as address "Z" (600). Figure 6 explains the data path before a bubble in step 3 of the instruction pipeline (114), while Figure 7 corresponds to a bubble detected in step 4 (117), and also causes a bubble in the stage 3 (114). Figure 8 explains the evolution of the route in the cycles following the detection of a bubble until the jump direction instruction (600) is supplied by the search stage (100).

In figure 6, it is shown how during the T cycle a bubble is detected only in step 3 of the instruction pipeline (114), and how the controller sets the route that loads a 0 in stages 3 (115) and 4 ( 118) of the trace pipeline (113), while the jump direction "Z" (600) is routed to the entry register of the search stage (139). The detection of the bubble in step 3 corresponds to the setting of 1 of the "BURBUJA_P3" signal (125) and consequently of the "BUBBLE" signal (122). The route to stage 3 (115) is controlled by assigning 0 to the "SEL_PIPE_TRAZA" signal (132), and the route to stage 4 (118) is controlled by assigning a 0 to the "SEL_TR_P4" signal (133). The loading of the search register input register (139) is controlled by activating the "LD_DIR" signal (137) and routing the "Z" address (600) to said register (139) by assigning a 0 to the "SEL_DIR" signal. (138).

Figure 7 shows how during the T cycle a bubble is detected in step 4 of the instruction pipeline (117), and how the controller sets the route that loads a 0 in stages 3 (115), 4 (118 ) and 5 (121) of the trace pipeline (113), while the jump direction "Z" (600) is routed to the input register of the search stage (139). The detection of the bubble in step 4 corresponds to the setting to 1 of the "BUBBLE_P4" signal (126) and consequently of the "BUBBLE" signal (122). The route to stage 3 of the trace pipeline (115) is controlled by assigning 0 to the signal "SEL_PIPE_TRAZA" (132), the route to stage 4 (118) is controlled by assigning a 0 to the signal "SEL_TR_P4" (133) , and the route to step 5 (121) is controlled by assigning a 0 to the signal "SEL_TR_P5" (134). The loading of the search register input register (139) is controlled by activating the "LD_DIR" signal (137) and routing the "Z" address (600) to said register (139) by assigning a 0 to the "SEL_DIR" signal. (138).

Figure 8 represents the wait for two cycles (801 and 802) after either of the two bubbles described in Figures 6 and 7, so that in the T + 2 cycle (802) the search stage (100) deactivates the “WAIT” signal (124) indicating that the instructions are available for loading in the decoding stage (103) in the next cycle, activating the signals "LD_N" (127) and "LD_N_1" (128).

In the set of cases presented in Figures 3, 4, 5, 6, 7 and 8, it is described how the device proposed in the invention behaves before the different instruction sequences, and the appearance of possible bubbles. In all cases it is verified that the device makes the synchronization of the execution of the trace instructions with the drawn instructions effective, also avoiding the overload at runtime, since the trace instructions are always executed in parallel with the instructions to draw.

The embodiment of the invention will be based on the structural specification of Figure 1 and operating according to the state transition diagram of Figure 2, and the tables defined in Figures 9, 10, 11 and 12. Figures 3, 4 , 5 and 6 complete the details that facilitate implementation.

The preferred physical embodiment will consist of the "Hardware / Firmware" implementation of the described functionality, based on a model description of a standard processor architecture on which the aforementioned modifications will be made and that basically affect the design of the pipeline. Said architecture description models will allow the manufacturing details of the device to be generated, which may be materialized on a programmable device such as an FPGA (Programmable Door Matrix, Field Programmable Gafe Array) or on a Specific Application Integrated Circuit ( ASIC, Application Specific Integrated Circuit).

There are different realization options. All of them are based on the VHDL model of a ΊR Core ”of a segmented RISC processor, such as ARM or LEON, on which the implementation of the pipeline structure of the device will be modified to include the functionality described in this patent. The objective is to generate a new ΊR Core ”, which can be manufactured on FPGA or ASIC.

Claims

1. A parallel processing device for program instructions and trace instructions, characterized in that it comprises:

• an instruction search stage (100), which in turn comprises:

or a module for calculating the direction of the instruction (101); and, or module for searching the instructions with double reading port (102);

• a duplicate decoding stage (103);

• a pipeline-trace (113) for processing only trace instructions;

• an output record (123) for the trace;

• a data path that in turn comprises a set of multiplexers (109, 110, 116, 119);

• a controller of the data path (104), which in turn comprises some inputs (105) and outputs (106) that control the multiplexers (109, 110, 116, 119), the load in some associated registers the different stages (107, 108, 114, 115, 117, 118, 120, 121) and the output register (123) for the trace;

where the data path controller (104) is configured to determine, depending on the state of said controller (104) and the value of the inputs (105) on said controller, the value of the outputs (106) that are sent to the multiplexers (109, 110, 116 and 119) of the data path in such a way that a trace instruction is executed in synchronization with the preceding instruction, said execution being effective during the last stage (121) of the pipeline trace (113).

2. A parallel processing device for program instructions and trace instructions according to claim 1, characterized in that the controller comprises the following sequence of instructions:

• S1: corresponds to Instruction-Trace pairs (301-302, 303-304 and 305-306) in which the instructions to be drawn (301, 303 and 305) are always loaded in the “INSTRUCTION N” element (107) , while the corresponding trace instructions (302, 304 and 306) are loaded in the “INSTRUCTION N + 1” element (115);

• S2: corresponds to a sequence of instructions (401, 402, 403 and 404) that are not drawn, so that “INSTRUCTION N” (114) and “INSTRUCTION N + 1” (115) always load instructions;

• S3: corresponds to two Instruction-Trace pairs (502-503 and 504-505) in which the trace instructions (503 and 505) are loaded in successive cycles (500 and 510) in the “INSTRUCTION N + 1” item (107), while the instructions to be plotted (502 and 504) are loaded in those same cycles in the “INSTRUCTION N + 1” element (108).

3. A parallel processing device for program instructions and trace instructions according to claim 2, characterized in that the processing device executes the sequence S1 for two clock cycles, T (300) and T + 1 (307); such that the instructions stored in directions X + 1 (302), X + 3 (304) and X + 5 (306), are the trace instructions of the instructions that precede them, located, respectively, in directions X (301), X + 2 (303) and X + 4 (305).

4. A parallel processing device for program instructions and trace instructions according to claim 2, characterized in that the processing device executes the sequence S2 in which no trace instructions are loaded for two cycles; so that during the first cycle T (400) it is detected that the two instructions that are loaded in the decoding stage (403 and 404) are not trace, and therefore the signals “N_ES_TRAZA” (129) and “N_1_EN_TRAZA "(130) are worth both" 0 "; and, in the T + 1 cycle (408) the controller is in the state called "PENDING INSTR" (202), in which the pending instruction (404) located in the second decoding unit (108) is directed towards the "Step 3" of the processor instruction pipeline (114).

5. A parallel processing device of program instructions and trace instructions according to claim 2, characterized in that the processing device executes the sequence S3: in the T cycle (500) the value of the signal "N_ES_TRAZA" (129 ) is “1”, while “N_1_EN_TRAZA” (130) is worth “0”, and the value of the multiplexing signal “SEL_TR_P4” (133) is “2”, so that a route is enabled where the instruction of trace (503) is synchronized with the execution of the instruction to be plotted (502); in the T + 1 cycle (510), the trace instruction (503) is located in step 4 of the trace pipeline (118). In the T cycle (500), in addition, the multiplexing signal “SEL_PIPE_TRAZA” (132) takes the value 0, so that in the T + 1 cycle (510) a zero (507) is found in step 3 of the trace pipeline (115).

6. A parallel processing device for program instructions and trace instructions according to claim 1, characterized in that the processing device during a "T" cycle detects a bubble in step 3 of the instruction pipeline (114), such that the controller sets the route that loads a " 0 "in stages 3 (115) and 4 (118) of the pipeline (113) and a skip address" Z "(600) is routed to the entry register of the search stage (139), of such so that the detection of the bubble in step 3 corresponds to setting "1" of the signal "BUBBLE_P3" (125) and the signal "BUBBLE" (122).

7. A parallel processing device for program instructions and trace instructions according to claim 6, characterized in that the controller controls: the route to step 3 (115) by assigning "0" to the signal "SEL_PIPE_TRAZA" (132) ; the route to step 4 (118) assigning a "0" to the signal "SEL_TR_P4" (133); and loading the search register input register (139) by activating the "LD_DIR" signal (137) and routing the "Z" address (600) to said register (139) by assigning a "0" to the "SEL_DIR" signal. "(138).

8. A parallel processing device for program instructions and trace instructions according to claim 1, characterized in that the processing device during a "T" cycle detects a bubble in step 4 of the instruction pipeline (117), in this way the controller sets the route that loads a “0” in stages 3 (115), 4 (118) and 5 (121) of the pipeline (113) and the jump direction “Z” (600) is routes to the input register of the search stage (139), so that the detection of the bubble in stage 4 corresponds to setting "1" of the "BUBBLE_P4" signal (126) and the signal "BUBBLE" (122).

9. A parallel processing device for program instructions and trace instructions according to claim 8, characterized in that the controller controls: the route to step 3 of the pipeline (115) by assigning "0" to the signal "SEL_PIPE_TRAZA "(132); the route to step 4 (118) assigning a "0" to the signal "SEL_TR_P4" (133); the route to step 5 (121) is controlled by assigning a "0" to the signal "SEL_TR_P5" (134); and, loading the input register of the search stage (139) by activating the "LD_DIR" signal (137) and routing the "Z" address (600) to said register (139) by assigning a 0 to the "SEL_DIR" signal (138).

10. A RISC processor, "Computer with Reduced Instruction Set", characterized in that it comprises a parallel processing device of program instructions and trace instructions according to any one of the preceding claims.

11. A parallel processing method of program instructions and trace instructions which, executed on a parallel processing device of program instructions and trace instructions defined in any one of claims 1 to 9, processes an instruction in parallel and a trace instruction.