RU2816092C1

RU2816092C1 - Vliw processor with improved performance at operand update delay

Info

Publication number: RU2816092C1
Application number: RU2023135294A
Authority: RU
Inventors: Фёдор Анатольевич Груздов; Мурад Искендер-оглы Нейман-заде
Original assignee: Акционерное общество "МЦСТ"
Filing date: 2023-12-26
Publication date: 2024-03-26

Abstract

FIELD: computer engineering.

SUBSTANCE: invention relates to computer engineering, particularly to microprocessors with parallel execution of several instructions. Processor has a preparatory pipeline, a register file, the first and second execution pipelines, an operand readiness control unit, an operand readiness register and a control unit. Preparatory conveyor is capable of separating the first and second personalized commands from the wide command. First and second execution conveyors are capable of synchronously with each other the first and second personalized instructions using the first and second groups of operands, respectively. Operand readiness control unit is capable of controlling the updating of the first group of operands in the register file. Operand readiness register is capable of storing a compromising flag when the first group of operands has not been updated before the execution of the first personalized instruction. In the presence of a compromising flag, the control unit is capable of recognizing the result of executing the first personalized command as an unreliable result.

EFFECT: high efficiency of the processor with simultaneous reduction of power consumption and the amount of heat released.

9 cl, 9 dwg

Description

Область техникиTechnical field

[1] Изобретение относится к микропроцессорной технике, в частности к микропроцессорам с параллельным исполнением нескольких команд, а более точно к микропроцессорам, выполненным в архитектуре Very Long Instruction Word (далее – VLIW-процессоры).[1] The invention relates to microprocessor technology, in particular to microprocessors with parallel execution of several instructions, and more precisely to microprocessors made in the Very Long Instruction Word architecture (hereinafter referred to as VLIW processors).

Предпосылки к созданию изобретенияPrerequisites for creating an invention

[2] VLIW-процессор реализует общеизвестный конвейерный процесс обработки команд (далее также – конвейерный процесс), в котором каждая команда последовательно проходит через несколько стадий обработки, таких как выборка, дешифрация, исполнение, обращение к запоминающему устройству и обратная запись результата. В дальнейшем изложении комплекс технических средств, входящих в состав процессора и непосредственно связанных с осуществлением конвейерного процесса, именуется «конвейер». Упомянутое запоминающее устройство представляет собой кэш-память процессора, которая в дальнейшем изложении кратко именуется как «память», если не указано иное.[2] The VLIW processor implements the well-known pipelined command processing process (hereinafter also referred to as the pipeline process), in which each command sequentially passes through several processing stages, such as fetching, decoding, execution, accessing the storage device and writing back the result. In the following discussion, the complex of technical means included in the processor and directly related to the implementation of the conveyor process is called a “conveyor”. Said storage device is a cache memory of the processor, which is hereinafter briefly referred to as "memory" unless otherwise noted.

[3] Классический конвейер позволяет осуществлять одновременную обработку нескольких команд, причем каждая из команд в определенный момент времени находится на одной из стадий обработки, и на каждой стадии в определенный момент времени может обрабатываться только одна команда. Тем не менее VLIW-процессор снабжен несколькими функциональными блоками, позволяющими, начиная со стадии исполнения, обрабатывать несколько команд одновременно уже на каждой стадии. Другими словами, конвейер VLIW-процессора включает в себя один подготовительный конвейер, выполняющий стадии выборки и дешифрации, и несколько исполнительных конвейеров, выполняющих стадии исполнения, обращения к памяти и обратной записи.[3] The classic pipeline allows for simultaneous processing of several commands, with each command at a certain point in time being at one of the processing stages, and at each stage only one command can be processed at a certain point in time. However, the VLIW processor is equipped with several functional blocks that allow, starting from the execution stage, to process several commands simultaneously at each stage. In other words, the VLIW processor pipeline includes one preparation pipeline that performs the fetch and decryption stages, and several execution pipelines that perform the execution, memory access, and writeback stages.

[4] Однако один подготовительный конвейер не может вывести на стадию исполнения несколько команд одновременно, поскольку на каждой своей стадии он может обрабатывать только одну команду. Во VLIW-процессоре данная проблема решается с помощью операции, выполняемой на этапе компиляции, а именно операции объединения нескольких команд (далее – персонализированная команда) в одну команду (далее – широкая команда). Соответственно, на стадиях выборки и дешифрации широкая команда проходит обработку как одна команда, а на стадии исполнения разделяется на несколько персонализированных команд, каждая из которых поступает на исполнение в предписанный ей функциональный блок. Одно из названий этой широкой команды на английском языке по существу и зашифровано в аббревиатуре VLIW.[4] However, one preparation pipeline cannot output several commands to the execution stage at the same time, since it can only process one command at each stage. In the VLIW processor, this problem is solved using an operation performed at the compilation stage, namely the operation of combining several commands (hereinafter - a personalized command) into one command (hereinafter - a wide command). Accordingly, at the sampling and decryption stages, a wide command is processed as one command, and at the execution stage it is divided into several personalized commands, each of which is sent for execution to its assigned functional block. One of the names of this broad command in English is essentially encrypted in the acronym VLIW.

[5] Наиболее ярко преимущество VLIW-процессора проявляется при ветвлении программы, когда из нескольких альтернативных ветвей команд должна быть выбрана одна действительная ветвь, на которую и должен быть осуществлен переход. В этом случае посредством включения по одной персонализированной команде из каждой альтернативной ветви в одну широкую команду и обработки последовательности таких широких команд, VLIW-процессор обеспечивает обработку всех альтернативных ветвей до того, как из них будет выбрана действительная ветвь. При выполнении команды передачи управления VLIW-процессор уже располагает обработанной действительной ветвью команд, независимо от того, какая из альтернативных ветвей команд выбрана в качестве действительной ветви, что существенно увеличивает быстродействие VLIW-процессора.[5] The advantage of the VLIW processor is most clearly manifested when branching a program, when from several alternative branches of commands one valid branch must be selected, to which the transition must be made. In this case, by including one personalized instruction from each alternative branch into one wide instruction and processing a sequence of such wide instructions, the VLIW processor ensures that all alternative branches are processed before the actual branch is selected from them. When executing a control transfer instruction, the VLIW processor already has a valid instruction branch processed, regardless of which of the alternative instruction branches is selected as the valid branch, which significantly increases the performance of the VLIW processor.

[6] Описанная конфигурация и преимущества VLIW-процессора известны специалисту в данной области и раскрыта, например, в патентной публикации US2012151192A1, 14.06.2012. Характерной особенностью VLIW-процессора является одновременное выполнение всех персонализированных команд, составляющих широкую команду, и если какая-то персонализированная команда не может быть выполнена, например, из-за задержки чтения операндов из памяти, то приостанавливается выполнение всей широкой команды.[6] The described configuration and advantages of the VLIW processor are known to one skilled in the art and are disclosed, for example, in patent publication US2012151192A1, 06/14/2012. A characteristic feature of the VLIW processor is the simultaneous execution of all personalized instructions that make up a wide instruction, and if any personalized instruction cannot be executed, for example, due to a delay in reading operands from memory, then the execution of the entire wide instruction is suspended.

[7] Данное обстоятельство зачастую приводит к тому, что выполнение безусловно необходимых команд попадает в зависимость от выполнения спекулятивных команд для маловероятной альтернативной ветви, что в известном VLIW-процессоре существенно увеличивает среднее время обработки команд. Указанный недостаток известного VLIW-процессора более подробно проиллюстрирован ниже со ссылками на фигуры, а здесь отметим, что увеличенное время работы процессора также вызывает избыточное потребление электроэнергии и повышенное тепловыделение, требующее принятие дополнительных мер по организации охлаждения.[7] This circumstance often leads to the fact that the execution of absolutely necessary commands becomes dependent on the execution of speculative commands for an unlikely alternative branch, which in the known VLIW processor significantly increases the average command processing time. This disadvantage of the known VLIW processor is illustrated in more detail below with reference to the figures, and here we note that the increased operating time of the processor also causes excessive power consumption and increased heat generation, requiring additional measures to organize cooling.

[8] Техническая проблема, на решение которой направлено изобретение, состоит в поиске решения, способного повысить производительность VLIW-процессора, снизить потребление электроэнергии, а также уменьшить выделение тепла.[8] The technical problem to be solved by the invention is to find a solution that can increase the performance of a VLIW processor, reduce power consumption, and also reduce heat generation.

Сущность изобретенияThe essence of the invention

[9] Для решения указанной технической проблемы в качестве изобретения предложен процессор (далее также – VLIW-процессор), содержащий подготовительный конвейер, регистровый файл, первый и второй исполнительные конвейеры, блок контроля готовности операндов, регистр готовности операндов и блок управления. Подготовительный конвейер способен выделять из широкой команды первую и вторую персонализированные команды. Первый и второй исполнительные конвейеры способны синхронно друг с другом выполнять соответственно первую и вторую персонализированные команды с использованием соответственно первой и второй групп операндов. Блок контроля готовности операндов способен контролировать обновление первой группы операндов в регистровом файле. Регистр готовности операндов способен сохранять компрометирующий флаг, когда первая группа операндов не была обновлена до выполнения первой персонализированной команды. Блок управления при наличии компрометирующего флага способен распознавать результат выполнения первой персонализированной команды как недостоверный результат. [9] To solve this technical problem, the invention proposes a processor (hereinafter also referred to as a VLIW processor) containing a preparation pipeline, a register file, the first and second execution pipelines, an operand readiness control unit, an operand readiness register and a control unit. The preparation pipeline is capable of selecting the first and second personalized teams from a broad team. The first and second execution pipelines are capable of executing first and second personalized instructions, respectively, using the first and second groups of operands, respectively, in synchronization with each other. The operand readiness control unit is capable of monitoring the update of the first group of operands in the register file. The operand ready register is capable of storing a compromise flag when the first group of operands has not been updated before executing the first personalized instruction. The control unit, in the presence of a compromising flag, is able to recognize the result of the first personalized command as an unreliable result.

[10] Технический результат изобретения состоит в уменьшении времени, затрачиваемого VLIW-процессором на выполнение программы, что повышает производительность VLIW-процессора, снижает потребление энергии и выделение тепла, т.е. является решением поставленной перед изобретением технической проблемы. Следует отметить, что в контексте настоящего изложения понятие «выполнение программы» означает выполнение тех входящих в программу персонализированных команд, которые позволяют пройти путь от начальной персонализированной команды до конечной. Поскольку программа может содержать несколько таких путей, то понятие «выполнение программы» не подразумевает обязательное выполнение всех входящих в программу персонализированных команд. [10] The technical result of the invention is to reduce the time spent by the VLIW processor to execute a program, which increases the performance of the VLIW processor, reduces energy consumption and heat generation, i.e. is a solution to the technical problem posed by the invention. It should be noted that in the context of this presentation, the concept of “program execution” means the execution of those personalized commands included in the program that allow you to go from the initial personalized command to the final one. Since a program can contain several such paths, the concept of “program execution” does not imply the mandatory execution of all personalized commands included in the program.

[11] Причинно-следственная связь между признаками изобретения и техническим результатом заключается в том, что выполнение первой персонализированной команды, а вместе с ней и всей широкой команды не приостанавливается, а продолжается, даже если первая группа операндов не обновлена, при этом результат выполнения первой персонализированной команды признается недостоверным. Ввиду того, что результат первой персонализированной команды может в дальнейшем не понадобиться, незамедлительное выполнение широкой команды экономит время на ожидание обновления первой группы операндов, которое может составлять до 100 тактов или больше.[11] The cause-and-effect relationship between the features of the invention and the technical result is that the execution of the first personalized command, and with it the entire wide command, is not suspended, but continues, even if the first group of operands is not updated, while the result of the first personalized command is considered unreliable. Because the result of the first personalized instruction may not be needed later, executing the broad instruction immediately saves time waiting for the first group of operands to update, which can be up to 100 clock cycles or more.

[12] Если же выполнение первой персонализированной команды станет необходимым, она будет выполнена посредством исполнения компенсирующего кода, предусматривающего возможность повторной загрузки первой персонализированной команды в подготовительный конвейер. Обновление первой группы операндов в этом случае будет произведено значительно быстрее, поскольку повторное обращение в память для загрузки операндов удовлетворяется за меньшее время, что в конечном итоге обеспечивает преимущество во времени для повторного выполнения первой персонализированной команды относительно первой попытки. [12] If the execution of the first personalized command becomes necessary, it will be executed by executing compensating code that provides the ability to re-load the first personalized command into the preparation pipeline. The update of the first group of operands in this case will be much faster, since the repeated memory access to load the operands is satisfied in less time, which ultimately provides a time advantage for re-executing the first personalized instruction relative to the first attempt.

[13] В первом частном случае изобретения блок управления способен запускать повторное выполнение первой персонализированной команды посредством обращения к компенсирующему коду. Как было показано выше, если необходимость выполнения первой персонализированной команды подтверждена, то данное исполнение позволяет получить достоверный результат выполнения первой персонализированной команды.[13] In the first particular case of the invention, the control unit is capable of triggering repeated execution of the first personalized command by accessing the compensating code. As shown above, if the need to execute the first personalized command is confirmed, then this execution allows you to obtain a reliable result of executing the first personalized command.

[14] Во втором частном случае изобретения блок управления способен не запускать повторное выполнение первой персонализированной команды, когда блок управления определил, что первая персонализированная команда является спекулятивной командой для неисполненного перехода. Данное исполнение позволяет сэкономить время, которое не было затрачено на ожидание обновления первой группы операндов, а также сэкономить вычислительный ресурс, который не будет затрачен на запуск повторного выполнения первой персонализированной команды. [14] In the second particular case of the invention, the control unit is capable of not re-executing the first personalized command when the control unit has determined that the first personalized command is a speculative command for an unexecuted transition. This execution saves time that would not have been spent waiting for the first group of operands to be updated, as well as saving computational resources that would not be spent on restarting the execution of the first personalized instruction.

[15] В третьем частном случае изобретения блок управления использует компрометирующий флаг для передачи управления или логической операции. Данное исполнение позволяет не выполнять команды, использующие результат выполнения первой персонализированной команды и потому также воспроизводящие недостоверные результаты, а перейти на другую альтернативную ветвь (далее кратко – ветвь), где проблема с достоверностью результатов отсутствует. [15] In the third particular case of the invention, the control unit uses a compromise flag to transfer control or a logical operation. This execution allows you not to execute commands that use the result of executing the first personalized command and therefore also reproduce unreliable results, but to move to another alternative branch (hereinafter briefly referred to as the branch), where there is no problem with the reliability of the results.

[16] В четвертом частном случае изобретения регистр готовности операндов способен сохранять компрометирующий флаг для всей цепочки команд, следующих за первой персонализированной командой и являющихся зависимыми от нее, а блок управления способен признавать недостоверным результат выполнения всех команд из указанной цепочки. Данное исполнение позволяет выполнять без задержки не только текущую, но и последующие широкие команды, содержащие персонализированные команды, зависимые от первой персонализированной команды. Повторное выполнение указанной цепочки команд происходит быстрее по сравнению с временем возможного ожидания обновления первой группы операндов. [16] In the fourth particular case of the invention, the operand readiness register is capable of storing a compromising flag for the entire chain of commands following and dependent on the first personalized command, and the control unit is capable of invalidating the result of executing all commands from the specified chain. This design makes it possible to execute without delay not only the current, but also subsequent broad commands containing personalized commands dependent on the first personalized command. Repeated execution of the specified command chain is faster compared to the possible wait time for updating the first group of operands.

[17] В пятом частном случае изобретения первый исполнительный конвейер содержит результирующий регистр, сохраняющий результат выполнения первой персонализированной команды, причем регистр готовности операндов включен в результирующий регистр в качестве разряда. В развитии данного частного случая дополнительным разрядом, предназначенным для сохранения компрометирующего флага, снабжены предназначенные для входящих операндов регистры первого исполнительного конвейера, а также предназначенные для операндов регистры регистрового файла и памяти. Данное исполнение позволяет освободить описанный ниже исполнительный конвейер для логических операций от функции сохранения компрометирующего флага и предписать ему выполнение других команд, что в конечном итоге способствует увеличению производительности процессора.[17] In the fifth particular case of the invention, the first execution pipeline contains a result register that stores the result of executing the first personalized instruction, and the operand readiness register is included in the result register as a bit. In the development of this particular case, an additional bit intended for storing the compromising flag is provided in the registers of the first execution pipeline intended for incoming operands, as well as in the register file and memory registers intended for operands. This implementation allows the execution pipeline described below for logical operations to be freed from the function of storing the compromising flag and to instruct it to execute other instructions, which ultimately helps to increase processor performance.

[18] В шестом частном случае изобретения содержит исполнительный конвейер для логических операций, а регистр готовности операндов входит в состав регистров исполнительного конвейера для логических операций. Данное исполнение позволяет использовать для сохранения компрометирующего флага имеющиеся технические средства, что упрощает конструкцию предложенного процессора.[18] In the sixth particular case of the invention, it contains an executive pipeline for logical operations, and the operand readiness register is part of the registers of the executive pipeline for logical operations. This design makes it possible to use existing technical means to save the compromising flag, which simplifies the design of the proposed processor.

[19] В седьмом частном случае изобретения блок контроля готовности операндов способен контролировать обновление второй группы операндов в регистровом файле. Регистр готовности операндов при этом является первым регистром готовности операндов, а компрометирующий флаг является первым компрометирующим флагом. Процессор содержит также второй регистр готовности операндов, который способен сохранять второй компрометирующий флаг, когда вторая группа операндов не была обновлена до выполнения второй персонализированной команды. Блок управления при наличии второго компрометирующего флага способен распознавать результат выполнения второй персонализированной команды как недостоверный результат.[19] In the seventh particular case of the invention, the operand readiness control unit is capable of monitoring the update of the second group of operands in the register file. The operand readiness register is the first operand readiness register, and the compromising flag is the first compromising flag. The processor also includes a second operand ready register that is capable of storing a second compromise flag when the second set of operands has not been updated prior to executing the second personalized instruction. The control unit, in the presence of a second compromising flag, is able to recognize the result of the second personalized command as an unreliable result.

[20] Данное исполнение позволяет выполнить с недостоверным результатом две персонализированные команды или даже всю широкую команду с тем, чтобы не задерживать выполнение следующей широкой команды. Если необходимость выполнения какой-либо персонализированной команды будет впоследствии подтверждена, то данная персонализированная команда может быть выполнена по компенсирующему коду с сокращением времени, требуемого на обновление операндов.[20] This implementation allows two personalized commands or even an entire wide command to be executed with an unreliable result, so as not to delay the execution of the next wide command. If the need to execute any personalized command is subsequently confirmed, then this personalized command can be executed using the compensating code, reducing the time required to update the operands.

Краткое описание чертежейBrief description of drawings

[21] Осуществление изобретения будет пояснено ссылками на фигуры:[21] The implementation of the invention will be explained by reference to the figures:

Фиг. 1 – блок-схема известного VLIW-процессора;Fig. 1 – block diagram of a well-known VLIW processor;

Фиг. 2 – фрагмент программного кода, отображающий последовательность персонализированных команд для одного исполнительного конвейера и используемый для иллюстрации технической проблемы;Fig. 2 is a piece of code that displays a sequence of personalized commands for one execution pipeline and is used to illustrate a technical problem;

Фиг. 3 – схема конвейерного процесса обработки команд, отображенных во фрагменте программного кода с Фиг. 2, для известного VLIW-процессора при отсутствии задержки обновления операндов;Fig. 3 – diagram of the pipeline process for processing commands displayed in the program code fragment from FIG. 2, for a known VLIW processor in the absence of operand update delay;

Фиг. 4 – схема конвейерного процесса обработки команд, отображенных во фрагменте программного кода с Фиг. 2, для известного VLIW-процессора при возникновении задержки обновления операндов;Fig. 4 – diagram of the pipeline process for processing commands displayed in the program code fragment from FIG. 2, for a known VLIW processor when an operand update delay occurs;

Фиг. 5 – фрагмент программного кода, отображающий последовательность широких команд для двух исполнительных конвейеров и используемый для иллюстрации технической проблемы;Fig. 5 is a code fragment that displays a sequence of broad commands for two execution pipelines and is used to illustrate a technical problem;

Фиг. 6 – блок-схема VLIW-процессора, выполненного согласно первому предпочтительному варианту осуществления изобретения;Fig. 6 is a block diagram of a VLIW processor configured in accordance with the first preferred embodiment of the invention;

Фиг. 7 – блок-схема VLIW-процессора, выполненного согласно второму предпочтительному варианту осуществления изобретения; Fig. 7 is a block diagram of a VLIW processor configured in accordance with a second preferred embodiment of the invention;

Фиг. 8 – схема конвейерного процесса обработки команд, отображенных во фрагменте программного кода с Фиг. 2, для VLIW-процессора с Фиг. 6 при возникновении задержки обновления операндов;Fig. 8 – diagram of the pipeline process for processing commands displayed in the program code fragment from FIG. 2, for the VLIW processor from FIG. 6 when there is a delay in updating operands;

Фиг. 9 – схема конвейерного процесса обработки команд, отображенных во фрагменте программного кода с Фиг. 2, для VLIW-процессора с Фиг. 7 при возникновении задержки обновления операндов.Fig. 9 – diagram of the pipeline process for processing commands displayed in the program code fragment from FIG. 2, for the VLIW processor from FIG. 7 when there is a delay in updating operands.

[22] Следует отметить, что форма и размеры отдельных элементов, отображенных на фигурах, являются условными и показаны так, чтобы наиболее наглядно проиллюстрировать взаимное расположение элементов VLIW-процессора, а также их причинно-следственную связь с техническим результатом. Кроме того, во избежание избыточного усложнения фигур некоторые взаимосвязи элементов, очевидные специалисту в данной области техники, могут быть не отображены. Фигуры также дополнены выполненными на английском языке буквенными и словесными обозначениями, которые являются общепринятыми в данной области техники, и которые способствуют более быстрому восприятию фигур специалистом в данной области техники.[22] It should be noted that the shape and dimensions of the individual elements shown in the figures are conditional and are shown in such a way as to most clearly illustrate the relative arrangement of the elements of the VLIW processor, as well as their cause-and-effect relationship with the technical result. In addition, to avoid unnecessary complication of the figures, some element relationships that are obvious to one skilled in the art may not be shown. The figures are also supplemented by alphabetic and verbal symbols written in English, which are generally accepted in the art, and which facilitate faster perception of the figures by a person skilled in the art.

Осуществление изобретенияCarrying out the invention

[23] Осуществление изобретения будет показано на наилучших примерах его реализации, которые не являются ограничениями в отношении объема охраняемых прав.[23] The implementation of the invention will be shown using the best examples of its implementation, which are not restrictions on the scope of protected rights.

[24] На Фиг. 1 представлена блок-схема известного VLIW-процессора 300, в то время как на Фиг. 6 и 7 представлены соответственно блок-схемы первого VLIW-процессора 100 и второго VLIW-процессора 200, выполненных согласно первому и второму предпочтительным вариантам осуществления изобретения. В значительной степени блок-схемы на Фиг. 1, 6 и 7 повторяют друг друга, поэтому пока не указано иное, нижеследующее описание известного VLIW-процессора 300 относится также к первому и второму VLIW-процессорам 100 и 200. Идентичные элементы на Фиг. 1, 6 и 7 обозначены одними и теми же позициями. [24] In FIG. 1 is a block diagram of a known VLIW processor 300, while FIG. 6 and 7 are respectively block diagrams of a first VLIW processor 100 and a second VLIW processor 200 configured in accordance with the first and second preferred embodiments of the invention. To a large extent, the block diagrams in FIG. 1, 6 and 7 repeat each other, so unless otherwise indicated, the following description of the known VLIW processor 300 also applies to the first and second VLIW processors 100 and 200. Identical elements in FIG. 1, 6 and 7 are designated by the same positions.

[25] Известный VLIW-процессор 300 (Фиг. 1) содержит первый арифметико-логический функциональный блок 1 (ALC - arithmetic-logical channel), второй арифметико-логический функциональный блок 2, предикатно-логический функциональный блок 3 (PLC - predicate logical channel), память 4 команд (IM - instruction memory), регистровый файл 5 (RF - register file), предикатный файл 6 (PF - predicate file), блок 7 контроля готовности операндов (SB - scoreboarding), память 8 данных (DM – data memory), которая способна обмениваться данными с оперативной памятью 9 (RAM - random access memory). Первый арифметико-логический функциональный блок 1, второй арифметико-логический функциональный блок 2 и предикатно-логический функциональный блок 3 далее кратко именуются как функциональный блок 1, 2 или 3.[25] The well-known VLIW processor 300 (Fig. 1) contains the first arithmetic-logical functional block 1 (ALC - arithmetic-logical channel ), the second arithmetic-logical functional block 2, the predicate logical functional block 3 (PLC - predicate logical channel ), instruction memory 4 (IM - instruction memory ), register file 5 (RF - register file ), predicate file 6 (PF - predicate file ), operand readiness control unit 7 (SB - scoreboarding ), data memory 8 (DM - data memory ), which is capable of exchanging data with RAM 9 (RAM - random access memory ). The first arithmetic logic function block 1, the second arithmetic logic function block 2 and the predicate logic function block 3 are hereinafter briefly referred to as function block 1, 2 or 3.

[26] Указанные элементы VLIW-процессора 300 представляют собой основные компоненты конвейера, задействованные в соответствующих стадиях конвейерного процесса, которые показаны в левой части Фиг. 1: выборка (F - fetch), дешифрация (D – decode), исполнение (E – execute), обращение к памяти (M – memory) и обратная запись результата (WB – write back). Обратим внимание, что наличие во VLIW-процессоре 300 одной единственной памяти 4 команд эквивалентно наличию во VLIW-процессоре 300 одного единственного подготовительного конвейера, способного в определенный момент времени обрабатывать на каждой из стадий F и D только по одной команде, которая представляет собой широкую команду. Тем временем наличие во VLIW-процессоре 300 трех функциональных блоков 1, 2 и 3 эквивалентно наличию во VLIW-процессоре 300 трех исполнительных конвейеров, способных в определенный момент времени обрабатывать на каждой из стадий E, M и WB сразу по три команды, которые представляют собой персонализированные команды, выделенные на стадии D из широкой команды.[26] These elements of the VLIW processor 300 represent the main pipeline components involved in the respective stages of the pipeline process, which are shown on the left side of FIG. 1: fetch (F - fetch ), decryption (D - decode ), execution (E - execute ), accessing memory (M - memory ) and writing back the result (WB - write back ). Please note that the presence in the VLIW processor 300 of one single memory of 4 instructions is equivalent to the presence in the VLIW processor 300 of one single preparation pipeline capable of processing at each stage F and D only one instruction at a certain time, which is a wide instruction . Meanwhile, the presence in the VLIW processor 300 of three functional blocks 1, 2 and 3 is equivalent to the presence in the VLIW processor 300 of three execution pipelines capable of processing at each stage E, M and WB three instructions at once, which represent personalized teams allocated at stage D from the wider team.

[27] Что касается VLIW-процессоров 100 и 200, то описанная конфигурация VLIW-процессора 300 отражает лишь частный случай их исполнения, поскольку каждый из VLIW-процессоров 100 и 200 может содержать любое количество арифметико-логических и предикатно-логических функциональных блоков. Например, каждый из VLIW-процессоров 100 и 200 может содержать шесть арифметико-логических функциональных блоков и три предикатно-логических функциональных блока, а значит может обрабатывать широкую команду, включающую девять персонализированных команд. [27] As for the VLIW processors 100 and 200, the described configuration of the VLIW processor 300 reflects only a special case of their implementation, since each of the VLIW processors 100 and 200 can contain any number of arithmetic-logical and predicate-logical functional blocks. For example, each of the VLIW processors 100 and 200 may contain six arithmetic logic function blocks and three predicate logic function blocks, and therefore can process a broad instruction including nine personalized instructions.

[28] Далее, VLIW-процессор 300 содержит блок управления (не показан), обеспечивающий выработку и передачу управляющих сигналов на перечисленные выше элементы. В состав VLIW-процессора 300 входит также множество элементов, исполняющих тривиальные функции в конвейерном процессе и очевидных специалисту в данной области, таких как регистры, счетчики команд, шины передачи данных и т.п. Некоторые из таких элементов отображены на фигурах и будут раскрыты по ходу изложения.[28] Further, the VLIW processor 300 includes a control unit (not shown) that generates and transmits control signals to the elements listed above. The VLIW processor 300 also includes many elements that perform trivial functions in the pipeline process and are obvious to one skilled in the art, such as registers, program counters, data buses, and the like. Some of these elements are shown in the figures and will be revealed as we proceed.

[29] Память 4 команд представляет собой раздел кэш-памяти, в котором сохранен массив широких команд, подлежащих выполнению в ближайшее время. По сигналу, поступающему от счетчика команд, из указанного массива широких команд осуществляется выборка той широкой команды, которая должна быть выполнена следующей. Следует отметить, что каждая широкая команда компилируются из первой, второй и предикатной персонализированных команд за пределами VLIW-процессора 300.[29] The 4 instruction memory is a section of cache memory that stores an array of broad instructions to be executed in the near future. Based on the signal coming from the program counter, the wide command that must be executed next is selected from the specified array of wide commands. It should be noted that each wide instruction is compiled from the first, second, and predicate personalized instructions outside of the VLIW processor 300.

[30] Под выборкой команды на стадии F понимается выдача из памяти 4 команд N-битового сигнала, который указывает адреса as11, as12 регистров для исходных операндов первой персонализированной команды, адреса as21, as22 регистров для исходных операндов второй персонализированной команды, адреса aps11, aps12 регистров для исходных операндов предикатной персонализированной команды, адреса ad1, ad2 и apd регистров для результирующих операндов соответственно первой, второй и предикатной персонализированных команд, а также коды opc1, opc2 и popc операций, осуществляемых первой, второй и предикатной персонализированными командами.[30] By fetching an instruction at stage F we mean issuing from memory 4 commands an N-bit signal, which indicates the register addresses as11, as12 for the source operands of the first personalized command, register addresses as21, as22 for the source operands of the second personalized command, addresses aps11, aps12 registers for the source operands of the predicate personalized command, addresses ad1, ad2 and apd of registers for the resulting operands of the first, second and predicate personalized commands, respectively, as well as codes opc1, opc2 and popc of operations performed by the first, second and predicate personalized commands.

[31] Адреса as11, as12, as21, as22, ad1, ad2 регистров поступают из памяти 4 команд по шине 41 в регистровый файл 5, который представляет собой набор регистров, способных сохранять числовые данные целочисленного типа, с плавающей запятой и т.д. Тем временем адреса aps11, aps12, apd регистров поступают из памяти 4 команд по шине 42 в предикатный файл 6, который представляет собой набор регистров, способных сохранять булевые значения (1 или 0). В свою очередь, коды opc1, opc2 и popc операций поступают из памяти 4 команд по шинам 43, 44, 45 в первое, второе и предикатное управляющие устройства 11, 21 и 31 (CD, CDP – control device, control device predicate), входящие в состав функциональных блоков 1, 2 и 3. [31] Addresses as11, as12, as21, as22, ad1, ad2 registers come from instruction memory 4 on bus 41 to register file 5, which is a set of registers capable of storing numeric data of integer type, floating point, etc. Meanwhile, the addresses of aps11, aps12, apd registers come from instruction memory 4 on bus 42 to predicate file 6, which is a set of registers capable of storing boolean values (1 or 0). In turn, opc1, opc2 and popc operation codes come from memory 4 commands via buses 43, 44, 45 to the first, second and predicate control devices 11, 21 and 31 (CD, CDP - control device, control device predicate ), included included in functional blocks 1, 2 and 3.

[32] Регистровый файл 5 направляет исходные операнды src11 и src12, прочитанные в регистрах по адресам as11, as12, в функциональный блок 1 по шинам 51, и по шинам 52 направляет в функциональный блок 2 исходные операнды src21 и src22, прочитанные в регистрах по адресам as21, as22. Одновременно с этим предикатный файл 6 по шинам 61 направляет в функциональный блок 1 исходные операнды srcp1 и srcp2, прочитанные в регистрах по адресам aps1, aps2.[32] Register file 5 sends the source operands src11 and src12, read in registers at addresses as11, as12, to function block 1 on buses 51, and on buses 52 sends source operands src21 and src22, read in registers at addresses, to function block 2 as21, as22. At the same time, predicate file 6 via buses 61 sends to functional block 1 the source operands srcp1 and srcp2, read in the registers at addresses aps1, aps2.

[33] Следует отметить, что в контексте настоящего изложения исходные операнды src11 и src12, являющиеся исходными операндами первой персонализированной команды, составляют первую группу операндов. Аналогично, исходные операнды src21 и src22, являющиеся операндами второй персонализированной команды, составляют вторую группу операндов, а исходные операнды srcp1 и srcp2 – предикатную группу операндов. Поступление в функциональные блоки 1, 2 и 3 соответственно первой, второй и предикатной групп операндов, а также кодов opc1, opc2 и popc операций завершает стадию D, а вместе с ней и работу подготовительного конвейера по выделению из широкой команды первой, второй и предикатной персонализированных команд. [33] It should be noted that in the context of the present disclosure, the source operands src11 and src12, which are the source operands of the first personalized instruction, constitute the first operand group. Similarly, the source operands src21 and src22, which are the operands of the second personalized instruction, constitute the second operand group, and the source operands srcp1 and srcp2 constitute the predicate operand group. The entry into functional blocks 1, 2 and 3, respectively, of the first, second and predicate groups of operands, as well as opc1, opc2 and popc operation codes, completes stage D, and with it the work of the preparatory pipeline to isolate the first, second and predicate personalized ones from the wide command commands

[34] В состав функционального блока 1 помимо упомянутого управляющего устройства 11 включены входные регистры 13, арифметико-логическое устройство 14 (АЛУ, ALU - arithmetic-logical unit) и результирующий регистр 15. Исходные операнды src11 и src12 через входные регистры 13 поступают в АЛУ 14, в котором над ними выполняется операция, соответствующая коду opc1, после чего результат ALUres1 выполнения первой персонализированной команды через результирующий регистр 15 по шине 16 направляется в память 8 данных или в регистровый файл 5.[34] Functional block 1, in addition to the mentioned control device 11, includes input registers 13, an arithmetic-logical unit 14 (ALU - arithmetic-logical unit ) and a resulting register 15. The source operands src11 and src12 through input registers 13 enter the ALU 14, in which the operation corresponding to the opc1 code is performed on them, after which the result ALUres1 of the execution of the first personalized instruction is sent through the result register 15 via the bus 16 to the data memory 8 or to the register file 5.

[35] Аналогично, исходные операнды src21 и src22 через входные регистры 23 поступают в АЛУ 24, в котором осуществляется операция, соответствующая коду opc2. Далее результат ALUres2 выполнения второй персонализированной команды через результирующий регистр 25 по шине 26 направляется в память 8 данных или в регистровый файл 5. И наконец, исходные операнды srcp1 и srcp2 через входные регистры 33 поступают в предикатное логическое устройство 34 (ПЛУ, PLU – predicate logical unit), в котором осуществляется операция, соответствующая коду popc. Результат PLUres выполнения предикатной персонализированной команды через результирующий регистр 35 по шине 36 направляется в память 8 данных или в предикатный файл 6. На этом стадия E конвейерного процесса, предусматривающая одновременное выполнение первой, второй и предикатной персонализированных команд при помощи первого, второго и третьего исполнительных конвейеров, завершается.[35] Similarly, the source operands src21 and src22 through the input registers 23 are supplied to the ALU 24, in which the operation corresponding to the opc2 code is carried out. Next, the result ALUres2 of executing the second personalized command through the resulting register 25 via bus 26 is sent to data memory 8 or to register file 5. And finally, the source operands srcp1 and srcp2 through input registers 33 enter the predicate logical unit 34 (PLU, PLU - predicate logical unit ), which performs the operation corresponding to the code popc. The result PLUres of executing the predicate personalized command through the result register 35 on the bus 36 is sent to the data memory 8 or to the predicate file 6. This is stage E of the pipeline process, which provides for the simultaneous execution of the first, second and predicate personalized commands using the first, second and third execution pipelines , ends.

[36] Что касается управляющих устройств 11, 21, 31, то хотя на Фиг. 1, 6 и 7 они показаны в непосредственной близости с АЛУ 14, 24 и ПЛУ 34, это является лишь отражением их функциональной связи. В конструктивном исполнении VLIW-процессоров 100, 200 и 300 управляющие устройства 11, 21, 31 входят в состав упомянутого блока управления без физического выделения в отдельные модули. В настоящем изложении все функции, описанные для управляющих устройств 11, 21, 31, являются функциями блока управления.[36] Regarding the control devices 11, 21, 31, although in FIG. 1, 6 and 7 they are shown in close proximity to ALU 14, 24 and PLU 34; this is only a reflection of their functional connection. In the design of VLIW processors 100, 200 and 300, control devices 11, 21, 31 are included in the said control unit without being physically separated into separate modules. In the present presentation, all functions described for the control devices 11, 21, 31 are functions of the control unit.

[37] Далее, процесс конвейерной обработки некоторых персонализированных команд, таких как ld (load - загрузка (также – чтение) данных из памяти) или st (store – сохранение данных в память), включает обращение к памяти 8 данных, выполняемое с использованием шин 17, 27 и 37, соединяющими память 8 данных соответственно с шинами 16, 26 и 36. Память 8 данных представляет собой раздел кэш-памяти, сохраняющий массив данных, которые с большой вероятностью будут затребованы в ближайшее время. Передача данных по шинам 17, 27 и 37 в память 8 данных или из нее представляет собой суть того действия, которое выполняется конвейером на стадии M.[37] Further, the process of pipeline processing of some personalized commands, such as ld ( load - loading (also reading) data from memory) or st ( store - storing data into memory), involves accessing data memory 8, performed using buses 17, 27 and 37 connecting data memory 8 to buses 16, 26 and 36 respectively. Data memory 8 is a cache section storing an array of data that is likely to be requested in the near future. The transfer of data over buses 17, 27 and 37 to or from data memory 8 is the essence of the action that is performed by the pipeline in stage M.

[38] При помощи шины 81 память 8 данных соединена также с оперативной памятью 9, а при помощи шины 82 – с блоком 7 контроля готовности операндов (далее – блок 7 готовности операндов). Функции блока 7 готовности операндов подробно описаны ниже, а здесь отметим, что блок 7 готовности операндов посредством шин 71, 72, 73 соединен соответственно с управляющими устройствами 11, 21, 31, а посредством шин 46 и 74 – с памятью 4 команд. [38] Using bus 81, data memory 8 is also connected to RAM 9, and using bus 82, to operand readiness control block 7 (hereinafter referred to as operand readiness block 7). The functions of the operand readiness block 7 are described in detail below, but here we note that the operand readiness block 7 is connected via buses 71, 72, 73, respectively, to control devices 11, 21, 31, and via buses 46 and 74 to memory 4 commands.

[39] На стадии WB данные, прочитанные в памяти 8 данных или являющиеся результатом выполненной АЛУ математической операции, передаются по шинам 16, 26, 36 для записи в регистры ad1, ad2 регистрового файла 5 и регистр apd предикатного файла 6. Вместе со стадией WB на этом завершается весь цикл конвейерного процесса обработки команд.[39] In the WB stage, data read from the data memory 8 or resulting from a mathematical operation performed by the ALU is transferred along buses 16, 26, 36 to be written to registers ad1, ad2 of register file 5 and register apd of predicate file 6. Together with the WB stage This completes the entire cycle of the command processing pipeline.

[40] Далее со ссылками на Фиг. 2, 3, 4 и 5 более подробно раскрывается техническая проблема, возникающая при функционировании VLIW-процессора 300 и решенная во VLIW-процессорах 100 и 200.[40] Next with reference to FIG. 2, 3, 4 and 5 reveal in more detail a technical problem encountered in the operation of the VLIW processor 300 and solved in the VLIW processors 100 and 200.

[41] На Фиг. 2 представлен фрагмент программного кода, написанного на используемом в микропроцессорах языке ассемблера. Данный фрагмент включает в себя последовательность персонализированных команд, поступающих на выполнение в функциональный блок 1 после их выделения из широких команд. Справа от символа // представлен комментарий по содержанию каждой команды. Обратим внимание, что команда nop не содержит выполняемой операции и используется, например, для синхронизации двух персонализированных команд, когда есть необходимость поместить их в одну широкую команду.[41] In FIG. Figure 2 shows a fragment of program code written in the assembly language used in microprocessors. This fragment includes a sequence of personalized commands that are sent to functional block 1 for execution after they have been isolated from broad commands. To the right of the // symbol is a comment on the contents of each command. Please note that the nop command does not contain the operation being performed and is used, for example, to synchronize two personalized commands when there is a need to put them into one broad command.

[42] Заметим, что регистр r1 в команде ld на Фиг. 2 имеет функцию регистра с адресом ad1 на Фиг. 1, а тот же регистр r1 в команде add на Фиг. 2 имеет функцию регистра с адресом as11 на Фиг. 1. Другими словами, операнд, подлежащий записи в регистр r1 (далее – целевой операнд), является результирующим операндом для команды ld и исходным операндом для команды add, а значит, выполнение команды add невозможно без завершения выполнения команды ld.[42] Note that register r1 in the ld instruction in FIG. 2 has a register function with address ad1 in FIG. 1, and the same register r1 in the add command in FIG. 2 has a register function with address as11 in FIG. 1. In other words, the operand to be written to register r1 (hereinafter referred to as the target operand) is the result operand for the ld instruction and the source operand for the add instruction, which means that the execution of the add instruction is impossible without completing the execution of the ld instruction.

[43] На Фиг. 3 представлена схема конвейерного процесса обработки команд, отображенных во фрагменте программного кода на Фиг. 2. При прохождении командой ld стадии M, из памяти 8 данных должен быть прочитан целевой операнд для последующей записи в регистр r1. В случае, показанном на Фиг. 3, целевой операнд присутствует в памяти 8 данных, благодаря чему он извлекается из памяти 8 данных и передается по шине 17 на стадию WB за один такт. Затем целевой операнд посредством шины 16 пересылается на стадию D команды add, и команда add обрабатывается без задержки.[43] In FIG. 3 shows a diagram of the pipeline process for processing the commands displayed in the program code fragment in FIG. 2. When the ld command passes stage M, the target operand must be read from data memory 8 for subsequent writing to register r1. In the case shown in FIG. 3, the target operand is present in the data memory 8, whereby it is retrieved from the data memory 8 and transmitted via the bus 17 to the WB stage in one clock cycle. The target operand is then sent via bus 16 to stage D of the add instruction, and the add instruction is processed without delay.

[44] Одновременно с выдачей целевого операнда память 8 данных по шине 82 передает на блок 7 готовности операндов сигнал hit (попадание в цель), свидетельствующий о нормальной работе и об отсутствии необходимости принятия каких-либо мер. Запись и чтение целевого операнда из регистрового файла 5 выполняется командами ld и add по входному и выходному фронту одного тактового импульса, что на Фиг. 3 схематично отражено стрелкой, поэтому стадии WB и D команд ld и add могут быть осуществлены за один такт.[44] Simultaneously with the issuance of the target operand, the data memory 8 via bus 82 transmits to the operand readiness block 7 a hit signal (hitting the target), indicating normal operation and the absence of the need to take any measures. Writing and reading the target operand from register file 5 is performed with the ld and add commands on the input and output edges of one clock pulse, as in FIG. 3 is schematically represented by an arrow, so stages WB and D of the ld and add instructions can be implemented in one clock cycle.

[45] Блок 7 готовности операндов по шине 46 получает из памяти 4 команд информацию, какие операнды должны быть подготовлены для каждой из персонализированных команд, одновременно выполняемых функциональными блоками 1 и 2. Применительно к случаю на Фиг. 2, блок 7 готовности операндов по шине 46 получает из памяти 4 команд информацию, что для команды add требуется целевой операнд, который должен быть извлечен из памяти 8 данных в результате выполнения команды ld. Память 8 данных сообщает о готовности данного операнда посредством сигнала hit.[45] The operand readiness unit 7, via bus 46, obtains from the instruction memory 4 information which operands should be prepared for each of the personalized instructions simultaneously executed by function blocks 1 and 2. In the case of FIG. 2, the operand readiness unit 7 via bus 46 receives from command memory 4 information that the add command requires a target operand, which must be retrieved from data memory 8 as a result of executing the ld command. The data memory 8 indicates the readiness of a given operand by means of a hit signal.

[46] На Фиг. 4 представлена схема того же самого конвейерного процесса для последовательности команд на Фиг. 2 с тем, однако, отличием, что команде ld не удается прочитать целевой операнд и направить его на стадию D команды add, поскольку целевой операнд отсутствует в памяти 8 данных. В этом случае память 8 данных обращается за целевым операндом к своим разделам нижнего уровня или даже к оперативной памяти 9 по шине 81, и одновременно направляет в блок 7 готовности операндов по шине 82 сигнал miss (промах). Поскольку выполнение команды add в этом случае не представляется возможным, блок 7 готовности операндов приостанавливает работу конвейера, направляя сигнал hold (остановка) на управляющие устройства 11, 21, 31 и память 4 команд по шинам 71, 72, 73, 74 соответственно.[46] In FIG. 4 is a diagram of the same pipeline process for the sequence of instructions in FIG. 2 with the difference, however, that the ld instruction fails to read the target operand and send it to stage D of the add instruction because the target operand is not in the data memory 8. In this case, the data memory 8 accesses its lower-level sections or even the RAM 9 via bus 81 for the target operand, and simultaneously sends a miss signal to the operand readiness block 7 via bus 82. Since the execution of the add command in this case is not possible, the operand readiness block 7 pauses the operation of the conveyor, sending a hold signal (stop) to control devices 11, 21, 31 and memory 4 commands via buses 71, 72, 73, 74, respectively.

[47] Когда память 8 данных, наконец, предоставляет целевой операнд для чтения командой ld, память 8 данных направляет блоку 7 готовности операндов сигнал hit, и блок 7 готовности операндов сигнал снимает hold. В результате этого целевой операнд по шине 16 пересылается на стадию D команды add, и работа конвейера возобновляется. Как схематично показано на Фиг. 4, сигал hold может удерживаться длительное время, например на протяжении 100 тактов или больше. [47] When the data memory 8 finally provides the target operand to be read by the ld instruction, the data memory 8 sends a hit signal to the operand ready block 7, and a hold signal is sent to the operand ready block 7. As a result, the target operand is sent on bus 16 to stage D of the add instruction, and the pipeline is resumed. As schematically shown in FIG. 4, the hold signal can be held for a long time, for example for 100 clock cycles or more.

[48] На Фиг. 5 показана последовательность широких команд, в каждой из которых соответствующая команда последовательности с Фиг. 2 выступает в качестве первой персонализированной команды, а сама последовательность команд с Фиг. 2 образует ветвь А. Соответственно, при остановке выполнения команды add ветви А, как это показано на Фиг. 4, останавливается и выполнение команды or, выступающей в качестве второй персонализированной команды и принадлежащей ветви B. Иначе говоря, блокировка ветви А вызывает блокировку ветви В.[48] In FIG. 5 shows a sequence of broad commands, each of which has a corresponding command in the sequence of FIG. 2 acts as the first personalized command, and the sequence of commands from FIG. 2 forms branch A. Accordingly, when the execution of the add command of branch A is stopped, as shown in FIG. 4, the execution of the or command, which acts as a second personalized command and belongs to branch B, also stops. In other words, blocking branch A causes blocking of branch B.

[49] Предположим, что ветвь А является ветвью, переход на которую маловероятен, и которая включена в последовательность широких команд в качестве так называемой спекулятивной ветви, выполняемой «на всякий случай». Блокировка часто выполняемой ветви В из-за неготовности операндов у редко выполняемой ветви А является крайне нежелательным явлением, существенно замедляющим быстродействие VLIW-процессора 300. Более того, малое число выполненных переходов на ветвь А само по себе снижает вероятность нахождения целевого операнда в памяти 8 данных, что повышает вероятность и длительность остановки конвейера. Данная ситуация, являющаяся иллюстрацией стоящей перед изобретением технической проблемы, решена в предложенных VLIW-процессорах 100 и 200.[49] Suppose that branch A is a branch that is unlikely to be followed, and that is included in the sequence of broad instructions as a so-called speculative branch, executed “just in case.” Blocking of the frequently executed branch B due to the unavailability of the operands of the rarely executed branch A is an extremely undesirable phenomenon that significantly slows down the performance of the VLIW processor 300. Moreover, the small number of transitions performed to branch A itself reduces the probability of finding the target operand in data memory 8 , which increases the likelihood and duration of a conveyor stop. This situation, which illustrates the technical problem facing the invention, is solved in the proposed VLIW processors 100 and 200.

[50] На Фиг. 6 представлена блок-схема VLIW-процессора 100, выполненного согласно первому предпочтительному варианту осуществления изобретения. Описание элементов или взаимосвязей, которые во VLIW-процессоре 100 идентичны таковым во VLIW-процессоре 300, повторно приведено не будет, и далее будут раскрыты усовершенствования VLIW-процессора 100 относительно VLIW-процессора 300, которые позволяют решить техническую проблему путем достижения указанного выше технического результата.[50] In FIG. 6 is a block diagram of a VLIW processor 100 configured in accordance with a first preferred embodiment of the invention. Elements or relationships that in VLIW processor 100 are identical to those in VLIW processor 300 will not be described again, and improvements to VLIW processor 100 relative to VLIW processor 300 that solve the technical problem by achieving the above technical result will be disclosed. .

[51] В дополнение к шинам 71, 72, 73, 74 или вместо них VLIW-процессор 100 снабжен шинами 75 и 76, соединяющими блок 7 готовности операндов с управляющими устройствами 11 и 21 соответственно. Кроме того, каждый из регистров регистрового файла 5, регистров 13, 15, 23, 25 и регистров памяти 8 данных (все вместе далее – регистры данных) снабжен дополнительным разрядом. В регистрах 13, 15, 23, 25 этот дополнительный разряд обозначен соответственно позициями 131, 151, 231, 251.[51] In addition to or instead of buses 71, 72, 73, 74, VLIW processor 100 is provided with buses 75 and 76 connecting operand readiness unit 7 to control devices 11 and 21, respectively. In addition, each of the register file 5, registers 13, 15, 23, 25 and data memory registers 8 (together referred to as data registers) is equipped with an additional digit. In registers 13, 15, 23, 25, this additional digit is indicated by positions 131, 151, 231, 251, respectively.

[52] Как и прежде, получая сигналы по шинам 46 и 82, блок 7 готовности операндов способен контролировать обновление первой группы операндов в регистровом файле 5. В условиях нормальной работы, например, показанной на Фиг. 3, функционирование VLIW-процессора 100 не отличается от описанного выше функционирования VLIW-процессора 300.[52] As before, receiving signals on buses 46 and 82, the operand readiness unit 7 is able to control the update of the first group of operands in register file 5. Under normal operating conditions, such as those shown in FIG. 3, the operation of the VLIW processor 100 is no different from the operation of the VLIW processor 300 described above.

[53] Однако во VLIW-процессоре 100, когда какой-либо из операндов первой группы не обновлен до выполнения первой персонализированной команды, что эквивалентно тому, что не обновлена вся первая группа операндов, т.е. при получении сигнала miss по шине 82, блок 7 готовности операндов вместо передачи по шине 71 сигнала hold передает по шине 75 сигнал cflag1 (compromising flag – компрометирующий флаг). В другом исполнении сигнал cflag1 передается одновременно с сигналом hold, но имеет приоритет над ним, т.е. отменяет сигнал hold.[53] However, in the VLIW processor 100, when any of the operands of the first group are not updated before executing the first personalized instruction, which is equivalent to the entire first group of operands not being updated, i.e. when receiving the miss signal on bus 82, the operand readiness block 7, instead of transmitting the hold signal on bus 71, transmits the cflag1 signal ( compromising flag ) on bus 75. In another design, the cflag1 signal is transmitted simultaneously with the hold signal, but has priority over it, i.e. cancels the hold signal.

[54] Данная особенность VLIW-процессора 100 справедлива также и в отношении шин 73 и 74. При получении сигнала miss по шине 82, блок 7 готовности операндов либо не передает сигнал hold по шинам 73 и 74, либо предает по вновь введенным шинам (не показаны) на память 4 команд и функциональный блок 3 сигнал cflag1, отменяющий действие сигнала hold. [54] This feature of the VLIW processor 100 is also true with respect to buses 73 and 74. When receiving a miss signal on bus 82, the operand readiness unit 7 either does not transmit the hold signal on buses 73 and 74, or transmits it on newly introduced buses (not shown) to memory 4 commands and function block 3 signal cflag1, canceling the effect of the hold signal.

[55] Применительно к случаям с Фиг. 2 и 5, блок 7 готовности операндов предает по шине 75 сигнал cflag1, когда целевой операнд, т.е. операнд, подлежащий записи в регистр r1 по результату выполнения команды ld, не был записан в регистр r1 до выполнения команды add (как и прежде, предполагается, что ветвь А выполняется функциональным блоком 1). Тем не менее, как это показано на Фиг. 8, блок управления не приостанавливает конвейер, а продолжает выполнение команды add с прежним (не обновленным) операндом, сохраненным в регистре r1, одновременно с этим помещая компрометирующий флаг, т.е. битовое значение «1», в дополнительный разряд 131 соответствующего входного регистра 13.[55] With respect to the cases of FIG. 2 and 5, the operand readiness block 7 sends the cflag1 signal via bus 75 when the target operand, i.e. The operand to be written to register r1 as a result of the ld instruction was not written to register r1 before the add instruction was executed (as before, branch A is assumed to be executed by function block 1). However, as shown in FIG. 8, the control unit does not pause the pipeline, but continues executing the add command with the previous (not updated) operand stored in register r1, while simultaneously placing a compromising flag, i.e. bit value “1”, into additional bit 131 of the corresponding input register 13.

[56] Далее компрометирующий флаг записывается в дополнительный разряд 151 результирующего регистра 15, а затем и в дополнительные разряды всех других регистров, которые сохраняют результат выполнения команды add, а именно в дополнительные разряды регистров памяти 8 данных и регистрового файла 5. Соответственно, любой из дополнительных разрядов 131, 151 и т.д. содержит информацию о готовности целевого операнда, т.е. может выступать в качестве регистра готовности операндов для команды add.[56] Next, the compromising flag is written to additional bits 151 of the resulting register 15, and then to additional bits of all other registers that store the result of the add command, namely in additional bits of the data memory registers 8 and register file 5. Accordingly, any of additional categories 131, 151, etc. contains information about the readiness of the target operand, i.e. can act as an operand ready register for the add instruction.

[57] Если результат выполнения команды add используется следующей командой ветви А, то из регистрового файла 5 в качестве входного операнда он записывается во входной регистр 13, а вместе с ним в дополнительный разряд 131 этого входного регистра 13 записывается компрометирующий флаг. Таким образом, результаты выполнения всех последующих команд ветви А, использующих операнды, помеченные компрометирующим флагом, также помечаются компрометирующим флагом. [57] If the result of the add command is used by the next command of branch A, then from register file 5 as an input operand it is written to input register 13, and with it a compromising flag is written to additional bit 131 of this input register 13. Thus, the results of executing all subsequent instructions in branch A that use operands marked with the compromise flag are also marked with the compromise flag.

[58] Блок управления при наличии компрометирующего флага в дополнительном разряде 151 распознает результат выполнения первой персонализированной команды как недостоверный результат. Поскольку компрометирующим флагом помечены и все последующие команды ветви А, то результаты их выполнения блок управления также признает недостоверными. Таким образом, при возникновении необходимости перехода на ветвь А все команды ветви А, начиная с команды ld, должны быть выполнены заново.[58] The control unit, if there is a compromising flag in additional bit 151, recognizes the result of the first personalized command as an unreliable result. Since all subsequent commands of branch A are also marked with a compromising flag, the control unit also recognizes the results of their execution as unreliable. Thus, if it becomes necessary to switch to branch A, all commands of branch A, starting with the ld command, must be executed again.

[59] Однако необходимость выполнения ветви А возникает не всегда, а по существу - довольно редко, поскольку задержка выдачи целевого операнда характерна именно для редко выполняемых команд. Таким образом, существует большая вероятность, что ветвь А не будет востребована, и в этом случае недостоверность ее результата не окажет влияния на результат выполнения программы, содержащей ветви А и В. В то же время, поскольку выполнение ветви А не вызвало задержку выполнения ветви В, то предложенный VLIW-процессор 100 выполнит программу, содержащую ветви А и В, за значительно меньшее время по сравнению с известным VLIW-процессором 300, который в обязательном порядке должен обеспечить получение достоверного результата ветви А.[59] However, the need to execute branch A does not always arise, and in fact it is quite rare, since the delay in issuing the target operand is typical specifically for rarely executed instructions. Thus, there is a high probability that branch A will not be in demand, and in this case, the unreliability of its result will not affect the result of the execution of the program containing branches A and B. At the same time, since the execution of branch A did not delay the execution of branch B , then the proposed VLIW processor 100 will execute a program containing branches A and B in significantly less time compared to the known VLIW processor 300, which must ensure that a reliable result of branch A is obtained.

[60] Если же переход на ветвь А все-таки будет реализован, то повторное выполнение команды ld и всех других, зависящих от ее результата, команд ветви А осуществляется посредством обращения к компенсирующему коду, по существу, являющегося идентичным основному коду ветви А. Все аспекты использования компенсирующего кода известны специалисту в данной области техники, а здесь отметим, что повторное обращение к памяти 8 данных за целевым операндом, поступающее со стороны команды ld, значительно ускоряет выдачу целевого операнда памятью 8 данных, поскольку данная ситуация соответствует задаче, решаемой алгоритмом работы кэш-памяти. Таким образом, и в этом случае предложенный VLIW-процессор 100 имеет преимущество в быстродействии по сравнению с известным VLIW-процессором 300.[60] If the transition to branch A is nevertheless implemented, then the repeated execution of the ld command and all other commands of branch A that depend on its result is carried out by accessing the compensating code, which is essentially identical to the main code of branch A. All aspects of the use of the compensating code are known to a person skilled in the art, but here we note that repeated access to the data memory 8 for the target operand, coming from the ld instruction, significantly speeds up the issuance of the target operand by the data memory 8, since this situation corresponds to the problem solved by the operating algorithm cache memory. Thus, in this case as well, the proposed VLIW processor 100 has a performance advantage over the known VLIW processor 300.

[61] Обратим внимание, что хотя команда add была выполнена, не дожидаясь результата команды ld, выполнение самой команды ld будет завершено с достоверным результатом даже в этом случае. Поскольку результат выполнения команды ld после стадии WB сохраняется в регистровом файле 5, то строго говоря, повторное выполнение команды ld не является обязательным, а значит быстродействие VLIW-процессора 100 при переходе на ветвь А может быть еще выше. [61] Note that although the add command was executed without waiting for the result of the ld command, the execution of the ld command itself will complete with a valid result even in this case. Since the result of executing the ld command after the WB stage is stored in register file 5, strictly speaking, re-executing the ld command is not mandatory, which means the performance of the VLIW processor 100 when moving to branch A can be even higher.

[62] Тем не менее длительное сохранение результата команды ld в одном из регистров регистрового файла 5 в ожидании маловероятного повторного обращения к ветви А очевидным образом снижает число свободных регистров в регистровом файле 5. Ввиду того, что общее число регистров в регистровом файле 5 является сравнительно небольшим, выведение из оборота одного из регистров способно оказать отрицательное влияние на быстродействие VLIW-процессора 100, не компенсируемое сохраненным результатом выполнения команды ld. В рассматриваемом примере функционирования VLIW-процессора 100 при выполнении кода с Фиг. 5 предпочтение отдано в пользу повторного выполнения команды ld.[62] However, storing the result of the ld instruction for a long time in one of the registers of register file 5 in anticipation of the unlikely return to branch A obviously reduces the number of free registers in register file 5. Since the total number of registers in register file 5 is relatively small, the removal of one of the registers from circulation can have a negative impact on the performance of the VLIW processor 100, which is not compensated by the saved result of executing the ld command. In this example of the operation of the VLIW processor 100, when executing the code of FIG. 5 preference is given to re-executing the ld command.

[63] Поскольку функциональный блок 2 полностью идентичен функциональному блоку 1, то в случае неготовности второй группы операндов до выполнения второй персонализированной команды, т.е. при получении управляющим устройством 21 по шине 72 от блока 7 готовности операндов сигнала cflag2, функциональный блок 2 осуществляет те же самые действия, что были описаны для функционального блока 1. Результат выполнения второй персонализированной команды при этом распознается блоком управления как недостоверный.[63] Since function block 2 is completely identical to function block 1, if the second group of operands is not ready before executing the second personalized command, i.e. when the control device 21 receives the operands of the signal cflag2 ready via bus 72 from block 7, functional block 2 performs the same actions as described for functional block 1. The result of executing the second personalized command is recognized by the control block as unreliable.

[64] Обратим внимание, что возможно наступление такой ситуации, когда одновременно не были обновлены и первая и вторая группы операндов. В этом случае конвейер также продолжает работу, а блок управления признает недостоверными результаты выполнения и первой и второй персонализированной команды. Целесообразность выполнения первой и второй персонализированной команд с получением заведомо недостоверных результатов объясняется выигрышем во времени, связанным с ускорением выдачи целевых операндов при повторном обращении к памяти 8 данных. Кроме того, в этом случае обеспечивается незамедлительная обработка следующей широкой команды, если входящие в нее персонализированные команды не связаны с текущей широкой командой.[64] Please note that a situation may occur when both the first and second groups of operands were not updated at the same time. In this case, the conveyor also continues to operate, and the control unit recognizes the results of the execution of both the first and second personalized commands as unreliable. The expediency of executing the first and second personalized commands with obtaining obviously unreliable results is explained by the gain in time associated with the acceleration of issuing target operands when repeatedly accessing data memory 8. It also ensures that the next broad command is processed immediately if its member custom commands are not associated with the current broad command.

[65] Тем временем, осуществление изобретения во VLIW-процессоре 100 связано с определенными техническими сложностями, поскольку используемые во VLIW-процессоре 100 регистры, снабженные дополнительным разрядом, выходят за рамки общепринятых стандартов, и должны быть изготовлены индивидуально. Кроме того, во VLIW-процессоре 100 все шины передачи данных должны быть снабжены дополнительной линией, соединяющей дополнительные разряды. В целом необходимо признать, что все эти моменты способны существенно ограничить область использования данного варианта осуществления изобретения. [65] Meanwhile, implementation of the invention in the VLIW processor 100 is associated with certain technical difficulties, since the registers used in the VLIW processor 100, equipped with an additional bit, are beyond the generally accepted standards and must be manufactured individually. In addition, in VLIW processor 100, all data buses must be provided with an additional line connecting additional bits. In general, it must be recognized that all these points can significantly limit the scope of use of this embodiment of the invention.

[66] Однако указанные технические сложности не характерны для второго предпочтительного случая использования изобретения, а именно VLIW-процессора 200, блок-схема которого представлена на Фиг. 7. Регистры данных во VLIW-процессоре 200 идентичны таковым в известном VLIW-процессоре 300, однако, VLIW-процессор 200 снабжен шиной 77, способной передавать сигналы cflag1 и cflag2 на предикатно-логический функциональный блок 3. Применительно к ситуации задержки обновления целевого операнда в случаях с Фиг. 2 и 5, функционирование VLIW-процессора 200 проиллюстрировано на Фиг. 9. [66] However, these technical difficulties are not typical for the second preferred use case of the invention, namely the VLIW processor 200, the block diagram of which is presented in FIG. 7. The data registers in the VLIW processor 200 are identical to those in the known VLIW processor 300, however, the VLIW processor 200 is equipped with a bus 77 capable of transmitting signals cflag1 and cflag2 to the predicate logic function block 3. In relation to the situation of delaying the update of the target operand in cases with Fig. 2 and 5, the operation of the VLIW processor 200 is illustrated in FIG. 9.

[67] Если широкая команда содержит персонализированную команду, в отношении которой предполагается возможность задержки в подготовке исходных операндов (в случае с Фиг. 5 это команда add, выполняемая функциональным блоком 1), то в состав этой широкой команды включается предикатная персонализированная команда sbpred, выполнение которой предписывается функциональному блоку 3. При поступлении в функциональный блок 3 сигнала cflag1 по шине 77, компрометирующий флаг записывается во входной регистр 33, а после прохождения командой sbpred стадии WB сохраняется в регистре предикатного файла 6, который указан в команде sbpred, и который выполняет функцию регистра готовности операндов для данной персонализированной команды и содержащей ее ветви (в случае с Фиг. 5 это команда add и ветвь А).[67] If a wide instruction contains a personalized instruction that is expected to delay the preparation of source operands (in the case of FIG. 5, this is the add instruction executed by function block 1), then the predicate personalized instruction sbpred is included in the wide instruction, execute which is assigned to functional block 3. When the cflag1 signal arrives at functional block 3 via bus 77, the compromising flag is written to input register 33, and after the sbpred command passes through the WB stage, it is stored in the predicate file register 6, which is specified in the sbpred command, and which performs the function operand readiness register for this personalized instruction and the branches containing it (in the case of Fig. 5, this is the add command and branch A).

[68] Если же широкая команда содержит, например, две персонализированные команды, в отношении которых предполагается возможность задержки в подготовке исходных операндов, то включенная в состав этой широкой команды предикатная персонализированная команда sbpred назначает два регистра предикатного файла 6, каждый из которых выполняет функцию регистра готовности операндов для своей персонализированной команды.[68] If a wide instruction contains, for example, two personalized instructions for which it is expected that there may be a delay in the preparation of the source operands, then the predicate personalized instruction sbpred included in this wide instruction assigns two registers of the predicate file 6, each of which performs the function of a register readiness of operands for your personalized command.

[69] При поступлении в функциональный блок 3 сигналов cflag1 и cflag2 первый из назначенных командой sbpred регистров предикатного файла 6 сохраняет первый компрометирующий флаг cflag1 для первой персонализированной команды, а второй из назначенных командой sbpred регистров предикатного файла 6 сохраняет второй компрометирующий флаг cflag2 для второй персонализированной команды. Как было указано выше, в функциональный блок 3 могут поступить сразу оба сигнала cflag1 и cflag2 или только один из них.[69] When signals cflag1 and cflag2 are received in function block 3, the first of the predicate file 6 registers assigned by the sbpred command stores the first compromising flag cflag1 for the first personalized command, and the second of the predicate file 6 assigned by the sbpred command stores the second compromising flag cflag2 for the second personalized teams. As mentioned above, function block 3 can receive both signals cflag1 and cflag2 or only one of them.

[70] Аналогично описанному выше VLIW-процессору 100, во VLIW-процессоре 200 сигналы cflag1, cflag2 имеют приоритет над сигналом hold, т.е. отменяют его действие. В предпочтительном случае персонализированные команды, которые не должны блокировать выполнение ширкой команды даже при неготовности используемых ими операндов, определяются заранее на этапе компиляции. Как правило, это первые команды той ветви, переход на которую является маловероятным. Следующие команды данной ветви получают компрометирующий флаг из предикатного файла 6.[70] Similar to the VLIW processor 100 described above, in the VLIW processor 200 the signals cflag1, cflag2 have priority over the hold signal, i.e. cancel its action. In the preferred case, personalized instructions, which should not block the execution of a broad instruction even if the operands they use are not ready, are defined in advance at compilation time. As a rule, these are the first commands of the branch to which transition to is unlikely. The following commands in this branch receive the compromise flag from predicate file 6.

[71] Сохранение компрометирующего флага в предикатном файле 6 предоставляет возможность использовать его в качестве обычного предиката, характеризующего состояние соответствующей персонализированной команды или ветви, которой принадлежит эта персонализированная команда. Например, компрометирующий флаг, сохраненный в предикатном файле 6, а точнее – в регистре, назначенном командой sbpred для выполнения функции регистра готовности операндов, может выступать в качестве условия перехода на другую ветвь или на компенсирующий код, а также в качестве самостоятельного операнда для логической операции. [71] Storing the compromise flag in predicate file 6 allows it to be used as a regular predicate characterizing the state of the corresponding personalized command or branch to which that personalized command belongs. For example, a compromising flag stored in predicate file 6, or more precisely in the register designated by the sbpred command to perform the function of the operand ready register, can act as a condition for moving to another branch or compensating code, as well as as an independent operand for a logical operation .

[72] В остальном VLIW-процессор 200 имеет те же преимущества, что и описанный ранее VLIW-процессор 100.[72] In other respects, the VLIW processor 200 has the same advantages as the previously described VLIW processor 100.

Claims

1. A processor containing a preparation pipeline, a register file, the first and second execution pipelines, an operand readiness control unit, an operand readiness register and a control unit, while

the preparatory conveyor is capable of selecting the first and second personalized teams from a wide team,

the first and second execution pipelines are capable of executing the first and second personalized instructions, respectively, using the first and second groups of operands, respectively, synchronously with each other,

the operand readiness control unit is capable of monitoring the update of the first group of operands in the register file,

The operand ready register is capable of storing a compromise flag when the first group of operands has not been updated before executing the first personalized instruction, while

The control unit, in the presence of a compromising flag, is able to recognize the result of the first personalized command as an unreliable result.

2. The processor of claim 1, wherein the control unit is capable of causing re-execution of the first personalized command by accessing the compensation code.

3. The processor of claim 1, wherein the control unit is configured to not trigger re-execution of the first personalized command when the control unit has determined that the first personalized command is a speculative command for an unexecuted branch.

4. The processor according to claim 1, wherein the control unit uses a compromise flag to transfer control or a logical operation.

5. The processor according to claim 1, in which the operand readiness register is capable of storing a compromising flag for the entire chain of commands following and dependent on the first personalized command, and the control unit is capable of invalidating the result of executing all commands from the specified chain.

6. The processor according to claim 1, in which the first execution pipeline contains a result register that stores the result of executing the first personalized instruction, and the operand readiness register is included in the result register as a separate bit.

7. The processor according to claim 6, in which the registers of the first execution pipeline, intended for the source operands, as well as the register file and cache memory registers, intended for the operands, are equipped with a separate bit intended for storing the compromising flag.

8. The processor according to claim 1, which contains an executive pipeline for logical operations, and the operand readiness register is part of the registers of the executive pipeline for logical operations.

9. The processor according to claim 1, in which the operand readiness control unit is capable of controlling the update of the second group of operands in the register file, and

the operand ready register is the first operand ready register, and the compromise flag is the first compromise flag, wherein

the processor includes a second operand ready register that is capable of storing a second compromise flag when the second group of operands has not been updated prior to execution of the second personalized instruction, wherein

the control unit, in the presence of a second compromising flag, is able to recognize the result of the second personalized command as an unreliable result.