RU2559767C2

RU2559767C2 - Method of providing fault-tolerance computer system based on task replication, self-reconfiguration and self-management of degradation

Info

Publication number: RU2559767C2
Application number: RU2013150724/08A
Authority: RU
Inventors: Сергей Николаевич Тихонов; Юрий Александрович Беликов; Владимир Юрьевич Бобров; Юрий Яковлевич Быков; Вячеслав Юрьевич Гришин; Петр Михайлович Еремеев; Олег Ервандович Мелконян; Антонина Иннокентьевна Садовникова; Владимир Григорьевич Сиренко
Original assignee: Открытое акционерное общество "Научно-исследовательский институт "Субмикрон"
Priority date: 2013-11-15
Filing date: 2013-11-15
Publication date: 2015-08-10
Also published as: RU2013150724A

Abstract

FIELD: physics, computer engineering.

SUBSTANCE: invention pertains to computer engineering. Disclosed is a method of providing fault-tolerance of a computer system, based on task replication, self-reconfiguration and self-management of degradation, wherein the computer system includes four numbered computers BMi, where i=1,…,4 is the number of a computer, each connected to of the other computers through its own transmitting interface device with an information broadcast channel, wherein each computer in the operating cycle of the system receives input information, calculates its copy of output information and transmits said copy to every other computer of the system, calculates the correct output information by majoritising its own and all received copies, outputs the majoritisation result to an external medium, compares all available copies of output information with the majoritisation result; each computer further includes first, second, third and fourth program fault counters, first, second, third and fourth program fault seconds counters, first, second, third and fourth program blocking counters, a synchronisation timer, an inter-computer communication (MMO) synchronisation timer, an interruption request generation timer (Tmrgzp), a Tmrgzp preset register, a Tmrgzp counter, a synchronisation timer preset register, a synchronisation timer counter, a MMO preset register, a MMO synchronisation counter and a RFx fixing register.

EFFECT: faster operation of the computer system.

6 cl, 12 dwg

Description

Изобретение относится к области вычислительной техники и может использоваться при построении высоконадежных вычислительных и управляющих систем, предназначенных для решения задач управления бортовыми системами транспортного корабля. The invention relates to the field of computer technology and can be used in the construction of highly reliable computing and control systems designed to solve the problems of controlling the onboard systems of a transport ship.

Известен метод локализации «дружественных» и «враждебных» неисправностей в комплексах резервированных ВМ (вычислительных машин) [1]. Основой метода являются закономерности форм проявления неисправностей в наборе результатов вычислений, который используется в детерминированном алгоритме взаимной информационной согласованности. Для любой ВМ в комплексе резервированных ВМ с непосредственными связями, выполняющих одинаковые прикладные задачи, результат вычислений можно представить одним числом. Совокупность результатов вычислений каждой задачи, сформированная посредством межмашинного обмена во всех ВМ, образует исходный набор (ИН), позволяющий любой ВМ контролировать состояние комплекса. При отсутствии неисправностей значения этих элементов ИН согласованы либо находятся в диапазоне допустимых отклонений. При проявлении неисправности значения одного или нескольких элементов отличаются от остальных. Смысл локализации состоит в том, чтобы все исправные ВМ одновременно и однозначно определили номер неисправной.The known method of localization of "friendly" and "hostile" faults in the complexes of redundant VMs (computers) [1]. The method is based on the laws of the forms of manifestation of faults in the set of calculation results, which is used in the deterministic algorithm of mutual information consistency. For any VM in a complex of redundant VMs with direct connections that perform the same applied tasks, the calculation result can be represented by one number. The set of calculation results for each task, formed by means of inter-machine exchange in all VMs, forms an initial set (IN) that allows any VM to control the state of the complex. In the absence of malfunctions, the values of these elements of the ID are consistent or are in the range of permissible deviations. When a malfunction occurs, the values of one or more elements differ from the others. The meaning of localization is that all serviceable VMs simultaneously and unequivocally determine the number of the faulty one.

Вычислительный процесс периодически прерывается в контрольных точках (КТ) для выполнения процедуры обеспечения отказоустойчивости. При использовании существующих подходов к локализации процедура обеспечения отказоустойчивости полностью выполняется в той же самой КТ, в которой обнаружено проявление неисправности. Пусть N - общее число ВМ в комплексе, k - число неисправных. Локализация «дружественных» неисправностей производится при N ≥ 2 k+1 тривиальным образом. При формировании ИН производится один раунд обмена, в каждой исправной ВМ формируется вектор из N элементов, места расположения рассогласованных элементов (РЭ) в нем совпадают с номерами неисправных машин [2]. Локализация всех форм проявления «враждебных» неисправностей выполнима только при N ≥ 3 k+1 и является нетривиальной задачей. Под формой проявления неисправности (ФПН) понимается характер места расположения рассогласованных элементов (РЭ) в ИН, образованных в одной и той же КТ всеми исправными машинами. Это вызвано тем, что ИН имеет более сложную структуру, в частности многомерную и разреженную, места расположения РЭ в различных исправных ВМ не совпадают и неоднозначно связаны с номерами неисправных машин. Метод локализации [3] основан на том, что в одной и той же КТ все возможные ФПН с помощью алгоритма взаимной информационной согласованности (ВИС) преобразуются к одной и той же форме, установленной для «дружественных» неисправностей. При этом преобразовании значение каждого элемента вектора образуется с помощью мажоритарного выбора по большинству совпадающих значений в группах элементов ИН, так что те ФПН, для которых число РЭ меньше половины от общего числа элементов в группе, маскируются, т.е. могут накапливаться и вызвать отказ комплекса. Предлагаемый метод состоит в том, чтобы для любой ФПН сформировать в каждой исправной ВМ множество допустимых результатов локализации (возможно в некоторых машинах - неоднозначных), после чего в нескольких КТ с помощью преобразований над ИН и несовпадающими результатами локализации устранить неоднозначность определения номера неисправной машины. Процедура локализации, реализующая данный метод, является многоэтапной, причем для различных ФПН завершается на разных этапах. Утверждается, что несмотря на произвольный характер данных, принимаемых от неисправной ВМ в каждом раунде межмашинного обмена, множество РЭ в ИН, сформированных исправными ВМ, расположено закономерно, что и позволяет разработать правила локализации.The computational process is periodically interrupted at control points (CT) to perform the process of ensuring fault tolerance. Using existing approaches to localization, the fault tolerance procedure is completely performed in the same CT where the manifestation of a malfunction is detected. Let N be the total number of VMs in the complex, k be the number of faulty ones. Localization of “friendly” faults is performed for N ≥ 2 k + 1 in a trivial way. When forming an ID, one round of exchange is performed, in each operational VM a vector of N elements is formed, the locations of the mismatched elements (RE) in it coincide with the numbers of the faulty machines [2]. Localization of all forms of manifestation of “hostile” faults is feasible only for N ≥ 3 k + 1 and is a non-trivial task. Under the form of the manifestation of a malfunction (FPN) is meant the nature of the location of the mismatched elements (RE) in the ID formed in the same CT by all serviceable machines. This is due to the fact that the ID has a more complex structure, in particular, multidimensional and sparse, the locations of the RE in various operational VMs do not coincide and are ambiguously associated with the numbers of the faulty machines. The localization method [3] is based on the fact that in the same CT all possible PDFs using the mutual information consistency (VIS) algorithm are converted to the same form established for "friendly" faults. In this transformation, the value of each element of the vector is formed using the majority choice for the majority of coinciding values in the groups of elements of the IN, so that the FPFs for which the number of REs is less than half of the total number of elements in the group are masked, i.e. can accumulate and cause complex failure. The proposed method consists in generating, for any FSF, a set of valid localization results in each operational VM (possibly ambiguous in some machines), after which, using several transformations over IDs and inconsistent localization results, eliminate the ambiguity in determining the number of the faulty machine. The localization procedure that implements this method is multi-stage, and for various FPFs it is completed at different stages. It is argued that despite the arbitrary nature of the data received from the faulty VM in each round of inter-machine exchange, the set of REs in the IDs generated by the working VMs are located naturally, which allows us to develop localization rules.

Недостатками этого метода являются: во-первых, отсутствие согласованного учета проявлений неисправностей в процессе взаимообмена результатами вычислений, что может задерживать локализацию на неопределенно долгое время даже при относительно высокой частоте проявлений неисправности. The disadvantages of this method are: firstly, the lack of a consistent account of the manifestations of malfunctions in the process of interchange of the results of calculations, which can delay localization for an indefinitely long time even at a relatively high frequency of manifestations of malfunctions.

Во-вторых, отсутствует согласованное обнаружение, учет и идентификация проявлений неисправностей в процессе взаимного информационного согласования признаков обнаружения неисправностей и данных локализации для построения вспомогательных наборов. При этом повышается вероятность возникновения необнаруживаемого отказа и внезапного отказа всей системы при последующем появлении сбоя другого элемента системы (так как в этом случае может фиксироваться краткая неисправность при допустимости одиночной) даже в случае наличия исправного резерва.Secondly, there is no consistent detection, accounting and identification of fault manifestations in the process of mutual information coordination of signs of fault detection and localization data for the construction of auxiliary sets. This increases the likelihood of an undetectable failure and a sudden failure of the entire system during the subsequent occurrence of a failure of another element of the system (since in this case a short malfunction can be fixed if a single one is permissible) even if there is a working reserve.

Известен также метод сбоеустойчивого информационного согласования с идентификацией обнаруженных неисправностей в четырехмашинной вычислительной системе [4], состоящей из четырех ВМ с номерами 1-4, соединенных четырьмя каналами связи шинной архитектуры. ВМ присоединяется к каждому каналу отдельным устройством сопряжения (УС), причем один из УС ВМ является передатчиком, а три другие УС этой ВМ - приемниками. В каждом канале имеется один УС-передатчик и три УС-приемника. Передача информации из каждой ВМ во все другие осуществляется широковещательным способом. Пусть ВМi передает во все другие ВMj, k, l некоторую информацию.There is also known a method of fault-tolerant information coordination with identification of detected faults in a four-machine computer system [4], consisting of four VMs with numbers 1-4 connected by four communication channels of the bus architecture. A VM is connected to each channel by a separate interface device (US), with one of the VMs being a transmitter, and the other three USs of this VM being receivers. Each channel has one US transmitter and three US receivers. Information transfer from each VM to all others is carried out by broadcast method. Let BMi transmit some information to all other BMJ, k, l.

Алгоритм сбоеустойчивого информационного согласования заключается в следующем:The fail-safe information matching algorithm is as follows:

• в первом раунде каждая из ВМi, где i =1-4, передает свое значение З_i другим ВМ и получает от них их значения;• in the first round, each of BMi, where i = 1-4, transfers its value З _{i to} other VMs and receives their values from them;

• во втором (третьем) раунде каждая ВМi передает другим ВМ информацию, полученную ею в предыдущем раунде, и еще раз свое собственное значение З_i ²(З_i ³);• in the second (third) round, each BMi transfers to the other VMs the information received by it in the previous round, and once again its own value З _i ² (З _i ³ );

• в четвертом раунде каждая ВМi передает в другие ВМ всю информацию, полученную ею в третьем раунде;• in the fourth round, each BMi transfers to the other VMs all the information she received in the third round;

• каждая ВМi определяет значения элементов своего вектора интерактивного согласования, в том числе и значение вектора интерактивного согласования С_i Для каждого j (j=1,4) мажоритированием всех значений Z_j ¹, … , полученных данной ВМ в первом и втором раундах, определяется значением С_j ¹, мажоритированием всех значений Z_j ², …, полученных данной ВМ во втором и третьем раундах, находится значение С_j ² и, наконец, мажоритированием значений Z_j ³ , …, полученных в третьем и в четвертом раундах,- значение С_j ³. Значение С_jвектораинтерактивного согласования данной ВМi образуется путем мажоритирования значений С_j ¹, С_j ², С_j ³. В случае, если любое из мажоритирований невозможно из-за неравенства всех трех исходных значений, соответствующему результирующему значению присваивается значение NULL.• each BMi determines the values of the elements of its vector of interactive matching, including the value of the vector of interactive matching With_i For each j (j = 1,4), majorization of all values of Z_j ^one, ... obtained by this VM in the first and second rounds, is determined by the value of C_j ^oneby majorizing all the values of Z_j ², ... obtained by this VM in the second and third rounds, the value C_j ² and finally, majorizing the values of Z_j ³ , ... obtained in the third and fourth rounds - the value of C_j ³. C value_jof vectorinteractive coordination of this BMi is formed by majorizing the values of C_j ^one, FROM_j ², FROM_j ³. If any of the majorities is impossible due to the inequality of all three initial values, the corresponding resulting value is assigned the value NULL.

Обнаружение и идентификация проявившихся неисправностей осуществляется независимо в каждой из ВМ. В процессе всех раундов согласования ВМi получают некоторую последовательность значений, упорядоченную в порядке номеров раундов их поступления, номеров ВМ их траекторий и последовательности значений k. Обозначим эту последовательность через S. Каждая ВМi для некоторых j и k осуществляет последовательное сравнение значений Z_j ^k, … в своей S со значением С_j и при обнаружении первого несравнения ( что является признаком проявления неисправности) определяет подозреваемую область этой неисправности. Такой анализ проводится для каждого из значений j и k. Пересечение всех полученных подозреваемых областей и их анализ позволяет идентифицировать проявляющуюся неисправность.Detection and identification of the revealed malfunctions is carried out independently in each of the VMs. In the process of all coordination rounds, BMi receive a certain sequence of values ordered in the order of the numbers of the rounds of their arrival, the numbers of VMs of their trajectories, and the sequence of values of k. We denote this sequence by S. Each BMi for some j and k performs a sequential comparison of the values of Z _j ^k , ... in its S with the value of C _j and upon detection of the first non-comparison (which is a sign of a malfunction) determines the suspected area of this malfunction. Such an analysis is performed for each of the values of j and k. The intersection of all received suspected areas and their analysis allows you to identify the manifest malfunction.

Недостатком этого способа является сложность реализации его в системах реального времени.The disadvantage of this method is the difficulty of implementing it in real-time systems.

Задачей изобретения является увеличение быстродействия и упрощение выполняемых алгоритмов .The objective of the invention is to increase speed and simplify the algorithms.

Сущность заявляемого изобретения, возможность его осуществления и промышленного использования поясняются чертежами и алгоритмами, представленными на фиг. 1 - 12, гдеThe essence of the claimed invention, the possibility of its implementation and industrial use are illustrated by the drawings and algorithms shown in FIG. 1 - 12, where

• на фиг. 1 представлена структурная схема отказоустойчивой вычислительной системы с аппаратно-программной реализацией функций отказоустойчивости и динамической реконфигурации;• in FIG. 1 is a structural diagram of a fault tolerant computing system with hardware and software implementation of fault tolerance and dynamic reconfiguration functions;

• на фиг. 2 представлена диаграмма состояний деградации и реконфигурации;• in FIG. 2 is a state diagram of degradation and reconfiguration;

• на фиг.3 представлен алгоритм обеспечения сбое- и отказоустойчивости вычислительной системы, основанный на возможности самореконфигурации и самоуправления деградацией;• figure 3 presents the algorithm for ensuring the failure and fault tolerance of the computing system, based on the possibility of self-configuration and self-degradation;

• на фиг. 4 представлен алгоритм настройки контроллера межмашинного обмена (КМО);• in FIG. 4 shows an algorithm for tuning an inter-machine exchange controller (KMO);

• на фиг. 5 представлен алгоритм функционирования системы отсчета времени;• in FIG. 5 shows the algorithm for the functioning of the time reference system;

• на фиг. 6 представлена привязка к внешней метке времени;• in FIG. 6 shows a reference to an external time stamp;

• на фиг.7 представлен алгоритм синхронизации при отсутствии метки времени;• Fig.7 shows the synchronization algorithm in the absence of a time stamp;

• на фиг.8 представлен расчет запаздывания прогноза при отсутствии внешней МВ;• on Fig presents the calculation of the delay of the forecast in the absence of external MV;

• на фиг.9 представлен алгоритм выполнения мажоритирования;• figure 9 presents the algorithm for majorizing;

• на фиг.10 представлен алгоритм функционирования таймера синхронизации ММО;• figure 10 shows the algorithm for the operation of the timer synchronization IMO;

• на фиг. 11 представлен алгоритм выполнения самоуправления деградацией; • in FIG. 11 shows an algorithm for performing self-management by degradation;

• на фиг. 12 представлен алгоритм выполнения самореконфигурации.• in FIG. 12 shows an algorithm for performing self-configuration.

Указанные преимущества заявляемого способа перед прототипом достигаются за счет того, что в отказоустойчивую вычислительную систему с аппаратно-программной реализацией функций отказоустойчивости и динамической реконфигурации, содержащую первую 1, вторую 2, третью 3 и четвертую 4 вычислительные машины, соединенные первой 5, второй 6, третьей 7 и четвертой 8 последовательными шинами данных, первый 9, второй 10, третий 11 и четвертый 12 вторичные источники питания (ВИП), первый 13, второй 14, третий 15 и четвертый 16 контроллеры межмашинного обмена (КММО), первый 17, второй 18, третий 19 и четвертый 20 контроллеры управления конфигурацией (КУК), первые 21 группы выходов которых соединены с первыми группами входов нулевой 1, первой 2, второй 3 и третьей 4 ВМ, вторые 22 группы входов которых соединены с первой группой входов системы, вторая 23, третья 24, четвертая 25,пятая 26 группы входов которой соединены с первыми группами входов КММО1(13) и КУК1(17), КММО2(14) и КУК2(18), КММО3(15) и КУК3(19), КММО4(16) и КУК4(20) соответственно, вторые 27 группы входов которых соединены с первыми группами выходов ВМ (1, 2, 3, 4), локальная двунаправленная магистраль 28 которых соединена с локальными двунаправленными магистралями КММО (13, 14, 15, 16), первые 29 выходы которых соединены с первыми входами контроллеров КУК (17, 18, 19, 20), вторые 30 группы выходов которых соединены с первыми группами входов ВИП (9, 10, 11, 12) соответственно, группы выходов 31 которых соединены с третьими группами входов ВМ (1, 2, 3, 4), причем первая 32 группа выходов первого 13 КММО соединена с третьими группами входов второго 14, третьего 15 и четвертого 16 КММО и первого 17, второго 18, третьего 19 и четвертого КУК 20, первая 33 группа выходов второго 14 КММО соединена с четвертыми группами входов первого 13, третьего 15 и четвертого КММО 16 и первого 17, второго 18, третьего 19 и четвертого 20 КУК, первая 34 группа выходов третьего 15 КММО соединена с пятыми группами входов первого 13, второго 14 и четвертого 16 КММО и первого 17, второго 18, третьего 19 и четвертого 20 КУК, первая 35 группа выходов четвертого 16 КММО соединена с шестыми группами входов первого 13, второго 14 и третьего 15 КММО и первого 17, второго 18, третьего 19 и четвертого 20 КУК, причем шестая 36 группа входов системы соединена со вторыми группами входов первого 9, второго 10, третьего 11 и четвертого 12 ВИП, первые входы которых соединены с седьмой 37 группой входов системы, восьмая 38 группа входов которой соединена со вторыми входами первого 9, второго 10, третьего 11 и четвертого 12 ВИП, причем вторые 39 группы выходов КММО(13, 14, 15, 16) соединены с четвертыми группами входов ВМ (1,2,3,4), в каждую ЭВМ дополнительно введены первый 40, второй 41, третий 42, четвертый 43 программные счетчики сбоев (ERRi BМi), первый 44, второй 45, третий 46, четвертый 47 программные счетчики сбойных секунд (SECi BМi), первый 48, второй 49, третий 50, четвертый 51 программные счетчики блокировки (BLKi BМi), таймер синхронизации (ТмрСн) 52, таймер синхронизации ММО (межмашинного обмена) (Тмр ММО) 53, таймер генерирования запроса на прерывание (Тмргзп ) 54, регистр предустановки таймера Тмргзп RPгзп 55, счетчик RTгзп 56, регистр предустановки таймера синхронизации RPsn 57, счетчик таймера синхронизации RTsn 58, регистр предустановки ММО RРmmo 59, счетчик синхронизации ММО RTmmo 60, регистр фиксации RFx 61, контроллер межмашинного обмена (КММО) 13, 14, 15, 16, контроллер управления конфигурацией (КУК) 17,18,19,20 и определены этапы включения системы, состоящие из инициализации системы, начальной проверки по включению, формирования таблицы технического состояния ТТС системы, установки исходной рабочей конфигурации системы 62, функциональной работы с сохранением заданного уровня избыточности и механизмами обеспечения сбое- и отказоустойчивости вычислительной системы, состоящими из настройки КММО, согласования системного времени, выполнения синхронизации при отсутствии внешней метки времени, функционирования таймера синхронизации ММО, выполнения обменов по МКО 65, обнаружения отказов и сбоев 65,66 путем мажоритирования своей и всех полученных копий входной информации, выполнения самоуправления деградацией 67, выполнения самореконфигурации 68. The indicated advantages of the proposed method over the prototype are achieved due to the fact that in a fault-tolerant computing system with hardware-software implementation of fault tolerance and dynamic reconfiguration functions, containing the first 1, second 2, third 3 and fourth 4 computers connected to the first 5, second 6, third 7 and fourth 8 serial data buses, the first 9, second 10, third 11 and fourth 12 secondary power supplies (VIP), the first 13, second 14, third 15 and fourth 16 inter-machine controllers ( KMMO), the first 17, second 18, third 19 and fourth 20 configuration management controllers (CCM), the first 21 groups of outputs of which are connected to the first groups of inputs zero 1, the first 2, second 3 and third 4 VMs, the second 22 groups of inputs which are connected with the first group of inputs of the system, the second 23, third 24, fourth 25, fifth 26 groups of inputs which are connected to the first groups of inputs KMMO1 (13) and KUK1 (17), KMMO2 (14) and KUK2 (18), KMMO3 (15) and KUK3 (19), KMMO4 (16) and KUK4 (20), respectively, the second 27 groups of inputs of which are connected to the first groups of outputs of the VM (1, 2, 3, 4), local 28 directional lines connected to KMMO local bi-directional lines (13, 14, 15, 16), the first 29 outputs of which are connected to the first inputs of the KUK controllers (17, 18, 19, 20), the second 30 groups of outputs of which are connected to the first groups of inputs VIP (9, 10, 11, 12), respectively, the output groups 31 of which are connected to the third groups of inputs of the VM (1, 2, 3, 4), and the first 32 group of outputs of the first 13 CMMO is connected to the third groups of inputs of the second 14, third 15 and the fourth 16 KMMO and the first 17, second 18, third 19 and fourth KUK 20, the first 33 gr oppa outputs of the second 14 KMMO is connected to the fourth groups of inputs of the first 13, third 15 and fourth KMMO 16 and the first 17, second 18, third 19 and fourth 20 KUK, the first 34 group of outputs of the third 15 KMMO is connected to the fifth groups of inputs of the first 13, second 14 and the fourth 16 KMMO and the first 17, the second 18, the third 19 and the fourth 20 KUK, the first 35 group of outputs of the fourth 16 KMMO is connected to the sixth groups of inputs of the first 13, second 14 and third 15 KMMO and the first 17, second 18, third 19 and fourth 20 KUK, and the sixth 36th group of system inputs is connected with second groups of inputs of the first 9, second 10, third 11 and fourth 12 VIP, the first inputs of which are connected to the seventh 37 group of inputs of the system, the eighth 38 group of inputs of which is connected to the second inputs of the first 9, second 10, third 11 and fourth 12 VIP, moreover, the second 39 groups of outputs of the CMMO (13, 14, 15, 16) are connected to the fourth groups of inputs of the VM (1,2,3,4), the first 40, the second 41, the third 42, the fourth 43 program counters of failures are additionally introduced into each computer (ERRi BMi), first 44, second 45, third 46, fourth 47 software counters of failed seconds (SECi BMi), p first 48, second 49, third 50, fourth 51 software lock counters (BLKi BMi), synchronization timer (TmrSn) 52, synchronization timer MMO (machine exchange) (Tmr MMO) 53, timer for generating an interrupt request (Tmrgsp) 54, register timer preset Tmrgzp RPGZP 55, counter RTGZP 56, register of preset timer RPsn 57, timer counter RTsn 58, register of preset MMO RPmmo 59, synchronization counter MMO RTmmo 60, register register RFx 61, inter-machine controller (KMMO) 13, 14, 15, 16, configuration management controller (KU K) 17,18,19,20 and the stages of turning on the system are determined, consisting of the initialization of the system, an initial check on switching on, the formation of a table of the technical condition of the TTC of the system, the installation of the initial working configuration of the system 62, the functional work with preserving the specified level of redundancy and mechanisms for ensuring failure - and fault tolerance of the computing system, consisting of tuning the CMMO, matching the system time, performing synchronization in the absence of an external time stamp, the functioning of the synchronization timer M Moscow Region, performing exchanges on MCO 65, detecting failures and failures 65.66 by majorizing their own and all received copies of the input information, performing self-management by degradation 67, and performing self-configuration 68.

Настройка контроллера межмашинного обмена в одномашинной конфигурации 69 заключается в том, чтобы скоммутировать 73 на входы «а», «б», «в» элемента мажоритирования выход от своей ВМ и разрешить участие в синхронизации входа «а», в двухмашинной конфигурации 70 заключается в том, чтобы скоммутировать 74 на вход «а» элемента мажоритирования выход от своей ВМ, на вход «б» элемента мажоритирования выход от другой ВМ, на вход «в» элемента мажоритирования выход от «лучшей» из двух ВМ, у которой сумма программного счетчика сбоев (40,41,42,43) и программного счетчика сбойных секунд (44,45,46,47) имеет минимальное значение, разрешить участие в синхронизации входов «а, б», в трехмашинной конфигурации 71 заключается в том, чтобы скоммутировать 75 на вход «а» элемента мажоритирования выход от своей ВМ, на вход «б» элемента мажоритирования выход от второй ВМ, на вход «в» элемента мажоритирования выход от третьей ВМ и разрешить участие в синхронизации входов «а,б,в», в четырехмашинной конфигурации 72 заключается в том, чтобы скоммутировать 76 на вход «а» элемента мажоритирования выход от своей ВМ, на вход «б» элемента мажоритирования выход от второй ВМ, на вход «в» элемента мажоритирования выход от третьей ВМ и разрешить участие в синхронизации входов «а, б, в, г», где «г» - выход от четвертой ВМ.The setup of the inter-machine exchange controller in a single-machine configuration 69 consists in switching 73 to the inputs “a”, “b”, “c” of the majorization element the output from its VM and allowing participation in the synchronization of input “a”, in the two-machine configuration 70 is to switch 74 to the input “a” of the majorization element the output from your VM, to the input “b” of the majority element the output from another VM, to the input “c” of the majority element the output from the “best” of the two VMs with the total program counter failures (40,41,42,43) and software o counter of bad seconds (44,45,46,47) has a minimum value, allow participation in the synchronization of inputs “a, b”, in the three-machine configuration 71 is to connect 75 to the input “a” of the majorization element the output from its VM , at the input “b” of the majorization element, exit from the second VM, at the input “c” of the majorization element, exit from the third VM and allow participation in synchronization of the inputs “a, b, c”, in the four-machine configuration 72 is to switch 76 to the input "a" of the element of majorization is the output from its VM, to the input "b »The element of majorization the output from the second VM, to the input“ c ”of the element of majorization the output from the third VM and allow participation in the synchronization of inputs“ a, b, c, d ”, where“ g ”is the output from the fourth VM.

Функционирование системы отсчета времени производится записью 77 в регистр предустановки таймера Тмргзп RPгзп 55 интервала срабатывания равного 1000000 мкс, интервал срабатывания 78 переписывается в счетчик RTгзп 56, который работает на вычитание 80, Тмргзп 54 срабатывает 81 при обнулении RTгзп 56, генерируется 82 запрос на прерывание и интервал из RPгзп 55 переписывается 78 в RTгзп 56, процесс повторяется, причем если счетчик RTгзп 56 не равен «0», то по тактовому импульсу продолжается вычитание «1» из RTгзп 56, причем в регистр предустановки таймера синхронизации RPsn 57 заносится интервал срабатывания 83, равный 0xFFFFFFFF, по внешней метке времени МВ 84 содержимое регистра RPsn 57 предустановки таймера синхронизации переписывается 84 в счетчик таймера синхронизации RTsn 58, который работает на вычитание 86, а счетчик таймера синхронизации 58 переписывается 84 в регистр фиксации RFx 61, причем когда счетчик таймера синхронизации RTsn 58 устанавливается в «0» 87, формируются 88 признаки срабатывания таймера синхронизации и наличия внешней метки времени и содержимое счетчика таймера синхронизации 58 переписывается в регистр фиксации RFx 61, а регистр предустановки RPsn 57 переписывается 84 в счетчик RTsn 58 и процесс повторяется 85, причем если внешней метки времени МВ нет 87 и счетчик RTsn 58 не равен «0», то по тактовому импульсу 85 продолжается вычитание «1» из счетчика RTsn 58.The functioning of the time reference system is done by writing 77 to the timer preset register Tmrgpp rpgz 55 of the operation interval equal to 1,000,000 μs, the operation interval 78 is written to the counter rtgzp 56, which works by subtracting 80, tmrgzp 54 is triggered 81 when zeroing rtgzp 56, 82 interrupt request is generated and the interval from RPGZP 55 is rewritten 78 to RTGZP 56, the process is repeated, and if the counter RTGZP 56 is not equal to "0", then the subtraction of "1" from RTGZP 56 continues, and moreover, to the synchronization timer preset register AI RPsn 57 sets the response interval 83 equal to 0xFFFFFFFF, according to the external time stamp MV 84, the contents of the synchronization timer preset register RPsn 57 is copied 84 to the synchronization timer counter RTsn 58, which works by subtracting 86, and the synchronization timer counter 58 is copied 84 to the RFx latch register 61, and when the counter of the synchronization timer RTsn 58 is set to “0” 87, 88 signs of the activation of the synchronization timer and the presence of an external time stamp are generated and the contents of the counter of the synchronization timer 58 are overwritten in reg RFx 61 is locked, and the preset register RPsn 57 is copied 84 to the RTsn 58 counter and the process is repeated 85, and if there is no external MB time stamp 87 and the RTsn 58 counter is not “0”, subtraction “1” from clock 85 continues counter RTsn 58.

Синхронизация при отсутствии внешней метки времени выполняется при наличии запроса 89 на прерывание от таймера Тмргзп у каждой ВМi, затем, если номер текущей секунды четный 90, то каждая ВМi формирует байт синхронизации и передает 91 его в две другие ВМ, каждая из ВМi анализирует наличие байтов синхронизации от других ВМ и если они есть, то фиксирует обобщенный признак синхронизации 92, вычисляет время 93 от срабатывания таймера до обобщенной синхронизации (T3Si), затем выдает значение времени (T3Si) 94 из каждой ВМi, если есть обобщенный признак синхронизации 95, то вычисляет медиану 96 из трех значений времени T3S0, T3S1 и T3S2, выполняет процедуру мажоритирования 97 согласованных меток, затем вычисляет поправку 98 для таймера Тмргзп и корректирует 99 состояние таймера Тмргзп на вычисленное значение поправки. Synchronization in the absence of an external time stamp is performed when there is a request 89 for interruption from the timer Тmрзп for each BMi, then, if the current second number is even 90, then each BMi generates a synchronization byte and transmits 91 of it to two other VMs, each of the BMi analyzes the presence of bytes synchronization from other VMs and, if there are any, it captures the generalized synchronization indicator 92, calculates the time 93 from the timer to the generalized synchronization (T3Si), then gives the time value (T3Si) 94 from each VMi, if there is a generalized synchronization indicator If it is 95, then it calculates the median 96 of the three time values T3S0, T3S1 and T3S2, performs the procedure of majorizing 97 matched marks, then calculates the amendment 98 for the timer Тmрзп and corrects 99 the state of the timer Тмрзп by the calculated correction value.

Процедура мажоритирования заключается в том, что каждая ВМi формирует признак необходимости и перехода в режим деградации 100, если содержимое хотя бы одного счетчика сбоев ERRi больше 8, то к своему согласуемому значению добавляет 101 сформированный признак и выдает полученный результат во все другие ВМ, при наличии признака обобщенной синхронизации 102 ВМi выполняет мажоритирование 103 своей и двух полученных копий по формуле M = (A & B) | (A & C) | (В & С), где А - собственное значение, затем сравнивает имеющиеся копии В и С с результатом мажоритирования 104,105,106, если копии и собственное значение равны результату мажоритирования, и если есть признак деградации в М 107, то нужно выполнить процедуру самоуправления деградацией 112 и вернуть согласованное значение М 108, причем если копия из ВМ2 не равна М 105, то содержимое счетчика сбоев ERR BM2 (41) увеличивается 110 на единицу, причем если копия из ВМ3 не равна М 106, то содержимое счетчика сбоев ERR BM3 (42) увеличивается 111 на единицу, причем если собственное значение из ВМ1 не равно М 104, то содержимое счетчика сбоев ERR BM1 (40) увеличивается 109 на единицу, причем если нет признака деградации в М 107, то вернуть согласованное значение М 108.The majorization procedure consists in the fact that each BMi forms a sign of necessity and switches to degradation mode 100, if the content of at least one ERRi failure counter is more than 8, then 101 generated sign is added to its consistent value and gives the result to all other VMs, if any of a feature of generalized synchronization 102 BMi performs majorization of 103 of its own and two received copies according to the formula M = (A & B) | (A & C) | (B & C), where A is an eigenvalue, then compares the available copies of B and C with the result of majorization 104,105,106, if the copies and the eigenvalue are equal to the result of majorization, and if there is a sign of degradation in M 107, then you need to perform the self-management process of degradation 112 and return the agreed value M 108, and if the copy from BM2 is not equal to M 105, then the contents of the failure counter ERR BM2 (41) is increased by one, and if the copy from BM3 is not equal to M 106, then the contents of the error counter ERR BM3 (42) is increased 111 per unit, and if own The values of the VM1 is not equal to M 104, the contents of counter ERR BM1 (40) 109 fault is increased by one, and if no characteristic degradation in M 107, then return agreed value M 108.

Функционирование таймера синхронизации ММО 53 осуществляется так, что если есть данные с двух входов контроллера межмашинного обмена (ММО) 113 и есть тактовый импульс 114, то содержимое счетчика синхронизации ММО 60 увеличивается 115 на «1», если есть данные с трех входов контроллера ММО 116, то формируется обобщенный признак синхронизации 117, причем если нет данных с трех входов контроллера ММО и счетчик синхронизации ММО 60 равен регистру 59 предустановки ММО 118, то формируется обобщенный признак синхронизации 117, если же счетчик синхронизации ММО 53 не равен регистру 59 предустановки ММО, то осуществляется переход на ожидание тактового импульса 114 и увеличение 115 содержимого счетчика синхронизации ММО 60 на «1».The functioning of the synchronization timer MMO 53 is carried out so that if there is data from two inputs of the inter-machine exchange controller (MMO) 113 and there is a clock pulse 114, then the contents of the synchronization counter MMO 60 is increased by 115 by "1", if there is data from three inputs of the controller MMO 116 then a generalized synchronization flag 117 is generated, and if there is no data from the three inputs of the MMO controller and the MMO synchronization counter 60 is equal to the register 59 of the MMO 118 preset, then a generalized synchronization flag 117 is generated if the MMO synchronization counter 53 is not equal to the register 59 of the MMO preset, then the transition to waiting for a clock pulse 114 and an increase of 115 the contents of the synchronization counter MMO 60 to "1" is performed.

Процедура самоуправления деградацией заключается в том, что каждая BMi составляет список отказавших ВМ 119, для которых счетчик сбоев ERRi (40-43) превысил лимит, равный 8, согласовывает список 120, выполняя процедуру мажоритирования, если согласованный список не пуст 121, то по счетчикам ERRi (40-43) согласованно выбирает худшую BM, у которой счетчик сбоев ERRi (40-43) имеет максимальное значение, а если они равны, то с минимальным номером ВМi, из списка 122, готовит новую конфигурацию, установив режим деградации, согласованно блокирует выбранную ВМi 123 и изменяет режим на рабочий, включает дежурное питание, настраивает мажоритирование контроллера управления конфигурацией, блокирует все выключенные и неисправные ВМ, назначает ВМi ведущей на МКО в соответствии с конфигурацией, если выбранная ВМi заблокирована 124, то к содержимому счетчика блокировок (48-51) заблокированной ВМi прибавляет «1» 125, готовит для программного обеспечения (ПО) извещение о блокировке 126, конец деградации, причем, если выбранная ВМi не заблокирована 124, то готовит для ПО извещение о деградации 127, конец деградации, причем, если согласованный список пуст 121, то конец деградации.The process of self-management of degradation consists in the fact that each BMi compiles a list of failed VM 119, for which the failure counter ERRi (40-43) has exceeded the limit of 8, agrees list 120, performing the majority procedure, if the agreed list is not empty 121, then according to the counters ERRi (40-43) consistently selects the worst BM for which the failure counter ERRi (40-43) has a maximum value, and if they are equal, then with a minimum BMi number, from list 122, prepares a new configuration, setting the degradation mode, consistently blocks selected BMi 123 and changes p press the working one, turn on standby power, configure the majorization of the configuration management controller, block all switched off and faulty VMs, designate the VMi as the master on the MCO in accordance with the configuration, if the selected BMi is locked 124, then the locked BMi adds to the contents of the lock counter (48-51) “1” 125, prepares a lock notification 126 for the software (software), the end of degradation, moreover, if the selected BMi is not blocked 124, it prepares a 127 degradation notification for the software, the end of the degradation, moreover, if agreed ovanny list is empty 121, the end of the degradation.

Процедура самореконфигурации заключается в том, что если для реконфигурации выделено 128 времени больше 10 с , то выполняется выключение всех лишних ВМ 129 (резервные и неисправные ВМ), если для реконфигурации выделено 130 времени меньше 60 с, то необходимо завершить 134 процедуру, если больше и в текущей конфигурации три ВМ 131, то необходимо завершить 134 процедуру, если в текущей конфигурации меньше трех ВМ 131, то необходимо выбрать 132 включаемые ВМ по минимальному значению счетчика BLKi (48-51) и если нашлась 133 подходящая ВМ (выключенная и исправная или выключенная и имеет минимальное значение счетчика BLKi), то включить 135 BMi, передать 136 необходимую информацию в ОЗУ BMi и выполнить синхронный и согласованный перезапуск 137 ПО с признаком перезаписи, завершить процедуру 134, причем если не нашлась подходящая ВМ, то необходимо завершить 134 процедуру.The self-reconfiguration procedure consists in the fact that if 128 times more than 10 s are allocated for reconfiguration, then all unnecessary VMs 129 (standby and faulty VMs) are turned off, if 130 times less than 60 s are allocated for reconfiguration, then 134 procedure must be completed, if more and in the current configuration there are three VMs 131, then it is necessary to complete the 134 procedure, if in the current configuration there are less than three VMs 131, then it is necessary to select 132 switched on VMs according to the minimum value of the BLKi counter (48-51) and if 133 suitable VMs were found (switched off and serviceable shut down and has a minimum value BLKi counter) then enable 135 BMi, transmit 136 the necessary information in the RAM BMi and perform a simultaneous and coordinated restart 137 software with the feature rewriting complete the procedure 134, and if not found a suitable virtual machine, it is necessary to complete the 134 procedure.

Способ обеспечения сбое- и отказоустойчивости вычислительных систем, основанный на репликации задач, возможности самореконфигурации и самоуправлении деградацией заключается в следующем.The way to ensure the failure and fault tolerance of computing systems based on task replication, the possibility of self-configuration and self-degradation is as follows.

Отказоустойчивая вычислительная система включает три ВМ рабочей конфигурации и одну ВМ в холодном резерве. Структурная схема отказоустойчивой вычислительной системы с аппаратно-программной реализацией функций отказоустойчивости и динамической реконфигурации представлена на фиг.1. A fault-tolerant computing system includes three working configuration VMs and one cold standby VM. The structural diagram of a fault-tolerant computing system with hardware-software implementation of the functions of fault tolerance and dynamic reconfiguration is presented in figure 1.

Каждая ВМi рабочей конфигурации решает одну и ту же целевую задачу программного обеспечения, при помощи программного обеспечения отказоустойчивости (ПОО) результаты решения передает другим ВМ рабочей конфигурации и принимает от них их копии результатов решения, а затем по своей и полученным копиям решения определяет правильный результат. В двухмашинной рабочей конфигурации правильное решение определяется на основе двух результатов и, если такое определение невозможно (отсутствует возможность идентификации неисправной ВМ по результатам решения), то, в зависимости от имеющихся временных ресурсов, осуществляется либо повторное решение задачи ПО, либо переход к промежуточному тестированию каналов с последующим повторным решением при положительном результате тестирования или переходом к одноканальной конфигурации на основе исправного канала системы. При невозможности идентифицировать неисправную ВМi осуществляется переход к безопасному останову системы. Диаграмма состояний деградации и реконфигурации представлена на фиг.2. Each VMi of the working configuration solves the same target task of the software, with the help of fault tolerance software (VET), the results of the solution are transmitted to the other VMs of the working configuration and receive from them their copies of the solution results, and then it determines the correct result from its and the received copies of the solution. In a two-machine operating configuration, the correct solution is determined on the basis of two results and, if such a determination is impossible (there is no possibility of identifying a faulty VM by the results of the solution), then, depending on the available time resources, either a second solution of the software problem is performed, or transition to intermediate testing of channels followed by a second decision with a positive test result or a transition to a single-channel configuration based on a working channel of the system. If it is impossible to identify the faulty BMi, a transition to a safe shutdown of the system is carried out. The state diagram of degradation and reconfiguration is presented in figure 2.

Алгоритм обеспечения сбое- и отказоустойчивости вычислительной системы, основанный на возможности самореконфигурации и самоуправлении деградацией представлен на фиг.3. The algorithm for ensuring the failure and fault tolerance of a computing system based on the possibility of self-configuration and self-management of degradation is presented in Fig.3.

На этапе начальной проверки определяются работоспособные элементы системы, выбирается и устанавливается исходная рабочая конфигурация системы, формируется 62 таблица технического состояния (ТТС), инициализируются 62 механизмы обеспечения сбое- и отказоустойчивости системы. Если работоспособны 63 4 машины, то 4 машину выключают 64.At the initial verification stage, workable elements of the system are determined, the initial working configuration of the system is selected and installed, a table of technical condition (TTC) is formed, 62 mechanisms for ensuring system fail-safety and fault tolerance are initialized. If 63 4 cars are operational, then 4 cars turn off 64.

На втором этапе выполняются задачи прикладного программного обеспечения (ППО) системы 65. Все выполняемые действия сопровождаются непрерывным и сквозным функциональным диагностированием. При возникновении сбоев осуществляется 66 восстановление вычислительного процесса. В случае идентификации отказа выполняется 67 самоуправление деградацией и самореконфигурация 68 системы, обеспечивающая сохранение заданного уровня избыточности, и восстановление вычислительного процесса.At the second stage, the tasks of the application software (PPO) of system 65 are performed. All the actions performed are accompanied by continuous and end-to-end functional diagnostics. In case of failures, 66 computational process recovery is performed. In case of failure identification, self-degradation self-control and system self-reconfiguration 68 are performed, ensuring the preservation of a given level of redundancy, and restoring the computing process.

Алгоритм настройки контроллера межмашинного обмена (КМО) представлен на фиг.4.The configuration algorithm of the machine-to-machine exchange controller (CMO) is presented in FIG. 4.

Настройка КМО заключается в следующем:The configuration of the CMO is as follows:

• в одномашинной конфигурации 69 необходимо скоммутировать 73 на входы элемента мажоритирования : «а» - выход своей ВМ, «б» - выход своей ВМ, «в» - выход своей ВМ, разрешить участие в синхронизации входа «а»;• in a single-machine configuration 69, it is necessary to connect 73 to the inputs of the majorization element: “a” - output of its VM, “b” - output of its VM, “c” - output of its VM, allow participation in synchronization of input “a”;

в двухмашинной конфигурации 70 необходимо скоммутировать 74 на входы элемента мажоритирования : «а» -выход от своей ВМ, «б» - выход от другой ВМ, «в» - выход от «лучшей» из двух ВМ, (результат мажоритирования - данные от «лучшей» ВМ), у которой сумма программного счетчика сбоев (40,41,42,43) и программного счетчика сбойных секунд (44,45,46,47) имеет минимальное значение, разрешить участие в синхронизации входов «а», «б»;in a two-machine configuration 70, it is necessary to connect 74 to the inputs of the majorization element: “a” - output from your VM, “b” - output from another VM, “c” - output from the “best” of two VMs (the result of majorization is data from “ the best ”VM), in which the sum of the software failure counter (40,41,42,43) and the software counter of failed seconds (44,45,46,47) has a minimum value, allow participation in the synchronization of inputs“ a ”,“ b ” ;

• в трехмашинной конфигурации 71 необходимо скоммутировать 75 на входы элемента мажоритирования : «а» - выход от своей ВМ, «б» - выход от второй ВМ, «в» - выход от третьей ВМ, разрешить участие в синхронизации входов «а», «б», «в» ;• in a three-machine configuration 71, it is necessary to connect 75 to the inputs of the majorization element: “a” - output from its VM, “b” - output from the second VM, “c” - output from the third VM, allow participation in synchronization of inputs “a”, “ b "," c ";

• в четырехмашинной конфигурации 72 необходимо скоммутировать 76 на входы элемента мажоритирования : «а» -выход от своей ВМ, «б» - выход от второй ВМ, «в» - выход от третьей ВМ, разрешить участие в синхронизации входов «а», «б», «в», «г», где «г» - выход от четвертой ВМ.• in a four-machine configuration 72, it is necessary to connect 76 to the inputs of the majorization element: “a” - output from your VM, “b” - output from the second VM, “c” - output from the third VM, allow participation in synchronization of inputs “a”, “ b "," c "," g ", where" g "is the output from the fourth VM.

Сбое- и отказоустойчивая работа ВМ, базируется на двух принципах: синхронизированной работе всех ВМ рабочей конфигурации и согласованности принимаемых различными ВМ решений, основанной на согласованности исходных данных. Failure-and fault-tolerant operation of VMs is based on two principles: the synchronized operation of all VMs in the working configuration and the consistency of decisions made by different VMs based on the consistency of the source data.

Синхронизация работы программного обеспечения (ПО) отдельных ВМ, входящих в состав системы, и необходимое согласование данных производится программами общего программного обеспечения (ОПО) в ключевые моменты вычислительного процесса, называемые точками синхронизации.The synchronization of the software (software) of individual VMs that are part of the system, and the necessary coordination of data is carried out by common software programs (OPO) at key points in the computing process, called synchronization points.

В частности, обеспечивается: In particular, it is provided:

• согласованное ведение системного времени в разных ВМi;• coordinated management of system time in different BMi;

• согласованный запуск задач (в каждой ВМi в текущий момент времени выполняется одна и та же задача);• coordinated launch of tasks (in each BMi at the current moment the same task is performed);

• согласованное проведение обменов по мультиплексному каналу обмена (МКО);• coordinated exchange on the multiplexed channel of exchange (MCO);

• согласование данных по запросу программного обеспечения.• data matching upon request of software.

Предусматриваются два вида синхронизации:Two types of synchronization are provided:

• на основе единого системного времени;• based on a single system time;

• синхронизация при отсутствии метки времени (МВ).• synchronization in the absence of a time stamp (MV).

Алгоритм функционирования системы отсчета времени представлен на фиг.5. Привязка к внешней метке времени представлена на фиг. 6.The algorithm of the functioning of the time reference system is presented in figure 5. The reference to an external time stamp is shown in FIG. 6.

Принцип работы таймера Тмргзп 54: в регистр 55 предустановки (RPгзп) программно заносится 77 интервал срабатывания, равный 1000000 мкс, и таймер 54 запускается в циклическом режиме. Интервал срабатывания переписывается 78 в счетчик RTгзп 56, который работает на вычитание 80, Тмргзп 54 срабатывает 81 при обнулении счетчика RTгзп 56, генерируя запрос на прерывание 82, и интервал 78 из RPгзп 55 переписывается в RTгзп 56, процесс повторяется.The principle of operation of the timer Tmrgzp 54: in the register 55 of the preset (RPgzp) programmatically entered 77 interval operation, equal to 1,000,000 μs, and the timer 54 starts in cyclic mode. The operation interval is copied 78 to the counter RTGZP 56, which works to subtract 80, Tmrgzp 54 is triggered 81 when the counter RTgzp 56 is reset to zero, generating an interrupt request 82, and the interval 78 from RPgzp 55 is rewritten to RTgzp 56, the process is repeated.

Принцип работы таймера 52 синхронизации (ТмрСн ): в регистр предустановки таймера синхронизации RPsn 57 заносится 83 интервал срабатывания равный 0xFFFFFFFF, таймер работает в циклическом режиме, по внешней метке времени МВ содержимое регистра RPsn предустановки таймера синхронизации 57 переписывается 84 в счетчик 58 таймера синхронизации RTsn, который работает на вычитание 86, а счетчик 58 таймера синхронизации переписывается 84 в регистр фиксации RFx 61, причем когда счетчик 58 таймера синхронизации RTsn устанавливается в «0» 87, то формируются 88 признаки срабатывания таймера 52 синхронизации и наличия внешней метки времени, далее содержимое счетчика 58 таймера синхронизации 52 переписывается 84 в регистр фиксации RFx 61, что позволяет измерять интервал между двумя МВ.The principle of operation of the synchronization timer 52 (TmrSn): the response interval equal to 0xFFFFFFFF is entered into the preset register of the synchronization timer RPsn 57, the timer works in a cyclic mode, the contents of the register RPsn of the preset synchronization timer 57 are copied 84 to the counter 58 of the synchronization timer RTsn, using the external time stamp MV which works by subtracting 86, and the counter 58 of the synchronization timer is copied 84 to the register register RFx 61, and when the counter 58 of the synchronization timer RTsn is set to "0" 87, 88 signs of srub are formed yvaniya timer 52 and the presence of an external synchronizing time stamps, then the contents of counter 58 sync timer 52 is rewritten into register 84 RFx fixing 61 that allows to measure the interval between two MV.

После переполнения счетчика 56 RTгзп или счетчика 58 RTsn ВМ вычисляет:After the overflow of the counter 56 RTGZP or counter 58 RTsn VM calculates:

• содержимое таймера 52 синхронизации, вычитая из содержимого регистра предустановки 57 ТмрСн содержимое счетчика 58 синхронизации (RPsn - RTsn = TS) ;• the contents of the synchronization timer 52, subtracting the contents of the synchronization counter 58 from the contents of the preset register 57 TmpSn (RPsn - RTsn = TS);

• содержимое таймера Тмргзп 54, вычитая из содержимого регистра предустановки таймера 55 RРгзп содержимое счетчика RTгзп 56 (RРгзп - RTгзп = ТМ) ;• the contents of the timer Tmrgzp 54, subtracting the contents of the RTgzp 56 counter from the contents of the timer preset register 55 RРгзп 56 (RРгзп - RTгзп = ТМ);

• запаздывание прогноза (D=TS - TM) ;• forecast lag (D = TS - TM);

• количество тактов в секунде (Т1 = RPsn - RFx) .• number of ticks per second (T1 = RPsn - RFx).

Затем выполняет привязку прогноза МВ. Then, it binds the forecast MV.

Второй вид синхронизации - синхронизация при отсутствии внешней метки времени (МВ). Алгоритм синхронизации при отсутствии внешней метки времени представлен на фиг.7. Расчет запаздывания прогноза при отсутствии внешней МВ представлен на фиг. 8. The second type of synchronization is synchronization in the absence of an external time stamp (MV). The synchronization algorithm in the absence of an external time stamp is presented in Fig.7. The calculation of the forecast delay in the absence of an external MV is presented in FIG. 8.

Синхронизация системного времени проводится периодически каждую секунду. Производится согласование секундной составляющей времени (TS) и микросекундной составляющей (Tmks). При обнаружении расхождения значений Tmks в разных ВМi эти значения усредняются и во всех ВМ устанавливаются одинаковые значения Tmks.System time synchronization is carried out periodically every second. The second time component (TS) and the microsecond component (Tmks) are matched. If a discrepancy between Tmks values is detected in different BMi, these values are averaged and the same Tmks values are set in all VMs.

Значения TS разных ВМi мажоритируются. При изменении времени производится согласование нового значения, а затем согласованное значение времени устанавливается во всех ВМ.TS values of different BMi are majorized. When the time changes, a new value is negotiated, and then the agreed time value is set in all VMs.

Синхронизация при отсутствии внешней метки времени выполняется при наличии запроса на прерывание 89 от таймера Тмргзп 54 у каждой ВМi, затем, если номер текущей секунды четный 90, то каждая ВМi формирует байт синхронизации 91 и передает его в две другие ВМ, каждая из ВМi анализирует наличие байтов синхронизации от других ВМ и если они есть, то фиксирует обобщенный признак синхронизации 92, вычисляет время от срабатывания таймера до обобщенной синхронизации (T3Si) 93, затем выдает значение времени (T3Si) 94 во все другие ВМ, если есть обобщенный признак синхронизации и значения времени (T3S1 и T3S2) 95, то вычисляет медиану 96 из трех значений времени T3S0, T3S1 и T3S2, выполняет процедуру мажоритирования 97 согласованных меток, затем вычисляет поправку 98 для таймера Тмргзп 54 и корректирует 99 состояние таймера Тмргзп 54 на вычисленное значение поправки. Synchronization in the absence of an external time stamp is performed when there is a request for interruption 89 from the timer Тmрзп 54 for each BMi, then, if the number of the current second is even 90, then each BMi generates a synchronization byte 91 and transmits it to two other VMs, each of the BMi analyzes the presence synchronization bytes from other VMs and, if any, fixes the generalized synchronization flag 92, calculates the time from the timer to the generalized synchronization (T3Si) 93, then outputs the time value (T3Si) 94 to all other VMs, if there is a generalized sign of s synchronization and time values (T3S1 and T3S2) 95, then calculates the median 96 of the three time values T3S0, T3S1 and T3S2, performs the majorization procedure 97 of matched marks, then calculates the amendment 98 for the timer Tmrgzp 54 and adjusts 99 the state of the timer Tmrgzp 54 to the calculated value amendments.

На фиг.9 представлен алгоритм выполнения процедуры мажоритирования.Figure 9 presents the algorithm for performing the majorization procedure.

Процедура мажоритирования заключается в том, что каждая ВМi формирует признак необходимости и перехода в режим деградации 100, если содержимое хотя бы одного счетчика сбоев ERRi больше 8, то к своему согласуемому значению добавляет 101 сформированный признак и выдает полученный результат во все другие ВМ, после окончания обмена 102 при наличии обобщенного признака синхронизации ВМi выполняет мажоритирование 103 своей и двух полученных копий по формуле M = (A & B) | (A & C) | (В & С), где А - собственное значение, затем сравнивает имеющиеся копии В и С и собственное значение А с результатом мажоритирования 105, 106, 104, если копии и собственное значение равны результату мажоритирования, и если есть признак деградации в М 107, то нужно выполнить процедуру самоуправления деградацией 112 и вернуть согласованное значение М 108, причем если копия из ВМ2 не равна М 105, то содержимое счетчика сбоев ERR BM2 110 увеличивается на единицу, если копия из ВМ3 не равна М 106, то содержимое счетчика сбоев ERR BM3 111 увеличивается на единицу, если собственное значение из ВМ1 не равно М 104, то содержимое счетчика сбоев ERR BM1 109 увеличивается на единицу, причем если нет признака деградации в М 107, то вернуть согласованное значение М 108.The majorization procedure consists in the fact that each BMi forms a sign of necessity and switches to degradation mode 100, if the content of at least one ERRi failure counter is more than 8, then 101 generated characteristic is added to its consistent value and gives the result to all other VMs, after exchange 102 in the presence of a generalized sign of synchronization BMi performs majorization 103 of its and two received copies according to the formula M = (A & B) | (A & C) | (B & C), where A is an eigenvalue, then compares the available copies of B and C and the eigenvalue A with the result of majorization 105, 106, 104, if the copies and the eigenvalue are equal to the result of majorization, and if there is a sign of degradation in M 107, then you need to perform the self-management procedure by degradation 112 and return the agreed value M 108, and if the copy from BM2 is not equal to M 105, then the contents of the failure counter ERR BM2 110 is increased by one, if the copy from BM3 is not equal to M 106, then the contents of the failure counter ERR BM3 111 increases by one if sobs vennoe value of BM1 is not equal to 104 M, the content ERR BM1 failure counter 109 is incremented by one, wherein if there is no sign of degradation in M 107, then return agreed value M 108.

Если содержимое счетчиков сбоев ERR BM1 40, ERR BM2 41 или ERR BM3 42 больше восьми, то соответствующая ВМi блокируется и содержимое счетчиков блокировки BLK ВМ1 48, BLK ВМ2 49 или BLK ВМ3 50 увеличивается на «1» соответственно. If the contents of the failure counters ERR BM1 40, ERR BM2 41 or ERR BM3 42 are greater than eight, then the corresponding BMi is blocked and the contents of the block counters BLK BM1 48, BLK BM2 49 or BLK BM3 50 are increased by “1”, respectively.

Алгоритм функционирования таймера синхронизации ММО представлен на фиг. 10. The operation algorithm of the MMO synchronization timer is shown in FIG. 10.

Функционирование таймера синхронизации ММО 53 осуществляется, если есть данные с двух входов контроллера межмашинного обмена (ММО) 113 и есть тактовый импульс 114, то содержимое счетчика 60 синхронизации ММО RTmmo увеличивается 115 на «1», если есть данные с трех входов контроллера ММО 116, то формируется обобщенный признак синхронизации 117, причем если нет данных с трех входов контроллера ММО и счетчик 60 синхронизации ММО равен регистру 59 предустановки ММО 118, то формируется обобщенный признак синхронизации 117, если же счетчик 60 синхронизации ММО не равен регистру 59 предустановки ММО, то осуществляется переход на ожидание тактового импульса 114 и увеличение содержимого счетчика 60 синхронизации ММО 115 на «1».The operation of the synchronization timer MMO 53 is carried out, if there is data from two inputs of the inter-machine exchange controller (MMO) 113 and there is a clock pulse 114, then the contents of the counter 60 synchronization MMO RTmmo increases by 115 by "1", if there is data from three inputs of the controller MMO 116, then a generalized synchronization flag 117 is generated, and if there is no data from the three inputs of the MMO controller and the MMO synchronization counter 60 is equal to the register 59 of the MMO preset 118, then a generalized synchronization flag 117 is generated if the MMO synchronization counter 60 is not equal to the register 59 of the MMO preset, then the transition to waiting for a clock pulse 114 and increasing the contents of the counter 60 synchronization MMO 115 to "1".

Согласование значения времени производится также при чтении и изменении значения времени по запросу ПО.Coordination of the time value is also performed when reading and changing the time value at the request of the software.

При чтении времени программе ПО выдается согласованное (одинаковое) значение времени во всех ВМ. When reading time, the software program receives a consistent (identical) time value in all VMs.

При изменении времени производится согласование нового значения, а затем согласованное значение времени устанавливается во всех ВМ. When the time changes, a new value is negotiated, and then the agreed time value is set in all VMs.

Синхронное выполнение задач во всех ВМ обеспечивается согласованием вектора запуска и идентификатора при создании задачи, и согласованием идентификатора при каждом запуске и каждой приостановке задачи.The synchronous execution of tasks in all VMs is ensured by matching the start vector and identifier when creating the task, and matching the identifier at each start and each task pause.

При отработке запроса программы ПО на проведение обмена по МКО производится подготовка информации для передачи в МКО во всех ВМ одинаково. When working out a software program request for an exchange on MCOs, information is prepared for transfer to MCOs in all VMs equally.

Непосредственная передача в МКО производится только в одной ВМ (ВМ_вдщ). Direct transfer to the MCO is carried out only in one VM (VM_vdsh).

Перед запуском передачи производится согласование подготовленной для передачи информации, а после завершения обмена в ВМ_вдщ результаты обмена и принятая из МКО информация рассылаются из ВМ_вдщ в остальные ВМ. Before starting the transfer, the information prepared for the transfer is agreed upon, and after completion of the exchange to VM_vdsch, the exchange results and information received from the MCO are sent from VM_vdsch to the other VMs.

Задача ПО, инициировавшая проведение обмена, должна получить одинаковые результаты во всех ВМ. The software task that initiated the exchange should get the same results in all VMs.

Назначение ВМ_вдщ на МКО производится при инициализации ОПО: в качестве ВМ_вдщ выбирается ВМ с минимальным номером. VM_signal assignment to MCO is performed during the initialization of the SDE: VMs with the minimum number are selected as VM_signal.

Программы ПО могут изменить номер ВМ_вдщ с помощью специальной функции бортовой операционной системы (БОС). Software programs can change the VM_vdsh number using a special function of the on-board operating system (BOS).

Механизмы восстановления вычислительного процесса после сбоев и отказов предназначены для изоляции неисправных элементов рабочей конфигурации и выполняемых ими вычислительных процессов с целью нераспространения ошибок и искажений на другие элементы и вычислительные процессы системы, восстановления прерванных вычислительных процессов, реконфигурации вычислительных средств для изоляции отказавших элементов и замены их запасными, проверки работоспособности подключенных запасных элементов, обеспечивая их необходимой информацией и вовлекая в совместную работу в составе рабочей конфигурации.Mechanisms for recovering a computational process after failures and failures are designed to isolate faulty elements of the working configuration and the computational processes that they perform in order to not propagate errors and distortions to other elements and computational processes of the system, restore interrupted computational processes, reconfigure computing tools to isolate failed elements and replace them with spare ones , health checks of connected spare elements, providing them with the necessary information th and involving in joint work as part of the working configuration.

Алгоритм выполнения самоуправления деградацией представлен на фиг. 11.An algorithm for performing self-degradation is shown in FIG. eleven.

Процедура самоуправления деградацией заключается в том, что каждая BMi составляет список отказавших ВМ 119, для которых счетчик сбоев ERRi превысил лимит, равный 8, согласовывает список 120, выполняя процедуру мажоритирования, если согласованный список не пуст 121, то по счетчикам ERRi согласованно выбирает худшую BM, у которой счетчик сбоев ERRi (40-43) имеет максимальное значение, из списка 122, причем если у двух ВМ одинаковые значения счетчиков, то выбирает ВМ с наименьшим номером, готовит новую конфигурацию, установив режим деградации, согласованно блокирует выбранную ВМi 123 и изменяет режим на рабочий, включает дежурное питание, настраивает мажоритирование контроллера управления конфигурацией, блокирует все выключенные и неисправные ВМ, назначает ВМi ведущей на МКО в соответствии с конфигурацией, если выбранная ВМi заблокирована 124, то к содержимому счетчика блокировок заблокированной ВМi прибавяет «1» 125, готовит для программного обеспечения извещение о блокировке 126, конец деградации, причем, если выбранная ВМ не заблокирована 124, то готовит для ПО извещение о деградации 127, конец деградации, причем, если согласованный список пуст 121, то конец деградации.The process of self-management of degradation consists in the fact that each BMi compiles a list of failed VMs 119, for which the ERRi failure counter has exceeded the limit of 8, agrees list 120 by performing the majority procedure, if the agreed list is not empty 121, then ERRi will select the worst BM , in which the failure counter ERRi (40-43) has the maximum value from the list 122, and if two VMs have the same counter values, then selects the VM with the lowest number, prepares a new configuration, setting the degradation mode, consistent enables selected VMi 123 and switches the operating mode to on, turns on standby power, sets majorization of the configuration management controller, blocks all switched off and faulty VMs, assigns the VMi to be the master on the MCO in accordance with the configuration, if the selected VMi is locked 124, then the contents of the locked BMi lock counter adds “1” 125, prepares a lock notification 126 for the software, end of degradation, and if the selected VM is not blocked 124, it prepares 127 degradation notice for the software, end of degra data, and if the agreed list is empty 121, then the end of degradation.

Алгоритм выполнения самореконфигурации представлен на фиг. 12. The self-reconfiguration algorithm is shown in FIG. 12.

Процедура самореконфигурации заключается в том, что если для реконфигурации выделено 128 времени больше 10 с , то выполняется выключение всех лишних ВМ 129 (резервные и неисправные ВМ), если для реконфигурации выделено 130 времени меньше 60с, то необходимо завершить 134 процедуру, если больше и в текущей конфигурации три ВМ 131, то необходимо завершить 134 процедуру, если в текущей конфигурации меньше трех ВМ 131, то необходимо выбрать 132 включаемую ВМi по минимальному значению счетчика BLKi (48-51) и если нашлась 133 подходящая ВМi (выключенная и исправная или выключенная и имеет минимальное значение счетчика BLKi), то включить 135 BMi, передать 136 необходимую информацию в ОЗУ BMi и выполнить синхронный и согласованный перезапуск 137 ПО с признаком перезаписи, завершить процедуру 134, причем если не нашлась подходящая ВМi, то необходимо завершить 134 процедуру.The self-configuration procedure consists in the fact that if 128 times more than 10 s are allocated for reconfiguration, then all unnecessary VMs 129 (standby and faulty VMs) are turned off, if 130 times less than 60 s are allocated for reconfiguration, then 134 procedure must be completed, if more If there are three VMs 131 in the current configuration, then it is necessary to complete the 134 procedure, if in the current configuration there are less than three VMs 131, then it is necessary to select 132 VMs to be switched on according to the minimum value of the BLKi counter (48-51) and if 133 suitable VMi were found (switched off and serviceable and turned off and has the minimum value of the counter BLKi), then turn on 135 BMi, transfer 136 necessary information to RAM BMi and perform a synchronized and coordinated restart of 137 software with the sign of overwriting, complete procedure 134, and if no suitable BMi was found, then it is necessary to complete 134 procedure .

Источники информацииInformation sources

[1] Э.М. Мамедли, Р.Я. Самедов, Н.А. Соболев. Метод локализации «дружественных» и «враждебных» неисправностей.// Техническая диагностика. 1992, № 5 , стр. 126 - 138. [1] E.M. Mamedli, R.Ya. Samedov, N.A. Sobolev. The method of localization of "friendly" and "hostile" faults. // Technical diagnostics. 1992, No. 5, pp. 126 - 138.

[2] Авиженис А. Отказоустойчивость - свойство, обеспечивающее постоянную работоспособность цифровых систем//ТИИЭР. 1978. Т. 66. № 10. С. 5-25. [2] A Avizhenis. Fault tolerance - a property that ensures the continuous operation of digital systems // TIIER. 1978. T. 66. No. 10. S. 5-25.

[3] Lala / J /., Alger L.S., Ganthic R.J., Dzwonczyk M.J. A fault tolerant processor to meet rigorous failure requirements // Proc. 7th Dig. Avionics Syst. Conf., 1986. P. 555-562.[3] Lala / J /., Alger L.S., Ganthic R.J., Dzwonczyk M.J. A fault tolerant processor to meet rigorous failure requirements // Proc. 7th Dig. Avionics Syst. Conf., 1986. P. 555-562.

[4]. Лобанов А.В. Сбоеустойчивое информационное согласование в четырехмашинной вычислительной системе с идентификацией обнаруженных неисправностей // АиТ, 1992. № 2. С. 171-180.[four]. Lobanov A.V. Failure-proof informational coordination in a four-machine computing system with identification of detected faults // Autom&T, 1992. No. 2. P. 171-180.

Claims

1. A method of ensuring the failure and fault tolerance of a computer system based on task replication, the possibility of self-configuration and self-management of degradation, while the computer system includes four numbered computers BMi, where i = 1-4 is the number of the computer (VM), each of which is connected to all other computers through its own transmitting interface device (US) with the channel for broadcasting information, in which each computer in the cycle of the system receives input information, calculates with a personal copy of the output information and transfers this copy of each other computer of the system, calculates the correct output information by majorizing one's own and all received copies, gives the result of majorization to the external environment, compares all available copies of the output information with the majorizing result, characterized in that in each computer additionally introduced the first, second, third, fourth software failure counters (ERRi BMi), the first, second, third, fourth software failure seconds counters (SECi BMi), the first, second, third, even locked software lock counters (BLKi BMi), synchronization timer (TmrSn), inter-machine synchronization timer (MMO) (tmr MMO), interrupt request generation timer (Tmrgzp), RPgp interrupt request generation timer preset register, interrupt request generation counter RTgzp, RPsn synchronization timer preset register, RTsn synchronization timer counter, MMO RPmmo preset register, MMM RTmmo synchronization counter, RFx latch register, inter-machine exchange controller (CMMO), configuration control controller (CCC) and determined the stages of system startup, consisting of the initialization of the system, an initial check on switching on, the formation of a table of the technical state of the TTC system, the installation of the initial working configuration of the system, functional work while maintaining a given level of redundancy and mechanisms for determining the failure and fault tolerance of a computer system, consisting of adjusting the CMMO, coordinating the system time, performing synchronization in the absence of an external time stamp, functioning of the MM synchronization timer , Fault detection and failure by mazhoritirovaniya own and all received copies of the input data, performing self-degradation, performance samorekonfiguratsii.

2. The method according to p. 1, characterized in that the coordination of the system time is carried out by writing to the preset register the timer for generating an interrupt request RPgzp of the operation interval equal to 1,000,000 μs, the mode is cyclic, the operation interval is overwritten into the counter for generating the interrupt request (RTgzp), which works for subtraction, Tmrgzp is triggered when the counter for generating an interrupt request is reset, generating an interrupt request, and the interval from RPgzp is written to RTgzp, the process repeats, and if the counter to generate an interrupt request is not equal to "0", then the clock continues to subtract "1" from RTgzp, and the response interval equal to 0xFFFFFFFF is entered into the preset register of the synchronization timer RPsn, the mode is cyclic, the contents of the preset register RPsn according to the external time stamp MB the synchronization timer is written to the RTsn synchronization timer counter, which works for subtraction, and the synchronization timer counter is written to the RFx commit register, and when the RTsn synchronization timer counter is set to “0 ”, The synchronization timer is triggered and there is an external time stamp, and the contents of the synchronization timer counter are written to the RFx latch register, and the preset register RPsn is written to the RTsn counter and the process is repeated, and if there is no external MB time stamp and the RTsn counter is not“ 0 ” , then the clock continues to subtract “1” from the counter RTsn.

3. The method according to claim 1, characterized in that synchronization in the absence of an external time stamp is performed when there is an interrupt request from the timer for generating an interrupt request for each BMi, then, if the current second number is even, then each BMi generates a synchronization byte and transmits it to two other VMs, each BMi analyzes the presence of synchronization bytes from other VMs and, if any, captures the generalized synchronization flag, calculates the time from the timer to the generalized synchronization (T3Si), then gives the time value (T3Si) from each BMi, if there is a generalized sign of synchronization, it computes the median of the three time values T3S0, T3S1 and T3S2, performs the majorization procedure for the agreed labels, then calculates the correction for the interrupt request generation timer and adjusts the status of the interrupt request generation timer by The calculated correction value.

4. The method according to p. 1, characterized in that the majorization procedure consists in the fact that each BMi forms a sign of necessity and transition to degradation mode, if the contents of at least one ERRi failure counter were more than 8, then the generated sign adds to its consistent value and outputs the result to all other VMs, after the exchange is completed, if there is a sign of generalized synchronization, BMi calculates the correct output information by majorizing its own and two received copies using the formula M = (A & B) | (A & C) | (B & C), where A is an eigenvalue, then compares the available copies of B and C with the result of majorization, if the copies and eigenvalue are equal to the result of majorization and if there is a sign of degradation in M, then it performs the process of self-management by degradation and returns the agreed value of M, moreover, if the copy from BM1 is not equal to M, then the contents of the failure counter ERR BM1 is increased by one, and if the copy from BM2 is not equal to M, then the contents of the failure counter ERR BM2 is increased by one, and if the eigenvalue from BM0 is not equal to M, then the contents of the error counter ERR BM0 is increased by one, and if there is no sign of degradation in M, then it returns the agreed value M.

5. The method according to p. 1, characterized in that the operation of the MMO synchronization timer is carried out so that if there is data from two inputs of the inter-machine exchange controller (IMO) and there is a clock pulse, then the content of the MMO synchronization counter is increased by "1", if any data from three inputs of the MMO controller, then a generalized synchronization indicator is generated, and if there is no data from three inputs of the MMO controller and the MMO synchronization counter is equal to the MMO preset register, then a generalized synchronization indicator is generated, and if the counter IMO synchronization is not equal to the register of the MMO preset, then the transition to waiting for a clock pulse and increasing the contents of the synchronization counter IMO by "1".

6. The method according to p. 1, characterized in that the self-degradation procedure consists in the fact that each BMi compiles a list of failed VMs for which the ERRi failure counter has exceeded a limit of 8, agrees the list, performing the majority procedure, if the agreed list is not empty then, according to the counters, ERRi consistently selects the worst BMi from the list, prepares a new configuration, sets the degradation mode, consecutively locks the selected BMi and changes the mode to working, turns on standby power, sets up majorization of the controller configuration, blocks all switched off and faulty VMs, assigns BMi the master on the MCO in accordance with the configuration, if the selected BMi is locked, then BMi adds “1” to the contents of the locked lock counter, prepares a block notification for the software (software), the end of degradation moreover, if the selected BMi is not blocked, it prepares a degradation notice for the software, the end of the degradation, and if the agreed list is empty, then the end of the degradation.