RU2672786C1

RU2672786C1 - Method of the software verification by the software source code identifiers natural semantics in static analysis

Info

Publication number: RU2672786C1
Application number: RU2018100942A
Authority: RU
Inventors: Роман Евгеньевич Жидков
Original assignee: Роман Евгеньевич Жидков
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2018-11-19

Abstract

FIELD: computer equipment.SUBSTANCE: invention relates to the software verification method. In the method, semantic rules for the syntactically controlled definition abstract syntax tree nodes creation, based on the physical equations dimensional homogeneity, implemented in the software source code expressions, complementing with the identifiers natural semantics operation field, represented as the physical quantity dimension, interpreted in the identifier and described by stored in the symbols table one-dimensional integer array, initialized by the identifiers natural semantics values after lexical and syntactic analyses, at that, the specified field is filled automatically depending on the operation with identifiers in this node, during the modified abstract syntactic tree upward walk performance, calculating the natural semantics values in the internal nodes and controlling the software source code expressions correctness conditions fulfillment, based on the identifiers natural semantics values in the symbols table.EFFECT: technical result consists in the software verification automation.1 cl, 4 dwg

Description

Изобретение относится информатике, а именно к способам верификации программного обеспечения (ПО), разрабатываемого на языках программирования высокого уровня, и может быть использовано в инструментальных средствах статического анализа компьютерных программ, в которых идентификаторы ИК интерпретируют физические величины.The invention relates to computer science, and in particular to methods of verifying software developed in high-level programming languages, and can be used in tools for static analysis of computer programs in which IR identifiers interpret physical quantities.

Статический анализ, являясь одним из методов проведения верификации, позволяет снизить затраты на ее проведение ПО за счет обнаружения дефектов на более ранних (по сравнению с тестированием) стадиях разработки ПО и точной локализации искажений в программном коде. Инструментальные средства статического анализа представляют собой реализацию первой стадии работы компилятора (анализа) с усложненными семантическими правилами. Процесс проведения статического анализа делится на фазы, каждая из которых преобразует одно из представлений исходной программы в другое. На первой фазе выполняется лексический анализ, во время которого происходит чтение входной последовательности символов программы, выделяются лексемы языка программирования и формируются токены. Токен представляет собой структуру, состоящую из имени токена и ссылки на таблицу символов, в которой аккумулируется информация об обнаруженных лексемах в процессе анализа. Вторая фаза - синтаксический анализ, который дополняет информацией таблицу символов и преобразует поток токенов от лексического анализатора в древовидную структуру в соответствии с грамматикой языка программирования. Типичным представлением исходной программы, получаемым после проведения синтаксического анализа, является абстрактное синтаксическое дерево (АСД), в котором «родительские» узлы соответствуют операциям, а «дочерние» - их аргументам. АСД и информация из таблицы символов используется на третьей фазе - семантическом анализе, для проверки исходной программы на семантическую согласованность с определением конкретного языка программирования и поиска дефектов. Статический анализ способен обнаруживать лексические, синтаксические и семантические дефекты, вызванные нарушением требований спецификации языка программирования или рекомендаций отраслевых стандартов.Static analysis, being one of the methods of verification, can reduce the cost of its software by detecting defects at the earlier (compared to testing) stages of software development and accurate localization of distortions in the program code. The tools of static analysis are the implementation of the first stage of the compiler (analysis) with sophisticated semantic rules. The process of conducting static analysis is divided into phases, each of which converts one of the representations of the original program into another. In the first phase, a lexical analysis is performed, during which the input sequence of the program characters is read, the tokens of the programming language are highlighted and tokens are formed. A token is a structure consisting of a token name and a link to a symbol table, which accumulates information about detected tokens in the analysis process. The second phase is parsing, which adds information to the symbol table and converts the token stream from the lexical analyzer into a tree structure in accordance with the grammar of the programming language. A typical representation of the source program obtained after parsing is an abstract syntax tree (ASD), in which the "parent" nodes correspond to operations, and the "children" correspond to their arguments. ASD and information from the symbol table is used in the third phase - semantic analysis, to check the initial program for semantic consistency with the definition of a specific programming language and search for defects. Static analysis is able to detect lexical, syntactic and semantic defects caused by a violation of the requirements of a programming language specification or recommendations of industry standards.

Для снижения затрат на проведение верификации ПО целесообразно повысить эффективность статического анализа за счет дефектов, относящихся к нарушениям логики вычислений в программе. Для реализации данной возможности предлагается способ верификации ПО, основанный на проверке естественной семантики (ЕС) идентификаторов исходного кода (ИК) программы. Под семантикой понимается размерность физической величины, интерпретируемой в идентификаторе ИК.To reduce the cost of software verification, it is advisable to increase the effectiveness of static analysis due to defects related to violations of the logic of calculations in the program. To implement this feature, a software verification method is proposed, based on the verification of the natural semantics (EC) of the identifiers of the source code (IR) of the program. By semantics is meant the dimension of a physical quantity, interpreted in the identifier IR.

Известен способ (Патент №2373570, «Способ верификации программного обеспечения распределительных вычислительных комплексов и система для его реализации»), позволяющий определять точки и участки уязвимости ИК ПО распределенных вычислительных комплексов (например, переполнение буфера) путем автоматического составления и решения соответствующих систем уравнений на основе внутреннего представления ИК ПО, хранящегося в виде баз данных и баз знаний.There is a known method (Patent No. 2373570, “A method for verifying the software of distribution computing complexes and a system for its implementation”), which allows determining the points and areas of vulnerability of IR software of distributed computing complexes (for example, buffer overflow) by automatically compiling and solving the corresponding systems of equations based on internal representation of IR software stored in the form of databases and knowledge bases.

Известен способ (Патент №2515684, «Способ синтаксического анализа языка программирования с расширяемой грамматикой»), решающий задачу динамической модификации таблиц компиляции LR синтаксического анализатора за счет заданных отдельно для каждого уровня иерархии вложенности грамматических правил языка программирования директив расширения грамматики, предназначенных для введения новых грамматических конструкций. Использование данного способа направлено на создание инструмента, позволяющего для решения любых специфических задач использовать универсальный язык программирования с расширяемой грамматикой.There is a known method (Patent No. 2515684, “Method for parsing an extensible grammar programming language”) that solves the problem of dynamically modifying compilation tables LR of a parser by setting grammar extension directives set separately for each level of the nesting hierarchy of the programming language to introduce new grammar constructions. Using this method is aimed at creating a tool that allows for the solution of any specific problems to use a universal programming language with extensible grammar.

Известен способ (Патент №2103728, «Способ преобразования входной программы транслятора и устройство для его осуществления»), позволяющий решать задачу быстрого доступа к актуальным значениям идентификаторов в дереве трансляции за счет запоминания указателей таблицы идентификаторов для синтаксического дерева вместо запоминания имен в синтаксическом дереве.There is a known method (Patent No. 2103728, “A method for converting an input program of a translator and a device for its implementation”), which makes it possible to solve the problem of quick access to current identifier values in a translation tree by storing identifier table pointers for a syntax tree instead of storing names in a syntax tree.

Известен способ (Патент №2115158, «Способ и устройство для достоверной оценки сематических признаков в синтаксическом анализе при проходе вперед слева направо»), осуществляющий определенный вид семантического анализа (проверку семантических признаков) во время работы синтаксического анализатора за счет модификации формата узлов в дереве разбора и усовершенствования действий, связанных с грамматическими правилами вывода. Данный способ синтаксического анализа позволяет осуществить проверку сематических признаков, связанных с требованиями спецификации языка программирования, а также проконтролировать синтаксическую корректность конструкций в ИК, за один проход одновременно с построением дерева разбора, что снижает время, требуемое для компиляции программы.A known method (Patent No. 2115158, “Method and device for reliable assessment of semantic features in parsing when moving forward from left to right”), performing a certain type of semantic analysis (checking semantic features) during the operation of the parser by modifying the format of the nodes in the parse tree and improvements to grammar inference rules. This method of parsing allows you to check the semantic features associated with the requirements of the specification of the programming language, as well as to check the syntactic correctness of the structures in the IR, in one pass simultaneously with the construction of the parse tree, which reduces the time required to compile the program.

Наиболее близким технически решением, принятым за прототип, является известный способ построения синтаксических деревьев (Ахо Альфред В., Лам, Моника С., Сети Рави, Ульман Джеффри Д. Компиляторы: принципы, технологии и инструментарий, 2-е изд.: Пер. с англ. - М.: ООО «И.Д. Вильяме», 2008. - 1184 с.: ил. - Парал. тит. англ., разд. 5.3.1.), позволяющий с помощью синтаксически управляемого определения формировать внутреннее представление программы в виде АСД. Для чего каждой продукции исходной грамматики назначается семантическое правило создания объекта узла АСД с соответствующим количеством полей.The closest technical solution adopted for the prototype is a well-known method of constructing syntactic trees (Aho Alfred V., Lam, Monika S., Ravi Networks, Ulman Jeffrey D. Compilers: principles, technologies and tools, 2nd ed .: Per. from English. - M.: “ID Williams” LLC, 2008. - 1184 p.: ill. - Paral.Tit.Eng., Section 5.3.1.), which allows using the syntactically controlled definition to form an internal representation programs in the form of ASD. Why is each product of the original grammar assigned a semantic rule for creating an object of the ASD node with the corresponding number of fields.

Целью настоящего изобретения является повышение эффективности статического анализа, за счет расширения множества типов обнаруживаемых дефектов, для снижения суммарных затрат на проведение верификации ПО.The aim of the present invention is to increase the effectiveness of static analysis, by expanding the many types of detected defects, to reduce the total cost of software verification.

Указанный технический результат достигается за счет того, что в синтаксическом управляемом определении (СУО) способа, основанного на способе построения синтаксических деревьев, семантические правила дополняются полем для операций с ЕС идентификаторов, позволяющие при проведении синтаксического анализа строить модифицированное АСД, в процессе восходящего обхода которого осуществляется контроль неизменности ЕС идентификаторов, хранящейся в дополненной таблице символов в виде одномерно целочисленного массива из девяти элементов.The indicated technical result is achieved due to the fact that in the syntactically controlled definition (LMS) of the method based on the method of constructing the syntax trees, the semantic rules are supplemented by a field for operations with EU identifiers, which allow constructing a modified ASD during parsing, during which upward traversal is carried out control of the immutability of EU identifiers stored in the augmented symbol table in the form of a one-dimensionally integer array of nine elements.

Для построения модифицированного АСД продукциям контекстно-свободной грамматики языка программирования в СУО ставятся в соответствие программные конструкции (например, в объектно-ориентированном стиле), описывающие создание узлов АСД. Структура внутреннего узла дополняется полем для операции с ЕС идентификаторов ИК (op_NS), которое заполняется автоматически в зависимости от операции с идентификаторами в данном узле (op_I) и на основе размерной однородности физических уравнений, таблица соответствия которых представлена на фиг. 1.To build a modified ASD, the products of a context-free grammar of a programming language in the LMS are associated with software constructs (for example, in an object-oriented style) that describe the creation of ASD nodes. The structure of the internal node is supplemented by a field for the operation with the EU of identifiers IR (op _NS ), which is filled automatically depending on the operation with identifiers in this node (op _I ) and based on dimensional uniformity of physical equations, the correspondence table of which is presented in FIG. one.

Конструктор для создания объекта дополненного внутреннего узла АСД имеет вид: Node (op_I, op_NS, c₁, …, c_k), где op_I - метка узла и операция с идентификаторами; op_NS - операция с ЕС идентификаторов; с₁, …,c_k - k дополнительных полей для ссылок на «дочерние» объекты.The constructor for creating the object of the augmented internal node of the ASD has the form: Node (op _I , op _NS , c ₁ , ..., c _k ), where op _I is the node label and operation with identifiers; op _NS - operation with EU identifiers; with ₁ , ..., c _k - k additional fields for links to "child" objects.

Конструктор для создания узла, являющегося листом: Leaf (op, val), где op - метка узла; val - лексическое значение, представленное либо ссылкой на таблицу символов для идентификаторов, либо константу.Constructor for creating a node that is a leaf: Leaf (op, val), where op is the label of the node; val - lexical value, represented either by a link to a symbol table for identifiers, or a constant.

При применении продукций СУО в процессе синтаксического разбора создаются объекты узлов модифицированного АСД, согласно конструкторам описанным выше.When using OMS products in the process of parsing, objects of nodes of the modified ASD are created, according to the designers described above.

ЕС идентификаторов ИК (размерность физической величины) не может быть получена путем анализа исходной программы и должна назначаться вручную. Предлагается после проведения лексического и синтаксического анализа инициализировать таблицу символов значениями ЕС для обнаруженных идентификаторов. Для чего таблица символов должна быть расширена полем для хранения одномерного целочисленного массива из девяти элементов (int natSem[9]), который позволит описать размерность любой физической величины, интерпретируемой в идентификаторе ИК. Формула размерности произвольной физической величины (А) имеет вид:

, где e_j - степень j-го сомножителя формулы размерности физической величины. Элементами массива являются степени сомножителей в формуле размерности физической величины natSem[j - 1] = e_j, распределение которых по сопоставляющим формулы размерности представлено на фиг. 2. Идентификаторам, не обладающим ЕС, соответствует массив из нулевых элементов.EC identifiers IR (dimensionality of a physical quantity) cannot be obtained by analyzing the original program and must be assigned manually. After lexical and parsing analysis, it is proposed to initialize the symbol table with EU values for the identifiers found. For this, the symbol table should be expanded with a field for storing a one-dimensional integer array of nine elements (int natSem [9]), which will describe the dimension of any physical quantity interpreted in the identifier IR. The formula for the dimension of an arbitrary physical quantity (A) has the form:

, where e _j is the degree of the jth factor of the formula for the dimension of a physical quantity. The elements of the array are the degrees of the factors in the formula for the dimension of the physical quantity natSem [j - 1] = e _j , the distribution of which according to the matching formulas of dimension is shown in FIG. 2. Non-EU identifiers correspond to an array of zero elements.

Для непосредственного проведения верификации ПО необходимо осуществить восходящий обход модифицированного АСД, то есть обход, который вычисляет значения ЕС «родительского» узла после вычисления значений в «дочерних». Значения ЕС листьев АСД хранятся в таблице символов, получение которых возможно осуществить при помощи функции доступа к значениям поля в таблице символов по ссылке val объекта листа (getNaturalSemantics(val)). При выполнении обхода АСД в зависимости от операции в узле op_NS возможны два варианта действий: расчет результирующего значения ЕС и контроль выполнения условия корректности выражения ИК (фиг. 1).For direct verification of the software, it is necessary to perform an upward traversal of the modified ASD, that is, a traversal that calculates the EU values of the "parent" node after calculating the values in the "children". The EC values of the SDA leaves are stored in the symbol table, which can be obtained using the access function of the field values in the symbol table using the val link of the sheet object (getNaturalSemantics (val)). When performing a bypass of the ASD, depending on the operation in the op _NS node, two options are possible: calculating the resulting EC value and monitoring the fulfillment of the correctness condition for the expression of IR (Fig. 1).

Расчет значения ЕС «родительского» узла АСД производится путем сложения или вычитания (op_NS) значений ЕС «дочерних» узлов, помеченных идентификаторами ИК. Сложение и вычитание значений ЕС идентификаторов с числами (num) не производится, результирующее значение ЕС равняется значению ЕС идентификатора «дочернего» узла. Так как ЕС представлена в виде массива, то необходимая операция (op_NS) с операндами выражения выполняется поэлементно и результатом является массив той же размерности.The calculation of the EU value of the “parent” ASD node is done by adding or subtracting (op _NS ) the EU values of the “daughter” nodes marked with IR identifiers. Adding and subtracting the values of the EU identifiers with numbers (num) is not performed, the resulting value of the EU is equal to the EU value of the identifier of the "child" node. Since the EU is represented as an array, the necessary operation (op _NS ) with the operands of the expression is performed elementwise and the result is an array of the same dimension.

Контроль выполнения условия корректности выражений ИК выполняется с целью выявления нарушений принципа размерной однородности физических уравнений, представленных в конструкциях ИК, исходя из значений ЕС идентификаторов в таблице символов. Невыполнение условия корректности выражения ИК свидетельствует о логическом дефекте в программе.Monitoring the fulfillment of the correctness condition for IR expressions is carried out in order to identify violations of the principle of dimensional uniformity of physical equations presented in IR constructs, based on the values of EU identifiers in the symbol table. Failure to comply with the correctness of the expression of IR indicates a logical defect in the program.

В качестве доказательства возможности осуществления заявленного изобретения с достижением вышеуказанного технического результата рассматривается программная конструкция для выражения для расчета координат при равноускоренном движении: x = x0+v*t+a*t*t/2.As a proof of the possibility of implementing the claimed invention with the achievement of the above technical result, a software construction is considered for an expression for calculating coordinates with uniformly accelerated movement: x = x0 + v * t + a * t * t / 2.

Значения элементов массивов natSem, хранящихся в таблице символов, для идентификаторов выражения имеют вид: х - {1,0,0,0,0,0,0,0,0}; х0 - {1,0,0,0,0,0,0,0,0}; v - {1,0,-1,0,0,0,0,0,0}; t - {0,0,1,0,0,0,0,0,0}; а-{1,0,-2,0,0,0,0,0,0}.Values of elements of natSem arrays stored in the symbol table for expression identifiers are of the form: x - {1,0,0,0,0,0,0,0,0,0}; x0 - {1,0,0,0,0,0,0,0,0,0}; v is {1,0, -1,0,0,0,0,0,0,0}; t is {0,0,1,0,0,0,0,0,0,0}; and - {1,0, -2,0,0,0,0,0,0,0}.

СУО для разбора ИК и построения модифицированного АСД данного выражения представлена на фиг. 3, где Е, Т, F - нетерминальные символы грамматики, num, id - терминалы грамматики, представляющие числа и идентификаторы соответственно. АСД для рассматриваемого выражения изображено на фиг. 4.The MSA for parsing IR and constructing a modified ASD of this expression is presented in FIG. 3, where E, T, F are nonterminal grammar characters, num, id are grammar terminals representing numbers and identifiers, respectively. The SDA for the expression in question is depicted in FIG. four.

Шаги обхода АСД для расчета ЕС и контроля условий корректности выражений ИК:Steps to bypass the SDA for calculating the EU and monitoring the conditions for the correctness of the expressions of IR:

sem₁ = getNaturalSemantics(t); /* {0,0,1,0,0,0,0,0,0} */sem ₁ = getNaturalSemantics (t); / * {0,0,1,0,0,0,0,0,0,0} * /

sem₂ = num. val;sem ₂ = num. val;

sem₃ = sem₁, /* {0,0,1,0,0,0,0,0,0} */sem ₃ = sem ₁ , / * {0,0,1,0,0,0,0,0,0,0} * /

sem₄ = getNaturalSemantics{t); /* {0,0,1,0,0,0,0,0,0} */sem ₄ = getNaturalSemantics {t); / * {0,0,1,0,0,0,0,0,0,0} * /

sem₅ = sem₄ + sem₃; /* {0,0,2,0,0,0,0,0,0} */sem ₅ = sem ₄ + sem ₃ ; / * {0,0,2,0,0,0,0,0,0,0} * /

sem₆ = getNaturalSemantics(a); /* {1,0,-2,0,0,0,0,0,0} */sem ₆ = getNaturalSemantics ( a ); / * {1,0, -2,0,0,0,0,0,0,0} * /

sem₇ = sem₆ + sem₅; /* {1,0,0,0,0,0,0,0,0} */sem ₇ = sem ₆ + sem ₅ ; / * {1,0,0,0,0,0,0,0,0,0} * /

sem₈ = getNaturalSemantics(v), /* {1,0,-1,0,0,0,0,0,0} */sem ₈ = getNaturalSemantics (v), / * {1,0, -1,0,0,0,0,0,0,0} * /

sem₉ = getNaturalSemantics(t)\ /* {0,0,1,0,0,0,0,0,0} */sem ₉ = getNaturalSemantics (t) \ / * {0,0,1,0,0,0,0,0,0,0} * /

sem₁₀ = sem₉ + sem₈; /* {1,0,0,0,0,0,0,0,0} */sem ₁₀ = sem ₉ + sem ₈ ; / * {1,0,0,0,0,0,0,0,0,0} * /

sem₁₁ = sem₁₀; /* {1,0,0,0,0,0,0,0,0}, T_ПО = 0*/sem ₁₁ = sem ₁₀ ; / * {1,0,0,0,0,0,0,0,0,0}, T _ON = 0 * /

sem₁₂ = getNaturalSemantics(x0), /* {1,0,0,0,0,0,0,0,0} */sem ₁₂ = getNaturalSemantics (x0), / * {1,0,0,0,0,0,0,0,0,0} * /

sem₁₃ = sem₁₂; /* {1,0,0,0,0,0,0,0,0}, T_ПО = 0 */sem ₁₃ = sem ₁₂ ; / * {1,0,0,0,0,0,0,0,0,0}, T _ON = 0 * /

sem₁₄ = getNaturalSemantics(x); /* {1,0,0,0,0,0,0,0,0} */sem ₁₄ = getNaturalSemantics (x); / * {1,0,0,0,0,0,0,0,0,0} * /

sem₁₅ = sem₁₄; /* {1,0,0,0,0,0,0,0,0}, T_ПО = 0 */.sem ₁₅ = sem ₁₄ ; / * {1,0,0,0,0,0,0,0,0,0}, T _ON = 0 * /.

Результирующее значения EC выражения равно sem₁₅, условия корректности выполнены (T_ПО = 0).The resulting value of the EC expression is sem ₁₅ , the correctness conditions are met (T _ON = 0).

Если в данном примере заменить идентификатор а на v, то выражение примет вид: x = x0+v*t+v*t*t/2. Шаги обхода модифицированного АСД, с изменившимися значениями, следующие:If in this example we replace the identifier a with v, then the expression will take the form: x = x0 + v * t + v * t * t / 2. The steps to bypass a modified ASD, with changed values, are as follows:

sem₆ = getNaturalSemantics(v); /* {1,0,-1,0,0,0,0,0,0} */sem ₆ = getNaturalSemantics (v); / * {1,0, -1,0,0,0,0,0,0}} /

sem₇ = sem₆+sem₅; /* {1,0,1,0,0,0,0,0,0} */sem ₇ = sem ₆ + sem ₅ ; / * {1,0,1,0,0,0,0,0,0,0} * /

sem₁₁ = sem₁₀; /* {1,0,0,0,0,0,0,0,0}, T_ПО ≠ 0 */.sem ₁₁ = sem ₁₀ ; / * {1,0,0,0,0,0,0,0,0,0}, T _ON ≠ 0 * /.

Условие корректности на шаге 11 не выполнено (Т_ПО ≠ 0), следовательно существует дефект ЕС в проверяемом на данном узле, либо в его «дочерних» узлах.The correctness condition at step 11 is not fulfilled (T _software ≠ 0), therefore there is an EU defect in the node being checked on this node, or in its "child" nodes.

Таким образом, в настоящем изобретении доказана возможность обнаружения нового для статического анализа типа дефектов - дефектов ЕС идентификаторов ИК программы, что позволит повысить эффективность статического анализа и снизить суммарные затраты на проведение верификации ПО.Thus, in the present invention, the possibility of detecting a new type of static analysis for defects — EU defects of IR program identifiers — has been proved, which will increase the efficiency of static analysis and reduce the total cost of software verification.

Claims

A method of software verification, which consists in a static analysis of the source code of the program, characterized in that the semantic rules for creating nodes of the abstract syntax tree of a syntactically controlled definition, based on the dimensional homogeneity of the physical equations implemented in the expressions of the source code of the program, are supplemented with a field for operations with natural semantics identifiers, presented in the form of a dimension of a physical quantity, interpreted in the identifier, and described one A black integer array of nine elements stored in a symbol table, initialized with the values of the natural semantics of identifiers after lexical and syntactic analysis, while the specified field is automatically filled in depending on the operation with identifiers in this node and based on dimensional uniformity of physical equations, when the ascending traverses of the modified abstract syntax tree calculate the values of natural semantics in internal nodes and control they fulfill the conditions for the correctness of the expressions of the source code of the program based on the values of the natural semantics of identifiers in the symbol table.