CN115562645A - Configuration fault prediction method based on program semantics - Google Patents

Configuration fault prediction method based on program semantics Download PDF

Info

Publication number
CN115562645A
CN115562645A CN202211200856.4A CN202211200856A CN115562645A CN 115562645 A CN115562645 A CN 115562645A CN 202211200856 A CN202211200856 A CN 202211200856A CN 115562645 A CN115562645 A CN 115562645A
Authority
CN
China
Prior art keywords
configuration
log information
node
configuration parameter
constraint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211200856.4A
Other languages
Chinese (zh)
Other versions
CN115562645B (en
Inventor
李姗姗
周书林
郑思
董威
贾周阳
陈振邦
陈立前
张元良
王腾
廖湘科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202211200856.4A priority Critical patent/CN115562645B/en
Publication of CN115562645A publication Critical patent/CN115562645A/en
Application granted granted Critical
Publication of CN115562645B publication Critical patent/CN115562645B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a program semantic based configuration fault prediction method, which aims to solve the problems that the extraction capability of configuration parameter constraint information is insufficient and the occurrence of configuration faults cannot be prevented. The technical scheme is as follows: constructing a configuration parameter constraint extraction system consisting of a configuration parameter related code object extraction module, a log information related program object extraction module, a configuration related log information identification module, a natural language template generation module and a configuration constraint information identification module; a configuration parameter related code object extraction module and a log information related program object extraction module extract related information from the software source code, and a configuration related log information identification module identifies configuration related log information; the configuration constraint information identification module identifies configuration-related log information containing configuration constraints based on the template generated by the natural language template generation module. The invention can effectively extract the constraint information of the software configuration parameters, effectively detect the relevant defects of the software documents and prevent the occurrence of configuration faults.

Description

Configuration fault prediction method based on program semantics
Technical Field
The invention relates to the field of configuration fault prevention in large-scale software, in particular to a configuration fault prediction method based on program semantics.
Background
With the development of society, software is widely applied to various aspects of society, plays a crucial role in various fields, and has become an infrastructure of an information-oriented society. Software configuration is an indispensable component of a software system, widely exists in application scenes such as software deployment, operation, upgrading and migration, and mainly refers to adjusting values of corresponding configuration parameters of the software system through a specific interface or a file, so that a software user is ensured to select different implemented libraries, strategies, rules and the like to customize different functions through configuration parameter setting, and the software is ensured to adapt to different environments and loads by regulating and controlling resource use, thereby improving performance, reliability and other non-functional indexes of the software, and meeting different user requirements. Different from similar concepts such as configuration items in the software configuration management field and the software testing field, the configuration parameters in the present invention mainly refer to configuration parameters set through a specific interface or configuration file, and are usually provided to a software user in the form of key value pairs (a key refers to a configuration parameter name, and a value refers to a configuration parameter value), for example, in MySQL database software, a user configures a data directory (datadir) of the MySQL database as "/home/" directory by writing "datadir/home/" in the configuration file. At present, a large-scale basic software system gradually develops towards a highly configurable direction so as to adapt to the complex environment and the change of application requirements and improve the reliability and the availability of software services.
However, as the software scale continues to increase and the interaction relationship between the software becomes increasingly complex, the configuration brings convenience to users in using the software flexibly, and meanwhile, the failure of the configuration frequently causes the failure of the software service, and the configuration gradually attracts the wide attention of the industry. Of Google Corp
Figure BDA0003872406450000011
Et al found through research that configuration failures have been the second leading cause of Google service failures, with a percentage reaching nearly 29%. RandyKatz et al, washington university, also conducted similar investigations on Hadoop clusters, and found that configuration failures have become the most dominant factor leading to failure of Hadoop clusters in terms of both customer case quantity and technical support duration dimensions. In recent years, many large-scale companies such as Facebook, microSoft, amazon and the like frequently suffer from configuration faults, seriously affect the service quality of related software, and cause huge economic loss.
From the comprehensive analysis, the important reasons for frequent software configuration failures are the increasing number of software configuration parameters and the increasing complexity of configuration parameter constraints. On one hand, due to the continuous development and updating of software, the code scale is continuously enlarged, the scale of the corresponding configuration parameters is also obviously increased, and the difficulty of understanding and using the configuration parameters by a user is greatly improved. For example, there are more than 1000 configuration parameters in Apache http software and more than 800 configuration parameters in MySQL. On the other hand, in order to ensure the normal operation of the software, the configuration parameters and the environment all need to meet specific value conditions and association relations, namely configuration constraints, and the difficulty of correctly using the configuration parameter customization software by a user is further improved. For example, a configuration parameter Port in PostgreSQL database software is mainly used to specify a Port number used when the database listens for a network access request, so that a value of a Port parameter not only needs to be an integer value between 0 and 65535 in format, but also needs to ensure that a Port corresponding to a current value is not occupied by other programs.
The universality and severity of configuration faults gradually attract the attention of people, and part of researchers begin to pay attention to the prevention work before the configuration faults occur, and the configuration faults are reduced mainly by checking the configuration parameter values set by users in advance. Compared with the post diagnosis and repair after the fault occurs, the configuration fault prevention can reduce the possibility of the fault before the fault occurs, thereby avoiding the system loss caused by the configuration fault. The current configuration fault prediction method mainly comprises the following steps: firstly, extracting conditions required to be met by configuration parameter values, namely constraint information (configuration parameter constraint or configuration constraint for short) of the configuration parameters; and secondly, checking whether the configuration value set by the user meets the constraint condition or not by using the constraint information of the configuration parameters, thereby predicting the configuration fault. Therefore, the key of configuration fault prevention is to extract configuration constraints, and the important significance is achieved by researching how to obtain the software configuration constraints.
At present, in the prior art, two types of methods are mainly adopted to extract configuration parameter constraints so as to realize configuration fault prevention. The first method is to mainly mine software configuration parameter constraint Information from a large number of sample configuration files based on a predefined constraint rule mode, as represented by "enterprise: exploiting System Environment Information and correlation Information for Misconfiguration Detection" published by Jiaqi Zhang et al in ASPLOS 2014. However, on one hand, the method needs to manually summarize possible existing forms of configuration parameter constraints in advance, and has high requirements on related knowledge in the field of researchers, and on the other hand, the mining process of the method depends on a large number of sample configuration files as input, but due to factors such as user privacy data protection, related data sharing and lack of a maintenance platform, the sample configuration files are often difficult to obtain, and the mining effect of the configuration parameter constraints is directly influenced. The second type of method, represented by "Do Not frame Users for wisconsings (Not liable to Users due to configuration failure)" published by Tianyin Xu et al in SOSP2013, mainly uses a static program analysis method to track the usage of corresponding program variables (hereinafter, all are simply referred to as configuration variables) of configuration parameters in source codes, and implements constraint extraction by matching predefined configuration constraint code patterns. The above method also requires researchers to have rich domain knowledge and development experience to predefine the effective configuration parameter constraint code mode to obtain the configuration parameter constraint information in the software source code.
In conclusion, how to extract more configuration parameter constraint information, thereby effectively preventing configuration faults and improving software reliability is a hot issue being discussed by the technical personnel in the field.
Disclosure of Invention
The invention aims to solve the technical problem that the existing configuration parameter constraint information extraction capability is insufficient and the occurrence of configuration faults cannot be prevented, and provides a program semantic-based configuration fault prediction method. And extracting the relevant constraint information of the configuration parameters by using the context-related program semantics contained in the relevant log information in the software source code, and predicting the configuration fault according to the extracted relevant constraint information of the configuration parameters, thereby preventing the occurrence of the configuration fault.
In order to solve the technical problems, the technical scheme of the invention is as follows: firstly, constructing a configuration constraint extraction system consisting of a configuration parameter related code object extraction module, a log information related program object extraction module, a configuration related log information identification module, a natural language template generation module and a configuration constraint information identification module; a configuration constraint extraction system reads in a software source code and a software configuration parameter name list file; a code object set relevant to the configuration parameters in the software configuration parameter name list file in the software source code is extracted by a configuration parameter relevant code object extraction module; the log information related program object extraction module extracts a program object set related to the log information in the software source code; the configuration related log information identification module receives a code object set related to the configuration parameters from the configuration parameter related code object extraction module, receives a program object set related to the log information from the log information related program object extraction module, and screens the log information related to the configuration; meanwhile, the natural language template generating module generates a configuration constraint natural language template set according to the configuration constraint description document set and the error description related word set; then, the configuration constraint information identification module receives a configuration parameter name set and a binary pair set of configuration related log information from the configuration related log information identification module, receives a configuration constraint natural language description template set from a natural language template generation module, receives an error description related word set from a user, and identifies log information containing configuration constraint by using the configuration constraint natural language template; and finally, the user checks whether the user configuration setting meets the constraint or not by using the log information of the configuration constraint, and predicts the configuration fault.
The invention comprises the following steps:
the method comprises the following steps of firstly, constructing a configuration parameter constraint extraction system, wherein the configuration parameter constraint extraction system is composed of a configuration parameter related code object extraction module, a log information related program object extraction module, a configuration related log information identification module, a natural language template generation module and a configuration constraint information identification module. The configuration parameter related code object extraction module reads a software source code and a software configuration parameter name list file from a file system, obtains a configuration parameter name set C from the software configuration parameter name list file, extracts a configuration parameter related code object from the software source code according to the software configuration parameter name list file, obtains a configuration parameter related code object set, and sends the configuration parameter name set C and the configuration parameter related code object set to the configuration related log information identification module; the log information related program object extraction module reads a software source code from a file system, extracts all potential log information in the software source code to obtain a potential log information set L, extracts a related program object set of the potential log information in the software source code, and sends the potential log information set L and the related program object set of the potential log information to the configuration related log information identification module; the configuration related log information identification module receives a code object set C and a code object set related to the configuration parameters from the configuration parameter related code object extraction module, receives a potential log information set L and a related program object set of the potential log information from the log information related program object extraction module, identifies a binary pair set CL of the configuration related log information by matching the configuration parameter related code object and the log information related program object, and sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module; the natural language template generating module receives a configuration constraint description document set DS and an error description related word set lambda from a user, generates a configuration constraint natural language description template set and sends the configuration constraint natural language description template set to the configuration constraint information recognition module; the configuration constraint information identification module receives a configuration parameter name set C and a binary pair set CL of configuration related log information from the configuration related log information identification module, receives a configuration constraint natural language description template set from a natural language template generation module, receives an error description related word set from a user, matches the configuration related log information by using the configuration constraint natural language description template, identifies the configuration related log information containing configuration constraints in the configuration related log information, and obtains the configuration related log information containing the configuration constraints.
Secondly, the configuration parameter related code object extraction module reads the software source code and the software configuration parameter name list file from the file system, obtains a configuration parameter name set C from the software configuration parameter name list file, extracts the configuration parameter related code object from the software source code according to the software configuration parameter name list file, obtains a configuration parameter related code object set, and sends the configuration parameter name set C and the configuration parameter related code object set to the configuration related log information identification module, and the method is as follows:
2.1 the configuration parameter related code object extraction module reads the software configuration parameter name list file from the file system to obtain the configuration parameter name set C, C = { C = 1 ,c 2 ,…,c i ,…,C I },c i The name of the ith configuration parameter in the C is a constant character string, I is the total number of the names of the configuration parameters in the C, and I is more than or equal to 1 and less than or equal to I;
2.2 Clang front end (version 10.0.0 and above) solution using LLVM compiler frameworkAnalyzing the software source code and generating an Abstract Syntax Tree (Abstract Syntax Tree) AST corresponding to the software source code root ,AST root Each node in (a) represents a structure in the source code, such as a whole source code (transitionantitcle), a function declaration (FunctionDecl), an If branch statement (IfStmt), an assignment statement (AssignStmt), a function call (CallExpr), a constant string (stringlaterial), a binary computation (BinaryOperator), a single variable (decrefexpr), a struct variable (MemberExpr), etc., and corresponding dependencies between different structures are also represented in a tree structure, for example, for an If branch statement in a function declaration, the corresponding IfStmt node is located in a subtree with a FunctionDecl node as a root node;
2.3 in the abstract syntax Tree AST root Middle extraction of c 1 ,c 2 ,...,c i ,...,c I Related code object set CS 1 ,CS 2 ,...,CS i ,...,CS I In which CS i =<CV i ,CF i >,CV i Is and name c i Set of configuration parameter dependent program variables, CV i ={cv i1 ,cv i2 ,…,cv ip ,…,cv iP },cv ip Is CV of i The p-th one and the name c i The configuration parameter-related program variables of (1); CF (compact flash) i Is and name c i The set of function signatures related to the configuration parameters (a function signature defines the input and output of a function or method, and usually includes information such as the type of the parameter and the parameter, the return value and its type), the CF i ={cf i1 ,cf i2 ,...,cf iq ,...,cf iQ },cf iq Is CF i The q-th one and the name c i The specific method of configuring the parameter-related function signature is as follows:
2.3.1 initializing variable i =1, initializing CV i = initializing CF i ={};
2.3.2 related traversal interface (VisitNodeType form of traversal interface) provided by Clang (version 10.0.0 and above) is used, where NodeType refers to the type of node in abstract syntax treeThe specific node types are as described in step 2.2, the interfaces used in the traversal process include visitfunction decl, visistringLiteral, and the like for traversing the nodes of the functionalities decl and the stringLiteral types, and the detailed interface information can be referred to a Clang interface information document "https: llvm, org/doxygen/classclang _1 _1recursiveastviscers, html # details ") in turn through the AST root Each node of (a), locating comprises c i The nodes of the constant character strings are marked as Init _ Node, and simultaneously the Init _ Node is marked as the Current subtree root Node Current _ Sub _ AST;
2.3.3 determining whether any other constant string C in C exists in the Current _ Sub _ AST and the subtree using Current _ Sub _ AST as the root node e (e is more than or equal to 1 and less than or equal to I, e is not equal to I), if yes, turning to 2.3.5; otherwise, the operation is changed to 2.3.4;
2.3.4 in AST root If the Parent node of Current _ Sub _ AST is a transitionautDecl node, it indicates that the traversal reaches AST root 2.3.5 of the root node; otherwise, let Current _ Sub _ AST = Parent _ Sub _ AST, turn 2.3.3;
2.3.5 locate the AST Sub-tree in Current _ Sub _ AST containing Init _ Node and mark this AST Sub-tree in Current _ Sub _ AST with name c i Minimum Common Sub-tree Sub AST of the configuration parameter related code object of (a);
2.3.6 traversal all nodes in the minimum _ Common _ Sub _ AST, adding the program variable name corresponding to the node of type program variable in the minimum _ Common _ Sub _ AST to the CV i The preparation method comprises the following steps of (1) performing; adding a function signature corresponding to a node of type declared for a function in the minimum Common Sub AST to the CF i The preparation method comprises the following steps of (1) performing;
2.3.7 order CV i ,CF i The binary group formed is named c i CS of configuration parameters i Instant CS i =<CV i ,CF i >;
2.3.8 let I = I +1, if I ≦ I, let CV i ={},CF i = { }, go to 2.3.2; otherwise, the configuration parameter related code object extraction module sends configuration to the configuration related log information identification moduleSet of parameter names C and set of related code objects CS of configuration parameters 1 ,CS 2 ,...,CS i ,...,CS I
Thirdly, the log information related program object extraction module reads in the software source code from the file system, extracting all potential log information in the software source code by adopting a static program analysis method to obtain a potential log information set L, L = { L = { (L) } 1 ,l 2 ,...,l j ,...,l J In which l j J is the J-th log information, J is the total number of log information in L, J is more than or equal to 1 and less than or equal to J, and L is obtained 1 ,l 2 ,...,l j ,...,l J Is a collection LS of related program objects 1 ,LS 2 ,...,LS j ,...,LS J ,LS j =<LV j ,LF j >,LV j Is the same as log information l j Set of related program variables, LV j ={lv j1 ,lv j2 ,...,lv ju ,...,lv ju U is more than or equal to 1 and less than or equal to U, wherein lv ju For the u-th relevant program variable, LF j For the and log information l j Correlated function signature sets, LF j ={lf j1 ,lf j2 ,...,lf jv ,...,lf jV Where lf is jv And V is a V-th correlation function signature, V is more than or equal to 1 and less than or equal to V, and the method comprises the following steps:
3.1 initialize L = { };
3.2 traversing AST in turn using the related traversal interface provided by Clang (VisitNodeType related traversal interface) root Each node in the AST screening root Marking the node of the constant character string type as a constant character string t, and taking t as candidate log information l candidate . If t contains multiple constant character strings (denoted by t ', t', T ', etc.) in a single complete program statement, all program VARIABLEs appearing in the complete program statement are collectively denoted by a character string "VARIABLE" ", and finally all constant character strings (t, t', T.) and program VARIABLEs replaced by the character string" "_ VARIABLE _" in the single statement are representedThe quantities are combined in the order of occurrence in the sentence, with spaces separating between the constant string (t, t', etc.) and the string "_ VARIABLE _", constituting a piece of candidate journal information l candidate . For example, for the log information associated program statement "apr _ pstrcat (cmd->temp _ pool, "Limit Requestfields \ arg," \\ "best be a non-negative integer (0 = no limit)", NULL); ", candidate log information l extracted by the log information related program object extraction module candidate Is "_ VARIABLE _ LIMITRequestfields _ VARIABLE _ mut be a non-negative integer (0 = no limit)"; in the example, t represents "Limit Requestfields \ and t' represents" \\ best be a non-negative integer (0 = no limit) "; if l candidate Length less than 10 or l candidate If no space is included in the list, then l is added candidate Discarding, otherwise, the l candidate Adding the log information into a potential log information set L; when traversing AST root After the node is finished, turning to 3.3;
3.3 traversal of AST root A constant string type (StringLiteral) node of (1), a set L of potential log information is obtained, L = { L 1 ,l 2 ,...,l j ,...,l J In which l j J is the jth log information, J is the total number of log information in L, and J is more than or equal to 1 and less than or equal to J;
3.4 initialization variable j =1, set of initialization program variables LV i = is initialized with a set of function signatures LF i ={};
3.5 for the jth element L in L j L is extracted based on the backward slicing technique ("ThinSlicing" by Sridharan M et al in PLDI 2007) j Of related program objects of (1), forming j Set of potential log information related program objects LS j ,LS j =<LV j ,LF j >,LV j Middle storage with j All program variables in the relevant program context; LF (Low frequency) method j Storage and j the function call to which the associated program context relates corresponds to the function signature. The method comprises the following steps:
3.5.1 will l j Located in AST root The node in (1) is marked as the current node cur _ node, and all program variables (corresponding to the node type DecRefExpr (representing a single variable) and MemberExpr (representing a struct variable)) in the subtree with the cur _ node as the root node are added into the LV j Performing the following steps; if cur _ node is located in the hen/Else logic processing code of If branch statement, all program variables (corresponding node types of DeclRefExpr and MemberExpr) contained in subtree of which the node where the branch condition is located is root node in the If branch statement are added into LV j Adding a function signature corresponding to a function call related to a subtree in which the node where the If branch condition is located is the root node into the LF j The preparation method comprises the following steps of (1) performing;
3.5.2, acquiring parent node parent _ node of cur _ node, numbering all child nodes in a subtree taking parent _ node as root node according to appearance sequence, marking the serial number of cur _ node as x, if x =1, indicating that cur _ node does not have other statements in the subtree taking parent _ node as root node and located before the statement where cur _ node is located, and turning to 3.5.2; if x is more than 1, rotating to 3.5.3;
3.5.3 go through the x-1, x-2, 1 child nodes of parent _ node in turn, find and l j The related program variables and function signatures are as follows:
3.5.3.1 if cur _ node represents an assignment statement node (i.e., the node type is AssignStmt), and the left value of the assignment statement (usually representing the value to the left of the assignment operator, which is an object stored in computer memory that represents addressability) corresponds to a variable var ∈ LV j Then all program variables contained in the right value of the assignment statement (usually representing the value to the right of the assignment operator, meaning referring to a "data" stored at some memory address, meaning readable) are added to the LV j Performing the following steps; if the right value of the assignment statement contains a function call, adding a function signature corresponding to the function call into the LF j Turning to 3.5.4;
3.5.3.2 if cur _ node represents a function call statement node (i.e., the node type is CallExpr), and the argument var of the corresponding function call is within LV j Then add the function call corresponding function signature to the LF j Middle, 3.5.4;
3.5.4, making cur _ node = parent _ node, if cur _ node is a function declaration type node (that is, reaches the root node of the abstract syntax tree defined by the current function body), it indicates that the traversal has reached the root node of the subtree where the current function body is located, and jumps to 3.5.5; otherwise, skipping to 3.5.2;
3.5.5 adding function signature of cur _ node corresponding function declaration to LF j Performing the following steps;
3.6 making J = J +1, if J is less than or equal to J, rotating to 3.6; otherwise, the log information related program object extraction module obtains L and LS 1 ,LS 2 ,...,LS j ,...,LS J ,L={l 1 ,l 2 ,...,l j ,...,l J }; combining a set of potential log information L and a set of potential log information related program objects LS 1 ,LS 2 ,...,LS j ,...,LS J And sending the information to a configuration related log information identification module.
The fourth step, the configuration-related log information recognition module receives C and CS from the configuration parameter-related code object extraction module 1 ,CS 2 ,...,CS i ,...,CS I Receiving L and LS from log information related program object extraction module 1 ,LS 2 ,...,LS j ,...,LS J Screening out the configuration-related log information from the L to obtain a binary pair set CL of the configuration-related log information, and sending the configuration parameter name set C and the binary pair set CL of the configuration-related log information to a configuration constraint information identification module, wherein the method comprises the following steps of:
4.1 initializing variable j =1;
4.2 initializing variable i =1;
4.3 Log information j in L j And the ith configuration parameter name C in C i Matching is carried out, and l with incidence relation is found j And c i The method comprises the following steps:
4.3.1 consider that configuration parameter names in software are usually expressed by using multiple words and using hump naming (Camel-Case, a set of naming rules (convention) when writing computer programs, which refers to mixed use of upper and lower Case lettersNames constituting variables and functions) or common characters (e.g., "____ etc.) connecting the above-mentioned words, thereby configuring the associated log information recognition module pair c i Segmenting according to the word composition and the hump naming method to obtain a corresponding word set CWords i For example if c i The name character string representing the configuration parameter "DataDirectory" is obtained, and CWords is obtained i = data, direction, command CWords i The number of words in the set is | Cwords i |;
4.3.2 if CWords i I =1, 4.3.3; if CWords i If is greater than 1, turning to 4.3.4;
4.3.3 taking CWords i The word of (c) word (c) is detected whether c word (c) is associated with l j In this regard, the method is as follows:
4.3.3.1 if cword in l j The two are quoted, namely the front and the back of the cword are both double quotation marks or single quotation marks, and the number is changed to 4.3.3.4; otherwise, 4.3.3.2 is turned;
4.3.3.2 l j Any one of the keywords including "configuration", "option", "directive" and parameter "is used to modify the word cword, i.e. the keyword is in l j Middle adjacent word occurs, 4.3.3.4; otherwise, 4.3.3.3 is rotated;
4.3.3.3CWords i with log information l j If the matching fails, turning to 4.3.6;
4.3.3.4CWords i with log information l j Matching is successful, and 4.3.7 is carried out;
4.3.4 CWords i All words in the set use the string "[. -]' connect to generate a string of characters, denoted as CReg i For example, CWords corresponding to configuration parameter "DataDirectory i = data, direction, generated CReg i Is' data]directory”;
4.3.5 logging information l j In which a regular expression is used to match a rule to a string of characters CReg i Matching, if the matching is successful, turning to 4.3.7; otherwise, 4.3.6 is turned;
4.3.6 mixing of c i Related code object CS i And l j Correlation programObject LS j Matching is carried out, and the specific steps are as follows:
4.3.6.1 initialization variable p =1;
4.3.6.2 initialization variable u =1;
4.3.6.3 if cv ip =lv ju If the matching is successful, turning to 4.3.7; otherwise, 4.3.6.4 is turned;
4.3.6.4 making U = U +1, if U is less than or equal to U, turning to 4.3.6.3; otherwise, 4.3.6.5 is rotated;
4.3.6.5 making P = P +1, P ≤ P, and converting to 4.3.6.2; otherwise, 4.3.6.6 is rotated;
4.3.6.6 initialization variable q =1;
4.3.6.7 initialization variable v =1;
4.3.6.8 f cf iq =lf jv If the matching is successful, turning to 4.3.7; otherwise, 4.3.6.9 is rotated;
4.3.6.9 make V = V +1, if V is less than or equal to V, turn 4.3.6.8; otherwise, 4.3.6.10 is turned;
4.3.6.10 make Q = Q +1, if Q is less than or equal to Q, turn 4.3.6.7; otherwise, 4.3.6.11 is turned;
4.3.6.11 initialization variable u =1;
4.3.6.12 if Similarity (c) i ,lv ju )>0.63, which indicates that the matching is successful, and 4.3.7; otherwise, 4.3.6.13 is rotated; wherein Similarity (c) i ,lv ju ) Is to calculate c i ,lv ju The similarity function of (2) is calculated as follows:
for two character strings c needing to calculate similarity i ,lv ju First, for c i ,lv ju Performing Segmentation and Lemmatization (i.e. removing affixes of words and extracting main parts of words), performing Segmentation and Lemmatization on ci to obtain word set CW, and performing lv ju Performing word segmentation and morphology reduction to obtain a word set VW, and then calculating the weight of each word in CW and VW by using IDF (Inverse Document Frequency) algorithm (K.S. Jones et al published in 1972 "A static interpretation and knowledge application in retrieval") in Journal of the documentation Journal, specifically referring to each configuration provided by softwareThe number name is used as a file in the IDF algorithm, the set of all the configuration parameter names is used as a corpus in the IDF algorithm, and then the weight of each word in the CW and VW sets is calculated based on the IDF algorithm; the similarity between the two is calculated according to the formula
Figure BDA0003872406450000091
Wherein word is a word contained in the CW and VW sets;
4.3.6.13 making U = U +1, if U is less than or equal to U, turning to 4.3.6.12; otherwise, CS is stated i And l j Related program object LS j If the matching is unsuccessful, turning to 4.4;
4.3.7 Log information l j And configuration parameter name c i Successfully, will be two-element pair<c i ,l j >Adding the obtained product into the set CL;
4.4 let i = i +1; if I is less than or equal to I, rotating to 4.3; otherwise, 4.5 is rotated;
4.5 let j = j +1; if J is less than or equal to J, rotating to 4.2; otherwise, the C and the L are processed to obtain a binary pair set CL, CL = &<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c m ,l m >,...,<c M ,l M >4.6 times;
4.6 the configuration related log information identification module sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module;
fifthly, manually screening texts describing configuration constraints from configuration related documents and source code logs of 11 types of software including Apache Hadoop, HDFS, yarn, alluxio, cassandra, spark, hypertable, mongoDB, AOLServer, subversion and OpenLDAP, and collecting 338 pieces of configuration constraint text description information in total, wherein the configuration constraint text description information is recorded as a configuration constraint description document set DS (the number of configuration constraint text description information pieces in the DS is | DS | = 338).
Sixthly, manually acquiring words for describing error-related states based on a WordNet dictionary (WordNet is a cognitive-linguistic-based English dictionary jointly designed by psychologists, linguists and computer engineers at Princeton university) interface to form an error description-related word set lambda.
Seventhly, the natural language template generation module receives DS and the error description related word set lambda from the user and generates a configuration constraint natural language description template set LanPattern, and the method comprises the following steps:
7.1 the natural language template generation module obtains the natural language template describing the configuration constraint according to the DS, and the specific steps are as follows:
7.1.1 initialization variable y =1;
7.1.2 for the y-th article description information d in DS set y D is generated based on a spaCy open source library (spaCy is an NLP natural language text processing library of Python and CPython, and the version requirement is more than or equal to 3.1.0) y Corresponding POS tag sequence pair<pos y1 ,lemma y1 >,<pos y2 ,lemma y2 >,...,<pos yz ,lemma yz >,...,<pos yh ,lemma yh >The set of h POS tag sequence pairs is abbreviated as the first POS tag sequence set POS yh ,lemma yh In which pos yz (z is more than or equal to 1 and less than or equal to h) is d y Part-of-speech tags (e.g., noun (NN), verb (VB), adjective (JJ), etc.), lemma, for the "z" th word in (1) yz Is d y The original words after the word form reduction of the middle z word, and h is the total number of the POS label sequence;
7.1.3 removal, replacement and merging method for pos yh ,lemma yh Removing, merging and replacing the cross-section of the POS terminal to obtain a second POS tag sequence set POS yh ′,lemma yh The method comprises the following steps:
will "pos yh ,lemma yh Pos in the middle of the index of refraction yz Binary pairs of DT (part-of-speech tag of qualifier) or SYM (part-of-speech tag of symbol) are removed and pos that appears in succession are removed yz Binary pairs being NN (part-of-speech tag of noun) or JJ (part-of-speech tag of adjective) are merged, i.e. if pos yz = NN and pos y z+1 = NN or pos yz = JJ and pos y z+1 =JJ, then will<pos yz ,lemma yz >And<pos y z+1 ,lemma y z+1 >are combined into<pos yz ,lemma yz +lemma y z+1 >,lemma yz +lemma y z+1 Means the word lemma yz And the word lemma y z+1 Merging, and connecting the middle parts by using a blank; when the lemma yz For words in the lambda set, then the lemma will be yz Unified replacement is performed by a character string 'ERROR _ STATUS', so that a second POS label sequence set 'POS' is obtained yh ′,lemma yh ′》;
7.1.4 making y = y +1, if y is less than or equal to 338, rotating 7.1.2; otherwise, 7.1.5 is rotated;
7.1.5 mine pos Using Aprior frequent item mining Algorithm (book "DataMing: concepts and technologies" published by Jianwei Han et al 2011) yh ′,lemma yh ', and selecting the first five frequently occurring sequences to be added into a configuration constraint natural language description template set LanPattern;
and 7.2, the natural language template generating module sends the natural language description template set LanPattern and the error description related word set lambda to the configuration constraint information recognition module.
The eighth step of the configuration constraint information recognition module receiving the CL, CL = &, from the configuration related log information recognition module<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c m ,l m >,...,<c M ,l M >Receiving LanPattern and lambda from a natural language template generation module, and identifying log information containing configuration constraint description in CL based on LanPattern, wherein the method comprises the following steps:
8.1 initializing a variable m =1, initializing a log information set ConstraintDescSet = { } containing configuration constraints;
8.2 for binary pairs in CL<c m ,l m >If l is m In (a) contains c m Is prepared by m C in (1) m Replacing the character string 'CONFIG', and turning to 8.3; if l m In does not contain c m Directly rotating to 8.3;
8.3 if l m Contains the word in lambda, and m replacing the corresponding word in the Chinese character string with a character string 'ERROR _ STATUS', and turning to 8.4; if l m If the Chinese character does not contain the word in the lambda, directly turning to 8.4;
8.4 Generation of l Using spaCy open Source library m Corresponding third POS tag sequence set POS mh ,lemma mh And performing the removal, combination and replacement of pos by using the method for removing, combining and replacing in step 7.1.3 mh ,lemma mh Removing, replacing and merging to obtain a fourth POS tag sequence set POS mh ′,lemma mh ′》;
8.5 examination of pos mh ′,lemma mh ' whether it can match any template in LanPattern, if matching is successful, couple two elements<c m ,l m >Add to the set constraintdescet and go to 8.6; otherwise, directly rotating to 8.6;
8.6 making m = m +1, if m is less than or equal to | CL |, the | CL | represents the number of elements in CL, and then turning to 8.2; otherwise, 8.7 is rotated;
8.7 configuration constraint information identification module outputs the constraintdescet set to the user, the constraintdescet set includes all log information including configuration constraints in L, constraintdescet = &<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c r ,l r >,...,<c R ,l R >Therein of<c r ,l r >Denotes the r-th binary pair in constraintDescSet, c r Denotes the configuration parameter name,/ r Representing corresponding log information containing configuration constraint description, wherein R is the total number of binary pairs in the constraintDescSet, and R is more than or equal to 1 and less than or equal to R;
ninthly, the user checks whether the configuration parameter setting in the configuration file meets the constraint according to the log information set constraintdescet which is output by the configuration constraint information identification module and contains the configuration constraint, and the configuration fault is predicted, wherein the method comprises the following steps:
9.1 the user reads the ConstraintDescSet output by the configuration constraint information identification module;
9.2 initializing variable r =1;
9.3 the user reads the configuration file of the target software and checks if there is a pair name c in the configuration file r If the configuration parameters exist, 9.4 is converted; otherwise, 9.6 is turned;
9.4 check name c in configuration File r Whether the configuration parameter value setting of (1) satisfies (l) r If the described configuration constraint information is met, 9.6 is converted; otherwise, the configuration parameter setting violating the configuration constraint exists in the configuration file, the configuration fault exists in the current configuration file, and 9.5 is carried out;
9.5 there is a configuration fault in the current configuration file, the check fails, the user follows l r The described configuration constraint information pair is named c r The value setting of the configuration parameters is adjusted according to the l r Will be named c r And adjusting the value of the configuration parameter to be in a legal range, turning to 9.6 after the adjustment is finished, and continuously checking the next configuration parameter.
9.6 making R = R +1, if R is less than or equal to R, rotating to 9.3; otherwise, 9.7 is rotated;
9.7, the configuration parameter setting violating the configuration constraint is not found in the configuration file, which indicates that no configuration fault exists in the current configuration file, and the check is passed and finished.
Compared with the prior art, the invention can achieve the following beneficial effects:
1. the invention can effectively extract the constraint information of the software configuration parameters. The method extracts 427 pieces of configuration parameter constraint information from 7 types of large-scale open source software MySQL, apachehttpd, postgreSQL, nginx, lighttpd, squid and Postfix, wherein 263 configuration parameter constraints cannot be extracted by the existing work (Do Not film Users for Misconfigurations' published by Tianyin Xu et al in SOSP2013 extracts configuration constraints based on a predefined code pattern), so that the method extracts constraint information more completely compared with the existing extraction method.
2. By adopting the method and the system, 67 document related defects can be detected for the software community and used for supplementing missing or errors of the configuration parameter constraint description in the document, wherein 14 defect patches are received, and the occurrence of configuration faults is prevented. The received document defect patch ID is: httpd-64893, httpd-64904, httpd-64909, mySQL-101512, mySQL-101513, mySQL-101514, mySQL-101515, mySQL-101516, mySQL-101519, mySQL-101520, lighttpd-3035, lighttpd-3036, lighttpd-3038, lighttpd-3040.
Drawings
FIG. 1 is a logic structure diagram of a configuration parameter constraint extraction system constructed in the first step of the present invention;
FIG. 2 is a general flow diagram of the present invention;
FIG. 3 is a table of configuration parameter constraint description natural language templates constructed in the fourth step of the present invention.
Detailed Description
The present invention will be described below by taking http Web server software as an example with reference to the accompanying drawings.
As shown in fig. 2, the present invention comprises the steps of:
firstly, a configuration parameter constraint extraction system is constructed, as shown in fig. 1, and the configuration parameter constraint extraction system is composed of a configuration parameter related code object extraction module, a log information related program object extraction module, a configuration related log information identification module, a natural language template generation module and a configuration constraint information identification module. The method comprises the following steps that a configuration parameter related code object extracting module reads an http software source code and an http software configuration parameter name list file from a file system, obtains a configuration parameter name set C from the software configuration parameter name list file, extracts a configuration parameter related code object from the software source code according to the software configuration parameter name list file, obtains a configuration parameter related code object set, and sends the configuration parameter name set C and the configuration parameter related code object set to a configuration related log information identifying module; the method comprises the following steps that a log information related program object extraction module reads an http software source code from a file system, extracts all potential log information in the http software source code to obtain a potential log information set L, extracts a related program object set of the potential log information in the software source code, and sends the potential log information set L and the related program object set of the potential log information to a configuration related log information identification module; the configuration related log information identification module receives a code object set C and a code object set related to the configuration parameters from the configuration parameter related code object extraction module, receives a potential log information set L and a related program object set of the potential log information from the log information related program object extraction module, identifies a binary pair set CL of the configuration related log information by matching the configuration parameter related code object and the log information related program object, and sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module; the natural language template generation module receives a configuration constraint description document set DS and an error description related word set lambda from a user, generates a configuration constraint natural language description template set, and sends the configuration constraint natural language description template set to a configuration constraint information identification module; the configuration constraint information identification module receives a configuration parameter name set C and a binary pair set CL of configuration related log information from the configuration related log information identification module, receives a configuration constraint natural language description template set from a natural language template generation module, receives an error description related word set from a user, matches the configuration related log information by using the configuration constraint natural language description template, identifies the configuration related log information containing configuration constraints in the configuration related log information, and obtains the configuration related log information containing the configuration constraints.
Secondly, the related code object of configuration parameter extracting module reads source code of http software and name list file of http software configuration parameter from the file system, obtains name set C of configuration parameter from the name list file of http software configuration parameter, extracts related code object of configuration parameter from the source code of software according to the name list file of software configuration parameter, obtains related code object set of configuration parameter, and sends the name set C of configuration parameter and related code object set of configuration parameter to the related log information identifying module of configuration, the method is:
2.1 the configuration parameter related code object extraction module reads the http software configuration parameter name list file from the file system to obtain a configuration parameter name set C, C = { C = 1 ,c 2 ,...,c i ,...,c I },c i The name of the ith configuration parameter in C is a constant character string, I is the total number of the names of the configuration parameters in C, I is more than or equal to 1 and less than or equal to I, and I =694;
2.2 analyzing the Httpd software source code by using the Clang front end (10.0.0 version) of the LLVM compiler framework to generate an Abstract Syntax Tree (Abstract Syntax Tree) AST corresponding to the software source code root ,AST root Each node in (a) represents a structure in the source code, such as a whole source code (transitionantitcle), a function declaration (FunctionDecl), an If branch statement (IfStmt), an assignment statement (AssignStmt), a function call (CallExpr), a constant string (stringlaterial), a binary computation (BinaryOperator), a single variable (decrefexpr), a struct variable (MemberExpr), etc., and corresponding dependencies between different structures are also represented in a tree structure, for example, for an If branch statement in a function declaration, the corresponding IfStmt node is located in a subtree with a FunctionDecl node as a root node;
2.3 on abstract syntax Tree AST root Middle extraction of c 1 ,c 2 ,...,c i ,...,c I Related code object set CS 1 ,CS 2 ,...,CS i ,...,CS I In which CS i =<CV i ,CF i >,CV i Is and name c i Set of program variables, CV, associated with the configuration parameters of i ={cv i1 ,cv i2 ,...,cv ip ,...,cv iP },cv ip Is CV of i The p-th one and the name c i The configuration parameter-related program variables of (1); CF i Is and name c i The set of function signatures related to the configuration parameters (a function signature defines the input and output of a function or method, and usually includes information such as the type of the parameter and the parameter, the return value and its type), the CF i ={cf i1 ,cf i2 ,...,cf iq ,...,cf iQ },cf iq Is CF i The q-th one in (1) and named as c i The specific method of configuring the parameter-related function signature is as follows:
2.3.1 initialization variablesi =1, initialize CV i = is initialized CF i ={};
2.3.2 traversing AST in turn using a related traversal interface (VisitNodeType type traversal interface, wherein NodeType refers to the type of node in the abstract syntax tree, the concrete type of node is as described in step 2.2) provided by Clang (version 10.0.0 and above), the interfaces used in the traversal process, for example, traversing the node of functional Decl type using VisitFunctionDecl, traversing the node of StringLiteral type using VisitStringLiteral) traversing AST in turn root Each node of (2), locating comprises c i The nodes of the constant character strings are marked as Init _ Node, and simultaneously the Init _ Node is marked as the Current subtree root Node Current _ Sub _ AST;
2.3.3 determining whether there is any other constant string C in C in the subtree with Current _ Sub _ AST and Current _ Sub _ AST as root node e (e is more than or equal to 1 and less than or equal to I, e is not equal to I), if yes, turning to 2.3.5; otherwise, turning to 2.3.4;
2.3.4 in AST root If the Parent node of Current _ Sub _ AST is a transitionautDecl node, it indicates that the traversal reaches AST root 2.3.5 of the root node; otherwise, let Current _ Sub _ AST = Parent _ Sub _ AST, turn 2.3.3;
2.3.5 locate the AST Sub-tree containing Init _ Node in Current _ Sub _ AST, and mark this AST Sub-tree in Current _ Sub _ AST with name c i Minimum Common Sub-tree Sub AST of the configuration parameter related code object of (a);
2.3.6 traversing all nodes in the minimum _ Common _ Sub _ AST, adding program variable names corresponding to the nodes of type program variables in the minimum _ Common _ Sub _ AST to the CV i The preparation method comprises the following steps of (1) performing; adding a function signature corresponding to a node of type declared for a function in the minimum Common Sub AST to the CF i The preparation method comprises the following steps of (1) performing;
2.3.7 order CV i ,CF i The binary group formed is named c i Of configuration parameters CS i Instant CS i =<CV i ,CF i >;
2.3.8 let I = I +1, if I ≦ I, let CV i ={},CF i = { }, 2.3.2; otherwise, the related code object extraction module of the configuration parameter sends a name set C of the configuration parameter and a related code object set CS of the configuration parameter to the related log information identification module of the configuration parameter 1 ,CS 2 ,...,CS i ,...,CS I
<xnotran> , Httpd , Httpd , L, L = { l </xnotran> 1 ,l 2 ,...,l j ,...,l J H, wherein l j J is the J-th log information, J is the total number of log information in L, J is more than or equal to 1 and less than or equal to J, J =4382, and L is obtained 1 ,l 2 ,...,l j ,...,l J Set of related program objects LS 1 ,LS 2 ,...,LS j ,...,LS J ,LS j =<LV j ,LF j >,LV j For the and log information l j Sets of related program variables, LV j ={lv j1 ,lv j2 ,...,lv ju ,...,lv jU U is more than or equal to 1 and less than or equal to U, wherein lv ju For the u-th relevant program variable, LF j Is the same as log information l j Correlated function signature sets, LF j ={lf j1 ,lf j2 ,...,lf jv ,...,lf jV H, wherein lf jv And V is a signature of the V-th correlation function, V is more than or equal to 1 and less than or equal to V, and the method comprises the following steps:
3.1 initialize L = { };
3.2 traversal of AST in turn using the associative traversal interface provided by Clang (associative interface in the form of VisitNodeType) root Each node in the AST screening root Marking the node with the constant character string type as a constant character string t, and taking the t as candidate log information l candidate . If the single complete program statement where t is located contains multiple constant character strings (denoted as t',. The word), all programs appearing in the complete program statement are changed into programsThe quantities are uniformly denoted by the character string "_ VARIABLE _", and all constant character strings (t, t ', t "," etc.) and program VARIABLEs replaced by the character string "_ VARIABLE _" in the single sentence are finally combined in the order of appearance in the sentence, the constant character string (t, t', etc.) and the character string "" VARIABLE "" are separated by a space, and constitute a piece of candidate log information l candidate . For example, for the program statement "apr _ pstrcat (cmd->temp _ pool, "LimitRequestfields \", arg, "\" best be a non-negative integer (0 = no limit) ", NULL); ", candidate log information l extracted by the log information related program object extraction module candidate Is "_ VARIABLE _ LIMITREQUESTFIELDS _ VARIABLE _ must be a non-negative integer (0=nolimit)"; in the example, t represents "LimitRequestfields \ and" t' represents "\" best be a non-negative integer (0 = no limit) "; if l candidate Length less than 10 or l candidate If no space is included in the list, then l is added candidate Discarding, otherwise, the l candidate Adding the log information into a potential log information set L; when AST is traversed root After the node is finished, turning to 3.3;
3.3 traversal of AST root The constant string type (StringLiteral) node in (1) is completed, and a potential log information set L is obtained, wherein L = { L = { (L) } 1 ,l 2 ,...,l j ,...,l J H, wherein l j J is the jth log information, J is the total number of log information in L, J is more than or equal to 1 and less than or equal to J, and J =4382;
3.4 initializing variable j =1, initializing program variable set LV i = is initialized with a set of function signatures LF i ={};
3.5 for the jth element L in L j L is extracted based on the backward slicing technique ("ThinSlicing" by Sridharan M et al in PLDI 2007) j Of related program objects of (1), forming j Set of potential log information related program objects LS j ,LS j =<LV j ,LF j >. The method comprises the following steps:
3.5.1 mixing of l j Located in AST root The node in (1) is marked as the current node cur _ node, and all program variables (corresponding to the node types of DeclRefExpr (representing a single variable) and MemberExpr (representing a structure body variable)) in a subtree with cur _ node as a root node are added into the LV j Performing the following steps; if cur _ node is located in the hen/Else logic processing code of If branch statement, all program variables (corresponding node types of DeclRefExpr and MemberExpr) contained in subtree of which the node where the branch condition is located is root node in the If branch statement are added into LV j Adding a function signature corresponding to a function call related to a subtree in which the node where the If branch condition is located is the root node into the LF j Performing the following steps;
3.5.2, acquiring parent node parent _ node of cur _ node, numbering all child nodes in a subtree taking parent _ node as root node according to appearance sequence, marking the serial number of cur _ node as x, if x =1, indicating that cur _ node does not have other statements in the subtree taking parent _ node as root node and located before the statement where cur _ node is located, and turning to 3.5.2; if x is more than 1, rotating to 3.5.3;
3.5.3 go through the x-1, x-2, 1 child node of parent _ node in turn, find and l j Related program variables and function signatures by:
3.5.3.1 if cur _ node represents an assignment statement node (i.e. the node type is assignStmt), and the variable var corresponding to the left value of the assignment statement belongs to LV j Then add all program variables contained in the assignment statement right value to the LV j Performing the following steps; if the right value of the assignment statement contains a function call, adding a function signature corresponding to the function call into the LF j Middle, 3.5.4;
3.5.3.2 if cur _ node represents a function call statement node (i.e., the node type is CallExpr), and the argument var of the corresponding function call is within LV j Then add the function call corresponding function signature to the LF j Middle, 3.5.4;
3.5.4, making cur _ node = parent _ node, if cur _ node is a function declaration type node (namely, reaches the root node of the abstract syntax tree defined by the current function body), it indicates that the traversal has reached the root node of the subtree where the current function body is located, and jumps to 3.5.5; otherwise, skipping to 3.5.2;
3.5.5 adding the function signature of cur _ node corresponding function declaration to LF j Performing the following steps;
3.6 making J = J +1, if J is less than or equal to J, rotating to 3.6; otherwise, the log information related program object extraction module obtains L and LS 1 ,LS 2 ,...,LS j ,...,LS J ,L={l 1 ,l 2 ,...,l j ,...,l J }; combining a set of potential log information L and a set of potential log information related program objects LS 1 ,LS 2 ,...,LS j ,...,LS J And sending the information to a configuration related log information identification module.
Fourthly, the configuration-related log information identification module receives the C and the CS from the configuration parameter-related code object extraction module 1 ,CS 2 ,...,CS i ,...,CS I I =694, receiving L and LS from the log information related program object extraction module 1 ,LS 2 ,...,LS j ,...,LS J J =4382, the configuration-related log information is screened from L to obtain a binary pair set CL of the configuration-related log information, and the configuration parameter name set C and the binary pair set CL of the configuration-related log information are sent to the configuration constraint information identification module, in which the method includes:
4.1 initializing variable j =1;
4.2 initializing variable i =1;
4.3 Log information L of j in L j And the ith configuration parameter name C in C i Matching is carried out, and l with incidence relation is searched j And c i The method comprises the following steps:
4.3.1 configuring the associated Log information recognition Module Pair c, considering that the configuration parameter names in software are usually represented using words and are connected using hump nomenclature or common characters (here, "__.") i Segmenting according to word composition and hump naming method to obtain corresponding word set CWords i For example if c i The name string representing the configuration parameter "DataDirectory", the obtained CWordsi = { "data", "d = {" data "")ideal "}, order CWords i The number of words in the set is | Cwords i |;
4.3.2 if CWords i I | =1, turn 4.3.3; if CWords i If is greater than 1, turning to 4.3.4;
4.3.3 taking CWords i The word of (c) word (c) is detected whether c word (c) is associated with l j In this regard, the method is as follows:
4.3.3.1 if cword is at l j The two are quoted, namely the front and the back of the cword are both double quotation marks or single quotation marks, and the number is changed to 4.3.3.4; otherwise, 4.3.3.2 is rotated;
4.3.3.2 l j Any one of the keywords including "configuration", "option", "directive" and parameter "is used to modify the word cword, i.e. the keyword is in l j Middle adjacent word occurs, 4.3.3.4; otherwise, 4.3.3.3 is turned;
4.3.3.3CWords i with log information l j If the matching fails, turning to 4.3.6;
4.3.3.4CWords i with log information l j Matching is successful, and 4.3.7 is carried out;
4.3.4 CWords i All words in the set use the string "[. -]' connect to generate a string of characters, denoted as CReg i For example, CWords corresponding to configuration parameter "DataDirectory i = data, direction, generated CReg i Is' data-]directory”;
4.3.5 logging information l j In which a regular expression is used to match a rule to a string of characters CReg i Matching, if the matching is successful, turning to 4.3.7; otherwise, 4.3.6 is turned;
4.3.6 mixing of c i Related code object CS i And l j Related program object LS j Matching is carried out, and the specific steps are as follows:
4.3.6.1 initialization variable p =1;
4.3.6.2 initialization variable u =1;
4.3.6.3 if cv ip =lv ju If the matching is successful, turning to 4.3.7; otherwise, 4.3.6.4 is rotated;
4.3.6.4 make U = U +1, if U is less than or equal to U, turn 4.3.6.3; otherwise, 4.3.6.5 is turned;
4.3.6.5 making P = P +1, P ≤ P, and converting to 4.3.6.2; otherwise, 4.3.6.6 is turned;
4.3.6.6 initialization variable q =1;
4.3.6.7 initialization variable v =1;
4.3.6.8 Facf iq =lf jv If the matching is successful, turning to 4.3.7; otherwise, 4.3.6.9 is turned;
4.3.6.9 let V = V +1, if V ≦ V, rotate 4.3.6.8; otherwise, 4.3.6.10 is rotated;
4.3.6.10 make Q = Q +1, if Q is less than or equal to Q, turn 4.3.6.7; otherwise, 4.3.6.11 is rotated;
4.3.6.11 initialization variable u =1;
4.3.6.12 if Simiarity (c) i ,lv ju ) If the matching is more than 0.63, the matching is successful, and the operation is changed to 4.3.7; otherwise, 4.3.6.13 is turned; wherein Similarity (c) i ,lv ju ) Is to calculate c i ,lv ju The similarity function of (2) is calculated as follows:
Figure BDA0003872406450000191
wherein word is a word contained in the CW and VW sets;
4.3.6.13 making U = U +1, if U is less than or equal to U, turning to 4.3.6.12; otherwise, CS is stated i And l j Related program object LS j If the matching is unsuccessful, turning to 4.4;
4.3.7 Log information l j And configuration parameter name c i Successfully, will be two-element pair<c i ,l j >Adding the obtained product into the set CL;
4.4 let i = i +1; if I is less than or equal to I, rotating to 4.3; otherwise, 4.5 is rotated;
4.5 let j = j +1; if J is less than or equal to J, rotating to 4.2; otherwise, the C and the L are processed to obtain a binary pair set CL, CL = &<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c m ,l m >,...,<c M ,l M >}, M =1545, 4.6 turns;
4.6 the configuration related log information identification module sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module;
fifthly, manually screening texts describing configuration constraints from configuration related documents and source code logs of 11 types of software including Apache Hadoop, HDFS, yarn, alluxio, cassandra, spark, hypertable, mongoDB, AOLServer, subversion and OpenLDAP, and collecting 338 pieces of configuration constraint text description information in total and recording the configuration constraint text description information as a configuration constraint description document set DS (the number of configuration constraint text description information pieces in the DS is | DS | = 338).
And sixthly, manually acquiring words for describing error related states based on a WordNet dictionary interface to form an error description related word set lambda.
Seventhly, the natural language template generation module receives DS and the error description related word set lambda from the user and generates a configuration constraint natural language description template set LanPattern, and the method comprises the following steps:
7.1 the natural language template generating module obtains the natural language template describing the configuration constraints as shown in fig. 3 according to the DS, and the specific steps are as follows:
7.1.1 initialization variable y =1;
7.1.2 for the y-th article description information d in DS set y D is generated based on spaCy open source library (version 3.1.0) y Corresponding POS tag sequence pair<pos y1 ,lemma y1 >,<pos y2 ,lemma y2 >,...,<pos yz ,lemma yz >,...,<pos yh ,lemma yh >The set of h POS tag sequence pairs is abbreviated as the first POS tag sequence set POS yh ,lemma yh Wherein pos yz (z is more than or equal to 1 and less than or equal to h) is d y Part-of-speech tags for the z-th word (e.g. noun (NN), verb (VB), adjective (JJ)), lemma yz Is d y The original words after the word shape reduction of the middle z word are obtained, and h is the total number of the POS label sequences;
7.1.3 Using removal, replacement and merging methods on pos yh ,lemma yh Removing the filtrateMerging and replacing to obtain a second POS tag sequence set POS yh ′,lemma yh The method comprises the following steps:
will "pos yh ,lemma yh Pos in the body fluid yz Binary pairs of DT (part-of-speech tag of qualifier) or SYM (part-of-speech tag of symbol) are removed and pos that appears in succession are removed yz Binary pairs being NN (part-of-speech tag of noun) or JJ (part-of-speech tag of adjective) are merged, i.e. if pos yz = NN and pos y z+1 = NN or pos yz = JJ & pos y z+1 = JJ, then will<pos yz ,lemma yz >And<pos y z+1 ,lemma y z+1 >are combined into<pos yz ,lemma yz +lemma y z+1 >,lemma yz +lemma y z+1 Means the word lemma yz And the word lemma y z+1 Merging, and connecting the middle parts by using a blank; when the lemma yz For words in the lambda set, then the lemma will be yz Unified replacement is performed by a character string 'ERROR _ STATUS', so that a second POS label sequence set 'POS' is obtained yh ′,lemma yh ′》;
7.1.4 making y = y +1, if y is less than or equal to 338, rotating to 7.1.2; otherwise, 7.1.5 is rotated;
7.1.5 dig "pos" using Aprior frequent item Mining algorithm (book "Data Mining: concepts and technologies" published by Jianwei Han et al 2011) yh ′,lemma yh ', and selects the first five frequently occurring sequences to be added to the set of configuration-constrained natural language description templates LanPattern, i.e., corresponding to the contents of the second row to the sixth row of the first column of the table in FIG. 3, wherein each row of the first column from the second row to the sixth row is pos yh ′,lemma yh ' frequent item in the collection of scripts, and represents a pattern of POS tag sequences for a configuration constraint natural language description template, e.g., the first and second rows in the table represent configuration constraint natural language descriptions in the form of a noun (NN) and a modal verb (MD), corresponding to example descriptions in the second row and second column of the table“this value(NN)must(MD)be greater than 0”;
7.2 the natural language template generation module sends the natural language description template set LanPattern and the error description related word set lambda to the configuration constraint information identification module.
The eighth step, the configuration constraint information recognition module receives CL, CL =:fromthe configuration-related log information recognition module<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c m ,l m >,...,<c M ,l M >And M =1545, receiving LanPattern and λ from a natural language template generation module, and identifying log information containing configuration constraint description in CL based on the LanPattern, wherein the method comprises the following steps:
8.1 initializing a variable m =1, initializing a log information set ConstraintDescSet = { } containing configuration constraints;
8.2 for binary pairs in CL<c m ,l m >If l is m In (a) contains c m Is prepared by m C in (1) m Replacing the character string 'CONFIG', and turning to 8.3; if l m In does not contain c m Directly rotating to 8.3;
8.3 if l m Contains the word in lambda, and m replacing the corresponding word with a character string 'ERROR _ STATUS', and turning to 8.4; if l m If the Chinese character does not contain the word in the lambda, directly turning to 8.4;
8.4 Generation of l Using spaCy open Source library m Corresponding third POS tag sequence set POS mh ,lemma mh And performing the removal, combination and replacement of pos by using the method for removing, combining and replacing in step 7.1.3 mh ,lemma mh Removing, replacing and merging to obtain a fourth POS tag sequence set POS mh ′,lemma mh ′》;
8.5 examination of pos mh ′,lemma mh ' whether it can match any template in LanPattern, if matching is successful, couple two elements<c m ,l m >Add to the set ConstraintDescSet and turn 8.6; otherwise, directly rotating to 8.6;
8.6 making M = M +1, if M is less than or equal to M, M =1545, and turning to 8.2; otherwise, 8.7 is rotated;
8.7 the configuration constraint information identification module outputs the ConstraintDescSet set to the user, the constraintdescet set includes all log information including configuration constraints in L, constraintdescet =<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c r ,l r >,...,<c R ,l R >Therein of<c r ,l r >Denotes the r-th binary pair in constraintDescSet, c r Denotes the configuration parameter name,/ r Representing log information correspondingly containing configuration constraint description, wherein R is the total number of binary pairs in the constraintdescet, R is more than or equal to 1 and less than or equal to R, and R =205;
ninth step, the user checks whether the description information in the configuration related document of the http software is correct according to the configuration constraint-containing log information set ConstraintDescSet output by the configuration constraint information identification module, thereby checking the software document defect, the method is:
9.1 the user receives the constraintDescSet from the configuration constraint information identification module and detects whether the document information is sufficient according to the constraintDescSet;
9.2 initializing variable r =1;
9.3 checking if there is a pair name c in the software document r If the constraint information text description of the configuration parameters exists, 9.4 is converted; otherwise, 9.6 is rotated;
9.4 checking the software document if the textual description of the configuration parameter related constraint information named cr is associated with l r The described configuration constraint information is consistent, if consistent, 9.6 is turned; otherwise, 9.5 is rotated;
9.5 Pair name c in Current software document r The text description of the related constraint information of the configuration parameters has defects, and a user reports to a software developer to convert to 9.6;
9.6 let R = R +1, if R ≦ R, R =205, turn 9.3; otherwise, 9.6 is turned;
9.7, the text description checking of the constraint information related to the configuration parameters in the target software document is finished.
Through the analysis of the Httpd software, 164 pieces of log information containing configuration constraints are extracted, through comparison and inspection with official documents of the Httpd software (ninth step), 25 document defects are found in total, and 25 document defect patches are submitted, wherein 3 defect patches are received. The invention steps can be implemented on other 6 experimental software (Nginx, mySQL, postgreSQL, lighttpd, squid, postfix) in the same way (the fifth, sixth and seventh steps in the above steps can be executed only once), finally 427 pieces of log information containing configuration constraints can be extracted from 7 pieces of software, 67 document defects are discovered in total through comparison and inspection with official documents of the software (the ninth step), and 67 document defect patches are submitted, wherein 14 defect patches are received.

Claims (10)

1. A configuration failure prediction method based on program semantics is characterized by comprising the following steps:
the method comprises the following steps that firstly, a configuration parameter constraint extraction system is constructed, wherein the configuration parameter constraint extraction system consists of a configuration parameter related code object extraction module, a log information related program object extraction module, a configuration related log information identification module, a natural language template generation module and a configuration constraint information identification module; the configuration parameter related code object extraction module reads a software source code and a software configuration parameter name list file from a file system, obtains a configuration parameter name set C from the software configuration parameter name list file, extracts a configuration parameter related code object from the software source code according to the software configuration parameter name list file, obtains a configuration parameter related code object set, and sends the configuration parameter name set C and the configuration parameter related code object set to the configuration related log information identification module; the log information related program object extraction module reads a software source code from a file system, extracts all potential log information in the software source code to obtain a potential log information set L, extracts a related program object set of the potential log information in the software source code, and sends the potential log information set L and the related program object set of the potential log information to the configuration related log information identification module; the configuration related log information identification module receives the code object set C and the configuration parameter related code object set from the configuration parameter related code object extraction module, receives the potential log information set L and the related program object set of the potential log information from the log information related program object extraction module, identifies a binary pair set CL of the configuration related log information by matching the configuration parameter related code object and the log information related program object, and sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module; the natural language template generating module receives a configuration constraint description document set DS and an error description related word set lambda from a user, generates a configuration constraint natural language description template set and sends the configuration constraint natural language description template set to the configuration constraint information recognition module; the configuration constraint information identification module receives a configuration parameter name set C and a binary pair set CL of configuration relevant log information from the configuration relevant log information identification module, receives a configuration constraint natural language description template set from a natural language template generation module, receives an error description relevant word set from a user, matches the configuration relevant log information by using the configuration constraint natural language description template, identifies the configuration relevant log information containing configuration constraint in the configuration relevant log information, and obtains the configuration relevant log information containing the configuration constraint;
secondly, the configuration parameter related code object extraction module reads the software source code and the software configuration parameter name list file from the file system, obtains a configuration parameter name set C from the software configuration parameter name list file, extracts the configuration parameter related code object from the software source code according to the software configuration parameter name list file, obtains a configuration parameter related code object set, and sends the configuration parameter name set C and the configuration parameter related code object set to the configuration related log information identification module, and the method is as follows:
2.1 the configuration parameter related code object extraction module reads the software configuration parameter name list file from the file system to obtain the configuration parameter name set C, C = { C = 1 ,c 2 ,…,c i ,…,c I },c i Is the ith configuration parameter name in C, is a constant character string, I is the total number of the configuration parameter names in C,1≤i≤I;
2.2 Using the Clang front end of LLVM compiler framework to analyze the software source code, generate the abstract syntax tree AST corresponding to the software source code root ,AST root Each node in the tree represents a structure in the source code, and corresponding dependency relationships among different structures are represented by the tree structures;
2.3 in the abstract syntax Tree AST root Extract of (C) 1 ,c 2 ,…,c i ,…,c I Related code object set CS 1 ,CS 2 ,…,CS i ,…,CS I Wherein CS i =<CV i ,CF i >,CV i Is and name c i Set of program variables, CV, associated with the configuration parameters of i ={cv i1 ,cv i2 ,…,cv ip ,…,cv iP },cv ip Is CV of i The p-th one and the name c i The configuration parameter-related program variables of (1); CF (compact flash) i Is and name c i Of (2) a configuration parameter dependent function signature set, CF i ={cf i1 ,cf i2 ,…,cf iq ,…,cf iQ },cf iq Is CF i The q-th one in (1) and named as c i The function signature defines the input and output of the function or method, and comprises parameters, types of the parameters, return values and types of the return values;
thirdly, the log information related program object extraction module reads in the software source code from the file system, extracting all potential log information in the software source code by adopting a static program analysis method to obtain a potential log information set L, L = { L = { (L) } 1 ,l 2 ,…,l j ,…,l J H, wherein l j J is the J-th log information, J is the total number of log information in L, J is more than or equal to 1 and less than or equal to J, and L is obtained 1 ,l 2 ,…,l j ,…,l J Set of related program objects LS 1 ,LS 2 ,…,LS j ,…,LS J ,LS j =<LV j ,LF j >,LV j For the and log information l j Set of related program variables, LV j ={lv j1 ,lv j2 ,…,lv ju ,…,lv jU U is more than or equal to 1 and less than or equal to U, wherein lv ju For the u-th relevant program variable, LF j Is the same as log information l j Correlated function signature sets, LF j ={lf j1 ,lf j2 ,…,lf jv ,…,lf jV H, wherein lf jv And V is a V-th correlation function signature, V is more than or equal to 1 and less than or equal to V, and the method comprises the following steps:
3.1 initialize L = { };
3.2 traversal of AST in turn using the associative traversal interface provided by Clang root Each node in the AST screening root Marking the node with the constant character string type as a constant character string t, and taking the t as candidate log information l candidate (ii) a If t is located in a single complete program statement containing a plurality of constant character strings, denoted by t ', t ", \8230;, all program VARIABLEs appearing in the complete program statement are collectively denoted by a character string" _ VARIABLE _ ", and finally all constant character strings t in the single statement, t ', t", _8230, and program VARIABLEs replaced with the character string "_ VARIABLE _" are combined in the order of appearance in the sentence, the constant character string t, t ', 8230and the character string ' VARIABLE ' are separated by a space to form a candidate log information l candidate (ii) a If l candidate Length less than 10 or l candidate Does not include any blank space, will l candidate Discard, otherwise, add l candidate Adding the information into a potential log information set L; when traversing AST root After the node is finished, turning to 3.3;
3.3 traversal of AST root The constant string type node in (1) is completed to obtain a potential log information set L, L = { L = 1 ,l 2 ,…,l j ,…,l J H, wherein l j J is the jth log information, J is the total number of log information in L, and J is more than or equal to 1 and less than or equal to J;
3.4 initialization variable j =1, set of initialization program variables LV i = initialize the function signature set LF i ={};
3.5 for the jth element L in L j Based onExtraction of l to the slicing technique j Of related program objects of, constitute l j Set of potential log information related program objects LS j ,LS j =<LV j ,LF j >;LV j Middle storage and j all program variables in the relevant program context; LF (Low frequency) j Storage and j function signatures corresponding to function calls related to the related program context;
3.6 making J = J +1, if J is less than or equal to J, rotating to 3.6; otherwise, the log information related program object extraction module obtains L and LS 1 ,LS 2 ,…,LS j ,…,LS J ,L={l 1 ,l 2 ,…,l j ,…,l J }; combining a set of potential log information L and a set of potential log information related program objects LS 1 ,LS 2 ,…,LS j ,…,LS J Sending the information to a configuration related log information identification module;
the fourth step, the configuration-related log information recognition module receives C and CS from the configuration parameter-related code object extraction module 1 ,CS 2 ,…,CS i ,…,CS I Receiving L and LS from log information related program object extraction module 1 ,LS 2 ,…,LS j ,…,LS J Screening out the configuration-related log information from the L to obtain a binary pair set CL of the configuration-related log information, and sending the configuration parameter name set C and the binary pair set CL of the configuration-related log information to a configuration constraint information identification module, wherein the method comprises the following steps of:
4.1 initialization variable j =1;
4.2 initializing variable i =1;
4.3 Log information j in L j And the ith configuration parameter name C in C i Matching is carried out, and l with incidence relation is searched j And c i The method comprises the following steps:
4.3.1 configuration-related Log information identification Module Pair c i Segmenting according to word composition and hump naming method to obtain corresponding word set CWords i Order CWords i The number of words in the set is | Cwords i |;
4.3.2 if|CWords i I | =1, turn 4.3.3; if CWords i |>1, rotating by 4.3.4;
4.3.3 taking CWords i The word of (c) word (c) is detected whether c word is associated with l j Related, if CWords i With log information l j If the matching fails, turning to 4.3.6; if CWords i With log information l j Matching is successful, and 4.3.7 is carried out;
4.3.4 CWords i All words in the set use of character string [. ________________________________________]' connect to generate a string of characters, denoted as CReg i
4.3.5 logging information l j In the method, the regular expression is used for matching the regular expression with the character string CReg i Matching, if the matching is successful, turning to 4.3.7; otherwise, 4.3.6 is rotated;
4.3.6 mixing of c i Related code object CS i And l j Related program object LS k Matching, if the matching is successful, turning to 4.3.7; if the matching is unsuccessful, turning to 4.4;
4.3.7 Log information l j And configuration parameter name c i Successfully, will be two-element pair<c i ,l j >Adding the obtained product into the set CL;
4.4 let i = i +1; if I is less than or equal to I, rotating to 4.3; otherwise, 4.5 is rotated;
4.5 let j = j +1; if J is less than or equal to J, rotating to 4.2; otherwise, the C and the L are processed to obtain a binary pair set CL, CL =<c 1 ,l 1 >,<c 2 ,l 2 >,…,<c m ,l m >,…,<c M ,l M >4.6 times of rotation;
4.6 the configuration related log information identification module sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module;
fifthly, manually screening texts describing configuration constraints from configuration related documents and source code logs of 11 types of software including Apache Hadoop, HDFS, yarn, alluxio, cassandra, spark, hypertable, mongoDB, AOLServer, subversion and OpenLDAP, collecting 338 pieces of configuration constraint text description information in total, and recording the configuration constraint text description information as a configuration constraint description Document Set (DS), wherein the number of the configuration constraint text description information in the DS is | DS | =338;
step six, manually acquiring words for describing error related states based on a WordNet dictionary interface to form an error description related word set lambda;
seventhly, the natural language template generating module receives DS and the error description related word set lambda from the user and generates a configuration constraint natural language description template set LanPattern, and the method comprises the following steps:
7.1 the natural language template generation module obtains the natural language template describing the configuration constraint according to the DS, and the specific steps are as follows:
7.1.1 initialization variable y =1;
7.1.2 for the y-th article description information d in DS set y Generation of d from spaCy-based open Source library y Corresponding POS tag sequence pair<pos y1 ,lemma y1 >,<pos y2 ,lemma y2 >,…,<pos yz ,lemma yz >,…,<pos yh ,lemma yh >The set of h POS tag sequence pairs is abbreviated as a first POS tag sequence set POS yh ,lemma yh In which pos yz (z is more than or equal to 1 and less than or equal to h) is d y Part of speech tag, lemma, of the z-th word in (1) yz Is d y The original words after the word form reduction of the middle z word, and h is the total number of the POS label sequence;
7.1.3 removal, replacement and merging method for pos yh ,lemma yh Removing, merging and replacing to obtain a second POS tag sequence set POS' yh ,lemma′ yh The method comprises the following steps:
will "pos yh ,lemma yh Pos in the middle of the index of refraction yz Binary pairs of part-of-speech tags DT or SYM of symbols for qualifiers are removed and pos occurring consecutively are discarded yz Merge binary pairs of part-of-speech tags NN for nouns or part-of-speech tags JJ for adjectives, i.e., if pos yz = NN and pos y z+1 = NN or pos yz = JJ and pos y z+1 = JJ, then will<pos yz ,lemma yz >And<pos y z+1 ,lemma y z+1 >are combined into<pos yz ,lemma yz +lemma y z+1 >,lemma yz +lemma y z+1 Means the word lemma yz And the word lemma y z+1 Merging, and connecting the middle parts by using a blank; when the lemma yz For words in the lambda set, then the lemma will be yz Unified replacement is performed by a character string 'ERROR _ STATUS', so that a second POS label sequence set 'POS' is obtained yh ′,lemma yh ′》;
7.1.4 making y = y +1, if y is less than or equal to 338, rotating to 7.1.2; otherwise, 7.1.5 is rotated;
7.1.5 mining pos Using Aprior frequent term mining Algorithm yh ′,lemma yh ', and selecting the first five frequently occurring sequences to be added into a configuration constraint natural language description template set LanPattern;
7.2 the natural language template generation module sends a natural language description template set LanPattern and an error description related word set lambda to the configuration constraint information identification module;
the eighth step of the configuration constraint information recognition module receiving the CL, CL = &, from the configuration related log information recognition module<c 1 ,l 1 >,<c 2 ,l 2 >,…,<c m ,l m >,…,<c M ,l M >Receiving LanPattern and lambda from a natural language template generation module, and identifying log information containing configuration constraint description in CL based on LanPattern, wherein the method comprises the following steps:
8.1 initializing variable m =1, initializing a log information set ConstraintDescSet = { } containing configuration constraints;
8.2 for binary pairs in CL<c m ,l m >If l is m In (a) contains c m Is prepared by m C in (1) m Replacing the character string 'CONFIG', and converting to 8.3; if l m In does not contain c m Directly rotating to 8.3;
8.3 if l m Contains the word in lambda, and l m Replacing the corresponding word in the Chinese character string with a character string 'ERROR _ STATUS', and turning to 8.4; if l m If the Chinese character does not contain the word in the lambda, directly converting to 8.4;
8.4 open Source Using spaCyLibrary Generation l m Corresponding third POS tag sequence set POS mh ,lemma mh Performing the removal, combination and replacement of pos according to the method for removing, combining and replacing in the step 7.1.3 mh ,lemma mh Removing, replacing and merging the cross-section to obtain a fourth POS tag sequence set POS mh ′,lemma mh ′》;
8.5 inspection of pos mh ′,lemma mh ' whether it can match any template in LanPattern, if matching is successful, couple two elements<c m ,l m >Add to the set ConstraintDescSet and turn 8.6; otherwise, directly rotating to 8.6;
8.6 making m = m +1, if m is less than or equal to | CL |, the | CL | represents the number of elements in CL, and the number is changed to 8.2; otherwise, 8.7 is rotated;
8.7 configuration constraint information identification module outputs the constraintdescet set to the user, the constraintdescet set includes all log information including configuration constraints in L, constraintdescet =<c 1 ,l 1 >,<c 2 ,l 2 >,…,<c r ,l r >,…,<c R ,l R >Therein of<c r ,l r >Represents the r-th binary pair in the constraintdescet, c r Denotes the configuration parameter name,/ r Representing the corresponding log information containing configuration constraint description, wherein R is the total number of binary pairs in the constraintdescet, and R is more than or equal to 1 and less than or equal to R;
ninth step, the user checks whether the configuration parameter setting in the configuration file meets the constraint according to the log information set ConstraintDescSet containing the configuration constraint and output by the configuration constraint information identification module, and predicts the configuration failure, wherein the method comprises the following steps:
9.1 the user reads the ConstraintDescSet output by the configuration constraint information identification module;
9.2 initializing variable r =1;
9.3 the user reads the configuration file of the target software and checks if there is a pair name c in the configuration file r If the configuration parameters exist, the configuration parameters are set to 9.4; otherwise, 9.6 is turned;
9.4 check name c in configuration File r Is arranged inWhether the parameter value setting satisfies l r If the described configuration constraint information is met, 9.6 is converted; otherwise, the configuration parameter setting violating the configuration constraint exists in the configuration file, the configuration fault exists in the current configuration file, and 9.5 is carried out;
9.5 there is configuration failure in the current configuration file, the check is not passed, the user is according to l r The described configuration constraint information pair is named c r The value setting of the configuration parameters is adjusted according to the l r Will be named c r Adjusting the value of the configuration parameter to be in a legal range, turning to 9.6 after the adjustment is finished, and continuously checking the next configuration parameter;
9.6 making R = R +1, if R is less than or equal to R, rotating to 9.3; otherwise, 9.7 is rotated;
9.7 the configuration file is not found to have configuration parameter setting violating the configuration constraint, which shows that no configuration fault exists in the current configuration file, the check is passed, and the process is finished.
2. The program semantics-based configuration failure prediction method of claim 1, wherein the Clang front end has a version number of 10.0.0 or more; the spaCy is an NLP natural language text processing library of Python and CPython, and the version requirement is more than or equal to 3.1.0.
3. The method of claim 1, wherein the step 2.2 of the aST is performed root Each node in the source code represents a structure in the source code, which comprises the whole source code transflationUnitDecl, a function declaration functional Decl, an If branch statement IfStmt, an assignment statement AssignStmt, a function call CallExpr, a constant string StringLiteral, a binary computation BinaryOpera, a single variable DeclRefExpr, and a struct variable MemberExpr.
4. The method of claim 1, wherein 2.3 of said at least one step is performed in an AST (abstract syntax tree) root Extract of (C) 1 ,c 2 ,…,c i ,…,c I Related codeObject set CS 1 ,CS 2 ,…,CS i ,…,CS I The method comprises the following steps:
2.3.1 initialization variable i =1, initialization CV i = is initialized CF i ={};
2.3.2 traversing AST in turn using the related traversal interface provided by Clang root Each node of (2), locating comprises c i The nodes of the constant character strings are marked as Init _ Node, and simultaneously the Init _ Node is marked as the Current subtree root Node Current _ Sub _ AST;
2.3.3 determining whether there is any other constant string C in C in the subtree with Current _ Sub _ AST and Current _ Sub _ AST as root node e If the node is more than or equal to 1, e is less than or equal to I, and e is not equal to I, if so, the node is converted to 2.3.5; otherwise, turning to 2.3.4;
2.3.4 in AST root If the Parent node Parent _ Sub _ AST of the Current _ Sub _ AST is a transitionUnitDecl node, the traversal is indicated to reach the AST root 2.3.5 of the root node; otherwise, let Current _ Sub _ AST = Parent _ Sub _ AST, go 2.3.3;
2.3.5 locate the AST Sub-tree containing Init _ Node in Current _ Sub _ AST, and mark this AST Sub-tree in Current _ Sub _ AST with name c i Minimum Common Sub-tree Sub AST of the configuration parameter related code object;
2.3.6 traversing all nodes in the minimum _ Common _ Sub _ AST, adding program variable names corresponding to the nodes of type program variables in the minimum _ Common _ Sub _ AST to the CV i The preparation method comprises the following steps of (1) performing; adding a function signature corresponding to a node of type declared for a function in the minimum Common Sub AST to the CF i Performing the following steps;
2.3.7 order CV i ,CF i The binary group formed is named c i CS of configuration parameters i Instant CS i =<CV i ,CF i >;
2.3.8 let I = I +1, if I ≦ I, let CV i ={},CF i = { }, go to 2.3.2; otherwise, the configuration parameter related code object extraction module sends a configuration parameter name set C and the correlation of the configuration parameters to the configuration related log information identification moduleCode object set CS 1 ,CS 2 ,…,CS i ,…,CS I
5. The method according to claim 1, wherein the 3.2 step of providing the relevant traversal interface by Clang refers to a relevant interface in VisitNodeType form, and NodeType refers to a node type in an abstract syntax tree.
6. The method of claim 1, wherein 3.5 steps of extracting l based on backward slicing technique j Of related program objects of, constitute l j Set of potential log information related program objects LS j The method comprises the following steps:
3.5.1 mixing of l j Located in AST root The node in (1) is marked as the current node cur _ node, and all program variables in the subtree with the cur _ node as the root node are added into the LV j The preparation method comprises the following steps of (1) performing; if cur _ node is located in the Then/Else logic processing code of If branch statement, all program variables contained in subtree in which the node where branch condition is located in If branch statement is root node are added to LV j Adding a function signature corresponding to a function call related to a subtree in which the node where the If branch condition is located is the root node into the LF j Performing the following steps;
3.5.2, acquiring parent node parent _ node of the cur _ node, numbering all child nodes in a subtree taking the parent _ node as a root node according to the appearance sequence, recording the serial number of the cur _ node as x, and if x =1, turning to 3.5.2, wherein no other statement in the subtree taking the parent _ node as the root node exists and is positioned before the statement of the cur _ node; if x is greater than 1, rotating to 3.5.3;
3.5.3 go through the x-1, x-2, \ 8230of parent _ node, 1 child node, search and l j Related program variables and function signatures by:
3.5.3.1 if cur _ node represents the assignment statement node, i.e. the node type is AssignStmt, and the variable var corresponding to the left value of the assignment statement belongs to LV j Then add all program variables contained in the assignment statement right value to the LV j Performing the following steps; if the right value of the assignment statement contains a function call, adding a function signature corresponding to the function call into the LF j Turning to 3.5.4;
3.5.3.2 if cur _ node represents a function call statement node, that is, the node type is CallExpr, and the argument var corresponding to the function call belongs to LV j Then add the function call corresponding function signature to the LF j Turning to 3.5.4;
3.5.4, making cur _ node = parent _ node, if cur _ node is a functional decl type node representing function declaration, it indicates that traversal has reached the root node of the subtree where the current function body is located, and jumps to 3.5.5; otherwise, skipping to 3.5.2;
3.5.5 adding function signature of cur _ node corresponding function declaration to LF j In (1).
7. The method according to claim 6, wherein the node types corresponding to all program variables in the subtree whose cur _ node is the root node in step 3.5.1 include declrefxpr representing a single variable and MemberExpr representing a struct variable; and the node type corresponding to all program variables contained in the subtree in which the node of the branch condition in the If branch statement is the root node comprises DeclRefExpr and MemberExpr.
8. The program semantic-based configuration failure prediction method according to claim 1, characterized in that 4.3.3 steps of detecting whether cword is associated with/ j The related method comprises the following steps:
4.3.3.1 if cword in l j The two are quoted, namely, the front and the back of the cword are both double quotation marks or single quotation marks, and the number is changed to 4.3.3.4; otherwise, 4.3.3.2 is turned;
4.3.3.2 l j Any one of the keywords including "configuration", "option", "directive" and "parameter" is used to modify the word cword, i.e. the above-mentioned keyword is in l j Middle adjacent word occurs, 4.3.3.4; otherwise, 4.3.3.3 is rotated;
4.3.3.3CWords i with log information l j The matching is failed, and the process is finished;
4.3.3.4CWords i with log information l j And (6) matching successfully, and ending.
9. The method of claim 1, wherein step 4.3.6 is performed by using c i Related code object CS i And l j Related program object LS j The matching method comprises the following steps:
4.3.6.1 initialization variable p =1;
4.3.6.2 initialization variable u =1;
4.3.6.3 if cv ip =lv ju If the matching is successful, ending; otherwise, 4.3.6.4 is turned;
4.3.6.4 making U = U +1, if U is less than or equal to U, turning to 4.3.6.3; otherwise, 4.3.6.5 is rotated;
4.3.6.5 making P = P +1, P ≤ P, and converting to 4.3.6.2; otherwise, 4.3.6.6 is rotated;
4.3.6.6 initialization variable q =1;
4.3.6.7 initialization variable v =1;
4.3.6.8 Facf iq =lf jv If the matching is successful, ending; otherwise, 4.3.6.9 is rotated;
4.3.6.9 let V = V +1, if V ≦ V, rotate 4.3.6.8; otherwise, 4.3.6.10 is rotated;
4.3.6.10 make Q = Q +1, if Q is less than or equal to Q, turn 4.3.6.7; otherwise, 4.3.6.11 is rotated;
4.3.6.11 initialization variable u =1;
4.3.6.12 if Simiarity (c) i ,lv ju )>0.63, the matching is successful, and the process is finished; otherwise, 4.3.6.13 is rotated; wherein Similarity (c) i ,kv ju ) Is to calculate c i ,lv ju A function of the similarity of (a);
4.3.6.13 making U = U +1, if U is less than or equal to U, turning to 4.3.6.12; otherwise, the description CS i And l j Related program object LS j And (5) finishing the matching if the matching is unsuccessful.
10. The method of claim 9, wherein the Similarity (c) is determined at step 4.3.6.12 i ,lv ju ) The calculation method comprises the following steps:
to c i Performing word segmentation and word shape reduction to obtain word set CW, and pairing lv ju Performing word segmentation and morphology reduction to obtain a word set VW, wherein the morphology reduction refers to removing affixes of words and extracting main parts of the words; then, calculating the weight of each word in the CW and VW by using an IDF algorithm, namely, taking each configuration parameter name provided by software as a file in the IDF algorithm, taking a set of all configuration parameter names as a corpus in the IDF algorithm, and then calculating the weight of each word in the CW and VW sets based on the IDF algorithm;
Figure FDA0003872406440000091
where word is a word contained in the set of CW and VW.
CN202211200856.4A 2022-09-29 2022-09-29 Configuration fault prediction method based on program semantics Active CN115562645B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211200856.4A CN115562645B (en) 2022-09-29 2022-09-29 Configuration fault prediction method based on program semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211200856.4A CN115562645B (en) 2022-09-29 2022-09-29 Configuration fault prediction method based on program semantics

Publications (2)

Publication Number Publication Date
CN115562645A true CN115562645A (en) 2023-01-03
CN115562645B CN115562645B (en) 2023-06-09

Family

ID=84743497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211200856.4A Active CN115562645B (en) 2022-09-29 2022-09-29 Configuration fault prediction method based on program semantics

Country Status (1)

Country Link
CN (1) CN115562645B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302547A (en) * 2015-09-19 2016-02-03 大连理工大学 Fault injection method for Verilog HDL design
CN106709356A (en) * 2016-12-07 2017-05-24 西安电子科技大学 Static taint analysis and symbolic execution-based Android application vulnerability discovery method
CN108804136A (en) * 2018-05-31 2018-11-13 中国人民解放军国防科技大学 Configuration item type constraint inference method based on name semantics
CN109214037A (en) * 2017-06-29 2019-01-15 沃尔沃汽车公司 Method and system for vehicle platform verifying
US20190042207A1 (en) * 2017-08-07 2019-02-07 Sap Se Configuration model parsing for constraint-based systems
US10528454B1 (en) * 2018-10-23 2020-01-07 Fmr Llc Intelligent automation of computer software testing log aggregation, analysis, and error remediation
CN111597069A (en) * 2020-05-21 2020-08-28 中国工商银行股份有限公司 Program processing method, program processing apparatus, electronic device, and storage medium
CN111611177A (en) * 2020-06-29 2020-09-01 中国人民解放军国防科技大学 Software performance defect detection method based on configuration item performance expectation
EP3916598A1 (en) * 2020-05-26 2021-12-01 Argus Cyber Security Ltd System and method for detecting exploitation of a vulnerability of software
US11294649B1 (en) * 2021-01-13 2022-04-05 Amazon Technologies, Inc. Techniques for translating between high level programming languages
CN114817932A (en) * 2022-04-26 2022-07-29 河海大学 Ether house intelligent contract vulnerability detection method and system based on pre-training model

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302547A (en) * 2015-09-19 2016-02-03 大连理工大学 Fault injection method for Verilog HDL design
CN106709356A (en) * 2016-12-07 2017-05-24 西安电子科技大学 Static taint analysis and symbolic execution-based Android application vulnerability discovery method
CN109214037A (en) * 2017-06-29 2019-01-15 沃尔沃汽车公司 Method and system for vehicle platform verifying
US20190042207A1 (en) * 2017-08-07 2019-02-07 Sap Se Configuration model parsing for constraint-based systems
CN108804136A (en) * 2018-05-31 2018-11-13 中国人民解放军国防科技大学 Configuration item type constraint inference method based on name semantics
US10528454B1 (en) * 2018-10-23 2020-01-07 Fmr Llc Intelligent automation of computer software testing log aggregation, analysis, and error remediation
CN111597069A (en) * 2020-05-21 2020-08-28 中国工商银行股份有限公司 Program processing method, program processing apparatus, electronic device, and storage medium
EP3916598A1 (en) * 2020-05-26 2021-12-01 Argus Cyber Security Ltd System and method for detecting exploitation of a vulnerability of software
CN111611177A (en) * 2020-06-29 2020-09-01 中国人民解放军国防科技大学 Software performance defect detection method based on configuration item performance expectation
US11294649B1 (en) * 2021-01-13 2022-04-05 Amazon Technologies, Inc. Techniques for translating between high level programming languages
CN114817932A (en) * 2022-04-26 2022-07-29 河海大学 Ether house intelligent contract vulnerability detection method and system based on pre-training model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ELLEN JIANG等: "Discovering the Syntax and Strategies of Natural Language Programming with Generative Language Models", 《CHI \'22: PROCEEDINGS OF THE 2022 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS》 *
贾统;李影;吴中海;: "基于日志数据的分布式软件系统故障诊断综述", 软件学报, no. 07 *
陈肇炫;邹德清;李珍;金海;: "基于抽象语法树的智能化漏洞检测系统", 信息安全学报, no. 04 *

Also Published As

Publication number Publication date
CN115562645B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
CN109697162B (en) Software defect automatic detection method based on open source code library
US10664696B2 (en) Systems and methods for classification of software defect reports
US10545999B2 (en) Building features and indexing for knowledge-based matching
CN111459799B (en) Software defect detection model establishing and detecting method and system based on Github
Liu et al. Uniparser: A unified log parser for heterogeneous log data
WO2019051422A1 (en) Automating identification of test cases for library suggestion models
Brody et al. A structural model for contextual code changes
US20070005535A1 (en) System and methods for IT resource event situation classification and semantics
Meng et al. Logparse: Making log parsing adaptive through word classification
US20230195728A1 (en) Column lineage and metadata propagation
US10146762B2 (en) Automated classification of business rules from text
CN111523119A (en) Vulnerability detection method and device, electronic equipment and computer readable storage medium
Tao et al. Logstamp: Automatic online log parsing based on sequence labelling
Chen et al. Clone detection in Matlab Stateflow models
CN103679034B (en) A kind of computer virus analytic system based on body and feature extracting method thereof
CN116305158A (en) Vulnerability identification method based on slice code dependency graph semantic learning
Dommati et al. Bug Classification: Feature Extraction and Comparison of Event Model using Na\" ive Bayes Approach
Shetty et al. Neural knowledge extraction from cloud service incidents
US20210103699A1 (en) Data extraction method and data extraction device
US20150370887A1 (en) Semantic merge of arguments
CN116149669B (en) Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium
Jiang et al. Exploring naming conventions (and defects) of pre-trained deep learning models in hugging face and other model hubs
US9613134B2 (en) Identifying mathematical operators in natural language text for knowledge-based matching
CN115562645B (en) Configuration fault prediction method based on program semantics
Govindasamy et al. Data reduction for bug triage using effective prediction of reduction order techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant