CN115562645B - Configuration fault prediction method based on program semantics - Google Patents

Configuration fault prediction method based on program semantics Download PDF

Info

Publication number
CN115562645B
CN115562645B CN202211200856.4A CN202211200856A CN115562645B CN 115562645 B CN115562645 B CN 115562645B CN 202211200856 A CN202211200856 A CN 202211200856A CN 115562645 B CN115562645 B CN 115562645B
Authority
CN
China
Prior art keywords
configuration
log information
node
configuration parameter
constraint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211200856.4A
Other languages
Chinese (zh)
Other versions
CN115562645A (en
Inventor
李姗姗
周书林
郑思
董威
贾周阳
陈振邦
陈立前
张元良
王腾
廖湘科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202211200856.4A priority Critical patent/CN115562645B/en
Publication of CN115562645A publication Critical patent/CN115562645A/en
Application granted granted Critical
Publication of CN115562645B publication Critical patent/CN115562645B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a configuration fault prediction method based on program semantics, which aims to solve the problems that configuration parameter constraint information extraction capability is insufficient and configuration faults cannot be prevented. The technical proposal is as follows: constructing a configuration parameter constraint extraction system consisting of a configuration parameter related code object extraction module, a log information related program object extraction module, a configuration related log information identification module, a natural language template generation module and a configuration constraint information identification module; the configuration parameter related code object extraction module and the log information related program object extraction module extract related information from the software source code, and the configuration related log information identification module identifies configuration related log information; the configuration constraint information identification module identifies configuration related log information containing configuration constraints based on the template generated by the natural language template generation module. The invention can effectively extract the constraint information of the software configuration parameters, effectively detect the related defects of the software document and prevent the configuration faults.

Description

Configuration fault prediction method based on program semantics
Technical Field
The invention relates to the field of configuration fault prevention in large software, in particular to a configuration fault prediction method based on program semantics.
Background
With the continuous development of society, software is widely applied to various aspects of society, plays a vital role in various fields, and becomes an infrastructure of an information-based society. The software configuration is an indispensable component of the software system and widely exists in application scenes such as software deployment, operation, upgrading and migration, and mainly refers to adjusting the value of corresponding configuration parameters to the software system through a specific interface or file, so that software users can be ensured to select different realized libraries, strategies, rules and the like to customize different functions through configuration parameter setting, and resource utilization is regulated and controlled to ensure that the software is suitable for different environments and loads, so that non-functional indexes such as performance, reliability and the like of the software are improved, and different user demands are met. Unlike the similar concepts of configuration items in the software configuration management field and the software testing field, the configuration parameters in the present invention mainly refer to configuration parameters set through a specific interface or configuration file, and are usually provided to a software user in the form of key value pairs (a key is assigned to a configuration parameter name, and a value is assigned to a configuration parameter value), for example, in database software MySQL, the user configures a data directory (datadir) of a MySQL database as "/home/" directory by writing "datadir/home/" in the configuration file. At present, a large-scale basic software system gradually develops to a high configurable direction so as to adapt to complex environment and application demand changes, and the reliability and usability of software services are improved.
However, with the continuous increase of the software scale and the increasing complexity of the interaction relationship between the software, the configuration brings convenience to the user for flexibly using the software, and the software service failure is frequently caused by the following configuration faults, so that the configuration is gradually attracting the wide attention of the industry. Google Co Ltd
Figure BDA0003872406450000011
Through investigation, et al found that configuration failure had been the second leading cause of Google service failure, accounting for nearly 29%. Similar investigations were also made by RandyKatz et al, university of washington, for Hadoop clusters, which found that configuration failures had become the most significant factor in causing Hadoop cluster failure, both in terms of the number of customer cases and the length of technical support. In recent years, facebook, microSoftNumerous large companies such as Amazon and the like frequently suffer from configuration faults, seriously affect the service quality of related software, and cause huge economic loss.
From a comprehensive analysis, the important reasons for frequent software configuration failures are the increasing number of software configuration parameters and the increasing complexity of configuration parameter constraints. On the one hand, due to the continuous development and updating of the software, the code scale is continuously enlarged, the corresponding configuration parameter scale is also obviously increased, and the difficulty of understanding and using the configuration parameters by a user is greatly improved. For example, there are 1000 configuration parameters in Apache Httpd software, and 800 configuration parameters in MySQL. On the other hand, in order to ensure the normal operation of the software, the configuration parameters and the environment all need to meet specific value conditions and association relations, namely configuration constraints, so that the difficulty of correctly using the configuration parameters to customize the software by a user is further improved. For example, the configuration parameter Port in PostgreSQL database software is mainly used to specify the Port number used when the database listens to the network access request, so that the value of the Port parameter needs to satisfy not only that the value is an integer value between 0 and 65535 in format, but also that the Port corresponding to the current value is not occupied by other programs.
The popularity and severity of configuration faults gradually draw great attention, and some researchers begin to pay attention to the prevention work before the occurrence of the configuration faults, mainly by checking the values of the configuration parameters set by users in advance, so as to reduce the occurrence of the configuration faults. Compared with post diagnosis and repair after the occurrence of the fault, the configuration fault prevention can reduce the occurrence probability of the fault before the occurrence of the fault, thereby avoiding the system loss caused by the configuration fault. The current configuration fault prediction method mainly comprises the following steps: firstly, extracting the condition required to be met by the configuration parameter value, namely constraint information (simply called configuration parameter constraint or configuration constraint) of the configuration parameter; and secondly, checking whether the configuration value set by the user meets the constraint condition or not by utilizing constraint information of the configuration parameters, so as to predict the configuration fault. Therefore, the key of configuration fault prevention is to extract configuration constraints, and it is of great significance to study how to obtain software configuration constraints.
At present, two types of methods are mainly adopted in the prior art to extract configuration parameter constraints so as to realize configuration fault prevention. The first type of method is represented by "Encore: exploiting System Environment andCorrelation Information for Misconfiguration Detection (configuration fault diagnosis is realized by using a system environment association relation)" published by Jiaqi Zhang et al in ASPLOS2014, and mainly based on a predefined constraint rule mode, software configuration parameter constraint information is mined from a large number of sample configuration files. However, on one hand, the method needs to manually summarize possible existence forms of configuration parameter constraints in advance, and has high requirements on related knowledge in the field of researchers, and on the other hand, the mining process of the method depends on a large number of sample configuration files as input, but due to factors such as user privacy data protection, related data sharing, lack of maintenance platforms and the like, the sample configuration files are difficult to obtain, and the mining effect of the configuration parameter constraints is directly affected. The second method is represented by "Do Not Blame Users for Misconfigurations (without blading users for configuration failures)" published by Tianyin Xu et al in SOSP2013, and mainly uses a static program analysis method to track the use condition of corresponding program variables (hereinafter simply referred to as configuration variables) of configuration parameters in source codes, and implements constraint extraction by matching with a predefined configuration constraint code pattern. The method also requires researchers to have rich field knowledge and development experience so as to predefine effective configuration parameter constraint code modes to obtain configuration parameter constraint information in the software source code.
In summary, how to extract more constraint information of configuration parameters, so as to effectively prevent configuration faults and improve software reliability is a hotspot problem under discussion of those skilled in the art.
Disclosure of Invention
The invention aims to solve the technical problems that the existing configuration parameter constraint information extraction capability is insufficient and the configuration faults cannot be prevented, and provides a configuration fault prediction method based on program semantics. And extracting configuration parameter related constraint information by utilizing context related program semantics contained in related log information in the software source code, and carrying out configuration fault prediction according to the extracted configuration parameter related constraint information, thereby preventing configuration faults.
In order to solve the technical problems, the technical scheme of the invention is as follows: firstly, constructing a configuration constraint extraction system consisting of a configuration parameter related code object extraction module, a log information related program object extraction module, a configuration related log information identification module, a natural language template generation module and a configuration constraint information identification module; the configuration constraint extraction system reads in the software source codes and the software configuration parameter name list file; the method comprises the steps that a configuration parameter related code object extraction module extracts a code object set related to configuration parameters in a software configuration parameter name list file in software source codes; the method comprises the steps that a log information related program object extraction module extracts a program object set related to log information in software source codes; the configuration related log information identification module receives a code object set related to configuration parameters from the configuration parameter related code object extraction module, receives a program object set related to log information from the log information related program object extraction module, and screens the log information related to configuration; meanwhile, the natural language template generating module generates a configuration constraint natural language template set according to the configuration constraint description document set and the error description related word set; then, the configuration constraint information recognition module receives a configuration parameter name set and a binary pair set of configuration related log information from the configuration related log information recognition module, receives a configuration constraint natural language description template set from the natural language template generation module, receives an error description related word set from a user, and recognizes log information containing configuration constraints by using the configuration constraint natural language template; and finally, the user checks whether the user configuration setting meets the constraint by using the log information of the configuration constraint, and predicts the configuration fault.
The invention comprises the following steps:
the first step, a configuration parameter constraint extraction system is constructed, and the configuration parameter constraint extraction system consists of a configuration parameter related code object extraction module, a log information related program object extraction module, a configuration related log information identification module, a natural language template generation module and a configuration constraint information identification module. The method comprises the steps that a configuration parameter related code object extraction module reads a software source code and a software configuration parameter name list file from a file system, a configuration parameter name set C is obtained from the software configuration parameter name list file, a configuration parameter related code object is extracted from the software source code according to the software configuration parameter name list file, a configuration parameter related code object set is obtained, and the configuration parameter name set C and the configuration parameter related code object set are sent to a configuration related log information identification module; the method comprises the steps that a log information related program object extraction module reads software source codes from a file system, extracts all potential log information in the software source codes to obtain a potential log information set L, extracts a related program object set of the potential log information in the software source codes, and sends the potential log information set L and the related program object set of the potential log information to a configuration related log information identification module; the configuration related log information identification module receives the C and configuration parameter related code object set from the configuration parameter related code object extraction module, receives the potential log information set L and the related program object set of the potential log information from the log information related program object extraction module, identifies the binary pair set CL of the configuration related log information by matching the configuration parameter related code object and the log information related program object, and sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module; the natural language template generation module receives a configuration constraint description document set DS and an error description related word set lambda from a user, generates a configuration constraint natural language description template set, and sends the configuration constraint natural language description template set to the configuration constraint information identification module; the configuration constraint information identification module receives a configuration parameter name set C and a binary pair set CL of configuration related log information from the configuration related log information identification module, receives a configuration constraint natural language description template set from the natural language template generation module, receives an error description related word set from a user, matches the configuration related log information by using the configuration constraint natural language description template, and identifies the configuration related log information containing the configuration constraint in the configuration related log information to obtain the configuration related log information containing the configuration constraint.
The second step, the configuration parameter related code object extracting module reads the software source code and the software configuration parameter name list file from the file system, obtains the configuration parameter name set C from the software configuration parameter name list file, extracts the configuration parameter related code object from the software source code according to the software configuration parameter name list file, obtains the configuration parameter related code object set, and sends the configuration parameter name set C and the configuration parameter related code object set to the configuration related log information identifying module, the method is that:
2.1 the configuration parameter related code object extraction module reads the software configuration parameter name list file from the file system to obtain a configuration parameter name set C, c= { C 1 ,c 2 ,…,c i ,…,C I },c i The I-th configuration parameter name in the C is a constant character string, and I is the total number of the configuration parameter names in the C, I is more than or equal to 1 and less than or equal to I;
2.2 resolving the software Source code Using the Clang front end (version 10.0.0 and above) of the LLVM compiler framework, generating an abstract syntax Tree (Abstract Syntax Tree) AST corresponding to the software Source code root ,AST root Each node in the source code represents a structure in the source code, such as the whole source code (TranslationUnitDecl), a function statement (FuctionDecl), an If branch statement (IfStmt), an assignment statement (AssignStmt), a function call (CallExpr), a constant string (StringLiternal), binary calculation (BinaryOperator), a single variable (DeclRefExpr), a structural variable (Member Expr) and the like, and the corresponding dependency relationship is represented among different structures by the tree structure, for example, for the If branch statement in one function statement, the corresponding IfStmt node is located in a subtree taking the FuctionDecl node as a root node;
2.3 in abstract syntax tree AST root Extraction of c 1 ,c 2 ,...,c i ,...,c I Related set of code objects CS 1 ,CS 2 ,...,CS i ,...,CS I Wherein CS is i =<CV i ,CF i >,CV i Is named as c i Is a set of configuration parameter related program variables, CV i ={cv i1 ,cv i2 ,…,cv ip ,…,cv iP },cv ip Is CV i P and designation c of i A configuration parameter related program variable; CF (compact flash) i Is named as c i Function signature sets (function signatures define the input and output of a function or method, and generally contain parameters and their types, return values and their types, etc.), CF i ={cf i1 ,cf i2 ,...,cf iq ,...,cf iQ },cf iq Is CF (CF) i The q-th and designation of c i The specific method is as follows:
2.3.1 initializing variable i=1, initializing CV i = { }, initialize CF i ={};
2.3.2 using a related traversal interface (traversal interface in the form of VisitNodeType) provided by Clang (version 10.0.0 and above), wherein NodeType refers to the node type in the abstract syntax tree, the specific node type is as described in step 2.2, the interfaces used in the traversal process include VisitFunctionDecl, visitStringLiteral etc. for traversing nodes of the functional Decl and StringLiternal types, respectively, the detailed interface information can be seen in Clang interface information document "https:// Clang. Llvm. Org/doxygen/classlang_1_1R machinery, html#Details") traversing AST sequentially root Positioning each node of (c) including c i The Node of the constant character string is marked as an init_node, and the init_node is marked as a Current Sub tree root Node current_sub_ast;
2.3.3 determining whether there are any other constant strings C in C in the subtree with Current_Sub_AST as the root node e The node of (1.ltoreq.e.ltoreq.I, e.noteq.i), if yes, 2.3.5; otherwise, turning to 2.3.4;
2.3.4 in AST root The Parent node Parent_Sub_AST of the current_Sub_AST is obtained, and if Parent_Sub_AST is a transitionUnitdecl node, the traversal reaches AST root 2.3.5; otherwise, let current_sub_ast=part_sub_ast, change to 2.3.3;
2.3.5 locating an AST Sub-tree containing init_node in Current_sub_AST and marking this AST Sub-tree in Current_sub_AST as named c i A minimum Common subtree minimum_common_sub_ast of configuration parameter related code objects;
2.3.6 traversing all nodes in minimum_Common_Sub_AST and adding the program variable name corresponding to the node with the type of program variable in minimum_Common_Sub_AST to CV i In (a) and (b); adding a function signature corresponding to a node with a type of function declaration in minimum_common_sub_AST to the CF i In (a) and (b);
2.3.7 Condition CV i ,CF i The binary group consisting of the components is named c i Related code object set CS of configuration parameters of (a) i Order CS i =<CV i ,CF i >;
2.3.8 let i=i+1, if I is less than or equal to I, let CV i ={},CF i = { } 2.3.2; otherwise, the configuration parameter related code object extraction module sends a configuration parameter name set C and a configuration parameter related code object set CS to the configuration related log information identification module 1 ,CS 2 ,...,CS i ,...,CS I
Third, the log information related program object extraction module reads in the software source code from the file system, and extracts all potential log information in the software source code by adopting a static program analysis method to obtain a potential log information set L, L= { L 1 ,l 2 ,...,l j ,...,l J "wherein l j J is the J log information, J is the total number of log information in L, J is more than or equal to 1 and less than or equal to J, and L is obtained 1 ,l 2 ,...,l j ,...,l J Related program object set LS of (3) 1 ,LS 2 ,...,LS j ,...,LS J ,LS j =<LV j ,LF j >,LV j For and log information l j Related set of program variables, LV j ={lv j1 ,lv j2 ,...,lv ju ,...,lv ju U is equal to or greater than 1 and equal to or less than U, wherein lv ju Is the (u)Related program variable, LF j For and log information l j Related function signature set, LF j ={lf j1 ,lf j2 ,...,lf jv ,...,lf jV -where lf jv The V-th correlation function is signed, V is more than or equal to 1 and less than or equal to V, and the method comprises the following steps:
3.1 initializing l= { };
3.2 sequentially traversing AST using a related traversal interface provided by Clang (VisitNodeType related traversal interface) root Screening AST for each node in the list root The method comprises the steps of marking a node with a constant string type (StringLiteral) as a constant string t, and taking t as candidate log information l candidate . If t is a single complete program statement containing a plurality of constant strings (represented by t ', t″,) then all program VARIABLEs appearing in the complete program statement are collectively represented by a string "_variabl_", and finally all constant strings (t, t ', t″,) in the single statement and program VARIABLEs replaced by the string "_variabl_", are combined according to the appearance sequence in the statement, and spaces are used for separation between the constant strings (t, t ', t″,) and the string "_variabl_", so as to form a piece of candidate log information l candidate . For example, for the log information related program statement "apr_pstrcat (cmd->temp_pool, "limit requests fields" ", arg," \must be a non-positive integer (0=no limit) ", NULL; "the log information related program object extraction module extracts the candidate log information l candidate Is "_VARIABLE_LimittRequestFields_VARIABLE_must be a non-negative integer (0=no limit)"; in the example, t represents "LimmitRequest fields" ", t' represents" \ "must be a non-positive integer (0=no limit)"; if l candidate Length less than 10 or l candidate Not including any spaces, will l candidate Discard, otherwise, will l candidate Adding the potential log information set L; when traversing AST root After the nodes are finished, turning to 3.3;
3.3 traversing AST root The constant string type (string) node in the log is completed to obtain a potential log information set L, L= { L 1 ,l 2 ,...,l j ,...,l J "wherein l j J is the J log information, and J is the total number of log information in L, and J is more than or equal to 1 and less than or equal to J;
3.4 initialization variable j=1, initialization program variable set LV i = { }, initialize function signature set LF i ={};
3.5 for the j-th element L in L j L is extracted based on a backward slicing technique (thin slicing "by Sridharan M et al published in PLDI 2007) j Form l j Is a set of potential log information related program objects LS j ,LS j =<LV j ,LF j >,LV j Middle storage and l j All program variables in the relevant program context; LF (ladle furnace) j Storage and l j Function calls related to the relevant program context correspond to the function signature. The method comprises the following steps:
3.5.1 will l j Located at AST root Marked as current node cur_node, and adding all program variables (corresponding node types are DeclRefExpr (representing a single variable), memberExpr (representing a structural variable)) in the subtree with cur_node as root node to the LV j In (a) and (b); if cur_node is located in the Then/Else logic processing code of the If branch statement, adding all program variables (corresponding node type DeclRefExpr, memberExpr) contained in the subtree with the root node as the node where the branch condition is located in the If branch statement to the LV j In the method, function signatures corresponding to function calls related in a subtree with an If branch condition located node as a root node are added to LF j In (a) and (b);
3.5.2 obtaining parent node parent_node of cur_node, numbering all child nodes in the subtree with parent_node as root node according to the appearance sequence, marking the sequence number of cur_node as x, if x=1, indicating that cur_node is before statement where cur_node is located in the subtree with parent_node as root node, turning to 3.5.2; if x is greater than 1, turn 3.5.3;
3.5.3sequentially traversing the x-1, x-2, 1 child node of parent_node, find and/ j The related program variables and function signatures are as follows:
3.5.3.1 if cur_node represents an assignment statement node (i.e., node type is assignStmt), and the left value of the assignment statement (typically representing the value to the left of the assignment operator, which is an object stored in computer memory and is addressable) corresponds to the variable var e LV j Then all program variables contained in the right value of the assignment statement (typically representing the value to the right of the assignment operator, referring to a "data" stored in a memory address, representing readable) are added to the LV j In (a) and (b); if the right value of the assignment statement contains a function call, adding a function signature corresponding to the function call to the LF j Middle, turn 3.5.4;
3.5.3.2 if cur_node represents a function call statement node (i.e. the node type is CallExpr), and the real parameter variable var e LV corresponding to the function call j Then add the function call corresponding function signature to the LF j Middle, turn 3.5.4;
3.5.4 let cur_node=parent_node, if cur_node is a functional decl type node (i.e. reaching the abstract syntax tree root node defined by the current function body), the root node of the subtree where the current function body is located is traversed, and the jump is to 3.5.5; otherwise, jumping to 3.5.2;
3.5.5 adding the function signature declared by the cur_node corresponding function to the LF j In (a) and (b);
3.6 let j=j+1, if J is less than or equal to J, turn to 3.6; otherwise, the log information related program object extraction module obtains L and LS 1 ,LS 2 ,...,LS j ,...,LS J ,L={l 1 ,l 2 ,...,l j ,...,l J -a }; associating a set of potential log information L with a set of potential log information related program objects LS 1 ,LS 2 ,...,LS j ,...,LS J And sending the configuration related log information to a configuration related log information identification module.
Fourth step, the configuration related log information identification module receives C and CS from the configuration parameter related code object extraction module 1 ,CS 2 ,...,CS i ,...,CS I Receiving L and LS from log information dependent program object extraction module 1 ,LS 2 ,...,LS j ,...,LS J The configuration related log information is screened out from the L to obtain a binary pair set CL of the configuration related log information, and the configuration parameter name set C and the binary pair set CL of the configuration related log information are sent to a configuration constraint information identification module, wherein the method comprises the following steps:
4.1 initializing variable j=1;
4.2 initializing variable i=1;
4.3 log information L of j-th item in L j And the ith configuration parameter name C in C i Matching and searching for l with association relation j And c i The binary pair comprises the following steps:
4.3.1 the configuration parameter names in the software are usually expressed by using a plurality of words, and are connected by using hump naming (Camel-Case, a set of naming rules (convention) when the computer program is written, meaning that the names of variables and functions are formed by mixing the Case letters) or common characters (such as "_") so as to configure the related log information recognition module pair c i Dividing according to word composition and hump naming method to obtain corresponding word set CWords i For example, if c i Name string representing configuration parameter "DataDirectory", then the obtained cws i = { "data", "director" }, let CWords i The number of words in the set is |CWords i |;
4.3.2 if |CWorts i |=1, turn 4.3.3; if |CWords i |1, turn 4.3.4;
4.3.3 taking CWords i Word in the word, detect whether the word is with l j The method is as follows:
4.3.3.1 if cword is in l j The middle quotation, namely, the front and the back of the cword are double quotation marks or single quotation marks, and 4.3.3.4 is changed; otherwise turning to 4.3.3.2;
4.3.3.2 if l j The term "configuration", "option", "direct", and "p" is includedany one of the keywords in the diameter is used for modifying word cword, i.e. the keyword is in l j Immediately adjacent cword occurs, turn 4.3.3.4; otherwise turning to 4.3.3.3;
4.3.3.3CWords i and log information l j Matching fails, and the matching is switched to 4.3.6;
4.3.3.4CWords i and log information l j Successful matching, transfer 4.3.7;
4.3.4 CWorts i All words in the set use character strings]"connect to generate a character string, denoted CReg i For example CWorts corresponding to the configuration parameter "DataDirecty i = { "data", "direction" }, generated CReg i Is "data []directory”;
4.3.5 in log information l j In using regular expression matching rules to string CReg i Matching, and if the matching is successful, turning to 4.3.7; otherwise, turning to 4.3.6;
4.3.6 will c i Related code object CS i And/l j Related program object LS j The matching is carried out, and the specific steps are as follows:
4.3.6.1 initializing variable p=1;
4.3.6.2 initializing variable u=1;
4.3.6.3 if cv ip =lv ju Indicating successful matching, turning to 4.3.7; otherwise, turning to 4.3.6.4;
4.3.6.4 let u=u+1, if U is not more than U, turn 4.3.6.3; otherwise turning to 4.3.6.5;
4.3.6.5 let p=p+1, P is less than or equal to P, turn 4.3.6.2; otherwise turning to 4.3.6.6;
4.3.6.6 initializing variable q=1;
4.3.6.7 initializing variable v=1;
4.3.6.8 if cf iq =lf jv Indicating successful matching, turning to 4.3.7; otherwise turning to 4.3.6.9;
4.3.6.9 let v=v+1, if V is less than or equal to V, turn 4.3.6.8; otherwise turning to 4.3.6.10;
4.3.6.10 let q=q+1, if Q is less than or equal to Q, turn 4.3.6.7; otherwise turning to 4.3.6.11;
4.3.6.11 initializing variable u=1;
4.3.6.12 if Similarity (c) i ,lv ju )>0.63, indicating successful matching, turn 4.3.7; otherwise turning to 4.3.6.13; wherein Similarity (c) i ,lv ju ) Is the calculation c i ,lv ju The similarity is calculated as follows:
for two character strings c for which similarity needs to be calculated i ,lv ju First, for c i ,lv ju Processing Segmentation and morphological reduction (removing word affix, extracting trunk part of word), segmenting ci and morphological reduction to obtain word set CW, and performing lv ju Performing word segmentation and word shape reduction to obtain a word set VW, then calculating the weight of each word in the CW and the VW by using an IDF (Inverse Document Frequency ) algorithm (K.S. Jones et al, "A statistical interpretation ofterm specificity and its application in retrieval" published in Journal of documentation journal in 1972), specifically, taking each configuration parameter name provided by software as a file in the IDF algorithm, taking a set of all configuration parameter names as a corpus in the IDF algorithm, and then calculating the weight of each word in the CW and the VW set based on the IDF algorithm; the similarity calculation formula between the last two is as follows
Figure BDA0003872406450000091
Wherein word is a word contained in the set of CW and VW;
4.3.6.13 let u=u+1, if U is not more than U, turn 4.3.6.12; otherwise explain CS i And/l j Related program object LS j Matching is unsuccessful, and 4.4 is switched;
4.3.7 journal information l j And configuration parameter name c i Successfully, pair of two elements<c i ,l j >Adding the set CL;
4.4 let i=i+1; if I is less than or equal to I, turning to 4.3; otherwise, turning to 4.5;
4.5 let j=j+1; if J is less than or equal to J, turning to 4.2; whether or notThen the processing of C and L is completed to obtain the binary pair set CL, cl= {<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c m ,l m >,...,<c M ,l M >Turning to 4.6;
4.6, the configuration related log information identification module sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module;
fifthly, manually screening texts describing configuration constraints from configuration related documents and source code logs of Apache Hadoop and HDFS, yarn, alluxio, cassandra, spark, hypertable, mongoDB, AOLServer, subversion and OpenLDAP 11 types of software, and collecting 338 pieces of configuration constraint text description information altogether, wherein 338 pieces of configuration constraint text description information are recorded as a configuration constraint description document set DS (the number of the configuration constraint text description information pieces in DS is |DS|=338).
Sixth, words for describing the error-related state are manually acquired based on a WordNet dictionary (WordNet is an english dictionary based on cognitive linguistics, which is jointly designed by psychologists, linguists and computer engineers at Princeton university) interface, to form an error description-related word set λ.
Seventh, the natural language template generating module receives DS and error description related word set lambda from the user, and generates configuration constraint natural language description template set LanPatterns by:
7.1, the natural language template generating module obtains a natural language template describing configuration constraint according to DS, and the specific steps are as follows:
7.1.1 initializing variable y=1;
7.1.2 text description information d for the y-th clause in DS set y Generating d based on spaCy open source library (spaCy is an NLP natural language text processing library of Python and CPython, version requirement is more than or equal to 3.1.0) y Corresponding POS tag sequence pair<pos y1 ,lemma y1 >,<pos y2 ,lemma y2 >,...,<pos yz ,lemma yz >,...,<pos yh ,lemma yh >The h POS tag sequences are pairedThe set of constituents is abbreviated as the first POS tag sequence set POS yh ,lemma yh In which pos yz (1.ltoreq.z.ltoreq.h) is d y Part-of-speech tags (e.g., nouns (NN), verbs (VB), adjectives (JJ), etc.), lemma yz Is d y The original word after the z-th word morphological restoration in the sequence h is the total number of POS label sequences;
7.1.3 use of removal, substitution and merger methods on pos yh ,lemma yh Removing, combining and replacing to obtain a second POS label sequence set POS yh ′,lemma yh The method of'' is:
will be pos yh ,lemma yh Pos in the journal yz Binary pairs, either DT (part of speech tag of qualifier) or SYM (part of speech tag of symbol), are removed and consecutively occurring pos yz Binary pairs for NN (part of speech tag of noun) or JJ (part of speech tag of adjective), i.e. if pos yz =nn and pos y z+1 =nn or pos yz =jj and pos y z+1 =jj, then<pos yz ,lemma yz >And<pos y z+1 ,lemma y z+1 >is combined into<pos yz ,lemma yz +lemma y z+1 >,lemma yz +lemma y z+1 Representing the word lemma yz And word lemma y z+1 Combining, and connecting by using a space in the middle; when lemma yz For the word in lambda set, then lemma will be yz Uniformly replacing the POS label sequence set with a character string ERROR_STATUS to obtain a second POS label sequence set POS yh ′,lemma yh ′》;
7.1.4 let y=y+1, if y is less than or equal to 338, turn 7.1.2; otherwise turning to 7.1.5;
7.1.5 the "pos" was mined using an Aprior frequent item mining algorithm (book "data mining: concepts and Techniques" published by Jiawei Han et al 2011) yh ′,lemma yh Frequent items in', and selecting the first five frequently occurring sequences to add to the configuration constraint natural language descriptionTemplate set LanPatterns;
7.2 the natural language template generating module sends the natural language description template set LanPattern and the error description related word set lambda to the configuration constraint information identifying module.
Eighth, the configuration constraint information identification module receives CL, cl= { from the configuration-related log information identification module<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c m ,l m >,...,<c M ,l M >Receiving LanPatterns and lambda from a natural language template generation module, and identifying the journal information containing configuration constraint description in the CL based on the LanPatterns by the following steps:
8.1 initializing a variable m=1, and initializing a log information set constraint-containing log information set= { };
8.2 for binary pairs in CL<c m ,l m >If l m Comprises c m Will l m C in (c) m Replacing with a character string CONFIG, and turning to 8.3; if l m Not containing c m Then directly turning to 8.3;
8.3 if l m Contains words in lambda, will l m The corresponding word in (a) is replaced by a character string of 'ERROR_STATUS', and the process is changed to 8.4; if l m If the word in lambda is not contained, directly converting to 8.4;
8.4 Generation of l Using a spaCy open Source library m Corresponding third POS tag sequence set POS mh ,lemma mh In the following, the removal, combining and replacement methods described in step 7.1.3 were used for the "pos mh ,lemma mh Removing, replacing and combining to obtain a fourth POS label sequence set POS mh ′,lemma mh ′》;
8.5 examination of pos mh ′,lemma mh ' whether any template in LanPatterns can be matched or not, if matching is successful, the binary pair is obtained<c m ,l m >Adding the mixture into a set ConstraintDescSet, and converting the mixture into 8.6; otherwise, directly turning to 8.6;
8.6, let m=m+1, if m is less than or equal to |cl|, the |cl| represents the number of elements in CL, and go to 8.2; otherwise, turning to 8.7;
8.7 the configuration constraint information identification module outputs a configuration Descset set to the user, wherein the configuration Descset set contains all log information containing configuration constraints in L, and the configuration Descset= { <c 1 ,l 1 >,<c 2 ,l 2 >,...,<c r ,l r >,...,<c R ,l R >}, wherein<c r ,l r >Represents the r-th binary pair, c, in the ConstraintDescSet r Representing the configuration parameter name, l r Representing log information correspondingly containing configuration constraint description, wherein R is the total number of binary pairs in the constraint Descset, and R is more than or equal to 1 and less than or equal to R;
ninth, the user checks whether the configuration parameter setting in the configuration file meets the constraint according to the log information set constraint containing the configuration constraint output by the configuration constraint information identification module, and predicts the configuration fault, wherein the method is as follows:
9.1, the user reads the ConstraintDescSet output by the configuration constraint information identification module;
9.2 initializing variable r=1;
9.3 the user reads the configuration file of the target software and checks if there is a code named c in the configuration file r If present, 9.4; otherwise, turning to 9.6;
9.4 checking the name c in the configuration file r Whether the configuration parameter value setting of (1) satisfies l r The described configuration constraint information, if satisfied, turns to 9.6; otherwise, indicating that the configuration file has configuration parameter settings which violate configuration constraints, indicating that the current configuration file has configuration faults, and turning to 9.5;
9.5 configuration failures exist in the current configuration file, the checking is failed, and the user is according to l r The described configuration constraint information pair is named c r The configuration parameter value of (1) is set to be adjusted according to the following way r Description of the invention will be named c r The value of the configuration parameter of the system is adjusted to be within a legal range, 9.6 is switched after the adjustment is finished, and the latter is continuously checked.
9.6 let r=r+1, if R is less than or equal to R, turn to 9.3; otherwise, turning to 9.7;
and 9.7, not finding out the configuration parameter setting which violates the configuration constraint in the configuration file, indicating that no configuration fault exists in the current configuration file, and ending the checking.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention can effectively extract the constraint information of the software configuration parameters. In the invention, 427 configuration parameter constraint information is extracted in 7 large open source software MySQL, apacheHttpd, postgreSQL, nginx, lighttpd, squid, postfix, wherein 263 configuration parameter constraints cannot be extracted by the existing work (namely 'Do Not Blame Users for Misconfigurations' published by Tianyin Xu et al in SOSP2013 extracts configuration constraints based on a predefined code mode), so that compared with the existing extraction method, the extraction method of the invention extracts constraint information more fully.
2. The invention can detect 67 document related defects for a software community and is used for supplementing the missing or error of configuration parameter constraint description in the documents, wherein 14 defect patches are received, and configuration faults are prevented. The received document defect patch ID is: httpd-64893, httpd-64904, httpd-64909, mySQL-101512, mySQL-101513, mySQL-101514, mySQL-101515, mySQL-101516, mySQL-101519, mySQL-101520, lighttpd-3035, lighttpd-3036, lighttpd-3038, lighttpd-3040.
Drawings
FIG. 1 is a logical block diagram of a configuration parameter constraint extraction system constructed in a first step of the present invention;
FIG. 2 is a general flow chart of the present invention;
FIG. 3 is a table of configuration parameter constraint description natural language templates constructed in the fourth step of the present invention.
Detailed Description
The invention will be described below with reference to the accompanying drawings using Httpd Web server software as an example.
As shown in fig. 2, the present invention includes the steps of:
first, a configuration parameter constraint extraction system is constructed, as shown in fig. 1, wherein the configuration parameter constraint extraction system is composed of a configuration parameter related code object extraction module, a log information related program object extraction module, a configuration related log information identification module, a natural language template generation module and a configuration constraint information identification module. The method comprises the steps that a configuration parameter related code object extraction module reads an Httpd software source code and an Httpd software configuration parameter name list file from a file system, a configuration parameter name set C is obtained from the software configuration parameter name list file, a configuration parameter related code object is extracted from the software source code according to the software configuration parameter name list file, a configuration parameter related code object set is obtained, and the configuration parameter name set C and the configuration parameter related code object set are sent to a configuration related log information identification module; the method comprises the steps that a log information related program object extraction module reads Httpd software source codes from a file system, extracts all potential log information in the Httpd software source codes to obtain a potential log information set L, extracts a related program object set of the potential log information in the software source codes, and sends the potential log information set L and the related program object set of the potential log information to a configuration related log information identification module; the configuration related log information identification module receives the C and configuration parameter related code object set from the configuration parameter related code object extraction module, receives the potential log information set L and the related program object set of the potential log information from the log information related program object extraction module, identifies the binary pair set CL of the configuration related log information by matching the configuration parameter related code object and the log information related program object, and sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module; the natural language template generation module receives a configuration constraint description document set DS and an error description related word set lambda from a user, generates a configuration constraint natural language description template set, and sends the configuration constraint natural language description template set to the configuration constraint information identification module; the configuration constraint information identification module receives a configuration parameter name set C and a binary pair set CL of configuration related log information from the configuration related log information identification module, receives a configuration constraint natural language description template set from the natural language template generation module, receives an error description related word set from a user, matches the configuration related log information by using the configuration constraint natural language description template, and identifies the configuration related log information containing the configuration constraint in the configuration related log information to obtain the configuration related log information containing the configuration constraint.
The second step, the configuration parameter related code object extracting module reads the Httpd software source code and the Httpd software configuration parameter name list file from the file system, obtains a configuration parameter name set C from the Httpd software configuration parameter name list file, extracts the configuration parameter related code object from the software source code according to the software configuration parameter name list file, obtains a configuration parameter related code object set, and sends the configuration parameter name set C and the configuration parameter related code object set to the configuration related log information identifying module, the method is that:
2.1 reading the Httpd software configuration parameter name list file from the file system by the configuration parameter related code object extraction module to obtain a configuration parameter name set C, c= { C 1 ,c 2 ,...,c i ,...,c I },c i The I-th configuration parameter name in C is a constant character string, I is the total number of the configuration parameter names in C, I is more than or equal to 1 and less than or equal to I, and I=694;
2.2 resolving Httpd software Source code Using Clang front end (version 10.0.0) of LLVM compiler framework, generating abstract syntax Tree (Abstract Syntax Tree) AST corresponding to the software Source code root ,AST root Each node in the source code represents a structure in the source code, such as the whole source code (TranslationUnitDecl), a function statement (FuctionDecl), an If branch statement (IfStmt), an assignment statement (AssignStmt), a function call (CallExpr), a constant string (StringLiternal), binary calculation (BinaryOperator), a single variable (DeclRefExpr), a structural variable (Member Expr) and the like, and the corresponding dependency relationship is represented among different structures by the tree structure, for example, for the If branch statement in one function statement, the corresponding IfStmt node is located in a subtree taking the FuctionDecl node as a root node;
2.3 in abstract syntax tree AST root Extraction of c 1 ,c 2 ,...,c i ,...,c I Related set of code objects CS 1 ,CS 2 ,...,CS i ,...,CS I Wherein CS is i =<CV i ,CF i >,CV i Is named as c i Is a set of configuration parameter related program variables, CV i ={cv i1 ,cv i2 ,...,cv ip ,...,cv iP },cv ip Is CV i P and designation c of i A configuration parameter related program variable; CF (compact flash) i Is named as c i Function signature sets (function signatures define the input and output of a function or method, and generally contain parameters and their types, return values and their types, etc.), CF i ={cf i1 ,cf i2 ,...,cf iq ,...,cf iQ },cf iq Is CF (CF) i The q-th and designation of c i The specific method is as follows:
2.3.1 initializing variable i=1, initializing CV i = { }, initialize CF i ={};
2.3.2 using a related traversal interface (traversing interface in the form of VisitNodeType) provided by Clang (version 10.0.0 and above), wherein NodeType refers to the node type in the abstract syntax tree, the node specific type is as described in step 2.2, the interfaces used in the traversal process are, for example, traversing FuctionDecl type nodes with VisitFuctionDecl, traversing StringLiteral type nodes with VisitStringLiteral type nodes) traversing AST in sequence root Positioning each node of (c) including c i The Node of the constant character string is marked as an init_node, and the init_node is marked as a Current Sub tree root Node current_sub_ast;
2.3.3 determining whether there are any other constant strings C in C in the subtree with Current_Sub_AST as the root node e The node of (1.ltoreq.e.ltoreq.I, e.noteq.i), if yes, 2.3.5; otherwise, turning to 2.3.4;
2.3.4 in AST root Parent node Parent_Sub_AST of current_Sub_AST is obtained, if Parent_Sub_AST isTranslationUnitDecl node, indicating traversal to AST root 2.3.5; otherwise, let current_sub_ast=part_sub_ast, turn 2.3.3;
2.3.5 locating an AST Sub-tree containing init_node in Current_sub_AST and marking this AST Sub-tree in Current_sub_AST as named c i A minimum Common subtree minimum_common_sub_ast of configuration parameter related code objects;
2.3.6 traversing all nodes in minimum_Common_Sub_AST and adding the program variable name corresponding to the node with the type of program variable in minimum_Common_Sub_AST to CV i In (a) and (b); adding a function signature corresponding to a node with a type of function declaration in minimum_common_sub_AST to the CF i In (a) and (b);
2.3.7 Condition CV i ,CF i The binary group consisting of the components is named c i Related code object set CS of configuration parameters of (a) i Order CS i =<CV i ,CF i >;
2.3.8 let i=i+1, if I is less than or equal to I, let CV i ={},CF i = { } 2.3.2; otherwise, the configuration parameter related code object extraction module sends a configuration parameter name set C and a configuration parameter related code object set CS to the configuration related log information identification module 1 ,CS 2 ,...,CS i ,...,CS I
Thirdly, reading the Httpd software source code from the file system by a log information related program object extraction module, and extracting all potential log information in the Httpd software source code by adopting a static program analysis method to obtain a potential log information set L, L= { L 1 ,l 2 ,...,l j ,...,l J "wherein l j J is the total log information in L, J is more than or equal to 1 and less than or equal to J, J is more than or equal to 4382, and L is obtained 1 ,l 2 ,...,l j ,...,l J Related program object set LS of (3) 1 ,LS 2 ,...,LS j ,...,LS J ,LS j =<LV j ,LF j >,LV j For and log information l j Related processesSequence variable set, LV j ={lv j1 ,lv j2 ,...,lv ju ,...,lv jU U is equal to or greater than 1 and equal to or less than U, wherein lv ju LF as the u-th related program variable j For and log information l j Related function signature set, LF j ={lf j1 ,lf j2 ,...,lf jv ,...,lf jV -where lf jv The V-th correlation function is signed, V is more than or equal to 1 and less than or equal to V, and the method comprises the following steps:
3.1 initializing l= { };
3.2 sequential traversal of AST using related traversal interface provided by Clang (related interface in the form of VisitNodeType) root Screening AST for each node in the list root The method comprises the steps of marking a node with a constant string type (StringLiteral) as a constant string t, and taking t as candidate log information l candidate . If t is a single complete program statement containing a plurality of constant strings (represented by t ', t″,) then all program VARIABLEs appearing in the complete program statement are collectively represented by a string "_variabl_", and finally all constant strings (t, t ', t″,) in the single statement and program VARIABLEs replaced by the string "_variabl_", are combined according to the appearance sequence in the statement, and spaces are used for separation between the constant strings (t, t ', t″,) and the string "_variabl_", so as to form a piece of candidate log information l candidate . For example, for the log information related program statement "apr_pstrcat (cmd->temp_pool, "limit requests fields" ", arg," \must be a non-positive integer (0=no limit) ", NULL; "the log information related program object extraction module extracts the candidate log information l candidate Is "_VARIABLE_LimittRequestFields_VARIABLE_must be a non-negative integer (0=no limit)"; in the example, t represents "LimmitRequest fields" ", t' represents" \ "must be a non-positive integer (0=no limit)"; if l candidate Length less than 10 or l candidate Not including any spaces, will l candidate Discard, otherwise, will l candidate Adding the potential log information set L; when traversing AST root After the nodes are finished, turning to 3.3;
3.3 traversing AST root The constant string type (string) node in the log is completed to obtain a potential log information set L, L= { L 1 ,l 2 ,...,l j ,...,l J "wherein l j J is the J log information, J is the total number of log information in L, J is more than or equal to 1 and less than or equal to J, and J=4382;
3.4 initialization variable j=1, initialization program variable set LV i = { }, initialize function signature set LF i ={};
3.5 for the j-th element L in L j L is extracted based on a backward slicing technique (thin slicing "by Sridharan M et al published in PLDI 2007) j Form l j Is a set of potential log information related program objects LS j ,LS j =<LV j ,LF j >. The method comprises the following steps:
3.5.1 will l j Located at AST root Marked as current node cur_node, and adding all program variables (corresponding node types are DeclRefExpr (representing a single variable), memberExpr (representing a structural variable)) in the subtree with cur_node as root node to the LV j In (a) and (b); if cur_node is located in the Then/Else logic processing code of the If branch statement, adding all program variables (corresponding node type DeclRefExpr, memberExpr) contained in the subtree with the root node as the node where the branch condition is located in the If branch statement to the LV j In the method, function signatures corresponding to function calls related in a subtree with an If branch condition located node as a root node are added to LF j In (a) and (b);
3.5.2 obtaining parent node parent_node of cur_node, numbering all child nodes in the subtree with parent_node as root node according to the appearance sequence, marking the sequence number of cur_node as x, if x=1, indicating that cur_node is before statement where cur_node is located in the subtree with parent_node as root node, turning to 3.5.2; if x is greater than 1, turn 3.5.3;
3.5.3 traversing par in turnThe x-1, x-2, 1 child node of ent_node, find and/ j The related program variables and function signatures are as follows:
3.5.3.1 if cur_node represents an assignment statement node (i.e. node type is assignStmt), and the variable var e LV corresponding to the left value of the assignment statement j Then add all program variables contained in the right value of the assignment statement to the LV j In (a) and (b); if the right value of the assignment statement contains a function call, adding a function signature corresponding to the function call to the LF j Middle, turn 3.5.4;
3.5.3.2 if cur_node represents a function call statement node (i.e. the node type is CallExpr), and the real parameter variable var e LV corresponding to the function call j Then add the function call corresponding function signature to the LF j Middle, turn 3.5.4;
3.5.4 let cur_node=parent_node, if cur_node is a functional decl type node (i.e. reaching the abstract syntax tree root node defined by the current function body), the root node of the subtree where the current function body is located is traversed, and the jump is to 3.5.5; otherwise, jumping to 3.5.2;
3.5.5 adding the function signature declared by the cur_node corresponding function to the LF j In (a) and (b);
3.6 let j=j+1, if J is less than or equal to J, turn to 3.6; otherwise, the log information related program object extraction module obtains L and LS 1 ,LS 2 ,...,LS j ,...,LS J ,L={l 1 ,l 2 ,...,l j ,...,l J -a }; associating a set of potential log information L with a set of potential log information related program objects LS 1 ,LS 2 ,...,LS j ,...,LS J And sending the configuration related log information to a configuration related log information identification module.
Fourth step, the configuration related log information identification module receives C and CS from the configuration parameter related code object extraction module 1 ,CS 2 ,...,CS i ,...,CS I I=694, receiving L and LS from log information related program object extraction module 1 ,LS 2 ,...,LS j ,...,LS J J=4382, screening the configuration phase from LThe log information is closed to obtain a binary pair set CL of the configuration related log information, and the configuration parameter name set C and the binary pair set CL of the configuration related log information are sent to a configuration constraint information identification module, wherein the method comprises the following steps:
4.1 initializing variable j=1;
4.2 initializing variable i=1;
4.3 log information L of j-th item in L j And the ith configuration parameter name C in C i Matching and searching for l with association relation j And c i The binary pair comprises the following steps:
4.3.1 the configuration parameter names in the software are typically represented using a plurality of words and are connected using hump nomenclature or common characters (here "-") so that the configuration related log information recognition module pair c i Dividing according to word composition and hump naming method to obtain corresponding word set CWords i For example, if c i If the name string representing the configuration parameter "DataDirectory" is obtained, cws i= { "data", "directory" }, and cws is ordered i The number of words in the set is |CWords i |;
4.3.2 if |CWorts i |=1, turn 4.3.3; if |CWords i |1, turn 4.3.4;
4.3.3 taking CWords i Word in the word, detect whether the word is with l j The method is as follows:
4.3.3.1 if cword is in l j The middle quotation, namely, the front and the back of the cword are double quotation marks or single quotation marks, and 4.3.3.4 is changed; otherwise turning to 4.3.3.2;
4.3.3.2 if l j The method comprises the step of modifying word cword by any keyword of configuration, option, direct and parameter, namely the keyword is in l j Immediately adjacent cword occurs, turn 4.3.3.4; otherwise turning to 4.3.3.3;
4.3.3.3CWords i and log information l j Matching fails, and the matching is switched to 4.3.6;
4.3.3.4CWords i and log information l j Successful matching, transfer 4.3.7;
4.3.4 CWorts i All words in the set use character strings]"connect to generate a character string, denoted CReg i For example CWorts corresponding to the configuration parameter "DataDirecty i = { "data", "direction" }, generated CReg i Is "data []directory”;
4.3.5 in log information l j In using regular expression matching rules to string CReg i Matching, and if the matching is successful, turning to 4.3.7; otherwise, turning to 4.3.6;
4.3.6 will c i Related code object CS i And/l j Related program object LS j The matching is carried out, and the specific steps are as follows:
4.3.6.1 initializing variable p=1;
4.3.6.2 initializing variable u=1;
4.3.6.3 if cv ip =lv ju Indicating successful matching, turning to 4.3.7; otherwise, turning to 4.3.6.4;
4.3.6.4 let u=u+1, if U is not more than U, turn 4.3.6.3; otherwise turning to 4.3.6.5;
4.3.6.5 let p=p+1, P is less than or equal to P, turn 4.3.6.2; otherwise turning to 4.3.6.6;
4.3.6.6 initializing variable q=1;
4.3.6.7 initializing variable v=1;
4.3.6.8 if cf iq =lf jv Indicating successful matching, turning to 4.3.7; otherwise turning to 4.3.6.9;
4.3.6.9 let v=v+1, if V is less than or equal to V, turn 4.3.6.8; otherwise turning to 4.3.6.10;
4.3.6.10 let q=q+1, if Q is less than or equal to Q, turn 4.3.6.7; otherwise turning to 4.3.6.11;
4.3.6.11 initializing variable u=1;
4.3.6.12 if Similarity (c) i ,lv ju ) > 0.63, indicating successful match, turn 4.3.7; otherwise turning to 4.3.6.13; wherein Similarity (c) i ,lv ju ) Is the calculation c i ,lv ju The similarity is calculated as follows:
Figure BDA0003872406450000191
wherein word is a word contained in the set of CW and VW;
4.3.6.13 let u=u+1, if U is not more than U, turn 4.3.6.12; otherwise explain CS i And/l j Related program object LS j Matching is unsuccessful, and 4.4 is switched;
4.3.7 journal information l j And configuration parameter name c i Successfully, pair of two elements<c i ,l j >Adding the set CL;
4.4 let i=i+1; if I is less than or equal to I, turning to 4.3; otherwise, turning to 4.5;
4.5 let j=j+1; if J is less than or equal to J, turning to 4.2; otherwise, the processing of the C and the L is completed, and a binary pair set CL, CL= { is obtained<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c m ,l m >,...,<c M ,l M >M=1545 to 4.6;
4.6, the configuration related log information identification module sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module;
fifthly, manually screening texts describing configuration constraints from configuration related documents and source code logs of Apache Hadoop and HDFS, yarn, alluxio, cassandra, spark, hypertable, mongoDB, AOLServer, subversion and OpenLDAP 11 types of software, and collecting 338 pieces of configuration constraint text description information altogether, wherein 338 pieces of configuration constraint text description information are recorded as a configuration constraint description document set DS (the number of the configuration constraint text description information pieces in DS is |DS|=338).
And sixthly, manually acquiring words for describing the error correlation state based on the WordNet dictionary interface to form an error description correlation word set lambda.
Seventh, the natural language template generating module receives DS and error description related word set lambda from the user, and generates configuration constraint natural language description template set LanPatterns by:
7.1 the natural language template generating module obtains a natural language template describing configuration constraints as shown in fig. 3 according to the DS, and specifically comprises the following steps:
7.1.1 initializing variable y=1;
7.1.2 text description information d for the y-th clause in DS set y Generation of d based on the spaCy open source library (version 3.1.0) y Corresponding POS tag sequence pair<pos y1 ,lemma y1 >,<pos y2 ,lemma y2 >,...,<pos yz ,lemma yz >,...,<pos yh ,lemma yh >The set of h POS tag sequence pairs is abbreviated as a first POS tag sequence set POS yh ,lemma yh In which pos yz (1.ltoreq.z.ltoreq.h) is d y Part-of-speech tags (e.g., noun (NN), verb (VB), adjective (JJ)), lemma for the z-th word in (a) yz Is d y The original word after the z-th word morphological restoration in the sequence h is the total number of POS label sequences;
7.1.3 use of removal, substitution and merger methods on pos yh ,lemma yh Removing, combining and replacing to obtain a second POS label sequence set POS yh ′,lemma yh The method of'' is:
will be pos yh ,lemma yh Pos in the journal yz Binary pairs, either DT (part of speech tag of qualifier) or SYM (part of speech tag of symbol), are removed and consecutively occurring pos yz Binary pairs for NN (part of speech tag of noun) or JJ (part of speech tag of adjective), i.e. if pos yz =nn and pos y z+1 =nn or pos yz =jj and pos y z+1 =jj, then<pos yz ,lemma yz >And<pos y z+1 ,lemma y z+1 >is combined into<pos yz ,lemma yz +lemma y z+1 >,lemma yz +lemma y z+1 Representing the word lemma yz And word lemma y z+1 Combining, and connecting by using a space in the middle; when lemma yz For the word in lambda set, then lemma will be yz Unified substitution for words The symbol string "ERROR_STATUS" yields a second POS tag sequence set "POS yh ′,lemma yh ′》;
7.1.4 let y=y+1, if y is less than or equal to 338, turn 7.1.2; otherwise turning to 7.1.5;
7.1.5 the "pos" was mined using an Aprior frequent item Mining algorithm (book "Data Mining: concepts and Techniques (Data Mining: concept and technology)", published by Jiawei Han et al 2011) yh ′,lemma yh ' frequent items in the list of FIG. 3, and selecting the first five frequently occurring sequences to add to the configuration constraint natural language description template set LanPatterns, namely the contents of the second row to the sixth row corresponding to the first column of the table in FIG. 3, wherein each of the second row to the sixth row of the first column is "pos yh ′,lemma yh Frequent items in the' collection, and representing a POS tag sequence pattern of a configuration constraint natural language description template, for example, the first row of the table indicates that the configuration constraint natural language description is in the form of a noun (NN) and a morbid verb (MD), and the second row of the table corresponds to the example description of "this value (NN) last (MD) be greaterthan 0";
7.2 the natural language template generating module sends the natural language description template set LanPattern and the error description related word set lambda to the configuration constraint information identifying module.
Eighth, the configuration constraint information identification module receives CL, cl= { from the configuration-related log information identification module<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c m ,l m >,...,<c M ,l M >M=1545, receiving lanpattern and λ from the natural language template generation module, identifying log information containing configuration constraint descriptions in CL based on the lanpattern, by:
8.1 initializing a variable m=1, and initializing a log information set constraint-containing log information set= { };
8.2 for binary pairs in CL<c m ,l m >If l m Comprises c m Will l m C in (c) m Replacement by character strings"CONFIG", 8.3; if l m Not containing c m Then directly turning to 8.3;
8.3 if l m Contains words in lambda, will l m The corresponding word in (a) is replaced by a character string of 'ERROR_STATUS', and the process is changed to 8.4; if l m If the word in lambda is not contained, directly converting to 8.4;
8.4 Generation of l Using a spaCy open Source library m Corresponding third POS tag sequence set POS mh ,lemma mh In the following, the removal, combining and replacement methods described in step 7.1.3 were used for the "pos mh ,lemma mh Removing, replacing and combining to obtain a fourth POS label sequence set POS mh ′,lemma mh ′》;
8.5 examination of pos mh ′,lemma mh ' whether any template in LanPatterns can be matched or not, if matching is successful, the binary pair is obtained<c m ,l m >Adding the mixture into a set ConstraintDescSet, and converting the mixture into 8.6; otherwise, directly turning to 8.6;
8.6 let m=m+1, if M is less than or equal to M, m=1545, go to 8.2; otherwise, turning to 8.7;
8.7 the configuration constraint information identification module outputs a configuration Descset set to the user, wherein the configuration Descset set contains all log information containing configuration constraints in L, and the configuration Descset= {<c 1 ,l 1 >,<c 2 ,l 2 >,...,<c r ,l r >,...,<c R ,l R >}, wherein<c r ,l r >Represents the r-th binary pair, c, in the ConstraintDescSet r Representing the configuration parameter name, l r Representing log information correspondingly containing configuration constraint description, wherein R is the total number of binary pairs in the constraint Descset, R is more than or equal to 1 and less than or equal to R, and R=205;
ninth, the user checks whether the description information in the configuration related document of the Httpd software is correct according to the log information set constraint desset containing the configuration constraint output by the configuration constraint information identification module, thereby checking the defect of the software document, and the method is as follows:
9.1, the user receives the ConstraintDescSet from the configuration constraint information identification module, and detects whether the document information is sufficient according to the ConstraintDescSet;
9.2 initializing variable r=1;
9.3 checking if there is a pair name c in the software document r The constraint information text description of the configuration parameters of (2) if present, 9.4; otherwise, turning to 9.6;
9.4 checking if the configuration parameter related constraint information text description named cr in the software document is associated with l r The described configuration constraint information is consistent, if so, the method is switched to 9.6; otherwise, turning to 9.5;
9.5 the pair name in the current software document is c r The configuration parameter related constraint information text description of the software is defected, and the user reports the defect to a software developer, and the step is 9.6;
9.6 let r=r+1, if R is less than or equal to R, r=205, turn 9.3; otherwise, turning to 9.6;
9.7, the text description checking of the configuration parameter related constraint information in the target software document is finished.
By analyzing the Httpd software, 164 pieces of log information containing configuration constraints are extracted in total, 25 document defects are found out in total by comparing and checking with the official document of the Httpd software (ninth step), and 25 document defect patches are submitted, wherein 3 defect patches have been received. The same method (the fifth, sixth and seventh steps can be applied to multiple types of software only by executing once) can implement the steps of the invention on the other 6 types of experimental software (Nginx, mySQL, postgreSQL, lighttpd, squid, postfix), and finally the 7 types of software can extract 427 pieces of log information containing configuration constraints, find 67 document defects in total through comparison and inspection with the official document of the software (ninth step), and submit 67 document defect patches, wherein 14 defect patches are received.

Claims (10)

1. The configuration fault prediction method based on the program semantics is characterized by comprising the following steps:
firstly, constructing a configuration parameter constraint extraction system, wherein the configuration parameter constraint extraction system consists of a configuration parameter related code object extraction module, a log information related program object extraction module, a configuration related log information identification module, a natural language template generation module and a configuration constraint information identification module; the method comprises the steps that a configuration parameter related code object extraction module reads a software source code and a software configuration parameter name list file from a file system, a configuration parameter name set C is obtained from the software configuration parameter name list file, a configuration parameter related code object is extracted from the software source code according to the software configuration parameter name list file, a configuration parameter related code object set is obtained, and the configuration parameter name set C and the configuration parameter related code object set are sent to a configuration related log information identification module; the method comprises the steps that a log information related program object extraction module reads software source codes from a file system, extracts all potential log information in the software source codes to obtain a potential log information set L, extracts a related program object set of the potential log information in the software source codes, and sends the potential log information set L and the related program object set of the potential log information to a configuration related log information identification module; the configuration related log information identification module receives the C and configuration parameter related code object set from the configuration parameter related code object extraction module, receives the potential log information set L and the related program object set of the potential log information from the log information related program object extraction module, identifies the binary pair set CL of the configuration related log information by matching the configuration parameter related code object and the log information related program object, and sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module; the natural language template generation module receives a configuration constraint description document set DS and an error description related word set lambda from a user, generates a configuration constraint natural language description template set, and sends the configuration constraint natural language description template set to the configuration constraint information identification module; the configuration constraint information identification module receives a configuration parameter name set C and a binary pair set CL of configuration related log information from the configuration related log information identification module, receives a configuration constraint natural language description template set from the natural language template generation module, receives an error description related word set from a user, matches the configuration related log information by using the configuration constraint natural language description template, and identifies configuration related log information containing configuration constraint in the configuration related log information to obtain configuration related log information containing the configuration constraint;
The second step, the configuration parameter related code object extracting module reads the software source code and the software configuration parameter name list file from the file system, obtains the configuration parameter name set C from the software configuration parameter name list file, extracts the configuration parameter related code object from the software source code according to the software configuration parameter name list file, obtains the configuration parameter related code object set, and sends the configuration parameter name set C and the configuration parameter related code object set to the configuration related log information identifying module, the method is that:
2.1 the configuration parameter related code object extraction module reads the software configuration parameter name list file from the file system to obtain a configuration parameter name set C, c= { C 1 ,c 2 ,...,c i ,...,c I },c i The I-th configuration parameter name in the C is a constant character string, and I is the total number of the configuration parameter names in the C, I is more than or equal to 1 and less than or equal to I;
2.2 resolving the software source code by using the Clang front end of the LLVM compiler framework to generate an abstract syntax tree AST corresponding to the software source code root ,AST root Each node in the source code represents a structure in the source code, and the different structures represent corresponding subordinate relations in a tree structure;
2.3 in abstract syntax tree AST root Extraction of c 1 ,c 2 ,...,c i ,...,c I Related set of code objects CS 1 ,CS 2 ,...,CS i ,...,CS I Wherein CS is i =<CV i ,CF i >,CV i Is named as c i Is a set of configuration parameter related program variables, CV i ={cv i1 ,cv i2 ,...,cv ip ,...,cv iP },cv ip Is CV i P and designation c of i A configuration parameter related program variable; CF (compact flash) i Is named as c i Is related to the configuration parameters of the function signatureAggregation, CF i ={cf i1 ,cf i2 ,...,cf iq ,...,cf iQ },cf iq Is CF (CF) i The q-th and designation of c i The function signature defines the input and output of the function or method, including parameters and the types of the parameters, return values and the types thereof;
third, the log information related program object extraction module reads in the software source code from the file system, and extracts all potential log information in the software source code by adopting a static program analysis method to obtain a potential log information set L, L= { L 1 ,l 2 ,...,l j ,...,l J "wherein l j J is the J log information, J is the total number of log information in L, J is more than or equal to 1 and less than or equal to J, and L is obtained 1 ,l 2 ,...,l j ,...,l J Related program object set LS of (3) 1 ,LS 2 ,...,LS j ,...,LS J ,LS j =<LV j ,LF j >,LV j For and log information l j Related set of program variables, LV j ={lv j1 ,lv j2 ,...,lv ju ,...,lv jU U is equal to or greater than 1 and equal to or less than U, wherein lv ju LF as the u-th related program variable j For and log information l j Related function signature set, LF j ={lf j1 ,lf j2 ,...,lf jv ,...,lf jV -where lf jv The V-th correlation function is signed, V is more than or equal to 1 and less than or equal to V, and the method comprises the following steps:
3.1 initializing l= { };
3.2 traversing AST sequentially using related traversal interfaces provided by Clang root Screening AST for each node in the list root The node of which the type is constant character string type, namely StringLiteral, marks the node of the constant character string type as constant character string t, and takes t as a piece of candidate log information l candidate The method comprises the steps of carrying out a first treatment on the surface of the If t is a single complete program statement containing multiple constant strings, expressed by t ', t', then using the characters for all the program variables appearing in the complete program statement in a unified mannerString "_variable_" means that all constant strings t, t ', t ",. And the program VARIABLE replaced with string" _variable_ "in the single complete program statement are finally combined according to the appearance sequence in the statement, and the constant strings t, t', t",. And the string "_variable_" are separated by spaces, so that a piece of candidate log information l is formed candidate; If l candidate Length less than 10 or l candidate Not including any spaces, will l candidate Discard, otherwise, will l candidate Adding the potential log information set L; when traversing AST root After the nodes are finished, turning to 3.3;
3.3 traversing AST root The constant string type node in the log information is completed to obtain a potential log information set L, L= { L 1 ,l 2 ....,l j ,...,l J "wherein l j J is the J log information, and J is the total number of log information in L, and J is more than or equal to 1 and less than or equal to J;
3.4 initialization variable j=1, initialization program variable set LV i = { }, initialize function signature set LF i ={};
3.5 for the j-th element L in L j Extraction of l based on backward slicing technique j Form l j Is a set of potential log information related program objects LS j ,LS j =<LV j ,LF j >;LV j Middle storage and l j All program variables in the relevant program context; LF (ladle furnace) j Storage and l j Function calls related to the related program context correspond to the function signature;
3.6 let j=j+1, if J is less than or equal to J, turn to 3.6; otherwise, the log information related program object extraction module obtains L and LS 1 ,LS 2 ,...,LS j ,...,LS J ,L={l 1 ,l 2 ,...,l J ,...,l J -a }; associating a set of potential log information L with a set of potential log information related program objects LS 1 ,LS 2 ,...,LS j ,...,LS J Transmitting the configuration related log information to a configuration related log information identification module;
fourth step, the configuration related log information identification module receives C and CS from the configuration parameter related code object extraction module 1 ,CS 2 ,...,CS i ,...,CS I Receiving L and LS from log information dependent program object extraction module 1 ,LS 2 ,...,LS j ,...,LS J The configuration related log information is screened out from the L to obtain a binary pair set CL of the configuration related log information, and the configuration parameter name set C and the binary pair set CL of the configuration related log information are sent to a configuration constraint information identification module, wherein the method comprises the following steps:
4.1 initializing variable j=1;
4.2 initializing variable i=1;
4.3 log information L of j-th item in L j And the ith configuration parameter name C in C i Matching and searching for l with association relation j And c i The binary pair comprises the following steps:
4.3.1 configuring the related Log information identification Module pair c i Dividing according to word composition and hump naming method to obtain corresponding word set CWords i Let CWords i The number of words in the set is |CWords i |;
4.3.2 if |CWorts i |=1, turn 4.3.3; if |CWords i |1, turn 4.3.4;
4.3.3 taking CWords i Word in the word, detect whether the word is with l j Related, if CWords i And log information l j Matching fails, and the matching is switched to 4.3.6; if CWords is used i And log information l j Successful matching, transfer 4.3.7;
4.3.4 CWorts i All words in the set use character strings]"connect to generate a character string, denoted CReg i
4.3.5 in log information l j In using regular expression matching rules to string CReg i Matching, and if the matching is successful, turning to 4.3.7; otherwise, turning to 4.3.6;
4.3.6 will c i Related code object CS i And/l j Related program object LS j Matching, and if the matching is successful, turning to 4.3.7; if the matching is unsuccessful, turning to 4.4;
4.3.7 journal information l j And configuration parameter name c i Successfully, the binary pair is less than c i ,l j > join in set CL;
4.4 let i=i+1; if I is less than or equal to I, turning to 4.3; otherwise, turning to 4.5;
4.5 let j=j+1; if J is less than or equal to J, turning to 4.2; otherwise, the C and the L are processed completely to obtain a binary pair set CL, and CL= { < C 1 ,l 1 >,<c 2 ,l 2 >,...,<c m ,l m >,...,<c M ,l M >, turn 4.6;
4.6, the configuration related log information identification module sends the configuration parameter name set C and the binary pair set CL of the configuration related log information to the configuration constraint information identification module;
fifthly, manually screening texts describing configuration constraints from configuration related documents and source code logs of Apache Hadoop and HDFS, yarn, alluxio, cassandra, spark, hypertable, mongoDB, AOLServer, subversion and OpenLDAP 11 types of software, and collecting 338 pieces of configuration constraint text description information altogether, wherein the 338 pieces of configuration constraint text description information are recorded as a configuration constraint description document set DS, and the number of the configuration constraint text description information pieces in the DS is |DS|=338;
step six, manually acquiring words for describing the error correlation state based on a WordNet dictionary interface to form an error description correlation word set lambda;
seventh, the natural language template generating module receives DS and error description related word set lambda from the user, and generates configuration constraint natural language description template set LanPatterns by:
7.1, the natural language template generating module obtains a natural language template describing configuration constraint according to DS, and the specific steps are as follows:
7.1.1 initializing variable y=1;
7.1.2 text description information d for the y-th clause in DS set y Generating d based on spaCy open source library y Corresponding POS tag sequence pair < POS y1 ,lemma y1 >,<pos y2 ,lemma y2 >,...,<pos yz ,lemma yz >,...,<pos yh ,lemma yh The set of h POS tag sequence pairs is abbreviated as a first POS tag sequence set yh ,lemma yh >, where pos yz (1.ltoreq.z.ltoreq.h) is d y Part-of-speech tag, lemma, of the z-th word in (b) yz Is d y The original word after the z-th word morphological restoration in the sequence h is the total number of POS label sequences;
7.1.3 use of removal, substitution and merger methods for < pos yh ,lemma yh Removing, combining and replacing to obtain a second POS label sequence set POS' yh ,lemma′ yh >, the method is:
will < pos > yh ,lemma yh Pos in > yz Binary pairs of part-of-speech tags DT or SYM of a symbol, which are qualifiers, are removed and consecutively occurring pos yz Binary pairs of part-of-speech tags NN or JJ for adjectives of nouns are combined, i.e. if pos yz =nn and pos y z+1 =nn or pos yz =jj and pos y z+1 If JJ, will < pos yz ,lemma yz Pos > and < pos y z+1 ,lemma y z+1 > pool < pos yz ,lemma yz +lemma y z+1 >,lemma yz +lemma y z+1 Representing the word lemma yz And word lemma y z+1 Combining, and connecting by using a space in the middle; when lemma yz For the word in lambda set, then lemma will be yz Uniformly replacing the first POS label sequence set with the character string 'ERROR_STATUS' to obtain a second POS label sequence set yh ′,lemma yh ′>>;
7.1.4 let y=y+1, if y is less than or equal to 338, turn 7.1.2; otherwise turning to 7.1.5;
7.1.5 mining < pos using an Aprior frequent item mining algorithm yh ′,lemma yh Frequent items in the' > are selected, and the first five frequently occurring sequences are added into a configuration constraint natural language description template set LanPatterns;
7.2, the natural language template generating module sends a natural language description template set LanPattern and an error description related word set lambda to the configuration constraint information identifying module;
eighth step, the configuration constraint information identification module receives CL, cl= { < c, from the configuration-related log information identification module 1 ,l 1 >,<c 2 ,l 2 >,...,<c m ,l m >,...,<c M ,l M > }, receiving LanPatterns and lambda from the natural language template generation module, identifying the journal information containing the configuration constraint description in the CL based on the LanPatterns, the method is as follows:
8.1 initializing a variable m=1, and initializing a log information set constraint-containing log information set= { };
8.2 for binary pair < c in CL m ,l m >, if l m Comprises c m Will l m C in (c) m Replacing with a character string CONFIG, and turning to 8.3; if l m Not containing c m Then directly turning to 8.3;
8.3 if l m Contains words in lambda, will l m The corresponding word in (a) is replaced by a character string of 'ERROR_STATUS', and the process is changed to 8.4; if l m If the word in lambda is not contained, directly converting to 8.4;
8.4 Generation of l Using a spaCy open Source library m Corresponding third POS tag sequence set < POS > mh ,lemma mh >, and using 7.1.3 steps of the removal, merge and replace method to < pos >, the method of the removal, merge and replace mh ,lemma mh Removing, replacing and combining > to obtain a fourth POS label sequence set mh ′,lemma mh ′>>;
8.5 examination of < pos mh ′,lemma mh Whether the' > can be matched with any template in LanPatterns or not, if the matching is successful, the binary pair is less than c m ,l m > add to ConstraintDescset and turn 8.6; otherwise, directly turning to 8.6;
8.6, let m=m+1, if m is less than or equal to |cl|, the |cl| represents the number of elements in CL, and go to 8.2; otherwise, turning to 8.7;
8.7 the configuration constraint information identification module outputs a ConstraintDescset to the user, the ConstraintDescset set containing all the Log information containing configuration constraints in L, constraintDescset= { < c 1 ,l 1 >,<c 2 ,l 2 >,...,<c r ,l r >,...,<c R ,l R > }, where < c r ,l r Represents the r-th binary pair, c, in the ConstraintDescSet r Representing the configuration parameter name, l r Representing log information correspondingly containing configuration constraint description, wherein R is the total number of binary pairs in the constraint Descset, and R is more than or equal to 1 and less than or equal to R;
Ninth, the user checks whether the configuration parameter setting in the configuration file meets the constraint according to the log information set constraint containing the configuration constraint output by the configuration constraint information identification module, and predicts the configuration fault, wherein the method is as follows:
9.1, the user reads the ConstraintDescSet output by the configuration constraint information identification module;
9.2 initializing variable r=1;
9.3 the user reads the configuration file of the target software and checks if there is a code named c in the configuration file r If present, 9.4; otherwise, turning to 9.6;
9.4 checking the name c in the configuration file r Whether the configuration parameter value setting of (1) satisfies l r The described configuration constraint information, if satisfied, turns to 9.6; otherwise, indicating that the configuration file has configuration parameter settings which violate configuration constraints, indicating that the current configuration file has configuration faults, and turning to 9.5;
9.5 configuration failures exist in the current configuration file, the checking is failed, and the user is according to l r The described configuration constraint information pair is named c r The configuration parameter value of (1) is set to be adjusted according to the following way r Description of the invention will be named c r The value of the configuration parameter of the device is adjusted to be within a legal range, 9.6 is switched after the adjustment is finished, and the latter is continuously checked;
9.6 let r=r+1, if R is less than or equal to R, turn to 9.3; otherwise, turning to 9.7;
and 9.7, not finding out the configuration parameter setting which violates the configuration constraint in the configuration file, indicating that no configuration fault exists in the current configuration file, and ending the checking.
2. The method for predicting configuration faults based on program semantics as claimed in claim 1, wherein the version number of the Clang front end is 10.0.0 version and above; the spaCy is an NLP natural language text processing library of Python and CPython, and the version requirement is more than or equal to 3.1.0.
3. The method for predicting configuration failure based on program semantics as claimed in claim 1, wherein said AST is 2.2 steps root The structure in the source code represented by each node in the list comprises an entire source code whole TranslationUnitDecl, a function statement FunctionDecl, if branch statement IfStmt, an assignment statement AssignStmt, a function call CallExpr, a constant string StringLiteral, a binary computation BinaryOper, a single variable DeclRefExpr and a structural variable Member Expr.
4. The method for predicting configuration failure based on program semantics as claimed in claim 1, wherein the abstract syntax tree AST is 2.3 steps root Extraction of c 1 ,c 2 ,...,c i ,...,c I Related set of code objects CS 1 ,CS 2 ,...,CS i ,...,CS I The method of (1) is as follows:
2.3.1 initializing variable i=1, initializing CV i = { }, initialize CF i ={};
2.3.2 traversing AST sequentially using related traversal interfaces provided by Clang root Positioning each node of (c) including c i The Node of the constant character string is marked as an init_node, and the init_node is marked as a Current Sub tree root Node current_sub_ast;
2.3.3 determining whether there are any other constant strings C in C in the subtree with Current_Sub_AST as the root node e E is not less than 1 and not more than I, and e is not more than I, if yes, 2.3.5 is converted;otherwise, turning to 2.3.4;
2.3.4 in AST root The Parent node Parent_Sub_AST of the current_Sub_AST is obtained, and if Parent_Sub_AST is a transitionUnitdecl node, the traversal reaches AST root 2.3.5; otherwise, let current_sub_ast=part_sub_ast, turn 2.3.3;
2.3.5 locating an AST Sub-tree containing init_node in Current_sub_AST and marking this AST Sub-tree in Current_sub_AST as named c i A minimum Common subtree minimum_common_sub_ast of configuration parameter related code objects;
2.3.6 traversing all nodes in minimum_Common_Sub_AST and adding the program variable name corresponding to the node with the type of program variable in minimum_Common_Sub_AST to CV i In (a) and (b); adding a function signature corresponding to a node with a type of function declaration in minimum_common_sub_AST to the CF i In (a) and (b);
2.3.7 Condition CV i ,CF i The binary group consisting of the components is named c i Related code object set CS of configuration parameters of (a) i Order CS i =<CV i ,CF i >;
2.3.8 let i=i+1, if I is less than or equal to I, let CV i ={},CF i = { } 2.3.2; otherwise, the configuration parameter related code object extraction module sends a configuration parameter name set C and a configuration parameter related code object set CS to the configuration related log information identification module 1 ,CS 2 ,...,CS i ,...,CS I
5. The method for predicting configuration faults based on program semantics as claimed in claim 1, wherein the relevant traversal interface provided by Clang in step 3.2 refers to a relevant interface in a form of VisitNodeType, and nododetype refers to a node type in an abstract syntax tree.
6. The method for predicting configuration faults based on program semantics as claimed in claim 1, wherein the backward slicing technique based extraction of l is 3.5 steps j Form l j Is a set of potential log information related program objects LS j The method of (1) is as follows:
3.5.1 will l j Located at AST root Marking the node in (a) as the current node cur_node, and adding all program variables in the subtree with cur_node as the root node to the LV j In (a) and (b); if cur_node is located in the Then/Else logic processing code of the If branch statement, adding all program variables contained in the subtree with the root node as the node where the branch condition is located in the If branch statement to the LV j In the method, function signatures corresponding to function calls related in a subtree with an If branch condition located node as a root node are added to LF j In (a) and (b);
3.5.2 obtaining parent node parent_node of cur_node, numbering all child nodes in the subtree with parent_node as root node according to the appearance sequence, marking the sequence number of cur_node as x, if x=1, indicating that cur_node is before statement where cur_node is located in the subtree with parent_node as root node, turning to 3.5.2; if x is greater than 1, turn 3.5.3;
3.5.3 sequentially traverses the x-1, x-2 of parent_node, 1 child node, looking for and/ j The related program variables and function signatures are as follows:
3.5.3.1 if cur_node indicates that the node of the assignment statement, i.e. the node type is assignStmt, and the variable var e LV corresponding to the left value of the assignment statement j Then add all program variables contained in the right value of the assignment statement to the LV j In (a) and (b); if the right value of the assignment statement contains a function call, adding a function signature corresponding to the function call to the LF j Middle, turn 3.5.4;
3.5.3.2 if cur_node indicates that the node of the function call statement is the node type CallExpr, and the real parameter variable var E LV corresponding to the function call j Then add the function call corresponding function signature to the LF j Middle, turn 3.5.4;
3.5.4 let cur_node=parent_node, if cur_node is a FunctionDecl type node representing a function declaration, explaining that traversal has reached the root node of the subtree where the current function body is located, and jumping to 3.5.5; otherwise, jumping to 3.5.2;
3.5.5 sounding the cur_node corresponding functionExplicit function signature is added to LF j Is a kind of medium.
7. The method for predicting configuration failure based on program semantics as claimed in claim 6, wherein all program variable corresponding node types in the subtree with cur_node as the root node in step 3.5.1 include DeclRefExpr representing a single variable and MemberExpr representing a structure variable; and the corresponding node types of all program variables contained in the subtree with the node where the branch condition is located in the If branch statement as the root node comprise DeclRefExpr and Member Expr.
8. The method for predicting configuration failure based on program semantics as claimed in claim 1, wherein said detecting whether cword is identical to l in step 4.3.3 j The related method is as follows:
4.3.3.1 if cword is in l j The middle quotation, namely, the front and the back of the cword are double quotation marks or single quotation marks, and 4.3.3.4 is changed; otherwise turning to 4.3.3.2;
4.3.3.2 if l j The keyword comprises any one of the keywords of configuration, option, direct and parameter for modifying word cword, namely the keyword is in l j Immediately adjacent cword occurs, turn 4.3.3.4; otherwise turning to 4.3.3.3;
4.3.3.3 CWords i and log information l j The matching is failed and is finished;
4.3.3.4 CWords i and log information l j And (5) successfully matching and ending.
9. A method of program semantic based configuration fault prediction as claimed in claim 1, wherein step 4.3.6 said step c i Related code object CS i And/l j Related program object LS j The matching method comprises the following steps:
4.3.6.1 initializing variable p=1;
4.3.6.2 initializing variable u=1;
4.3.6.3 if cv ip =lv ju Indicating that the matching is successful and ending; otherwise, turning to 4.3.6.4;
4.3.6.4 let u=u+1, if U is not more than U, turn 4.3.6.3; otherwise turning to 4.3.6.5;
4.3.6.5 let p=p+1, P is less than or equal to P, turn 4.3.6.2; otherwise turning to 4.3.6.6;
4.3.6.6 initializing variable q=1;
4.3.6.7 initializing variable v=1;
4.3.6.8 if cf iq =lf jv Indicating that the matching is successful and ending; otherwise turning to 4.3.6.9;
4.3.6.9 let v=v+1, if V is less than or equal to V, turn 4.3.6.8; otherwise turning to 4.3.6.10;
4.3.6.10 let q=q+1, if Q is less than or equal to Q, turn 4.3.6.7; otherwise turning to 4.3.6.11;
4.3.6.11 initializing variable u=1;
4.3.6.12 if Similarity (c) i ,lv ju ) > 0.63, indicating successful match, end; otherwise turning to 4.3.6.13; wherein Similarity (c) i ,lv ju ) Is the calculation c i ,lv ju A function of the similarity of (2);
4.3.6.13 let u=u+1, if U is not more than U, turn 4.3.6.12; otherwise explain CS i And/l j Related program object LS j The matching is unsuccessful and ends.
10. The method for predicting configuration failure based on program semantics as claimed in claim 9, wherein step 4.3.6.12 the Similarity (c i ,lv ju ) The calculation method of (1) is as follows:
pair c i Word segmentation and morphological reduction are carried out to obtain a word set CW, and lv is calculated ju Performing word segmentation and morphological reduction to obtain a word set VW, wherein the morphological reduction refers to removing the affix of the word and extracting the trunk part of the word; then calculating the weight of each word in the CW and the VW by using an IDF algorithm, namely, taking each configuration parameter name provided by software as a file in the IDF algorithm, taking a set of all configuration parameter names as a corpus in the IDF algorithm, and then calculating the weight of each word in the CW and the VW set based on the IDF algorithm;
Figure FDA0004177425910000091
where word is the word contained in the set of CW and VW.
CN202211200856.4A 2022-09-29 2022-09-29 Configuration fault prediction method based on program semantics Active CN115562645B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211200856.4A CN115562645B (en) 2022-09-29 2022-09-29 Configuration fault prediction method based on program semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211200856.4A CN115562645B (en) 2022-09-29 2022-09-29 Configuration fault prediction method based on program semantics

Publications (2)

Publication Number Publication Date
CN115562645A CN115562645A (en) 2023-01-03
CN115562645B true CN115562645B (en) 2023-06-09

Family

ID=84743497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211200856.4A Active CN115562645B (en) 2022-09-29 2022-09-29 Configuration fault prediction method based on program semantics

Country Status (1)

Country Link
CN (1) CN115562645B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817932A (en) * 2022-04-26 2022-07-29 河海大学 Ether house intelligent contract vulnerability detection method and system based on pre-training model

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302547A (en) * 2015-09-19 2016-02-03 大连理工大学 Fault injection method for Verilog HDL design
CN106709356B (en) * 2016-12-07 2019-05-24 西安电子科技大学 Android application bug excavation method based on static stain analysis and semiology analysis
US10977879B2 (en) * 2017-06-29 2021-04-13 Volvo Car Corporation Method and system for vehicle platform validation
US10474435B2 (en) * 2017-08-07 2019-11-12 Sap Se Configuration model parsing for constraint-based systems
CN108804136B (en) * 2018-05-31 2021-10-01 中国人民解放军国防科技大学 Configuration item type constraint inference method based on name semantics
US10528454B1 (en) * 2018-10-23 2020-01-07 Fmr Llc Intelligent automation of computer software testing log aggregation, analysis, and error remediation
CN111597069B (en) * 2020-05-21 2023-06-13 中国工商银行股份有限公司 Program processing method, device, electronic equipment and storage medium
EP3916598A1 (en) * 2020-05-26 2021-12-01 Argus Cyber Security Ltd System and method for detecting exploitation of a vulnerability of software
CN111611177B (en) * 2020-06-29 2023-06-09 中国人民解放军国防科技大学 Software performance defect detection method based on configuration item performance expectation
US11294649B1 (en) * 2021-01-13 2022-04-05 Amazon Technologies, Inc. Techniques for translating between high level programming languages

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114817932A (en) * 2022-04-26 2022-07-29 河海大学 Ether house intelligent contract vulnerability detection method and system based on pre-training model

Also Published As

Publication number Publication date
CN115562645A (en) 2023-01-03

Similar Documents

Publication Publication Date Title
CN109697162B (en) Software defect automatic detection method based on open source code library
US10713441B2 (en) Hybrid learning system for natural language intent extraction from a dialog utterance
CN107430612B (en) Finding documents describing solutions to computational problems
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
EP3514694B1 (en) Query translation
US11520992B2 (en) Hybrid learning system for natural language understanding
CN111459799B (en) Software defect detection model establishing and detecting method and system based on Github
WO2019051422A1 (en) Automating identification of test cases for library suggestion models
EP3679482A1 (en) Automating identification of code snippets for library suggestion models
Brody et al. A structural model for contextual code changes
WO2019075390A1 (en) Blackbox matching engine
US8805877B2 (en) User-guided regular expression learning
US10747958B2 (en) Dependency graph based natural language processing
WO2019051388A1 (en) Automating generation of library suggestion engine models
US20120143897A1 (en) Wild Card Auto Completion
US11281864B2 (en) Dependency graph based natural language processing
US10891178B2 (en) Method and device for identifying problematic component in storage system
US20210103699A1 (en) Data extraction method and data extraction device
Chen et al. Clone detection in Matlab Stateflow models
CN116305158A (en) Vulnerability identification method based on slice code dependency graph semantic learning
Zahid et al. Evolution in software architecture recovery techniques—A survey
Jiang et al. Exploring naming conventions (and defects) of pre-trained deep learning models in hugging face and other model hubs
US20150370887A1 (en) Semantic merge of arguments
CN115562645B (en) Configuration fault prediction method based on program semantics
CN116822491A (en) Log analysis method and device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant