CN115048316A - Semi-supervised software code defect detection method and device - Google Patents
Semi-supervised software code defect detection method and device Download PDFInfo
- Publication number
- CN115048316A CN115048316A CN202210971176.6A CN202210971176A CN115048316A CN 115048316 A CN115048316 A CN 115048316A CN 202210971176 A CN202210971176 A CN 202210971176A CN 115048316 A CN115048316 A CN 115048316A
- Authority
- CN
- China
- Prior art keywords
- code
- defect
- codes
- function
- semi
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3688—Test management for test execution, e.g. scheduling of test suites
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Stored Programmes (AREA)
Abstract
The invention discloses a semi-supervised software code defect detection method and a semi-supervised software code defect detection device, which belong to the field of network safety and comprise the following steps: s1, extracting code features; s2, using the code features extracted in the step S1, defining a code generation rule template based on a rule engine, and using a code generation module to generate defect codes and neutral codes for expanding code samples of machine learning training; s3, filtering the defect codes; and S4, constructing and training a model, then judging the defect codes and outputting the detection result. The method expands the defect samples and normal samples of machine learning training, improves the accuracy of generated samples, and can greatly improve the code defect detection precision.
Description
Technical Field
The invention relates to the field of network security, in particular to a semi-supervised software code defect detection method and device.
Background
With the wide application of various open source software in different fields, the volume and complexity of software codes are rapidly increased, and thus security events caused by various code defects frequently occur. How to efficiently detect the code defects has become an important problem for guaranteeing national social security as well as personal information and property security. Conventional code defect detection is based on manual code security audit and rule-based static code detection. The former is mainly implemented by experts who are proficient in code defects, needs strong professional skills of professionals, usually consumes manpower and has low detection efficiency; the latter extracts software defect characteristics and converts the software defect characteristics into defect scanning rules through analysis of lexical methods, grammars, data streams, program control streams and the like of software defect codes to complete defect detection of the program codes.
In recent years, artificial intelligence technology is rapidly developed, software defect detection based on machine learning shows wide application prospect, a working mode combining form inference and probability inference is adopted, fuzzy information of software codes can be used for defect judgment, constraint conditions based on a rule detection method are avoided, and the method has strong robustness and high detection efficiency. However, due to the lack of large-scale defective samples of actual engineering codes with labels, the detection method based on machine learning at present has low detection precision and poor practicability on real codes.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a semi-supervised software code defect detection method and device, which expand the defect samples and normal samples of machine learning training, improve the accuracy of generated samples, and greatly improve the code defect detection precision and the like.
The purpose of the invention is realized by the following scheme:
a semi-supervised software code defect detection method comprises the following steps:
s1, extracting code features;
s2, using the code features extracted in the step S1, defining a code generation rule template based on a rule engine, and using a code generation module to generate defect codes and neutral codes for expanding code samples of machine learning training;
s3, filtering the defect codes;
and S4, constructing and training a model, then judging the defect codes and outputting the detection result.
Further, in step S2, the method includes the sub-steps of: defining two types of code generation rule templates in a rule engine, wherein one type is a defect code generation rule template, and the purpose is to generate target codes with specific defects through a code function by using a code generation module for expanding the number of a code sample library; the other type is a neutral code generation rule template, and aims to generate codes with the same functions as the original codes but different character forms by using a code generation module on the premise of not changing the semantics of the original code program, so as to expand the number of the code sample libraries.
Further, in step S2, the code generation module includes the sub-steps of:
s21, selecting and generating a defect code or a neutral code after the code characteristics are obtained; according to different category requirements of code generation, randomly selecting a defect generation rule or a neutral generation rule, selecting a target point executed by a certain action rule of a code function according to a precondition for executing the defect generation rule or the neutral generation rule, if no target point is met, replacing the defect generation rule or the neutral generation rule, and if all the rules are executed, and the target point is not found, replacing the next function segment to continuously and repeatedly execute the step;
s22, executing a generating strategy on a target point line of the code function to generate a defect code or a neutral code with a specific defect category;
s23, expanding the generated defect codes or neutral codes with specific defect categories to obtain a defect code set or a neutral code set, wherein the neutral code set is a normal code set, marking category labels on the defect code set according to the defect categories, marking normal labels on the normal code set, and adding the normal labels to the constructed code samples for expanding the defect samples and the normal samples of machine learning training.
Further, in step S3, the method includes the sub-steps of:
s31, obtaining the constructed normal code function segment and defect code function segment from the step S2;
s32, inputting the normal code function segment and the defect code function segment into a static code detection tool, performing static code detection and collecting the detection results of each tool;
s33, aiming at the input that the input is a normal code function segment, if the detection result of the static tool is higher than the set range and no defect is detected, the generated normal code is a correct label sample, and then the normal code is added into a code sample library, otherwise, the sample is discarded; and aiming at the input of a defect code function segment, setting the position of a detection result of a static tool in the offset set range of a code generation target point, and detecting the defect of a specific category when the position is higher than the set range, wherein the generated defect code is a correct label sample and then is added into a code sample library, otherwise, discarding the sample.
Further, in step S4, the method includes the sub-steps of: constructing a depth nerve code defect detection model which has a parameter updating function and is based on fusion characteristics, wherein the model specifically comprises a characteristic fusion network, a defect judgment network and a model updating mechanism; the feature fusion network is used for fusing code function sample features of different dimensions; the defect judging network is used for classifying the fused features, identifying whether the fused features are defect codes or not and then outputting the categories of the defects; the model updating mechanism is used for realizing incremental updating of the model on the premise of not influencing business work.
Further, in step S1, the method includes the sub-steps of: and slicing the source code into a certain granularity, and extracting the characteristics of a code character string and a function abstract syntax tree.
Further, the feature fusion network utilizes three types of features of character level, word level and abstract syntax tree as input, and converts text and tree structure into vector form by using an embedding method.
Further, a cross entropy loss function is adopted as a loss function of the deep neural code defect detection model based on the fusion features in the training process.
Further, the model updating mechanism specifically includes the sub-steps of:
s41, defining the fusion-feature-based deep neural code defect detection model working on the ith day asAn off-line deep neural code defect detection model based on fusion characteristics is;
S42, collecting code functions with tag information of the ith day from the code sample library for incremental updating and training of the offline model;
S43, using off-line model when the i day traffic is less than the set valueReplacing online modelsI.e. by。
A semi-supervised software code defect detecting device comprises a program instruction executing unit and a program instruction storing unit, and when the program instructions are loaded and executed by the program instruction executing unit, the semi-supervised software code defect detecting method is implemented.
The beneficial effects of the invention include:
the invention utilizes a large amount of existing codes existing in the Internet, expands the defect samples and normal samples of machine learning training, further improves the accuracy of generating the samples by combining with the existing code static detection tool, and can greatly improve the code defect detection precision by the proposed deep neural code defect detection model based on the fusion characteristics.
The invention provides a new rule-based defect code generation method, which makes full use of the scale effect of open source codes and solves the problem that the existing deep learning-based model lacks large-scale actual engineering code defect samples with labels.
The invention further adopts a plurality of static code detection tools, and the accuracy of the constructed defect code label is improved.
The invention provides a deep neural code defect detection model based on fusion characteristics, which fuses the multi-dimensional characteristics of a code function and improves the defect detection effect.
The invention designs a periodical periodic model increment updating mechanism, and can realize continuous learning iteration of the deep learning model parameters under the condition of not interfering normal services.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a deployment scenario of an embodiment of the present invention;
FIG. 2 is a flow chart of the operation of a code defect generator according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating the operation of a code defect filter according to an embodiment of the present invention;
FIG. 4 is a deep learning model used by the code defect discriminator according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating steps of a method according to an embodiment of the present invention.
Detailed Description
All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.
In the process of seeking to solve the problems in the background, the inventors of the present invention have found, through creative thinking: the semi-supervised learning can be used for mining the characteristics contained in the data under the condition of only a small number of labels, and completing the training of the model by methods such as pseudo label generation and the like. Therefore, the inventor of the invention provides a technical scheme, applies semi-supervised learning to the field of software defect detection, fully utilizes the scale effect of the existing codes, forms a mechanism of continuous learning iteration of an intelligent model, solves the problem of rare number of defect samples of actual engineering codes, improves the defect detection precision of real codes, and has very important significance for realizing automatic detection of code defects.
In order to solve the problem of efficient and automatic detection of software code defects, in a specific implementation manner, the embodiment of the invention also at least solves the technical problems in the following aspects:
1) how to generate a real software defect sample by using an actual engineering code solves the problem that machine learning lacks a real training sample.
2) How the defect detection model realizes semi-supervised learning solves the problem of weight updating iteration of the intelligent model, and realizes the improvement of the defect detection capability.
3) How the features of the software code are fused and converged solves the multi-dimensional representation of the features of the software code, and the effect of detecting defects is improved.
In a specific embodiment, the invention aims to realize the aim of automatic detection of software code defects, namely, when an actual engineering project with complex source codes is faced, the source codes can be quickly detected and analyzed, and the defect codes can be screened out. Meanwhile, the screening result is required to be ensured to have higher accuracy so as to ensure the practical value of the machine learning model and provide powerful support for manual further inspection and analysis.
In one embodiment of the invention, the detection model is continuously studied and iterated by fully utilizing the existing codes widely existing in the Internet based on the code defect detection of the semi-supervised learning software. Mainly comprises three steps: firstly, extracting code engineering characteristics, namely slicing a source code into a certain granularity, and extracting characteristics such as code character strings, function Abstract Syntax Trees (AST) and the like; secondly, a code defect generator is used for expanding defect samples and normal samples of machine learning training; thirdly, a code defect filter detects generated defects and improves the accuracy of generating samples; and fourthly, training a code defect discriminator and outputting a detection result.
The technical scheme provided by the embodiment of the invention mainly comprises a firewall, a code defect discriminator, a code defect filter, a code defect generator, a code feature extractor, a code generation rule base, a code feature base and a source code base, which are connected to a network through a network to form a complete semi-supervised software code defect detection system. The working principle realized by the embodiment of the invention is irrelevant to a specific deployment mode, so that the working principle is explained by only adopting a semi-supervised software code defect detection deployment scheme shown in figure 1. In the deployment scenario of fig. 1, an end user uploads an engineering code by requesting a defect discrimination service interface, and filters out invalid or unauthorized access by a firewall rule; and then returning possible defect code segments through a code defect discriminator, and finishing the autonomous learning of the parameters of the defect detection model through the code defect discriminator in a semi-supervision mode to improve the defect detection precision of the real code. Specifically, the following contents are included:
in the code feature extraction process, a semi-supervised model parameter autonomous learning is completed by utilizing massive existing codes, and a plurality of existing source code hosting platforms such as GitHub, Gitee, SourceFrge and the like exist on the Internet. On the basis of observing various existing code protocols (such as BSD, GPL, LGPL and MIT), existing projects of various programming languages can be acquired from the platforms, and the existing projects acquired by the platforms are uploaded to a source code library in real time through a source code acquisition service.
Then, carrying out feature extraction operation on each code project, wherein the feature extraction operation comprises the following steps: function slicing, preprocessing, code feature extraction and the like. Adopting different slicing strategies for different language items, and then preprocessing a function block generated by slicing, wherein the specific steps are as follows:
1) reading items from a source code library, and performing function slicing on the items by adopting different function segmentation methods according to a programming language adopted by the items to obtain function fragments;
2) according to grammatical requirements of different programming languages, preprocessing operations such as removing comments, blank lines and keywords are carried out, and preprocessing is carried out on the function segments;
and finally, extracting character string features and abstract syntax features according to the characteristics of the programming language, and finishing syntax feature extraction by using an abstract syntax tree extraction tool provided by the programming language.
In the code defect generation process, as shown in fig. 2, the code defect generation is a key for realizing semi-supervised defect detection, and can automatically convert the collected code function into the target code of a specific defect or the code function with the same function and different forms. The code defect generator obtained based on the code defect generation process of the embodiment of the invention internally maintains a code generation module based on a rule engine, two types of code generation rule templates are defined in the rule engine, one type is defect rule code generation, and the purpose is to generate a target code with specific defects through the existing code function; the other type is neutral rule code generation, which aims to generate a batch of codes with the same functions as the original codes but different character forms on the premise of not changing the semantics of the original code program and expand the number of code sample libraries. Fig. 2 is a schematic diagram of a work flow of a code defect generator in a code defect generation process according to an embodiment of the present invention, and the work principle is as follows:
code function segments are obtained from a code feature extractor based on a code feature extraction process of an embodiment of the inventionAnd corresponding functional characteristics;
randomly selecting defect generation rules according to different category requirements generated by codesSelecting code function according to the precondition of rule executionTarget point for execution of a certain behavior ruleIf no target point is met, the generation rule is replaced, and if all the rules are executed and no target point is found, the next function segment is replacedContinuing to execute the process;
targeting at code functionsLine, execute generation policyGenerating a defect code having a specific defect class;
In order to expand the generated defect samples, a part of rules in the neutral rule code generation rules can be randomly selected for execution to generate a series of defect codes;
Similarly, function fragments may also be generatedA series of codes with the same function but different character forms;
Aggregating defect codesAnd normal code setAnd (5) labeling according to the defect type, and adding the labeled defect type into the construction code sample.
In the specific implementation process, the defect rule and the code generation principle of the embodiment of the invention are described by memory type defect code generation:
aiming at double release defect (double free), the selection of the target point executed by the rule is based on that in the function segment, the line number of memory release codes such as delete (var1), free (var1) and the like is selected as the nth line of the target point according to the mode of character string regular matching, and the rule action is used as the memory release action of adding the same delete (var1), free (var1) and the like below the target point; for example: original code: free (var1), policy enforcement descendant code free (var 1); free (var 1);
for the defect (use after free), the rule-executed target selection is based on that in the function segment, memory release codes such as delete (var1), free (var1) and the like are selected according to a character string regular matching mode, then the line number of var1= NULL appearing behind the line is selected as the nth line of the target, the rule action is used as the target to delete the line code, and memory operations such as memory copy action and the like are added, for example: original code: free (var1), var1= NULL; policy enforcement descendant code free (var1), memcpy (var1, "test", 10);
aiming at the defect of un-initialized (Use of unified Variable), the target selection of rule execution is based on selecting int a, char c in a function segment according to a character string regular matching mode, the line number of the declared code of the Variable is the nth line of the target, and the rule action is used as adding printing Variable operation below the target, for example: original code: int a; a post-policy-enforcement code int a; printf ("% d", a);
aiming at the defect of using undefined Variable (Use of undefined Variable), the target point selection executed by the rule is based on that in a function segment, according to the regular matching mode of character strings, a =5, b =1.5, the Variable assignment code is selected, then int a appearing before the line is selected, the line number of float b is the nth line of the target point, the rule action is taken as deleting the line code at the target point, for example: original code: int a, a =5; code a =5 after policy execution.
In the specific implementation process, the partial rules generated by the neutral rule codes and the code generation principle are described as follows:
variable renaming, renaming a local variable to a random name that is not in the scope, such as: int number, rewritten as int qsfgrward;
variables define sequential swapping, with variables in the same row or in adjacent rows defining swap positions, for example: int a, b, rewritten as int b, a.
Conditional substitution, which replaces both sides of a comparison operator, such as if, for example: if (a = = b), rewritten to if (b = = a); if (a > b), rewritten as if (b > a).
Conditional branch replacement, exchanging the If-Else type conditional judgment branch, for example: if (a > b) { a =1; } else { b =1; } rewrite to if (a < b) { b =1; } else { a =1; }.
In the code defect filtering process, as shown in fig. 3. Because the code defect generation process adopts an automatic method to construct codes, the problems of incorrect types of defect codes, self-contained defects of normal codes and the like are inevitably generated. The code defect filtering designed by the embodiment of the invention has the function of filtering the codes generated by the code defect generator, so that the accuracy of the code sample label is ensured, and the precision of the code defect detection method is indirectly improved. The code defect generator obtained based on the code defect filtering process in the embodiment of the invention uses multiple existing static code detection tools (such as codeql, flawfinder, Cppcheck and the like) inside, and the detection results of the tools screen out defect codes and normal codes according to a majority judgment criterion.
In a specific implementation process, fig. 3 is a schematic flow chart of code defect code generation, and the working principle thereof is as follows:
obtaining constructed normal code function fragments from a code defect generatorDefect code function fragment;
Segmenting normal code functionsDefect code function fragmentInputting the static code into the existing static code detection tool, detecting the static code and collecting the detection result of each tool;
is a normal code function segment for an inputIf the detection result of the static tool mostly does not detect the defect, the generated normal codeAdding the correct label sample into a code sample library; otherwise discard the sample;
For input being a defective code function segment(ii) a Assuming the detection result of the static tool is offset to the target point of code generation [ -5,5]The position of the line, and the defect code generated when most defects of a specific type are detectedAdding the correct label sample into a code sample library; otherwise discard the sample。
In the code defect discriminating process, as shown in fig. 4. The code defect judgment is the core for realizing semi-supervised defect detection, and the embodiment of the invention designs a deep neural code defect detection model which has a parameter updating function and is based on fusion characteristics, and the model mainly comprises three parts: a feature fusion network, a defect discrimination network and a model updating mechanism. The feature fusion network can fuse code function sample features of different dimensions, the defect judgment network can classify the fused features, identify whether the fused features are defect codes or not and output the categories of defects, and the model updating mechanism can realize incremental updating of the model on the premise of not influencing business work.
As shown in fig. 4, based on the code defect discriminator obtained in the code defect discriminating process of the embodiment of the present invention, the deep learning model used includes two parts: a feature fusion network and a defect discrimination network. In the feature fusion network, three types of features of character level, word level and abstract syntax tree are utilized, word2vec, node2vec and other embedding methods are used for converting text and tree structures into vector forms, and then definition is performedThere are shown 3 kinds of feature fusion networks,is each layer of the feature fusion network,is the input vector of the feature fusion network: character feature vectors, word-level feature vectors and abstract syntax tree feature vectors; then useRepresents input toThe vector of the layer(s) is,is that the hyper-parameter is set to 0.5 by default,is thatThe output of the layer(s) is,to representThe weight of a layer is determined by the weight of the layer,is thatOffset of layer, activation function of neural networkA variant linear rectifying function (leak ReLU) is used to solve the problem that the gradient of the function becomes zero when the input is near zero or negative. Then the feature fusion network forward pass formula is:
is defined as followsIs each layer of the defect discriminating network,represents input toThe vector of the layer(s) is,is thatThe output of the layer(s) is,to representThe weight of a layer is determined by the weight of the layer,is thatOffset of layer, activation function of neural networkAlso employed are variant linear rectification functions,the method is an output function of a defect discrimination network, and adopts a softmax function due to the multi-classification problem. Then the defect discrimination network forwards the equation:
by usingA label representing the training data is attached to the training data,and an output label representing the defect discrimination model. Then useAnd searching an index of the maximum value of the parameter as the output of the defect discriminator, and adopting a cross entropy loss function in the training process of the deep neural network based on the fusion characteristics:
in order to realize continuous learning iteration of deep learning model parameters, the scheme adopts an offline model increment updating mechanism, and the working principle is as follows:
1) defining a defect detection discrimination model of the i-th day online work asThe offline defect detection and discrimination model is;
2) Collecting code functions with tag information of ith day from a code sample library for incremental update training of offline models;
3) At times of day i when there is less traffic (e.g.: 11 o' clock night), use the offline modelReplacing online modelsI.e. by。
As shown in fig. 5, the embodiment of the present invention utilizes a large number of existing codes existing in the internet, expands the defect samples and normal samples of machine learning training, uses an existing code static detection tool, further improves the accuracy of generating samples, and greatly improves the code defect detection accuracy through a deep neural code defect detection model based on fusion features, and has the following beneficial effects and advantages in comparison with the existing scheme:
1) a rule-based defect code generation method is designed, the scale effect of the existing code is fully utilized, and the problem that the existing deep learning-based model lacks large-scale actual engineering code defect samples with labels is solved;
2) a plurality of existing static code detection tools are combined, a new technical effect is generated, namely the accuracy of the constructed defect code label is improved;
3) a deep neural code defect detection model based on fusion characteristics is designed, the multi-dimensional characteristics of a code function are fused, and the defect detection effect is improved;
4) a periodical periodic model increment updating mechanism is designed, and continuous learning iteration of parameters of a deep learning model can be realized under the condition of not interfering normal services.
Example 1
A semi-supervised software code defect detection method comprises the following steps:
s1, extracting code features;
s2, using the code features extracted in the step S1, defining a code generation rule template based on a rule engine, and using a code generation module to generate defect codes and neutral codes for expanding code samples of machine learning training;
s3, filtering the defect codes;
and S4, constructing and training a model, then judging the defect codes and outputting the detection result.
Example 2
On the basis of embodiment 1, in step S2, the method includes the sub-steps of: defining two types of code generation rule templates in a rule engine, wherein one type is a defect code generation rule template, and the purpose is to generate target codes with specific defects through a code function by using a code generation module for expanding the number of a code sample library; the other type is a neutral code generation rule template, and aims to generate codes with the same functions as the original codes but different character forms by using a code generation module on the premise of not changing the semantics of the original code program, so as to expand the number of the code sample libraries.
Example 3
On the basis of embodiment 2, in step S2, the code generation module includes the sub-steps of:
s21, selecting and generating a defect code or a neutral code after the code characteristics are obtained; randomly selecting a defect generation rule or a neutral generation rule according to different category requirements of code generation, selecting a target point executed by a certain behavior rule of a code function according to a precondition executed by the defect generation rule or the neutral generation rule, if the target point is not met, replacing the defect generation rule or the neutral generation rule, and if all the rules are executed, and the target point is not found, replacing the next function segment to continuously and repeatedly execute the step;
s22, executing a generating strategy on a target point line of the code function to generate a defect code or a neutral code with a specific defect category;
s23, expanding the generated defect codes or neutral codes with specific defect categories to obtain a defect code set or a neutral code set, wherein the neutral code set is a normal code set, marking category labels on the defect code set according to the defect categories, marking normal labels on the normal code set, and adding the normal labels to the constructed code samples for expanding the defect samples and the normal samples of machine learning training.
Example 4
On the basis of embodiment 1, in step S3, the method includes the sub-steps of:
s31, obtaining the constructed normal code function segment and defect code function segment from the step S2;
s32, inputting the normal code function segment and the defect code function segment into a static code detection tool, performing static code detection and collecting the detection results of each tool;
s33, aiming at the input that the input is a normal code function segment, if the detection result of the static tool is higher than the set range and no defect is detected, the generated normal code is a correct label sample, and then the normal code is added into a code sample library, otherwise, the sample is discarded; and aiming at the input of a defect code function segment, setting the position of a detection result of a static tool in the offset set range of a code generation target point, and detecting the defect of a specific category when the position is higher than the set range, wherein the generated defect code is a correct label sample and then is added into a code sample library, otherwise, discarding the sample.
Example 5
On the basis of embodiment 1, in step S4, the method includes the sub-steps of: constructing a depth nerve code defect detection model which has a parameter updating function and is based on fusion characteristics, wherein the model specifically comprises a characteristic fusion network, a defect judgment network and a model updating mechanism; the feature fusion network is used for fusing code function sample features of different dimensions; the defect judging network is used for classifying the fused features, identifying whether the fused features are defect codes or not and then outputting the categories of the defects; the model updating mechanism is used for realizing incremental updating of the model on the premise of not influencing business work.
Example 6
On the basis of embodiment 1, in step S1, the method includes the sub-steps of: and slicing the source code into a certain granularity, and extracting the characteristics of a code character string and a function abstract syntax tree.
Example 7
On the basis of the embodiment 5, the feature fusion network takes three types of features of character level, word level and abstract syntax tree as input, and converts text and tree structure into vector form by using an embedding method.
Example 8
On the basis of the embodiment 5, the loss function of the deep neural code defect detection model based on the fusion features in the training process adopts a cross entropy loss function.
Example 9
On the basis of embodiment 5, the model update mechanism specifically includes the sub-steps of:
s41, defining the fusion-feature-based deep neural code defect detection model working on the ith day asAn off-line deep neural code defect detection model based on fusion characteristics is;
S42, collecting code functions with tag information of the ith day from the code sample library for incremental updating and training of the offline model;
S43, using off-line model when the i day traffic is less than the set valueReplacing online modelsI.e. by。
Example 10
A semi-supervised software code defect detecting apparatus includes a program instruction executing unit and a program instruction storing unit, and when a program instruction is loaded and executed by the program instruction executing unit, the semi-supervised software code defect detecting method as described in any one of embodiments 1 to 9 is performed.
The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.
The parts not involved in the present invention are the same as or can be implemented using the prior art.
The above-described embodiment is only one embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be easily made based on the application and principle of the present invention disclosed in the present application, and the present invention is not limited to the method described in the above-described embodiment of the present invention, so that the above-described embodiment is only preferred, and not restrictive.
Other embodiments than the above examples may be devised by those skilled in the art based on the foregoing disclosure, or by adapting and using knowledge or techniques of the relevant art, and features of various embodiments may be interchanged or substituted and such modifications and variations that may be made by those skilled in the art without departing from the spirit and scope of the present invention are intended to be within the scope of the following claims.
Claims (10)
1. A semi-supervised software code defect detection method is characterized by comprising the following steps:
s1, extracting code features;
s2, using the code features extracted in the step S1, defining a code generation rule template based on a rule engine, and using a code generation module to generate defect codes and neutral codes for expanding code samples of machine learning training;
s3, filtering the defect codes;
and S4, constructing and training a model, then judging the defect codes and outputting the detection result.
2. Semi-supervised software code defect detection method according to claim 1, comprising, in step S2, the sub-steps of: defining two types of code generation rule templates in a rule engine, wherein one type is a defect code generation rule template, and the purpose is to generate an object code with specific defects through a code function by using a code generation module for expanding the number of a code sample library; the other type is a neutral code generation rule template, and aims to generate codes with the same functions as the original codes but different character forms by using a code generation module on the premise of not changing the semantics of the original code program, so as to expand the number of the code sample libraries.
3. The semi-supervised software code defect detection method of claim 2, wherein in step S2, the code generation module comprises the sub-steps of:
s21, selecting and generating a defect code or a neutral code after the code characteristics are obtained; randomly selecting a defect generation rule or a neutral generation rule according to different category requirements of code generation, selecting a target point executed by a certain behavior rule of a code function according to a precondition executed by the defect generation rule or the neutral generation rule, if the target point is not met, replacing the defect generation rule or the neutral generation rule, and if all the rules are executed, and the target point is not found, replacing the next function segment to continuously and repeatedly execute the step;
s22, executing a generating strategy on a target point line of the code function to generate a defect code or a neutral code with a specific defect category;
s23, expanding the generated defect codes or neutral codes with specific defect categories to obtain a defect code set or a neutral code set, wherein the neutral code set is a normal code set, marking category labels on the defect code set according to the defect categories, marking normal labels on the normal code set, and adding the normal labels to the constructed code samples for expanding the defect samples and the normal samples of machine learning training.
4. Semi-supervised software code defect detection method according to claim 1, comprising, in step S3, the sub-steps of:
s31, obtaining the constructed normal code function segment and defect code function segment from the step S2;
s32, inputting the normal code function segment and the defect code function segment into a static code detection tool, performing static code detection and collecting the detection results of each tool;
s33, aiming at the input that the input is a normal code function segment, if the detection result of the static tool is higher than the set range and no defect is detected, the generated normal code is a correct label sample, and then the normal code is added into a code sample library, otherwise, the sample is discarded; and aiming at the input of a defect code function segment, setting the position of a detection result of a static tool in the offset set range of a code generation target point, and detecting the defect of a specific category when the position is higher than the set range, wherein the generated defect code is a correct label sample and then is added into a code sample library, otherwise, discarding the sample.
5. Semi-supervised software code defect detection method according to claim 1, comprising, in step S4, the sub-steps of: constructing a depth nerve code defect detection model which has a parameter updating function and is based on fusion characteristics, wherein the model specifically comprises a characteristic fusion network, a defect judgment network and a model updating mechanism; the feature fusion network is used for fusing code function sample features of different dimensions; the defect judging network is used for classifying the fused features, identifying whether the fused features are defect codes or not and then outputting the categories of the defects; the model updating mechanism is used for realizing incremental updating of the model on the premise of not influencing business work.
6. Semi-supervised software code defect detection method according to claim 1, comprising, in step S1, the sub-steps of: and slicing the source code into a certain granularity, and extracting the characteristics of a code character string and a function abstract syntax tree.
7. The semi-supervised software code defect detection method of claim 5, wherein the feature fusion network takes three types of features of character level, word level and abstract syntax tree as input, and converts text and tree structures into vector form by using an embedding method.
8. The semi-supervised software code defect detection method of claim 5, wherein the loss function of the fusion feature-based deep neural code defect detection model in the training process adopts a cross entropy loss function.
9. The semi-supervised software code defect detection method as recited in claim 5, wherein the model update mechanism specifically comprises the sub-steps of:
s41, defining the fusion-feature-based deep neural code defect detection model working on the ith day asThe off-line deep neural code defect detection model based on the fusion characteristics comprises;
S42, collecting the ith day tape label from the code sample libraryCode function of information for incremental update training of offline models;
10. A semi-supervised software code defect detecting device, comprising a program instruction executing unit and a program instruction storing unit, wherein when the program instruction is loaded and executed by the program instruction executing unit, the semi-supervised software code defect detecting method according to any one of claims 1 to 9 is performed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210971176.6A CN115048316B (en) | 2022-08-15 | 2022-08-15 | Semi-supervised software code defect detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210971176.6A CN115048316B (en) | 2022-08-15 | 2022-08-15 | Semi-supervised software code defect detection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115048316A true CN115048316A (en) | 2022-09-13 |
CN115048316B CN115048316B (en) | 2022-12-09 |
Family
ID=83166588
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210971176.6A Active CN115048316B (en) | 2022-08-15 | 2022-08-15 | Semi-supervised software code defect detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115048316B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115617694A (en) * | 2022-11-30 | 2023-01-17 | 中南大学 | Software defect prediction method, system, device and medium based on information fusion |
CN115629995A (en) * | 2022-12-21 | 2023-01-20 | 中南大学 | Software defect positioning method, system and equipment based on multi-dependency LSTM |
CN116662206A (en) * | 2023-07-24 | 2023-08-29 | 泰山学院 | Computer software online real-time visual debugging method and device |
CN117290238A (en) * | 2023-10-10 | 2023-12-26 | 湖北大学 | Software defect prediction method and system based on heterogeneous relational graph neural network |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9069737B1 (en) * | 2013-07-15 | 2015-06-30 | Amazon Technologies, Inc. | Machine learning based instance remediation |
US20190163609A1 (en) * | 2017-11-29 | 2019-05-30 | International Business Machines Corporation | Cognitive dynamic script language builder |
CN110162475A (en) * | 2019-05-27 | 2019-08-23 | 浙江工业大学 | A kind of Software Defects Predict Methods based on depth migration |
CN111459799A (en) * | 2020-03-03 | 2020-07-28 | 西北大学 | Software defect detection model establishing and detecting method and system based on Github |
CN112597063A (en) * | 2021-02-26 | 2021-04-02 | 北京北大软件工程股份有限公司 | Method, device and storage medium for positioning defect code |
CN113221960A (en) * | 2021-04-20 | 2021-08-06 | 西北大学 | Construction method and collection method of high-quality vulnerability data collection model |
US11106801B1 (en) * | 2020-11-13 | 2021-08-31 | Accenture Global Solutions Limited | Utilizing orchestration and augmented vulnerability triage for software security testing |
CN114490344A (en) * | 2021-12-31 | 2022-05-13 | 北京航空航天大学 | Software integration evaluation method based on machine learning and static analysis |
-
2022
- 2022-08-15 CN CN202210971176.6A patent/CN115048316B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9069737B1 (en) * | 2013-07-15 | 2015-06-30 | Amazon Technologies, Inc. | Machine learning based instance remediation |
US20190163609A1 (en) * | 2017-11-29 | 2019-05-30 | International Business Machines Corporation | Cognitive dynamic script language builder |
CN110162475A (en) * | 2019-05-27 | 2019-08-23 | 浙江工业大学 | A kind of Software Defects Predict Methods based on depth migration |
CN111459799A (en) * | 2020-03-03 | 2020-07-28 | 西北大学 | Software defect detection model establishing and detecting method and system based on Github |
US11106801B1 (en) * | 2020-11-13 | 2021-08-31 | Accenture Global Solutions Limited | Utilizing orchestration and augmented vulnerability triage for software security testing |
CN112597063A (en) * | 2021-02-26 | 2021-04-02 | 北京北大软件工程股份有限公司 | Method, device and storage medium for positioning defect code |
CN113221960A (en) * | 2021-04-20 | 2021-08-06 | 西北大学 | Construction method and collection method of high-quality vulnerability data collection model |
CN114490344A (en) * | 2021-12-31 | 2022-05-13 | 北京航空航天大学 | Software integration evaluation method based on machine learning and static analysis |
Non-Patent Citations (5)
Title |
---|
RUDOLF FERENC等: ""An automatically created novel bug dataset and its validation in bug prediction"", 《JOURNAL OF SYSTEMS AND SOFTWARE》 * |
WENBO ZHENG等: ""Software Defect Prediction Model Based on Improved Deep Forest and AutoEncoder by Forest"", 《THE 31ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING》 * |
张肖等: "一种半监督集成学习软件缺陷预测方法", 《小型微型计算机系统》 * |
郑显达: ""基于知识图谱和表示学习的软件缺陷预测系统设计与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
郭宏宇 等: ""基于改进型循环神经网络的恶意代码分类检测"", 《信息技术》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115617694A (en) * | 2022-11-30 | 2023-01-17 | 中南大学 | Software defect prediction method, system, device and medium based on information fusion |
CN115629995A (en) * | 2022-12-21 | 2023-01-20 | 中南大学 | Software defect positioning method, system and equipment based on multi-dependency LSTM |
CN116662206A (en) * | 2023-07-24 | 2023-08-29 | 泰山学院 | Computer software online real-time visual debugging method and device |
CN116662206B (en) * | 2023-07-24 | 2024-02-13 | 泰山学院 | Computer software online real-time visual debugging method and device |
CN117290238A (en) * | 2023-10-10 | 2023-12-26 | 湖北大学 | Software defect prediction method and system based on heterogeneous relational graph neural network |
CN117290238B (en) * | 2023-10-10 | 2024-04-09 | 湖北大学 | Software defect prediction method and system based on heterogeneous relational graph neural network |
Also Published As
Publication number | Publication date |
---|---|
CN115048316B (en) | 2022-12-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115048316B (en) | Semi-supervised software code defect detection method and device | |
CN108647520B (en) | Intelligent fuzzy test method and system based on vulnerability learning | |
CN108446540B (en) | Program code plagiarism type detection method and system based on source code multi-label graph neural network | |
CN111459799B (en) | Software defect detection model establishing and detecting method and system based on Github | |
CN112733156B (en) | Intelligent detection method, system and medium for software vulnerability based on code attribute graph | |
US20080112620A1 (en) | Automated system for understanding document content | |
Al-Obeidallah et al. | A survey on design pattern detection approaches | |
CN112560036B (en) | C/C + + vulnerability static detection method based on neural network and deep learning | |
CN115357904B (en) | Multi-class vulnerability detection method based on program slicing and graph neural network | |
CN113138920B (en) | Software defect report allocation method and device based on knowledge graph and semantic role labeling | |
CN113742205B (en) | Code vulnerability intelligent detection method based on man-machine cooperation | |
CN117215935A (en) | Software defect prediction method based on multidimensional code joint graph representation | |
CN117236677A (en) | RPA process mining method and device based on event extraction | |
CN117520561A (en) | Entity relation extraction method and system for knowledge graph construction in helicopter assembly field | |
CN109800420A (en) | A kind of feasibility study review report automatic generation method and storage medium | |
CN117454387A (en) | Vulnerability code detection method based on multidimensional feature extraction | |
CN114218580A (en) | Intelligent contract vulnerability detection method based on multi-task learning | |
CN117390189A (en) | Neutral text generation method based on pre-classifier | |
CN116841869A (en) | Java code examination comment generation method and device based on code structured information and examination knowledge | |
CN116361816B (en) | Intelligent contract vulnerability detection method, system, storage medium and equipment | |
CN117473510B (en) | Automatic vulnerability discovery technology based on relationship between graph neural network and vulnerability patch | |
CN117592061B (en) | Source code security detection method and device integrating code vulnerability characteristics and attribute graphs | |
Li et al. | ACAGNN: Source Code Representation Based on Fine-Grained Multi-view Program Features | |
Dwivedi et al. | Applying reverse engineering techniques to analyze design patterns in source code | |
CN116707928A (en) | Threat knowledge extraction method and system combining rule matching and pre-training language model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |