CN115048316A - Semi-supervised software code defect detection method and device - Google Patents

Semi-supervised software code defect detection method and device Download PDF

Info

Publication number
CN115048316A
CN115048316A CN202210971176.6A CN202210971176A CN115048316A CN 115048316 A CN115048316 A CN 115048316A CN 202210971176 A CN202210971176 A CN 202210971176A CN 115048316 A CN115048316 A CN 115048316A
Authority
CN
China
Prior art keywords
code
defect
codes
function
semi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210971176.6A
Other languages
Chinese (zh)
Other versions
CN115048316B (en
Inventor
饶志宏
孙治
韩烨
毛得明
陈剑锋
和达
赵童
权赵恒
王炳文
辜彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 30 Research Institute
Original Assignee
CETC 30 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 30 Research Institute filed Critical CETC 30 Research Institute
Priority to CN202210971176.6A priority Critical patent/CN115048316B/en
Publication of CN115048316A publication Critical patent/CN115048316A/en
Application granted granted Critical
Publication of CN115048316B publication Critical patent/CN115048316B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a semi-supervised software code defect detection method and a semi-supervised software code defect detection device, which belong to the field of network safety and comprise the following steps: s1, extracting code features; s2, using the code features extracted in the step S1, defining a code generation rule template based on a rule engine, and using a code generation module to generate defect codes and neutral codes for expanding code samples of machine learning training; s3, filtering the defect codes; and S4, constructing and training a model, then judging the defect codes and outputting the detection result. The method expands the defect samples and normal samples of machine learning training, improves the accuracy of generated samples, and can greatly improve the code defect detection precision.

Description

Semi-supervised software code defect detection method and device
Technical Field
The invention relates to the field of network security, in particular to a semi-supervised software code defect detection method and device.
Background
With the wide application of various open source software in different fields, the volume and complexity of software codes are rapidly increased, and thus security events caused by various code defects frequently occur. How to efficiently detect the code defects has become an important problem for guaranteeing national social security as well as personal information and property security. Conventional code defect detection is based on manual code security audit and rule-based static code detection. The former is mainly implemented by experts who are proficient in code defects, needs strong professional skills of professionals, usually consumes manpower and has low detection efficiency; the latter extracts software defect characteristics and converts the software defect characteristics into defect scanning rules through analysis of lexical methods, grammars, data streams, program control streams and the like of software defect codes to complete defect detection of the program codes.
In recent years, artificial intelligence technology is rapidly developed, software defect detection based on machine learning shows wide application prospect, a working mode combining form inference and probability inference is adopted, fuzzy information of software codes can be used for defect judgment, constraint conditions based on a rule detection method are avoided, and the method has strong robustness and high detection efficiency. However, due to the lack of large-scale defective samples of actual engineering codes with labels, the detection method based on machine learning at present has low detection precision and poor practicability on real codes.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a semi-supervised software code defect detection method and device, which expand the defect samples and normal samples of machine learning training, improve the accuracy of generated samples, and greatly improve the code defect detection precision and the like.
The purpose of the invention is realized by the following scheme:
a semi-supervised software code defect detection method comprises the following steps:
s1, extracting code features;
s2, using the code features extracted in the step S1, defining a code generation rule template based on a rule engine, and using a code generation module to generate defect codes and neutral codes for expanding code samples of machine learning training;
s3, filtering the defect codes;
and S4, constructing and training a model, then judging the defect codes and outputting the detection result.
Further, in step S2, the method includes the sub-steps of: defining two types of code generation rule templates in a rule engine, wherein one type is a defect code generation rule template, and the purpose is to generate target codes with specific defects through a code function by using a code generation module for expanding the number of a code sample library; the other type is a neutral code generation rule template, and aims to generate codes with the same functions as the original codes but different character forms by using a code generation module on the premise of not changing the semantics of the original code program, so as to expand the number of the code sample libraries.
Further, in step S2, the code generation module includes the sub-steps of:
s21, selecting and generating a defect code or a neutral code after the code characteristics are obtained; according to different category requirements of code generation, randomly selecting a defect generation rule or a neutral generation rule, selecting a target point executed by a certain action rule of a code function according to a precondition for executing the defect generation rule or the neutral generation rule, if no target point is met, replacing the defect generation rule or the neutral generation rule, and if all the rules are executed, and the target point is not found, replacing the next function segment to continuously and repeatedly execute the step;
s22, executing a generating strategy on a target point line of the code function to generate a defect code or a neutral code with a specific defect category;
s23, expanding the generated defect codes or neutral codes with specific defect categories to obtain a defect code set or a neutral code set, wherein the neutral code set is a normal code set, marking category labels on the defect code set according to the defect categories, marking normal labels on the normal code set, and adding the normal labels to the constructed code samples for expanding the defect samples and the normal samples of machine learning training.
Further, in step S3, the method includes the sub-steps of:
s31, obtaining the constructed normal code function segment and defect code function segment from the step S2;
s32, inputting the normal code function segment and the defect code function segment into a static code detection tool, performing static code detection and collecting the detection results of each tool;
s33, aiming at the input that the input is a normal code function segment, if the detection result of the static tool is higher than the set range and no defect is detected, the generated normal code is a correct label sample, and then the normal code is added into a code sample library, otherwise, the sample is discarded; and aiming at the input of a defect code function segment, setting the position of a detection result of a static tool in the offset set range of a code generation target point, and detecting the defect of a specific category when the position is higher than the set range, wherein the generated defect code is a correct label sample and then is added into a code sample library, otherwise, discarding the sample.
Further, in step S4, the method includes the sub-steps of: constructing a depth nerve code defect detection model which has a parameter updating function and is based on fusion characteristics, wherein the model specifically comprises a characteristic fusion network, a defect judgment network and a model updating mechanism; the feature fusion network is used for fusing code function sample features of different dimensions; the defect judging network is used for classifying the fused features, identifying whether the fused features are defect codes or not and then outputting the categories of the defects; the model updating mechanism is used for realizing incremental updating of the model on the premise of not influencing business work.
Further, in step S1, the method includes the sub-steps of: and slicing the source code into a certain granularity, and extracting the characteristics of a code character string and a function abstract syntax tree.
Further, the feature fusion network utilizes three types of features of character level, word level and abstract syntax tree as input, and converts text and tree structure into vector form by using an embedding method.
Further, a cross entropy loss function is adopted as a loss function of the deep neural code defect detection model based on the fusion features in the training process.
Further, the model updating mechanism specifically includes the sub-steps of:
s41, defining the fusion-feature-based deep neural code defect detection model working on the ith day as
Figure 133965DEST_PATH_IMAGE001
An off-line deep neural code defect detection model based on fusion characteristics is
Figure 880205DEST_PATH_IMAGE002
S42, collecting code functions with tag information of the ith day from the code sample library for incremental updating and training of the offline model
Figure 703804DEST_PATH_IMAGE002
S43, using off-line model when the i day traffic is less than the set value
Figure 842793DEST_PATH_IMAGE002
Replacing online models
Figure 333817DEST_PATH_IMAGE001
I.e. by
Figure 516536DEST_PATH_IMAGE003
A semi-supervised software code defect detecting device comprises a program instruction executing unit and a program instruction storing unit, and when the program instructions are loaded and executed by the program instruction executing unit, the semi-supervised software code defect detecting method is implemented.
The beneficial effects of the invention include:
the invention utilizes a large amount of existing codes existing in the Internet, expands the defect samples and normal samples of machine learning training, further improves the accuracy of generating the samples by combining with the existing code static detection tool, and can greatly improve the code defect detection precision by the proposed deep neural code defect detection model based on the fusion characteristics.
The invention provides a new rule-based defect code generation method, which makes full use of the scale effect of open source codes and solves the problem that the existing deep learning-based model lacks large-scale actual engineering code defect samples with labels.
The invention further adopts a plurality of static code detection tools, and the accuracy of the constructed defect code label is improved.
The invention provides a deep neural code defect detection model based on fusion characteristics, which fuses the multi-dimensional characteristics of a code function and improves the defect detection effect.
The invention designs a periodical periodic model increment updating mechanism, and can realize continuous learning iteration of the deep learning model parameters under the condition of not interfering normal services.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a deployment scenario of an embodiment of the present invention;
FIG. 2 is a flow chart of the operation of a code defect generator according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating the operation of a code defect filter according to an embodiment of the present invention;
FIG. 4 is a deep learning model used by the code defect discriminator according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating steps of a method according to an embodiment of the present invention.
Detailed Description
All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.
In the process of seeking to solve the problems in the background, the inventors of the present invention have found, through creative thinking: the semi-supervised learning can be used for mining the characteristics contained in the data under the condition of only a small number of labels, and completing the training of the model by methods such as pseudo label generation and the like. Therefore, the inventor of the invention provides a technical scheme, applies semi-supervised learning to the field of software defect detection, fully utilizes the scale effect of the existing codes, forms a mechanism of continuous learning iteration of an intelligent model, solves the problem of rare number of defect samples of actual engineering codes, improves the defect detection precision of real codes, and has very important significance for realizing automatic detection of code defects.
In order to solve the problem of efficient and automatic detection of software code defects, in a specific implementation manner, the embodiment of the invention also at least solves the technical problems in the following aspects:
1) how to generate a real software defect sample by using an actual engineering code solves the problem that machine learning lacks a real training sample.
2) How the defect detection model realizes semi-supervised learning solves the problem of weight updating iteration of the intelligent model, and realizes the improvement of the defect detection capability.
3) How the features of the software code are fused and converged solves the multi-dimensional representation of the features of the software code, and the effect of detecting defects is improved.
In a specific embodiment, the invention aims to realize the aim of automatic detection of software code defects, namely, when an actual engineering project with complex source codes is faced, the source codes can be quickly detected and analyzed, and the defect codes can be screened out. Meanwhile, the screening result is required to be ensured to have higher accuracy so as to ensure the practical value of the machine learning model and provide powerful support for manual further inspection and analysis.
In one embodiment of the invention, the detection model is continuously studied and iterated by fully utilizing the existing codes widely existing in the Internet based on the code defect detection of the semi-supervised learning software. Mainly comprises three steps: firstly, extracting code engineering characteristics, namely slicing a source code into a certain granularity, and extracting characteristics such as code character strings, function Abstract Syntax Trees (AST) and the like; secondly, a code defect generator is used for expanding defect samples and normal samples of machine learning training; thirdly, a code defect filter detects generated defects and improves the accuracy of generating samples; and fourthly, training a code defect discriminator and outputting a detection result.
The technical scheme provided by the embodiment of the invention mainly comprises a firewall, a code defect discriminator, a code defect filter, a code defect generator, a code feature extractor, a code generation rule base, a code feature base and a source code base, which are connected to a network through a network to form a complete semi-supervised software code defect detection system. The working principle realized by the embodiment of the invention is irrelevant to a specific deployment mode, so that the working principle is explained by only adopting a semi-supervised software code defect detection deployment scheme shown in figure 1. In the deployment scenario of fig. 1, an end user uploads an engineering code by requesting a defect discrimination service interface, and filters out invalid or unauthorized access by a firewall rule; and then returning possible defect code segments through a code defect discriminator, and finishing the autonomous learning of the parameters of the defect detection model through the code defect discriminator in a semi-supervision mode to improve the defect detection precision of the real code. Specifically, the following contents are included:
in the code feature extraction process, a semi-supervised model parameter autonomous learning is completed by utilizing massive existing codes, and a plurality of existing source code hosting platforms such as GitHub, Gitee, SourceFrge and the like exist on the Internet. On the basis of observing various existing code protocols (such as BSD, GPL, LGPL and MIT), existing projects of various programming languages can be acquired from the platforms, and the existing projects acquired by the platforms are uploaded to a source code library in real time through a source code acquisition service.
Then, carrying out feature extraction operation on each code project, wherein the feature extraction operation comprises the following steps: function slicing, preprocessing, code feature extraction and the like. Adopting different slicing strategies for different language items, and then preprocessing a function block generated by slicing, wherein the specific steps are as follows:
1) reading items from a source code library, and performing function slicing on the items by adopting different function segmentation methods according to a programming language adopted by the items to obtain function fragments;
2) according to grammatical requirements of different programming languages, preprocessing operations such as removing comments, blank lines and keywords are carried out, and preprocessing is carried out on the function segments;
and finally, extracting character string features and abstract syntax features according to the characteristics of the programming language, and finishing syntax feature extraction by using an abstract syntax tree extraction tool provided by the programming language.
In the code defect generation process, as shown in fig. 2, the code defect generation is a key for realizing semi-supervised defect detection, and can automatically convert the collected code function into the target code of a specific defect or the code function with the same function and different forms. The code defect generator obtained based on the code defect generation process of the embodiment of the invention internally maintains a code generation module based on a rule engine, two types of code generation rule templates are defined in the rule engine, one type is defect rule code generation, and the purpose is to generate a target code with specific defects through the existing code function; the other type is neutral rule code generation, which aims to generate a batch of codes with the same functions as the original codes but different character forms on the premise of not changing the semantics of the original code program and expand the number of code sample libraries. Fig. 2 is a schematic diagram of a work flow of a code defect generator in a code defect generation process according to an embodiment of the present invention, and the work principle is as follows:
code function segments are obtained from a code feature extractor based on a code feature extraction process of an embodiment of the invention
Figure 93011DEST_PATH_IMAGE004
And corresponding functional characteristics;
randomly selecting defect generation rules according to different category requirements generated by codes
Figure 222641DEST_PATH_IMAGE005
Selecting code function according to the precondition of rule execution
Figure 551860DEST_PATH_IMAGE004
Target point for execution of a certain behavior rule
Figure 436640DEST_PATH_IMAGE006
If no target point is met, the generation rule is replaced, and if all the rules are executed and no target point is found, the next function segment is replaced
Figure 234831DEST_PATH_IMAGE007
Continuing to execute the process;
targeting at code functions
Figure 964890DEST_PATH_IMAGE006
Line, execute generation policy
Figure 650080DEST_PATH_IMAGE005
Generating a defect code having a specific defect class
Figure 643444DEST_PATH_IMAGE008
In order to expand the generated defect samples, a part of rules in the neutral rule code generation rules can be randomly selected for execution to generate a series of defect codes
Figure 928932DEST_PATH_IMAGE009
Similarly, function fragments may also be generated
Figure 462682DEST_PATH_IMAGE004
A series of codes with the same function but different character forms
Figure 517225DEST_PATH_IMAGE010
Aggregating defect codes
Figure 733355DEST_PATH_IMAGE009
And normal code set
Figure 443822DEST_PATH_IMAGE010
And (5) labeling according to the defect type, and adding the labeled defect type into the construction code sample.
In the specific implementation process, the defect rule and the code generation principle of the embodiment of the invention are described by memory type defect code generation:
aiming at double release defect (double free), the selection of the target point executed by the rule is based on that in the function segment, the line number of memory release codes such as delete (var1), free (var1) and the like is selected as the nth line of the target point according to the mode of character string regular matching, and the rule action is used as the memory release action of adding the same delete (var1), free (var1) and the like below the target point; for example: original code: free (var1), policy enforcement descendant code free (var 1); free (var 1);
for the defect (use after free), the rule-executed target selection is based on that in the function segment, memory release codes such as delete (var1), free (var1) and the like are selected according to a character string regular matching mode, then the line number of var1= NULL appearing behind the line is selected as the nth line of the target, the rule action is used as the target to delete the line code, and memory operations such as memory copy action and the like are added, for example: original code: free (var1), var1= NULL; policy enforcement descendant code free (var1), memcpy (var1, "test", 10);
aiming at the defect of un-initialized (Use of unified Variable), the target selection of rule execution is based on selecting int a, char c in a function segment according to a character string regular matching mode, the line number of the declared code of the Variable is the nth line of the target, and the rule action is used as adding printing Variable operation below the target, for example: original code: int a; a post-policy-enforcement code int a; printf ("% d", a);
aiming at the defect of using undefined Variable (Use of undefined Variable), the target point selection executed by the rule is based on that in a function segment, according to the regular matching mode of character strings, a =5, b =1.5, the Variable assignment code is selected, then int a appearing before the line is selected, the line number of float b is the nth line of the target point, the rule action is taken as deleting the line code at the target point, for example: original code: int a, a =5; code a =5 after policy execution.
In the specific implementation process, the partial rules generated by the neutral rule codes and the code generation principle are described as follows:
variable renaming, renaming a local variable to a random name that is not in the scope, such as: int number, rewritten as int qsfgrward;
variables define sequential swapping, with variables in the same row or in adjacent rows defining swap positions, for example: int a, b, rewritten as int b, a.
Conditional substitution, which replaces both sides of a comparison operator, such as if, for example: if (a = = b), rewritten to if (b = = a); if (a > b), rewritten as if (b > a).
Conditional branch replacement, exchanging the If-Else type conditional judgment branch, for example: if (a > b) { a =1; } else { b =1; } rewrite to if (a < b) { b =1; } else { a =1; }.
In the code defect filtering process, as shown in fig. 3. Because the code defect generation process adopts an automatic method to construct codes, the problems of incorrect types of defect codes, self-contained defects of normal codes and the like are inevitably generated. The code defect filtering designed by the embodiment of the invention has the function of filtering the codes generated by the code defect generator, so that the accuracy of the code sample label is ensured, and the precision of the code defect detection method is indirectly improved. The code defect generator obtained based on the code defect filtering process in the embodiment of the invention uses multiple existing static code detection tools (such as codeql, flawfinder, Cppcheck and the like) inside, and the detection results of the tools screen out defect codes and normal codes according to a majority judgment criterion.
In a specific implementation process, fig. 3 is a schematic flow chart of code defect code generation, and the working principle thereof is as follows:
obtaining constructed normal code function fragments from a code defect generator
Figure 515683DEST_PATH_IMAGE004
Defect code function fragment
Figure 690313DEST_PATH_IMAGE008
Segmenting normal code functions
Figure 822217DEST_PATH_IMAGE004
Defect code function fragment
Figure 833029DEST_PATH_IMAGE008
Inputting the static code into the existing static code detection tool, detecting the static code and collecting the detection result of each tool;
is a normal code function segment for an input
Figure 646264DEST_PATH_IMAGE004
If the detection result of the static tool mostly does not detect the defect, the generated normal code
Figure 675400DEST_PATH_IMAGE004
Adding the correct label sample into a code sample library; otherwise discard the sample
Figure 712626DEST_PATH_IMAGE004
For input being a defective code function segment
Figure 460002DEST_PATH_IMAGE008
(ii) a Assuming the detection result of the static tool is offset to the target point of code generation [ -5,5]The position of the line, and the defect code generated when most defects of a specific type are detected
Figure 388513DEST_PATH_IMAGE008
Adding the correct label sample into a code sample library; otherwise discard the sample
Figure 272155DEST_PATH_IMAGE008
In the code defect discriminating process, as shown in fig. 4. The code defect judgment is the core for realizing semi-supervised defect detection, and the embodiment of the invention designs a deep neural code defect detection model which has a parameter updating function and is based on fusion characteristics, and the model mainly comprises three parts: a feature fusion network, a defect discrimination network and a model updating mechanism. The feature fusion network can fuse code function sample features of different dimensions, the defect judgment network can classify the fused features, identify whether the fused features are defect codes or not and output the categories of defects, and the model updating mechanism can realize incremental updating of the model on the premise of not influencing business work.
As shown in fig. 4, based on the code defect discriminator obtained in the code defect discriminating process of the embodiment of the present invention, the deep learning model used includes two parts: a feature fusion network and a defect discrimination network. In the feature fusion network, three types of features of character level, word level and abstract syntax tree are utilized, word2vec, node2vec and other embedding methods are used for converting text and tree structures into vector forms, and then definition is performed
Figure 949124DEST_PATH_IMAGE011
There are shown 3 kinds of feature fusion networks,
Figure 918217DEST_PATH_IMAGE012
is each layer of the feature fusion network,
Figure 869993DEST_PATH_IMAGE013
is the input vector of the feature fusion network: character feature vectors, word-level feature vectors and abstract syntax tree feature vectors; then use
Figure 890033DEST_PATH_IMAGE014
Represents input to
Figure 269061DEST_PATH_IMAGE015
The vector of the layer(s) is,
Figure 725451DEST_PATH_IMAGE016
is that the hyper-parameter is set to 0.5 by default,
Figure 684179DEST_PATH_IMAGE017
is that
Figure 542414DEST_PATH_IMAGE018
The output of the layer(s) is,
Figure 607191DEST_PATH_IMAGE019
to represent
Figure 816455DEST_PATH_IMAGE020
The weight of a layer is determined by the weight of the layer,
Figure 110033DEST_PATH_IMAGE021
is that
Figure 822774DEST_PATH_IMAGE022
Offset of layer, activation function of neural network
Figure 559917DEST_PATH_IMAGE023
A variant linear rectifying function (leak ReLU) is used to solve the problem that the gradient of the function becomes zero when the input is near zero or negative. Then the feature fusion network forward pass formula is:
Figure 194161DEST_PATH_IMAGE024
is defined as follows
Figure 291430DEST_PATH_IMAGE025
Is each layer of the defect discriminating network,
Figure 858678DEST_PATH_IMAGE026
represents input to
Figure 281569DEST_PATH_IMAGE027
The vector of the layer(s) is,
Figure 449114DEST_PATH_IMAGE028
is that
Figure 287757DEST_PATH_IMAGE029
The output of the layer(s) is,
Figure 975090DEST_PATH_IMAGE030
to represent
Figure 568882DEST_PATH_IMAGE029
The weight of a layer is determined by the weight of the layer,
Figure 708877DEST_PATH_IMAGE031
is that
Figure 164260DEST_PATH_IMAGE032
Offset of layer, activation function of neural network
Figure 706100DEST_PATH_IMAGE023
Also employed are variant linear rectification functions,
Figure 408476DEST_PATH_IMAGE033
the method is an output function of a defect discrimination network, and adopts a softmax function due to the multi-classification problem. Then the defect discrimination network forwards the equation:
Figure 35767DEST_PATH_IMAGE035
Figure 278529DEST_PATH_IMAGE037
by using
Figure 183863DEST_PATH_IMAGE038
A label representing the training data is attached to the training data,
Figure 853879DEST_PATH_IMAGE039
and an output label representing the defect discrimination model. Then use
Figure 171728DEST_PATH_IMAGE040
And searching an index of the maximum value of the parameter as the output of the defect discriminator, and adopting a cross entropy loss function in the training process of the deep neural network based on the fusion characteristics:
Figure 952602DEST_PATH_IMAGE041
in order to realize continuous learning iteration of deep learning model parameters, the scheme adopts an offline model increment updating mechanism, and the working principle is as follows:
1) defining a defect detection discrimination model of the i-th day online work as
Figure 734613DEST_PATH_IMAGE042
The offline defect detection and discrimination model is
Figure 326263DEST_PATH_IMAGE043
2) Collecting code functions with tag information of ith day from a code sample library for incremental update training of offline models
Figure 928145DEST_PATH_IMAGE043
3) At times of day i when there is less traffic (e.g.: 11 o' clock night), use the offline model
Figure 715973DEST_PATH_IMAGE043
Replacing online models
Figure 86911DEST_PATH_IMAGE042
I.e. by
Figure 98729DEST_PATH_IMAGE044
As shown in fig. 5, the embodiment of the present invention utilizes a large number of existing codes existing in the internet, expands the defect samples and normal samples of machine learning training, uses an existing code static detection tool, further improves the accuracy of generating samples, and greatly improves the code defect detection accuracy through a deep neural code defect detection model based on fusion features, and has the following beneficial effects and advantages in comparison with the existing scheme:
1) a rule-based defect code generation method is designed, the scale effect of the existing code is fully utilized, and the problem that the existing deep learning-based model lacks large-scale actual engineering code defect samples with labels is solved;
2) a plurality of existing static code detection tools are combined, a new technical effect is generated, namely the accuracy of the constructed defect code label is improved;
3) a deep neural code defect detection model based on fusion characteristics is designed, the multi-dimensional characteristics of a code function are fused, and the defect detection effect is improved;
4) a periodical periodic model increment updating mechanism is designed, and continuous learning iteration of parameters of a deep learning model can be realized under the condition of not interfering normal services.
Example 1
A semi-supervised software code defect detection method comprises the following steps:
s1, extracting code features;
s2, using the code features extracted in the step S1, defining a code generation rule template based on a rule engine, and using a code generation module to generate defect codes and neutral codes for expanding code samples of machine learning training;
s3, filtering the defect codes;
and S4, constructing and training a model, then judging the defect codes and outputting the detection result.
Example 2
On the basis of embodiment 1, in step S2, the method includes the sub-steps of: defining two types of code generation rule templates in a rule engine, wherein one type is a defect code generation rule template, and the purpose is to generate target codes with specific defects through a code function by using a code generation module for expanding the number of a code sample library; the other type is a neutral code generation rule template, and aims to generate codes with the same functions as the original codes but different character forms by using a code generation module on the premise of not changing the semantics of the original code program, so as to expand the number of the code sample libraries.
Example 3
On the basis of embodiment 2, in step S2, the code generation module includes the sub-steps of:
s21, selecting and generating a defect code or a neutral code after the code characteristics are obtained; randomly selecting a defect generation rule or a neutral generation rule according to different category requirements of code generation, selecting a target point executed by a certain behavior rule of a code function according to a precondition executed by the defect generation rule or the neutral generation rule, if the target point is not met, replacing the defect generation rule or the neutral generation rule, and if all the rules are executed, and the target point is not found, replacing the next function segment to continuously and repeatedly execute the step;
s22, executing a generating strategy on a target point line of the code function to generate a defect code or a neutral code with a specific defect category;
s23, expanding the generated defect codes or neutral codes with specific defect categories to obtain a defect code set or a neutral code set, wherein the neutral code set is a normal code set, marking category labels on the defect code set according to the defect categories, marking normal labels on the normal code set, and adding the normal labels to the constructed code samples for expanding the defect samples and the normal samples of machine learning training.
Example 4
On the basis of embodiment 1, in step S3, the method includes the sub-steps of:
s31, obtaining the constructed normal code function segment and defect code function segment from the step S2;
s32, inputting the normal code function segment and the defect code function segment into a static code detection tool, performing static code detection and collecting the detection results of each tool;
s33, aiming at the input that the input is a normal code function segment, if the detection result of the static tool is higher than the set range and no defect is detected, the generated normal code is a correct label sample, and then the normal code is added into a code sample library, otherwise, the sample is discarded; and aiming at the input of a defect code function segment, setting the position of a detection result of a static tool in the offset set range of a code generation target point, and detecting the defect of a specific category when the position is higher than the set range, wherein the generated defect code is a correct label sample and then is added into a code sample library, otherwise, discarding the sample.
Example 5
On the basis of embodiment 1, in step S4, the method includes the sub-steps of: constructing a depth nerve code defect detection model which has a parameter updating function and is based on fusion characteristics, wherein the model specifically comprises a characteristic fusion network, a defect judgment network and a model updating mechanism; the feature fusion network is used for fusing code function sample features of different dimensions; the defect judging network is used for classifying the fused features, identifying whether the fused features are defect codes or not and then outputting the categories of the defects; the model updating mechanism is used for realizing incremental updating of the model on the premise of not influencing business work.
Example 6
On the basis of embodiment 1, in step S1, the method includes the sub-steps of: and slicing the source code into a certain granularity, and extracting the characteristics of a code character string and a function abstract syntax tree.
Example 7
On the basis of the embodiment 5, the feature fusion network takes three types of features of character level, word level and abstract syntax tree as input, and converts text and tree structure into vector form by using an embedding method.
Example 8
On the basis of the embodiment 5, the loss function of the deep neural code defect detection model based on the fusion features in the training process adopts a cross entropy loss function.
Example 9
On the basis of embodiment 5, the model update mechanism specifically includes the sub-steps of:
s41, defining the fusion-feature-based deep neural code defect detection model working on the ith day as
Figure 437176DEST_PATH_IMAGE042
An off-line deep neural code defect detection model based on fusion characteristics is
Figure 825432DEST_PATH_IMAGE043
S42, collecting code functions with tag information of the ith day from the code sample library for incremental updating and training of the offline model
Figure 254139DEST_PATH_IMAGE043
S43, using off-line model when the i day traffic is less than the set value
Figure 436859DEST_PATH_IMAGE043
Replacing online models
Figure 13334DEST_PATH_IMAGE042
I.e. by
Figure 956013DEST_PATH_IMAGE044
Example 10
A semi-supervised software code defect detecting apparatus includes a program instruction executing unit and a program instruction storing unit, and when a program instruction is loaded and executed by the program instruction executing unit, the semi-supervised software code defect detecting method as described in any one of embodiments 1 to 9 is performed.
The units described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.
The parts not involved in the present invention are the same as or can be implemented using the prior art.
The above-described embodiment is only one embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be easily made based on the application and principle of the present invention disclosed in the present application, and the present invention is not limited to the method described in the above-described embodiment of the present invention, so that the above-described embodiment is only preferred, and not restrictive.
Other embodiments than the above examples may be devised by those skilled in the art based on the foregoing disclosure, or by adapting and using knowledge or techniques of the relevant art, and features of various embodiments may be interchanged or substituted and such modifications and variations that may be made by those skilled in the art without departing from the spirit and scope of the present invention are intended to be within the scope of the following claims.

Claims (10)

1. A semi-supervised software code defect detection method is characterized by comprising the following steps:
s1, extracting code features;
s2, using the code features extracted in the step S1, defining a code generation rule template based on a rule engine, and using a code generation module to generate defect codes and neutral codes for expanding code samples of machine learning training;
s3, filtering the defect codes;
and S4, constructing and training a model, then judging the defect codes and outputting the detection result.
2. Semi-supervised software code defect detection method according to claim 1, comprising, in step S2, the sub-steps of: defining two types of code generation rule templates in a rule engine, wherein one type is a defect code generation rule template, and the purpose is to generate an object code with specific defects through a code function by using a code generation module for expanding the number of a code sample library; the other type is a neutral code generation rule template, and aims to generate codes with the same functions as the original codes but different character forms by using a code generation module on the premise of not changing the semantics of the original code program, so as to expand the number of the code sample libraries.
3. The semi-supervised software code defect detection method of claim 2, wherein in step S2, the code generation module comprises the sub-steps of:
s21, selecting and generating a defect code or a neutral code after the code characteristics are obtained; randomly selecting a defect generation rule or a neutral generation rule according to different category requirements of code generation, selecting a target point executed by a certain behavior rule of a code function according to a precondition executed by the defect generation rule or the neutral generation rule, if the target point is not met, replacing the defect generation rule or the neutral generation rule, and if all the rules are executed, and the target point is not found, replacing the next function segment to continuously and repeatedly execute the step;
s22, executing a generating strategy on a target point line of the code function to generate a defect code or a neutral code with a specific defect category;
s23, expanding the generated defect codes or neutral codes with specific defect categories to obtain a defect code set or a neutral code set, wherein the neutral code set is a normal code set, marking category labels on the defect code set according to the defect categories, marking normal labels on the normal code set, and adding the normal labels to the constructed code samples for expanding the defect samples and the normal samples of machine learning training.
4. Semi-supervised software code defect detection method according to claim 1, comprising, in step S3, the sub-steps of:
s31, obtaining the constructed normal code function segment and defect code function segment from the step S2;
s32, inputting the normal code function segment and the defect code function segment into a static code detection tool, performing static code detection and collecting the detection results of each tool;
s33, aiming at the input that the input is a normal code function segment, if the detection result of the static tool is higher than the set range and no defect is detected, the generated normal code is a correct label sample, and then the normal code is added into a code sample library, otherwise, the sample is discarded; and aiming at the input of a defect code function segment, setting the position of a detection result of a static tool in the offset set range of a code generation target point, and detecting the defect of a specific category when the position is higher than the set range, wherein the generated defect code is a correct label sample and then is added into a code sample library, otherwise, discarding the sample.
5. Semi-supervised software code defect detection method according to claim 1, comprising, in step S4, the sub-steps of: constructing a depth nerve code defect detection model which has a parameter updating function and is based on fusion characteristics, wherein the model specifically comprises a characteristic fusion network, a defect judgment network and a model updating mechanism; the feature fusion network is used for fusing code function sample features of different dimensions; the defect judging network is used for classifying the fused features, identifying whether the fused features are defect codes or not and then outputting the categories of the defects; the model updating mechanism is used for realizing incremental updating of the model on the premise of not influencing business work.
6. Semi-supervised software code defect detection method according to claim 1, comprising, in step S1, the sub-steps of: and slicing the source code into a certain granularity, and extracting the characteristics of a code character string and a function abstract syntax tree.
7. The semi-supervised software code defect detection method of claim 5, wherein the feature fusion network takes three types of features of character level, word level and abstract syntax tree as input, and converts text and tree structures into vector form by using an embedding method.
8. The semi-supervised software code defect detection method of claim 5, wherein the loss function of the fusion feature-based deep neural code defect detection model in the training process adopts a cross entropy loss function.
9. The semi-supervised software code defect detection method as recited in claim 5, wherein the model update mechanism specifically comprises the sub-steps of:
s41, defining the fusion-feature-based deep neural code defect detection model working on the ith day as
Figure DEST_PATH_IMAGE001
The off-line deep neural code defect detection model based on the fusion characteristics comprises
Figure 982250DEST_PATH_IMAGE002
S42, collecting the ith day tape label from the code sample libraryCode function of information for incremental update training of offline models
Figure 181282DEST_PATH_IMAGE002
S43, using off-line model when the i day traffic is less than the set value
Figure 757757DEST_PATH_IMAGE002
Replacing online models
Figure 684124DEST_PATH_IMAGE001
I.e. by
Figure DEST_PATH_IMAGE003
10. A semi-supervised software code defect detecting device, comprising a program instruction executing unit and a program instruction storing unit, wherein when the program instruction is loaded and executed by the program instruction executing unit, the semi-supervised software code defect detecting method according to any one of claims 1 to 9 is performed.
CN202210971176.6A 2022-08-15 2022-08-15 Semi-supervised software code defect detection method and device Active CN115048316B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210971176.6A CN115048316B (en) 2022-08-15 2022-08-15 Semi-supervised software code defect detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210971176.6A CN115048316B (en) 2022-08-15 2022-08-15 Semi-supervised software code defect detection method and device

Publications (2)

Publication Number Publication Date
CN115048316A true CN115048316A (en) 2022-09-13
CN115048316B CN115048316B (en) 2022-12-09

Family

ID=83166588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210971176.6A Active CN115048316B (en) 2022-08-15 2022-08-15 Semi-supervised software code defect detection method and device

Country Status (1)

Country Link
CN (1) CN115048316B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115617694A (en) * 2022-11-30 2023-01-17 中南大学 Software defect prediction method, system, device and medium based on information fusion
CN115629995A (en) * 2022-12-21 2023-01-20 中南大学 Software defect positioning method, system and equipment based on multi-dependency LSTM
CN116662206A (en) * 2023-07-24 2023-08-29 泰山学院 Computer software online real-time visual debugging method and device
CN117290238A (en) * 2023-10-10 2023-12-26 湖北大学 Software defect prediction method and system based on heterogeneous relational graph neural network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9069737B1 (en) * 2013-07-15 2015-06-30 Amazon Technologies, Inc. Machine learning based instance remediation
US20190163609A1 (en) * 2017-11-29 2019-05-30 International Business Machines Corporation Cognitive dynamic script language builder
CN110162475A (en) * 2019-05-27 2019-08-23 浙江工业大学 A kind of Software Defects Predict Methods based on depth migration
CN111459799A (en) * 2020-03-03 2020-07-28 西北大学 Software defect detection model establishing and detecting method and system based on Github
CN112597063A (en) * 2021-02-26 2021-04-02 北京北大软件工程股份有限公司 Method, device and storage medium for positioning defect code
CN113221960A (en) * 2021-04-20 2021-08-06 西北大学 Construction method and collection method of high-quality vulnerability data collection model
US11106801B1 (en) * 2020-11-13 2021-08-31 Accenture Global Solutions Limited Utilizing orchestration and augmented vulnerability triage for software security testing
CN114490344A (en) * 2021-12-31 2022-05-13 北京航空航天大学 Software integration evaluation method based on machine learning and static analysis

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9069737B1 (en) * 2013-07-15 2015-06-30 Amazon Technologies, Inc. Machine learning based instance remediation
US20190163609A1 (en) * 2017-11-29 2019-05-30 International Business Machines Corporation Cognitive dynamic script language builder
CN110162475A (en) * 2019-05-27 2019-08-23 浙江工业大学 A kind of Software Defects Predict Methods based on depth migration
CN111459799A (en) * 2020-03-03 2020-07-28 西北大学 Software defect detection model establishing and detecting method and system based on Github
US11106801B1 (en) * 2020-11-13 2021-08-31 Accenture Global Solutions Limited Utilizing orchestration and augmented vulnerability triage for software security testing
CN112597063A (en) * 2021-02-26 2021-04-02 北京北大软件工程股份有限公司 Method, device and storage medium for positioning defect code
CN113221960A (en) * 2021-04-20 2021-08-06 西北大学 Construction method and collection method of high-quality vulnerability data collection model
CN114490344A (en) * 2021-12-31 2022-05-13 北京航空航天大学 Software integration evaluation method based on machine learning and static analysis

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
RUDOLF FERENC等: ""An automatically created novel bug dataset and its validation in bug prediction"", 《JOURNAL OF SYSTEMS AND SOFTWARE》 *
WENBO ZHENG等: ""Software Defect Prediction Model Based on Improved Deep Forest and AutoEncoder by Forest"", 《THE 31ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING》 *
张肖等: "一种半监督集成学习软件缺陷预测方法", 《小型微型计算机系统》 *
郑显达: ""基于知识图谱和表示学习的软件缺陷预测系统设计与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
郭宏宇 等: ""基于改进型循环神经网络的恶意代码分类检测"", 《信息技术》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115617694A (en) * 2022-11-30 2023-01-17 中南大学 Software defect prediction method, system, device and medium based on information fusion
CN115629995A (en) * 2022-12-21 2023-01-20 中南大学 Software defect positioning method, system and equipment based on multi-dependency LSTM
CN116662206A (en) * 2023-07-24 2023-08-29 泰山学院 Computer software online real-time visual debugging method and device
CN116662206B (en) * 2023-07-24 2024-02-13 泰山学院 Computer software online real-time visual debugging method and device
CN117290238A (en) * 2023-10-10 2023-12-26 湖北大学 Software defect prediction method and system based on heterogeneous relational graph neural network
CN117290238B (en) * 2023-10-10 2024-04-09 湖北大学 Software defect prediction method and system based on heterogeneous relational graph neural network

Also Published As

Publication number Publication date
CN115048316B (en) 2022-12-09

Similar Documents

Publication Publication Date Title
CN115048316B (en) Semi-supervised software code defect detection method and device
CN108647520B (en) Intelligent fuzzy test method and system based on vulnerability learning
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
CN111459799B (en) Software defect detection model establishing and detecting method and system based on Github
CN112733156B (en) Intelligent detection method, system and medium for software vulnerability based on code attribute graph
US20080112620A1 (en) Automated system for understanding document content
Al-Obeidallah et al. A survey on design pattern detection approaches
CN112560036B (en) C/C + + vulnerability static detection method based on neural network and deep learning
CN115357904B (en) Multi-class vulnerability detection method based on program slicing and graph neural network
CN113138920B (en) Software defect report allocation method and device based on knowledge graph and semantic role labeling
CN113742205B (en) Code vulnerability intelligent detection method based on man-machine cooperation
CN117215935A (en) Software defect prediction method based on multidimensional code joint graph representation
CN117236677A (en) RPA process mining method and device based on event extraction
CN117520561A (en) Entity relation extraction method and system for knowledge graph construction in helicopter assembly field
CN109800420A (en) A kind of feasibility study review report automatic generation method and storage medium
CN117454387A (en) Vulnerability code detection method based on multidimensional feature extraction
CN114218580A (en) Intelligent contract vulnerability detection method based on multi-task learning
CN117390189A (en) Neutral text generation method based on pre-classifier
CN116841869A (en) Java code examination comment generation method and device based on code structured information and examination knowledge
CN116361816B (en) Intelligent contract vulnerability detection method, system, storage medium and equipment
CN117473510B (en) Automatic vulnerability discovery technology based on relationship between graph neural network and vulnerability patch
CN117592061B (en) Source code security detection method and device integrating code vulnerability characteristics and attribute graphs
Li et al. ACAGNN: Source Code Representation Based on Fine-Grained Multi-view Program Features
Dwivedi et al. Applying reverse engineering techniques to analyze design patterns in source code
CN116707928A (en) Threat knowledge extraction method and system combining rule matching and pre-training language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant