CN115454855B - Code defect report auditing method, device, electronic equipment and storage medium - Google Patents

Code defect report auditing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115454855B
CN115454855B CN202211130348.3A CN202211130348A CN115454855B CN 115454855 B CN115454855 B CN 115454855B CN 202211130348 A CN202211130348 A CN 202211130348A CN 115454855 B CN115454855 B CN 115454855B
Authority
CN
China
Prior art keywords
vector
code
defect
code slice
slice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211130348.3A
Other languages
Chinese (zh)
Other versions
CN115454855A (en
Inventor
张萍
刘晓玲
马涛
梁冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202211130348.3A priority Critical patent/CN115454855B/en
Publication of CN115454855A publication Critical patent/CN115454855A/en
Application granted granted Critical
Publication of CN115454855B publication Critical patent/CN115454855B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Investigating Materials By The Use Of Optical Means Adapted For Particular Applications (AREA)

Abstract

The disclosure provides a code defect report auditing method, a device, electronic equipment and a storage medium, and relates to the technical field of Internet. The method comprises the following steps: obtaining a defect report to be audited, which is output by a source code scanning tool, wherein the defect report to be audited comprises a code slice and defect types of the code slice, selecting a target code slice sample with the same defect type as the code slice from a defect database, respectively converting the code slice and the target code slice sample into feature vectors with different vector types to obtain a first vector set and a second vector set, obtaining vector distances between two feature vectors with the same vector types in each group in the first vector set and the second vector set by utilizing a trained twin network, obtaining a vector distance set, and determining an audit result of the defect report to be audited according to the vector distance set. The method and the device improve the accuracy and efficiency of code defect report audit.

Description

Code defect report auditing method, device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of internet, and in particular relates to a code defect report auditing method, a device, electronic equipment and a storage medium.
Background
The source code defect scanning tool is generally used for checking defects in source codes through means of data flow, control flow, semantic analysis and the like based on vulnerability characteristics and scanning rules defined by experts, and high false alarm rate exists generally. The OWASP (Open Web Application Security Project ) displays, for reference evaluation data of such tools, that the false alarm rate of some high-detection-rate tools even exceeds 50%, so that it is necessary to audit the defect report output by the source code defect scanning tool to reject false alarms.
Manual auditing of code defect report often depends on professional knowledge of auditors, and has high manpower investment and low efficiency. Some existing automated methods simply filter low risk problems or filter defects by a set white list without substantial auditing of the defects.
Based on this, how to improve the auditing efficiency of the code defect report becomes a technical problem to be solved.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The disclosure provides a code defect report auditing method, a device, an electronic device and a storage medium, which at least overcome the problem of low defect report auditing efficiency in the related technology to a certain extent.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to one aspect of the present disclosure, there is provided a code defect report auditing method, including: obtaining a defect report to be audited output by a source code scanning tool, wherein the defect report to be audited comprises a code slice and a defect type of the code slice, and the defect type is a defect result of the code slice identified by the source code scanning tool; selecting an object code slice sample with the same defect type as the code slice from a defect database according to the defect type of the code slice, wherein the defect database stores triplet information of a plurality of audited code slice samples, and the triplet information is < defect type, code slice sample and audit label >; converting the code slice into feature vectors of different vector categories to obtain a first vector set, and converting the target code slice sample into feature vectors of different vector categories to obtain a second vector set; according to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first vector set and the second vector set into the corresponding trained twin network to obtain a vector distance set consisting of vector distances output by a plurality of trained twin networks; and determining an auditing result of the defect report to be audited according to the vector distance set, wherein the auditing result is defect or false report.
In one embodiment of the disclosure, converting the code slice into feature vectors of different vector classes to obtain a first vector set, converting the target code slice sample into feature vectors of different vector classes to obtain a second vector set, comprising: converting the code slices into sequence vectors, tree vectors and graphic vectors respectively to obtain a first vector set; and respectively converting the target code slice sample into a sequence vector, a tree vector and a graph vector to obtain a second vector set.
In one embodiment of the present disclosure, according to a mapping relationship between a vector class of a feature vector and a class of a twin network, inputting feature vectors of each group of the same vector class in the first vector set and the second vector set into a corresponding trained twin network, to obtain a vector distance set composed of vector distances output by a plurality of trained twin networks, including: inputting a first feature vector in the first vector set into a first neural network of a trained first twin network, inputting a second feature vector in the second vector set into a second neural network of the trained first twin network to obtain a first vector distance output by the trained first twin network, wherein the first feature vector is the same as the vector class of the second feature vector, and the trained first twin network is determined by the vector class of the first feature vector or the second feature vector; inputting a third feature vector in the first vector set into a first neural network of a trained second twin network, inputting a fourth feature vector in the second vector set into a second neural network of the trained second twin network to obtain a second vector distance output by the trained second twin network, wherein the third feature vector is the same as the fourth feature vector in vector category, and the trained second twin network is determined by the third feature vector or the fourth feature vector in vector category; the set of vector distances is determined based on the first vector distance and the second vector distance.
In one embodiment of the disclosure, determining an audit result of the defect report to be audited according to the vector distance set includes: calculating code similarity between the code slice and the target code slice sample based on the set of vector distances; and determining an auditing result of the defect report to be audited based on the code similarity.
In one embodiment of the present disclosure, calculating the code similarity between the code slice and the target code slice sample from the set of vector distances comprises: calculating the vector similarity corresponding to each vector distance in the vector distance set to obtain a plurality of vector similarities; and calculating a weighted average value of the vector similarity, and obtaining the code similarity between the code slice and the target code slice sample.
In one embodiment of the present disclosure, a defect database is obtained by: obtaining a plurality of audited defect reports, wherein the audited defect reports comprise code slice samples, defect types of the code slice samples and audit labels; converting the plurality of audited defect reports into a triplet format to obtain triplet information of a plurality of code slice samples, wherein the triplet information is < defect type, code slice sample and audit label >; and storing the triplet information of the code slice samples into a database to obtain a defect database.
In one embodiment of the present disclosure, a trained twin network is obtained by the steps comprising: acquiring triplet information of a first code slice sample and triplet information of a second code slice sample from the defect database, wherein the defect types of the first code slice sample and the second code slice sample are the same; respectively converting the first code slice sample and the second code slice sample into feature vectors of different vector categories to obtain a first sample vector set and a second sample vector set; according to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first sample vector set and the second sample vector set into the twin network to be trained correspondingly, and obtaining the target vector distance output by the twin network to be trained; and calculating a contrast loss function value based on the target vector distance, and adjusting model parameters in the twin network to be trained according to the contrast loss function value to obtain the trained twin network.
According to another aspect of the present disclosure, there is provided a code defect report auditing apparatus, comprising: the defect report acquisition module is used for acquiring a defect report to be audited output by the source code scanning tool, wherein the defect report to be audited comprises a code slice and a defect type of the code slice, and the defect type is a defect result of the code slice identified by the source code scanning tool; the system comprises a slice sample selection module, a code slice analysis module and a code slice analysis module, wherein the slice sample selection module is used for selecting target code slice samples which are the same as the defect types of the code slices from a defect database according to the defect types of the code slices, wherein the defect database stores triplet information of a plurality of audited code slice samples, and the triplet information is < defect types, code slice samples and audit labels >; the feature vector conversion module is used for converting the code slice into feature vectors of different vector categories to obtain a first vector set, and converting the target code slice sample into feature vectors of different vector categories to obtain a second vector set; the vector distance set acquisition module is used for inputting the feature vectors of each group of the same vector categories in the first vector set and the second vector set into the corresponding trained twin network according to the mapping relation between the vector categories of the feature vectors and the categories of the twin network to obtain a vector distance set consisting of vector distances output by a plurality of trained twin networks; and the audit result determining module is used for determining the audit result of the defect report to be audited according to the vector distance set, wherein the audit result is defect or false alarm.
In an embodiment of the present disclosure, the above feature vector conversion module is further configured to convert the code slice into a sequence vector, a tree vector, and a graphics vector, to obtain a first vector set; and respectively converting the target code slice sample into a sequence vector, a tree vector and a graph vector to obtain a second vector set.
In an embodiment of the present disclosure, the vector distance set obtaining module is further configured to input a first feature vector in the first vector set into a first neural network of a trained first twin network, input a second feature vector in the second vector set into a second neural network of the trained first twin network, and obtain a first vector distance output by the trained first twin network, where the first feature vector is the same as a vector class of the second feature vector, and the trained first twin network is determined by the vector class of the first feature vector or the second feature vector; inputting a third feature vector in the first vector set into a first neural network of a trained second twin network, inputting a fourth feature vector in the second vector set into a second neural network of the trained second twin network to obtain a second vector distance output by the trained second twin network, wherein the third feature vector is the same as the fourth feature vector in vector category, and the trained second twin network is determined by the third feature vector or the fourth feature vector in vector category; the set of vector distances is determined based on the first vector distance and the second vector distance.
In one embodiment of the disclosure, the above audit result determining module is further configured to calculate a code similarity between the code slice and the target code slice sample based on the vector distance set; and determining an auditing result of the defect report to be audited based on the code similarity.
In an embodiment of the present disclosure, the above-mentioned audit result determining module is further configured to calculate a vector similarity corresponding to each vector distance in the vector distance set, to obtain a plurality of vector similarities; and calculating a weighted average value of the vector similarity, and obtaining the code similarity between the code slice and the target code slice sample.
In one embodiment of the disclosure, the apparatus further includes a database construction module for obtaining the defect database by: obtaining a plurality of audited defect reports, wherein the audited defect reports comprise code slice samples, defect types of the code slice samples and audit labels; converting the plurality of audited defect reports into a triplet format to obtain triplet information of a plurality of code slice samples, wherein the triplet information is < defect type, code slice sample and audit label >; and storing the triplet information of the code slice samples into a database to obtain a defect database.
In one embodiment of the disclosure, the apparatus further includes a twin network training module, configured to obtain a trained twin network by: acquiring triplet information of a first code slice sample and triplet information of a second code slice sample from the defect database, wherein the defect types of the first code slice sample and the second code slice sample are the same; respectively converting the first code slice sample and the second code slice sample into feature vectors of different vector categories to obtain a first sample vector set and a second sample vector set; according to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first sample vector set and the second sample vector set into the twin network to be trained correspondingly, and obtaining the target vector distance output by the twin network to be trained; and calculating a contrast loss function value based on the target vector distance, and adjusting model parameters in the twin network to be trained according to the contrast loss function value to obtain the trained twin network.
According to still another aspect of the present disclosure, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the code defect report auditing method described above via execution of the executable instructions.
According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the code defect report auditing method described above.
The embodiment of the disclosure provides a code defect report auditing method, a device, an electronic device and a storage medium, wherein the method comprises the following steps: obtaining a defect report to be audited output by a source code scanning tool, wherein the defect report to be audited comprises a code slice and a defect type of the code slice; selecting an object code slice sample with the same defect type as the code slice from a defect database according to the defect type of the code slice; converting the code slice into feature vectors of different vector categories to obtain a first vector set, and converting the target code slice sample into feature vectors of different vector categories to obtain a second vector set; according to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first vector set and the second vector set into the corresponding trained twin network to obtain a vector distance set consisting of vector distances output by a plurality of trained twin networks; and determining an auditing result of the defect report to be audited according to the vector distance set. The method and the device improve the accuracy and efficiency of code defect report audit.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 is a schematic diagram of a source code defect detection method according to an embodiment of the disclosure;
FIG. 2 illustrates a code defect report auditing method flow diagram in an embodiment of the present disclosure;
FIG. 3 illustrates a code slice acquisition method in an embodiment of the present disclosure;
FIG. 4 illustrates another code defect report auditing method flow diagram in an embodiment of the present disclosure;
FIG. 5 illustrates a code defect report auditing method, in accordance with an embodiment of the present disclosure;
FIG. 6 illustrates another code defect report auditing method schematic diagram in an embodiment of the present disclosure;
FIG. 7 illustrates another code defect report auditing method flow chart in an embodiment of the present disclosure;
FIG. 8 illustrates another code defect report auditing method flow chart in an embodiment of the present disclosure;
FIG. 9 illustrates another code defect report auditing method schematic diagram in an embodiment of the present disclosure;
FIG. 10 illustrates another code defect report auditing method schematic in an embodiment of the present disclosure;
FIG. 11 illustrates a code defect report auditing apparatus, in accordance with an embodiment of the present disclosure; and
fig. 12 shows a block diagram of an electronic device in an embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
As mentioned in the background section above, the manual auditing of code defect reports often relies on the expertise of auditors, with high human input and inefficiency. Some existing automated methods simply filter low risk problems or filter defects by a set white list without substantial auditing of the defects. Some deep learning-based defect identification methods rely on a large amount of training data and limit the types of defects that can be covered by the training data, and often only two classifications or few types of defects can be identified.
Referring to a schematic diagram of a source code defect detection method shown in fig. 1, fig. 1 is a manner of performing source code defect detection based on deep learning fusion of various features, as shown in fig. 1, firstly, converting source codes into abstract syntax trees (AST, abstract Syntax Tree), and expanding code paths through analysis means such as data flow, control flow and the like to obtain code slices; then, the code slice is vectorized in multiple dimensions, aiming at being capable of representing multiple semantic features of the code slice, such as sequence feature vectors representing context semantics, tree feature vectors representing AST, and graph feature vectors representing data flows and control flow graphs. Finally, the vector of each semantic feature is input into a proper single neural network (such as a cyclic neural network, a tree convolution network and a graph convolution network) to extract high-dimensional features and classify the features, and finally, the classification results of the features are fused through an additional loss function.
The method disclosed in fig. 1 does not need to rely on expert-defined defect scanning rules, and integrates various semantic features such as code sequences, abstract syntax trees, data streams, control streams and the like, but the classification capability of the model depends on a large amount of training data and defect types which can be covered by the training data. Typical defect scanning tools support up to hundreds of defect types, such as Fortify supported defect types exceeding 800. Limited to training data, the approach disclosed in FIG. 1 is often only able to perform two classifications or identify a few classes of defects.
Based on the above, the code defect report auditing method provided by the embodiment of the disclosure converts a code slice into a set of vectors represented by various semantic features, selects a proper neural network for each semantic vector to construct a set of twin networks, is used for extracting high-dimensional features of various semantic vectors to be audited and checked for defects, calculates the distance between the vectors, converts the vector distance into a similarity value, fuses the similarity values of various semantic features, and identifies whether the defect report is misreported by evaluating the similarity between defects. The twin network is suitable for solving the problems of more classification and less training data, and the method can overcome the problems that the existing defect detection method based on deep learning needs a large amount of training data and has less identification defect types, and can be used for automatic auditing of code defects and improving the efficiency of code auditing.
The code defect report auditing method provided by the disclosure can be applied to a security system of the DevOps, is suitable for automatic auditing of source code defects of a continuous integrated environment, and improves auditing efficiency of code defect report; the method can be implemented on the floor of a research and development cloud platform, is used for automatic auditing of code security, and can solve the problem of high false alarm rate of a white box type source code defect inspection tool.
The present exemplary embodiment will be described in detail below with reference to the accompanying drawings and examples.
First, in an embodiment of the present disclosure, a code defect report auditing method is provided, which may be performed by any electronic device having computing processing capabilities.
Fig. 2 shows a flowchart of a code defect report auditing method according to an embodiment of the present disclosure, and as shown in fig. 2, the code defect report auditing method provided in the embodiment of the present disclosure includes the following steps:
s202, obtaining a defect report to be audited output by a source code scanning tool, wherein the defect report to be audited comprises a code slice and a defect type of the code slice, and the defect type is a defect result of the code slice identified by the source code scanning tool;
it should be noted that the defect report to be audited may further include a preset code path, and the code slice may be a part of code segment intercepted from the source code according to the preset code path. The defect type may include at least one of an access control defect, an SQL (Structure Query Language, structured query language) injection defect, a command injection defect, and a cross-site scripting defect, or the defect type may also be a defect ID (Identity document, identification) defined by a CWE (Common Weakness Enumeration, universal defect enumeration) database.
In one embodiment of the present disclosure, a portion of a code fragment may be intercepted from a source code as a code slice according to a preset code path, which may be a path between a source point (source) to a burst point (sink). Referring to a code slice acquisition method shown in fig. 3, a portion of a code slice between a source point (getHeader) function and a burst point (printf) function of a source code is taken as a code slice.
S204, selecting an object code slice sample which is the same as the defect type of the code slice from a defect database according to the defect type of the code slice, wherein the defect database stores triplet information of a plurality of audited code slice samples, and the triplet information is < defect type, code slice sample and audit label >;
it should be noted that the target code slice sample is a code slice sample with the same defect type as the code slice in the defect database; the audit label is determined by the audit result of the corresponding defect report of the code slice sample, and comprises 0 or 1, wherein 0 represents the defect, and 1 represents the false alarm.
In one embodiment of the present disclosure, selecting an object code slice sample from a defect database that is the same as a defect type of a code slice according to the defect type of the code slice may include: when the defect type of the code slice is an access control defect, selecting an object code slice sample with the defect type of the access control defect from a defect database; when the defect type of the code slice is SQL injection defect, selecting an object code slice sample with the defect type of SQL injection defect from a defect database; when the defect type of the code slice is a command injection defect, selecting an object code slice sample with the defect type of the command injection defect from a defect database; when the defect type of the code slice is a cross-site script defect, selecting an object code slice sample with the defect type of the cross-site script defect from a defect database; alternatively, a target code slice sample is selected from the defect database that is the same as the defect ID of the code slice, based on the defect ID of the code slice.
S206, converting the code slice into feature vectors of different vector categories to obtain a first vector set, and converting the target code slice sample into feature vectors of different vector categories to obtain a second vector set;
it should be noted that the vector classes may include sequence vectors, tree vectors, and graphic vectors, where the sequence vectors may be used to characterize the context semantics of the code slices, the tree vectors may represent the code slices in the form of abstract syntax trees AST, and the graphic vectors may be used to characterize the code slices in the form of data flows or control flow graphs.
In one embodiment of the present disclosure, converting a code slice into feature vectors of different vector classes may include: converting the code slice into a sequence vector to obtain the sequence vector of the code slice; converting the code slice into a tree vector to obtain the tree vector of the code slice; converting the code slice into a graphic vector to obtain the graphic vector of the code slice; based on the sequence vector, the tree vector, and the graphics vector of the code slice, a first vector set is obtained, wherein the first vector set may include the sequence vector, the tree vector, and the graphics vector of the code slice. Converting the object code slice samples into feature vectors of different vector classes may include: converting the target code slice sample into a sequence vector to obtain the sequence vector of the target code slice sample; converting the target code slice sample into a tree vector to obtain the tree vector of the target code slice sample; converting the target code slice sample into a graphic vector to obtain the graphic vector of the target code slice sample; a first set of vectors is derived based on the sequence, tree, and graph vectors of the target code slice samples, wherein the first set of vectors may include the sequence, tree, and graph vectors of the target code slice samples.
S208, according to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first vector set and the second vector set into the corresponding trained twin network to obtain a vector distance set consisting of vector distances output by a plurality of trained twin networks;
the twin Network (Siamese Network): the twin network belongs to the concept of deep neural network, and basically defines that the first neural network (network 1) and the second neural network (network 2) in the twin network have the same structure and the network weight parameters are shared, so that the network1 and the network2 can be understood as the same network in essence, the network weight parameters (W) are shared, and the twin network can be composed of two same types of neural networks sharing weights.
In one embodiment of the present disclosure, the twinning network's categories may include: the system comprises a circulating twin network, a tree convolution twin network, a graph convolution twin network and the like, wherein the circulating twin network consists of two circulating neural networks sharing weights, the tree convolution twin network consists of two tree convolution neural networks sharing weights, and the graph convolution twin network consists of two graph convolution neural networks sharing weights. The recurrent neural network may include RNN (Rerrent Neural Network, recurrent neural network), LSTM (long short-Term Memory network) and BLSTM (Bidirectional Long Short-Term Memory network), the tree convolutional neural network may be tree convolutional and tree LSMT, and the graph convolutional neural network may include GCN (Graph Convolution Network, graph convolutional network) and GAT (Graph ATtention network, graph neural network).
In one embodiment of the present disclosure, the present disclosure may include a twin network group including a plurality of twin networks, such as a cyclic twin network, a tree convolution twin network, a graph convolution twin network, and the like, from which a corresponding kind of twin network may be selected according to a vector class of the feature vector, such as when the vector class of the feature vector is a sequence vector, a cyclic twin network may be selected from the twin network group; when the vector category of the feature vector is a tree vector, selecting a tree convolution twin network from the twin network group; and when the vector category of the feature vector is a graph vector, selecting a graph convolution twin network from the twin network group.
S210, determining an auditing result of a defect report to be audited according to the vector distance set, wherein the auditing result is defect or false alarm.
It should be noted that, the auditing result is defect, which means that the defect report to be audited passes the audit, and the code slice has defect; and the auditing result is false report, which indicates that the defect report to be audited does not pass the audit, and the code slice has no defect.
In one embodiment of the present disclosure, a plurality of vector distances are included in a set of vector distances, where the vector distances may be euclidean distances or other vector distances. Each vector distance can be converted into a similarity to obtain a plurality of vector similarities, and a weighted average of the plurality of vector similarities is calculated to obtain the code similarity of the code slice and the target code slice sample. According to the code similarity, determining an auditing result of a defect report to be audited, for example, when the code similarity is larger than or equal to a preset similarity threshold value, determining that the auditing result of the code slice is the same as that of the target code slice sample, at this time, determining the auditing result of the defect report to be audited according to an auditing label of the target code slice sample, when the auditing label of the target code slice sample is 0, the auditing result of the defect report to be audited is defect, and when the auditing label of the target code slice sample is 1, the auditing result of the defect report to be audited is false report. The preset similarity threshold may be any percentage, such as 80%, 85%, or 90%, etc.
In another case, when the code similarity is smaller than a preset similarity threshold, determining that the auditing results of the code slice sample and the target code slice sample are opposite, at this time, determining the auditing result of the defect report to be audited according to the auditing label of the target code slice sample, when the auditing label of the target code slice sample is 0, the auditing result of the defect report to be audited is false, and when the auditing label of the target code slice sample is 1, the auditing result of the defect report to be audited is defect.
In one embodiment of the disclosure, a plurality of target code slice samples with the same defect type as the code slice can be obtained from a defect database, the target code slice sample with the highest code similarity with the code slice is determined from the plurality of target code slice samples, and an audit result of a defect report to be audited is determined according to an audit label of the target code slice sample with the highest code similarity with the code slice. If the audit label of the target code slice sample with the highest code similarity with the code slice is 0, the audit result of the defect report to be audited is defect, and if the audit label of the target code slice sample with the highest code similarity with the code slice is 1, the audit result of the defect report to be audited is false alarm.
According to the code defect report auditing method provided by the embodiment of the disclosure, the target code slice sample with the same defect type as the code slice is selected from the defect database based on the defect type, the code slice sample and the target code slice sample are respectively converted into the feature vectors with different vector categories, a first vector set and a second vector set are obtained, the trained twin network is utilized to obtain the vector distance between the two feature vectors with the same vector category in each group in the first vector set and the second vector set, the vector distance set is obtained, the auditing result of the defect report to be audited is determined according to the vector distance set, and the accuracy and efficiency of the code defect report auditing are improved.
The code defect report auditing method provided by the embodiment of the disclosure can be used for solving the problem of high false alarm rate of the existing white-box source code defect inspection tool in a DevOps security system or research and development cloud project, and solves the problems of large investment and low efficiency of code defect manual audit. The future research and development cloud platform is used as a DevOps capability to be output to the space wing cloud, and can form automatic audit of source code defects according to the scheme of the present disclosure, so as to reduce information security risks in the process of software development projects, and achieve the purpose of front-end security protection.
In one embodiment of the present disclosure, the converting the code slice into feature vectors of different vector classes may be implemented through the steps disclosed in fig. 4, to obtain a first vector set, converting the target code slice sample into feature vectors of different vector classes, to obtain a second vector set, see another code defect report auditing method flowchart shown in fig. 4, may include the steps of:
s402, converting code slices into sequence vectors, tree vectors and graphic vectors respectively to obtain a first vector set;
s404, converting the target code slice sample into a sequence vector, a tree vector and a graph vector respectively to obtain a second vector set.
In one embodiment of the present disclosure, according to a mapping relationship between a vector class of a feature vector and a class of a twin network, inputting feature vectors of each group of the same vector class in a first vector set and a second vector set into a corresponding trained twin network to obtain a vector distance set composed of vector distances output by a plurality of trained twin networks, including:
inputting a first feature vector in a first vector set into a first neural network of a trained first twin network, inputting a second feature vector in a second vector set into a second neural network of the trained first twin network to obtain a first vector distance output by the trained first twin network, wherein the vector types of the first feature vector and the second feature vector are the same, and the trained first twin network is determined by the vector type of the first feature vector or the second feature vector;
Inputting a third feature vector in the first vector set into a first neural network of a trained second twin network, inputting a fourth feature vector in the second vector set into a second neural network of the trained second twin network to obtain a second vector distance output by the trained second twin network, wherein the vector type of the third feature vector is the same as that of the fourth feature vector, and the trained second twin network is determined by the vector type of the third feature vector or the fourth feature vector;
a set of vector distances is determined based on the first vector distance and the second vector distance.
It should be noted that, the trained first twin network is determined by a vector class of the first feature vector or the second feature vector, for example, when the vector classes of the first feature vector and the second feature vector are both sequence vectors, the trained first twin network selects a cyclic twin network; when the vector categories of the first feature vector and the second feature vector are tree vectors, the trained first twin network selects a tree convolution twin network; when the vector categories of the first feature vector and the second feature vector are graphic vectors, the trained first twin network selects a graph roll twin network. The trained second twin network is determined by the vector class of the third feature vector or the fourth feature vector, the determination mode of the trained second twin network is the same as the determination mode of the first twin network, which is not described in detail herein, and in addition, the types of the first twin network and the second twin network are different.
In one embodiment of the present disclosure, referring to a schematic diagram of a code defect report auditing method shown in fig. 5, fig. 5 discloses an application process of the twin network of the present disclosure, which may include: obtaining a defect report to be audited, and determining a code slice and the defect type of the code slice from the defect report to be audited; selecting an object code slice sample with the same defect type as the code slice from a defect database based on the defect type of the code slice, and respectively vectorizing the code slice and the object code slice sample to obtain a first vector set and a second vector set, wherein the first vector set comprises a sequence vector, a tree vector, a graph vector and other vectors of the code slice, and the second vector set comprises a sequence vector, a tree vector, a graph vector and other vectors of the object code slice sample; according to the mapping relation between the vector category of the feature vector and the category of the twin network, the sequence vector of the code slice and the sequence vector of the target code slice sample can be input into the twin network 1 to obtain a first vector distance; inputting the tree vector of the code slice and the tree vector of the target code slice sample into the twin network 2 to obtain a second vector distance; inputting the graphic vector of the code slice and the graphic vector of the target code slice sample into the twin network 3 to obtain a third vector distance; and inputting other vectors of the code slice and other vectors of the target code slice sample into other twin networks to obtain a fourth vector distance, and finally obtaining a vector distance set, wherein the vector distance set comprises a first vector distance, a second vector distance, a third vector distance, a fourth vector distance and the like.
The present disclosure may utilize audited defect reports to build a defect database that may be used for training of twin networks and as a similarity search space for code slices to reduce the cost of sample data acquisition.
In one embodiment of the present disclosure, where the twinning network is widely used in deep metric learning, referring to another code defect reporting auditing method schematic diagram shown in fig. 6, a first feature vector (which may be denoted as X 1 ) And a second eigenvector (which may be denoted as X 2 ) Respectively inputted into two neural networks of the twin network, mapped into ebedding vectors (which can be respectively marked as G W (X 1 ) And G W (X 2 ) The Euclidean or cosine distance (which can be noted as G) between the two vectors is then measured W (X 1 )-G W (X 2 ) I) to obtain a distance metric (which can be denoted as D) between the two vectors W ) And then the similarity of the first feature vector and the second feature vector can be determined according to the distance measurement value between the two vectors. Whereas the two neural networks comprised by the twin network are in fact identicalA network, network weight parameters (W) are shared.
In one embodiment of the present disclosure, determining an audit result of a defect report to be audited according to a vector distance set may be implemented through the steps disclosed in fig. 7, referring to another code defect report audit method flowchart shown in fig. 7, may include the steps of:
S702, calculating the code similarity between the code slice and the target code slice sample based on the vector distance set;
s704, determining an auditing result of the defect report to be audited based on the code similarity.
In one embodiment of the present disclosure, calculating the code similarity between the code slice and the target code slice sample from the set of vector distances may be implemented by the steps disclosed in fig. 8, referring to another code defect report auditing method flowchart shown in fig. 8, may include the steps of:
s802, calculating the vector similarity corresponding to each vector distance in the vector distance set to obtain a plurality of vector similarities;
s804, calculating a weighted average of the vector similarities to obtain the code similarity between the code slice and the target code slice sample.
The vector similarity is inversely proportional to the vector distance, and a larger vector distance indicates a smaller vector similarity and a smaller vector distance indicates a larger vector similarity.
In one embodiment of the present disclosure, a defect database is obtained by:
obtaining a plurality of audited defect reports, wherein the audited defect reports comprise code slice samples, defect types of the code slice samples and audit labels;
Converting the plurality of audited defect reports into a triplet format to obtain triplet information of a plurality of code slice samples, wherein the triplet information is < defect type, code slice sample and audit label >;
and storing the triplet information of the plurality of code slice samples into a database to obtain a defect database.
Wherein the defect database stores triplet information of a plurality of audited code slice samples
In one embodiment of the present disclosure, a trained twin network is obtained by the steps comprising:
acquiring triplet information of a first code slice sample and triplet information of a second code slice sample from a defect database, wherein the defect types of the first code slice sample and the second code slice sample are the same;
respectively converting the first code slice sample and the second code slice sample into feature vectors of different vector categories to obtain a first sample vector set and a second sample vector set;
according to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first sample vector set and the second sample vector set into the twin network to be trained correspondingly, and obtaining the target vector distance output by the twin network to be trained;
And constructing a contrast loss function based on the target vector distance, and adjusting model parameters in the twin network to be trained according to the contrast loss function to obtain the trained twin network.
It should be noted that, in the process of training the twin network, the data in the defect database may be used as sample data of the twin network, two code slice samples with the same defect category may be selected from the defect database, after the two code slice samples are respectively converted into feature vectors with different vector categories, the obtained first sample vector set and second sample vector set are used as training samples, and according to the mapping relationship between the vector category of the feature vector and the category of the twin network, feature vectors with the same vector category in each of the first sample vector set and the second sample vector set are input into the twin network to be trained.
On the premise of ensuring that the defect types of the two code slice samples are the same, various combinations of audit labels can be covered as much as possible according to the audit labels of the code slice samples, for example, the audit labels of the two code slice samples corresponding to the feature vectors input into the twin network to be trained are respectively < defect, defect >, < defect, false alarm > or < false alarm, false alarm >. To improve the generalization ability of the resulting trained twin network.
In one embodiment of the present disclosure, the contrast loss function L (Wi, Y, X 1i ,X 2i ) Can be expressed as:
the value of Y depends on whether the audit labels corresponding to the two feature vectors are consistent, if yes, y=1, and if no, y=0; d (D) Wi For the i-th vector distance in the vector distance set, m is a configurable constant, N is the number of code slice samples of the same defect type as the code slice sample corresponding to the input feature vector in the defect database, X 1i For the ith feature vector, X, of the first set of vectors 2i For the i-th feature vector in the second vector set, i is the number of elements in the first vector set or the second vector set.
In one embodiment of the present disclosure, the trained twin network may calculate the distance between two feature vectors by the following formula:
D Wi =||G Wi (X 1i )-G Wi (X 2i )|| (2)
in one embodiment of the present disclosure, the vector similarity of two feature vectors may be calculated based on the vector distance by the following formula:
in one embodiment of the present disclosure, a weighted average of a plurality of vector similarities may be calculated to obtain a code similarity between a code slice and an object code slice sample by the following formula:
wherein P is i Is the weight of the ith feature vector.
In one embodiment of the present disclosure, the contrast loss function shown in the above formula (1) and formula (2) may be used to achieve the matching degree of the semantic vector and train the twin network.
In one embodiment of the present disclosure, referring to another code defect report auditing method schematic diagram shown in fig. 9, fig. 9 discloses a training process of the twin network of the present disclosure, which may include: obtaining a plurality of audited defect reports, converting each audited defect report into a triplet format, obtaining triplet information of a plurality of code slice samples, and storing the triplet information of the plurality of code slice samples into a defect database to construct the defect database. And selecting a first code slice sample and a second code slice sample with the same defect type from the defect database, and respectively converting the first code slice sample and the second code slice sample into feature vectors to obtain a sequence vector, a tree vector, a graph vector and other vectors of the first code slice sample, and a sequence vector, a tree vector, a graph vector and other vectors of the second code slice sample. According to the mapping relation between the vector category of the feature vector and the category of the twin network, the sequence vector of the first code slice sample and the sequence vector of the second code slice sample can be input into the twin network 1; inputting the tree vector of the first code slice sample and the tree vector of the second code slice sample into the twinning network 2; inputting the graphic vector of the first code slice sample and the graphic vector of the second code slice sample into the twinning network 3; other vectors of the first code slice samples and other vectors of the second code slice samples are input into other twinning networks. A contrast loss function is constructed by vector distances output by the twin network 1, the twin network 2, the twin network 3 and other twin networks, and model parameters of the twin network 1, the twin network 2, the twin network 3 and other twin networks are adjusted based on the contrast loss function, so that the trained twin network 1, the trained twin network 2, the trained twin network 3 and other twin networks are obtained.
In one embodiment of the present disclosure, referring to another code defect report auditing method schematic shown in fig. 10, each audited defect report may now be converted into triplet information of code slice samples and stored in a defect database, after determining a code slice and a defect type of the code slice from the pending defect report, selecting an object code slice sample of the same defect type from the defect database according to the defect type of the code slice, and vectorizing the code slice and the object code slice sample, respectively, to obtain a first vector set and a second vector set. According to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first vector set and the second vector set into the corresponding trained twin network to obtain a vector distance set consisting of vector distances output by a plurality of trained twin networks. And calculating the code similarity between the code slice and the target code slice sample according to the vector distance set, and determining an audit result of the defect report to be audited based on the code similarity and an audit label of the target code slice sample.
Based on the same inventive concept, the embodiment of the disclosure also provides a code defect report auditing device, as the following embodiment. Since the principle of solving the problem of the embodiment of the device is similar to that of the embodiment of the method, the implementation of the embodiment of the device can be referred to the implementation of the embodiment of the method, and the repetition is omitted.
FIG. 11 is a schematic diagram of a code defect report auditing apparatus according to an embodiment of the disclosure, as shown in FIG. 11, the apparatus includes:
a defect report obtaining module 1110, configured to obtain a defect report to be audited output by the source code scanning tool, where the defect report to be audited includes a code slice and a defect type of the code slice, and the defect type is a defect result of the code slice identified by the source code scanning tool;
the slice sample selection module 1120 is configured to select, according to a defect type of the code slice, an object code slice sample having the same defect type as the code slice from a defect database, where triple information of a plurality of audited code slice samples is stored in the defect database, and the triple information is < defect type, code slice sample, audit tag >;
the feature vector conversion module 1130 is configured to convert the code slice into feature vectors of different vector types to obtain a first vector set, and convert the target code slice sample into feature vectors of different vector types to obtain a second vector set;
The vector distance set obtaining module 1140 is configured to input, according to a mapping relationship between a vector category of a feature vector and a category of a twin network, feature vectors of each group of the same vector category in the first vector set and the second vector set into a corresponding trained twin network, to obtain a vector distance set composed of vector distances output by a plurality of trained twin networks, where the twin networks are composed of two neural networks of the same type that share weights;
the audit result determining module 1150 is configured to determine an audit result of the defect report to be audited according to the vector distance set, where the audit result is a defect or a false alarm.
In one embodiment of the present disclosure, the feature vector conversion module 1130 is further configured to convert the code slices into a sequence vector, a tree vector, and a graphics vector, respectively, to obtain a first vector set; and respectively converting the object code slice samples into sequence vectors, tree vectors and graphic vectors to obtain a second vector set.
In an embodiment of the present disclosure, the vector distance set obtaining module 1140 is further configured to input a first feature vector in a first vector set into a first neural network of a trained first twin network, input a second feature vector in a second vector set into a second neural network of the trained first twin network, and obtain a first vector distance output by the trained first twin network, where the first feature vector is the same as a vector class of the second feature vector, and the trained first twin network is determined by the vector class of the first feature vector or the second feature vector; inputting a third feature vector in the first vector set into a first neural network of a trained second twin network, inputting a fourth feature vector in the second vector set into a second neural network of the trained second twin network to obtain a second vector distance output by the trained second twin network, wherein the vector type of the third feature vector is the same as that of the fourth feature vector, and the trained second twin network is determined by the vector type of the third feature vector or the fourth feature vector; a set of vector distances is determined based on the first vector distance and the second vector distance.
In one embodiment of the present disclosure, the audit result determining module 1150 is further configured to calculate a code similarity between the code slice and the target code slice sample based on the vector distance set; based on the code similarity, determining an audit result of the defect report to be audited.
In one embodiment of the present disclosure, the above-mentioned audit result determining module 1150 is further configured to calculate a vector similarity corresponding to each vector distance in the vector distance set, so as to obtain a plurality of vector similarities; and calculating a weighted average of the plurality of vector similarities to obtain the code similarity between the code slice and the target code slice sample.
In one embodiment of the disclosure, the apparatus further includes a database construction module for obtaining the defect database by: obtaining a plurality of audited defect reports, wherein the audited defect reports comprise code slice samples, defect types of the code slice samples and audit labels; converting the plurality of audited defect reports into a triplet format to obtain triplet information of a plurality of code slice samples; storing the triplet information of the code slice samples into a database to obtain a defect database, wherein the defect database stores the triplet information of the audited code slice samples, and the triplet information is < defect type, code slice sample and audit label >.
In one embodiment of the disclosure, the apparatus further includes a twin network training module, configured to obtain a trained twin network by: acquiring triplet information of a first code slice sample and triplet information of a second code slice sample from a defect database, wherein the defect types of the first code slice sample and the second code slice sample are the same; respectively converting the first code slice sample and the second code slice sample into feature vectors of different vector categories to obtain a first sample vector set and a second sample vector set; according to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first sample vector set and the second sample vector set into the twin network to be trained correspondingly, and obtaining the target vector distance output by the twin network to be trained; and constructing a contrast loss function based on the target vector distance, and adjusting model parameters in the twin network to be trained according to the contrast loss function to obtain the trained twin network.
Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
An electronic device 1200 according to such an embodiment of the present disclosure is described below with reference to fig. 12. The electronic device 1200 shown in fig. 12 is merely an example, and should not be construed as limiting the functionality and scope of use of the disclosed embodiments.
As shown in fig. 12, the electronic device 1200 is in the form of a general purpose computing device. Components of electronic device 1200 may include, but are not limited to: the at least one processing unit 1210, the at least one memory unit 1220, and a bus 1230 connecting the different system components (including the memory unit 1220 and the processing unit 1210).
Wherein the storage unit stores program code that is executable by the processing unit 1210 such that the processing unit 1210 performs steps according to various exemplary embodiments of the present disclosure described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 1210 may perform the following steps of the method embodiment described above: obtaining a defect report to be audited output by a source code scanning tool, wherein the defect report to be audited comprises a code slice and a defect type of the code slice, and the defect type is a defect result of the code slice identified by the source code scanning tool; selecting an object code slice sample with the same defect type as the code slice from a defect database according to the defect type of the code slice, wherein the defect database stores triplet information of a plurality of audited code slice samples, and the triplet information is < defect type, code slice sample and audit label >; converting the code slice into feature vectors of different vector categories to obtain a first vector set, and converting the target code slice sample into feature vectors of different vector categories to obtain a second vector set; according to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first vector set and the second vector set into the corresponding trained twin network to obtain a vector distance set consisting of vector distances output by a plurality of trained twin networks; and determining an auditing result of the defect report to be audited according to the vector distance set, wherein the auditing result is defect or false report.
The storage unit 1220 may include a readable medium in the form of a volatile storage unit, such as a Random Access Memory (RAM) 12201 and/or a cache memory 12202, and may further include a Read Only Memory (ROM) 12203.
Storage unit 1220 may also include a program/utility 12204 having a set (at least one) of program modules 12205, such program modules 12205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 1230 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.
The electronic device 1200 may also communicate with one or more external devices 1240 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1200, and/or any devices (e.g., routers, modems, etc.) that enable the electronic device 1200 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1250. Also, the electronic device 1200 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet through the network adapter 1260. As shown, the network adapter 1260 communicates with other modules of the electronic device 1200 over bus 1230. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1200, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, a computer-readable storage medium, which may be a readable signal medium or a readable storage medium, is also provided. On which a program product is stored which enables the implementation of the method described above of the present disclosure. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.
More specific examples of the computer readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In this disclosure, a computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Alternatively, the program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
In particular implementations, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the description of the above embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. A code defect report auditing method, comprising:
obtaining a defect report to be audited output by a source code scanning tool, wherein the defect report to be audited comprises a code slice and a defect type of the code slice, and the defect type is a defect result of the code slice identified by the source code scanning tool;
selecting an object code slice sample with the same defect type as the code slice from a defect database according to the defect type of the code slice;
converting the code slice into feature vectors of different vector categories to obtain a first vector set, and converting the target code slice sample into feature vectors of different vector categories to obtain a second vector set;
according to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first vector set and the second vector set into the corresponding trained twin network to obtain a vector distance set consisting of vector distances output by a plurality of trained twin networks;
determining an auditing result of the defect report to be audited according to the vector distance set, wherein the auditing result is defect or false report;
Wherein the vector categories include: sequence vectors, tree vectors, and graphic vectors;
converting the code slice into feature vectors of different vector categories to obtain a first vector set, including: converting the code slice into a sequence vector to obtain the sequence vector of the code slice; converting the code slice into a tree vector to obtain the tree vector of the code slice; converting the code slice into a graphic vector to obtain the graphic vector of the code slice; based on the sequence vector, the tree vector and the graphics vector of the code slice, a first vector set is obtained.
2. The code defect report auditing method of claim 1, characterized in that converting the code slice into feature vectors of different vector categories to obtain a first vector set, converting the target code slice sample into feature vectors of different vector categories to obtain a second vector set, comprising:
converting the code slices into sequence vectors, tree vectors and graphic vectors respectively to obtain a first vector set;
and respectively converting the target code slice sample into a sequence vector, a tree vector and a graph vector to obtain a second vector set.
3. The code defect report auditing method of claim 1, characterized in that according to a mapping relation between a vector category of a feature vector and a category of a twin network, feature vectors of each group of the same vector category in the first vector set and the second vector set are input into a corresponding trained twin network, and a vector distance set composed of vector distances output by a plurality of trained twin networks is obtained, and the method comprises the following steps:
inputting a first feature vector in the first vector set into a first neural network of a trained first twin network, inputting a second feature vector in the second vector set into a second neural network of the trained first twin network to obtain a first vector distance output by the trained first twin network, wherein the first feature vector is the same as the vector class of the second feature vector, and the trained first twin network is determined by the vector class of the first feature vector or the second feature vector;
inputting a third feature vector in the first vector set into a first neural network of a trained second twin network, inputting a fourth feature vector in the second vector set into a second neural network of the trained second twin network to obtain a second vector distance output by the trained second twin network, wherein the third feature vector is the same as the fourth feature vector in vector category, and the trained second twin network is determined by the third feature vector or the fourth feature vector in vector category;
The set of vector distances is determined based on the first vector distance and the second vector distance.
4. The code defect report auditing method of claim 1, wherein determining an audit result of the defect report to be audited from the set of vector distances comprises:
calculating code similarity between the code slice and the target code slice sample based on the set of vector distances;
and determining an auditing result of the defect report to be audited based on the code similarity.
5. The code defect report auditing method of claim 4, in which computing code similarities between the code slices and the target code slice samples from the set of vector distances comprises:
calculating the vector similarity corresponding to each vector distance in the vector distance set to obtain a plurality of vector similarities;
and calculating a weighted average value of the vector similarity, and obtaining the code similarity between the code slice and the target code slice sample.
6. The code defect report auditing method of claim 1, in which the defect database is obtained by:
Obtaining a plurality of audited defect reports, wherein the audited defect reports comprise code slice samples, defect types of the code slice samples and audit labels;
converting the plurality of audited defect reports into a triplet format to obtain triplet information of a plurality of code slice samples, wherein the triplet information is < defect type, code slice sample and audit label >;
and storing the triplet information of the code slice samples into a database to obtain a defect database.
7. The code defect report auditing method of claim 1, characterized by obtaining a trained twinned network by:
acquiring triplet information of a first code slice sample and triplet information of a second code slice sample from the defect database, wherein the defect types of the first code slice sample and the second code slice sample are the same;
respectively converting the first code slice sample and the second code slice sample into feature vectors of different vector categories to obtain a first sample vector set and a second sample vector set;
according to the mapping relation between the vector category of the feature vector and the category of the twin network, inputting the feature vector of each group of the same vector category in the first sample vector set and the second sample vector set into the twin network to be trained correspondingly, and obtaining the target vector distance output by the twin network to be trained;
And calculating a contrast loss function value based on the target vector distance, and adjusting model parameters in the twin network to be trained according to the contrast loss function value to obtain the trained twin network.
8. A code defect report auditing apparatus, comprising:
the defect report acquisition module is used for acquiring a defect report to be audited output by the source code scanning tool, wherein the defect report to be audited comprises a code slice and a defect type of the code slice, and the defect type is a defect result of the code slice identified by the source code scanning tool;
the slice sample selection module is used for selecting an object code slice sample with the same defect type as the code slice from a defect database according to the defect type of the code slice;
the feature vector conversion module is used for converting the code slice into feature vectors of different vector categories to obtain a first vector set, and converting the target code slice sample into feature vectors of different vector categories to obtain a second vector set;
the vector distance set acquisition module is used for inputting the feature vectors of each group of the same vector categories in the first vector set and the second vector set into the corresponding trained twin network according to the mapping relation between the vector categories of the feature vectors and the categories of the twin network to obtain a vector distance set consisting of vector distances output by a plurality of trained twin networks;
The auditing result determining module is used for determining the auditing result of the defect report to be audited according to the vector distance set, wherein the auditing result is defect or false report;
wherein the vector categories include: sequence vectors, tree vectors, and graphic vectors;
the feature vector conversion module is further configured to convert the code slice into a sequence vector, so as to obtain the sequence vector of the code slice; converting the code slice into a tree vector to obtain the tree vector of the code slice; converting the code slice into a graphic vector to obtain the graphic vector of the code slice; based on the sequence vector, the tree vector and the graphics vector of the code slice, a first vector set is obtained.
9. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the code defect report auditing method of any of claims 1-7 via execution of the executable instructions.
10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the code defect report auditing method of any of claims 1-7.
CN202211130348.3A 2022-09-16 2022-09-16 Code defect report auditing method, device, electronic equipment and storage medium Active CN115454855B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211130348.3A CN115454855B (en) 2022-09-16 2022-09-16 Code defect report auditing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211130348.3A CN115454855B (en) 2022-09-16 2022-09-16 Code defect report auditing method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115454855A CN115454855A (en) 2022-12-09
CN115454855B true CN115454855B (en) 2024-02-09

Family

ID=84304430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211130348.3A Active CN115454855B (en) 2022-09-16 2022-09-16 Code defect report auditing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115454855B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017181286A1 (en) * 2016-04-22 2017-10-26 Lin Tan Method for determining defects and vulnerabilities in software code
CN107967208A (en) * 2016-10-20 2018-04-27 南京大学 A kind of Python resource sensitive defect code detection methods based on deep neural network
EP3392780A2 (en) * 2017-04-19 2018-10-24 Tata Consultancy Services Limited Systems and methods for classification of software defect reports
CN112579477A (en) * 2021-02-26 2021-03-30 北京北大软件工程股份有限公司 Defect detection method, device and storage medium
CN114077741A (en) * 2021-11-01 2022-02-22 清华大学 Software supply chain safety detection method and device, electronic equipment and storage medium
CN114297075A (en) * 2021-12-30 2022-04-08 中国电信股份有限公司 Code detection method and device, electronic equipment and computer readable medium
WO2022093250A1 (en) * 2020-10-29 2022-05-05 Veracode, Inc. Development pipeline integrated ongoing learning for assisted code remediation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9378115B2 (en) * 2014-09-18 2016-06-28 Sap Se Base line for code analysis
US20170212829A1 (en) * 2016-01-21 2017-07-27 American Software Safety Reliability Company Deep Learning Source Code Analyzer and Repairer
US11238306B2 (en) * 2018-09-27 2022-02-01 International Business Machines Corporation Generating vector representations of code capturing semantic similarity
US11574252B2 (en) * 2020-02-19 2023-02-07 Raytheon Company System and method for prioritizing and ranking static analysis results using machine learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017181286A1 (en) * 2016-04-22 2017-10-26 Lin Tan Method for determining defects and vulnerabilities in software code
CN107967208A (en) * 2016-10-20 2018-04-27 南京大学 A kind of Python resource sensitive defect code detection methods based on deep neural network
EP3392780A2 (en) * 2017-04-19 2018-10-24 Tata Consultancy Services Limited Systems and methods for classification of software defect reports
WO2022093250A1 (en) * 2020-10-29 2022-05-05 Veracode, Inc. Development pipeline integrated ongoing learning for assisted code remediation
CN112579477A (en) * 2021-02-26 2021-03-30 北京北大软件工程股份有限公司 Defect detection method, device and storage medium
CN114077741A (en) * 2021-11-01 2022-02-22 清华大学 Software supply chain safety detection method and device, electronic equipment and storage medium
CN114297075A (en) * 2021-12-30 2022-04-08 中国电信股份有限公司 Code detection method and device, electronic equipment and computer readable medium

Also Published As

Publication number Publication date
CN115454855A (en) 2022-12-09

Similar Documents

Publication Publication Date Title
CN111027378B (en) Pedestrian re-identification method, device, terminal and storage medium
CN112508044A (en) Artificial intelligence AI model evaluation method, system and equipment
CN110942072A (en) Quality evaluation-based quality scoring and detecting model training and detecting method and device
CN113778894B (en) Method, device, equipment and storage medium for constructing test cases
CN115547466B (en) Medical institution registration and review system and method based on big data
CN116416884B (en) Testing device and testing method for display module
CN117171696B (en) Sensor production monitoring method and system based on Internet of things
CN115858794B (en) Abnormal log data identification method for network operation safety monitoring
CN110619535A (en) Data processing method and device
CN113487223B (en) Risk assessment method and system based on information fusion
CN114611409A (en) Method and device for establishing power distribution terminal abnormity detection model
CN113242213B (en) Power communication backbone network node vulnerability diagnosis method
CN115454855B (en) Code defect report auditing method, device, electronic equipment and storage medium
CN109993183A (en) Network failure appraisal procedure, calculates equipment and storage medium at device
CN115620083B (en) Model training method, face image quality evaluation method, equipment and medium
CN116861358A (en) BP neural network and multi-source data fusion-based computing thinking evaluation method
CN116342164A (en) Target user group positioning method and device, electronic equipment and storage medium
CN112395478B (en) Dual-model shared data screening method and system
CN114548307A (en) Classification model training method and device, and classification method and device
CN115184054A (en) Mechanical equipment semi-supervised fault detection and analysis method, device, terminal and medium
CN110083807B (en) Contract modification influence automatic prediction method, device, medium and electronic equipment
CN114297235A (en) Risk address identification method and system and electronic equipment
CN113505039A (en) Communication fault analysis method, device and system
CN114338187B (en) Terminal safety detection method and device based on decision tree
CN117040942B (en) Network security test evaluation method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant