CN111832028A - Code auditing method and device, electronic equipment and medium - Google Patents

Code auditing method and device, electronic equipment and medium Download PDF

Info

Publication number
CN111832028A
CN111832028A CN202010734321.XA CN202010734321A CN111832028A CN 111832028 A CN111832028 A CN 111832028A CN 202010734321 A CN202010734321 A CN 202010734321A CN 111832028 A CN111832028 A CN 111832028A
Authority
CN
China
Prior art keywords
code
attribute
segments
segment
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010734321.XA
Other languages
Chinese (zh)
Other versions
CN111832028B (en
Inventor
姜又荷
苏建明
蒋家堂
尉翰龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202010734321.XA priority Critical patent/CN111832028B/en
Publication of CN111832028A publication Critical patent/CN111832028A/en
Application granted granted Critical
Publication of CN111832028B publication Critical patent/CN111832028B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The disclosure relates to the technical field of artificial intelligence, and provides a code auditing method and device. The code auditing method comprises the following steps: acquiring a first code segment to be detected; processing the first code segment to obtain a first code attribute graph corresponding to the first code segment; inputting the first code attribute graph into a code auditing model, wherein the code auditing model is a machine learning model obtained by training based on N second code attribute graphs corresponding to N second code segments and N third code attribute graphs corresponding to N third code segments, the second code segments are code segments with bugs, and one third code segment is a code segment obtained by repairing a bug in one second code segment; and obtaining the output of the code auditing model so as to obtain the detection result of auditing the first code segment. The disclosure also provides a training method and device of the code auditing module, an electronic device and a medium.

Description

Code auditing method and device, electronic equipment and medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a code auditing method and apparatus, a code auditing model training method and apparatus, an electronic device, and a medium.
Background
Over the past time, a vast number of businesses worldwide have suffered from cyber attacks, amounting to billions of dollars. Nowadays, networks are distributed in all aspects of work and life, and network security is very important for enterprises. Code auditing is an analytical approach aimed at discovering security vulnerabilities, bugs, and source code violations of program code specifications. In the conventional penetration test service and the security check of a software architecture, the security audit of source code is a crucial step.
The existing code auditing mostly adopts a mode of combining source code scanning and manual analysis and confirmation, and a code auditing tool commonly used in the market at present can assist security personnel in white box testing and manual vulnerability mining. For the existing code auditing tool at present, the mainstream auditing method can be divided into two types, including dynamic program analysis and static program analysis: dynamic program analysis may trigger vulnerable code by constructing an exception input, identifying code vulnerabilities in the running program; static program analysis detects code based on static code information, discovering code vulnerabilities and vulnerabilities from structural and code feature aspects.
In the process of designing the technical scheme of the present disclosure, the inventor finds that the prior art has the following defects: although dynamic program analysis can achieve high precision, it is difficult to detect all code segments with exceptions; when static program analysis is performed, since the difference between a lot of code segments with bugs and normal codes is very small, the code segments cannot be distinguished when the static program analysis is not performed, a lot of situations of missing reports and false reports exist, and therefore accuracy is low.
Disclosure of Invention
In view of this, the embodiments of the present disclosure provide a code auditing method and apparatus capable of improving vulnerability detection performance in a code, a training method and training apparatus for a code auditing model, an electronic device, and a medium.
In a first aspect of the disclosed embodiments, a code auditing method is provided. The method comprises the following steps: acquiring a first code segment to be detected; processing the first code segment to obtain a first code attribute graph corresponding to the first code segment; inputting the first code attribute graph into a code auditing model, wherein the code auditing model is a machine learning model obtained by training based on N second code attribute graphs corresponding to N second code segments and N third code attribute graphs corresponding to N third code segments, the second code segments are code segments with bugs, one third code segment is a code segment obtained by repairing a bug in one second code segment, the N third code segments correspond to the N second code segments one by one, and N is an integer greater than or equal to 1; and obtaining the output of the code auditing model so as to obtain the detection result of auditing the first code segment.
According to the embodiment of the disclosure, the output of the code auditing model comprises a first parameter R for characterizing whether the first code segment has the vulnerability and a second parameter Z for characterizing the position range of the vulnerability in the first code attribute map.
According to the embodiment of the disclosure, when the first parameter R represents that the first code segment has a vulnerability, extracting child code segments having the vulnerability from the first code segment based on the value of the second parameter Z.
According to an embodiment of the present disclosure, the method further includes performing the following operations in a loop, and terminating the loop until the deviation of the second parameter Z output by two adjacent loops meets a preset condition, including: updating the first code segment with the sub-code segments and performing the processing, inputting, obtaining and extracting operations as described above; and calculating the deviation of the second parameter Z obtained in the current cycle and the second parameter Z obtained in the previous cycle.
According to the embodiment of the disclosure, the deviations of the second parameter Z cyclically output by two adjacent rounds meet a preset condition, and the deviations obtained by calculating for continuous preset times are all smaller than a threshold value.
According to an embodiment of the present disclosure, the processing the first code segment to obtain a first code attribute map corresponding to the first code segment includes: analyzing the first code segment to obtain an abstract syntax tree, a data flow graph and a control flow graph corresponding to the first code segment; and combining the abstract syntax tree, the data flow graph and the control flow graph to obtain the first code attribute graph.
According to an embodiment of the present disclosure, the method further comprises training the code audit model. The method specifically comprises the following steps: acquiring N second code attribute graphs; acquiring N third code attribute graphs; marking each code attribute graph in the N second code attribute graphs and the N third code attribute graphs to obtain training sample data; and training the code auditing model by using the training sample data.
In a second aspect of the disclosed embodiments, a method for training a code audit model is provided. The training method comprises the following steps: acquiring N second code attribute graphs corresponding to N second code segments, wherein the second code segments are code segments with vulnerabilities, and N is an integer greater than or equal to 1; acquiring N third code attribute graphs corresponding to N third code segments, wherein the N third code segments correspond to the N second code segments one by one, and one third code segment is a code segment obtained after repairing a leak in one second code segment; marking each code attribute graph in the N second code attribute graphs and the N third code attribute graphs to obtain training sample data; and training the code auditing model by using the training sample data.
According to an embodiment of the present disclosure, said marking each of the N second and N third code attribute maps comprises: marking a first parameter R for each code attribute graph based on whether the code segment corresponding to each code attribute graph has a bug or not; marking a third parameter Type for each second code attribute graph based on the Type of the vulnerability in the second code segment corresponding to the second code attribute graph, and marking a second parameter Z based on the position range of the vulnerability in the second code attribute graph; and marking the third parameter Type and the second parameter Z of the third code attribute map according to the mark of the second code attribute map corresponding to the third code attribute map.
According to an embodiment of the present disclosure, said training the code audit model using the training sample data comprises: calculating a detection ratio index when the code auditing model is verified by utilizing partial data in the training sample data, wherein the detection ratio index is used for evaluating the prediction accuracy of the code auditing model; and stopping training the code auditing module when the detection accuracy index reaches a preset precision.
According to an embodiment of the present disclosure, the training method further comprises setting the detection ratio indicator, wherein the detection ratio indicator comprises at least one of a false negative ratio or a false positive ratio.
According to an embodiment of the present disclosure, the obtaining N second code attribute maps and N third code attribute maps includes: acquiring N second code segments; repairing the N second code segments one by one to obtain N corresponding third code segments; processing the N second code segments to obtain N second code attribute graphs corresponding to the N second code segments one by one; and processing the N third code segments to obtain N third code attribute maps which correspond to the N third code segments one by one.
In a third aspect of the disclosed embodiments, a code auditing apparatus is provided. The code auditing device comprises a first acquisition module, a first processing module, an input module and a result acquisition module. The first obtaining module is used for obtaining a first code segment to be detected. The first processing module is used for processing the first code segment to obtain a first code attribute graph corresponding to the first code segment. The input module is used for inputting the first code attribute graph into a code auditing model, wherein the code auditing model is a machine learning model obtained by training N second code attribute graphs corresponding to N second code segments and N third code attribute graphs corresponding to N third code segments, the second code segments are code segments with bugs, one of the third code segments is a code segment obtained after bug repair in the second code segment, N of the third code segments are in one-to-one correspondence with N of the second code segments, and N is an integer greater than or equal to 1. And the result obtaining module is used for obtaining the output of the code auditing model so as to obtain the detection result of auditing the first code segment.
According to the embodiment of the disclosure, the output of the code auditing model comprises a first parameter R for characterizing whether the first code segment has the vulnerability and a second parameter Z for characterizing the position range of the vulnerability in the first code attribute map.
According to the embodiment of the disclosure, the code auditing device further comprises an extraction module. The extracting module is used for extracting child code segments with the vulnerability from the first code segment based on the value of the second parameter Z when the first parameter R represents that the vulnerability exists in the first code segment.
According to the embodiment of the disclosure, the code auditing apparatus further comprises a loop module. The circulation module is used for triggering the following circulation operations until the deviation of the second parameter Z output by two adjacent rounds of circulation meets a preset condition, and the circulation operation is terminated, and the method comprises the following steps: updating the first code segment with the sub-code segment and triggering execution of the processing, inputting, obtaining and extracting operations as described above; and calculating the deviation of the second parameter Z obtained in the current cycle and the second parameter Z obtained in the previous cycle.
In a fourth aspect of the embodiments of the present disclosure, a training apparatus for a code audit model is provided. The training device comprises a second acquisition module, a marking module and a training module. The second obtaining module is configured to obtain N second code attribute maps corresponding to N second code segments, where the second code segments are code segments with vulnerabilities, and N is an integer greater than or equal to 1; and acquiring N third code attribute graphs corresponding to N third code segments, wherein the N third code segments correspond to the N second code segments one by one, and one third code segment is a code segment obtained after repairing a leak in the second code segment. The marking module is used for marking each code attribute graph in the N second code attribute graphs and the N third code attribute graphs to obtain training sample data. The training module is used for training the code auditing model by using the training sample data.
According to an embodiment of the present disclosure, the tagging module includes a first tagging submodule, a second tagging submodule, and a third tagging submodule. The first marking submodule is used for marking a first parameter R for each code attribute graph based on whether the code segment corresponding to each code attribute graph has a bug. And the second marking submodule is used for marking a third parameter Type for each second code attribute graph based on the Type of the vulnerability in the second code segment corresponding to the second code attribute graph and marking a second parameter Z based on the position range of the vulnerability in the second code attribute graph. The third labeling sub-module is configured to label, for each third code attribute map, the third parameter Type and the second parameter Z of the third code attribute map according to a label of the second code attribute map corresponding to the third code attribute map.
According to an embodiment of the present disclosure, the second obtaining module is further configured to obtain N second code segments; repairing the N second code segments one by one to obtain N corresponding third code segments; processing the N second code segments to obtain N second code attribute maps corresponding to the N second code segments one by one, and processing the N third code segments to obtain N third code attribute maps corresponding to the N third code segments one by one.
In a fifth aspect of the disclosed embodiments, an electronic device is provided. The electronic device includes one or more memories, and one or more processors. The memory has stored thereon computer-executable instructions. The processor executes the instructions to implement the method as described in the first or second aspect above.
A sixth aspect of embodiments of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described in the first or second aspect above when executed.
A seventh aspect of embodiments of the present disclosure provides a computer program comprising computer executable instructions for implementing a method as described in the first or second aspect above when executed.
One or more of the above-described embodiments may provide the following advantages or benefits: by converting the code segments with various types of vulnerability characteristics into the code attribute graph, replacing the process of detecting the code language with the process of detecting the graph, and combining the deep learning model to identify whether the program has the code segments which possibly cause the vulnerability, the detection error rate can be reduced to a certain extent, and the detection precision can be improved.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates a flow diagram of a code auditing method according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates an example of a code fragment;
FIG. 3 schematically illustrates an abstract syntax tree of the code fragment shown in FIG. 2;
FIG. 4 schematically illustrates a data flow diagram of the code fragment shown in FIG. 2;
FIG. 5 schematically illustrates a control flow diagram of the code fragment shown in FIG. 2;
FIG. 6 schematically illustrates a code property diagram of the code fragment shown in FIG. 2;
FIG. 7 schematically illustrates a flow diagram of a code auditing method according to another embodiment of the present disclosure;
FIG. 8 schematically illustrates a flow diagram of a code auditing method according to yet another embodiment of the present disclosure;
FIG. 9 schematically illustrates a flow diagram of a method of training a code audit model according to an embodiment of the present disclosure;
FIG. 10 schematically illustrates a flow diagram of a method of training a code audit model according to another embodiment of the present disclosure;
FIG. 11 schematically illustrates a block diagram of a code auditing apparatus according to an embodiment of the present disclosure;
FIG. 12 schematically illustrates a block diagram of an apparatus for training a code audit model, according to an embodiment of the present disclosure;
FIG. 13 schematically illustrates an exemplary architecture to which a code auditing apparatus may be applied, according to an embodiment of the present disclosure;
FIG. 14 schematically illustrates a structural schematic of the code-to-code properties diagram tool illustrated in FIG. 13;
FIG. 15 schematically shows a structural schematic of the code attribute map conversion to code attribute vector tool shown in FIG. 13;
FIG. 16 schematically illustrates a structural schematic of the deep learning detection model tool illustrated in FIG. 13;
FIG. 17 schematically illustrates a method flow for code auditing in the architecture shown in FIG. 13; and
FIG. 18 schematically illustrates a block diagram of an electronic device suitable for implementing code auditing according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Various embodiments of the present disclosure mainly perform audit detection on code vulnerabilities, convert code segments into code attribute maps, and learn and identify the code attribute maps through machine learning models (e.g., deep learning algorithm models combining convolutional neural networks, deep learning neural networks, and the like), so as to detect whether vulnerabilities exist in the code segments corresponding to the code attribute maps.
FIG. 1 schematically shows a flow diagram of a code auditing method according to an embodiment of the present disclosure.
As shown in fig. 1, the code auditing method may include operations S110 to S140 according to an embodiment of the present disclosure.
First, in operation S110, a first code segment to be detected is obtained. The first code segment may be, for example, code segment 20 shown in fig. 2.
Fig. 2 schematically shows an example of a code fragment 20.
Code fragment 20 is a piece of buffer overflow code in which the size of the declared array buf is MAXSIZE-40. The If statement defines that the function test returns ERROR when the length of the incoming variable str is greater than or equal to 2 × MAXSIZE. While in other cases the incoming variable str is copied into the array buf (i.e., strcpy).
The code segment 20 is flawed, wherein the bug is caused by an if statement followed by a strcpy function, belonging to a buffer bug. The main code causing the vulnerability is a strcpy function (string copy function) after the if judgment statement, which copies the str string into the buf array, but when the length of the string str exceeds the space length MAXSIZE of the buf array, the value after the buf array space is modified, which causes the vulnerability of buffer overflow
In particular, since the If statement defines that the function test returns ERROR when the length of the incoming variable str is greater than or equal to 2 × MAXSIZE, this results in the strcpy function still being executed when the length of the incoming variable str is greater than MAXSIZE and less than 2 × MAXSIZE. This means that the data copied to the array buf needs to occupy a space other than the space allocated to the array buf, which may cause a conflict in the program operation. This type of vulnerability code is very typical and common in the source code of various types of programs.
When repairing the code segment 20, "if (length > - [ 2 ]) return error" may be modified to "if (length > - [ MAXSIZE ]) return error". Therefore, it is not necessarily the most difficult to fix the bug, but it is very critical how to locate the bug in the long program code, and the code auditing method of the embodiment of the disclosure aims to quickly, accurately and efficiently locate the bug in the code segment.
Then, in operation S120, the first code segment is processed to obtain a first code attribute map corresponding to the first code segment.
Fig. 6 schematically shows a code property diagram 60 of the code fragment 20 shown in fig. 2.
In conjunction with fig. 2 and 6, the main portion of the code attribute graph 60 is shown as a solid line to show all the paths that the code segment 20 will traverse during program execution. Wherein the circled nodes represent individual operations in the code fragment 20.
In the code properties graph 60, for each operational node, an abstract syntax tree for that node may be drawn based on the syntax structure of the operations in that node, represented by the wide dashed lines in the graph.
The last layer of the abstract syntax tree of each operation node comprises data such as variables, constants or parameters processed by each operation, and the circulation and transformation process of each data in the graph can be indicated by data flow (dotted arrows in the graph).
Therefore, the code attribute diagram can be used for displaying the contents of the logic function, the data flow direction, the execution flow and the like of the code segment.
Next, in operation S130, the first code attribute map is input to a code auditing model, where the code auditing model is a machine learning model obtained by training based on N second code attribute maps corresponding to N second code segments and N third code attribute maps corresponding to N third code segments, where a second code segment is a code segment with a bug, and one third code segment is a code segment obtained by repairing a bug in one second code segment, where N third code segments correspond to N second code segments one to one, and N is an integer greater than or equal to 1.
The code auditing model can be a machine learning model which is based on a Convolutional Neural Network (CNN) and a deep learning neural network and is constructed by combining a Tensorflow framework. For example, the basic vector feature diagram corresponding to the code attribute diagram can be output by using a VGG-16 basic network, and then the input basic vector feature diagram can be analyzed and processed by using a TensorFlow framework publicly available by Faster R-CNN. The Tensorflow Object Detection API is a powerful tool that can help users quickly build and deploy an image recognition system.
For example, in one embodiment, 5000 code segments with buffer overflow vulnerabilities and corresponding repaired code segments may be processed to obtain corresponding code attribute maps, and then a code audit model is trained by using the code attribute maps, so that the code audit model learns the image difference between the code attribute map with the buffer overflow vulnerabilities and the code attribute map without the buffer vulnerabilities, thereby enabling the code attribute maps to have code segments with and without vulnerabilities.
Thereafter, in operation S140, an output of the code auditing model is obtained to obtain a detection result for auditing the first code segment.
In some embodiments, the output of the code audit model may include a first parameter R for characterizing whether the first code segment has a vulnerability. For example, R ═ 1 indicates that the first code segment has a hole, and R ═ 0 indicates that the first code segment has no hole.
In still other embodiments, the output of the code audit model may further include a second parameter Z for characterizing the location range of the vulnerability in the first code attribute map. The value of Z may be a coordinate range of a region, for example, the lower left corner in the first code attribute map may be defined as the origin of coordinates to determine the value of Z.
In another embodiment, the output of the code auditing model may further include a third parameter type characterizing a type of vulnerability when the vulnerability exists in the first code segment.
Taking the code segment 20 as an example, the part of the corresponding code attribute graph 60 having the bug is a graph region of a node where the if statement is located and a part below the node, and more precisely, the abstract syntax tree of the if statement includes 3 layers of branches. Whereas the abstract syntax tree of the if-statement after patching the code fragment 20 essentially comprises only two levels of branches. The code design model distinguishes image characteristics of the vulnerability in the code attribute graph through learning of the code attribute graph of the code segment with the vulnerability and the code segment without the vulnerability.
In one embodiment, taking the lower left corner of the code attribute map 60 as the origin of coordinates, the code audit model may output the following after identifying the code attribute map 60: r ═ 1; type ═ buffer overflow; z: 1588, 15883, 4235, and 5540, where a graph region to a node where the if sentence is located and a portion below the node can be determined by a range of Z.
Therefore, whether the first code segment has the vulnerability or not can be determined according to the value of R output by the code auditing model. If the bug exists, according to other embodiments of the disclosure, the location of the bug can be determined according to the value of Z output by the code audit model, so that the bug can be rapidly located.
The embodiment of the disclosure converts the code language into the picture information better recognized by the deep learning model, wherein the semantic information of the code is retained in the picture information (namely, the code attribute map), and the combination process of the code language and the deep learning model is simplified.
The embodiment of the disclosure can process most of code vulnerability characteristics through artificial intelligence projects, relies on a large number of data sources and algorithm support, and combines a machine learning model algorithm to recognize software code vulnerabilities in batch, so that a large amount of human labor can be reduced. In addition, the achievement of the current deep learning model in the field of graph detection and identification is relatively mature, so that the process of detecting the code language is replaced by the process of detecting the graph by converting the code segments with various types of vulnerability characteristics into the code attribute graph, and the detection error rate can be reduced to a certain extent and the detection precision can be improved by combining the existing deep learning detection model and judging whether the code segments possibly causing the vulnerability exist in the program.
According to the embodiment of the present disclosure, in operation S120, the first code segment is processed to obtain a first code attribute map corresponding to the first code segment, and specifically, the first code segment may be parsed to obtain an abstract syntax tree, a data flow graph, and a control flow graph corresponding to the first code segment, and the abstract syntax tree, the data flow graph, and the control flow graph are merged to obtain the first code attribute map. The following description will be made with reference to fig. 3 to 6, taking the code segment 20 as an example.
Fig. 3 schematically shows an abstract syntax tree 30 of the code fragment 20 shown in fig. 2.
As shown in fig. 3, the abstract syntax tree 30 is a syntax structure representing the code fragment 10 in the form of a tree, and each node on the tree represents a structure in the source code.
Fig. 4 schematically shows a data flow diagram 40 of the code fragment 20 shown in fig. 2.
As shown in fig. 4, the data flow diagram 40 graphically expresses the logical functions of the system, the logical flow of data within the system, and the logical transformation process from a data transfer and processing perspective. Nodes within circles in dataflow graph 40 represent operations on data, and data in the stream transformation, such as transmitted data str, declared variable MAXSIZE, declared array variable buf [ MAXSIZE ], parameter length, output data ERROR, is identified above the arrowed lines.
Fig. 5 schematically shows a control flow diagram 50 of the code fragment 20 shown in fig. 2.
As shown in fig. 5, the control flow graph 50 represents all the paths traversed during the execution of the code segment 20, and the graph shows the possible flow directions of all the basic block executions in a process, and also reflects the real-time execution of a process. Each node in the control flow graph 50 corresponds to an operation of the code segment 20, and the connection relationship between the nodes represents an execution path of the operation.
Fig. 6 schematically shows a code property diagram 60 of the code fragment 20 shown in fig. 2.
With reference to fig. 3-6, the code property diagram 60 may be obtained by combining the abstract syntax tree 30, the data flow diagram 40, and the control flow diagram 50, according to an embodiment of the present disclosure. For example, control flow graph 50 is the backbone. For each node in the control flow graph 50, the syntax structure corresponding to the operation in the node is looked up from the abstract syntax tree 30. Meanwhile, the data flow of the transmitted data str, the declared variable MAXSIZE, the declared array variable buf [ MAXSIZE ], the parameter length, and the output data ERROR are shown in the code attribute diagram 60 in conjunction with the data flow diagram 40. In this way, the code attribute map 60 shown in fig. 6 is formed.
FIG. 7 schematically shows a flow diagram of a code auditing method according to another embodiment of the present disclosure.
As shown in fig. 7, the code auditing method according to an embodiment of the present disclosure may include operations S110 to S140, and operation S750.
First, in operation S110, a first code segment to be detected is obtained.
Then, in operation S120, the first code segment is processed to obtain a first code attribute map corresponding to the first code segment.
Next, in operation S130, the first code attribute map is input to a code auditing model, where the code auditing model is a machine learning model obtained by training based on N second code attribute maps corresponding to N second code segments and N third code attribute maps corresponding to N third code segments, where a second code segment is a code segment with a bug, and one third code segment is a code segment obtained by repairing a bug in one second code segment, where N third code segments correspond to N second code segments one to one, and N is an integer greater than or equal to 1.
Thereafter, in operation S140, an output of the code auditing model is obtained to obtain a detection result for auditing the first code segment.
Operations S110 to S140 are the same as those described above, and are not described herein again.
Next, in operation S750, when the first parameter R represents that the first code segment has a vulnerability, child code segments having the vulnerability are extracted from the first code segment based on the value of the second parameter Z.
In one embodiment, the child code segments can be repaired after the child code segments are extracted.
In another embodiment, after the child code segments are extracted, the first code segment may be updated by using the child code segments, and operations S110 to S140 are performed again to obtain a new value of the second parameter Z, so as to reduce the position of the vulnerability. For example, in one embodiment, the following operations S110 to S750 may be performed in a loop, and the deviation of the second parameter Z obtained by calculating the current loop from the second parameter Z obtained by the previous loop is calculated after each loop; if the deviation of the second parameter Z output in two adjacent rounds does not meet the preset condition, updating the first code segment by the sub-code segment extracted in the current round, and repeatedly executing operation S110-operation S750; and when the deviation of the second parameter Z output by two adjacent rounds of circulation meets the preset condition, the circulation is terminated, and the position of the loophole is positioned by the second parameter Z of the last round of circulation, so that the position of the loophole can be continuously and accurately positioned, and the efficiency of repairing the loophole at the later stage is improved.
The deviation of the second parameter Z output by two consecutive cycles satisfies the preset condition, for example, the values of the deviations calculated by consecutive preset times are all smaller than a threshold, for example, the deviation is smaller than 10% for consecutive 10 times.
FIG. 8 schematically illustrates a flow diagram of a code auditing method according to yet another embodiment of the present disclosure.
As shown in fig. 8, the code auditing method according to the embodiment of the present disclosure may include operations S810 to S880 in addition to operations S110 to S140, where operations S120 to S860 may be performed in a loop.
In operation S810, it is determined whether the first code segment has a vulnerability based on a value of the first parameter R output by the code auditing model. If no bug exists, in operation S880, it is determined that the first code segment does not have a bug. If there is a bug, operation S820 is performed.
In operation S820, when the first parameter R characterizes that the first code segment has a vulnerability, child code segments having the vulnerability are extracted from the first code segment based on the value of the second parameter Z.
In operation S830, a deviation of the second parameter Z obtained in the present cycle from the second parameter Z obtained in the previous cycle is calculated. If the current cycle is the first cycle, the second parameter Z obtained from the previous cycle is the initial value of the second parameter Z. The initial value of Z may be set to the entire first code attribute map global scope in one embodiment.
In operation S840, it is determined whether the deviation is less than a threshold (e.g., 10%). If yes, performing operation S850; if not, the first code segment is updated with the sub-code segments extracted in the current round in operation S860, and then the operation S120 is returned to enter the next round.
In operation S850, if a deviation between the second parameter Z obtained in the present cycle and the second parameter Z obtained in the previous cycle is less than a threshold, it is determined whether a preset number of times (e.g., 10 times) is reached when the deviation is less than the threshold. If not, the first code segment is updated with the sub-code segments extracted in the current round in operation S860, and then the operation S120 is returned to enter the next round. If the detected bug is detected, the loop is terminated and operation S870 is performed to output the second parameter Z obtained in the last loop.
According to the embodiment of the disclosure, on the basis of training the code audit model, the code audit model is called for many times by combining with an algorithm, the range of the code to be detected is continuously reduced, and the detection precision of the specific fragile code in the code segment is improved.
In addition, the code auditing method of the embodiment of the disclosure further comprises training a code auditing model. In particular, the training method of the code audit model can refer to the related description of fig. 9 to fig. 10.
FIG. 9 schematically illustrates a flow diagram of a method of training a code audit model according to an embodiment of the present disclosure.
As shown in FIG. 9, the method of training a code audit model according to an embodiment of the present disclosure may include operations S910 to S940.
In operation S910, N second code attribute maps corresponding to N second code segments are obtained, where the second code segments are code segments with vulnerabilities, and N is an integer greater than or equal to 1.
In operation S920, N third code attribute maps corresponding to N third code segments are obtained, where the N third code segments correspond to the N second code segments one to one, and one third code segment is a code segment obtained after repairing a vulnerability in one second code segment.
In operation S930, each of the N second code attribute maps and the N third code attribute maps is labeled to obtain training sample data.
In operation S940, the code audit model is trained using the training sample data.
According to an embodiment of the present disclosure, marking each code attribute graph may be based on whether a code segment corresponding to each code attribute graph has a bug, and marking each code attribute graph with a first parameter R. For example, R may take a value of 0 or 1, where R ═ 1 indicates that the code segment has a bug, and R ═ 0 marks that the code segment has no bug. For example, in the imported code segment, the code segment having the leak and the patched code segment are marked differently with "1" and "0".
According to further embodiments of the present disclosure, marking each code attribute map may further include marking each second code attribute map with a third parameter Type based on a Type of a vulnerability in a second code segment to which the second code attribute map corresponds, and marking a second parameter Z based on a location range of the vulnerability in the second code attribute map. Accordingly, for each third code attribute map, the third parameter Type and the second parameter Z of the third code attribute map are labeled according to the label of the second code attribute map corresponding to the third code attribute map. That is, the third parameter Type and the second parameter Z of the second code attribute map corresponding to the third code attribute map are associated with each other and labeled on the third code attribute map. In this way, the code auditing model can be enabled to learn the difference between the code attribute graph with the bug and the code attribute graph after the bug is fixed through comparison.
In this way, when the code audit model is trained, the code audit model can learn not only the graph difference between the code attribute graph with the holes and the code attribute graph without the holes, but also even the image characteristics and the like of the holes in the code attribute graph.
In the training in operation S940, 80% of data in the training sample data may be used as the training set, and 20% of data may be used as the verification set.
According to the embodiment of the disclosure, a detection ratio index can be set before the code audit model is verified by using the verification set, and the detection ratio index is used for evaluating the prediction accuracy of the code audit model.
And then, calculating the detection rate index when the code auditing model is verified by using the verification set, and stopping training the code auditing module when the detection accuracy index reaches the preset precision.
The detection ratio indicator may include at least one of a false negative ratio or a false positive ratio. The false-missing rate is used to characterize the ratio of the number of undetected code segments that actually have a leak to the total number of detected code segments. The false alarm rate is used for representing the ratio of the number of code segments with errors in the detection result to the total number of the detected code segments.
For example: the missing report ratio is N/(T + N);
the false alarm ratio is F/(F + T);
wherein, T: correctly detecting the sample number of the loopholes;
f: number of detected false hole samples (misinformation)
N: number of samples of real holes not detected (false negative).
And importing the code attribute diagrams in the training set and the verification set into a code auditing model, wherein the code auditing model firstly learns and identifies the code attribute diagrams in the training set, and then uses the code data diagrams in the verification set for detection and verification. Since the code attribute map in the verification set is marked with real data, the detection ratio index is generated in the verification process. Wherein the training of the code audit model is stopped when the false alarm rate and the false alarm rate are sufficiently small (reaching a preset accuracy, e.g., 1%).
According to the embodiment of the disclosure, the code audit model can be trained to identify the bug codes through the detection rate index for evaluating the prediction accuracy of the code audit model, so that the false alarm rate and the missing report rate of the code vulnerability detection are reduced.
FIG. 10 schematically illustrates a flow diagram of a method of training a code audit model, according to another embodiment of the present disclosure.
As shown in fig. 10, a method of training a code audit model according to an embodiment of the present disclosure may include operations S1010 to S1040, and operations S930 and S940.
In operation S1010, N second code segments are acquired.
In operation S1020, the N second code segments are repaired one by one to obtain corresponding N third code segments.
In operation S1030, the N second code segments are processed to obtain N second code attribute maps corresponding to the N second code segments one to one.
In operation S1040, the N third code segments are processed to obtain N third code attribute maps corresponding to the N third code segments one to one.
For example, a uniform interface may be called to import a code segment with a certain vulnerability and a corresponding repaired code segment in the vulnerability library, and then the code segment is converted into a code attribute graph combined by an abstract syntax tree, a control flow graph and a data flow graph according to corresponding rules.
In operation S930, each of the N second code attribute maps and the N third code attribute maps is labeled to obtain training sample data.
In operation S940, the code audit model is trained using the training sample data.
Here, operations S930 and S940 are the same as those described above, and are not described herein again.
FIG. 11 schematically shows a block diagram of a code auditing apparatus 1100 according to an embodiment of the present disclosure.
As shown in fig. 11, the code auditing apparatus 1100 may include a first obtaining module 1110, a first processing module 1120, an input module 1130, and a result obtaining module 1140 according to an embodiment of the present disclosure. According to other embodiments of the present disclosure, the code auditing apparatus 1100 may further include an extraction module 1150 and a loop module 1160. The code auditing apparatus 1100 may be used to implement the methods described with reference to fig. 1-8.
The first obtaining module 1110 is configured to obtain a first code segment to be detected.
The first processing module 1120 is configured to process the first code segment to obtain a first code attribute map corresponding to the first code segment.
The input module 1130 is configured to input the first code attribute map to a code auditing model, where the code auditing model is a machine learning model obtained by training N second code attribute maps corresponding to N second code segments and N third code attribute maps corresponding to N third code segments, where the second code segments are code segments with a bug, one third code segment is a code segment obtained by repairing a bug in one second code segment, where the N third code segments are in one-to-one correspondence with the N second code segments, and N is an integer greater than or equal to 1.
The result obtaining module 1140 is configured to obtain an output of the code auditing model to obtain a detection result for auditing the first code segment. According to the embodiment of the disclosure, the output of the code auditing model comprises a first parameter R for characterizing whether the vulnerability exists in the first code segment and a second parameter Z for characterizing the position range of the vulnerability in the first code attribute graph.
The extracting module 1150 is configured to extract, when the first parameter R represents that the first code segment has a vulnerability, child code segments having the vulnerability from the first code segment based on a value of the second parameter Z.
The loop module 1160 triggers a loop to perform operations S120 to S860 shown in fig. 8, for example, to update the first code segment with the sub-code segment, then audit the updated first code segment again, calculate a deviation between the second parameter Z obtained in the current loop and the second parameter Z obtained in the previous loop, and terminate the loop after multiple loops until the deviation between the second parameters Z output in two adjacent loops meets a preset condition.
FIG. 12 schematically illustrates a block diagram of an apparatus 1200 for training a code audit model according to an embodiment of the present disclosure.
As shown in fig. 12, the apparatus 1200 for training a code audit model according to an embodiment of the present disclosure may include a second obtaining module 1210, a marking module 1220, and a training module 1230. The apparatus 1200 may be used to implement the training method described with reference to fig. 9 and 10.
The second obtaining module 1210 is configured to obtain N second code attribute maps corresponding to N second code segments, where a second code segment is a code segment with a vulnerability, where N is an integer greater than or equal to 1; and acquiring N third code attribute graphs corresponding to the N third code segments, wherein the N third code segments correspond to the N second code segments one by one, and one third code segment is a code segment obtained after the bug in one second code segment is repaired.
According to an embodiment of the present disclosure, the second obtaining module 1210 is further configured to obtain N second code segments; repairing the N second code segments one by one to obtain N corresponding third code segments; and processing the N third code segments to obtain N third code attribute maps which are in one-to-one correspondence with the N third code segments.
The labeling module 1220 is configured to label each of the N second code attribute maps and the N third code attribute maps to obtain training sample data.
According to an embodiment of the present disclosure, the marking module 1220 may include a first marking submodule 1221, a second marking submodule 1222, and a third marking submodule 1223.
The first marking submodule 1221 is configured to mark a first parameter R for each code attribute map based on whether a code segment corresponding to each code attribute map has a bug.
The second labeling submodule 1222 is configured to label, for each second code attribute map, a third parameter Type based on the Type of the vulnerability in the second code segment corresponding to the second code attribute map, and a second parameter Z based on the location range of the vulnerability in the second code attribute map.
The third labeling sub-module 1223 is configured to label, for each third code attribute map, the third parameter Type and the second parameter Z of the third code attribute map according to the label of the second code attribute map corresponding to the third code attribute map.
The training module 1230 is configured to train the code audit model using the training sample data.
FIG. 13 schematically illustrates an exemplary architecture to which a code auditing apparatus 1300 may be applied, according to an embodiment of the present disclosure.
As shown in fig. 13, the code auditing apparatus 1300 may perform audit learning on the input code segment and obtain an output result. The output result may include whether the code segment has a bug, and if the bug exists, the output result may also include a location of the bug.
In this embodiment, the code auditing apparatus 1300 may include a code segment to code attribute map tool 1, a code attribute map to code attribute vector tool 2, and a deep learning detection model tool 3 for training detection codes. The code attribute graph conversion to code attribute vector tool 2 and the deep learning detection model tool 3 are combined to form the code auditing model of the embodiment of the disclosure.
The tool 1 for converting the code segments into the code attribute graph mainly has the functions of converting the code segments into the code attribute graph according to a certain rule, and reserving semantic information of the code segments for training the deep learning neural network.
The code attribute map conversion to code attribute vector tool 2 mainly functions to convert the code attribute map generated by the code segment conversion to code attribute map tool 1 into a vector that can be recognized by the deep learning neural network.
The deep learning detection model tool 3 performs recognition and judgment on a large number of code vectors, and trains code segments with holes and code segments without exceptions. After the training is completed, through mutual cooperation of the three modules, the code auditing device 1300 detects the source code to be detected, and audits and detects the source code through static feature recognition.
Fig. 14 schematically shows a structural schematic of the code conversion to code property diagram tool 1 shown in fig. 13.
As shown in fig. 14, the tool 1 for converting a code fragment into a code property graph includes an abstract syntax tree generating unit 11, a data flow graph generating unit 12, a control flow graph generating unit 13, and a code property graph generating unit 14.
The abstract syntax tree generating unit 11 is configured to analyze the source code, generate an Abstract Syntax Tree (AST), and express a syntax structure of the programming language in a tree form, where each node on the tree represents a structure in the source code.
The data flow generation unit 12 is used for analyzing the source code and generating a data flow graph, which graphically expresses the logical functions of the system, the logical flow direction of the data inside the system and the logical transformation process from the data transfer and processing perspective.
The generate control flow graph unit 13 is configured to analyze the source code to generate a Control Flow Graph (CFG) representing all paths traversed during the execution of a program. The method can graphically represent the possible flow direction of all basic block execution in a process and can also reflect the real-time execution process of the process. As for the buffer overflow code shown in fig. 3, a control flow graph as shown in fig. 6 may be generated.
The code attribute graph generating unit 14 is configured to combine the contents of the abstract syntax tree, the data flow graph, and the control flow graph to generate a code attribute graph, and express the contents of the logic function, the data flow direction, the execution flow, and the like of the code segment.
Fig. 15 schematically shows a structural schematic of the code property graph shown in fig. 13 converted into the code property vector tool 2.
As shown in fig. 15, the tool 2 for converting the code attribute map into the code attribute vector may include a unit 21 for extracting the code attribute map and marking it with real data, and a unit 22 for generating the code attribute map into a vector.
The code attribute map is extracted and marked with a real data unit 21, so that when a code library is imported, a code segment with a bug and a repaired code segment are respectively imported, and after the code segments are converted into the code attribute map shown in fig. 6, marking parameters, such as R, Z and type, of codes which specifically cause the bug in the code segments are marked. Labels corresponding to the code attribute map are generated based on the labeled parameters.
The code attribute map generation vector unit 22 may output a base vector feature map using a convolutional neural network VGG-16 basis network.
Fig. 16 schematically shows a structural schematic of the deep learning detection model tool 3 shown in fig. 13.
As shown in fig. 16, the deep learning detection model tool 3 includes a deep learning target detector unit 31, a deep learning target detection training set unit 32, and a deep learning target detection algorithm unit 33.
The deep learning target detector unit 31 may use the TensorFlow framework publicly available to Faster R-CNN. The system uses an artificial intelligent neural network to analyze and process input complex data. The TensorFlow Object Detection API is a powerful tool, can help anyone to quickly establish and deploy an image recognition system, and can build a TensorFlow Object Detection API framework on a computer of the person, so that the person can build a model of the person and build a recognition model of the person.
The deep learning target detection data unit 32 may import training sample data. The training sample data is, for example, data obtained by generating a code attribute map according to the tool 1 by using code segments having a certain specific vulnerability (for example, 5000 code segments having a buffer overflow vulnerability) and corresponding patch segments in the vulnerability library and performing a distinguishing mark. Wherein 80% of data in the training sample data is used as a training set, and 20% of data is used as a verification set.
Deep learning target detection algorithm unit 33: and judging the input code segment in a prediction stage according to a TensorFlow framework of the deep learning target detector.
FIG. 17 schematically illustrates a method flow for code auditing in the architecture shown in FIG. 13.
Step S101: the imported code segments are converted into a code attribute graph through a code segment conversion tool 1 to generate a corresponding code attribute graph, and the code attribute graph is converted into a code attribute vector which can be identified by a deep learning target detector through a code attribute graph conversion vector tool 2.
Step S102: and setting detection ratio indexes for evaluating the prediction accuracy of the code audit model, such as a missing report ratio and a false report ratio.
And taking 80% of data in the training sample data as a training set and 20% of data as a verification set. And importing the code segments of the training set and the verification set into a deep learning neural network, wherein the deep learning neural network firstly learns and identifies data in the training set, and then detects and verifies the data in the verification set. The data in the verification set is marked with real data, so the index data is generated in the verification process, and when the false alarm rate and the false alarm rate are small enough, the training is stopped.
Step S103: and designing a detection algorithm to improve the detection precision and search the specific position of the fragile code. And setting a state binary group (R, Z), wherein R represents the state whether the current code segment has the holes (1 represents that the holes exist, and 0 represents that the holes do not exist), and Z represents the positions of the codes with the holes in the current code segment. Initially, the R state of the currently detected whole code segment is 0, and the Z state is the current whole code segment. Inputting all the code segments to be detected currently into the trained model, setting the R state of a certain part of codes to be 1 when detecting that a certain part of codes in the current code segments have bugs, and updating the Z state into the coordinates of the part of codes in the attribute graph.
Step S104: after judging that some part of code has a bug in step S103, re-transmitting the vulnerable code part marked by the current Z state to the code auditing apparatus, and further searching for a specific position of the vulnerable code.
Step S105: and (3) circulating the detection steps for multiple times, calculating the deviation of the current Z-state value and the last Z-state value (namely the Z-value difference part of two times accounts for the percentage of the fragile code segment which is detected currently) each time, and when the deviation value of 10 continuous times is less than 10%, indicating that the currently detected range can not be reduced any more, thus stopping circulating and outputting the final detection result.
Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.
For example, any plurality of the first obtaining module 1110, the first processing module 1120, the input module 1130, the result obtaining module 1140, the extracting module 1150, the looping module 1160, the second obtaining module 1210, the labeling module 1220, the training module 1230, the first labeling submodule 1221, the second labeling submodule 1222, the third labeling submodule 1223, the code fragment converting to code attribute map tool 1, the code attribute map converting to code attribute vector tool 2, the deep learning detection model tool 3, the abstract syntax tree generating unit 11, the data flow map generating unit 12, the control flow map generating unit 13, the code attribute map generating unit 14, the code attribute map extracting and labeling them with the real data unit 21, the code attribute map converting to vector unit 22, the deep learning object detector unit 31, the deep learning object detection training set unit 32, and the deep learning object detection algorithm unit 33 may be combined in one module to implement, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the first obtaining module 1110, the first processing module 1120, the input module 1130, the result obtaining module 1140, the extracting module 1150, the looping module 1160, the second obtaining module 1210, the labeling module 1220, the training module 1230, the first labeling submodule 1221, the second labeling submodule 1222, the third labeling submodule 1223, the code fragment conversion to code attribute map tool 1, the code attribute map conversion to code attribute vector tool 2, the deep learning detection model tool 3, the generate abstract syntax tree unit 11, the generate data flow map unit 12, the generate control flow map unit 13, the generate code attribute map unit 14, the extract code attribute map and label it with the real data unit 21, the convert code attribute map to vector unit 22, the deep learning object detector unit 31, the deep learning object detection training set unit 32, and the deep learning object detection algorithm unit 33 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in any one of or any suitable combination of software, hardware, and firmware. Alternatively, at least one of the first obtaining module 1110, the first processing module 1120, the input module 1130, the result obtaining module 1140, the extracting module 1150, the looping module 1160, the second obtaining module 1210, the labeling module 1220, the training module 1230, the first labeling submodule 1221, the second labeling submodule 1222, the third labeling submodule 1223, the code fragment conversion to code attribute map tool 1, the code attribute map conversion to code attribute vector tool 2, the deep learning detection model tool 3, the generating abstract syntax tree unit 11, the generating data flow map unit 12, the generating control flow map unit 13, the generating code attribute map unit 14, the extracting and labeling code attribute maps with real data unit 21, the converting code attribute maps to vector unit 22, the deep learning object detector unit 31, the deep learning object detection training set unit 32, and the deep learning object detection algorithm unit 33 may be at least partially implemented as a computer program module, when the computer program modules are run, corresponding functions may be performed.
FIG. 18 schematically illustrates a block diagram of an electronic device 1800 suitable for implementing code auditing according to an embodiment of the present disclosure. The electronic device 1800 shown in fig. 18 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.
As shown in fig. 18, a computer system 1800 according to an embodiment of the present disclosure includes a processor 1801, which may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1802 or a program loaded from a storage portion 1808 into a Random Access Memory (RAM) 1803. The processor 1801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1801 may also include onboard memory for caching purposes. The processor 1801 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to the various embodiments of the present disclosure.
In the RAM1803, various programs and data necessary for the operation of the electronic apparatus 1800 are stored. The processor 1801, ROM1802, and RAM1803 are connected to one another by a bus 1804. The processor 1801 performs various operations of the method flows according to embodiments of the present disclosure by executing programs in the ROM1802 and/or the RAM 1803. Note that the programs may also be stored in one or more memories other than ROM1802 and RAM 1803. The processor 1801 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the present disclosure, the electronic device 1800 may also include an input/output (I/O) interface 1805, the input/output (I/O) interface 1805 also being connected to the bus 1804. System 1800 can also include one or more of the following components connected to I/O interface 1805: an input portion 1806 including a keyboard, a mouse, and the like; an output portion 1807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1808 including a hard disk and the like; and a communication section 1809 including a network interface card such as a LAN card, a modem, or the like. The communication section 1809 performs communication processing via a network such as the internet. A driver 1810 is also connected to the I/O interface 1805 as needed. A removable medium 1811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1810 as necessary, so that a computer program read out therefrom is mounted in the storage portion 1808 as necessary.
According to embodiments of the present disclosure, method flows according to various embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1809, and/or installed from the removable media 1811. The computer program, when executed by the processor 1801, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with various embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include ROM1802 and/or RAM1803 and/or one or more memories other than ROM1802 and RAM1803 described above.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (16)

1. A code auditing method, comprising:
acquiring a first code segment to be detected;
processing the first code segment to obtain a first code attribute graph corresponding to the first code segment;
inputting the first code attribute graph into a code auditing model, wherein the code auditing model is a machine learning model obtained by training based on N second code attribute graphs corresponding to N second code segments and N third code attribute graphs corresponding to N third code segments, the second code segments are code segments with bugs, one third code segment is a code segment obtained by repairing a bug in one second code segment, the N third code segments correspond to the N second code segments one by one, and N is an integer greater than or equal to 1; and
and obtaining the output of the code auditing model so as to obtain the detection result of auditing the first code segment.
2. The method of claim 1, wherein the output of the code auditing model comprises:
a first parameter R for characterizing whether the first code segment has a vulnerability; and
and the second parameter Z is used for characterizing the position range of the vulnerability in the first code attribute graph.
3. The method of claim 2, wherein the method further comprises:
when the first parameter R represents that the first code segment has a bug, extracting child code segments with bugs from the first code segment based on the value of the second parameter Z.
4. The method of claim 3, wherein the method further comprises performing the following operation in a loop, and terminating the loop until the deviation of the second parameter Z output by two adjacent loops meets a preset condition, including:
updating the first code segment with the sub-code segments and performing the processing, inputting, obtaining and extracting operations as described above; and
and calculating the deviation of the second parameter Z obtained in the current cycle and the second parameter Z obtained in the previous cycle.
5. The method of claim 4, wherein the deviation of the second parameter Z output by two adjacent cycles satisfying a preset condition comprises:
and the values of the deviation obtained by continuous preset times of calculation are all smaller than a threshold value.
6. The method according to any one of claims 1 to 5, wherein the processing the first code segment to obtain a first code attribute map corresponding to the first code segment comprises:
analyzing the first code segment to obtain an abstract syntax tree, a data flow graph and a control flow graph corresponding to the first code segment; and
and combining the abstract syntax tree, the data flow graph and the control flow graph to obtain the first code attribute graph.
7. The method of any of claims 1-5, wherein the method further comprises training the code audit model, comprising:
acquiring N second code attribute graphs;
acquiring N third code attribute graphs;
marking each code attribute graph in the N second code attribute graphs and the N third code attribute graphs to obtain training sample data; and
and training the code auditing model by using the training sample data.
8. A training method of a code audit model comprises the following steps:
acquiring N second code attribute graphs corresponding to N second code segments, wherein the second code segments are code segments with vulnerabilities, and N is an integer greater than or equal to 1;
acquiring N third code attribute graphs corresponding to N third code segments, wherein the N third code segments correspond to the N second code segments one by one, and one third code segment is a code segment obtained after repairing a leak in one second code segment;
marking each code attribute graph in the N second code attribute graphs and the N third code attribute graphs to obtain training sample data; and
and training the code auditing model by using the training sample data.
9. The training method of claim 8, wherein said labeling each of said N second code attribute maps and said N third code attribute maps comprises:
marking a first parameter R for each code attribute graph based on whether the code segment corresponding to each code attribute graph has a bug or not;
marking a third parameter Type for each second code attribute graph based on the Type of the vulnerability in the second code segment corresponding to the second code attribute graph, and marking a second parameter Z based on the position range of the vulnerability in the second code attribute graph; and
for each of the third code attribute maps, marking the third parameter Type and the second parameter Z of the third code attribute map according to a mark of the second code attribute map corresponding to the third code attribute map.
10. The training method of claim 8, wherein said training the code audit model with the training sample data comprises:
calculating a detection ratio index when the code auditing model is verified by utilizing partial data in the training sample data, wherein the detection ratio index is used for evaluating the prediction accuracy of the code auditing model; and
and stopping training the code auditing module when the detection accuracy index reaches a preset precision.
11. The training method of claim 10, wherein the training method further comprises setting the detection ratio indicator, wherein the detection ratio indicator comprises at least one of a false negative ratio or a false positive ratio.
12. The training method of claim 8, wherein said obtaining N of said second code attribute maps and N of said third code attribute maps comprises:
acquiring N second code segments;
repairing the N second code segments one by one to obtain N corresponding third code segments;
processing the N second code segments to obtain N second code attribute graphs corresponding to the N second code segments one by one; and
and processing the N third code segments to obtain N third code attribute graphs which correspond to the N third code segments one by one.
13. A code auditing apparatus, comprising:
the first acquisition module is used for acquiring a first code segment to be detected;
the first processing module is used for processing the first code segment to obtain a first code attribute graph corresponding to the first code segment;
an input module, configured to input the first code attribute map into a code auditing model, where the code auditing model is a machine learning model obtained by training based on N second code attribute maps corresponding to N second code segments and N third code attribute maps corresponding to N third code segments, where the second code segments are code segments with a vulnerability, and one third code segment is a code segment obtained by repairing a vulnerability in one second code segment, where the N third code segments correspond to the N second code segments one to one, and N is an integer greater than or equal to 1; and
and the result obtaining module is used for obtaining the output of the code auditing model so as to obtain the detection result of auditing the first code segment.
14. A training apparatus for a code audit model, comprising:
a second obtaining module, configured to obtain N second code attribute maps corresponding to N second code segments, where the second code segments are code segments with vulnerabilities, and N is an integer greater than or equal to 1; acquiring N third code attribute graphs corresponding to N third code segments, wherein the N third code segments correspond to the N second code segments one by one, and one third code segment is a code segment obtained after repairing a leak in one second code segment;
the marking module is used for marking each code attribute graph in the N second code attribute graphs and the N third code attribute graphs to obtain training sample data; and
and the training module is used for training the code auditing model by utilizing the training sample data.
15. An electronic device, comprising:
one or more memories having stored thereon computer-executable instructions;
one or more processors that execute the instructions to implement:
the method according to any one of claims 1 to 7; or
A method according to any one of claims 8 to 12.
16. A computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform:
the method according to any one of claims 1 to 7; or
A method according to any one of claims 8 to 12.
CN202010734321.XA 2020-07-27 2020-07-27 Code auditing method and device, electronic equipment and medium Active CN111832028B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010734321.XA CN111832028B (en) 2020-07-27 2020-07-27 Code auditing method and device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010734321.XA CN111832028B (en) 2020-07-27 2020-07-27 Code auditing method and device, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN111832028A true CN111832028A (en) 2020-10-27
CN111832028B CN111832028B (en) 2024-08-02

Family

ID=72925045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010734321.XA Active CN111832028B (en) 2020-07-27 2020-07-27 Code auditing method and device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN111832028B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113946340A (en) * 2021-09-30 2022-01-18 北京五八信息技术有限公司 Code processing method and device, electronic equipment and storage medium
CN114443476A (en) * 2022-01-11 2022-05-06 阿里云计算有限公司 Code review method and device
CN114547085A (en) * 2022-03-22 2022-05-27 中国铁塔股份有限公司 Data processing method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110011986A (en) * 2019-03-20 2019-07-12 中山大学 A kind of source code leak detection method based on deep learning
CN110245496A (en) * 2019-05-27 2019-09-17 华中科技大学 A kind of source code leak detection method and detector and its training method and system
CN111259394A (en) * 2020-01-15 2020-06-09 中山大学 Fine-grained source code vulnerability detection method based on graph neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110011986A (en) * 2019-03-20 2019-07-12 中山大学 A kind of source code leak detection method based on deep learning
CN110245496A (en) * 2019-05-27 2019-09-17 华中科技大学 A kind of source code leak detection method and detector and its training method and system
CN111259394A (en) * 2020-01-15 2020-06-09 中山大学 Fine-grained source code vulnerability detection method based on graph neural network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113946340A (en) * 2021-09-30 2022-01-18 北京五八信息技术有限公司 Code processing method and device, electronic equipment and storage medium
CN114443476A (en) * 2022-01-11 2022-05-06 阿里云计算有限公司 Code review method and device
CN114547085A (en) * 2022-03-22 2022-05-27 中国铁塔股份有限公司 Data processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111832028B (en) 2024-08-02

Similar Documents

Publication Publication Date Title
CN109426723B (en) Detection method, system, equipment and storage medium using released memory
CN111832028B (en) Code auditing method and device, electronic equipment and medium
JP7172435B2 (en) Representation of software using abstract code graphs
US8732669B2 (en) Efficient model checking technique for finding software defects
CN104899147B (en) A kind of code Static Analysis Method towards safety inspection
US20180114026A1 (en) Method and system automatic buffer overflow warning inspection and bug repair
US20150370685A1 (en) Defect localization in software integration tests
US20130291113A1 (en) Process flow optimized directed graph traversal
CN104021084A (en) Method and device for detecting defects of Java source codes
JP7110789B2 (en) Selection of automated software program repair candidates
CN112035359A (en) Program testing method, program testing device, electronic equipment and storage medium
Le et al. SLING: using dynamic analysis to infer program invariants in separation logic
US20190361788A1 (en) Interactive analysis of a security specification
CN109446107A (en) A kind of source code detection method and device, electronic equipment
CN114491566A (en) Fuzzy test method and device based on code similarity and storage medium
KR102114547B1 (en) Testing method and apparatus of target function incluede in target program
US9037916B2 (en) Dynamic concolic execution of an application
Harzevili et al. Automatic Static Vulnerability Detection for Machine Learning Libraries: Are We There Yet?
Winter et al. Path-sensitive data flow analysis simplified
JP7384054B2 (en) automated software program repair
CN117171741A (en) Code defect analysis method and device
CN111966578A (en) Automatic evaluation method for android compatibility defect repair effect
JP2020129372A (en) Automated restoration of software program
CN116401670A (en) Vulnerability patch existence detection method and system in passive code scene
RU168346U1 (en) VULNERABILITY IDENTIFICATION DEVICE

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant