CN115544522A - Source code vulnerability detection method and device, electronic equipment and storage medium - Google Patents

Source code vulnerability detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115544522A
CN115544522A CN202211497278.5A CN202211497278A CN115544522A CN 115544522 A CN115544522 A CN 115544522A CN 202211497278 A CN202211497278 A CN 202211497278A CN 115544522 A CN115544522 A CN 115544522A
Authority
CN
China
Prior art keywords
semantic
source code
vectors
vector
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211497278.5A
Other languages
Chinese (zh)
Other versions
CN115544522B (en
Inventor
时忆杰
涂腾飞
秦素娟
金正平
温巧燕
史武俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202211497278.5A priority Critical patent/CN115544522B/en
Publication of CN115544522A publication Critical patent/CN115544522A/en
Application granted granted Critical
Publication of CN115544522B publication Critical patent/CN115544522B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a source code vulnerability detection method, a source code vulnerability detection device, electronic equipment and a storage medium, wherein a plurality of first semantic vectors are obtained by extracting semantic information of each source code segment in a source code file, the first semantic vectors are updated based on a preset first metric matrix, a second semantic vector is obtained, the semantic features of the second semantic vector are extracted, and finally the semantic features are classified to realize the source code vulnerability detection method with high accuracy, high robustness and low calculation amount.

Description

Source code vulnerability detection method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of vulnerability detection technologies, and in particular, to a method and an apparatus for detecting a source code vulnerability, an electronic device, and a storage medium.
Background
The automatic source code vulnerability detection method is realized in a machine learning or deep learning mode and the like, and does not depend on manual labor. The number of code blocks with holes in the real world is small, and the positions of the code blocks in the whole source code file are random, so that the automatic source code hole detection is hindered.
The conventional automatic source code vulnerability detection methods, such as vulnerability detection based on CNN (Convolutional Neural Networks), vulnerability detection based on Bi-LSTM (Bi-Long Short Term Memory Networks), detection based on graph Neural Networks, etc., achieve automatic source code vulnerability detection, but have poor robustness during detection and low accuracy of detection results.
Disclosure of Invention
In view of the above, an object of the present application is to provide a method and an apparatus for detecting a source code vulnerability, an electronic device and a storage medium.
Based on the above purpose, the present application provides a source code vulnerability detection method, which includes:
acquiring a source code file; the source code file comprises a plurality of source code segments;
extracting semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors;
updating the first semantic vector based on a preset first metric matrix to obtain a second semantic vector;
extracting semantic features of the second semantic vector;
classifying the semantic features to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and determining the source code segments corresponding to the semantic features with vulnerabilities as the source code segments with vulnerabilities.
Optionally, the extracting semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors includes:
converting the source code file into an abstract syntax tree; wherein each node in the abstract syntax tree corresponds to one of the source code segments in the source code file;
serializing the abstract syntax tree to obtain a serialized abstract syntax tree;
and extracting semantic information of each node in the serialized abstract syntax tree to obtain a plurality of first semantic vectors.
Optionally, the extracting semantic information of each node in the serialized abstract syntax tree to obtain a plurality of first semantic vectors includes:
selecting semantic information of any one node as central semantic information;
selecting quantity based on preset context to obtain a plurality of context semantic information of the central semantic information;
and converting the context semantic information into vector representation to obtain a plurality of first semantic vectors.
Optionally, the updating the first semantic vector based on a preset first metric matrix to obtain a second semantic vector includes:
classifying the plurality of first semantic vectors to obtain a first set and a second set; the first set comprises a plurality of first semantic vectors with the same and unequal attributes, and the second set comprises a plurality of first semantic vectors with different and unequal attributes;
initializing the preset first metric matrix to obtain a second metric matrix;
pairing unequal vectors in the first set pairwise based on the second measurement matrix, and reducing the distance between the paired vectors to obtain a third measurement matrix;
pairwise matching unequal vectors in the second set based on the third measurement matrix, and expanding the distance between pairwise matched vectors to obtain a fourth measurement matrix;
and obtaining the second semantic vector according to the fourth measurement matrix and the first semantic vector.
Optionally, the reducing a distance between the paired vectors to obtain a third metric matrix includes:
and narrowing down a first distance between every two paired vectors, and enabling the first distance to be larger than a first threshold value so as to obtain the third measurement matrix.
Optionally, the expanding a distance between the pairwise paired vectors to obtain a fourth metric matrix includes:
and expanding a second distance between every two paired vectors, and enabling the second distance not to be larger than a second threshold value so as to obtain the fourth measurement matrix.
Optionally, the extracting semantic features of the second semantic vector includes:
taking the first semantic vector as a time sequence;
based on the time sequence, according to the second semantic vector, obtaining a second semantic vector at the current moment and a first relevant feature of the second semantic vector at the last moment;
based on the time sequence, according to the second semantic vector, obtaining second relevant features of the second semantic vector at the current moment and the second semantic vector at the next moment;
and combining the first relevant features and the second relevant features to obtain semantic features of the second semantic vector.
Based on the above object, the present application further provides a source code vulnerability detection apparatus, including:
an acquisition module configured to acquire a source code file; the source code file comprises a plurality of source code segments;
the conversion module is configured to extract semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors;
the updating module is configured to update the first semantic vector based on a preset first metric matrix to obtain a second semantic vector;
an extraction module configured to extract semantic features of the second semantic vector;
and the classification module is configured to classify the semantic features to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and determine the source code segments corresponding to the semantic features with vulnerabilities as the source code segments with vulnerabilities.
In view of the foregoing, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method according to any of the above embodiments is implemented.
In view of the above, the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method according to any of the above embodiments.
From the foregoing, it can be seen that the source code vulnerability detection method, apparatus, electronic device and storage medium provided by the present application obtain a plurality of first semantic vectors by extracting semantic information of each source code segment in the source code file, update the first semantic vectors based on the preset first metric matrix, obtain second semantic vectors, extract semantic features of the second semantic vectors, and finally classify the semantic features to realize a source code vulnerability detection method with high accuracy and robustness and low computation.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 shows a flowchart of an exemplary method for detecting a source code vulnerability according to an embodiment of the present application.
Fig. 2 shows a schematic diagram of an exemplary source code vulnerability detection method according to an embodiment of the present application.
Fig. 3 shows a flowchart of an exemplary source code vulnerability detection method according to an embodiment of the present application.
Fig. 4 is a schematic diagram illustrating an exemplary method for detecting a source code vulnerability according to an embodiment of the present application.
Fig. 5 is a schematic diagram illustrating an exemplary method for detecting a source code vulnerability according to an embodiment of the present application.
Fig. 6 is a schematic diagram illustrating an exemplary source code vulnerability detection apparatus according to an embodiment of the present application.
FIG. 7 shows a schematic diagram of an exemplary electronic device according to an embodiment of the application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.
It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item preceding the word comprises the element or item listed after the word and its equivalent, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used only to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
As described in the background art, CNN-based vulnerability detection regards vectors embedded with words as a time series, and extracts their depth features using a one-dimensional convolutional neural network. And (3) based on the vulnerability detection of the Bi-LSTM, regarding the sample as a time sequence with context correlation, and extracting the context information of the sample by using the Bi-LSTM. Based on vulnerability detection of the graph neural network, each element in the source code file is taken as a node in the graph, and context information is extracted by using the relationship between the node and the surrounding nodes. Therefore, the existing vulnerability detection method tries to extract semantic information among elements of a source code file, but does not utilize the category comparison information of a sample, so that the performance of a neural network is reduced, and the prediction accuracy is low. And the method for extracting the context information is gradually complicated, the convergence speed is low due to the disappearance of the gradient of the depth feature extractor and the problem of gradient explosion, and the robustness is low.
In view of this, embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for detecting a source code vulnerability, where a source code file is processed by using an abstract syntax tree tool and converted into a text file with context information, the processed text file is converted into a Vector by using a skip-gram algorithm (continuous skip algorithm) in a Word2Vec (Word to Vector) model, a metric learning algorithm is used to increase the identification degree of the vulnerability between non-leaky-hole samples, a Bi-Phased Long Short Term Memory (Bi-Phased Long Short Term Memory) network model is used to extract high-level features in the Vector, and a random forest classifier is used to classify the samples, so as to implement a method for detecting a source code vulnerability with high accuracy, high robustness, and low computational complexity.
Fig. 1 shows a flowchart of an exemplary source code vulnerability detection method according to an embodiment of the present application.
In step S101, a source code file is acquired; wherein, the source code file comprises a plurality of source code segments.
In step S103, semantic information of each source code segment in the source code file is extracted to obtain a plurality of first semantic vectors.
In some embodiments, the extracting semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors may further include: converting the source code file into an abstract syntax tree; wherein each node in the abstract syntax tree corresponds to one of the source code segments in the source code file; serializing the abstract syntax tree to obtain a serialized abstract syntax tree; and extracting semantic information of each node in the serialized abstract syntax tree to obtain a plurality of first semantic vectors.
Specifically, the source code file is preprocessed by using an abstract syntax tree extraction tool to obtain a text file with an abstract syntax tree. The semantic properties of the programming language are represented by its tree structure, and each node in the abstract syntax tree corresponds to a source code fragment in the source code file.
It will be appreciated that the abstract syntax tree is an abstract representation of the source code syntax structure, which can efficiently convert the source code files into a tree structure with semantic information. The abstract syntax tree may misinterpret every detail in the real syntax. For example, nesting brackets are implicit in the structure of the tree and are not present in the form of nodes; whereas a conditional jump statement like if-condition-then can be represented using a node with three branches. The property of the abstract syntax tree is that the abstract syntax tree filters out some redundant information while extracting the semantic information of the source code.
In some embodiments, the document processed by the abstract syntax tree is a tree structure, which contains semantic information required for vulnerability detection, but the complex tree structure is not easy for neural network to extract high-level features. Therefore, in this embodiment, the tree structure extracted from the abstract syntax tree is subjected to root traversal, and the tree structure is serialized to obtain a serialized abstract syntax tree, so that the original information is retained, the structure is serialized, and the processing of the neural network is facilitated.
In some embodiments, the extracting semantic information of each node in the serialized abstract syntax tree to obtain a plurality of first semantic vectors may further include: selecting semantic information of any one node as central semantic information; selecting quantity based on preset context to obtain a plurality of context semantic information of the central semantic information; and converting the context semantic information into vector representation to obtain a plurality of first semantic vectors.
Specifically, semantic information such as if-condition-then, while-condition, function and the like is extracted from each node in the serialized abstract syntax tree by using a skip-gram algorithm in the Word2Vec model, and is converted into a vector form to obtain a plurality of first semantic vectors.
In this embodiment, the code files in the code warehouse with the number of star greater than 500 in 30 GitHub may be collected as a corpus, converted into serialized abstract syntax tree texts, and the texts are used as training data by using a skip-gram, and the vector conversion is performed by using a skip-gram algorithm. The skip-gram algorithm takes the semantic information of a certain node as central semantic information, sets and selects 10 context selection numbers, takes the semantic information of 10 nodes in front of and behind the context selection number as context semantic information, and then converts the context semantic information into vector representation. Finally, a number of first semantic vectors of length 128 are generated, each representing a node in the serialized abstract syntax tree.
In step S105, the first semantic vector is updated based on a preset first metric matrix, so as to obtain a second semantic vector. And taking the first semantic vector obtained by the skip-gram algorithm conversion as the input of the metric learning algorithm. The metric learning algorithm is to generate a metric by continuously iterating to project a sample vector into a high-dimensional space by using the label of the sample.
In some embodiments, the updating the first semantic vector based on the preset first metric matrix to obtain a second semantic vector may further include: classifying the plurality of first semantic vectors to obtain a first set and a second set; the first set comprises a plurality of first semantic vectors with the same and unequal attributes, and the second set comprises a plurality of first semantic vectors with different and unequal attributes; initializing the preset first metric matrix to obtain a second metric matrix; pairing unequal vectors in the first set pairwise based on the second measurement matrix, and reducing the distance between the paired vectors to obtain a third measurement matrix; pairwise matching unequal vectors in the second set based on the third measurement matrix, and expanding the distance between pairwise matched vectors to obtain a fourth measurement matrix; and obtaining the second semantic vector according to the fourth measurement matrix and the first semantic vector.
Specifically, in some embodiments, classifying several of the first semantic vectors may include samples with the same property (e.g., the same class) and unequal as a first set, and samples with different property (e.g., different class) and unequal as a second set. And adding the first metric matrix A into Gaussian disturbance for initialization by using the first metric matrix A randomly generated by a program before training to obtain a second metric matrix A'.
Further, after traversing each vector in the first set by using the second metric matrix a ', pairing unequal vectors in pairs and mapping the unequal vectors by the second metric matrix a ', reducing the first distance between the unequal vectors when the first distance of the unequal vectors meets the condition that the first distance is larger than a first threshold value 0, and assigning the metric matrix with the minimum difference between the second metric matrix a ' and the first metric matrix a to obtain a third metric matrix a ″.
It should be noted that the distance metric is used as an optimization target to reduce the distance between homogeneous samples. This distance is calculated in a manner that changes as the sample vector changes. Therefore, the distance measurement matrix calculated on the training data can map the test sample vector to a similar high-dimensional space on the test data with similar probability distribution, and the feature extraction precision of the Bi-Phased LSTM is improved.
It will be appreciated that there is a constraint that the distance between vectors that are brought closer to distance, after mapping, is greater than 0. This condition avoids the generation of a large number of overlapping vectors and thus avoids the over-fitting of a large number of repeated data to the model.
Further, after traversing each vector in the second set by using the third metric matrix a ", pairing the unequal vectors in pairs and mapping the unequal vectors by the third metric matrix a", expanding the second distance between the unequal vectors when the second distance between the unequal vectors satisfies a condition not greater than a second threshold value 1, and assigning the metric matrix with the minimum difference between the third metric matrix a "and the second metric matrix a ' to the second metric matrix a ', thereby obtaining a fourth metric matrix a '". After being mapped by such metrics, the distance between samples of the same class is shortened and the distance between samples of different classes is increased.
It should be noted that, in order to make the samples of different types more distinguishable, the distance between the samples of different types is continuously enlarged. However, if the distance is infinitely increased, the convergence speed of the model is slow, and the prediction effect is poor, so that the distance between sample vectors is restricted to be less than or equal to 1.
Further, the fourth metric matrix is multiplied by the first semantic vector to obtain a second semantic vector.
In step S107, semantic features of the second semantic vector are extracted.
In some embodiments, a Bi-Phased LSTM network (multi-stage long short term memory network) may be used to extract the high-level features in the second semantic vector. The fourth metric matrix A ' ' ' is multiplied by the first semantic vector as input to the Bi-Phased LSTM network.
In some embodiments, a first semantic vector before being mapped by the measured matrix is regarded as a time sequence t, which embodies original characteristics of a sample, and a second semantic vector obtained after being mapped is regarded as a new sample x t . Mixing t with x t And respectively used as the input and hidden states of the Bi-Phased LSTM network, and respectively passing through 16 Phased LSTM cores in a sequential and reverse manner according to the time sequence t to respectively obtain the first relevant features of the second semantic vector at the current moment and the second semantic vector at the previous moment and obtain the second relevant features of the second semantic vector at the current moment and the second semantic vector at the next moment. And then splicing the two output results to obtain the semantic features of the second semantic vector.
In step S109, the semantic features are classified to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and the source code segments corresponding to the semantic features with vulnerabilities are determined to be source code segments with vulnerabilities.
In some embodiments, a random forest classifier may be used to classify semantic features, enabling automated source code vulnerability detection. In order to utilize the features in different source code files, a random forest classifier is used for classifying the advanced features of the second semantic vector, and a classification result of the corresponding semantic vector is generated. In order to maximize the classification precision, a cross validation method is used for selecting hyper-parameters of a random forest, the maximum depth of each sub-tree is set to be 40, the number of leaf nodes is set to be 3, each internal node in the tree is split by 4 word nodes at least, and the whole random forest has 8000 sub-trees.
The present application will be described below with reference to specific examples.
In the industry, the code blocks with holes in the source code file occupy less total code amount and have small difference with the code blocks without holes for realizing the same functions. Due to the large size of the source code file, key features are lost during the Bi-LSTM training process because the cell state of the Bi-LSTM is updated at every point in time. This makes such samples often undetectable by traditional Bi-LSTM-based automated source code vulnerability detection frameworks.
In some embodiments, as shown in fig. 2, 30 code warehouses with star number greater than or equal to 500 are collected from the GitHub, the source code files of the warehouses are used as data sets, and the warehouses are classified according to the category of the programming language (as shown in fig. 2, the programming language can be classified into C + + source code files, python source code files, java source code files, and the like). For the source code files of the same type of programming language, firstly, an abstract syntax tree extraction tool corresponding to the type of programming language is used for preprocessing the source code files to obtain a tree-shaped abstract syntax structure. The tree structure is traversed by the first root, serialized and stored as a text file. Due to the property of the abstract syntax tree, the semantic features of the source code file are extracted in the extraction and serialization processes, invalid information is filtered, and elements which are irrelevant to semantics, such as annotation symbols, spaces and blank lines, are removed.
In some embodiments, the elements in the serialized abstract syntax tree are converted into vector representations using the skip-gram algorithm. As the skip-gram algorithm is a training method of unsupervised learning, most source code data collected in the GitHub are reserved in the embodiment, and part of files without semantic information, such as initialization files without actual functions, files only containing other source code file names and the like are deleted. After 200 iterations, the elements in the serialized abstract syntax tree are transformed into a vector representation of length 128. Since the number of elements contained in each source code file is different, in order to treat one source code file as one sample, the elements contained in all the source code files are unified into 128. Redundant elements are deleted if the length of one source code file is larger than 128, and vectors with 0 values are supplemented and regularized if the length of one source code file is smaller than 128.
In some embodiments, the accuracy of vulnerability detection may be improved by mapping the sample vectors to a new high-dimensional space using a metric learning algorithm with the idea of contrast learning. The specific training process is shown in fig. 3:
in step S301: and respectively putting the same-class samples and the heterogeneous samples into two sets and initializing a measurement matrix. The first metric matrix a may be randomly generated by a program. In the present embodiment, the first metric matrix a of size 128 × 128 is randomly generated by a program.
In some embodiments, two sets are initialized, the first set C1 and the second set C2 being empty sets, two-level loop traversal is used on the sample vector, placing samples of the same class and unequal into the first set C1, and samples of different classes and unequal into the second set C2. The first metric matrix a is initialized with a normal distribution with an expected value of 0 and a variance of 1.
In step S303: and adding the measurement matrix into Gaussian disturbance (which means probability distribution deviation) to obtain a new measurement matrix, and recording the new measurement matrix as a second measurement matrix A'.
In step S305: a set of homogeneous samples is traversed. And traversing the first set of the same type of samples by using a second metric matrix A ', pairing unequal elements pairwise and reducing the distance between the unequal elements when the distance of the unequal elements meets the condition that the distance is larger than 0 after the unequal elements are mapped by the second metric matrix A'.
In some embodiments, after traversing each sample in the first set C1, the metric matrix with the smallest difference between the second metric matrix a' and the first metric matrix a is assigned to the first metric matrix, resulting in a third metric matrix a ″.
In step S307: a set of heterogeneous samples is traversed. Traversing the second set of heterogeneous samples by using the third metric matrix A ', pairing the unequal elements pairwise and expanding the distance between the unequal elements when the distance between the unequal elements meets the requirement of being less than or equal to 1 after the unequal elements are mapped by the third metric matrix A'.
In some embodiments, after traversing each sample in the second set C2, the metric matrix with the smallest difference between the third metric matrix a ″ and the second metric matrix a ' is assigned to the second metric matrix a ', resulting in a fourth metric matrix a ' ".
In step S309: the metric matrix is continuously updated before the error drops to a threshold. The metric matrix is continuously updated before the difference between the third metric matrix a ″ and the fourth metric matrix a ″' falls to a threshold value. In some embodiments, the threshold may be set to 10 -3
In some embodiments, a fourth metric matrix of 128 x 128 size is iteratively generated, and multiplied by any second sample vector of 128 length to obtain a new first sample vector of 128 length.
In some embodiments, after converting the serialized abstract syntax tree text file into sample vectors, the sample vectors need to be projected into a new feature space for ease of classification. Since the sample vector before the word embedding operation retains the information of the original sample, it is used as the timestamp flag t of Bi-Phased LSTM, and the sample vector after the word embedding is used as the input sample x of Bi-Phased LSTM t . Using a metric learning algorithm, input samples x t The metric learning matrix is used to map to a new feature space.
In some embodiments, to make the space corresponding to the metric learning matrix linearly separable from the classifier, a way of computing errors that iterates and projects over time is used. In each iteration, the product of the iterated measurement matrix and the sample vector is used as the input of the Bi-Phased LSTM, other parameters are fixed, the parameters of the measurement matrix are differentiated, and the derivative vector is projected onto the intersection of the first set C1 and the second set C2. And multiplying the obtained projection vector by a hyperparameter of 0.2, adding the projection vector and the measurement matrix for mapping, and assigning the measurement matrix for mapping to obtain a new measurement matrix.
In some embodiments, phase LSTM units are used to replace traditional LSTM units, which improves the robustness of vulnerability detection. In order to solve the problem of gradient disappearance in the LSTM training process, a time gate is added before a core state and a hidden state, so that the gradient updates the parameters of the network only at the stage of opening the time gate. As shown in fig. 4, sample x is input t After being transmitted to Phased LSTM cell, the data is transformed by the activation function and input to the gate i t Processed input sample x t The result of multiplication is denoted by c t . In order to filter the maximum or minimum gradient value and control the problem of gradient explosion or gradient disappearance, a door f is forgotten t Processed input samples and c t Multiplication. C is to t As input to a time gate k controlled by t t Processing to obtain final core state c t . To obtain the hidden state, c t Activated using an activation function, and through an output gate o t Multiplying the processed input samples through a second time gate k t After processing, a hidden state h is obtained t
Input samples x of the same class (e.g., all holes or all holes) after being mapped by the metric learning matrix t With reduced distance between them, different classes (e.g. leaky and non-leaky) of input samples x t The distance therebetween is enlarged. X is to be t The high-level features are extracted from the input to the Bi-Phased LSTM after being merged with t. In order to obtain information of elements before and after a certain element at the same time, two Phased LSTM with the same structure are used. The forward Phased LSTM propagates in a sequential manner, and calculates the correlation between each element and the elements behind it, as indicated by the arrows from left to right in fig. 5. X is to be t And t as input, and the opposite reverse Phased LSTM is propagated in reverse order, calculating each elementFinally, the results of the positive and negative Phased LSTM (Phased LSTM 1 and Phased LSTM 2 in the figure) are combined as the final output vector, according to the correlation of the previous elements. Although the two Phased LSTMs have the same structure, the two Phased LSTMs have different parameters after iteration, and output result vectors are different. On the basis of Bi-Phased LSTM, a maximum pooling layer is added, and the high-level features extracted finally are obtained only through the processing of an activation function and a linear layer.
The embodiment of the application can be applied to the field of intelligent contracts for source code vulnerability detection, and also belongs to the protection range of the application. The method for detecting the source code vulnerability applied to the field of the intelligent contracts has the advantages of the corresponding method embodiment, and is not described again here.
It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.
It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Based on the same technical concept, the application also provides a source code vulnerability detection device corresponding to the method of any embodiment.
Referring to fig. 6, the source code vulnerability detection apparatus includes:
an obtaining module 601 configured to obtain a source code file; wherein, the source code file comprises a plurality of source code segments.
A conversion module 602, configured to extract semantic information of each source code segment in the source code file, to obtain several first semantic vectors.
In some embodiments, the converting module 602 is further configured to convert the source code file into an abstract syntax tree, where each node in the abstract syntax tree corresponds to one source code segment in the source code file, serialize the abstract syntax tree to obtain a serialized abstract syntax tree, and extract semantic information of each node in the serialized abstract syntax tree to obtain a plurality of the first semantic vectors.
In some embodiments, the conversion module 602 is further configured to select semantic information of any one of the nodes as central semantic information, obtain a plurality of context semantic information of the central semantic information based on a preset context selection number, and convert the plurality of context semantic information into vector representation to obtain a plurality of first semantic vectors.
The updating module 603 is configured to update the first semantic vector based on a preset first metric matrix to obtain a second semantic vector.
In some embodiments, the updating module 603 is further configured to classify the plurality of first semantic vectors to obtain a first set and a second set, where the first set includes a plurality of first semantic vectors with the same and unequal attributes, the second set includes a plurality of first semantic vectors with different and unequal attributes, initialize the preset first metric matrix to obtain a second metric matrix, pair two by two the unequal vectors in the first set based on the second metric matrix, and reduce a distance between the paired vectors to obtain a third metric matrix, pair two by two the unequal vectors in the second set based on the third metric matrix, and expand a distance between the paired vectors to obtain a fourth metric matrix, and obtain the second semantic vector according to the fourth metric matrix and the first semantic vector.
In some embodiments, the updating module 603 is further configured to narrow a first distance between pairwise paired vectors and make the first distance greater than a first threshold to obtain the third metric matrix.
In some embodiments, the updating module 603 is further configured to expand a second distance between pairwise paired vectors and make the second distance not greater than a second threshold to obtain the fourth metric matrix.
An extraction module 604 configured to extract semantic features of the second semantic vector.
In some embodiments, the extracting module 604 is further configured to take the first semantic vector as a time sequence, obtain, based on the time sequence, a first relevant feature of a second semantic vector at a current time and a second semantic vector at a previous time according to the second semantic vector, obtain, based on the time sequence, a second relevant feature of the second semantic vector at the current time and the second semantic vector at the next time according to the second semantic vector, and merge the first relevant feature and the second relevant feature to obtain the semantic feature of the second semantic vector.
The classification module 605 is configured to classify the semantic features to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and determine the source code segments corresponding to the semantic features with vulnerabilities as source code segments with vulnerabilities.
For convenience of description, the above devices are described as being divided into various modules by functions, which are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.
The apparatus of the foregoing embodiment is used to implement the corresponding source code vulnerability detection method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same technical concept, corresponding to the method of any embodiment, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor executes the program to implement the method for detecting a source code vulnerability according to any embodiment.
Fig. 7 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static Memory device, a dynamic Memory device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solutions provided by the embodiments of the present specification are implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called by the processor 1010 for execution.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The electronic device of the above embodiment is used to implement the corresponding source code vulnerability detection method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same technical concept, corresponding to any of the above embodiments, the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the source code vulnerability detection method according to any of the above embodiments.
The computer-readable media of the present embodiments include permanent and non-permanent, removable and non-removable media implemented in any method or technology for storage of information. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The computer instructions stored in the storage medium of the foregoing embodiment are used to enable the computer to execute the source code vulnerability detection method according to any of the foregoing embodiments, and have the beneficial effects of the corresponding method embodiments, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, technical features in the above embodiments or in different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like that may be made without departing from the spirit or scope of the embodiments of the present application are intended to be included within the scope of the claims.

Claims (10)

1. A method for detecting a source code vulnerability includes:
acquiring a source code file; the source code file comprises a plurality of source code segments;
extracting semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors;
updating the first semantic vector based on a preset first metric matrix to obtain a second semantic vector;
extracting semantic features of the second semantic vector;
classifying the semantic features to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and determining the source code segments corresponding to the semantic features with vulnerabilities as the source code segments with vulnerabilities.
2. The method according to claim 1, wherein said extracting semantic information of each of the source code segments in the source code file to obtain a plurality of first semantic vectors comprises:
converting the source code file into an abstract syntax tree; wherein each node in the abstract syntax tree corresponds to one of the source code segments in the source code file;
serializing the abstract syntax tree to obtain a serialized abstract syntax tree;
and extracting semantic information of each node in the serialized abstract syntax tree to obtain a plurality of first semantic vectors.
3. The method according to claim 2, wherein said extracting semantic information of each of said nodes in said serialized abstract syntax tree to obtain a plurality of said first semantic vectors comprises:
selecting semantic information of any node as central semantic information;
selecting quantity based on preset context to obtain a plurality of context semantic information of the central semantic information;
and converting the plurality of context semantic information into vector representation to obtain a plurality of first semantic vectors.
4. The method according to claim 1, wherein the updating the first semantic vector based on a preset first metric matrix to obtain a second semantic vector comprises:
classifying the plurality of first semantic vectors to obtain a first set and a second set; the first set comprises a plurality of first semantic vectors with the same and unequal attributes, and the second set comprises a plurality of first semantic vectors with different and unequal attributes;
initializing the preset first metric matrix to obtain a second metric matrix;
pairing unequal vectors in the first set pairwise based on the second measurement matrix, and reducing the distance between the paired vectors to obtain a third measurement matrix;
pairwise matching unequal vectors in the second set based on the third measurement matrix, and expanding the distance between pairwise matched vectors to obtain a fourth measurement matrix;
and obtaining the second semantic vector according to the fourth measurement matrix and the first semantic vector.
5. The method of claim 4, wherein narrowing the distance between the pairwise paired vectors to obtain a third metric matrix comprises:
and narrowing a first distance between every two paired vectors, and enabling the first distance to be larger than a first threshold value so as to obtain the third measurement matrix.
6. The method of claim 4, wherein expanding the distance between pairwise paired vectors to obtain a fourth metric matrix comprises:
and expanding a second distance between every two paired vectors, and enabling the second distance not to be larger than a second threshold value so as to obtain the fourth measurement matrix.
7. The method of claim 1, wherein the extracting semantic features of the second semantic vector comprises:
taking the first semantic vector as a time sequence;
based on the time sequence, obtaining a second semantic vector at the current moment and a first relevant feature of the second semantic vector at the last moment according to the second semantic vector;
based on the time sequence, obtaining second relevant features of the second semantic vector at the current moment and the second semantic vector at the next moment according to the second semantic vector;
and combining the first relevant feature and the second relevant feature to obtain the semantic feature of the second semantic vector.
8. A source code vulnerability detection apparatus, comprising:
an acquisition module configured to acquire a source code file; the source code file comprises a plurality of source code segments;
the conversion module is configured to extract semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors;
the updating module is configured to update the first semantic vector based on a preset first metric matrix to obtain a second semantic vector;
an extraction module configured to extract semantic features of the second semantic vector;
and the classification module is configured to classify the semantic features to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and determine the source code segments corresponding to the semantic features with vulnerabilities as the source code segments with vulnerabilities.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the program.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 7.
CN202211497278.5A 2022-11-28 2022-11-28 Source code vulnerability detection method and device, electronic equipment and storage medium Active CN115544522B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211497278.5A CN115544522B (en) 2022-11-28 2022-11-28 Source code vulnerability detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211497278.5A CN115544522B (en) 2022-11-28 2022-11-28 Source code vulnerability detection method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115544522A true CN115544522A (en) 2022-12-30
CN115544522B CN115544522B (en) 2023-04-07

Family

ID=84722554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211497278.5A Active CN115544522B (en) 2022-11-28 2022-11-28 Source code vulnerability detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115544522B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635569A (en) * 2018-12-10 2019-04-16 国家电网有限公司信息通信分公司 A kind of leak detection method and device
GB201917161D0 (en) * 2019-08-23 2020-01-08 Praetorian System and method for automatically detecting a security vulnerability in a source code using a machine learning model
CN111611586A (en) * 2019-02-25 2020-09-01 上海信息安全工程技术研究中心 Software vulnerability detection method and device based on graph convolution network
CN115146267A (en) * 2022-06-22 2022-10-04 北京天融信网络安全技术有限公司 Method and device for detecting macro viruses in Office document, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635569A (en) * 2018-12-10 2019-04-16 国家电网有限公司信息通信分公司 A kind of leak detection method and device
CN111611586A (en) * 2019-02-25 2020-09-01 上海信息安全工程技术研究中心 Software vulnerability detection method and device based on graph convolution network
GB201917161D0 (en) * 2019-08-23 2020-01-08 Praetorian System and method for automatically detecting a security vulnerability in a source code using a machine learning model
CN115146267A (en) * 2022-06-22 2022-10-04 北京天融信网络安全技术有限公司 Method and device for detecting macro viruses in Office document, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115544522B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
US20180336453A1 (en) Domain specific language for generation of recurrent neural network architectures
US9928040B2 (en) Source code generation, completion, checking, correction
CN111414987B (en) Training method and training device of neural network and electronic equipment
JP2019207685A (en) Method, device and system for estimating causal relation between observation variables
US11740879B2 (en) Creating user interface using machine learning
CN108664512B (en) Text object classification method and device
US11573771B2 (en) Predicting code editor
CN114897173B (en) Method and device for determining PageRank based on variable component sub-line
CN107391682B (en) Knowledge verification method, knowledge verification apparatus, and storage medium
CN112989363B (en) Vulnerability positioning method and device, electronic equipment and storage medium
CN114327483B (en) Graph tensor neural network model building method and source code semantic recognition method
CN112182217A (en) Method, device, equipment and storage medium for identifying multi-label text categories
CN102999318A (en) Programming assisting method and device
CN113535912B (en) Text association method and related equipment based on graph rolling network and attention mechanism
CN108681490B (en) Vector processing method, device and equipment for RPC information
CN115544522B (en) Source code vulnerability detection method and device, electronic equipment and storage medium
CN116644180A (en) Training method and training system for text matching model and text label determining method
CN114880457A (en) Training method of process recommendation model, process recommendation method and electronic equipment
CN114897183A (en) Problem data processing method, and deep learning model training method and device
CN114358011A (en) Named entity extraction method and device and electronic equipment
CN111552477A (en) Data processing method and device
CN110457455A (en) A kind of three-valued logic question and answer consulting optimization method, system, medium and equipment
CN116991459B (en) Software multi-defect information prediction method and system
CN117909505B (en) Event argument extraction method and related equipment
US20230008628A1 (en) Determining data suitability for training machine learning models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant