CN115544522B

CN115544522B - Source code vulnerability detection method and device, electronic equipment and storage medium

Info

Publication number: CN115544522B
Application number: CN202211497278.5A
Authority: CN
Inventors: 时忆杰; 涂腾飞; 秦素娟; 金正平; 温巧燕; 史武俊
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-04-07
Anticipated expiration: 2042-11-28
Also published as: CN115544522A

Abstract

The application provides a source code vulnerability detection method, a source code vulnerability detection device, electronic equipment and a storage medium, wherein a plurality of first semantic vectors are obtained by extracting semantic information of each source code segment in a source code file, the first semantic vectors are updated based on a preset first metric matrix, a second semantic vector is obtained, semantic features of the second semantic vector are extracted, and finally the semantic features are classified to realize the source code vulnerability detection method with high accuracy, high robustness and low calculation amount.

Description

Source code vulnerability detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of vulnerability detection technologies, and in particular, to a method and an apparatus for detecting a source code vulnerability, an electronic device, and a storage medium.

Background

The automatic source code vulnerability detection method is realized in a machine learning or deep learning mode and the like, and does not depend on manual labor. The number of code blocks with holes in the real world is small, and the positions of the code blocks in the whole source code file are random, so that the automatic source code hole detection is hindered.

The conventional automatic source code vulnerability detection methods, such as vulnerability detection based on CNN (Convolutional Neural Networks), vulnerability detection based on Bi-LSTM (Bi-Long Short Term Memory Networks), detection based on graph Neural Networks, etc., achieve automatic source code vulnerability detection, but have poor robustness during detection and low accuracy of detection results.

Disclosure of Invention

In view of this, an object of the present application is to provide a method, an apparatus, an electronic device, and a storage medium for detecting a source code vulnerability.

Based on the above purpose, the present application provides a source code vulnerability detection method, which includes:

acquiring a source code file; the source code file comprises a plurality of source code segments;

extracting semantic information of each source code fragment in the source code file to obtain a plurality of first semantic vectors;

updating the first semantic vector based on a preset first metric matrix to obtain a second semantic vector;

extracting semantic features of the second semantic vector;

classifying the semantic features to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and determining the source code segments corresponding to the semantic features with vulnerabilities as the source code segments with vulnerabilities.

Optionally, the extracting semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors includes:

converting the source code file into an abstract syntax tree; wherein each node in the abstract syntax tree corresponds to one of the source code segments in the source code file;

serializing the abstract syntax tree to obtain a serialized abstract syntax tree;

and extracting semantic information of each node in the serialized abstract syntax tree to obtain a plurality of first semantic vectors.

Optionally, the extracting semantic information of each node in the serialized abstract syntax tree to obtain a plurality of first semantic vectors includes:

selecting semantic information of any one node as central semantic information;

selecting quantity based on preset context to obtain a plurality of context semantic information of the central semantic information;

and converting the plurality of context semantic information into vector representation to obtain a plurality of first semantic vectors.

Optionally, the updating the first semantic vector based on a preset first metric matrix to obtain a second semantic vector includes:

classifying a plurality of first semantic vectors to obtain a first set and a second set; the first set comprises a plurality of first semantic vectors with the same and unequal attributes, and the second set comprises a plurality of first semantic vectors with different and unequal attributes;

initializing the preset first metric matrix to obtain a second metric matrix;

pairing unequal vectors in the first set pairwise based on the second measurement matrix, and reducing the distance between the paired vectors to obtain a third measurement matrix;

pairwise matching unequal vectors in the second set based on the third measurement matrix, and expanding the distance between pairwise matched vectors to obtain a fourth measurement matrix;

and obtaining the second semantic vector according to the fourth measurement matrix and the first semantic vector.

Optionally, the reducing a distance between the paired vectors to obtain a third metric matrix includes:

and narrowing a first distance between every two paired vectors, and enabling the first distance to be larger than a first threshold value so as to obtain the third measurement matrix.

Optionally, the expanding the distance between the pairwise paired vectors to obtain a fourth metric matrix includes:

and expanding a second distance between every two paired vectors, and enabling the second distance not to be larger than a second threshold value so as to obtain the fourth measurement matrix.

Optionally, the extracting semantic features of the second semantic vector includes:

taking the first semantic vector as a time sequence;

based on the time sequence, according to the second semantic vector, obtaining a second semantic vector at the current moment and a first relevant feature of the second semantic vector at the last moment;

based on the time sequence, obtaining second relevant features of the second semantic vector at the current moment and the second semantic vector at the next moment according to the second semantic vector;

and combining the first relevant feature and the second relevant feature to obtain the semantic feature of the second semantic vector.

Based on the above object, the present application further provides a source code vulnerability detection apparatus, including:

an acquisition module configured to acquire a source code file; the source code file comprises a plurality of source code segments;

the conversion module is configured to extract semantic information of each source code fragment in the source code file to obtain a plurality of first semantic vectors;

the updating module is configured to update the first semantic vector based on a preset first metric matrix to obtain a second semantic vector;

an extraction module configured to extract semantic features of the second semantic vector;

and the classification module is configured to classify the semantic features to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and determine the source code segments corresponding to the semantic features with vulnerabilities as the source code segments with vulnerabilities.

In view of the above, the present application also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the method according to any of the above embodiments is implemented.

In view of the above, the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method according to any one of the above embodiments.

From the foregoing, it can be seen that according to the source code vulnerability detection method, device, electronic device and storage medium provided by the application, by extracting semantic information of each source code segment in the source code file, a plurality of first semantic vectors are obtained, the first semantic vectors are updated based on a preset first metric matrix, second semantic vectors are obtained, semantic features of the second semantic vectors are extracted, and finally the semantic features are classified to realize the source code vulnerability detection method with high accuracy and robustness and low computation amount.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 shows a flowchart of an exemplary method for detecting a source code vulnerability according to an embodiment of the present application.

Fig. 2 is a schematic diagram illustrating an exemplary method for detecting a source code vulnerability according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating an exemplary method for detecting a source code vulnerability according to an embodiment of the present application.

Fig. 4 is a schematic diagram illustrating an exemplary method for detecting a source code vulnerability according to an embodiment of the present application.

Fig. 5 is a schematic diagram illustrating an exemplary method for detecting a source code vulnerability according to an embodiment of the present application.

Fig. 6 is a schematic diagram illustrating an exemplary source code vulnerability detection apparatus according to an embodiment of the present application.

FIG. 7 shows a schematic diagram of an exemplary electronic device according to an embodiment of the application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to specific embodiments and the accompanying drawings.

It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item preceding the word comprises the element or item listed after the word and its equivalent, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

As described in the background art, CNN-based vulnerability detection regards vectors embedded with words as a time series, and extracts their depth features using a one-dimensional convolutional neural network. And based on the vulnerability detection of the Bi-LSTM, the sample is regarded as a time sequence with context correlation, and the upper and lower information of the sample is extracted by using the Bi-LSTM. Based on vulnerability detection of the graph neural network, each element in the source code file is regarded as a node in the graph, and the context information is extracted by using the relationship between the element and the surrounding nodes. Therefore, the existing vulnerability detection method tries to extract semantic information among elements of a source code file, but does not utilize the category comparison information of a sample, so that the performance of a neural network is reduced, and the prediction accuracy is low. And the method for extracting the context information is gradually complicated, the convergence speed is low due to the disappearance of the gradient of the depth feature extractor and the problem of gradient explosion, and the robustness is low.

In view of this, embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for detecting a source code vulnerability, where a source code file is processed by using an abstract syntax tree tool and converted into a text file with context information, the processed text file is converted into a Vector by using a skip-gram algorithm (continuous Word skipping algorithm) in a Word2Vec (Word to Vector) model, a metric learning algorithm is used to increase the identification degree of a vulnerability between hole-free samples, a Bi-phase Long Short Term Memory (Bi-phase Long Short Term Memory) network model is used to extract high-level features in the Vector, and a random forest classifier is used to classify the samples, so as to implement a method for detecting a source code vulnerability with high accuracy, high robustness, and low computation amount.

In step S101, a source code file is acquired; wherein, the source code file comprises a plurality of source code segments.

In step S103, semantic information of each source code segment in the source code file is extracted to obtain a plurality of first semantic vectors.

In some embodiments, the extracting semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors may further include: converting the source code file into an abstract syntax tree; wherein each node in the abstract syntax tree corresponds to one of the source code segments in the source code file; serializing the abstract syntax tree to obtain a serialized abstract syntax tree; and extracting semantic information of each node in the serialized abstract syntax tree to obtain a plurality of first semantic vectors.

Specifically, the source code file is preprocessed by using an abstract syntax tree extraction tool to obtain a text file with an abstract syntax tree. The semantic characteristics of the programming language are expressed by utilizing the tree structure, and each node in the abstract syntax tree corresponds to one source code segment in the source code file.

It will be appreciated that the abstract syntax tree is an abstract representation of the source code syntax structure, which can efficiently convert the source code files into a tree structure with semantic information. The abstract syntax tree will mistakenly represent every detail in the real syntax. For example, nesting brackets are implicit in the structure of the tree and are not present in the form of nodes; whereas a conditional jump statement like if-condition-then can be represented using a node with three branches. The property of the abstract syntax tree is that the abstract syntax tree filters out some redundant information while extracting the semantic information of the source code.

In some embodiments, the file processed by the abstract syntax tree is a tree structure, which contains semantic information required for vulnerability detection, but the complex tree structure is not easy for neural networks to extract high-level features. Therefore, in this embodiment, the tree structure extracted from the abstract syntax tree is subjected to root traversal, and the tree structure is serialized to obtain a serialized abstract syntax tree, so that the original information is retained, the structure is serialized, and the processing of the neural network is facilitated.

In some embodiments, the extracting semantic information of each node in the serialized abstract syntax tree to obtain a plurality of first semantic vectors may further include: selecting semantic information of any node as central semantic information; selecting quantity based on preset context to obtain a plurality of context semantic information of the central semantic information; and converting the context semantic information into vector representation to obtain a plurality of first semantic vectors.

Specifically, semantic information such as if-condition-then, while-condition, function and the like is extracted from each node in the serialized abstract syntax tree by using a skip-gram algorithm in the Word2Vec model, and is converted into a vector form to obtain a plurality of first semantic vectors.

In this embodiment, the code files in the code warehouse with the number of star greater than 500 in 30 GitHub may be collected as a corpus, converted into serialized abstract syntax tree texts, and the texts are used as training data by using a skip-gram, and the vector conversion is performed by using a skip-gram algorithm. The skip-gram algorithm takes the semantic information of a certain node as central semantic information, sets and selects 10 context selection numbers, takes the semantic information of 10 nodes in front and at the back of the context selection numbers as context semantic information, and converts the context semantic information into vector representation. Finally, a number of first semantic vectors of length 128 are generated, each representing a node in the serialized abstract syntax tree.

In step S105, the first semantic vector is updated based on a preset first metric matrix, so as to obtain a second semantic vector. And taking the first semantic vector obtained by the skip-gram algorithm conversion as the input of the metric learning algorithm. The metric learning algorithm is to generate a metric by continuously iterating to project a sample vector into a high-dimensional space by using the label of the sample.

In some embodiments, the updating the first semantic vector based on a preset first metric matrix to obtain a second semantic vector may further include: classifying a plurality of first semantic vectors to obtain a first set and a second set; the first set comprises a plurality of first semantic vectors with the same and unequal attributes, and the second set comprises a plurality of first semantic vectors with different and unequal attributes; initializing the preset first metric matrix to obtain a second metric matrix; pairing unequal vectors in the first set pairwise based on the second measurement matrix, and reducing the distance between the paired vectors to obtain a third measurement matrix; pairwise pairing unequal vectors in the second set based on the third metric matrix, and enlarging the distance between pairwise paired vectors to obtain a fourth metric matrix; and obtaining the second semantic vector according to the fourth measurement matrix and the first semantic vector.

Specifically, in some embodiments, classifying several of the first semantic vectors may include samples with the same property (e.g., the same class) and unequal as a first set, and samples with different property (e.g., different class) and unequal as a second set. And adding the first metric matrix A into Gaussian disturbance for initialization by using the first metric matrix A randomly generated by a program before training to obtain a second metric matrix A'.

Further, after traversing each vector in the first set by using the second metric matrix a ', pairing unequal vectors in pairs and mapping the unequal vectors by the second metric matrix a ', reducing the first distance between the unequal vectors when the first distance of the unequal vectors meets the condition that the first distance is larger than a first threshold value 0, and assigning the metric matrix with the minimum difference between the second metric matrix a ' and the first metric matrix a to obtain a third metric matrix a ″.

It should be noted that the distance metric is used as an optimization target to reduce the distance between homogeneous samples. This distance is calculated in a manner that changes as the sample vector changes. Therefore, the distance measurement matrix calculated on the training data can map the test sample vector to a similar high-dimensional space on the test data with similar probability distribution, and the feature extraction precision of the Bi-Phased LSTM is improved.

It will be appreciated that there is a constraint that the distance between vectors that are brought closer to distance, after mapping, is greater than 0. This condition avoids the generation of a large number of overlapping vectors and thus avoids the over-fitting of a large number of repeated data to the model.

Further, after traversing each vector in the second set by using the third metric matrix a ", pairing the unequal vectors in pairs and mapping the unequal vectors by the third metric matrix a", expanding the second distance between the unequal vectors when the second distance between the unequal vectors satisfies a condition not greater than a second threshold value 1, and assigning the metric matrix with the minimum difference between the third metric matrix a "and the second metric matrix a ' to the second metric matrix a ', thereby obtaining a fourth metric matrix a '". After being mapped by such metrics, the distance between samples of the same class is shortened and the distance between samples of different classes is increased.

It should be noted that, in order to make the samples of different types more distinguishable, the distance between the samples of different types is continuously enlarged. However, if the distance is infinitely increased, the convergence speed of the model is slow, the prediction effect is poor, and therefore the distance between sample vectors is restricted to be less than or equal to 1.

Further, the fourth metric matrix is multiplied by the first semantic vector to obtain a second semantic vector.

In step S107, semantic features of the second semantic vector are extracted.

In some embodiments, a Bi-Phased LSTM network (multi-stage long short term memory network) may be used to extract the high-level features in the second semantic vector. The fourth metric matrix A ' ' ' is multiplied by the first semantic vector as input to the Bi-Phased LSTM network.

In some embodiments, a first semantic vector before being mapped by the measured matrix is regarded as a time sequence t, which embodies original characteristics of a sample, and a second semantic vector obtained after being mapped is regarded as a new sample x _t . Mixing t with x _t And respectively used as the input and hidden states of the Bi-Phased LSTM network, respectively passing through 16 Phased LSTM cores in a sequential and reverse manner according to the time sequence t, and respectively obtaining the first relevant features of the second semantic vector at the current moment and the second semantic vector at the previous moment, and obtaining the second relevant features of the second semantic vector at the current moment and the second semantic vector at the next moment. And splicing the two output results to obtain the semantic features of the second semantic vector.

In step S109, the semantic features are classified to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and the source code segments corresponding to the semantic features with vulnerabilities are determined as source code segments with vulnerabilities.

In some embodiments, a random forest classifier may be used to classify semantic features, enabling automated source code vulnerability detection. In order to utilize the features in different source code files, a random forest classifier is used for classifying the advanced features of the second semantic vector, and a classification result of the corresponding semantic vector is generated. In order to maximize the classification precision, a cross validation method is used for selecting hyper-parameters of a random forest, the maximum depth of each sub-tree is set to be 40, the number of leaf nodes is set to be 3, each internal node in the tree is split by 4 word nodes at least, and the whole random forest has 8000 sub-trees.

The present application will be described below with reference to specific examples.

In the industry, the code blocks with holes in the source code file occupy less total code amount and have small difference with the code blocks without holes for realizing the same functions. Due to the large size of the source code file, key features are lost during the Bi-LSTM training process because the cell states of the Bi-LSTM are updated at every point in time. This makes such samples often undetectable by traditional Bi-LSTM-based automated source code vulnerability detection frameworks.

In some embodiments, as shown in fig. 2, 30 code warehouses with star number greater than or equal to 500 are collected from the GitHub, the source code files of the warehouses are used as data sets, and the warehouses are classified according to the category of the programming language (as shown in fig. 2, the programming language can be classified into C + + source code files, python source code files, java source code files, and the like). For the source code files of the same type of programming language, firstly, an abstract syntax tree extraction tool corresponding to the type of programming language is used for preprocessing the source code files to obtain a tree-shaped abstract syntax structure. The tree structure is traversed by the first root, serialized and stored as a text file. Due to the property of the abstract syntax tree, the semantic features of the source code file are extracted in the process of extraction and serialization, invalid information is filtered, and elements which are not semantically related, such as annotation symbols, spaces and blank lines, are removed.

In some embodiments, the elements in the serialized abstract syntax tree are converted into vector representations using the skip-gram algorithm. As the skip-gram algorithm is a training method of unsupervised learning, most source code data collected in the GitHub are reserved in the embodiment, and part of files without semantic information, such as initialization files without actual functions, files only containing other source code file names and the like are deleted. After 200 iterations, the elements in the serialized abstract syntax tree are converted into a vector representation of length 128. Since the number of elements contained in each source code file is different, in order to treat one source code file as one sample, the elements contained in all the source code files are unified into 128. If the length of one source code file is larger than 128, redundant elements are deleted, and if the length of one source code file is smaller than 128, vectors with 0 element values are supplemented and normalized.

In some embodiments, the precision of vulnerability detection can be improved by mapping the sample vectors to a new high-dimensional space using a metric learning algorithm with a contrast learning idea. The specific training process is shown in fig. 3:

in step S301: and respectively putting the same-class samples and the different-class samples into two sets and initializing a measurement matrix. The first metric matrix a may be randomly generated by a program. In the present embodiment, the first metric matrix a of size 128 × 128 is randomly generated by a program.

In some embodiments, two sets are initialized, the first set C1 and the second set C2 being empty sets, and two-level loop traversal is used on the sample vector to place samples of the same and unequal class into the first set C1 and samples of different and unequal class into the second set C2. The first metric matrix a is initialized with a normal distribution with an expected value of 0 and a variance of 1.

In step S303: and adding Gaussian disturbance (which means probability distribution deviation) into the measurement matrix to obtain a new measurement matrix, and recording the new measurement matrix as a second measurement matrix A'.

In step S305: a set of homogeneous samples is traversed. And traversing the first set of the same type of samples by using a second metric matrix A ', pairing unequal elements pairwise and reducing the distance between the unequal elements when the distance of the unequal elements meets the condition that the distance is larger than 0 after the unequal elements are mapped by the second metric matrix A'.

In some embodiments, after traversing each sample in the first set C1, the metric matrix with the smallest difference between the second metric matrix a' and the first metric matrix a is assigned to the first metric matrix, resulting in a third metric matrix a ″.

In step S307: a set of heterogeneous samples is traversed. Traversing the second set of heterogeneous samples by using the third metric matrix A ', pairing the unequal elements pairwise and expanding the distance between the unequal elements when the distance between the unequal elements meets the requirement of being less than or equal to 1 after the unequal elements are mapped by the third metric matrix A'.

In some embodiments, after traversing each sample in the second set C2, the metric matrix with the smallest difference between the third metric matrix a ″ and the second metric matrix a ' is assigned to the second metric matrix a ', resulting in a fourth metric matrix a ' ".

In step S309: the metric matrix is continuously updated before the error drops to a threshold. At the third degreeThe metric matrix is continuously updated until the difference between the metric matrix a ″ and the fourth metric matrix a ″' falls below a threshold value. In some embodiments, the threshold may be set to 10 ^-3 。

In some embodiments, a fourth metric matrix of 128 x 128 size is iteratively generated, and multiplied by any second sample vector of 128 length to obtain a new first sample vector of 128 length.

In some embodiments, after converting the serialized abstract syntax tree text file into sample vectors, the sample vectors need to be projected into a new feature space for ease of classification. Since the sample vector before the word embedding operation retains the information of the original sample, it is used as the timestamp flag t of the Bi-Phased LSTM, and the sample vector after the word embedding is used as the input sample x of the Bi-Phased LSTM _t . Using a metric learning algorithm, input samples x _t The metric learning matrix is used to map to a new feature space.

In some embodiments, to make the space corresponding to the metric learning matrix linearly separable for the classifier, a way of computing errors that iterate and project over time is used. In each iteration, the product of the iterated measurement matrix and the sample vector is used as the input of the Bi-Phased LSTM, other parameters are fixed, the parameters of the measurement matrix are differentiated, and the derivative vector is projected onto the intersection of the first set C1 and the second set C2. And multiplying the obtained projection vector by the hyperparameter 0.2, adding the projection vector and the measurement matrix for mapping, and assigning the projection vector to the measurement matrix for mapping to obtain a new measurement matrix.

In some embodiments, phase LSTM units are used to replace traditional LSTM units, which improves the robustness of vulnerability detection. In order to solve the problem of gradient disappearance in the LSTM training process, a time gate is added before a core state and a hidden state, so that the gradient updates the parameters of the network only at the stage of opening the time gate. As shown in fig. 4, input sample x _t After being transmitted to Phased LSTM cell, the data is transformed by the activation function and input to the gate i _t Processed input sample x _t The result of multiplication is denoted by c _t ^’ . In order to filter the maximum or minimum gradient value and control the problem of gradient explosion or gradient disappearance, a door f is forgotten _t Processed input samples and c _t ^’ Multiplication. C is to _t ^’ As input to a time gate k controlled by t _t Processing to obtain the final core state c _t . To obtain the hidden state, c _t Activated using an activation function, and through an output gate o _t Multiplying the processed input samples through a second time gate k _t After processing, a hidden state h is obtained _t 。

Input samples x of the same class (e.g., all holes or all holes) after being mapped by the metric learning matrix _t With reduced distance between them, different classes (e.g. leaky and non-leaky) of input samples x _t The distance therebetween is enlarged. X is to be _t The high-level features are extracted from the input to the Bi-Phased LSTM after being merged with t. In order to obtain information of elements before and after a certain element at the same time, two Phased LSTM with the same structure are used. The forward Phased LSTM propagates in a sequential manner, and calculates the correlation between each element and the elements behind it, as indicated by the arrows from left to right in fig. 5. X is to be _t And t is used as input, opposite phase LSTM is in reverse order propagation, the correlation of each element with the previous element is calculated, and finally the results of the two phase LSTM (phase LSTM 1 and phase LSTM 2 in the figure) in the positive direction and the reverse direction are combined to be used as a final output vector. Although the two Phased LSTMs have the same structure, the two Phased LSTMs have different parameters after iteration, and output result vectors are different. On the basis of Bi-Phased LSTM, a maximum pooling layer is added, and the high-level features extracted finally are obtained only through the processing of an activation function and a linear layer.

The embodiment of the application can be applied to the field of intelligent contracts for source code vulnerability detection, and also belongs to the protection range of the application. The method for detecting the source code vulnerability applied to the field of the intelligent contracts has the advantages of the corresponding method embodiment, and is not described again here.

It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and is completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.

It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same technical concept, the application also provides a source code vulnerability detection device corresponding to the method of any embodiment.

Referring to fig. 6, the source code vulnerability detection apparatus includes:

an obtaining module 601 configured to obtain a source code file; wherein, the source code file comprises a plurality of source code segments.

A conversion module 602, configured to extract semantic information of each source code segment in the source code file, to obtain several first semantic vectors.

In some embodiments, the converting module 602 is further configured to convert the source code file into an abstract syntax tree, where each node in the abstract syntax tree corresponds to one source code segment in the source code file, serialize the abstract syntax tree to obtain a serialized abstract syntax tree, and extract semantic information of each node in the serialized abstract syntax tree to obtain a plurality of the first semantic vectors.

In some embodiments, the conversion module 602 is further configured to select semantic information of any one of the nodes as central semantic information, obtain a plurality of context semantic information of the central semantic information based on a preset context selection number, and convert the plurality of context semantic information into vector representation to obtain a plurality of first semantic vectors.

The updating module 603 is configured to update the first semantic vector based on a preset first metric matrix to obtain a second semantic vector.

In some embodiments, the updating module 603 is further configured to classify the plurality of first semantic vectors to obtain a first set and a second set, where the first set includes a plurality of first semantic vectors with the same and unequal attributes, the second set includes a plurality of first semantic vectors with different and unequal attributes, initialize the preset first metric matrix to obtain a second metric matrix, pair two by two the unequal vectors in the first set based on the second metric matrix, and reduce a distance between the paired vectors to obtain a third metric matrix, pair two by two the unequal vectors in the second set based on the third metric matrix, and expand a distance between the paired vectors to obtain a fourth metric matrix, and obtain the second semantic vector according to the fourth metric matrix and the first semantic vector.

In some embodiments, the updating module 603 is further configured to narrow down a first distance between pairwise paired vectors and make the first distance greater than a first threshold to obtain the third metric matrix.

In some embodiments, the updating module 603 is further configured to expand a second distance between pairwise paired vectors and make the second distance not greater than a second threshold to obtain the fourth metric matrix.

An extraction module 604 configured to extract semantic features of the second semantic vector.

In some embodiments, the extracting module 604 is further configured to obtain a second semantic vector at a current time and a first related feature of a second semantic vector at a previous time from the second semantic vector based on the time sequence by using the first semantic vector as a time sequence, obtain a second related feature of the second semantic vector at the current time and the second related feature of the second semantic vector at the next time from the second semantic vector based on the time sequence, and combine the first related feature and the second related feature to obtain the semantic feature of the second semantic vector.

The classification module 605 is configured to classify the semantic features to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and determine the source code segments corresponding to the semantic features with vulnerabilities as source code segments with vulnerabilities.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.

The apparatus of the foregoing embodiment is used to implement the corresponding source code vulnerability detection method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same technical concept, corresponding to the method of any embodiment, the present application further provides an electronic device, which includes a memory, a processor, and a computer program that is stored in the memory and can run on the processor, and when the processor executes the program, the method for detecting a source code vulnerability according to any embodiment is implemented.

Fig. 7 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (for example, USB, network cable, etc.), and can also realize communication in a wireless mode (for example, mobile network, WIFI, bluetooth, etc.).

The bus 1050 includes a path to transfer information between various components of the device, such as the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the above embodiment is used to implement the corresponding source code vulnerability detection method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same technical concept, corresponding to any of the above embodiments, the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the source code vulnerability detection method according to any of the above embodiments.

Computer-readable media for the present embodiments include both non-transitory and non-transitory, removable and non-removable media implemented in any method or technology for storage of information. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the foregoing embodiment are used to enable the computer to execute the source code vulnerability detection method according to any of the foregoing embodiments, and have the beneficial effects of the corresponding method embodiments, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, technical features in the above embodiments or in different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims

1. A method for detecting a source code vulnerability includes:

extracting semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors;

wherein the second semantic vector is obtained by:

initializing the preset first metric matrix to obtain a second metric matrix;

obtaining the second semantic vector according to the fourth metric matrix and the first semantic vector;

extracting semantic features of the second semantic vector;

classifying the semantic features to obtain semantic features with vulnerabilities and semantic features without vulnerabilities, and determining the source code segments corresponding to the semantic features with vulnerabilities as source code segments with vulnerabilities.

2. The method of claim 1, wherein the extracting semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors comprises:

3. The method according to claim 2, wherein said extracting semantic information of each of said nodes in said serialized abstract syntax tree to obtain a plurality of said first semantic vectors comprises:

selecting semantic information of any node as central semantic information;

and converting the context semantic information into vector representation to obtain a plurality of first semantic vectors.

4. The method of claim 1, wherein reducing the distance between pairwise paired vectors to obtain a third metric matrix comprises:

5. The method of claim 1, wherein expanding the distance between pairwise paired vectors to obtain a fourth metric matrix comprises:

6. The method of claim 1, wherein the extracting semantic features of the second semantic vector comprises:

taking the first semantic vector as a time sequence;

based on the time sequence, obtaining a second semantic vector at the current moment and a first relevant feature of the second semantic vector at the last moment according to the second semantic vector;

and combining the first relevant features and the second relevant features to obtain semantic features of the second semantic vector.

7. A source code vulnerability detection apparatus, comprising:

the conversion module is configured to extract semantic information of each source code segment in the source code file to obtain a plurality of first semantic vectors;

the updating module is configured to update the first semantic vector based on a preset first metric matrix to obtain a second semantic vector; wherein the second semantic vector is obtained by the following method:

classifying the plurality of first semantic vectors to obtain a first set and a second set; the first set comprises a plurality of first semantic vectors with the same attribute and unequal attributes, and the second set comprises a plurality of first semantic vectors with different attributes and unequal attributes;

initializing the preset first metric matrix to obtain a second metric matrix;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the program.

9. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 6.