CN115587029A

CN115587029A - Patch detection method and device, electronic equipment and computer readable medium

Info

Publication number: CN115587029A
Application number: CN202211194665.1A
Authority: CN
Inventors: 高思雨; 闻剑峰; 殷铭
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-01-10

Abstract

The embodiment of the application discloses a method and a device for detecting a patch, electronic equipment and a computer readable medium, wherein the method for detecting the patch comprises the following steps: acquiring an original code segment and a patch code segment; the patch code section is a code section obtained by adding patches to the original code section; performing semantic analysis on the original code segment to obtain a first semantic analysis result, and performing semantic analysis on the patch code segment to obtain a second semantic analysis result; carrying out similarity calculation on the first semantic analysis result and the intention behavior description data to obtain a first similarity value, and carrying out similarity calculation on the second semantic analysis result and the intention behavior description data to obtain a second similarity value; and detecting whether the patch is an over-fit patch or not according to the first similarity value and the second similarity value. According to the technical scheme, the overfitting patch is detected, and the reliability of automatic program repair is improved.

Description

Patch detection method and device, electronic equipment and computer readable medium

Technical Field

The present application relates to the field of computer security technologies, and in particular, to a patch detection method, a patch detection apparatus, an electronic device, and a computer readable medium.

Background

The Automatic Program Repair (APR) technology can effectively reduce the software maintenance cost, thereby effectively reducing the Program debugging time of developers, and simultaneously improving the software quality by generating patch automatic Repair errors.

However, many over-fit patches are often generated by automatic program repair, and in practice, a detection suite for detecting a patch often cannot ensure that detection of an over-fit patch is completely correct, that is, part of the over-fit patch can be detected by a specific detection suite, so that a program to be repaired cannot be correctly repaired due to the existence of the over-fit patch, and thus, program development efficiency is low.

Therefore, how to accurately over-fit the patch for detection so as to improve the program development efficiency is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides a patch detection method and device, an electronic device and a computer readable medium, which can accurately detect an over-fit patch in a patch generated by automatic program repair, so as to prevent the over-fit patch from repairing a program, thereby improving the reliability of automatic program repair.

The embodiment of the application provides a patch detection method, which comprises the following steps: acquiring an original code section and a patch code section; wherein the patch code section is a code section obtained after adding a patch to the original code section; performing semantic analysis on the original code segment to obtain a first semantic analysis result, and performing semantic analysis on the patch code segment to obtain a second semantic analysis result; carrying out similarity calculation on the first semantic analysis result and the intention behavior description data to obtain a first similarity value, and carrying out similarity calculation on the second semantic analysis result and the intention behavior description data to obtain a second similarity value; the intention behavior description data is used for representing the intention corresponding to the development of the original code segment; and detecting whether the patch is an over-fit patch or not according to the first similarity value and the second similarity value.

In an embodiment of the present application, based on the foregoing scheme, a first abstract syntax tree is constructed according to the method included in the original code segment; determining a first set of leaf nodes from the first abstract syntax tree; wherein the first set of leaf nodes comprises a plurality of first path contexts; and calculating an original digital vector corresponding to each first path context, and taking the original digital vector corresponding to each first path context as the first semantic analysis result.

In an embodiment of the present application, based on the foregoing solution, a second abstract syntax tree is constructed according to a method included in the patch code segment; determining a second set of leaf nodes from the second abstract syntax tree; wherein the second set of leaf nodes comprises a plurality of second path contexts; and calculating a patch digital vector corresponding to each second path context, and taking the patch digital vector corresponding to each second path context as the second semantic analysis result.

In one embodiment of the present application, based on the foregoing scheme, a plurality of original embedded vectors are generated from the plurality of original digital vectors; performing similarity distance calculation on each original embedded vector and the embedded vector corresponding to the intention behavior description data to obtain a first similarity distance set; determining the first similarity value according to the first similarity distance set.

In an embodiment of the present application, based on the foregoing scheme, a minimum similarity distance is selected from the first similarity distance set as the first similarity value; or performing averaging operation on a plurality of similarity distances contained in the first similarity distance set to obtain a first average similarity distance, and taking the first average similarity distance as the first similarity value.

In an embodiment of the present application, based on the foregoing solution, the second semantic analysis result includes a plurality of patch number vectors;

in one embodiment of the present application, based on the foregoing scheme, a plurality of patch embedding vectors are generated according to the plurality of patch digital vectors; performing similarity distance calculation on each patch embedding vector and an embedding vector corresponding to the intention behavior description data to obtain a second similarity distance set; and determining the second similarity value according to the second similarity distance set.

In an embodiment of the present application, based on the foregoing scheme, selecting a minimum similarity distance from the second similarity distance set as the second similarity value; or performing averaging operation on a plurality of similarity distances contained in the second similarity distance set to obtain a second average similarity distance, and taking the second average similarity distance as the second similarity value.

In an embodiment of the present application, based on the foregoing scheme, a difference operation is performed on the first similarity value and the second similarity value to obtain a similarity gain; if the similarity gain is larger than or equal to a preset gain threshold value, obtaining a detection result for representing that the patch is an over-fit patch; and if the similarity gain is smaller than the preset gain threshold, obtaining a detection result for representing that the patch is not an over-fit patch.

In an embodiment of the application, based on the foregoing scheme, before performing similarity calculation on the first semantic analysis result and intention behavior description data to obtain a first similarity value, acquiring development intention data; wherein the development intention data includes at least one method name; extracting the method name contained in the development intention data, and determining the intention behavior description data according to the extracted method name.

In a third aspect, an embodiment of the present application provides an electronic device, including one or more processors; a memory for storing one or more programs that, when executed by the one or more processors, cause the electronic device to implement a patch detection method as described above.

In a fourth aspect, an embodiment of the present application provides a computer-readable medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the patch detection method as described above.

In a fifth aspect, the present application provides a computer program product, which includes computer instructions, and when executed by a processor, the computer instructions implement the patch detection method described above.

In the technical scheme provided by the embodiment of the application:

obtaining a first semantic analysis result by acquiring an original code segment and a patch code segment and performing semantic analysis on the original code segment; performing semantic analysis on the patch code segment to obtain a second semantic analysis result; furthermore, intention behavior description data of the intention for developing the original code segment is obtained, and similarity calculation is carried out on the first semantic analysis result and the second semantic analysis result and the intention behavior description data respectively to obtain a first similarity value and a second similarity value. Whether the patch is an over-fit patch can be detected according to the first similarity value and the second similarity value.

By the method, the overfitting patch can be accurately detected from the patches generated by automatic program repair, so that the patches can be prevented from repairing the program, and the reliability of automatic program repair is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

FIG. 1 is a schematic diagram of a computer-implemented framework to which embodiments of the present application may be applied;

FIG. 2 is a flow chart illustrating a method of patch detection in accordance with an exemplary embodiment of the present application;

FIG. 3 is a flow chart illustrating a method of patch detection in accordance with another exemplary embodiment of the present application;

FIG. 4 is a flow chart illustrating a method of patch detection in accordance with another exemplary embodiment of the present application;

FIG. 5 is a flow chart illustrating a method of patch detection in accordance with another exemplary embodiment of the present application;

FIG. 6 is a flow chart illustrating a method of patch detection in accordance with another exemplary embodiment of the present application;

FIG. 7 is a flow chart illustrating a method of detection of a patch in accordance with another exemplary embodiment of the present application;

FIG. 8 is a flow chart illustrating a method of patch detection in accordance with another exemplary embodiment of the present application;

FIG. 9 is a flow chart illustrating a method of patch detection in accordance with another exemplary embodiment of the present application;

FIG. 10 is a flow chart illustrating a method of detection of a patch in accordance with another exemplary embodiment of the present application;

FIG. 11 is a flow chart illustrating a method of patch detection in accordance with another exemplary embodiment of the present application;

FIG. 12 is a block diagram of a patch detection apparatus according to an embodiment of the present application;

FIG. 13 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flowcharts shown in the figures are illustrative only and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It should also be noted that: reference to "a plurality" in this application means two or more. "and/or" describe the association relationship of the associated objects, meaning that there may be three relationships, e.g., A and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The following description of the embodiments of the present application refers to the accompanying terminology and the technical knowledge:

a patch refers to a problem solving applet that is issued for a problem exposed during use of a large software system. In the embodiment of the present application, the patch and the patch code segment are the same concept unless otherwise specified. Further, code fragments and code fragments are also the same concept.

Code2Vec is a novel framework that predicts program properties using neural networks, the main idea being to turn the Code into embedded vectors, learning a distributed representation of the Code. Distributed representation will store semantic or grammatical features of a language dispersed in a low-dimensional, dense real vector.

An Abstract Syntax Tree (AST), or simply Syntax Tree (Syntax Tree), is an Abstract representation of the source code Syntax structure. It represents the syntactic structure of the programming language in the form of a tree, each node on the tree representing a structure in the source code.

Word embedding is the common name of language models and characterization learning techniques in Natural Language Processing (NLP). Conceptually, it refers to embedding a high-dimensional space with dimensions of the number of all words into a continuous vector space with much lower dimensions, each word or phrase being mapped as a vector on the real number domain.

The patch detection method and apparatus, the electronic device, and the computer-readable medium provided in the embodiments of the present application relate to the field of computer security technologies, and the embodiments will be described in detail below.

Referring to fig. 1, fig. 1 is a schematic diagram of a computer-implemented framework according to the present application. The computer (PC) implementation framework may include a code snippet extraction component 110, a code description generator 120, a code intent extractor 130, and a classifier 140. The code fragment extracting component 110 may extract the patch code fragments and the defect program code to obtain corresponding original code fragments, patch code fragments and development intention data. The development intention data may be the development intention of the developer implied or included by the defective program code in the absence of the defect. Illustratively, a certain piece of program code is used for identifying the person behaviors included in the picture through an image recognition algorithm, and the piece of program code may have a logic error or a bug, and the code segment extraction component 110 extracts the piece of program code, which is used for identifying the person behaviors included in the picture through the image recognition algorithm, and further extracts an original code segment with the error. The original code fragment is included in the program code.

It should be noted that the patch code segment may be obtained after a developer or artificial intelligence modifies an original code segment, and the developer of the patch code segment in the embodiment of the present application may be the same as or different from the developer of the original code segment, which is beneficial to perfecting the logic, syntax, and the like of the original code segment and optimizing the original code segment by different developers.

Further, the code description generator 120 may process the original code segment and the patch code segment to identify a corresponding original code semantic analysis result and a corresponding patch code semantic analysis result. The code intention extractor 130 may extract the development intention data to obtain corresponding intention behavior description data. The classifier 140 may be a patch classifier, and the computer may classify the patch code segments according to the similarity between the semantic analysis result of the original code and the intentional behavior description data and the similarity between the semantic analysis result of the patch code and the intentional behavior description data to determine whether the patch code segments are over-fit patches.

FIG. 2 is a flow diagram illustrating a method for patch detection according to an example embodiment. As shown in fig. 2, in an exemplary embodiment, the method may include steps S210 to S240, and the executing subject of the embodiment of the present application may be a computer, and the detailed method is described as follows:

step S210: acquiring an original code segment and a patch code segment; wherein the patch code section is a code section obtained after adding a patch to the original code section.

In the embodiment of the application, the original code segment and the patch code segment are obtained by extracting the defect program code and the patch through a code segment extracting component. The patch code segment is a code segment obtained after adding a patch to the original code segment, and may be obtained after modifying the original code segment by a developer or by an artificial intelligence algorithm.

Specifically, the computer may obtain the original code segment and the patch code segment by first obtaining the patch and the defect program code, and then extracting the patch and the defect program code by the code segment extraction component. Wherein the defective program code may be a program code found to be defective by a developer or by an artificial intelligence algorithm.

In the embodiment of the application, the original code segment and the patch code segment both comprise unit functions, one unit function can often implement one method, and the code segment extraction component can extract class declarations (such as class names, inheritance names and interface names) and method declarations (such as technical documents of method names, method signatures and methods) corresponding to each method, so that the class declarations and the method declarations corresponding to a plurality of methods can be obtained. A method's corresponding class declaration and method declaration may be referred to as a triple, and a triple may include a method's class declaration and method declaration. Further, after the code segment extraction component extracts the defect program code and the patch, a plurality of methods, that is, a plurality of triples, can be obtained.

In the embodiment of the present application, pat = { m = may be used ¹ 、m ² 、…、m ⁿ }(m ⁱ Representing the ith method or the ith triplet, 1 ≦ i ≦ n) represents the set of methods that the patch code section includes, i.e., the patch code section includes n methods. Similarly, org = { m = may be used ¹ 、m ² 、…、m ^p }(m ⁱ Representing the ith method or the ith triple, 1 ≦ i ≦ p) representing the set of methods included in the original code fragmentI.e. the original code fragment comprises p methods. Wherein n and p may be the same or different, and the embodiments of the present application are not limited.

Step S220: and performing semantic analysis on the original code segment to obtain a first semantic analysis result, and performing semantic analysis on the patch code segment to obtain a second semantic analysis result.

In this step, semantic analysis may be performed on the original code segment by the code description generator, thereby obtaining a first semantic analysis result. The first semantic analysis result may include a plurality of semantics included in the original code segment, each semantic may have a corresponding original digital vector, each original digital vector may determine a corresponding first path context, and each first path context is determined by a first set of leaf nodes included in the first abstract syntax tree. The first abstract syntax tree may be constructed from a plurality of methods included in the original code fragment. The first semantic analysis result may be represented as Stext, stext = { s = {(s) } _text1 、s _text2 、…、s _textk }. Wherein s is _texti And (i is more than or equal to 1 and less than or equal to k) represents the ith original semantic meaning, and the original semantic meanings can be mutually independent or mutually related, and the embodiment of the application is not limited.

Similarly, the patch code segment may be semantically analyzed by the code description generator to obtain a second semantic analysis result. The second semantic analysis result may include a plurality of semantics included in the patch code segment, each semantic may have a corresponding patch digital vector, each patch digital vector may determine a corresponding second path context, and each second path context is determined by a second leaf node set included in the second abstract syntax tree. The second abstract syntax tree may be constructed according to a plurality of methods included in the patch code segment. The second semantic analysis result may be represented as Ttext, ttext = { t = } t _text1 、t _text2 、…、t _textl }. Wherein, t _texti (i is more than or equal to 1 and less than or equal to l) represents the ith patch semantic, and the patch semantics can be mutually independent or mutually linked semantics, in the embodiment of the applicationAnd are not limited.

Step S230: carrying out similarity calculation on the first semantic analysis result and the intention behavior description data to obtain a first similarity value, and carrying out similarity calculation on the second semantic analysis result and the intention behavior description data to obtain a second similarity value; the intention behavior description data is used for representing the intention corresponding to the original code segment.

Wherein the intent behavior description data can be represented as Int _text . In conjunction with step S220, the computer can associate each S in Stext with each other _texti (1. Ltoreq. I. Ltoreq. K) with Int _text And performing similarity comparison to obtain k similarity distances, and determining a first similarity value from the k similarity distances. The first similarity value may be a minimum similarity distance among the k similarity distances, or may be an average value of the k similarity distances, and the like, and the embodiment of the present application is not limited thereto.

Further, each t in the Ttext may be _texti (1. Ltoreq. I. Ltoreq.l) are respectively compared with Int _text And carrying out similarity comparison to obtain l similarity distances, and determining a second similarity distance from the l similarity values. The second similarity value may be the minimum similarity distance among the l similarity distances, or may be an average value of the l similarity distances, and the like, which is not limited in the embodiment of the present application.

Step S240: and detecting whether the patch is an over-fit patch or not according to the first similarity value and the second similarity value.

Specifically, the second similarity value may be subtracted from the first similarity value to obtain a similarity gain. If the similarity gain is larger than or equal to a preset gain threshold value, obtaining a detection result for representing that the patch is an over-fit patch; and if the similarity gain is smaller than the preset gain threshold, obtaining a detection result for representing that the patch is not the over-fit patch.

In the embodiment of the application, a first semantic analysis result is obtained by acquiring an original code segment and a patch code segment and performing semantic analysis on the original code segment; performing semantic analysis on the patch code segment to obtain a second semantic analysis result; furthermore, intention behavior description data of the intention for developing the original code segment is obtained, and similarity calculation is carried out on the first semantic analysis result and the second semantic analysis result and the intention behavior description data respectively to obtain a first similarity value and a second similarity value. Finally, whether the patch is an over-fit patch can be detected according to the first similarity value and the second similarity value. By the method, the overfitting patch can be detected from the patches generated by automatic program repair, so that the patches can be prevented from repairing the program, and the reliability of automatic program repair is improved.

Fig. 3 is a flow chart of step S220 in an exemplary embodiment in the embodiment shown in fig. 2. As shown in fig. 3, in an exemplary embodiment, the process of performing semantic analysis on the original code segment to obtain the first semantic analysis result may include steps S310 to S330, which are described in detail as follows:

step S310: and constructing a first abstract syntax tree according to the method contained in the original code segment.

As described in step S210, the original code segment includes a plurality of methods, i.e., a plurality of triples, and each method may include a class declaration and a method declaration of the respective method. Thus, a first abstract syntax tree can be constructed from the methods contained in the original code fragment.

Step S320: determining a first set of leaf nodes from the first abstract syntax tree; wherein the first set of leaf nodes comprises a plurality of first path contexts.

The first abstract syntax tree may include a plurality of first leaf nodes, where a plurality of leaf nodes may correspond to one method, and the leaf nodes corresponding to different methods may intersect with each other, which is not limited in the embodiment of the present application. A plurality of first path contexts may be determined in the first abstract syntax tree. The plurality of first path contexts may be included in the first set of leaf nodes, a certain first path context may include a plurality of first leaf nodes, and paths of different first leaf nodes may form different first path contexts.

Step S330: and calculating an original digital vector corresponding to each first path context, and taking the original digital vector corresponding to each first path context as a first semantic analysis result.

In this step, the original digital vector corresponding to the context of each first path can be calculated through the Code2Vec algorithm, and a plurality of original digital vectors are obtained. The Code2Vec algorithm can predict the method names of the methods in the original Code segment, and calculate the original number vectors. Each original digital vector may include a weight of a corresponding first path context, and each path context may be subjected to weighted aggregation according to the weight of each path context, so as to obtain a first semantic analysis result, i.e., stext. Wherein the weight of each first path context can be learned by using an attention mechanism in training. The original digital vector corresponding to each first path context can be used as a semantic, so that a first semantic analysis result can be obtained. It is understood that Stext = { s = { s } _text1 、s _text2 、…、s _textk S in _texti (1 ≦ i ≦ k) represents the ith original semantic, and may also represent the ith original number vector.

In the embodiment of the application, a first abstract syntax tree is constructed by an original code segment method, so that a plurality of first path contexts and an original digital vector corresponding to each first path context are determined, and a first semantic analysis result is obtained.

Fig. 4 is a flowchart of step S220 in an exemplary embodiment in the embodiment shown in fig. 2. As shown in fig. 4, in an exemplary embodiment, the process of semantically analyzing the patch code segment to obtain a second semantic analysis result may include steps S410 to S430, which are described in detail as follows:

step S410: and constructing a second abstract syntax tree according to the method contained in the patch code segment.

As described in step S210, the patch code segment includes a plurality of methods, i.e., a plurality of triples, and each method may include a class declaration and a method declaration of the corresponding method. Thus, the second abstract syntax tree may be constructed according to the methods contained in the patch code fragment.

Step S420: determining a second set of leaf nodes from the second abstract syntax tree; wherein the second set of leaf nodes comprises a plurality of second path contexts.

The second abstract syntax tree may include a plurality of second leaf nodes, where a plurality of leaf nodes may correspond to one method, and the leaf nodes corresponding to different methods may intersect, which is not limited in the embodiment of the present application. A plurality of second path contexts may be determined in the second abstract syntax tree. The plurality of second path contexts may be included in the second leaf node set, a certain second path context may include a plurality of second leaf nodes, and paths of different second leaf nodes may form different second path contexts.

Step S430: and calculating a patch digital vector corresponding to each second path context, and taking the patch digital vector corresponding to each second path context as a second semantic analysis result.

In this step, the patch digital vector corresponding to each second path context may be calculated by the Code2Vec algorithm, so as to obtain a plurality of patch digital vectors. The Code2Vec algorithm can predict the method names of the methods in the patch Code segment, and calculate the patch number vectors. Each patch digital vector may include a weight of a corresponding second path context, and each path context may be subjected to weighted aggregation according to the weight of each path context, so as to finally obtain a second semantic analysis result, i.e., stext. Wherein the weight of each second path context may be learned by using an attention mechanism in the training. The patch digital vector corresponding to each second path context can be used as a semantic, so that a second semantic analysis result can be obtained. It is understood that Ttext = { t = { t } _text1 、t _text2 、…、t _textl }. Wherein, t _texti (1 ≦ i ≦ l) represents the ith patch semantics and may also represent the ith patch number vector.

In the embodiment of the application, a second abstract syntax tree is constructed by a method of patching a code segment, so that a plurality of second path contexts and a patching digital vector corresponding to each second path context are determined, and a second semantic analysis result is obtained.

Fig. 5 is a flowchart of step S230 in an exemplary embodiment in the embodiment shown in fig. 2. As shown in fig. 5, in an exemplary embodiment, the process of calculating the similarity between the first semantic analysis result and the intention behavior description data to obtain the first similarity value may include steps S510 to S530, which are described in detail as follows:

step S510: a plurality of original embedded vectors are generated from the plurality of original digital vectors.

The first semantic analysis result may include a plurality of original digital vectors, and before the similarity calculation, a plurality of original embedded vectors may be generated from the plurality of original digital vectors.

Specifically, stext = { s = {(s) _text1 、s _text2 、…、s _textk S in _texti The original embedded vector of (1 ≦ i ≦ k) may be

And then a plurality of original embedded vectors can be obtained, i.e.

Step S520: and performing similarity distance calculation on each original embedding vector and the embedding vector corresponding to the intention behavior description data to obtain a first similarity distance set.

Wherein the intention behavior description data (Int) _text ) The corresponding embedded vector can be denoted Int _vec 。

Can be combined with

Each of (a) to

In turn with Int _vec Calculating the similarity distanceThe formula of (d) may be:

where function Distance () may be a function that returns the Distance between real valued vectors.

Through the calculation of the above formula, a first similarity distance set, namely dist, can be obtained _org ＝{dist _org 1、dist _org 2、…、dist _org k}。

Step S530: determining the first similarity value according to the first similarity distance set.

In the embodiment of the application, each original embedded vector and the embedded vector corresponding to the intention behavior description data can be subjected to similarity Distance calculation through a Distance () function to obtain each Svec and Int _vec The similarity distance is beneficial to improving the efficiency of calculating the similarity distance.

Fig. 6 is a flowchart of step S530 in an exemplary embodiment in the embodiment shown in fig. 5. As shown in fig. 6, in an exemplary embodiment, the process of determining the first similarity value according to the first similarity distance set may include steps S610 to S620, which are described in detail as follows:

step S610: the smallest similarity distance is selected from the first set of similarity distances as the first similarity value.

I.e. slave _org ＝{dist _org 1、dist _org 2、…、dist _org k selects the smallest dist as the first similarity value.

Step S620: and carrying out averaging operation on a plurality of similarity distances contained in the first similarity distance set to obtain a first average similarity distance, and taking the first average similarity distance as a first similarity value.

It is understood that the first similarity value = (dist) _org 1+dist _org 2+…+dist _org k)/k。

It should be noted that, the determination method of the first similarity value may include, but is not limited to, the above two methods, and a specific calculation method may be designed by a person skilled in the art, and the embodiment of the present application is not limited.

In the embodiment of the application, the first similarity value can be determined by taking the minimum value, taking the average value and the like, and the most representative data can be selected more reasonably as the first similarity value, so that the final result is more reliable and accurate.

Fig. 7 is a flowchart of step S230 in an exemplary embodiment in the embodiment shown in fig. 2. As shown in fig. 6, in an exemplary embodiment, the process of calculating the similarity between the second semantic analysis result and the intention behavior description data to obtain the second similarity value may include steps S710 to S720, which are described in detail as follows:

step S710: a plurality of patch embedding vectors are generated from the plurality of patch digital vectors.

The second semantic analysis result may include a plurality of patch digital vectors, and before the similarity calculation, a plurality of patch embedding vectors may be generated according to the plurality of patch digital vectors.

Specifically, ttext = { t _text1 、t _text2 、…、t _textl T in _texti The patch embedding vector of (1. Ltoreq. I. Ltoreq.l) may be

And a plurality of patch embedding vectors can be obtained, i.e.

Step S720: and respectively carrying out similarity distance calculation on each patch embedding vector and the embedding vector corresponding to the intention behavior description data to obtain a second similarity distance set.

Wherein the intention behavior description data (Int) _text ) The corresponding embedded vector can be represented as Int _vec 。

Can be combined with

Each of (a) to

In turn with Int _vec The similarity distance calculation is performed, and a specific formula can be as follows:

through the calculation of the above formula, a second similarity distance set, namely dist, can be obtained _pat ＝{dist _pat 1、dist _pat 2、…、dist _pat l}。

Step S730: and determining the second similarity value according to the second similarity distance set.

In the embodiment of the application, each patch embedding vector and the embedding vector corresponding to the intention behavior description data can be subjected to similarity Distance calculation through a Distance () function to obtain each Tvec and Int _vec The similarity distance of (2) is beneficial to improving the efficiency of calculating the similarity distance.

Fig. 8 is a flowchart of step S730 in an exemplary embodiment in the embodiment shown in fig. 7. As shown in fig. 8, in an exemplary embodiment, the process of determining the second similarity value according to the second similarity distance set may include steps S810 to S820, which are described in detail as follows:

step S810: and selecting the minimum similarity distance from the second similarity distance set as a second similarity value.

I.e. slave _pat ＝{dist _pat 1、dist _pat 2、…、dist _pat l the smallest dist is selected as the second similarity value.

Step S820: and carrying out averaging operation on a plurality of similarity distances contained in the second similarity distance set to obtain a second average similarity distance, and taking the second average similarity distance as a second similarity value.

It is understood that the second similarity value = (dist) _pat 1+dist _pat 2+…+dist _pat l)/l。

It should be noted that, the determination method of the second similarity value may include, but is not limited to, the above two methods, and a specific calculation method may be designed by a person skilled in the art, and the embodiment of the present application is not limited.

In the embodiment of the application, the second similarity value can be determined by taking the minimum value, taking the average value and the like, and the most representative data can be more reasonably selected as the second similarity value, so that the final result is more reliable and accurate.

Fig. 9 is a flowchart of step S240 in an exemplary embodiment in the embodiment shown in fig. 2. As shown in fig. 6, in an exemplary embodiment, the process of detecting whether the patch is an over-fit patch according to the first similarity value and the second similarity value may include steps S910 to S930, which are described in detail as follows:

step S910: and performing difference operation on the first similarity value and the second similarity value to obtain a similarity gain.

Wherein the similarity gain may be expressed as g _m ＝dist _org -dist _pat 。

Step S920: and if the similarity gain is larger than or equal to the preset gain threshold value, obtaining a detection result for representing that the patch is an over-fit patch.

Wherein the preset gain threshold may be set by a person skilled in the art, and may be 0, for example.

Step S930: and if the similarity gain is smaller than a preset gain threshold value, obtaining a detection result used for representing that the patch is not the over-fit patch.

Exemplarily, when g _m If the number is less than 0, the patch is indicated to be an overfitting patch; when g is _m And if the code size is more than or equal to 0, the patch is a code segment better than the original code segment, and the defect program code can be modified through the patch.

In the embodiment of the application, the patch code segments can be classified according to the preset gain threshold, that is, whether the patch code segments are over-fit patches or not is determined, and the preset gain threshold can be set by a person skilled in the art, so that the method has flexibility and can be applied to different over-fit patch detection scenarios.

FIG. 10 is a flowchart illustrating an intentional behavioral description data acquisition method according to another exemplary embodiment. As shown in fig. 10, in an exemplary embodiment, the method may be implemented before step S230 in fig. 2, and the method may include steps S1010 to S1020, which are described in detail as follows:

step S1010: acquiring development intention data; wherein, the development intention data includes at least one method name.

The development intent data may be the developer's development intent implied or implied by the defective program code in the absence of a defect.

Step S1020: and extracting the method name contained in the development intention data, and determining intention behavior description data according to the extracted method name.

The intention behavior description data can be determined by extracting the method name contained in the intention data through a code intention extractor.

In the embodiment of the application, the development intention data is extracted through the code intention extractor, so that the development intention of developers when writing the original code segment can be obtained, the similarity comparison of the semantics of the original code segment and the patch code segment is facilitated, and the patch code segment is classified.

FIG. 11 is a flowchart illustrating another method of patch detection according to an example embodiment. As shown in fig. 11, the method of fig. 11 may be used to describe the complete steps of the method for detecting a patch, and the method may include steps S1110 to S1130, which are described in detail as follows:

s1110, acquiring the patch and the defect program code.

S1120, extracting the patch and the defective program code through the code segment extraction component to obtain an original code segment, development intention data and a patch code segment.

The specific extraction method has been described in detail in the foregoing embodiments, and is not described herein again.

And S1130, processing the original code segment and the patch code segment respectively through the code description generator, and processing the intention data through the code intention extractor.

After the code description generator processes the original code segment, a first semantic analysis result can be obtained; after the patch code segment is processed, a second semantic analysis result can be obtained. After the intention data is processed by the code intention extractor, intention behavior description data can be obtained. Each specific processing procedure has been described in detail in the foregoing embodiments, and is not described herein again.

S1140, carrying out similarity comparison on the first semantic analysis result and the intention behavior description data to obtain a first similarity value; and comparing the similarity of the second semantic analysis result and the intention behavior description data to obtain a second similarity value.

The first semantic analysis result is the semantic of the original code segment, and the similarity between the semantic of the original code segment and the development intention data, namely the first similarity value, can be obtained through similarity comparison. Similarly, the second semantic analysis result is the semantics of the patch code segment, and the similarity between the semantics of the patch code segment and the development intention data, that is, the second similarity, can be obtained through similarity comparison.

S1150, classifying the patches according to the first similarity value and the second similarity value through a classifier.

Specifically, if the similarity gain is greater than or equal to the preset gain threshold, a detection result for characterizing that the patch is an overfitting patch is obtained. And if the similarity gain is smaller than a preset gain threshold value, obtaining a detection result for representing that the patch is not an over-fit patch.

In the embodiment of the application, a first semantic analysis result is obtained by acquiring an original code segment and a patch code segment and performing semantic analysis on the original code segment; performing semantic analysis on the patch code segment to obtain a second semantic analysis result; furthermore, intention behavior description data of the intention for developing the original code segment is obtained, and similarity calculation is carried out on the first semantic analysis result and the second semantic analysis result and the intention behavior description data respectively to obtain a first similarity value and a second similarity value. Finally, whether the patch is an over-fit patch can be detected according to the first similarity value and the second similarity value. By the method, the overfitting patch can be detected from the patches generated by automatic program repair, so that the patches are prevented from repairing the program, and the reliability of automatic program repair is improved.

Fig. 12 is a schematic diagram illustrating a structure of a patch detection apparatus according to an exemplary embodiment. As shown in fig. 12, in an exemplary embodiment, the patch detecting device includes:

an obtaining unit 1210 configured to obtain an original code section and a patch code section; the patch code segment is a code segment obtained after adding patches to an original code segment;

the processing unit 1220 is configured to perform semantic analysis on the original code segment to obtain a first semantic analysis result, and perform semantic analysis on the patch code segment to obtain a second semantic analysis result;

the calculating unit 1230 is configured to perform similarity calculation on the first semantic analysis result and the intentional behavior description data to obtain a first similarity value, and perform similarity calculation on the second semantic analysis result and the intentional behavior description data to obtain a second similarity value; wherein, the intention behavior description data is used for representing the intention corresponding to the original code segment;

a detecting unit 1240, configured to detect whether the patch is an over-fit patch according to the first similarity value and the second similarity value.

By the structure, the overfitting patch can be detected from the patches generated by automatic program repair, so that the patches are prevented from repairing the program, and the reliability of automatic program repair is improved.

In an embodiment, the processing unit 1220 is further configured to construct a first abstract syntax tree according to a method included in the original code segment; determining a first set of leaf nodes from the first abstract syntax tree; wherein the first set of leaf nodes comprises a plurality of first path contexts;

the calculating unit 1230 is further configured to calculate an original digital vector corresponding to each first path context, and use the original digital vector corresponding to each first path context as the first semantic analysis result.

In an embodiment, the processing unit 1220 is further configured to construct a second abstract syntax tree according to a method included in the patch code segment; determining a second set of leaf nodes from the second abstract syntax tree; wherein the second set of leaf nodes comprises a plurality of second path contexts;

the calculating unit 1230 is further configured to calculate a patch digital vector corresponding to each second path context, and use the patch digital vector corresponding to each second path context as the second semantic analysis result.

In one embodiment, the first semantic analysis result comprises a plurality of original digital vectors;

in one embodiment, the processing unit 1220 is further configured to generate a plurality of original embedded vectors from the plurality of original digital vectors;

the calculating unit 1230 is further configured to perform similarity distance calculation on each original embedded vector and the embedded vector corresponding to the intended behavior description data, so as to obtain a first similarity distance set;

the processing unit 1220 is further configured to determine the first similarity value according to the first similarity distance set.

In an embodiment, the processing unit 1220 is further configured to select a minimum similarity distance from the first similarity distance set as the first similarity value; or

The calculating unit 1230 is further configured to perform an averaging operation on the multiple similarity distances included in the first similarity distance set to obtain a first average similarity distance, and use the first average similarity distance as the first similarity value.

In one embodiment, the second semantic analysis result includes a plurality of patch number vectors;

in one embodiment, the processing unit 1220 is further configured to generate a plurality of patch embedding vectors from the plurality of patch digital vectors;

the calculating unit 1230 is further configured to perform similarity distance calculation on each patch embedding vector and the embedding vector corresponding to the intention behavior description data, so as to obtain a second similarity distance set;

the processing unit 1220 is further configured to determine the second similarity value according to the second similarity distance set.

In an embodiment, the processing unit 1220 is further configured to select a minimum similarity distance from the second similarity distance set as the second similarity value; or

The calculating unit 1230 is further configured to perform an averaging operation on the multiple similarity distances included in the second similarity distance set to obtain a second average similarity distance, and use the second average similarity distance as the second similarity value.

In an embodiment, the processing unit 1220 is further configured to, if the similarity gain is greater than or equal to a preset gain threshold, obtain a detection result for characterizing that the patch is an over-fit patch; and if the similarity gain is smaller than the preset gain threshold, obtaining a detection result for representing that the patch is not an over-fit patch.

In an embodiment, before performing similarity calculation on the first semantic analysis result and the intention behavior description data to obtain a first similarity value, the obtaining unit 1210 is further configured to obtain development intention data; wherein the development intention data includes at least one method name;

the processing unit 1220 is further configured to extract the method name included in the development intention data, and determine intention behavior description data according to the extracted method name.

It should be noted that the apparatus for detecting a patch provided in the foregoing embodiment and the method for detecting a patch provided in the foregoing embodiment belong to the same concept, and specific ways of executing operations by each module and unit have been described in detail in the method embodiment, and are not described herein again.

An embodiment of the present application further provides an electronic device, including: one or more processors; the storage device is configured to store one or more programs, and when the one or more programs are executed by the one or more processors, the electronic device is enabled to implement the patch detection method provided in each of the above embodiments.

FIG. 13 illustrates a schematic structural diagram of a computer system suitable for use to implement the electronic device of the embodiments of the subject application.

It should be noted that the computer system 1300 of the electronic device shown in fig. 13 is only an example, and should not bring any limitation to the functions and the application scope of the embodiments of the present application.

As shown in fig. 13, a computer system 1300 includes a Central Processing Unit (CPU) 1301 that can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-only memory (ROM) 1302 or a program loaded from a storage portion 1308 into a Random Access Memory (RAM) 1303. In the RAM1303, various programs and data necessary for system operation are also stored. The CPU1301, the ROM1302, and the RAM1303 are connected to each other via a bus 1304. An Input/Output (I/O) interface 1305 is also connected to bus 1304.

The following components are connected to the I/O interface 1305: an input portion 1306 including a keyboard, a mouse, and the like; an output section 1307 including, for example, a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 1308 including a hard disk and the like; and a communication section 1309 including a network interface card such as a LAN (local area network) card, a modem, and the like. The communication section 1309 performs communication processing via a network such as the internet. The drive 1310 is also connected to the I/O interface 1305 as needed. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as needed, so that the computer program read out therefrom is mounted in the storage section 1308 as needed.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications component 1309 and/or installed from removable media 1311. When the computer program is executed by a Central Processing Unit (CPU) 1301, various functions defined in the system of the present application are executed.

It should be noted that the computer readable media shown in the embodiments of the present application may be computer readable signal media or computer readable storage media or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a flash memory, an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

Another aspect of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the foregoing patch detection method. The computer-readable storage medium may be included in the electronic device described in the above embodiment, or may exist alone without being assembled into the electronic device.

Another aspect of the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the patch detection method provided in the above embodiments.

The above description is only a preferred exemplary embodiment of the present application, and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for detecting a patch, comprising:

acquiring an original code segment and a patch code segment; the patch code section is a code section obtained after adding patches to the original code section;

performing semantic analysis on the original code segment to obtain a first semantic analysis result, and performing semantic analysis on the patch code segment to obtain a second semantic analysis result;

performing similarity calculation on the first semantic analysis result and the intention behavior description data to obtain a first similarity value, and performing similarity calculation on the second semantic analysis result and the intention behavior description data to obtain a second similarity value; wherein, the intention behavior description data is used for representing the intention corresponding to the development of the original code segment;

and detecting whether the patch is an over-fit patch or not according to the first similarity value and the second similarity value.

2. The method as claimed in claim 1, wherein the performing semantic analysis on the original code segment to obtain a first semantic analysis result includes:

constructing a first abstract syntax tree according to the method contained in the original code segment;

determining a first set of leaf nodes from the first abstract syntax tree; wherein the first set of leaf nodes comprises a plurality of first path contexts;

and calculating an original digital vector corresponding to each first path context, and taking the original digital vector corresponding to each first path context as the first semantic analysis result.

3. The method as claimed in claim 1, wherein the performing semantic analysis on the patch code segment to obtain a second semantic analysis result includes:

constructing a second abstract syntax tree according to the method contained in the patch code segment;

determining a second set of leaf nodes from the second abstract syntax tree; wherein the second set of leaf nodes comprises a plurality of second path contexts;

and calculating a patch digital vector corresponding to each second path context, and taking the patch digital vector corresponding to each second path context as the second semantic analysis result.

4. The method of claim 1, wherein the first semantic analysis result comprises a plurality of original digital vectors; the calculating the similarity of the first semantic analysis result and the intention behavior description data to obtain a first similarity value includes:

generating a plurality of original embedded vectors from the plurality of original digital vectors;

performing similarity distance calculation on each original embedded vector and the embedded vector corresponding to the intention behavior description data to obtain a first similarity distance set;

determining the first similarity value according to the first similarity distance set.

5. The method of claim 4, wherein determining the first similarity value from the first set of similarity distances comprises:

selecting a minimum similarity distance from the first set of similarity distances as the first similarity value; or

And carrying out averaging operation on a plurality of similarity distances contained in the first similarity distance set to obtain a first average similarity distance, and taking the first average similarity distance as the first similarity value.

6. The method of claim 1, wherein the second semantic analysis result comprises a plurality of patch number vectors; the calculating the similarity between the second semantic analysis result and the intention behavior description data to obtain a second similarity value includes:

generating a plurality of patch embedding vectors from the plurality of patch digital vectors;

performing similarity distance calculation on each patch embedding vector and an embedding vector corresponding to the intention behavior description data to obtain a second similarity distance set;

and determining the second similarity value according to the second similarity distance set.

7. The method of claim 6, wherein determining the second similarity value from the second set of similarity distances comprises:

selecting a minimum similarity distance from the second set of similarity distances as the second similarity value; or

And carrying out averaging operation on a plurality of similarity distances contained in the second similarity distance set to obtain a second average similarity distance, and taking the second average similarity distance as the second similarity value.

8. The method of claim 1, wherein detecting whether the patch is an over-fit patch according to the first similarity value and the second similarity value comprises:

performing difference operation on the first similarity value and the second similarity value to obtain a similarity gain;

if the similarity gain is larger than or equal to a preset gain threshold value, obtaining a detection result for representing that the patch is an over-fit patch;

and if the similarity gain is smaller than the preset gain threshold, obtaining a detection result for representing that the patch is not the over-fit patch.

9. The method according to any one of claims 1 to 7, wherein before the calculating the similarity between the first semantic analysis result and the intention behavior description data to obtain a first similarity value, the method further comprises:

acquiring development intention data; wherein the development intention data includes at least one method name;

and extracting the method name contained in the development intention data, and determining the intention behavior description data according to the extracted method name.

10. An apparatus for detecting a patch, comprising:

an obtaining unit, configured to obtain an original code segment and a patch code segment; wherein the patch code section is a code section obtained after adding a patch to the original code section;

the processing unit is used for performing semantic analysis on the original code segment to obtain a first semantic analysis result and performing semantic analysis on the patch code segment to obtain a second semantic analysis result;

the calculation unit is used for carrying out similarity calculation on the first semantic analysis result and the intention behavior description data to obtain a first similarity value, and carrying out similarity calculation on the second semantic analysis result and the intention behavior description data to obtain a second similarity value; the intention behavior description data is used for representing the intention corresponding to the development of the original code segment;

a detecting unit, configured to detect whether the patch is an over-fit patch according to the first similarity value and the second similarity value.

11. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs that, when executed by the electronic device, cause the electronic device to implement a method of detecting a patch as claimed in any one of claims 1 to 9.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out a method of detecting a patch according to any one of claims 1 to 9.