CN117130645A

CN117130645A - Automatic program repairing method and system based on large language model and completion engine

Info

Publication number: CN117130645A
Application number: CN202311384703.4A
Authority: CN
Inventors: 胡鹏飞; 郝立鹏; 刘健雄; 李峰; 金岩
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-10-25
Filing date: 2023-10-25
Publication date: 2023-11-28

Abstract

The application belongs to the field of program repair, and particularly relates to an automatic program repair method and system based on a large language model and a completion engine. In addition, in view of the capability of the completion engine to provide code completion suggestions, we use the adoption of the identifier when the completion engine provides only one candidate identifier suffix to complete the context. This not only allows the proposed method to generate efficient, unusual and long identifier patch code, but also reduces the effort of iteratively generating large language models required for long identifier names.

Description

Automatic program repairing method and system based on large language model and completion engine

Technical Field

The application belongs to the field of program repair, and particularly relates to an automatic program repair method and system based on a large language model and a completion engine.

Background

As software systems become more complex, software vulnerabilities become more complex. The traditional most advanced automatic program repair tools are based primarily on manual repair templates to match defective codes and apply corresponding code patches. Although superior to other conventional techniques, such tools can only repair error types within a preset template, and cannot generalize to new vulnerability types.

With the development of deep learning technology, researchers construct automatic program repair tools based on deep learning based on neural machine translation architecture. The prior art CN116755753a uses a trained neural machine translation model to convert defective codes into correct codes by learning defective codes and repair codes that are crawled from a code library submitted by an open source project. However, the training set of these tools may be limited in size and also contain irrelevant or noisy data, reducing the accuracy of model pre-training.

Recently, with the development of large language model technology, more and more researches have proved that the large language model technology plays a role in helping developers to complete various coding tasks and can also be directly applied to automatic program repair work for generating patch codes. However, most large language models treat programs as logical bodies made up of sequences of identifiers, meaning that they do not understand the underlying semantic constraints of the target programming language. This can result in the generation of a large amount of invalid patch code, thereby reducing the utility of the technique.

Disclosure of Invention

In order to solve the problems, the patent provides an automatic program repairing method based on a large language model and a programming language automatic completion engine, which generates more effective patch codes by fusing the large language model and the completion engine, and enhances the accuracy and usability of automatic program repairing work by utilizing an artificial intelligence technology. The completion engine may parse incomplete program code and infer code semantics in a highly fault tolerant manner. The autoregressive identifier generation process of a large language model can be compared to the code writing work of a human developer, where the completion engine can provide real-time updates to check whether the partial code written by the human/large language model is valid. The technical proposal is that,

an automatic program repairing method based on a large language model and a complement engine comprises the following steps:

s1, generating an identifier library T (T) ₁ ,t ₂ ,…t _n ) And corresponding probability P (P ₁ ,p ₂ ,…p _n )；

S2, sampling the identifiers in the identifier library T according to a set sequence to obtain the identifiers T _i (1. Ltoreq.i.ltoreq.n), and checking whether a record corresponding to the identifier is stored in the memory: if the identifier t of the sample _i If the record is hit in the memory, the first complementing engine is not required to be called, and the record is selected to be reserved or pruning operation is directly carried out according to the situation; if the identifier t of the sample _i If there is no hit in memory, call completion engine I, based on the sampled identifier t _i Generating a completion result, the current program code context and the current insert location, and if the result is not unknown and the completion result is empty, determining the identifier t of the current sample _i Is rejected by comparing the sampled identifier t _i The corresponding probability p _i (1. Ltoreq.i. Ltoreq.n) zeroing, completing pruning operations and updating memory records, otherwise determining the identifier t of the current sample _i The method comprises the steps of adopting the method, updating a memory record, continuing to sample identifiers in an identifier library T, and repeating the step S2 until all identifiers in the identifier library T are completely sampled;

s3, the second pair of updated identifier libraries T of the completion engine actively complement the identifier strings to give a series of candidate completion identifiers as continuation of the candidate use of the adopted identifier strings, and the steps S1-S3 are repeated by utilizing the updated identifier libraries until all the identifier strings are completed, so that a complete patch code is formed.

Preferably, a large language model is used to provide an identifier library T (T ₁ ,t ₂ ,…t _n ) Corresponding probability library P (P ₁ ,p ₂ ,…p _n ) The identifier library T and the corresponding probability library P are mapped to the search space, and sampling is performed according to the probability from high to low.

Preferably, pruning operation is: pruning module to search the identifier t in the space _i Probability p corresponding to (1. Ltoreq.i.ltoreq.n) _i (1 is less than or equal to i is less than or equal to n) and is set to zero; when a certain identifier probability is zero, it is not sampled.

Preferably, in the process of sampling and judging each identifier library T in step S2, the large language model is not involved in generating a new identifier library again, that is, step S2 only judges the current output result of the large language model.

Preferably, the patch code production method comprises the following steps:

firstly, using a < SPAN > identifier as a shielding code to replace a code block with a vulnerability to form a patch embryonic form;

the < SPAN > identifier is then replaced with a large language model to synthesize a patch of repair code from the context of the code around the vulnerability location.

Preferably, in step S2, the memory stores identifiers that are known to be infeasible or/and feasible, and the identifiers are hit in three cases:

if the rejected record in the memory is hit, judging that the current identifier is not available, and directly pruning the search space by a pruning module;

if the prefix tree Trie of rejected identifiers in memory is hit, it is checked whether any identifier in the hit record is a new generation t _i (1 is less than or equal to i is less than or equal to n), if yes, pruning operation is directly carried out on the search space by the pruning module, and sampling is carried out again from the identifier library T;

if a viable record stored in memory is hit, the adopted identifier needs to be kept.

Preferably, in step S2, the search space pruning process is as follows:

first, the next identification of candidates is based on the mapping of the identifier library T and the corresponding probability P given by the large language modelSymbol sampling, updating the current body of program code accordingly, and shifting the inserted symbol to the newly sampled identifier t _i (1.ltoreq.i.ltoreq.n);

then, calling a first completion engine according to the currently generated identifier string result, and checking the current string result; if the result is not unknown and no completion is made, this means that no further candidate continuation can be formed after the currently produced identifier, and therefore the identifier t in the search space is pruned by the pruning module _i Probability p corresponding to (1. Ltoreq.i.ltoreq.n) _i (1. Ltoreq.i.ltoreq.n) zeroing to prune it, resampling from the identifier pool T, and executing the next cycle; otherwise, the current identifier is considered viable.

Preferably, in step S2, the memory builds a prefix tree of the rejected identifier, specifically including the steps of: if an identifier is rejected, meaning that a candidate continuation cannot be formed after the identifier to obtain a statically valid patch code, then any identifier with such prefix should be rejected, a prefix tree of all rejected identifiers for the given program code body and the insert location is built and checked for any identifier therein being the prefix of the newly generated next identifier; if so, resampling from T.

Preferably, in step S3, the active complement operation steps are as follows:

obtaining a completion result according to a given program code body and a current insert position, and checking whether a current identifier string is unknown; if so, the result will be set to an empty string, meaning that no additional completions will be generated; otherwise, calculating the common prefix of all the complement results, adjusting the complement results to adapt to the vocabulary requirement of the language model, and returning the results.

An automatic program repairing system based on a large language model and a complement engine comprises the large language model, a pruning module, a complement engine I and a complement engine II;

the large language model is used for providing an identifier library T (T ₁ ,t ₂ ,…t _n ) Corresponding toProbability P (P) ₁ ,p ₂ ,…p _n ) Mapping the identifier library T and the corresponding probability P to a search space;

the pruning module is used for checking sampled identifiers T in the identifier library T _i (1. Ltoreq.i.ltoreq.n) hit the contents in the memory and pruning the search space according to the situation;

at identifier t _i When the memory is not hit, the completion engine performs completion operation on the identifier word string as soon as the completion engine acquires the memory query result, if the completion result is not unknown and the completion result is empty, the pruning module is triggered to prune the search space, otherwise, the identifier t _i Is adopted;

the completion engine II actively completes the identifier character string to give a series of candidate completion identifiers as continuation of the currently hit identifier character string candidate use, and generates a completion result.

Compared with the prior art, the application has the following beneficial effects:

1) The method aims at the scene of repairing the single vulnerability, the patch is developed on the premise that accurate fault code position positioning is provided, and the patch code is obtained by changing continuous codes of positioning points. Furthermore, the method proposed by the patent can replace codes at the positions of the fault codes by using separate filling identifiers, and generate replacement codes by means of a large language model to expand and realize the patching of a plurality of bug codes. The method provided by the patent can be directly applied to a general programming language, minimum expenditure is introduced, and the generation of the current patch code identifier can be actively completed by using the completion engine without calling a large language model for many times.

2) The method firstly uses a large language model to provide the probability of the next identifier in the generated patch code, then queries a completion engine, modifies a probability list by dynamically zeroing the probability of an invalid token, and selects from the new probability list to select the next identifier. In addition, in view of the capability of the completion engine to provide code completion suggestions, we use the adoption of the identifier when the completion engine provides only one candidate identifier suffix to complete the context. This not only allows the proposed method to generate efficient, unusual and long identifier patch code, but also reduces the effort of iteratively generating large language models required for long identifier names.

Drawings

FIG. 1 is a flow chart of the present application.

FIG. 2 is a diagram of an embodiment of the present application.

Detailed Description

The following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application.

s1, providing an identifier library T (T) ₁ ,t ₂ ,…t _n ) Corresponding probability library P (P ₁ ,p ₂ ,…p _n ) The identifier library T and the corresponding probability library P are mapped to the search space, and sampling is performed according to the probability from high to low.

S2, sampling the identifiers in the identifier library T according to a set sequence to obtain the identifiers T _i (1. Ltoreq.i.ltoreq.n), and checking whether a record corresponding to the identifier is stored in the memory: if the identifier t of the sample _i If the record is hit in the memory, the pruning operation is directly carried out without calling the first complement engine; if the identifier t of the sample _i If there is no hit in memory, call completion engine I, based on the sampled identifier t _i Generating a completion result, the current program code context and the current insert location, and if the result is not unknown and the completion result is empty, determining the identifier t of the current sample _i Is rejected by takingIdentifier t of the sample _i The corresponding probability p _i (1. Ltoreq.i. Ltoreq.n) zeroing, completing pruning operations and updating memory records, otherwise determining the identifier t of the current sample _i And (2) after the sampling is adopted, updating the memory record, continuing to sample the identifiers in the identifier library T, and repeating the step (S2) until all the identifiers in the identifier library T are sampled. Step S2 can be regarded as an identifier sampling inner loop procedure.

If the result is not unknown and the complement result is null, the logic of the complement engine one judgment condition corresponds to:

(1) Let t of sample _i It may be placed behind the already generated code string to form an identifier string S'.

(2) S 'is given to the first completion engine, and whether S' is legal or not is judged. If S' is legal, then either t _i Is the last word (the complement result is unknown), or can continue to generate possible continuation on the basis of S' (the complement result is not null); otherwise, if the result generated by the completion engine I is not unknown and the completion result is empty, the S' is illegal, namely t is calculated _i Is not suitable for being placed behind the already generated identifier string, i.e. t _i Is rejected.

(3) Determining t based on the result in (2) _i Whether rejected or accepted.

Pruning operation is as follows: pruning module to search the identifier t in the space _i Probability p corresponding to (1. Ltoreq.i.ltoreq.n) _i (1 is less than or equal to i is less than or equal to n) and is set to zero; when a certain identifier probability is zero, it is not sampled.

In the step S2, in the process of sampling and judging each identifier library T, a new identifier library is not generated again by the large language model, namely, in the step S2, only the current output result of the large language model is judged.

If the memory stores the identifier t of the sample, if the memory stores the identifier t of the known possibility _i (1 is not less than i is not less than n), if the record in the memory is hit, the pruning operation is directly carried out without calling the first complementing engine.

The memory stores a prefix tree of identifiers known to be rejected, identifiers known to be accepted, and identifiers known to be rejected, the identifier records being hit in three cases:

(1) If the rejected record in memory is hit, then the current identifier t is determined _i (1.ltoreq.i.ltoreq.n) is not available, and the pruning module performs pruning operations directly on the search space and continues to sample from the identifier pool T.

(2) If the taken record in memory is hit, the current identifier t is determined _i (1.ltoreq.i.ltoreq.n) available, maintaining the probability of the current identifier, and continuing to sample from the identifier pool T.

(3) If the prefix tree Trie of the rejected identifier in the memory is hit, checking whether any identifier in the hit record is the prefix of the new identifier string, if so, directly pruning the search space by the pruning module, and resampling from the identifier library T;

the memory builds a prefix tree of the rejected identifier, which comprises the following specific steps: if an identifier is rejected, meaning that a candidate continuation cannot be formed after the identifier to obtain a statically valid patch code, then any identifier with such prefix should be rejected, a prefix tree of all rejected identifiers for the given program code body and the insert location is built and checked for any identifier therein being the prefix of the newly generated next identifier; if so, resampling from T.

If the identifier of the current sample is not stored in the memory, then the completion engine I is invoked, based on the identifier t of the sample _i (1.ltoreq.i.ltoreq.n), the current program code context, and the current insert location, generating a completion result, and if the result is not unknown and the completion result is empty, determining the identifier t of the current sample _i Is rejected by comparing the sampled identifier t _i The corresponding probability p _i (1. Ltoreq.i. Ltoreq.n) zeroing, completing pruning operations and updating memory records, otherwise determining the identifier t of the current sample _i Is taken in and the memory record is updated.

S3, the second complement engine actively complements the identifier character string to give a series of possible complement identifiers as continuation of the candidate use of the identifier character string which is hit currently, and the steps S1-S3 are repeated by utilizing the updated identifier library until all the identifier character strings are complemented to form a complete patch code.

The patch code production method comprises the following steps:

In step S3, obtaining a completion result according to the given program code body and the current insert position, and checking whether the current identifier string is unknown; if so, the result will be set to an empty string, meaning that no additional completions will be generated; otherwise, calculating the common prefix of all the complement results, adjusting the complement results to adapt to the vocabulary requirement of the language model, and returning the results.

The entire steps S1-S3 can be seen as one large outer loop.

the large language model is used for providing an identifier library T (T ₁ ,t ₂ ,…t _n ) And corresponding probability P (P ₁ ,p ₂ ,…p _n ) Mapping the identifier library T and the corresponding probability P to a search space;

at identifier t _i When the memory is not hit, the completion engine performs completion operation on the identifier word strings as soon as the completion engine acquires the memory query result, if the completion result is not unknown and the completion result is empty, the pruning module is triggered to perform search space pruning, otherwiseIdentifier t _i Is adopted.

An embodiment, as shown in fig. 2, illustrates a process for implementing patch code generation proposed by the present patent.

The generation process is composed of an inner loop and an outer loop, and the large loop continuously uses the cooperation between the large language model and the complement engine II to generate a new identifier to update the generated result.

First, the external loop uses the currently generated identifier as input of the large language model (1 in fig. 2), the large language model returns the identifier library T (String, name, end, …) and the corresponding probability library P (91%, 3%,0.2%, …) from the given current sample, and maps the identifier library T and the corresponding probability library P onto the search space of the internal loop, and the sampling is performed according to the probability size from high to low. Then, entering the identifier sampling phase of the inner loop, repeating the identifier sampling process from the search space, checking its feasibility, and pruning the search space until the identifier is adopted.

Each time the identifier pool T is sampled to obtain T _i (1. Ltoreq.i.ltoreq.n) after which the identifier t is first checked _i Whether the content in the memory has been hit (fig. 2 (2)), the memory stores an identifier record that is known to be viable or not viable. The memory record of the non-viable identifiers includes our custom prefix tree data structure (Trie), which will be described later.

Identifier t of the sample _i By hitting an infeasible record in memory with the identifier t _i Probability p of (2) _i (1.ltoreq.i.ltoreq.n) is set to zero ((3) in FIG. 2) to trim the search space and the next sampling operation will take place on the updated search space. In this way, the same identifier is not resampled during the identifier selection phase, avoiding useless operations. If the identifier t of the sample _i Without hitting the memory contents in memory (i.eNo identifier Name is stored in memory), then call the completion engine one, based on the sampled identifier t _i Generating a completion result, the current program code context and the current insert location, and if the result is not unknown and the completion result is empty, determining the identifier t of the current sample _i Is rejected by comparing the sampled identifier t _i The corresponding probability p _i (1. Ltoreq.i. Ltoreq.n) zeroing, completing pruning operations and updating memory records, otherwise determining the identifier t of the current sample _i Is taken in and the memory record is updated. In both cases, the memory is updated (fig. 2 (5)). Accept identifier t _i After that (fig. 2 (6)), we further try to actively complement the identifier string using the second complementing engine (fig. 2 (7)).

When the internal circulation is carried out, the new identifier library is not generated again by the large language model, namely, the internal circulation only judges the current output result of the large language model.

Finally, the method presented in this patent appends all newly generated accepted identifiers to the current generation and starts a new loop until a complete patch code is generated. The loop stops when the model generates a special flag indicating the end.

In order to alleviate the efficiency decline brought by sampling test and pruning operation cycle in the internal circulation process and accelerate the searching process, the method provided by the patent applies the memory technology to reduce the frequency of calling the complement engine for analysis.

The memory mainly realizes the following 3 functions:

(1) Bearing in mind the rejected identifier:

repairing vulnerabilities in practice requires the generation of a large number of samples, meaning that the same body of program code and current insert location may be repeated in pruning operations, so we can speed up the pruning operation process by storing identifiers that were trimmed off as judged during sampling in memory and zeroing out rejected identifiers Fu Gailv.

(2) Bearing in mind the adopted identifiers:

in addition to the rejected identifier, we can also store the previously accepted identifier in memory, avoiding a call to the complement engine one, and directly deciding that the identifier is available.

(3) Building a prefix tree of rejected identifiers:

many identifiers in the language model vocabulary may be prefixes of another identifier, as is common in language models. Obviously, if an identifier is rejected, meaning that a possible continuation cannot be formed after the identifier to obtain a statically valid patch code, then any identifier with such a prefix should be rejected. Thus, we build a prefix tree (denoted Trie) of all rejected identifiers for a given body of program code and insert location and check if any of them is the prefix of the newly generated next identifier. If so, directly jumping to the next iteration, and avoiding further analysis.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An automatic program repairing method based on a large language model and a complement engine is characterized by comprising the following steps:

S2, sampling the identifiers in the identifier library T according to a set sequence to obtain the identifiers T _i (1. Ltoreq.i.ltoreq.n), and checking whether a record corresponding to the identifier is stored in the memory: if the identifier t of the sample _i If the record is hit in the memory, the first complementing engine is not required to be called, and the record is selected to be reserved or pruning operation is directly carried out according to the situation; if the sample is markedIdentifier t _i If there is no hit in memory, call completion engine I, based on the sampled identifier t _i Generating a completion result, the current program code context and the current insert location, and if the result is not unknown and the completion result is empty, determining the identifier t of the current sample _i Is rejected by comparing the sampled identifier t _i The corresponding probability p _i (1. Ltoreq.i. Ltoreq.n) zeroing, completing pruning operations and updating memory records, otherwise determining the identifier t of the current sample _i The method comprises the steps of adopting the method, updating a memory record, continuing to sample identifiers in an identifier library T, and repeating the step S2 until all identifiers in the identifier library T are completely sampled;

2. An automated program repair method based on a large language model and a completion engine according to claim 1, wherein the identifier library T and the corresponding probability library P are mapped onto the search space, and sampling is performed according to the probability size from high to low.

3. The automatic program repair method based on a large language model and a complement engine according to claim 1, wherein the pruning operation is: pruning module to search the identifier t in the space _i Probability p corresponding to (1. Ltoreq.i.ltoreq.n) _i (1 is less than or equal to i is less than or equal to n) and is set to zero; when a certain identifier probability is zero, it is not sampled.

4. The automatic program repairing method based on the large language model and the completion engine according to claim 1, wherein in the step S2, in the process of sampling and judging each identifier library T, the large language model is not involved in generating a new identifier library again, i.e. step S2 only judges the current output result of the large language model.

5. The automatic program repairing method based on a large language model and a complement engine according to claim 1, wherein the patch code producing method comprises the steps of:

6. An automated program repair method based on a large language model and a completion engine according to claim 1, wherein in step S2, the memory stores identifiers that are known to be infeasible or/and feasible, and the identifiers are hit in three cases:

7. The automatic program repairing method based on a large language model and a complement engine according to claim 1, wherein in step S2, the search space pruning process is as follows:

firstly, sampling the candidate next identifier according to the mapping of the identifier library T and the corresponding probability P given by the large language model, updating the current program code body accordingly, and moving the inserted symbol to the identifier T generated by the new sampling _i （1≤i≤n）Afterwards;

8. The automatic program repair method based on a large language model and a completion engine according to claim 1, wherein in step S2, the memory builds a prefix tree of rejected identifiers, specifically comprising the steps of: if an identifier is rejected, meaning that a candidate continuation cannot be formed after the identifier to obtain a statically valid patch code, then any identifier with such prefix should be rejected, a prefix tree of all rejected identifiers for the given program code body and the insert location is built and checked for any identifier therein being the prefix of the newly generated next identifier; if so, resampling from T.

9. The automatic program repairing method based on the large language model and the completion engine according to claim 1, wherein in step S3, the active completion operation steps are as follows:

10. An automatic program repairing system based on a large language model and a complement engine is characterized by comprising the large language model, a pruning module, a complement engine I and a complement engine II;

the large language model is used for providing an identifier library T (T ₁ ,t ₂ ,…t _n ) Corresponding probability P (P ₁ ,p ₂ ,…p _n ) Mapping the identifier library T and the corresponding probability P to a search space;