CN117130645A - Automatic program repairing method and system based on large language model and completion engine - Google Patents

Automatic program repairing method and system based on large language model and completion engine Download PDF

Info

Publication number
CN117130645A
CN117130645A CN202311384703.4A CN202311384703A CN117130645A CN 117130645 A CN117130645 A CN 117130645A CN 202311384703 A CN202311384703 A CN 202311384703A CN 117130645 A CN117130645 A CN 117130645A
Authority
CN
China
Prior art keywords
identifier
completion
language model
memory
ltoreq
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311384703.4A
Other languages
Chinese (zh)
Inventor
胡鹏飞
郝立鹏
刘健雄
李峰
金岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202311384703.4A priority Critical patent/CN117130645A/en
Publication of CN117130645A publication Critical patent/CN117130645A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/65Updates
    • G06F8/658Incremental updates; Differential updates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The application belongs to the field of program repair, and particularly relates to an automatic program repair method and system based on a large language model and a completion engine. In addition, in view of the capability of the completion engine to provide code completion suggestions, we use the adoption of the identifier when the completion engine provides only one candidate identifier suffix to complete the context. This not only allows the proposed method to generate efficient, unusual and long identifier patch code, but also reduces the effort of iteratively generating large language models required for long identifier names.

Description

Automatic program repairing method and system based on large language model and completion engine
Technical Field
The application belongs to the field of program repair, and particularly relates to an automatic program repair method and system based on a large language model and a completion engine.
Background
As software systems become more complex, software vulnerabilities become more complex. The traditional most advanced automatic program repair tools are based primarily on manual repair templates to match defective codes and apply corresponding code patches. Although superior to other conventional techniques, such tools can only repair error types within a preset template, and cannot generalize to new vulnerability types.
With the development of deep learning technology, researchers construct automatic program repair tools based on deep learning based on neural machine translation architecture. The prior art CN116755753a uses a trained neural machine translation model to convert defective codes into correct codes by learning defective codes and repair codes that are crawled from a code library submitted by an open source project. However, the training set of these tools may be limited in size and also contain irrelevant or noisy data, reducing the accuracy of model pre-training.
Recently, with the development of large language model technology, more and more researches have proved that the large language model technology plays a role in helping developers to complete various coding tasks and can also be directly applied to automatic program repair work for generating patch codes. However, most large language models treat programs as logical bodies made up of sequences of identifiers, meaning that they do not understand the underlying semantic constraints of the target programming language. This can result in the generation of a large amount of invalid patch code, thereby reducing the utility of the technique.
Disclosure of Invention
In order to solve the problems, the patent provides an automatic program repairing method based on a large language model and a programming language automatic completion engine, which generates more effective patch codes by fusing the large language model and the completion engine, and enhances the accuracy and usability of automatic program repairing work by utilizing an artificial intelligence technology. The completion engine may parse incomplete program code and infer code semantics in a highly fault tolerant manner. The autoregressive identifier generation process of a large language model can be compared to the code writing work of a human developer, where the completion engine can provide real-time updates to check whether the partial code written by the human/large language model is valid. The technical proposal is that,
an automatic program repairing method based on a large language model and a complement engine comprises the following steps:
s1, generating an identifier library T (T) 1 ,t 2 ,…t n ) And corresponding probability P (P 1 ,p 2 ,…p n );
S2, sampling the identifiers in the identifier library T according to a set sequence to obtain the identifiers T i (1. Ltoreq.i.ltoreq.n), and checking whether a record corresponding to the identifier is stored in the memory: if the identifier t of the sample i If the record is hit in the memory, the first complementing engine is not required to be called, and the record is selected to be reserved or pruning operation is directly carried out according to the situation; if the identifier t of the sample i If there is no hit in memory, call completion engine I, based on the sampled identifier t i Generating a completion result, the current program code context and the current insert location, and if the result is not unknown and the completion result is empty, determining the identifier t of the current sample i Is rejected by comparing the sampled identifier t i The corresponding probability p i (1. Ltoreq.i. Ltoreq.n) zeroing, completing pruning operations and updating memory records, otherwise determining the identifier t of the current sample i The method comprises the steps of adopting the method, updating a memory record, continuing to sample identifiers in an identifier library T, and repeating the step S2 until all identifiers in the identifier library T are completely sampled;
s3, the second pair of updated identifier libraries T of the completion engine actively complement the identifier strings to give a series of candidate completion identifiers as continuation of the candidate use of the adopted identifier strings, and the steps S1-S3 are repeated by utilizing the updated identifier libraries until all the identifier strings are completed, so that a complete patch code is formed.
Preferably, a large language model is used to provide an identifier library T (T 1 ,t 2 ,…t n ) Corresponding probability library P (P 1 ,p 2 ,…p n ) The identifier library T and the corresponding probability library P are mapped to the search space, and sampling is performed according to the probability from high to low.
Preferably, pruning operation is: pruning module to search the identifier t in the space i Probability p corresponding to (1. Ltoreq.i.ltoreq.n) i (1 is less than or equal to i is less than or equal to n) and is set to zero; when a certain identifier probability is zero, it is not sampled.
Preferably, in the process of sampling and judging each identifier library T in step S2, the large language model is not involved in generating a new identifier library again, that is, step S2 only judges the current output result of the large language model.
Preferably, the patch code production method comprises the following steps:
firstly, using a < SPAN > identifier as a shielding code to replace a code block with a vulnerability to form a patch embryonic form;
the < SPAN > identifier is then replaced with a large language model to synthesize a patch of repair code from the context of the code around the vulnerability location.
Preferably, in step S2, the memory stores identifiers that are known to be infeasible or/and feasible, and the identifiers are hit in three cases:
if the rejected record in the memory is hit, judging that the current identifier is not available, and directly pruning the search space by a pruning module;
if the prefix tree Trie of rejected identifiers in memory is hit, it is checked whether any identifier in the hit record is a new generation t i (1 is less than or equal to i is less than or equal to n), if yes, pruning operation is directly carried out on the search space by the pruning module, and sampling is carried out again from the identifier library T;
if a viable record stored in memory is hit, the adopted identifier needs to be kept.
Preferably, in step S2, the search space pruning process is as follows:
first, the next identification of candidates is based on the mapping of the identifier library T and the corresponding probability P given by the large language modelSymbol sampling, updating the current body of program code accordingly, and shifting the inserted symbol to the newly sampled identifier t i (1.ltoreq.i.ltoreq.n);
then, calling a first completion engine according to the currently generated identifier string result, and checking the current string result; if the result is not unknown and no completion is made, this means that no further candidate continuation can be formed after the currently produced identifier, and therefore the identifier t in the search space is pruned by the pruning module i Probability p corresponding to (1. Ltoreq.i.ltoreq.n) i (1. Ltoreq.i.ltoreq.n) zeroing to prune it, resampling from the identifier pool T, and executing the next cycle; otherwise, the current identifier is considered viable.
Preferably, in step S2, the memory builds a prefix tree of the rejected identifier, specifically including the steps of: if an identifier is rejected, meaning that a candidate continuation cannot be formed after the identifier to obtain a statically valid patch code, then any identifier with such prefix should be rejected, a prefix tree of all rejected identifiers for the given program code body and the insert location is built and checked for any identifier therein being the prefix of the newly generated next identifier; if so, resampling from T.
Preferably, in step S3, the active complement operation steps are as follows:
obtaining a completion result according to a given program code body and a current insert position, and checking whether a current identifier string is unknown; if so, the result will be set to an empty string, meaning that no additional completions will be generated; otherwise, calculating the common prefix of all the complement results, adjusting the complement results to adapt to the vocabulary requirement of the language model, and returning the results.
An automatic program repairing system based on a large language model and a complement engine comprises the large language model, a pruning module, a complement engine I and a complement engine II;
the large language model is used for providing an identifier library T (T 1 ,t 2 ,…t n ) Corresponding toProbability P (P) 1 ,p 2 ,…p n ) Mapping the identifier library T and the corresponding probability P to a search space;
the pruning module is used for checking sampled identifiers T in the identifier library T i (1. Ltoreq.i.ltoreq.n) hit the contents in the memory and pruning the search space according to the situation;
at identifier t i When the memory is not hit, the completion engine performs completion operation on the identifier word string as soon as the completion engine acquires the memory query result, if the completion result is not unknown and the completion result is empty, the pruning module is triggered to prune the search space, otherwise, the identifier t i Is adopted;
the completion engine II actively completes the identifier character string to give a series of candidate completion identifiers as continuation of the currently hit identifier character string candidate use, and generates a completion result.
Compared with the prior art, the application has the following beneficial effects:
1) The method aims at the scene of repairing the single vulnerability, the patch is developed on the premise that accurate fault code position positioning is provided, and the patch code is obtained by changing continuous codes of positioning points. Furthermore, the method proposed by the patent can replace codes at the positions of the fault codes by using separate filling identifiers, and generate replacement codes by means of a large language model to expand and realize the patching of a plurality of bug codes. The method provided by the patent can be directly applied to a general programming language, minimum expenditure is introduced, and the generation of the current patch code identifier can be actively completed by using the completion engine without calling a large language model for many times.
2) The method firstly uses a large language model to provide the probability of the next identifier in the generated patch code, then queries a completion engine, modifies a probability list by dynamically zeroing the probability of an invalid token, and selects from the new probability list to select the next identifier. In addition, in view of the capability of the completion engine to provide code completion suggestions, we use the adoption of the identifier when the completion engine provides only one candidate identifier suffix to complete the context. This not only allows the proposed method to generate efficient, unusual and long identifier patch code, but also reduces the effort of iteratively generating large language models required for long identifier names.
Drawings
FIG. 1 is a flow chart of the present application.
FIG. 2 is a diagram of an embodiment of the present application.
Detailed Description
The following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application.
An automatic program repairing method based on a large language model and a complement engine comprises the following steps:
s1, providing an identifier library T (T) 1 ,t 2 ,…t n ) Corresponding probability library P (P 1 ,p 2 ,…p n ) The identifier library T and the corresponding probability library P are mapped to the search space, and sampling is performed according to the probability from high to low.
S2, sampling the identifiers in the identifier library T according to a set sequence to obtain the identifiers T i (1. Ltoreq.i.ltoreq.n), and checking whether a record corresponding to the identifier is stored in the memory: if the identifier t of the sample i If the record is hit in the memory, the pruning operation is directly carried out without calling the first complement engine; if the identifier t of the sample i If there is no hit in memory, call completion engine I, based on the sampled identifier t i Generating a completion result, the current program code context and the current insert location, and if the result is not unknown and the completion result is empty, determining the identifier t of the current sample i Is rejected by takingIdentifier t of the sample i The corresponding probability p i (1. Ltoreq.i. Ltoreq.n) zeroing, completing pruning operations and updating memory records, otherwise determining the identifier t of the current sample i And (2) after the sampling is adopted, updating the memory record, continuing to sample the identifiers in the identifier library T, and repeating the step (S2) until all the identifiers in the identifier library T are sampled. Step S2 can be regarded as an identifier sampling inner loop procedure.
If the result is not unknown and the complement result is null, the logic of the complement engine one judgment condition corresponds to:
(1) Let t of sample i It may be placed behind the already generated code string to form an identifier string S'.
(2) S 'is given to the first completion engine, and whether S' is legal or not is judged. If S' is legal, then either t i Is the last word (the complement result is unknown), or can continue to generate possible continuation on the basis of S' (the complement result is not null); otherwise, if the result generated by the completion engine I is not unknown and the completion result is empty, the S' is illegal, namely t is calculated i Is not suitable for being placed behind the already generated identifier string, i.e. t i Is rejected.
(3) Determining t based on the result in (2) i Whether rejected or accepted.
Pruning operation is as follows: pruning module to search the identifier t in the space i Probability p corresponding to (1. Ltoreq.i.ltoreq.n) i (1 is less than or equal to i is less than or equal to n) and is set to zero; when a certain identifier probability is zero, it is not sampled.
In the step S2, in the process of sampling and judging each identifier library T, a new identifier library is not generated again by the large language model, namely, in the step S2, only the current output result of the large language model is judged.
If the memory stores the identifier t of the sample, if the memory stores the identifier t of the known possibility i (1 is not less than i is not less than n), if the record in the memory is hit, the pruning operation is directly carried out without calling the first complementing engine.
The memory stores a prefix tree of identifiers known to be rejected, identifiers known to be accepted, and identifiers known to be rejected, the identifier records being hit in three cases:
(1) If the rejected record in memory is hit, then the current identifier t is determined i (1.ltoreq.i.ltoreq.n) is not available, and the pruning module performs pruning operations directly on the search space and continues to sample from the identifier pool T.
(2) If the taken record in memory is hit, the current identifier t is determined i (1.ltoreq.i.ltoreq.n) available, maintaining the probability of the current identifier, and continuing to sample from the identifier pool T.
(3) If the prefix tree Trie of the rejected identifier in the memory is hit, checking whether any identifier in the hit record is the prefix of the new identifier string, if so, directly pruning the search space by the pruning module, and resampling from the identifier library T;
the memory builds a prefix tree of the rejected identifier, which comprises the following specific steps: if an identifier is rejected, meaning that a candidate continuation cannot be formed after the identifier to obtain a statically valid patch code, then any identifier with such prefix should be rejected, a prefix tree of all rejected identifiers for the given program code body and the insert location is built and checked for any identifier therein being the prefix of the newly generated next identifier; if so, resampling from T.
If the identifier of the current sample is not stored in the memory, then the completion engine I is invoked, based on the identifier t of the sample i (1.ltoreq.i.ltoreq.n), the current program code context, and the current insert location, generating a completion result, and if the result is not unknown and the completion result is empty, determining the identifier t of the current sample i Is rejected by comparing the sampled identifier t i The corresponding probability p i (1. Ltoreq.i. Ltoreq.n) zeroing, completing pruning operations and updating memory records, otherwise determining the identifier t of the current sample i Is taken in and the memory record is updated.
S3, the second complement engine actively complements the identifier character string to give a series of possible complement identifiers as continuation of the candidate use of the identifier character string which is hit currently, and the steps S1-S3 are repeated by utilizing the updated identifier library until all the identifier character strings are complemented to form a complete patch code.
The patch code production method comprises the following steps:
firstly, using a < SPAN > identifier as a shielding code to replace a code block with a vulnerability to form a patch embryonic form;
the < SPAN > identifier is then replaced with a large language model to synthesize a patch of repair code from the context of the code around the vulnerability location.
In step S3, obtaining a completion result according to the given program code body and the current insert position, and checking whether the current identifier string is unknown; if so, the result will be set to an empty string, meaning that no additional completions will be generated; otherwise, calculating the common prefix of all the complement results, adjusting the complement results to adapt to the vocabulary requirement of the language model, and returning the results.
The entire steps S1-S3 can be seen as one large outer loop.
An automatic program repairing system based on a large language model and a complement engine comprises the large language model, a pruning module, a complement engine I and a complement engine II;
the large language model is used for providing an identifier library T (T 1 ,t 2 ,…t n ) And corresponding probability P (P 1 ,p 2 ,…p n ) Mapping the identifier library T and the corresponding probability P to a search space;
the pruning module is used for checking sampled identifiers T in the identifier library T i (1. Ltoreq.i.ltoreq.n) hit the contents in the memory and pruning the search space according to the situation;
at identifier t i When the memory is not hit, the completion engine performs completion operation on the identifier word strings as soon as the completion engine acquires the memory query result, if the completion result is not unknown and the completion result is empty, the pruning module is triggered to perform search space pruning, otherwiseIdentifier t i Is adopted.
The completion engine II actively completes the identifier character string to give a series of candidate completion identifiers as continuation of the currently hit identifier character string candidate use, and generates a completion result.
An embodiment, as shown in fig. 2, illustrates a process for implementing patch code generation proposed by the present patent.
The generation process is composed of an inner loop and an outer loop, and the large loop continuously uses the cooperation between the large language model and the complement engine II to generate a new identifier to update the generated result.
First, the external loop uses the currently generated identifier as input of the large language model (1 in fig. 2), the large language model returns the identifier library T (String, name, end, …) and the corresponding probability library P (91%, 3%,0.2%, …) from the given current sample, and maps the identifier library T and the corresponding probability library P onto the search space of the internal loop, and the sampling is performed according to the probability size from high to low. Then, entering the identifier sampling phase of the inner loop, repeating the identifier sampling process from the search space, checking its feasibility, and pruning the search space until the identifier is adopted.
Each time the identifier pool T is sampled to obtain T i (1. Ltoreq.i.ltoreq.n) after which the identifier t is first checked i Whether the content in the memory has been hit (fig. 2 (2)), the memory stores an identifier record that is known to be viable or not viable. The memory record of the non-viable identifiers includes our custom prefix tree data structure (Trie), which will be described later.
Identifier t of the sample i By hitting an infeasible record in memory with the identifier t i Probability p of (2) i (1.ltoreq.i.ltoreq.n) is set to zero ((3) in FIG. 2) to trim the search space and the next sampling operation will take place on the updated search space. In this way, the same identifier is not resampled during the identifier selection phase, avoiding useless operations. If the identifier t of the sample i Without hitting the memory contents in memory (i.eNo identifier Name is stored in memory), then call the completion engine one, based on the sampled identifier t i Generating a completion result, the current program code context and the current insert location, and if the result is not unknown and the completion result is empty, determining the identifier t of the current sample i Is rejected by comparing the sampled identifier t i The corresponding probability p i (1. Ltoreq.i. Ltoreq.n) zeroing, completing pruning operations and updating memory records, otherwise determining the identifier t of the current sample i Is taken in and the memory record is updated. In both cases, the memory is updated (fig. 2 (5)). Accept identifier t i After that (fig. 2 (6)), we further try to actively complement the identifier string using the second complementing engine (fig. 2 (7)).
When the internal circulation is carried out, the new identifier library is not generated again by the large language model, namely, the internal circulation only judges the current output result of the large language model.
Finally, the method presented in this patent appends all newly generated accepted identifiers to the current generation and starts a new loop until a complete patch code is generated. The loop stops when the model generates a special flag indicating the end.
In order to alleviate the efficiency decline brought by sampling test and pruning operation cycle in the internal circulation process and accelerate the searching process, the method provided by the patent applies the memory technology to reduce the frequency of calling the complement engine for analysis.
The memory mainly realizes the following 3 functions:
(1) Bearing in mind the rejected identifier:
repairing vulnerabilities in practice requires the generation of a large number of samples, meaning that the same body of program code and current insert location may be repeated in pruning operations, so we can speed up the pruning operation process by storing identifiers that were trimmed off as judged during sampling in memory and zeroing out rejected identifiers Fu Gailv.
(2) Bearing in mind the adopted identifiers:
in addition to the rejected identifier, we can also store the previously accepted identifier in memory, avoiding a call to the complement engine one, and directly deciding that the identifier is available.
(3) Building a prefix tree of rejected identifiers:
many identifiers in the language model vocabulary may be prefixes of another identifier, as is common in language models. Obviously, if an identifier is rejected, meaning that a possible continuation cannot be formed after the identifier to obtain a statically valid patch code, then any identifier with such a prefix should be rejected. Thus, we build a prefix tree (denoted Trie) of all rejected identifiers for a given body of program code and insert location and check if any of them is the prefix of the newly generated next identifier. If so, directly jumping to the next iteration, and avoiding further analysis.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. An automatic program repairing method based on a large language model and a complement engine is characterized by comprising the following steps:
s1, generating an identifier library T (T) 1 ,t 2 ,…t n ) And corresponding probability P (P 1 ,p 2 ,…p n );
S2, sampling the identifiers in the identifier library T according to a set sequence to obtain the identifiers T i (1. Ltoreq.i.ltoreq.n), and checking whether a record corresponding to the identifier is stored in the memory: if the identifier t of the sample i If the record is hit in the memory, the first complementing engine is not required to be called, and the record is selected to be reserved or pruning operation is directly carried out according to the situation; if the sample is markedIdentifier t i If there is no hit in memory, call completion engine I, based on the sampled identifier t i Generating a completion result, the current program code context and the current insert location, and if the result is not unknown and the completion result is empty, determining the identifier t of the current sample i Is rejected by comparing the sampled identifier t i The corresponding probability p i (1. Ltoreq.i. Ltoreq.n) zeroing, completing pruning operations and updating memory records, otherwise determining the identifier t of the current sample i The method comprises the steps of adopting the method, updating a memory record, continuing to sample identifiers in an identifier library T, and repeating the step S2 until all identifiers in the identifier library T are completely sampled;
s3, the second pair of updated identifier libraries T of the completion engine actively complement the identifier strings to give a series of candidate completion identifiers as continuation of the candidate use of the adopted identifier strings, and the steps S1-S3 are repeated by utilizing the updated identifier libraries until all the identifier strings are completed, so that a complete patch code is formed.
2. An automated program repair method based on a large language model and a completion engine according to claim 1, wherein the identifier library T and the corresponding probability library P are mapped onto the search space, and sampling is performed according to the probability size from high to low.
3. The automatic program repair method based on a large language model and a complement engine according to claim 1, wherein the pruning operation is: pruning module to search the identifier t in the space i Probability p corresponding to (1. Ltoreq.i.ltoreq.n) i (1 is less than or equal to i is less than or equal to n) and is set to zero; when a certain identifier probability is zero, it is not sampled.
4. The automatic program repairing method based on the large language model and the completion engine according to claim 1, wherein in the step S2, in the process of sampling and judging each identifier library T, the large language model is not involved in generating a new identifier library again, i.e. step S2 only judges the current output result of the large language model.
5. The automatic program repairing method based on a large language model and a complement engine according to claim 1, wherein the patch code producing method comprises the steps of:
firstly, using a < SPAN > identifier as a shielding code to replace a code block with a vulnerability to form a patch embryonic form;
the < SPAN > identifier is then replaced with a large language model to synthesize a patch of repair code from the context of the code around the vulnerability location.
6. An automated program repair method based on a large language model and a completion engine according to claim 1, wherein in step S2, the memory stores identifiers that are known to be infeasible or/and feasible, and the identifiers are hit in three cases:
if the rejected record in the memory is hit, judging that the current identifier is not available, and directly pruning the search space by a pruning module;
if the prefix tree Trie of rejected identifiers in memory is hit, it is checked whether any identifier in the hit record is a new generation t i (1 is less than or equal to i is less than or equal to n), if yes, pruning operation is directly carried out on the search space by the pruning module, and sampling is carried out again from the identifier library T;
if a viable record stored in memory is hit, the adopted identifier needs to be kept.
7. The automatic program repairing method based on a large language model and a complement engine according to claim 1, wherein in step S2, the search space pruning process is as follows:
firstly, sampling the candidate next identifier according to the mapping of the identifier library T and the corresponding probability P given by the large language model, updating the current program code body accordingly, and moving the inserted symbol to the identifier T generated by the new sampling i (1≤i≤n)Afterwards;
then, calling a first completion engine according to the currently generated identifier string result, and checking the current string result; if the result is not unknown and no completion is made, this means that no further candidate continuation can be formed after the currently produced identifier, and therefore the identifier t in the search space is pruned by the pruning module i Probability p corresponding to (1. Ltoreq.i.ltoreq.n) i (1. Ltoreq.i.ltoreq.n) zeroing to prune it, resampling from the identifier pool T, and executing the next cycle; otherwise, the current identifier is considered viable.
8. The automatic program repair method based on a large language model and a completion engine according to claim 1, wherein in step S2, the memory builds a prefix tree of rejected identifiers, specifically comprising the steps of: if an identifier is rejected, meaning that a candidate continuation cannot be formed after the identifier to obtain a statically valid patch code, then any identifier with such prefix should be rejected, a prefix tree of all rejected identifiers for the given program code body and the insert location is built and checked for any identifier therein being the prefix of the newly generated next identifier; if so, resampling from T.
9. The automatic program repairing method based on the large language model and the completion engine according to claim 1, wherein in step S3, the active completion operation steps are as follows:
obtaining a completion result according to a given program code body and a current insert position, and checking whether a current identifier string is unknown; if so, the result will be set to an empty string, meaning that no additional completions will be generated; otherwise, calculating the common prefix of all the complement results, adjusting the complement results to adapt to the vocabulary requirement of the language model, and returning the results.
10. An automatic program repairing system based on a large language model and a complement engine is characterized by comprising the large language model, a pruning module, a complement engine I and a complement engine II;
the large language model is used for providing an identifier library T (T 1 ,t 2 ,…t n ) Corresponding probability P (P 1 ,p 2 ,…p n ) Mapping the identifier library T and the corresponding probability P to a search space;
the pruning module is used for checking sampled identifiers T in the identifier library T i (1. Ltoreq.i.ltoreq.n) hit the contents in the memory and pruning the search space according to the situation;
at identifier t i When the memory is not hit, the completion engine performs completion operation on the identifier word string as soon as the completion engine acquires the memory query result, if the completion result is not unknown and the completion result is empty, the pruning module is triggered to prune the search space, otherwise, the identifier t i Is adopted;
the completion engine II actively completes the identifier character string to give a series of candidate completion identifiers as continuation of the currently hit identifier character string candidate use, and generates a completion result.
CN202311384703.4A 2023-10-25 2023-10-25 Automatic program repairing method and system based on large language model and completion engine Pending CN117130645A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311384703.4A CN117130645A (en) 2023-10-25 2023-10-25 Automatic program repairing method and system based on large language model and completion engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311384703.4A CN117130645A (en) 2023-10-25 2023-10-25 Automatic program repairing method and system based on large language model and completion engine

Publications (1)

Publication Number Publication Date
CN117130645A true CN117130645A (en) 2023-11-28

Family

ID=88856696

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311384703.4A Pending CN117130645A (en) 2023-10-25 2023-10-25 Automatic program repairing method and system based on large language model and completion engine

Country Status (1)

Country Link
CN (1) CN117130645A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334777A (en) * 2007-06-29 2008-12-31 明基电通股份有限公司 System and method for digging regular access subtree
CN105759983A (en) * 2009-03-30 2016-07-13 触摸式有限公司 System and method for inputting text into electronic devices
CN111897946A (en) * 2020-07-08 2020-11-06 扬州大学 Vulnerability patch recommendation method, system, computer equipment and storage medium
CN114742036A (en) * 2022-03-21 2022-07-12 清华大学 Combined model compression method and system for pre-training language model
CN116108891A (en) * 2022-12-15 2023-05-12 南京邮电大学 Lightweight pruning method, device, terminal and computer readable storage medium based on Transformer
CN116745758A (en) * 2020-12-23 2023-09-12 甲骨文国际公司 Intelligent query editor using neural network-based machine learning
US20230312679A1 (en) * 2015-06-19 2023-10-05 Immatics Biotechnologies Gmbh Novel peptides and combination of peptides for use in immunotherapy and methods for generating scaffolds for the use against pancreatic cancer and other cancers

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101334777A (en) * 2007-06-29 2008-12-31 明基电通股份有限公司 System and method for digging regular access subtree
CN105759983A (en) * 2009-03-30 2016-07-13 触摸式有限公司 System and method for inputting text into electronic devices
US20230312679A1 (en) * 2015-06-19 2023-10-05 Immatics Biotechnologies Gmbh Novel peptides and combination of peptides for use in immunotherapy and methods for generating scaffolds for the use against pancreatic cancer and other cancers
CN111897946A (en) * 2020-07-08 2020-11-06 扬州大学 Vulnerability patch recommendation method, system, computer equipment and storage medium
CN116745758A (en) * 2020-12-23 2023-09-12 甲骨文国际公司 Intelligent query editor using neural network-based machine learning
CN114742036A (en) * 2022-03-21 2022-07-12 清华大学 Combined model compression method and system for pre-training language model
CN116108891A (en) * 2022-12-15 2023-05-12 南京邮电大学 Lightweight pruning method, device, terminal and computer readable storage medium based on Transformer

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WENJIAN W U ET AL.: "Effect of Embedding Protection of Iliohepatic Nerve on Pain After Mesh Plug and Patch Repair for Inguinal Hernia", 《CLINICAL MEDICINE & ENGINEERING》 *
YUXIANG WEI ET AL.: "Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair", 《ARXIV》, pages 5 - 7 *
杨博;张能;李善平;夏鑫;: "智能代码补全研究综述", 软件学报, no. 05 *

Similar Documents

Publication Publication Date Title
KR890002329B1 (en) Table type language interpreter
Devlin et al. Semantic code repair using neuro-symbolic transformation networks
CN110059176B (en) Rule-based general text information extraction and information generation method
JPH0731606B2 (en) Apparatus and method for generic code sharing for digital data processing system
CN114416421B (en) Automatic positioning and repairing method for code defects
CN112463424A (en) End-to-end program repair method based on graph
CN108984612A (en) Acquisition methods, device, computer equipment and the storage medium of target SQL statement
CN112306497A (en) Method and system for converting natural language into program code
CN113190219A (en) Code annotation generation method based on recurrent neural network model
CN116302998A (en) Multi-mode program automatic repair method, system, equipment and storage medium
CN112199115A (en) Cross-Java byte code and source code line association method based on feature similarity matching
CN114547619A (en) Vulnerability repairing system and method based on tree
Chakraborty et al. Codit: Code editing with tree-based neural machine translation
CN114385491B (en) JS translator defect detection method based on deep learning
CN115168402A (en) Method and device for generating model by training sequence
CN108228232B (en) Automatic repairing method for circulation problem in program
CN117130645A (en) Automatic program repairing method and system based on large language model and completion engine
CN116955393A (en) Data processing method and device, electronic equipment and storage medium
CN116069337A (en) Code defect automatic repair method combining repair template and deep learning
CN115658845A (en) Intelligent question-answering method and device suitable for open-source software supply chain
CN112685041A (en) Front-end modular grammar conversion method, system and storage medium
CN113076089A (en) API completion method based on object type
CN105930162B (en) A kind of characteristic positioning method based on subgraph search
CN116880826B (en) Visualized code generation method
CN114610320B (en) LLVM (LLVM) -based variable type information restoration and comparison method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination