CN117971704B

CN117971704B - Teenager programming scene self-guide code error correction data set generation method

Info

Publication number: CN117971704B
Application number: CN202410361536.XA
Authority: CN
Inventors: 苏喻; 朱林波; 陆君宇; 丁军; 陈恩红; 李嘉豪
Original assignee: Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Current assignee: Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date: 2024-03-28
Filing date: 2024-03-28
Publication date: 2024-06-04
Anticipated expiration: 2044-03-28
Also published as: CN117971704A

Abstract

The invention relates to the technical field of programming education, and discloses a self-guidance code error correction data set generation method of teenager programming scenes, which comprises the following steps: collecting topics of teenager programming scenes; extracting part of topics, writing a structured prompt word I, generating an error code through a large language model, transmitting the error code into a compiler, acquiring error reporting information, and clustering the error reporting information to obtain clustering information; taking the error code during programming as a seed sample of the sample pool; generating error codes of error categories corresponding to the sample pool by carrying out less sample learning on seed samples in the sample pool; performing arbitration on the generated error codes by combining the cluster information, storing the error codes which meet the arbitration condition into a sample pool, and performing secondary code generation or rejection on the error codes which do not meet the arbitration condition by utilizing error reporting information of a compiler; the quality of the generated data set is guaranteed, and the problem of high construction cost of the code error correction data set is solved.

Description

Teenager programming scene self-guide code error correction data set generation method

Technical Field

The invention relates to the technical field of programming education, in particular to a self-guidance code error correction data set generation method for teenager programming scenes.

Background

Intelligent code correction (INTELLIGENT CODE REPAIR) aims at repairing and correcting codes written by users, plays an important role in the field of code intelligence, and recently gets more and more attention. Excellent intelligent code correction can detect and repair errors and defects in the code, helping programmers to improve the accuracy, readability and maintainability of the code. The method can automatically discover potential problems and provide repair suggestions, and reduces the introduction of human errors and vulnerabilities, thereby improving the quality of the whole code.

In addition, through an automatic error correction and repair process, the workload of manually searching and repairing errors by a programmer is reduced, the time and energy of a developer are saved, and the development efficiency is improved.

In programming scenarios, intelligent code correction can help them discover and correct errors in their own code faster, and provide immediate feedback and guidance, helping teenagers understand and master the specifications of the programming language. Thus, intelligent code correction is critical to teenager programming learning.

The dataset currently used for intelligent code correction training is mainly Bugs <2 > Fix dataset, which provides some real world error codes and corresponding correct codes. However, for teenager programming scenarios, this dataset still has the following drawbacks:

1) Bugs2 the 2Fix dataset is to obtain the real codes from open source projects, which are mainly written by adult developers and not necessarily cover the problems and errors common in teenager programming. This may result in a failure to provide sufficiently accurate and targeted error correction suggestions in programming education for teenagers.

2) The Bugs2Fix data set, although a larger scale data set, may still suffer from insufficient data coverage relative to the entire programming domain. Code instances in the dataset may not cover all possible error categories, which would limit the generalization capability of the error correction system.

3) The Bugs2Fix dataset is constructed by identifying all submissions with modification intents of "Fix", "solve", "bug" and the like from the Github, and directly taking the modification fragments before the submission as error codes and the modification fragments after the submission as correct codes. Due to the uneven level of different authors, the rationality and correctness of each modification are difficult to ensure, so that the quality of the data set is difficult to ensure.

Obviously, under the teenager programming scenario, the Bugs Fix data set cannot perfectly meet the training requirement, and then a code error correction data set conforming to the teenager programming scenario needs to be constructed. The code error correction data set should be able to make up for the above-mentioned Bugs Fix data set, and the construction process can solve or alleviate the following problems faced in the construction of a general code error correction data set:

1) The code error correction dataset is made up of correct codes, error codes and error categories, which results in the generation of labels for the error categories being a complex process requiring analysis and comparison of the codes to determine the location and type of error, which requires significant labor and time costs and is a significant challenge for the labeling personnel's labeling ability.

2) The data in the code correction task needs to be diversified, covering different programming languages, different technical fields and various error categories. Acquiring diverse data with wide coverage can be challenging, particularly for certain professional fields or errors in a particular programming language.

3) After a large amount of time and labor are consumed to collect and label large-scale data, reasonable means should be used to ensure the quality of the data.

In the technical scheme in the prior art, a code error correction data set is obtained by adopting a manual labeling method. The methods cannot be well matched with teenager programming scenes, the implementation cost is high, and quality control is not performed on generated data. The generation of code error correction data sets in teenager programming scenarios remains a challenge.

Disclosure of Invention

In order to solve the technical problems, the invention provides a self-boot code error correction data set generation method for teenager programming scenes.

In order to solve the technical problems, the invention adopts the following technical scheme:

A method for generating a self-guiding code error correction data set of a teenager programming scene uses a real sample of the teenager programming scene, fuses error information reported by a compiler, and generates the code error correction data through self-guiding of a conversational large language model, which comprises the following steps:

S1: collecting topics of teenager programming scenes; the title includes a description of the title and a correct code;

S2, repeating the following operations until the number of the cluster is not changed any more: extracting part of topics from all the topics, writing a structured prompt word I, generating an error code through a conversational large language model, transmitting the error code into a compiler to obtain error reporting information, and clustering the encoded error reporting information through an AP clustering algorithm to obtain clustering information of a cluster; each cluster represents an error category;

S3: setting a sample pool for each error category, and taking error codes written in actual programming and provided with corresponding error categories as seed samples of the sample pool;

S4: writing a structured prompt word II for each error category of each question, and performing small sample learning for seed samples in a sample pool corresponding to the error category through a dialogue type large language model to generate error codes of the error category corresponding to the sample pool; inputting error codes generated by the conversational large language model into a compiler, arbitrating the generated error codes by combining the clustering information, storing the error codes meeting the arbitration condition into a sample pool, and generating or discarding the error codes not meeting the arbitration condition by utilizing error reporting information of the compiler;

and error codes stored in each sample pool and corresponding topic descriptions are the code error correction data.

Further, in step S1, the teenager programming scenario includes a teenager programming game; the title also includes a test case.

Further, step S2 specifically includes:

extracting part of the topics from all the topics each time, writing a structured prompt word I, generating error codes of the topics on the basis of correct codes by using a conversational large language model, and acquiring error reporting information by using a compiler; wherein, the structured prompt word one The construction method is as follows:

；

Wherein, Representing concatenation, identity representing the identity set up for a conversational large language model,/>Representing a description of a title,/>Representing the correct code;

Coding the error reporting information, transmitting the coded error reporting information into a BERT model, obtaining the vector of each word of the error reporting information by obtaining the final hidden layer state of the BERT model, and finally averaging the vectors of all words of the error reporting information to be used as error reporting information vectors;

Taking the error reporting information vector as a point, clustering the error reporting information by using an AP clustering algorithm, and forming a clustering information set C by the clustering information of each cluster; each element C in the cluster information set C comprises a vector representation v and a semantic representation l of a cluster center point p, and a cluster radius r; the cluster radius r is the maximum value of the distance from each point in the cluster to the central point of the cluster; the semantic representation l of the cluster center point p is the semantic representation of the point closest to the cluster center point;

and stopping question extraction when the number of the cluster clusters is not changed any more, and obtaining a plurality of cluster clusters corresponding to the error categories one by one.

Further, the step S4 specifically includes:

each error category for each topic Selecting a seed sample from the corresponding sample pool, constructing a structured prompt word II, and generating an error category/>, through a large language modelAn error code b of (a); structured prompt word two/>The construction method is as follows:

；

Wherein, Representing concatenation, identity representing the identity set up for a conversational large language model,/>Representing a description of a title,/>E represents seed samples in the sample pool corresponding to the error class I;

inputting the generated error code b into a compiler, and encoding the error reporting information of the compiler into a feedback information vector The generated error code b is arbitrated by utilizing the clustering information, wherein the clustering information comprises vector representation v of a cluster center point p and cluster radius r; wherein the arbitration conditions are: judgment/>Whether the Euclidean distance d from v is less than or equal to the error category/>The corresponding cluster radius r; if so, the generated error code b accords with the arbitration condition, and the generated error code b is stored in a corresponding sample pool; if not, the generated error code b does not accord with the arbitration condition, and the generated error code is abandoned or secondary code generation is carried out;

Wherein, ；

Vector characterization of cluster center point pFeedback information vector，/>Representation/>A value in the kth dimension; /(I)Representation/>Values in the kth dimension.

Further, when the generated error code does not meet the arbitration condition and secondary code generation is carried out, the error reporting information of the compiler is fused into the structured prompting word IIObtaining the structured prompt word three/>：

；

Wherein,Indicating error reporting information obtained after the generated error code b is input to a compiler;

By structuring the third prompting word And a conversational large language model to regenerate error codes.

Compared with the prior art, the invention has the beneficial technical effects that:

The invention obtains the complete error code category by using the dynamic AP clustering algorithm, solves the problem of incomplete error category of the code error correction data set, and ensures the diversity of the data set. The authenticity of the generated data is ensured by taking the error code written by a small number of true teenagers as seed data. The compiler is used for feeding back information, a conversational large language model is used for self-guiding data generation, the quality of a generated data set is guaranteed, the problem of high construction cost of the code error correction data set is solved, and meanwhile a solution idea is provided for construction of other types of data sets.

Drawings

FIG. 1 is a flow chart of a method of generating in the present invention.

Detailed Description

A preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

A method for generating self-guiding code error correction data set based on dialogue large language model in teenager programming scene comprehensively considers various possible situations of error code in teenager programming learning scene, and provides large-scale high-quality data set for subsequent code error correction training with low cost.

Teenagers in the present invention refer to people between the ages of 10 and 19.

The invention relates to a method for generating a self-guiding code error correction data set based on a conversational large language model in a teenager programming scene, which comprises the following steps:

s1: public topics of teenager programming games are collected, and the topics comprise topic descriptions, correct codes and test cases.

S2: the collected topics are continuously extracted, structured prompt words (also called structured prompt) are written, the dialogue type large language model generates relevant error codes, and the error codes are transmitted into a compiler to obtain error reporting information. And coding the error reporting information which is continuously acquired, and then implementing an AP clustering algorithm until the number of clusters is not changed. At this time, each cluster represents an error category.

S3: setting a sample pool for each error category, collecting a plurality of corresponding error codes written by a small number of teenagers in actual programming, and taking the error codes as seed samples to guide subsequent output data of the conversational large language model, thereby further improving the authenticity of the output data.

S4: and writing a structured prompt word aiming at each error category of each collected topic, and generating error codes of corresponding error categories after the dialogue large language model realizes less sample learning aiming at seed samples in the sample pool. Inputting the generated error codes into a compiler, arbitrating the generated error codes by combining the cluster information in the step S2, storing the high-quality data into a sample pool, and generating or directly discarding the secondary codes by using feedback information of the compiler on the non-high-quality data.

The step S1 specifically comprises the following steps: collecting public topics of teenager programming games; the collected information cannot only contain the question description and the question correct code, but also contains the test case of the question, so that the error correction effect of the code is ensured to be evaluated by reasonable evaluation indexes later.

The step S2 specifically comprises the following steps:

And extracting part of the topics from all the topics at a time, writing a structured prompt word, generating error codes of the topics on the basis of correct codes by using a conversational large language model, and acquiring error reporting information by using a compiler. The construction formula of the structured prompt word I is as follows:

；

Wherein, The related information is spliced by using reasonable sentences to form the general sentences. identity means that setting an identity for a large language model, such as "you are a teenager programming education teacher," can help the large language model generate more scene-compliant data. /(I)Representing a description of a title,/>Representing the correct code for the title.

Precoding error reporting information, then transmitting the coded error reporting information into a BERT model, obtaining the final hidden layer state of the BERT model, obtaining the vector representation of each word in the error reporting information, and finally averaging the vectors of all words to be used as error reporting information vectors.

And taking the error reporting information vector as a point, clustering the error reporting information by using an AP clustering algorithm, and recording the clustering information of each cluster as a set C. Each element C in the set C contains a vector representation v of the cluster center point p and a semantic representation l (the semantic representation of the closest point to the cluster center point is taken as the semantic representation of the cluster center point), and a cluster radius r (the maximum value of the distance of each sample in the cluster to its cluster center point).

The AP clustering algorithm has three important matrices:

Similarity matrix similarity: the similarity between points i and k is noted as The degree to which a point k is fit as the center point of the cluster of points i is generally expressed by a negative value of the euclidean distance.

Attraction degree matrix responsibility: the degree of attraction between points i and k is noted asThe data information indicating that the point i is transmitted to the candidate cluster center point k has a value reflecting whether the point k is suitable as the cluster center point of the point i.

Home degree matrix availabilities: the degree of attribution between points i and k is noted asThe data information indicating that the candidate cluster center point k is transmitted to the point i, the value of which reflects whether the point i selects the point k as the cluster center point is appropriate.

In the implementation process of the AP clustering algorithm, the number of cluster center points does not need to be preset, and all nodes are likely to become the cluster center points. The key to affecting a point as the center point of a cluster is the value on the diagonal of the similarity matrixIt is called reference level reference, denoted p. The p value has a great influence on whether the point k can become the cluster center point, and the number of the cluster center points can be influenced. Therefore, in order to ensure that the opportunities that each point becomes the center point of the cluster are equal in the process of clustering the error reporting information, the p values take the average value of the similarity matrix, and the p values of all the data points are set to be equal.

The AP clustering algorithm used is specifically described as follows:

input: n pieces of error reporting information, each piece of error reporting information is A dimension vector; similarity matrix S,/>Representing the similarity between two points; attraction degree matrix R,/>Representing the similarity between two points; attribution degree matrix A,/>Representing the degree of attribution between two points;

and (3) outputting: the cluster information of each cluster is marked as a set C.

The AP clustering algorithm comprises the following steps:

1) And initializing the values of the similarity matrix, the attraction matrix and the attribution matrix to be 0. And then taking the negative value according to the Euclidean distance, converting the error reporting information vector into a similarity matrix, wherein the calculation formula is as follows:

；

wherein, the point i is the error information vector Point k is the error information vector/>。

2) Updating the attraction degree matrix by using the attribution degree matrix and the similarity degree matrix, wherein the calculation formula is as follows:

。

3) The calculation formula of the attribution degree matrix is as follows:

；

。

4) And repeating the step 2) and the step 3) unless the cluster center point is not transformed any more or the corresponding iteration times are reached.

5) If the sum of the value on the attribution diagonal and the value on the attraction diagonalThen the selected point k is the cluster center point. For the remaining points i, select/>The point with the largest value is taken as the center point of the cluster to which the point belongs.

6) And calculating the cluster information of each cluster, and recording the cluster information as a set C.

And along with the increase of the vector points, stopping the extraction of the questions when the number of the central points of the cluster is not changed any more, and obtaining the complete error category representation.

Further, the present invention provides a specific embodiment of step S2 as follows:

Input: a topic collection T1, sharing topic N channels (each topic contains topic description and correct codes); a topic set T2, the content is empty; s1=0.

And (3) outputting: clustering information.

The method specifically comprises the following steps:

s21, N topics (n=1/10N) are extracted from the topic set T1, the topic set T2 is set, the topic set T1 is updated, and the extracted N topics are deleted from the topic set T1.

S22, for n topics in the topic set T2, using a structured prompt word to enable the dialogue type large language model to generate 10 sections of error codes with different error categories on the basis of the topics and the correct codes, so as to form an error code set D.

S23, compiling each error code in the error code set D through a compiler to obtain error reporting information of the compiler, and forming an error reporting information set E.

S24, carrying out vector characterization on each compiler in the error reporting information set E by using a BERT model, and putting the obtained error reporting information vector into the error reporting information vector set V.

S25, clustering the error information vector set V by using an AP clustering algorithm, wherein m1 is equal to the number of new clusters.

S26, ifOr the set T1 is empty, step S27 is performed; no make/>Steps S21 to S25 are repeated.

S27, recording the clustering information of each cluster, and recording the clustering information as a set C. Each element C in the set C contains a vector representation v and a semantic representation l of the cluster center point p, and a cluster radius r.

The following is the structured prompt word one used in step S2Is a template of (a):

promt1= "you are a teenager programming education teacher, you are explaining {" + The correct code for this programming topic is + "+/>+ "}. Please write ten sections of error codes which your students may have on the basis of the correct codes in combination with your teaching experience for subsequent programming teaching as counterexamples, please return the written codes to me in json format without additional description other than the codes. "

The step S4 specifically comprises the following steps:

for each topic and each error category, selecting a seed sample from the sample pool, constructing a structured prompt word II, and enabling the conversational large language model to generate error codes of related error categories. The structural prompt word II has the following construction formula:

；

Wherein, Representing the error class, e representing the seed sample in the sample pool to which the error class I corresponds.

The generated error code passes through a compiler to obtain error reporting information of the compiler, and the error reporting information is represented by the coding mode in the step S2 to obtain a feedback information vector. And then, the generated error codes are arbitrated by using the cluster information acquired in the step S2, and the related formulas are as follows:

；

Wherein, vector characterization of cluster center point p Feedback information vector。

The data arbitration mode is to compare the sizes of the cluster radius r and d of the error category. If d is less than or equal to r, the generated error code accords with the arbitration condition, and the generated error code is put into a related sample pool and is stored. If d is larger than r, the generated error code does not accord with the arbitration condition, and then the compiler is incorporated to report error information, so as to inform the conversational large language model that deviation is generated, and the conversational large language model learns the deviation information and regenerates data. At this time, the construction of the prompt3 is as follows:

；

Wherein, Indicating the error information of the compiler, and keeping the meaning of the rest parameters unchanged.

Further, the present invention provides a specific embodiment of step S4 as follows:

Input: the topic set T1 is provided with N topics, and each topic comprises topic description And correct code/>; A cluster information set C;

And (3) outputting: the code corrects the data.

The method specifically comprises the following steps:

S41: for each error class I of each topic q in the topic set T1, a seed sample e is selected from the sample pool of error classes I.

S42: based on the structured prompt word bipompt 2, the conversational large language model learns the error category of the seed sample eAnd in the topic description/>, of topic qAnd correct code/>On the basis of which an error code b is generated.

S43: passing the error code b through a compiler to obtain error reporting information of the compilerChecking whether the compiling result meets the expectations, i.e. judging the error reporting information/>, of the generated error codeVector characterization of/>Whether the distance from v is less than or equal to r, if so, is deemed to be satisfactory, and if not, is deemed to be undesirable.

S44: if it is expected, the error code b is put into the error categoryIn the sample pool of (2), step S45 is performed. If not, step S46 is performed.

S45: judging whether the error codes obtained from the sample pool of the current error category I of the current question q reach m2, if so, switching to the generation of the error codes of the next error category; if not, the steps S41 to S44 are repeatedly performed. m2 is the number of data expected to be acquired for each error category.

S46: judging whether the number of times which does not meet the expectation is less than t times, if so, reporting error information of the generated error code bMerging the structured prompt word two campt 2 to obtain a structured prompt word three campt 3, and repeating the steps 42 to S44 based on the structured prompt word; if not, the generation of the error code of the next error category is switched.

The structural prompt word dipompt 2 and the structural prompt word tripompt 3 are constructed as follows:

；

Wherein, Representing the error class, e representing the seed sample in the sample pool corresponding to the error class I,/>The error information of the compiler is represented, and templates of the structured prompt word two promts 2 and the structured prompt word three promts 3 are as follows:

promt1= "you are a teenager programming education teacher, for {" + The programming title + ", the correct code is {" +/>+ "}. Please you combine your teaching experience, write what you might write { "+/>, on the correct code basisError codes of + "(written error codes will be used as counterexamples for teaching). You can learn codes with such class errors before writing: { "+/>"+" (This code is only referred to, not as a base of modification), please return the written code to me in json format, without additional description beyond the code. "

Promt2= "you are a teenager programming education teacher, for {" +The programming title + ", the correct code is {" +/>+ "}. Please you combine your teaching experience, write what you might write { "+/>, on the correct code basisError codes of + "(written error codes will be used as counterexamples for teaching). You can learn codes with such class errors before writing: { "+/>"+" (This code is for reference only and not the basis for modification). Note that please carefully consider this error category, it is not to compare it with { "+/>And + "} are confused. Please return the written code to me in json format without additional description beyond the code. "

The conversational large language model of the invention may employ a currently published commercial artificial intelligence conversational model with code processing capabilities, such as chatgpt of openai, starfire of science fiction, and hundred degrees of center of gravity.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a single embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.

Claims

1. A method for generating a self-guiding code error correction data set of a teenager programming scene uses a real sample of the teenager programming scene, fuses error information reported by a compiler, and generates the code error correction data through self-guiding of a conversational large language model, which comprises the following steps:

2. The method for generating a self-guidance code error correction data set for a juvenile programming scenario of claim 1, wherein in step S1, the juvenile programming scenario includes a juvenile programming race; the title also includes a test case.

3. The method for generating a self-boot code error correction data set for a juvenile programming scenario of claim 1, wherein step S2 specifically comprises:

；

4. The method for generating a self-boot code error correction data set for a juvenile programming scenario of claim 1, wherein step S4 specifically comprises:

；

Wherein, ；

5. The teenager programming scenario self-bootstrap code error correction data set generation method of claim 4, wherein: when the generated error code does not meet the arbitration condition and secondary code generation is carried out, the error reporting information of the compiler is fused into the structured prompting word IIObtaining the structured prompt word three/>：

；