CN117215944A

CN117215944A - Automatic evaluation method for large model code generation capacity and related products

Info

Publication number: CN117215944A
Application number: CN202311212732.2A
Authority: CN
Inventors: 许娟婷
Original assignee: Pacific Insurance Technology Co Ltd
Current assignee: Pacific Insurance Technology Co Ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-12-12

Abstract

The application discloses an automatic evaluation method of large model code generation capacity and a related product, which can be applied to the technical field of data processing, and the method comprises the following steps: inputting a preset Chinese test sample into a large-scale language model to be evaluated, and enabling the large-scale language model to be evaluated to generate a corresponding code to be evaluated based on the Chinese test sample; acquiring the code to be evaluated; and evaluating the code to be evaluated by using a preset test unit based on a pre-constructed execution evaluation standard to obtain an evaluation result. Therefore, the evaluation method provided by the application takes the Chinese test sample into consideration, evaluates the code generation capacity of the large model based on the pre-constructed execution evaluation standard and the pre-set test unit, and improves the evaluation capacity and the evaluation flexibility.

Description

Automatic evaluation method for large model code generation capacity and related products

Technical Field

The application relates to the technical field of data processing, in particular to an automatic evaluation method for large model code generation capacity and a related product.

Background

With the development of technology, a Large Language Model (LLM) makes revolutionary breakthroughs in the field of Natural Language Processing (NLP), and meanwhile, the large model also supports a code writing function. While LLM generates code, the main concern is the correctness of the code. It is an important task how to reasonably and accurately evaluate the large model generation code capability.

The two syntactically distinct code fragments may be semantically equivalent compared to natural language processing, which makes some classical NLP indexes unreliable. The existing evaluation method is generally evaluated through unit tests, but has limited quantity, so that the problems of insufficient tests and inaccurate description of the problems exist. In addition, the code evaluation benchmarks in the existing evaluation methods are all English data sets, so that the problem of inflexibility in evaluation is caused.

Therefore, how to improve the evaluation capability and the evaluation flexibility of the large model code generation capability is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

Based on the problems, the application provides an automatic evaluation method for the code generation capability of a large model and a related product, which take account of Chinese test samples, evaluate the code generation capability of the large model through a pre-constructed execution evaluation standard and a pre-set test unit, and solve the problems of insufficient evaluation capability and poor evaluation flexibility of the existing evaluation method.

In a first aspect, the present application provides an automatic evaluation method for large model code generation capability, including:

inputting a preset Chinese test sample into a large-scale language model to be evaluated, and enabling the large-scale language model to be evaluated to generate a corresponding code to be evaluated based on the Chinese test sample;

acquiring the code to be evaluated;

and evaluating the code to be evaluated by using a preset test unit based on a pre-constructed execution evaluation standard to obtain an evaluation result.

Optionally, before inputting the preset chinese test sample into the large language model to be evaluated, and enabling the large language model to be evaluated to generate the corresponding code to be evaluated based on the chinese test sample, the method further includes:

constructing a Chinese evaluation benchmark;

and acquiring a preset Chinese test sample from the Chinese evaluation standard.

Optionally, the constructing a chinese evaluation benchmark includes:

translating English evaluation problems in the basic English evaluation standard into Chinese by using a translation model, obtaining a first evaluation problem and outputting the first evaluation problem for manual calibration;

acquiring a calibration result, and screening the first evaluation problem based on the calibration result to obtain a second evaluation problem; the second evaluation problem is an evaluation problem with an accurate calibration result in the first evaluation problem;

constructing a Chinese test sample based on the second evaluation problem and a basic Chinese programming problem;

and constructing a Chinese evaluation benchmark based on the Chinese test sample.

Optionally, the evaluating the code to be evaluated by using a preset test unit based on the pre-constructed execution evaluation standard, and before obtaining the evaluation result, further includes:

constructing a first test unit group based on the neural network; the first test unit group comprises at least two test units to be screened;

testing the test unit to be screened by using a standard evaluation code to obtain a test result;

screening the to-be-screened test units based on the test results to obtain to-be-screened test units with the passed test results, and marking the to-be-screened test units as preset test units;

and utilizing the preset test unit to construct a unit test sample library.

determining a preset test unit corresponding to the code to be evaluated based on the code to be evaluated;

and acquiring the preset test unit from the unit test sample library.

based on a code generation model evaluation index pass@k, constructing an execution evaluation standard by combining unit test passing rate;

based on the execution evaluation criteria, an evaluation dimension is constructed that includes edit errors, run errors, memory overruns, time overruns, result errors, partial passes, and passes.

Optionally, the evaluating the code to be evaluated by using a preset test unit based on a pre-constructed execution evaluation standard to obtain an evaluation result, including:

evaluating the code to be evaluated by using a preset test unit based on the pass@k and the execution evaluation standard constructed by combining the unit test passing rate, and obtaining at least one of the evaluation dimensions as an evaluation result;

and the unit test passing rate is the proportion of the total number of the preset test units for evaluating the code to be evaluated, which are used for the code to be evaluated to pass through, when the evaluation result of the code to be evaluated is that the code to be evaluated passes through partially.

In a second aspect, the present application provides an automatic evaluation device for large model code generation capability, including:

the input module is used for inputting a preset Chinese test sample into a large-scale language model to be evaluated, so that the large-scale language model to be evaluated generates a corresponding code to be evaluated based on the Chinese test sample;

the acquisition module is used for acquiring the code to be evaluated;

and the evaluation module is used for evaluating the code to be evaluated by utilizing a preset test unit based on a preset execution evaluation standard to obtain an evaluation result.

In a third aspect, the present application provides an automatic evaluation apparatus for large model code generation capability, including:

a memory for storing a computer program;

a processor for implementing the steps of the method for automatically evaluating large model code generation capabilities according to any one of the above, when executing the computer program.

In a fourth aspect, the present application provides a readable storage medium, wherein a computer program is stored on the readable storage medium, and the computer program, when executed by a processor, implements the steps of the method for automatically evaluating the large model code generating capability according to any one of the above.

Compared with the prior art, the application has the following advantages that:

the method comprises the steps of firstly inputting a preset Chinese test sample into a large-scale language model to be evaluated, and enabling the large-scale language model to be evaluated to generate a corresponding code to be evaluated based on the Chinese test sample. And then acquiring a code to be evaluated. And finally, evaluating the code to be evaluated by using a preset test unit based on a pre-constructed execution evaluation standard to obtain an evaluation result. Therefore, the evaluation method provided by the application takes the Chinese test sample into consideration, evaluates the code generation capacity of the large model based on the pre-constructed execution evaluation standard and the pre-set test unit, and improves the evaluation capacity and the evaluation flexibility.

Drawings

FIG. 1 is a flow chart of an automatic evaluation method for large model code generation capability provided by the application;

FIG. 2 is a flow chart of a method of generating unit tests provided by the present application;

FIG. 3 is a schematic structural diagram of an automatic evaluation device for large model code generation capability provided by the application.

Detailed Description

As described above, the existing evaluating method has the problems of insufficient evaluating capability and poor evaluating flexibility. Specifically, with the development of technology, a Large Language Model (LLM) makes a revolutionary breakthrough in the field of Natural Language Processing (NLP), and at the same time, the large model also supports a code writing function. While LLM generates code, the main concern is the correctness of the code. The two syntactically distinct code fragments may be semantically equivalent compared to natural language processing, which makes classical NLP indexes such as the BLEU algorithm unreliable. The correctness of the code generated by the prior art for large models is usually evaluated by unit tests, such as a HumanEval evaluation benchmark. However, the number of unit tests is not yet sufficient, because it is a difficult task to manually construct a high quality test, especially for complex procedures, thus resulting in problems of insufficient tests and insufficient definition and accuracy of the problem description. In addition, in the existing evaluation method, the code evaluation benchmarks are basically English data sets, and the Chinese code evaluation benchmarks are lacked, so that the evaluation lacks certain flexibility.

In order to solve the above problems, the present application provides an automatic evaluation method for large model code generation capability, including: firstly, inputting a preset Chinese test sample into a large-scale language model to be evaluated, and enabling the large-scale language model to be evaluated to generate a corresponding code to be evaluated based on the Chinese test sample. And then acquiring a code to be evaluated. And finally, evaluating the code to be evaluated by using a preset test unit based on a pre-constructed execution evaluation standard to obtain an evaluation result.

Therefore, the evaluation method provided by the application takes the Chinese test sample into consideration, evaluates the code generation capacity of the large model based on the pre-constructed execution evaluation standard and the pre-set test unit, and improves the evaluation capacity and the evaluation flexibility.

It should be noted that the automatic evaluation method for the large model code generation capability and the related products provided by the application can be applied to the technical field of data processing. The foregoing is merely exemplary, and the application fields of the automatic evaluation method for the large model code generating capability and the related products provided by the present application are not limited.

In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

FIG. 1 is a flow chart of an automatic evaluation method for large model code generation capability provided by the application. Referring to fig. 1, the method for automatically evaluating the large model code generating capability provided by the present application may include:

s101: inputting a preset Chinese test sample into a large-scale language model to be evaluated, and enabling the large-scale language model to be evaluated to generate a corresponding code to be evaluated based on the Chinese test sample.

In practical application, with the development of technology, LLM makes a revolutionary breakthrough in the NLP field, and meanwhile, a large model also supports a code writing function. While LLM generates code, the main concern is the correctness of the code. The large language model may correspondingly generate code based on natural language instructions and then execute the code. Therefore, in evaluating the code generation capability of a large language model, it is necessary to input a test sample into the model so as to correspond to the generated code. Specifically, the existing code evaluation standards are basically English data sets, and lack Chinese code evaluation standards. When the method is used for evaluating, the preset Chinese test sample is firstly input into the large-scale language model to be evaluated, and the code to be evaluated is correspondingly generated by the large-scale language model to be evaluated. The Chinese test sample comprises api call, chinese understanding, content understanding, algorithm capability, mathematical reasoning and simple interview problem.

In addition, since the manner of acquiring the chinese test sample is not the same, the present application can be described in terms of one possible acquisition manner.

In one case, it is directed to how to obtain a Chinese test sample. Correspondingly, before inputting the preset Chinese test sample into the large-scale language model to be evaluated, and enabling the large-scale language model to be evaluated to generate the corresponding code to be evaluated based on the Chinese test sample, the method further comprises the steps of:

constructing a Chinese evaluation benchmark;

In practical application, the existing code evaluation standards are basically English data sets, and lack Chinese code evaluation standards. Therefore, the application firstly builds a Chinese evaluation benchmark. Specifically, the Chinese evaluation benchmark mainly comprises an api call (including a python three-party library call), chinese understanding, content understanding, algorithm capability, mathematical reasoning and a Chinese test sample related to a simple programming interview problem. The automatic evaluation device for the large model code generation capability when evaluating the large model code generation capability can acquire a preset Chinese test sample from a Chinese evaluation reference.

In addition, since the manner of constructing the Chinese test sample is not the same, the present application can be described in terms of one possible construction.

In one case, it is directed to how to construct a Chinese test sample. Correspondingly, the construction of the Chinese evaluation benchmark comprises the following steps:

In practical application, the constructed Chinese evaluation standard comprises 370 Chinese test sample cases, and is derived from Chinese programming questions, programming application questions and part of questions, which are selected from English evaluation standard HumanEval, MBPP, translated into Chinese and manually checked for accuracy. Specifically, the translation model can be utilized to translate the English evaluation problem in the basic English evaluation standard into Chinese, so as to obtain a first evaluation problem, output the first evaluation problem and manually calibrate the first evaluation problem. Then a series of problems of accurate translation or inaccurate translation are obtained. The automatic evaluation device with the large model code generating capacity can screen and obtain an evaluation problem with accurate translation based on the calibration result, and the evaluation problem is recorded as a second evaluation problem and is used for constructing a Chinese evaluation reference. It should be noted that the original english evaluation benchmark title format is code complement, and each test sample includes a natural language description, input and output descriptions, and some input/output examples of the problem, as follows:

while the application allows for more convenient developers, the constructed instruction format is a code generation text-to-code, and each test sample only contains natural language description, such as' given a positive floating point number, it can be decomposed into an integer part and a decimal part. Returning the fractional part of the number. Please implement with the python code).

S102: and acquiring the code to be evaluated.

In practical application, the automatic evaluation device with large model code generating capability inputs a preset Chinese test sample into a large language model to be evaluated, and the large language model to be evaluated correspondingly generates a code to be evaluated. And continuously acquiring the to-be-evaluated code by the automatic evaluation device with large model code generating capability, and evaluating the to-be-evaluated code to obtain an evaluation result.

S103: and evaluating the code to be evaluated by using a preset test unit based on a pre-constructed execution evaluation standard to obtain an evaluation result.

In practical application, after the automatic evaluation device with the large model code generating capability acquires the code to be evaluated, which is correspondingly generated by the large language model to be evaluated, the code to be evaluated is evaluated by utilizing a preset test unit based on a pre-constructed execution evaluation standard, so that an evaluation result is obtained. It should be noted that the execution evaluation standard pre-constructed in the present application has a code capability of evaluating a large model with finer granularity than the prior art, and the preset test unit provided in the present application has a more comprehensive evaluation effect than the prior art.

In addition, since the ways of constructing the unit test sample libraries are not identical, the present application can be described in terms of one possible construction.

In one case, the sample library is tested for how the cells are built. Correspondingly, the evaluating the code to be evaluated by using a preset test unit based on the pre-constructed execution evaluation standard, and before obtaining the evaluation result, further comprises:

and utilizing the preset test unit to construct a unit test sample library.

In practical application, a unit test sample library stores test units for testing codes to be evaluated. Fig. 2 is a flowchart of a method for generating unit test according to the present application. In connection with FIG. 2, a design promt is constructed to generate a sample cell test. Specifically, a plurality of new unit tests are generated by using the neural network chatgpt, namely at least two test units to be screened are generated, and a first test unit group is formed. And then testing the units to be screened by using a standard evaluation code group_trunk based on an automatic evaluation framework, so as to screen out high-quality unit tests, namely screening out the units to be screened, the test results of which are passed, and marking the units as preset test units. Thus, the library formed by the preset test units passing the test is the unit test sample library. It should be noted that the automatic evaluation framework is composed of the automatic evaluation method of the large model code generation capability provided by the application. In addition, the project is designed by the task description: please generate corresponding unit test according to the provided problem description and code; description of problems: the group_trunk code is composed of two parts of unit test input/output samples.

In addition, since the ways of acquiring the preset test units are different, the present application can be described with respect to one possible acquisition way.

In one case, a preset test unit is obtained for how to acquire. Correspondingly, the evaluating the code to be evaluated by using a preset test unit based on the pre-constructed execution evaluation standard, and before obtaining the evaluation result, further comprises:

and acquiring the preset test unit from the unit test sample library.

In practical application, various preset test units are stored in the constructed unit test sample library. While different test samples have corresponding test cells. Therefore, before acquiring the preset test units, the automatic evaluation device with the large model code generating capability needs to confirm the codes to be evaluated generated by the large language model to be evaluated, then confirm the corresponding preset test units based on the codes to be evaluated, and acquire the preset test units from the unit test sample library for testing.

In addition, since the ways of constructing the execution evaluation criteria are not the same, the present application can be described in terms of one possible construction way.

In one case, the evaluation criteria are evaluated for how to construct. Correspondingly, the evaluating the code to be evaluated by using a preset test unit based on the pre-constructed execution evaluation standard, and before obtaining the evaluation result, further comprises:

In practical application, in order to strictly and accurately evaluate the generation capacity of large model codes, the application provides a fine-grained execution evaluation standard: and testing the passing rate of the code generation model evaluation index pass@k+ unit. Then based on the execution evaluation criteria, in order to evaluate the availability of the code from the point of view of unit test "executable", 7 evaluation dimensions were designed, specifically including (1) compilation errors: program fails to compile due to grammar errors; (2) operational errors: program compiling is passed, but the running fails due to environmental problems and the like; (3) memory overrun: the program occupies too much memory in the process of executing the program, and exceeds the upper limit of the memory; (4) time exceeds: program running exceeds the original preset time upper limit; (5) error of results: the result is successfully operated after the program is compiled, but the result is inconsistent with the correct result of the unit test; (6) part by: percentage of program passing unit test; (7) by: the program passes all unit tests. The pass@k evaluation index is as follows: the sample is required to pass all unit tests set by the sample, namely, all preset test units are considered to pass, and the statistical result is the sample passing rate passing through all preset test units.

In addition, since the evaluation modes are not identical, the application can be described in terms of one possible evaluation mode.

In one case, an evaluation is made as to how it is to be performed. Correspondingly, the evaluating the code to be evaluated by using a preset test unit based on the pre-constructed execution evaluation standard to obtain an evaluation result, which comprises the following steps:

In practical application, the automatic evaluation device of the large model code generation capability evaluates the code to be evaluated by utilizing a preset test unit which is generated based on chatgpt and is screened based on an execution evaluation standard constructed by the pass@k+ unit test passing rate, and at least one of the evaluation dimensions is obtained as an evaluation result. Wherein, the unit test passing rate is: in the partially passing sample, the passing percentage of the preset test unit is a finer granularity evaluation mode. The unit test passing rate calculation formula is as follows:

pass_ratio＝n/N*100％

wherein N is the total preset test unit number of the sample, N is the preset test unit number through which the code can pass, and the evaluation index of the application, namely the pass@k+ unit test passing rate, can evaluate the code capacity of a large model in a finer granularity.

In addition, in order to consider the security and the executable performance, the application also provides an automatic evaluation framework for realizing the automatic evaluation method of the large model code generation capability, and specifically, the input of the code automatic evaluation framework comprises importing related python library, complete code realization, def check and unit test assertion, and the format is as follows:

output of the code automatic evaluation framework: the reasons for whether pass, fail, and the unit test pass percentages are as follows:

{"result":"passed","whole_passed":true,"pass_ratio":100％}

{"result":"failed:unsupported operand type(s)for-:'list'and'list'","whole_passed":false,"pass_ratio":50％}。

in summary, the method includes inputting a preset Chinese test sample into a large language model to be evaluated, so that the large language model to be evaluated generates a corresponding code to be evaluated based on the Chinese test sample. And then acquiring a code to be evaluated. And finally, evaluating the code to be evaluated by using a preset test unit based on a pre-constructed execution evaluation standard to obtain an evaluation result. Therefore, the evaluation method provided by the application takes the Chinese test sample into consideration, evaluates the code generation capacity of the large model based on the pre-constructed execution evaluation standard and the pre-set test unit, and improves the evaluation capacity and the evaluation flexibility.

Based on the automatic evaluation method of the large model code generation capability provided by the embodiment, the application also provides an automatic evaluation device of the large model code generation capability. The automatic evaluation device for the large model code generation capability is described below with reference to the embodiments and the drawings, respectively.

FIG. 3 is a schematic structural diagram of an automatic evaluation device for large model code generation capability provided by the application. Referring to fig. 3, an apparatus 200 for automatically evaluating a large model code generating capability according to an embodiment of the present application includes:

the input module 201 is configured to input a preset chinese test sample into a large language model to be evaluated, so that the large language model to be evaluated generates a corresponding code to be evaluated based on the chinese test sample;

an acquisition module 202, configured to acquire the code to be evaluated;

and the evaluation module 203 is configured to evaluate the code to be evaluated by using a preset test unit based on a preset execution evaluation standard, so as to obtain an evaluation result.

As an embodiment, the automatic evaluation device 200 for large model code generation capability described above further includes, for how to obtain chinese test sample cases: a first construction sub-module and a first acquisition sub-module;

the first construction submodule is used for constructing a Chinese evaluation benchmark;

the first acquisition sub-module is used for acquiring a preset Chinese test sample from the Chinese evaluation standard.

As an implementation manner, the first building sub-module is specifically configured to:

As an embodiment, the automatic evaluation apparatus 200 for large model code generation capability described above further includes: a second building sub-module;

the second construction submodule is used for constructing the first test unit group based on the neural network; the first test unit group comprises at least two test units to be screened;

and utilizing the preset test unit to construct a unit test sample library.

As an embodiment, the automatic evaluation device 200 for large model code generating capability further includes: a second acquisition sub-module;

the second acquisition submodule is used for determining a preset test unit corresponding to the code to be evaluated based on the code to be evaluated;

and acquiring the preset test unit from the unit test sample library.

As an embodiment, the automatic evaluation apparatus 200 for large model code generation capability described above further includes: a third building sub-module;

the third construction submodule is used for generating a model evaluation index pass@k based on codes and constructing an execution evaluation standard by combining unit test passing rate;

As an embodiment, the evaluation module 203 is specifically configured to obtain an evaluation result for how to perform the evaluation:

In addition, the application also provides an automatic evaluation device for the generation capacity of the large model code, which comprises the following components: a memory for storing a computer program; a processor for implementing the steps of the method for automatically evaluating large model code generation capabilities according to any one of the above, when executing the computer program.

In addition, the application also provides a readable storage medium, wherein the readable storage medium is stored with a computer program, and the computer program realizes the steps of the automatic evaluation method for the large model code generating capability according to any one of the above steps when being executed by a processor.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An automatic evaluation method for large model code generation capability, characterized in that the method comprises the following steps:

acquiring the code to be evaluated;

2. The method according to claim 1, wherein before inputting the preset chinese test sample into the large language model to be evaluated, the large language model to be evaluated generates the corresponding code to be evaluated based on the chinese test sample, further comprises:

constructing a Chinese evaluation benchmark;

3. The method of claim 2, wherein said constructing a chinese evaluation benchmark comprises:

4. The method according to claim 1, wherein the evaluating the code to be evaluated by using a preset test unit based on a pre-constructed execution evaluation criterion, before obtaining an evaluation result, further comprises:

and utilizing the preset test unit to construct a unit test sample library.

5. The method according to claim 4, wherein the evaluating the code to be evaluated by using a preset test unit based on the pre-constructed execution evaluation criterion, before obtaining the evaluation result, further comprises:

and acquiring the preset test unit from the unit test sample library.

6. The method according to claim 1, wherein the evaluating the code to be evaluated by using a preset test unit based on a pre-constructed execution evaluation criterion, before obtaining an evaluation result, further comprises:

7. The method according to claim 6, wherein the evaluating the code to be evaluated by using a preset test unit based on a pre-constructed execution evaluation criterion to obtain an evaluation result comprises:

8. An automatic evaluation device for large model code generation capability, comprising:

the acquisition module is used for acquiring the code to be evaluated;

9. An automatic evaluation apparatus for large model code generation capability, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method for automatically evaluating large model code generation capabilities according to any one of claims 1 to 7 when executing said computer program.

10. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the method for automatically evaluating large model code generating capabilities according to any one of claims 1 to 7.