CN117993366A

CN117993366A - Evaluation item dynamic generation method and system, electronic equipment and readable storage medium

Info

Publication number: CN117993366A
Application number: CN202410381770.9A
Authority: CN
Inventors: 何召锋; 尚余虎; 程祥; 项刘宇; 吴惠甲
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2024-04-01
Filing date: 2024-04-01
Publication date: 2024-05-07
Anticipated expiration: 2044-04-01
Also published as: CN117993366B

Abstract

The disclosure provides a method and a system for dynamically generating an evaluation question, electronic equipment and a readable storage medium, belonging to the field of model evaluation, wherein the method comprises the following steps: generating a plurality of first test topics based on the seed topics and the hints; determining the difficulty of the plurality of first test questions based on the first reply information of the target model for the plurality of first test questions; if the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out; and if the difficulty of the plurality of first test questions accords with the preset difficulty, determining the plurality of first test questions as test questions aiming at the target model. The method and the system for dynamically generating the evaluation questions, the electronic equipment and the readable storage medium solve the problem that the existing evaluation method lacks self-adaptability.

Description

Evaluation item dynamic generation method and system, electronic equipment and readable storage medium

Technical Field

The disclosure belongs to the technical field of model evaluation, and more particularly relates to a dynamic evaluation item generation method and system, electronic equipment and a readable storage medium.

Background

The large language model (Large Language Model, LLM) is an advanced natural language processing technique that learns rich language knowledge and patterns by pre-training on large amounts of text data. The models can generate smooth, coherent and logical texts, and can answer questions, emotion analysis and other tasks. With the rapid development of large language models, people are beginning to worry about the risk they may carry or have negative social impact, and thus their full-scale assessment is becoming increasingly important. However, existing assessment methods often lack adaptivity, and the question difficulty cannot be dynamically adjusted according to different application scenarios and user requirements, which limits the accuracy of assessment in the face of the emerging large models.

Disclosure of Invention

The disclosure aims to provide a dynamic generation method and system of an evaluation question, electronic equipment and a readable storage medium, so as to solve the problem that the existing evaluation method lacks adaptivity.

In a first aspect of an embodiment of the present disclosure, a method for dynamically generating an evaluation question is provided, including:

generating a plurality of first test topics based on the seed topics and the hints;

Determining the difficulty of the plurality of first test questions based on the first reply information of the target model for the plurality of first test questions;

If the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out;

And if the difficulty of the plurality of first test questions accords with the preset difficulty, determining the plurality of first test questions as test questions aiming at the target model.

In a second aspect of the embodiments of the present disclosure, there is provided a system for dynamically generating an evaluation item, including:

the title generation module: generating a plurality of first test topics based on the seed topics and the hints;

The question difficulty determining module: the method comprises the steps of determining difficulty of a plurality of first test questions based on first reply information of a target model for the plurality of first test questions;

The question difficulty adjusting module: if the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out;

The title determination module: and determining the plurality of first test questions as the test questions aiming at the target model if the difficulty of the plurality of first test questions accords with the preset difficulty.

In a third aspect of the embodiments of the present disclosure, there is provided an electronic device including a memory, a processor, and a computer program stored in the memory and running on the processor, where the steps of the above-described method for dynamically generating an evaluation question are implemented when the processor executes the computer program.

In a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described evaluation item dynamic generation method.

The method and the system for dynamically generating the evaluation questions, the electronic equipment and the readable storage medium have the beneficial effects that: the invention provides a method for adaptively adjusting the question difficulty, which dynamically adjusts the question difficulty according to the expression of a large model. The method can ensure that the evaluation result is more accurate and is suitable for different tasks and fields. When facing the continuously emerging large model, the method can flexibly adjust the question difficulty according to the actual demand, thereby improving the adaptability and practicality of the evaluation method.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flowchart of a method for dynamically generating an evaluation item according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of a method for dynamically generating evaluation questions according to an embodiment of the disclosure;

FIG. 3 is a block diagram of a dynamic evaluation-task generating system according to an embodiment of the present disclosure;

Fig. 4 is a schematic block diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings.

Referring to fig. 1, fig. 1 is a flowchart of a method for dynamically generating an evaluation item according to an embodiment of the disclosure, where the method includes:

s101: a plurality of first test topics is generated based on the seed topics and the hints.

In this embodiment, a plurality of first test questions are generated according to the seed questions and the conditions for prompting to generate the plurality of first test questions.

Seed topics refer to data samples used to initialize or train a model, which are called "seeds" because they are the basis for model learning and eventually grow into a complete model just as seeds germinate.

In this embodiment, first, a seed topic S is given, where S is a set of topics in the fields of mathematics, physics, chemistry, history, or programming. Can be expressed as:

Wherein, Is set/>/>The seed topics n is the total number of seed topics.

Then, given a prompting instruction P for guiding the test question generation and a seed question generation condition set C, the condition set C comprises requirements which are required to be met by the generated questions, and the condition set C comprises, but is not limited to, the question type, the difficulty level, the content field and the like. The condition set may be expressed as:

Wherein the method comprises the steps of Is the condition of the ith first test question in set C, n is the total number of conditions.

Finally, based on the set seed questions and the prompt instruction P, a plurality of test questions (first test questions) are generated using the high-level natural language generating capability of the large language model. In this embodiment, the advanced natural language generation capabilities of GPT-4 may be employed to generate multiple test topics.

In the process, firstly, a prompting instruction P generated by a seed question S and a guiding question is required to be converted into a format which can be understood by GPT-4;

The seed questions, the prompt instruction P and the conditions are injected into an input text box of the large language model, and the large language model can automatically generate the questions.

One specific embodiment of the present disclosure is as follows:

After the prompt instruction P is input, a new historical test question is generated according to the seed questions and the condition set C in the seed question set S, the question type of the new historical test question is a single-item selection question, the difficulty level is 0.6, and the content field is the world history.

Generating a problem: which adventure home reached the continental america for the first time in 1492?

A. Cristolochia fre Columbus

B. fedi nan, mcPhenylanthen

C. Games Kodajia

D. Zhanms kuke

High quality test topics can help us evaluate the performance and accuracy of the model more accurately. Secondly, through test questions, the defects of the model can be found, and repair and optimization can be performed in time. In addition, the test questions can also help us evaluate the reliability and the expandability of the model, and provide basis for further improvement of the model. If the quality of the test questions is not high, such as the presence of erroneous data, missing data, or deviations from certain features, the performance of the model may be degraded, and even misleading results may be obtained. Therefore, in order to ensure the performance and quality of a large language model, we need to pay attention to the quality of the test subjects and take corresponding measures to improve the quality of the test subjects.

In this embodiment, the quality of the plurality of first test questions is scored by a first formula; the first formula is:

Wherein, For the quality of the ith first test question,/>For the similarity of the ith first test topic and the corresponding seed topic,/>Creative for the ith first test question,/>Language fluency for the ith first test question,/>、And/>Respectively corresponding weight coefficients of similarity, creativity and language fluency.

The final quality score is obtained by weighting the different evaluation indexes.

In this embodiment, after the first test question is generated, an evaluation function is defined to evaluate the quality of the first test question based on the similarity, creativity, language fluency, and the like.

Similarity: the relevance of the generated first test question and the seed question and the prompt instruction P is compared;

Creative: checking whether the first test question has low similarity with the existing seed questions, and whether the first test question can excite the thinking ability of the large language model;

Language fluency: refers to whether the generated first test question performs well in terms of grammar, spelling, and language structure.

In this embodiment: the similarity between the first test question and the seed question is calculated through a text similarity algorithm, and the calculation formula of the similarity is as follows:

Wherein, And/>Vectors of the first test topic and seed topic, respectively,/>Represents the vector dot product of the first test question and the seed question, and II A II and II B II represent the Euclidean norms (i.e., the lengths of the vectors) of the first test question and the seed question, respectively.

The creativity of the first test questions is calculated by evaluating the diversity of the generated first test questions, and the calculation formula of the diversity is as follows:

Wherein, Is the proportion of topics in the ith category (e.g., knowledge points or topic types), n is the total number of categories, and H is the diversity index. The higher the diversity index value, the better the creativity of the first test subject.

The language fluency of the first test question is calculated through a readability formula, wherein the calculation formula of the language fluency is as follows:

Wherein, For the language fluency of the first test question ASL (Average Sentence Length) is the average sentence length, ASW (Average Syllables per Word) is the average number of syllables per word, and SENTENCES is the number of sentences in the text. The higher the number of scores, the better the readability of the text.

If the quality of the generated first test questions does not meet the expected requirement, the difficulty level of the first test questions is adjusted, and then the adjusted first test questions are re-scored again until the expected requirement is met.

S102: the difficulty of the plurality of first test questions is determined for the first reply information of the plurality of first test questions based on the target model.

Determining the correctness of the reply of the target model according to the first reply information of the plurality of first test questions and the second reply information corresponding to the plurality of first test questions;

Determining a difficulty of the plurality of first test questions based on the correctness;

the first reply information is the current reply information of a plurality of first test questions, and the second reply information is the correct reply information.

In this embodiment, the reply information is an answer given by the target model for the plurality of first test questions, where the first reply information is a current answer given by the target model for the plurality of first test questions, and the second reply information is a correct answer corresponding to the plurality of first test questions.

The formula for calculating the correctness of the reply of the target model determined by the first reply information of the plurality of first test questions and the second reply information corresponding to the plurality of first test questions is as follows:

Wherein, For the correctness of the target model for the ith first test question,/>Vector of first reply information for ith first test question for target model,/>Is the vector of the second reply information corresponding to the ith first test question,/>And/>The lengths of the vectors corresponding to the first reply information and the second reply information of the ith first test question are respectively.

In practical application, multiple models may be evaluated simultaneously based on multiple first test question listsAnd large language model set/>The configuration parameters of each model can be used/>Representing, an answer generation function/>, is defined for each model's reply information：

Wherein,Is/>The individual model is for topic/>Is a response to a request for a response from the host computer.

Collecting and integrating answers generated by all models:

wherein Responses is The individual model is a set of all answers to a set of topics.

For the ith first test question, firstly, the similarity between the first reply information and the second reply information generated by using the cosine similarity calculation model is used,

Wherein,Similarity of first reply message and second reply message expressed as mth model for ith first test question,/>First reply message expressed as the first model for the ith first test question,/>A second reply message represented as the mth model for the ith first test question.

Normalizing the similarity score to between 0 and 1, calculating a correctness score for each model answer：

Wherein,Representation of the list of topics/>/>, Problem iThe correctness of the individual model answers is scored.

S103: if the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out.

In this embodiment, a dynamic adjustment policy is defined, and when the difficulty of the plurality of first test questions does not conform to the preset difficulty, the policy can adjust the difficulty of generating the questions according to the performance differences and feedback data of different models. For example, if a model answers poorly, the difficulty of adjusting the topic may be reduced to improve the performance of the model.

In this embodiment, if the difficulty of the plurality of first test questions does not conform to the preset difficulty, the difficulty of the plurality of first test questions is adjusted according to the difficulty adjustment factor of the plurality of first test questions.

The calculation formula of the difficulty adjustment factor is as follows:

Wherein, Difficulty adjustment factor for ith first test question,/>To adjust the parameters,/>As a smoothness function,/>For the expected threshold of the ith first test question,/>For the average correctness of the first test questionsIs a smoothness parameter for the adjustment factor.

Sigmoid function willThe values of (2) map to the (0, 1) interval such that the adjustment factor changes more smoothly as it approaches the desired threshold. If the average correctness of the model/>Below the desired threshold/>，/>Will be a positive number indicating that the first test question difficulty needs to be reduced.

If a certain modelAbove the desired threshold, the difficulty of raising the topic is maintained to better challenge the model. In particular, a parameter adjustment factor/>, which improves the difficulty, can be definedFor example:

If the average correctness of the model is above the desired threshold, then Will be a positive number indicating that the first test question difficulty can be increased. /(I)The value of (2) is close to 1, i.e. the magnitude of the difficulty increase decreases with increasing accuracy.

By introducing the smoothness function, the problem difficulty can be ensured not to be excessively severely regulated, so that a more stable and progressive learning environment is provided for the model. This smooth adjustment strategy helps the model to remain stable in the face of performance fluctuations and can better accommodate different learning phases.

According to the adjustment strategy, the original prompt instruction P and the condition C are subjected to parameter adjustment factorsThe large language model is injected into the input text box of the large language model, and the large language model automatically returns a prompt instruction P and a condition C for regenerating the title.

And re-enabling the large language model to generate a plurality of first test questions by using the updated parameters so as to meet the new difficulty requirement.

S104: and if the difficulty of the plurality of first test questions accords with the preset difficulty, determining the plurality of first test questions as test questions aiming at the target model.

In this embodiment, the updated parameters are re-used to generate multiple first test questions by using the large language model (GPT-4) to meet the new difficulty requirement, and multiple iterations are performed, and when the difficulty of the multiple first test questions meets the preset difficulty, multiple final first test questions are obtained.

From the above, the invention introduces objective evaluation criteria, and does not depend on subjective judgment of expert as evaluation basis. The evaluation standard is automatically generated through the large language model, so that the main opinion difference of different experts on the question difficulty is eliminated, and the instability of the evaluation result is reduced. The innovation improves the problem that subjectivity and inconsistency are easy to introduce in the prior art, and makes model evaluation more objective.

The invention automatically generates the evaluation standard by using the large language model, thereby reducing the requirement of manual intervention. This innovation improves the efficiency and scalability of the assessment while ensuring the comprehensiveness and diversity of the assessment criteria. The method is free from manually writing evaluation standards by experts, and is beneficial to reducing subjectivity in the evaluation process and improving accuracy of the standards.

The invention provides a method for adaptively adjusting the question difficulty, which dynamically adjusts the question difficulty according to the expression of a large model. The method can ensure that the evaluation result is more accurate and is suitable for different tasks and fields. When facing the continuously emerging large model, the method can flexibly adjust the question difficulty according to the actual demand, thereby improving the adaptability and practicality of the evaluation method.

In general, the method for automatically generating the evaluation standard and adaptively adjusting the question difficulty effectively improves the defects and shortcomings in subjectivity, inconsistency, accuracy, adaptability and the like in the prior art by introducing the objective evaluation standard.

Fig. 2 is another flow chart of a dynamic generation method of an evaluation question according to an embodiment of the present disclosure, first, we determine initial conditions and requirements for generating the question, including seed questions and prompts. Next, we generated specific topics using GPT-4. We then provide these generated topics to a plurality of large language models and collect their answers to these topics to evaluate their responsiveness to each question. Second, by analyzing the answer results of the open source large language model, we can determine the difficulty level of each generated topic. Finally, we dynamically adjust the difficulty of generating questions based on the feedback of the questions and the open source large language model generated by GPT-4 to ensure that the ability to test the large language model is fully evaluated and challenged.

Corresponding to the method for dynamically generating an evaluation item in the above embodiment, fig. 3 is a block diagram of a system for dynamically generating an evaluation item according to an embodiment of the present disclosure. For ease of illustration, only portions relevant to embodiments of the present disclosure are shown. Referring to fig. 3, the evaluation-subject dynamic generation system 20 includes: the system comprises a question generation module 21, a question difficulty determination module 22, a question difficulty adjustment module 23 and a question determination module 24.

Wherein the topic generation module 21: generating a plurality of first test topics based on the seed topics and the hints;

The topic difficulty determination module 22: the method comprises the steps of determining difficulty of a plurality of first test questions based on first reply information of a target model for the plurality of first test questions;

Question difficulty adjustment module 23: if the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out;

the topic determination module 24: and determining the plurality of first test questions as the test questions aiming at the target model if the difficulty of the plurality of first test questions accords with the preset difficulty.

In one embodiment of the present disclosure, the evaluation topic dynamic generation system 20 further includes: and a quality scoring module:

Scoring a quality of the plurality of first test subjects;

And when the quality scores of the first test questions are lower than a preset value, adjusting the first test questions.

In one embodiment of the present disclosure, the quality scoring module is specifically configured to:

scoring the quality of the plurality of first test topics by a first formula;

The first formula is:

In one embodiment of the present disclosure, the topic difficulty determination module 22 is specifically configured to:

In one embodiment of the present disclosure, the topic difficulty adjustment module 23 is specifically configured to:

If the difficulty of the plurality of first test questions does not accord with the preset difficulty, the difficulty of the plurality of first test questions is adjusted according to the difficulty adjustment factors of the plurality of first test questions.

The calculation formula of the difficulty adjustment factor is as follows:

Referring to fig. 4, fig. 4 is a schematic block diagram of an electronic device according to an embodiment of the disclosure. The electronic device 300 in the present embodiment as shown in fig. 4 may include: one or more processors 301, one or more input devices 302, one or more output devices 303, and one or more memories 304. The processor 301, the input device 302, the output device 303, and the memory 304 communicate with each other via a communication bus 305. The memory 304 is used to store a computer program comprising program instructions. The processor 301 is configured to execute program instructions stored in the memory 304. Wherein the processor 301 is configured to invoke program instructions to perform the functions of the systems of the system embodiments described above, such as the functions of the modules 21 to 24 shown in fig. 3.

It should be appreciated that in the disclosed embodiments, the Processor 301 may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input device 302 may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of a fingerprint), a microphone, etc., and the output device 303 may include a display (LCD, etc.), a speaker, etc.

The memory 304 may include read only memory and random access memory and provides instructions and data to the processor 301. A portion of memory 304 may also include non-volatile random access memory. For example, the memory 304 may also store information of device type.

In a specific implementation, the processor 301, the input device 302, and the output device 303 described in the embodiments of the present disclosure may perform the implementation manners described in the first embodiment and the second embodiment of the method for dynamically generating an evaluation question provided in the embodiments of the present disclosure, and may also perform the implementation manners of the electronic device described in the embodiments of the present disclosure, which are not described herein again.

In another embodiment of the disclosure, a computer readable storage medium is provided, where the computer readable storage medium stores a computer program, where the computer program includes program instructions, where the program instructions, when executed by a processor, implement all or part of the procedures in the method embodiments described above, or may be implemented by instructing related hardware by the computer program, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by the processor, implements the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

The computer readable storage medium may be an internal storage unit of the electronic device of any of the foregoing embodiments, such as a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk provided on the electronic device, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the electronic device. The computer-readable storage medium is used to store a computer program and other programs and data required for the electronic device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the electronic device and unit described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed electronic device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via some interfaces or units, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purposes of the embodiments of the present disclosure.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a specific embodiment of the present disclosure, but the protection scope of the present disclosure is not limited thereto, and any equivalent modifications or substitutions will be apparent to those skilled in the art within the scope of the present disclosure, and these modifications or substitutions should be covered in the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. The method for dynamically generating the evaluation questions is characterized by comprising the following steps of:

2. The method for dynamically generating an evaluation item according to claim 1, further comprising:

scoring the quality of the plurality of first test topics;

3. The method for dynamically generating an evaluation item according to claim 2,

Scoring the quality of the plurality of first test subjects by a first formula;

the first formula is:

Wherein, For the quality of the ith first test question,/>For the similarity of the ith first test topic and the corresponding seed topic,/>Creative for the ith first test question,/>Language fluency for the ith first test question,/>、/>And/>Respectively corresponding weight coefficients of similarity, creativity and language fluency.

4. The method of claim 1, wherein determining the difficulty of the plurality of first test subjects based on the reply information of the target model for the plurality of first test subjects comprises:

determining correctness of the target model reply according to the first reply information of the plurality of first test questions and the second reply information corresponding to the plurality of first test questions;

the first reply information is the current reply information of the plurality of first test questions, and the second reply information is the correct reply information.

5. The method for dynamically generating an evaluation question according to claim 4, wherein the formula for calculating the correctness of the reply of the target model determined by the first reply information of the plurality of first test questions and the second reply information corresponding to the plurality of first test questions is as follows:

Wherein, For the correctness of the target model for the ith first test question,/>Vector of first reply information for ith first test question for target model,/>Is a vector of second reply information corresponding to the ith first test question,And/>The lengths of the vectors corresponding to the first reply information and the second reply information of the ith first test question are respectively.

6. The method for dynamically generating evaluation questions according to claim 1, wherein if the difficulty of the plurality of first test questions does not meet the preset difficulty, adjusting the plurality of first test questions comprises:

And if the difficulty of the plurality of first test questions does not accord with the preset difficulty, adjusting the difficulty of the plurality of first test questions according to the difficulty adjustment factors of the plurality of first test questions.

7. The method for dynamically generating an evaluation item according to claim 6, wherein the calculation formula of the difficulty adjustment factor is:

8. A dynamic evaluation item generation system, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and running on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.