CN117993366A - Evaluation item dynamic generation method and system, electronic equipment and readable storage medium - Google Patents

Evaluation item dynamic generation method and system, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN117993366A
CN117993366A CN202410381770.9A CN202410381770A CN117993366A CN 117993366 A CN117993366 A CN 117993366A CN 202410381770 A CN202410381770 A CN 202410381770A CN 117993366 A CN117993366 A CN 117993366A
Authority
CN
China
Prior art keywords
test
difficulty
test questions
questions
reply information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410381770.9A
Other languages
Chinese (zh)
Other versions
CN117993366B (en
Inventor
何召锋
尚余虎
程祥
项刘宇
吴惠甲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202410381770.9A priority Critical patent/CN117993366B/en
Publication of CN117993366A publication Critical patent/CN117993366A/en
Application granted granted Critical
Publication of CN117993366B publication Critical patent/CN117993366B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The disclosure provides a method and a system for dynamically generating an evaluation question, electronic equipment and a readable storage medium, belonging to the field of model evaluation, wherein the method comprises the following steps: generating a plurality of first test topics based on the seed topics and the hints; determining the difficulty of the plurality of first test questions based on the first reply information of the target model for the plurality of first test questions; if the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out; and if the difficulty of the plurality of first test questions accords with the preset difficulty, determining the plurality of first test questions as test questions aiming at the target model. The method and the system for dynamically generating the evaluation questions, the electronic equipment and the readable storage medium solve the problem that the existing evaluation method lacks self-adaptability.

Description

Evaluation item dynamic generation method and system, electronic equipment and readable storage medium
Technical Field
The disclosure belongs to the technical field of model evaluation, and more particularly relates to a dynamic evaluation item generation method and system, electronic equipment and a readable storage medium.
Background
The large language model (Large Language Model, LLM) is an advanced natural language processing technique that learns rich language knowledge and patterns by pre-training on large amounts of text data. The models can generate smooth, coherent and logical texts, and can answer questions, emotion analysis and other tasks. With the rapid development of large language models, people are beginning to worry about the risk they may carry or have negative social impact, and thus their full-scale assessment is becoming increasingly important. However, existing assessment methods often lack adaptivity, and the question difficulty cannot be dynamically adjusted according to different application scenarios and user requirements, which limits the accuracy of assessment in the face of the emerging large models.
Disclosure of Invention
The disclosure aims to provide a dynamic generation method and system of an evaluation question, electronic equipment and a readable storage medium, so as to solve the problem that the existing evaluation method lacks adaptivity.
In a first aspect of an embodiment of the present disclosure, a method for dynamically generating an evaluation question is provided, including:
generating a plurality of first test topics based on the seed topics and the hints;
Determining the difficulty of the plurality of first test questions based on the first reply information of the target model for the plurality of first test questions;
If the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out;
And if the difficulty of the plurality of first test questions accords with the preset difficulty, determining the plurality of first test questions as test questions aiming at the target model.
In a second aspect of the embodiments of the present disclosure, there is provided a system for dynamically generating an evaluation item, including:
the title generation module: generating a plurality of first test topics based on the seed topics and the hints;
The question difficulty determining module: the method comprises the steps of determining difficulty of a plurality of first test questions based on first reply information of a target model for the plurality of first test questions;
The question difficulty adjusting module: if the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out;
The title determination module: and determining the plurality of first test questions as the test questions aiming at the target model if the difficulty of the plurality of first test questions accords with the preset difficulty.
In a third aspect of the embodiments of the present disclosure, there is provided an electronic device including a memory, a processor, and a computer program stored in the memory and running on the processor, where the steps of the above-described method for dynamically generating an evaluation question are implemented when the processor executes the computer program.
In a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described evaluation item dynamic generation method.
The method and the system for dynamically generating the evaluation questions, the electronic equipment and the readable storage medium have the beneficial effects that: the invention provides a method for adaptively adjusting the question difficulty, which dynamically adjusts the question difficulty according to the expression of a large model. The method can ensure that the evaluation result is more accurate and is suitable for different tasks and fields. When facing the continuously emerging large model, the method can flexibly adjust the question difficulty according to the actual demand, thereby improving the adaptability and practicality of the evaluation method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flowchart of a method for dynamically generating an evaluation item according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart of a method for dynamically generating evaluation questions according to an embodiment of the disclosure;
FIG. 3 is a block diagram of a dynamic evaluation-task generating system according to an embodiment of the present disclosure;
Fig. 4 is a schematic block diagram of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings.
Referring to fig. 1, fig. 1 is a flowchart of a method for dynamically generating an evaluation item according to an embodiment of the disclosure, where the method includes:
s101: a plurality of first test topics is generated based on the seed topics and the hints.
In this embodiment, a plurality of first test questions are generated according to the seed questions and the conditions for prompting to generate the plurality of first test questions.
Seed topics refer to data samples used to initialize or train a model, which are called "seeds" because they are the basis for model learning and eventually grow into a complete model just as seeds germinate.
In this embodiment, first, a seed topic S is given, where S is a set of topics in the fields of mathematics, physics, chemistry, history, or programming. Can be expressed as:
Wherein, Is set/>/>The seed topics n is the total number of seed topics.
Then, given a prompting instruction P for guiding the test question generation and a seed question generation condition set C, the condition set C comprises requirements which are required to be met by the generated questions, and the condition set C comprises, but is not limited to, the question type, the difficulty level, the content field and the like. The condition set may be expressed as:
Wherein the method comprises the steps of Is the condition of the ith first test question in set C, n is the total number of conditions.
Finally, based on the set seed questions and the prompt instruction P, a plurality of test questions (first test questions) are generated using the high-level natural language generating capability of the large language model. In this embodiment, the advanced natural language generation capabilities of GPT-4 may be employed to generate multiple test topics.
In the process, firstly, a prompting instruction P generated by a seed question S and a guiding question is required to be converted into a format which can be understood by GPT-4;
The seed questions, the prompt instruction P and the conditions are injected into an input text box of the large language model, and the large language model can automatically generate the questions.
One specific embodiment of the present disclosure is as follows:
After the prompt instruction P is input, a new historical test question is generated according to the seed questions and the condition set C in the seed question set S, the question type of the new historical test question is a single-item selection question, the difficulty level is 0.6, and the content field is the world history.
Generating a problem: which adventure home reached the continental america for the first time in 1492?
A. Cristolochia fre Columbus
B. fedi nan, mcPhenylanthen
C. Games Kodajia
D. Zhanms kuke
High quality test topics can help us evaluate the performance and accuracy of the model more accurately. Secondly, through test questions, the defects of the model can be found, and repair and optimization can be performed in time. In addition, the test questions can also help us evaluate the reliability and the expandability of the model, and provide basis for further improvement of the model. If the quality of the test questions is not high, such as the presence of erroneous data, missing data, or deviations from certain features, the performance of the model may be degraded, and even misleading results may be obtained. Therefore, in order to ensure the performance and quality of a large language model, we need to pay attention to the quality of the test subjects and take corresponding measures to improve the quality of the test subjects.
In this embodiment, the quality of the plurality of first test questions is scored by a first formula; the first formula is:
Wherein, For the quality of the ith first test question,/>For the similarity of the ith first test topic and the corresponding seed topic,/>Creative for the ith first test question,/>Language fluency for the ith first test question,/>And/>Respectively corresponding weight coefficients of similarity, creativity and language fluency.
The final quality score is obtained by weighting the different evaluation indexes.
In this embodiment, after the first test question is generated, an evaluation function is defined to evaluate the quality of the first test question based on the similarity, creativity, language fluency, and the like.
Similarity: the relevance of the generated first test question and the seed question and the prompt instruction P is compared;
Creative: checking whether the first test question has low similarity with the existing seed questions, and whether the first test question can excite the thinking ability of the large language model;
Language fluency: refers to whether the generated first test question performs well in terms of grammar, spelling, and language structure.
In this embodiment: the similarity between the first test question and the seed question is calculated through a text similarity algorithm, and the calculation formula of the similarity is as follows:
Wherein, And/>Vectors of the first test topic and seed topic, respectively,/>Represents the vector dot product of the first test question and the seed question, and II A II and II B II represent the Euclidean norms (i.e., the lengths of the vectors) of the first test question and the seed question, respectively.
The creativity of the first test questions is calculated by evaluating the diversity of the generated first test questions, and the calculation formula of the diversity is as follows:
Wherein, Is the proportion of topics in the ith category (e.g., knowledge points or topic types), n is the total number of categories, and H is the diversity index. The higher the diversity index value, the better the creativity of the first test subject.
The language fluency of the first test question is calculated through a readability formula, wherein the calculation formula of the language fluency is as follows:
Wherein, For the language fluency of the first test question ASL (Average Sentence Length) is the average sentence length, ASW (Average Syllables per Word) is the average number of syllables per word, and SENTENCES is the number of sentences in the text. The higher the number of scores, the better the readability of the text.
If the quality of the generated first test questions does not meet the expected requirement, the difficulty level of the first test questions is adjusted, and then the adjusted first test questions are re-scored again until the expected requirement is met.
S102: the difficulty of the plurality of first test questions is determined for the first reply information of the plurality of first test questions based on the target model.
Determining the correctness of the reply of the target model according to the first reply information of the plurality of first test questions and the second reply information corresponding to the plurality of first test questions;
Determining a difficulty of the plurality of first test questions based on the correctness;
the first reply information is the current reply information of a plurality of first test questions, and the second reply information is the correct reply information.
In this embodiment, the reply information is an answer given by the target model for the plurality of first test questions, where the first reply information is a current answer given by the target model for the plurality of first test questions, and the second reply information is a correct answer corresponding to the plurality of first test questions.
The formula for calculating the correctness of the reply of the target model determined by the first reply information of the plurality of first test questions and the second reply information corresponding to the plurality of first test questions is as follows:
Wherein, For the correctness of the target model for the ith first test question,/>Vector of first reply information for ith first test question for target model,/>Is the vector of the second reply information corresponding to the ith first test question,/>And/>The lengths of the vectors corresponding to the first reply information and the second reply information of the ith first test question are respectively.
In practical application, multiple models may be evaluated simultaneously based on multiple first test question listsAnd large language model set/>The configuration parameters of each model can be used/>Representing, an answer generation function/>, is defined for each model's reply information
Wherein,Is/>The individual model is for topic/>Is a response to a request for a response from the host computer.
Collecting and integrating answers generated by all models:
wherein Responses is The individual model is a set of all answers to a set of topics.
For the ith first test question, firstly, the similarity between the first reply information and the second reply information generated by using the cosine similarity calculation model is used,
Wherein,Similarity of first reply message and second reply message expressed as mth model for ith first test question,/>First reply message expressed as the first model for the ith first test question,/>A second reply message represented as the mth model for the ith first test question.
Normalizing the similarity score to between 0 and 1, calculating a correctness score for each model answer
Wherein,Representation of the list of topics/>/>, Problem iThe correctness of the individual model answers is scored.
S103: if the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out.
In this embodiment, a dynamic adjustment policy is defined, and when the difficulty of the plurality of first test questions does not conform to the preset difficulty, the policy can adjust the difficulty of generating the questions according to the performance differences and feedback data of different models. For example, if a model answers poorly, the difficulty of adjusting the topic may be reduced to improve the performance of the model.
In this embodiment, if the difficulty of the plurality of first test questions does not conform to the preset difficulty, the difficulty of the plurality of first test questions is adjusted according to the difficulty adjustment factor of the plurality of first test questions.
The calculation formula of the difficulty adjustment factor is as follows:
Wherein, Difficulty adjustment factor for ith first test question,/>To adjust the parameters,/>As a smoothness function,/>For the expected threshold of the ith first test question,/>For the average correctness of the first test questionsIs a smoothness parameter for the adjustment factor.
Sigmoid function willThe values of (2) map to the (0, 1) interval such that the adjustment factor changes more smoothly as it approaches the desired threshold. If the average correctness of the model/>Below the desired threshold/>,/>Will be a positive number indicating that the first test question difficulty needs to be reduced.
If a certain modelAbove the desired threshold, the difficulty of raising the topic is maintained to better challenge the model. In particular, a parameter adjustment factor/>, which improves the difficulty, can be definedFor example:
If the average correctness of the model is above the desired threshold, then Will be a positive number indicating that the first test question difficulty can be increased. /(I)The value of (2) is close to 1, i.e. the magnitude of the difficulty increase decreases with increasing accuracy.
By introducing the smoothness function, the problem difficulty can be ensured not to be excessively severely regulated, so that a more stable and progressive learning environment is provided for the model. This smooth adjustment strategy helps the model to remain stable in the face of performance fluctuations and can better accommodate different learning phases.
According to the adjustment strategy, the original prompt instruction P and the condition C are subjected to parameter adjustment factorsThe large language model is injected into the input text box of the large language model, and the large language model automatically returns a prompt instruction P and a condition C for regenerating the title.
And re-enabling the large language model to generate a plurality of first test questions by using the updated parameters so as to meet the new difficulty requirement.
S104: and if the difficulty of the plurality of first test questions accords with the preset difficulty, determining the plurality of first test questions as test questions aiming at the target model.
In this embodiment, the updated parameters are re-used to generate multiple first test questions by using the large language model (GPT-4) to meet the new difficulty requirement, and multiple iterations are performed, and when the difficulty of the multiple first test questions meets the preset difficulty, multiple final first test questions are obtained.
From the above, the invention introduces objective evaluation criteria, and does not depend on subjective judgment of expert as evaluation basis. The evaluation standard is automatically generated through the large language model, so that the main opinion difference of different experts on the question difficulty is eliminated, and the instability of the evaluation result is reduced. The innovation improves the problem that subjectivity and inconsistency are easy to introduce in the prior art, and makes model evaluation more objective.
The invention automatically generates the evaluation standard by using the large language model, thereby reducing the requirement of manual intervention. This innovation improves the efficiency and scalability of the assessment while ensuring the comprehensiveness and diversity of the assessment criteria. The method is free from manually writing evaluation standards by experts, and is beneficial to reducing subjectivity in the evaluation process and improving accuracy of the standards.
The invention provides a method for adaptively adjusting the question difficulty, which dynamically adjusts the question difficulty according to the expression of a large model. The method can ensure that the evaluation result is more accurate and is suitable for different tasks and fields. When facing the continuously emerging large model, the method can flexibly adjust the question difficulty according to the actual demand, thereby improving the adaptability and practicality of the evaluation method.
In general, the method for automatically generating the evaluation standard and adaptively adjusting the question difficulty effectively improves the defects and shortcomings in subjectivity, inconsistency, accuracy, adaptability and the like in the prior art by introducing the objective evaluation standard.
Fig. 2 is another flow chart of a dynamic generation method of an evaluation question according to an embodiment of the present disclosure, first, we determine initial conditions and requirements for generating the question, including seed questions and prompts. Next, we generated specific topics using GPT-4. We then provide these generated topics to a plurality of large language models and collect their answers to these topics to evaluate their responsiveness to each question. Second, by analyzing the answer results of the open source large language model, we can determine the difficulty level of each generated topic. Finally, we dynamically adjust the difficulty of generating questions based on the feedback of the questions and the open source large language model generated by GPT-4 to ensure that the ability to test the large language model is fully evaluated and challenged.
Corresponding to the method for dynamically generating an evaluation item in the above embodiment, fig. 3 is a block diagram of a system for dynamically generating an evaluation item according to an embodiment of the present disclosure. For ease of illustration, only portions relevant to embodiments of the present disclosure are shown. Referring to fig. 3, the evaluation-subject dynamic generation system 20 includes: the system comprises a question generation module 21, a question difficulty determination module 22, a question difficulty adjustment module 23 and a question determination module 24.
Wherein the topic generation module 21: generating a plurality of first test topics based on the seed topics and the hints;
The topic difficulty determination module 22: the method comprises the steps of determining difficulty of a plurality of first test questions based on first reply information of a target model for the plurality of first test questions;
Question difficulty adjustment module 23: if the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out;
the topic determination module 24: and determining the plurality of first test questions as the test questions aiming at the target model if the difficulty of the plurality of first test questions accords with the preset difficulty.
In one embodiment of the present disclosure, the evaluation topic dynamic generation system 20 further includes: and a quality scoring module:
Scoring a quality of the plurality of first test subjects;
And when the quality scores of the first test questions are lower than a preset value, adjusting the first test questions.
In one embodiment of the present disclosure, the quality scoring module is specifically configured to:
scoring the quality of the plurality of first test topics by a first formula;
The first formula is:
Wherein, For the quality of the ith first test question,/>For the similarity of the ith first test topic and the corresponding seed topic,/>Creative for the ith first test question,/>Language fluency for the ith first test question,/>And/>Respectively corresponding weight coefficients of similarity, creativity and language fluency.
In one embodiment of the present disclosure, the topic difficulty determination module 22 is specifically configured to:
Determining the correctness of the reply of the target model according to the first reply information of the plurality of first test questions and the second reply information corresponding to the plurality of first test questions;
Determining a difficulty of the plurality of first test questions based on the correctness;
the first reply information is the current reply information of a plurality of first test questions, and the second reply information is the correct reply information.
In one embodiment of the present disclosure, the topic difficulty determination module 22 is specifically configured to:
the formula for calculating the correctness of the reply of the target model determined by the first reply information of the plurality of first test questions and the second reply information corresponding to the plurality of first test questions is as follows:
Wherein, For the correctness of the target model for the ith first test question,/>Vector of first reply information for ith first test question for target model,/>Is the vector of the second reply information corresponding to the ith first test question,/>And/>The lengths of the vectors corresponding to the first reply information and the second reply information of the ith first test question are respectively.
In one embodiment of the present disclosure, the topic difficulty adjustment module 23 is specifically configured to:
If the difficulty of the plurality of first test questions does not accord with the preset difficulty, the difficulty of the plurality of first test questions is adjusted according to the difficulty adjustment factors of the plurality of first test questions.
In one embodiment of the present disclosure, the topic difficulty adjustment module 23 is specifically configured to:
The calculation formula of the difficulty adjustment factor is as follows:
Wherein, Difficulty adjustment factor for ith first test question,/>To adjust the parameters,/>As a smoothness function,/>For the expected threshold of the ith first test question,/>For the average correctness of the first test questionsIs a smoothness parameter for the adjustment factor.
Referring to fig. 4, fig. 4 is a schematic block diagram of an electronic device according to an embodiment of the disclosure. The electronic device 300 in the present embodiment as shown in fig. 4 may include: one or more processors 301, one or more input devices 302, one or more output devices 303, and one or more memories 304. The processor 301, the input device 302, the output device 303, and the memory 304 communicate with each other via a communication bus 305. The memory 304 is used to store a computer program comprising program instructions. The processor 301 is configured to execute program instructions stored in the memory 304. Wherein the processor 301 is configured to invoke program instructions to perform the functions of the systems of the system embodiments described above, such as the functions of the modules 21 to 24 shown in fig. 3.
It should be appreciated that in the disclosed embodiments, the Processor 301 may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The input device 302 may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of a fingerprint), a microphone, etc., and the output device 303 may include a display (LCD, etc.), a speaker, etc.
The memory 304 may include read only memory and random access memory and provides instructions and data to the processor 301. A portion of memory 304 may also include non-volatile random access memory. For example, the memory 304 may also store information of device type.
In a specific implementation, the processor 301, the input device 302, and the output device 303 described in the embodiments of the present disclosure may perform the implementation manners described in the first embodiment and the second embodiment of the method for dynamically generating an evaluation question provided in the embodiments of the present disclosure, and may also perform the implementation manners of the electronic device described in the embodiments of the present disclosure, which are not described herein again.
In another embodiment of the disclosure, a computer readable storage medium is provided, where the computer readable storage medium stores a computer program, where the computer program includes program instructions, where the program instructions, when executed by a processor, implement all or part of the procedures in the method embodiments described above, or may be implemented by instructing related hardware by the computer program, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by the processor, implements the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
The computer readable storage medium may be an internal storage unit of the electronic device of any of the foregoing embodiments, such as a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk provided on the electronic device, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the electronic device. The computer-readable storage medium is used to store a computer program and other programs and data required for the electronic device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the electronic device and unit described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed electronic device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via some interfaces or units, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purposes of the embodiments of the present disclosure.
In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a specific embodiment of the present disclosure, but the protection scope of the present disclosure is not limited thereto, and any equivalent modifications or substitutions will be apparent to those skilled in the art within the scope of the present disclosure, and these modifications or substitutions should be covered in the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (10)

1. The method for dynamically generating the evaluation questions is characterized by comprising the following steps of:
generating a plurality of first test topics based on the seed topics and the hints;
Determining the difficulty of the plurality of first test questions based on the first reply information of the target model for the plurality of first test questions;
If the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out;
And if the difficulty of the plurality of first test questions accords with the preset difficulty, determining the plurality of first test questions as test questions aiming at the target model.
2. The method for dynamically generating an evaluation item according to claim 1, further comprising:
scoring the quality of the plurality of first test topics;
and when the quality scores of the first test questions are lower than a preset value, adjusting the first test questions.
3. The method for dynamically generating an evaluation item according to claim 2,
Scoring the quality of the plurality of first test subjects by a first formula;
the first formula is:
Wherein, For the quality of the ith first test question,/>For the similarity of the ith first test topic and the corresponding seed topic,/>Creative for the ith first test question,/>Language fluency for the ith first test question,/>、/>And/>Respectively corresponding weight coefficients of similarity, creativity and language fluency.
4. The method of claim 1, wherein determining the difficulty of the plurality of first test subjects based on the reply information of the target model for the plurality of first test subjects comprises:
determining correctness of the target model reply according to the first reply information of the plurality of first test questions and the second reply information corresponding to the plurality of first test questions;
Determining a difficulty of the plurality of first test questions based on the correctness;
the first reply information is the current reply information of the plurality of first test questions, and the second reply information is the correct reply information.
5. The method for dynamically generating an evaluation question according to claim 4, wherein the formula for calculating the correctness of the reply of the target model determined by the first reply information of the plurality of first test questions and the second reply information corresponding to the plurality of first test questions is as follows:
Wherein, For the correctness of the target model for the ith first test question,/>Vector of first reply information for ith first test question for target model,/>Is a vector of second reply information corresponding to the ith first test question,And/>The lengths of the vectors corresponding to the first reply information and the second reply information of the ith first test question are respectively.
6. The method for dynamically generating evaluation questions according to claim 1, wherein if the difficulty of the plurality of first test questions does not meet the preset difficulty, adjusting the plurality of first test questions comprises:
And if the difficulty of the plurality of first test questions does not accord with the preset difficulty, adjusting the difficulty of the plurality of first test questions according to the difficulty adjustment factors of the plurality of first test questions.
7. The method for dynamically generating an evaluation item according to claim 6, wherein the calculation formula of the difficulty adjustment factor is:
Wherein, Difficulty adjustment factor for ith first test question,/>To adjust the parameters,/>As a smoothness function,/>For the expected threshold of the ith first test question,/>For the average correctness of the first test questionsIs a smoothness parameter for the adjustment factor.
8. A dynamic evaluation item generation system, comprising:
the title generation module: generating a plurality of first test topics based on the seed topics and the hints;
The question difficulty determining module: the method comprises the steps of determining difficulty of a plurality of first test questions based on first reply information of a target model for the plurality of first test questions;
The question difficulty adjusting module: if the difficulty of the plurality of first test questions does not accord with the preset difficulty, the plurality of first test questions are adjusted, and the step of determining the difficulty of the plurality of first test questions based on the reply information of the target model for the plurality of first test questions is carried out;
The title determination module: and determining the plurality of first test questions as the test questions aiming at the target model if the difficulty of the plurality of first test questions accords with the preset difficulty.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and running on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.
CN202410381770.9A 2024-04-01 2024-04-01 Evaluation item dynamic generation method and system, electronic equipment and readable storage medium Active CN117993366B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410381770.9A CN117993366B (en) 2024-04-01 2024-04-01 Evaluation item dynamic generation method and system, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410381770.9A CN117993366B (en) 2024-04-01 2024-04-01 Evaluation item dynamic generation method and system, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN117993366A true CN117993366A (en) 2024-05-07
CN117993366B CN117993366B (en) 2024-06-21

Family

ID=90892663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410381770.9A Active CN117993366B (en) 2024-04-01 2024-04-01 Evaluation item dynamic generation method and system, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN117993366B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989555A (en) * 2015-03-05 2016-10-05 上海汉声信息技术有限公司 Language competence test method and system
CN110489454A (en) * 2019-07-29 2019-11-22 北京大米科技有限公司 A kind of adaptive assessment method, device, storage medium and electronic equipment
CN113673702A (en) * 2021-07-27 2021-11-19 北京师范大学 Method and device for evaluating pre-training language model and storage medium
US20220309087A1 (en) * 2021-03-29 2022-09-29 Google Llc Systems and methods for training language models to reason over tables
US20230068338A1 (en) * 2021-08-31 2023-03-02 Accenture Global Solutions Limited Virtual agent conducting interactive testing
CN117290694A (en) * 2023-11-24 2023-12-26 北京并行科技股份有限公司 Question-answering system evaluation method, device, computing equipment and storage medium
CN117493830A (en) * 2023-11-16 2024-02-02 郑州阿帕斯数云信息科技有限公司 Evaluation of training data quality, and generation method, device and equipment of evaluation model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989555A (en) * 2015-03-05 2016-10-05 上海汉声信息技术有限公司 Language competence test method and system
CN110489454A (en) * 2019-07-29 2019-11-22 北京大米科技有限公司 A kind of adaptive assessment method, device, storage medium and electronic equipment
US20220309087A1 (en) * 2021-03-29 2022-09-29 Google Llc Systems and methods for training language models to reason over tables
CN113673702A (en) * 2021-07-27 2021-11-19 北京师范大学 Method and device for evaluating pre-training language model and storage medium
US20230068338A1 (en) * 2021-08-31 2023-03-02 Accenture Global Solutions Limited Virtual agent conducting interactive testing
CN117493830A (en) * 2023-11-16 2024-02-02 郑州阿帕斯数云信息科技有限公司 Evaluation of training data quality, and generation method, device and equipment of evaluation model
CN117290694A (en) * 2023-11-24 2023-12-26 北京并行科技股份有限公司 Question-answering system evaluation method, device, computing equipment and storage medium

Also Published As

Publication number Publication date
CN117993366B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
US9652999B2 (en) Computer-implemented systems and methods for estimating word accuracy for automatic speech recognition
CN115358897B (en) Student management method, system, terminal and storage medium based on electronic student identity card
CN111653274A (en) Method, device and storage medium for awakening word recognition
CN117744753A (en) Method, device, equipment and medium for determining prompt word of large language model
CN117808946B (en) Method and system for constructing secondary roles based on large language model
CN117493830A (en) Evaluation of training data quality, and generation method, device and equipment of evaluation model
CN116663679A (en) Language model training method, device, equipment and storage medium
CN117539977A (en) Training method and device for language model
CN117993366B (en) Evaluation item dynamic generation method and system, electronic equipment and readable storage medium
CN115116474A (en) Spoken language scoring model training method, scoring method, device and electronic equipment
CN113094404B (en) Big data acquisition multi-core parameter self-adaptive time-sharing memory driving method and system
CN115392769A (en) Evaluation model training method, performance evaluation method and device
CN111581911B (en) Method for automatically adding punctuation to real-time text, model construction method and device
CN112767932A (en) Voice evaluation system, method, device, equipment and computer readable storage medium
CN112163975A (en) Intelligent learning guiding and prompting method and system
Liu et al. Deep learning scoring model in the evaluation of oral English teaching
CN112131889A (en) Intelligent Chinese subjective question scoring method and system based on big data
Meng et al. Nonlinear network speech recognition structure in a deep learning algorithm
CN118467709B (en) Evaluation method, device, medium and computer program product for visual question-answering task
US20220398496A1 (en) Learning effect estimation apparatus, learning effect estimation method, and program
CN115083437A (en) Method and device for determining uncertainty of learner pronunciation
CN118485553A (en) Examination evaluation method, system, medium, electronic equipment and product
CN113157713A (en) Topic grouping updating method and device, computer equipment and storage medium
CN118467709A (en) Evaluation method, device, medium and computer program product for visual question-answering task
Chen Entertainment social media based on deep learning and interactive experience application in English e-learning teaching system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant