CN117093459A

CN117093459A - Evaluation method and device of language model, electronic equipment and storage medium

Info

Publication number: CN117093459A
Application number: CN202310519580.4A
Authority: CN
Inventors: 孙鹏飞; 谢雨欣
Original assignee: Shanghai Mobvoi Information Technology Co ltd
Current assignee: Shanghai Mobvoi Information Technology Co ltd
Priority date: 2023-05-09
Filing date: 2023-05-09
Publication date: 2023-11-21

Abstract

The disclosure provides a method for evaluating a language model, which comprises the following steps: generating a guide text for inputting the language model based on the target data set, wherein the guide text comprises a plurality of predicted text sequences or hints information; processing the guide text by using the language model to determine a predicted text of the language model; and comparing the predicted text with the reference text to generate a target dimension score, wherein the target dimension score is used for characterizing the performance of the language model in the evaluation category corresponding to the target data set. The disclosure also provides an evaluation device of the language model, electronic equipment and a storage medium.

Description

Evaluation method and device of language model, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of machine learning, and in particular relates to a method and a device for evaluating a language model, electronic equipment and a storage medium.

Background

With the development of artificial intelligence, the evaluation standard becomes a wind vane for the development of a pre-training language model. However, most of the existing test sets for testing and evaluating the pre-training language model are english data sets, and the english data sets can only verify the processing performance of the pre-training language model test on english tasks. Chinese is pictographic, no separator exists between characters, and different word splitting or word splitting modes can generate different output results, so that even if the pre-training language model has better evaluation results on English tasks, the processing effect on the Chinese tasks cannot be verified. The lack of the Chinese data set creates a barrier to the progress in language understanding capabilities of the pre-trained language model, making the development and application of the pre-trained language model on Chinese tasks lag behind those of English tasks.

Based on the method, CLUE (Chinese Language Understanding Evaluation, chinese comprehension ability assessment) appears, and the CLUE is used as a first large-scale language assessment standard of Chinese, provides a large pre-training Chinese language library and diagnosis assessment data set, can perform public tests of various Chinese tasks for most pre-training models, and promotes development and application of the Chinese pre-training models.

However, with the rise of a Generative language model such as a GPT (Generative Pre-trained Transformer) model, drawbacks of the evaluation scheme of the cleu are also revealed. Specifically, GPT is used as a generated language model, which mainly predicts the next word by the previous word, that is, only predicts the word at the end of a sentence when GPT is evaluated; whereas a traditional pre-trained language model (e.g., BERT model) is an output that works in conjunction with context to understand and expect the article, the predicted portion can be located anywhere in the sentence. Based on the difference, the traditional evaluation schemes such as CLUE are limited to the category of the language model, cannot meet the evaluation requirement of the generated language model, and have no universality.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present disclosure provides a method, an apparatus, an electronic device, and a storage medium for evaluating a language model.

One aspect of the present disclosure provides a method for evaluating a language model, which may include: generating a guide text for inputting a language model based on a target data set, wherein the guide text comprises a plurality of predicted text sequences or hints information; processing the guide text with the language model to determine a predicted text of the language model; and comparing the predicted text with a reference text to generate a target dimension score, wherein the target dimension score is used for representing the performance of the language model in the evaluation category corresponding to the target data set.

In some implementations, the generating the guide text for the input language model based on the target dataset includes: determining a plurality of language tags corresponding to attribute categories of the target data set as candidate character strings in response to the task categories of the target data set being classification tasks; screening at least one candidate word for representing the language tag for each language tag; and filling each candidate word into an evaluation template corresponding to the target data set respectively to obtain a plurality of predicted text sequences, and taking the predicted text sequences as the guide texts.

In some implementations, the generating the guide text for the input language model based on the target dataset includes: analyzing the contextual meaning of the evaluation template in response to the task class of the target dataset being a complete filling task; generating at least one candidate character string matching the contextual meaning; and filling each candidate character string into an evaluation template corresponding to the target data set respectively to obtain a plurality of predicted text sequences, and taking the predicted text sequences as the guide texts.

In some implementations, the generating the guide text for the input language model based on the target dataset includes: and extracting sample text from the target data set as the prompt information, and generating a prediction text conforming to the text generation rule by taking the prompt information as the guide text.

In some embodiments, before the generating the guide text for the input language model based on the target dataset, comprising: and determining an evaluation template according to the test text and the evaluation category of the target data set, wherein the evaluation category comprises a task category and an attribute category, and the task category at least comprises a classification task, a complete filling task and a generation task.

In some embodiments, after said comparing the predicted text to the reference text to generate the target dimension score, comprising: and integrating the target dimension scores of the language model in each evaluation category to obtain comprehensive performance scores for representing the multidimensional performance of the language model.

In some embodiments, after comparing the predicted text with the reference expected text to generate the target dimension score, the method further comprises: comparing the target dimension score and/or the comprehensive performance score with a control evaluation system and/or an expected score respectively to obtain difference data of the language model; and according to the difference data, performing targeted optimization on the language model.

Another aspect of the present disclosure provides an evaluation device of a language model, which may include: a guide text generation module for generating a guide text for inputting a language model based on a target data set, wherein the guide text comprises a plurality of predicted text sequences or prompt messages; a predicted text acquisition module for processing the guide text by using the language model to determine a predicted text of the language model; and the performance calculation module is used for comparing the predicted text with a reference text to generate a target dimension score, wherein the target dimension score is used for representing the performance of the language model in the evaluation category corresponding to the target data set.

In yet another aspect, the disclosure provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the program to implement the method for evaluating a language model according to any of the foregoing embodiments.

Yet another aspect of the present disclosure provides a readable storage medium storing a computer program adapted to be loaded by a processor to perform a method for evaluating a language model according to any of the embodiments described above.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of a process when a BERT model performs a shape-filling task;

FIG. 2 is a schematic diagram of a process of a GPT model in performing a shape filling task;

FIG. 3 is a flowchart of a method for evaluating a language model according to an exemplary embodiment of the present disclosure;

FIG. 4 is a block diagram of an evaluation device of a language model according to an exemplary embodiment of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the drawings and the embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant content and not limiting of the present disclosure. It should be further noted that, for convenience of description, only a portion relevant to the present disclosure is shown in the drawings.

In addition, embodiments of the present disclosure and features of the embodiments may be combined with each other without conflict. The technical aspects of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Unless otherwise indicated, the exemplary implementations/embodiments shown are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Thus, unless otherwise indicated, features of the various implementations/embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, when the terms "comprises" and/or "comprising," and variations thereof, are used in the present specification, the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof is described, but the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof is not precluded. It is also noted that, as used herein, the terms "substantially," "about," and other similar terms are used as approximation terms and not as degree terms, and as such, are used to explain the inherent deviations of measured, calculated, and/or provided values that would be recognized by one of ordinary skill in the art.

FIG. 1 is a schematic diagram of a process when a BERT model performs a shape-filling task; fig. 2 is a schematic diagram of a process when the GPT model performs a shape-filling task. The differences between the generative language model and the conventional mask language model will be described with reference to fig. 1 to 2.

Currently, there are at least three classes of language models, including autoregressive language model (autoregressive language model), such as GPT (generating Pre-trained Transformer, generating Pre-training transducer) models; masked language model (mask language model), e.g., BERT (Bidirectional Encoder Representation from Transformers), transformer bi-directional encoder representation) model; and, an encoder-decoder language model (codec language model), such as a T5 (Text-to-Text Transfer Transformer, text-to-Text pre-trained language model) model.

The mask language model such as BERT can combine the context and the semantics to supplement characters or words to the mask at any position in the predicted language sequence; for example, in the sentence "documents [ Mask ] the words" the words at the Mask [ Mask ] position can be predicted as "open" by combining the context of the preceding and following words and the training text. The coding and decoding language models such as T5 and the like can obtain a predicted sequence by destroying the original sequence of the training text and predicting the actual sequence of the training text; for example, the training sentences are "the books" and "the documents open", and the predicted sequence "students opened their books" is obtained after the two are destroyed and reformed.

The GPT and other generated language models predict the next word mainly according to the semantics of the previous word, so that the evaluation mode of the random mask applicable to the BERT and other mask language models cannot be applied to the evaluation of the GPT model.

Referring to fig. 1 and 2, differences between a conventional mask language model such as BERT and a generative language model such as GPT will be described. By way of example in the evaluation of CLUE, the main idea is to convert downstream tasks (e.g. classification tasks) into a complete filling task by means of templates built in natural language, so that the words of the mask area can be predicted by bi-directional MLM (Masked Language Model, mask language model) tasks of BERT. For example, training text "[ Mask ] satisfaction of emotion classification is achieved by conditional prefix. The meal delivery is particularly quick and the attitude is good. And converting the emotion classification of the text into an MLM task through a specific template, namely inputting the MLM task into a Mask language model BERT, predicting a Mask, and judging that the predicted result is 'no', wherein the emotion classification result of the sentence is 'negative emotion'. Of course, the prediction result of this model is inaccurate, but it can be seen from the foregoing example that this type of model supports masking evaluation of the sentence head position characters in evaluating a conventional masking language model. Similarly, the training text of the subject classification is "broadcast a [ Mask ] news by the conditional prefix. Week! And converting the text into an MLM task through a specific template, namely inputting the MLM task into a Mask language model BERT, predicting a Mask to obtain an entertainment prediction result, and further determining that the sentence classification result is entertainment news. It can be seen from the foregoing examples that the model supports masking evaluations of characters at any position in a sentence when evaluating a traditional masking language model.

Of course, the mask evaluation scheme is also suitable for the unidirectional LM (Language Model) task of the GPT, for example, the training text is "fast to deliver meal, and the attitude is good. I feel that [ Mask ] "converts emotion classification into LM tasks through a specific template, namely, the LM tasks are input into a language model GPT, a Mask [ Mask ] is predicted, a 'very satisfactory' prediction result is obtained, the classification result of the sentence is determined to be 'positive emotion', and the text is represented by [ E ]; the red font in fig. 2 is the filling of the prediction result of Mask. For another example, the training text "week × new song is online-! Thank you listen to [ Mask ] "convert news category to LM task through specific template, i.e., input language model GPT, predicts Mask [ Mask ], obtaining a prediction result of an entertainment channel, further obtaining an entertainment news as a classification result of the sentence, and finishing the text by using [ E ] representation text; the red font in FIG. 2 is the filling of the predicted result for Mask [ Mask ]. From the two examples above, it is clearly demonstrated that the GPT model can only decode from left to right, so the prediction part can only be placed at the end of a sentence.

And the model can be evaluated by converting the natural language reasoning task into a complete filling task as a downstream task. For example, given two sentences, it is determined whether the two sentences are compatible, and the two sentences are spliced together and input into the language model to make a classification. If a transition is to be made to a shape-filling task, the construction can be made as follows:

{

"sense 1": "a cotton overcoat sent by a factory is wrapped on the body, the hand is inserted in the sleeve,

"sense 2": "at least one garment on a person",

“lable”：“entailment”

}

the natural language reasoning task can be converted into training texts of the complete filling task, such as 'wrapping a piece of cotton overcoat sent by a factory, inserting hands in sleeves, [ Mask ], and at least one piece of clothes on the body'.

Wherein, the language label table is an implication relation entilment. The language tags may be "neutral", "enter", and "contradiction", and the candidate strings may correspond to "and", "yes", and "not", respectively. That is, the position of [ Mask ] requires selecting one of the three candidate strings described above. If the language tag label is "neutral", then "and" is the correct result of [ Mask ]; if label is "enter," yes "is the correct result of [ Mask ]; if label is "connectivity" then "not" is the correct result for [ Mask ]. It is apparent that the language label is "enter", so the correct result of the [ Mask ] is "yes". Based on the method, the language model can be predicted, and the performance of executing the complete filling task is observed.

The reading understanding task can be used as a downstream task, and the model can be evaluated by converting the reading understanding task into a complete filling task, which comprises the following steps: digestion of pronouns can be performed by using templates to splice nouns and pronouns. For example:

it can be converted into training text for completing the gap-filling task through a template: the "it [ Mask ] bed, when the mobile phone beside the pillow on the bed sounds, i feel strange, because the arrears have been stopped for two months, now it suddenly sounds. ".

Wherein, "span2_index":37 denotes that interval flag 2 is the 37 th word; "span1_index":5 denotes that interval flag 1 is the 5 th word; "span1_text": "bed" means interval identification 1 is "bed" in training text; "span2_text": "it" means that interval marker 2 is "it" in the training text. The language identifier label is "false", and thus, in the training text, the candidate character string of the Mask [ Mask ] may be "not" corresponding to "false" or "true" corresponding to "i.e." true ". Since the training text gives a table of "false", the prediction result for [ Mask ] is "not".

FIG. 3 is a flowchart of a method for evaluating a language model according to an exemplary embodiment of the present disclosure. The evaluation method S100 of the language model of the present disclosure will be explained below with reference to fig. 3.

Step S102, generating a guide text for inputting a language model based on the target data set.

The target data set includes a plurality of test texts while indicating the evaluation category. The target data set adopts the existing Chinese language library, such as training corpus in CLUE, and through the direct utilization of the disclosed test text, the cost waste of re-labeling the test text is avoided, so that the method has universality. The test texts in the same target data set have the same evaluation category, wherein the evaluation category comprises an evaluation task category and an attribute category of the test text, and the evaluation task category can be a complete filling task, a classification task and a generation task, and can also comprise a natural language reasoning task; the attribute category of the test text may be emotion category, news category, etc.

The guide text includes a plurality of sequences of predicted text or hints, and the specific form of the guide text is determined based on the evaluation category indicated by the target data set. If the task category in the evaluation category of the target data set is a complete filling task or a classification task, the specific form of the guide text can be a plurality of predicted text sequences; if the task category in the evaluation category of the target data set is a generating task, the specific form of the guide text can be a prompt message. Of course, if the task category in the evaluation category of the target data set is a classification task, the form of the guide text may be a prompt message.

Step S104, the guiding text is processed by the language model to determine the predicted text of the language model.

The language model may be a GPT model, a Bert model, a T5 model, etc., and there is no limitation on the type of the language model. The predictive text is output by the language model by controlling the language model to calculate the confusion degree (ppl) of the plurality of predictive text sequences respectively or by controlling the language model to analyze the prompt information. The predicted text refers to an output result of the language model, and when the guide text is a plurality of predicted text sequences, the predicted text is a candidate character string corresponding to the predicted text sequence with the minimum obtained confusion degree after the confusion degree calculation of each predicted text sequence is carried out on the language model; when the guide text is the prompt information, the predicted text is a new added language segment which is generated by the language model under the guidance of the prompt information and accords with the text generation rule. And S106, comparing the predicted text with the reference text to generate a target dimension score.

The reference text is a reference standard given by the evaluation system or a target text formulated by a tester according to a required evaluation rule, and of course, the reference text can be a result given in other evaluation standards, which is not limited herein. When the task type of the target data set is a complete filling task or a classification task, the reference text can be a standard answer; when the task type of the target data set is a generating task, the reference text is used for reference only and is not used for indicating standard answers.

The target dimension score is used for characterizing the performance of the language model in the evaluation category corresponding to the target data set. Because different target data sets correspond to different evaluation categories, the target dimension scores of the language model in each dimension can be obtained by traversing the data sets of each evaluation category in the public Chinese language library by using the steps. The target dimension score is generally proportional to the difference between the predicted text and the reference text, and the difference is approximate to the difference, so that the better the prediction performance of the language model on the evaluation category corresponding to the target data set is proved; otherwise, the worse. The target dimension score may be one of the optimization parameters for the language model.

In some embodiments, the implementation manner of step S102 may be: in response to the task class of the target data set being a classification task, determining a plurality of language tags corresponding to the attribute class of the target data set as candidate character strings; screening at least one candidate word for representing the language tag for each language tag; and filling each candidate word into an evaluation template corresponding to the target data set respectively to obtain a plurality of predicted text sequences, and taking the predicted text sequences as guide texts.

The candidate character strings are at least one character which can be matched with a blank area in the training text, a word formed by multiple characters or a paragraph formed by multiple characters and the like which are directly or indirectly obtained according to the training text in the target data set.

The evaluation templates are generated according to the text content of different target data sets and task categories, and different templates are matched with different task categories. And (3) filling each candidate character string into a blank area of the evaluation template respectively by using different mask positions of different templates and different candidate character strings, so that a plurality of predicted text sequences can be obtained.

In the method, the predicted text sequence does not need to be selected by a language model, so that the type of the language model and the position of the mask are not considered, and compared with the traditional mask evaluation scheme, the method has universality and is suitable for generating a formula language model.

Specifically, after the evaluation template is obtained, fusing each group of tag data pairs formed by the candidate character strings and the language tags with the evaluation template to form a plurality of new sentences, namely a predicted text sequence; the predicted text sequence is input data of the language model.

TABLE 1

label	template
		like	This is an article on favorite emotions: i like to eat the discount cuisine of this restaurant.
happiness	This is an article on happy emotion: i like to eat the discount cuisine of this restaurant.
		sadness	This is an article about sad emotion: i like to eat the discount cuisine of this restaurant.
fear	This is an article about fear emotion: i like to eat the discount cuisine of this restaurant.
		surprise	This is an article on surprise emotion: i like to eat the discount cuisine of this restaurant.
anger	This is an article about anger emotion: i like to eat the discount cuisine of this restaurant.
		disgust	This is an article on aversion to emotion: i like to eat the discount cuisine of this restaurant.

Table 1 is a table of language tags and evaluation templates correspondence. As shown in table 1, if the task class of the target data set is a classification task and the attribute class is an emotion class, the language label may be "like", "happness", "sadness", "spar", "query", "anger" and "degus", etc.; then, using each language tag as a candidate character string, determining at least one candidate word representing the language tag corresponding to each candidate character string, for example, representing a "like" by "like", representing a "happiness" by "happiness", representing a "sadness" by "sadness", representing a "fear" by "fear", representing a "surrise" by "surprise", representing an "anger" by "anger", representing a "disgust" by "aversion", and the like; further, each candidate word is respectively filled into the evaluation template to obtain a plurality of predicted text sequences, for example, a "like" is filled into the evaluation template, which is an article about [ Mask ] emotion: i like to eat the discount cuisine of this restaurant. "obtain predicted text sequence" this is an article about favorite emotion: i like to eat the discount cuisine of this restaurant. "etc., are not listed here.

Obviously, not the candidate words corresponding to each candidate character string are suitable for the context of the evaluation template, so that the confusion degree of each predicted text sequence needs to be calculated by using a language model, and the candidate character string corresponding to the minimum confusion degree is used as a prediction result of the language model in the emotion classification dimension.

In some embodiments, the implementation manner of step S102 may further be: analyzing and evaluating the context meaning of the template in response to the task class of the target data set being a complete filling task; generating at least one candidate character string matched with the context meaning; and filling each candidate character string into an evaluation template corresponding to the target data set respectively to obtain a plurality of predicted text sequences, and taking the predicted text sequences as guide texts. Specifically, after the evaluation template is obtained, each candidate character string is filled in the mask position of the evaluation template, and a plurality of new texts are generated as predicted text sequences. For example:

{

"candidates": the toy is characterized by comprising the following components of ' chocolate made of a rubber "," overwhelming with hands ', ' wind flow character ', ' eight immortals crossing the sea ', ' lay straight-ahead ', ' plant soldiers ', ' identical words ', ' and phrases ', ' etc.,

"content": "when Guangzhou regressing north control, guo is dull and refund, CBA season post-competition suspense disappears as if it were just a moment, and in case of wann don't think, after 1 day of time interval, north control external assistance is about — due to individual decision (delineating the expense of a brokerage company), resulting in being forbidden, majoram #, plus Guo #, led to the conversion of the magic nature of Guangzhou, let … …",

"answer"：1

}

the candidate character string candidates can comprise "chocolate order", "overwhelming", "wind flow character", "eight immortals crossing the sea", "lay straight-through", "plant soldiers" and "consistent in language". The evaluation template is "when guangzhou regress, guo is dull and refund, CBA season post-competition suspense disappears as if it were once, and can be thought of in all cases, north control external assistance is about — by personal arbitration (delineating the expense of a brokerage company) after 1 day interval, so that the competition is forbidden, maxone #idiom#, plus guo, lead to the conversion of the magic of guangzhou to the Tianjin, and let … …). # idiom# represents a mask area; the candidate strings sequentially correspond to numbers 0 to 7, an answer of 1 indicates that the correct result is a candidate string with a number of 1, and the candidate string with a number of 1 is found to be "overwhelmed". The prediction result of the language model is output in the form of a number, and if the model output is also number 1, the prediction is correct, so that the accuracy of the model prediction can be calculated by using the value of answer.

Firstly, analyzing and evaluating a template, namely, when Guangzhou regressing is carried out at the moment, the concept of CBA season back race disappears as if it were once, and after 1 day of time interval, north control external assistance is about — due to personal decisions (the expense of a brokerage company is waited), the model is forbidden, a model # idiiom # is beaten, and a model # is added, so that … … ' has the contextual meaning of capturing Guangzhou magic and reversing Tianjin, and a plurality of candidate character strings of ' chocolate "," overwhelming "," windy figure "," eight immortal "," straight-run "," plant soldier ' and ' consistency of words ' are determined; further, each candidate character string is respectively filled into the # idiom # region of the evaluation template to obtain a plurality of predicted text sequences. For example, the candidate character string "chocolate command" is filled into the evaluation template, and the evaluation template "when the state regress is negative north-controlled, the state is dull, the suspense of the post-season CBA is disappeared as if it were once, and it can be thought that after the time interval is 1 day, the north-controlled external assistance is about — due to the individual decision (delineating the expense of a brokerage company), the user is forbidden, the chocolate command is marked, and the state miracle of the state is reversed, and … …", which are not listed.

Obviously, not the candidate words corresponding to each candidate character string are suitable for the context of the evaluation template, so that the confusion degree of each predicted text sequence is calculated by using a language model, and the candidate character string corresponding to the minimum confusion degree is used as a prediction result of the language model in the sports news completion gap-filling dimension.

In some embodiments, the performing step of step S102 may further be: and responding to the task category of the target data set to generate a task, extracting sample text from the target data set as prompt information, and taking the prompt information as guide text, wherein the prompt information is used for guiding the language model to generate a prediction text conforming to the text generation rule.

Specifically, the target data set includes a plurality of sample texts, and the plurality of sample texts are used as prompt messages, for example, K prompt messages are provided, the first K-1 prompts include answers, and the answers of the K prompt messages need to be predicted through a language model. Then, a plurality of sample texts are used as guide texts to be input into a language model, and the language model can obtain text generation rules of answers according to the K-1 prompt messages and the answers thereof; and further, analyzing the K prompt information according to the rule to obtain an answer of the K prompt information, wherein the answer is the predicted text of the language model. And the prompt information is set as the guide text, so that the prediction accuracy of the generated language model is improved.

For example, three original texts are given below, the first two sentences respectively give abstracts extracted from the texts, and the second sentence needs to generate corresponding abstracts according to rules for extracting abstracts from the first two sentences:

{

example 1= "recent, energy crisis worldwide is of great concern; due to the factors of unstable energy supply, surge energy price and the like in a plurality of countries, many industries and people face energy shortage and economic pressure. "

result1 = "global energy crisis causes energy shortage and economic pressure etc. problems"

With rapid development of technology, example 2= "information security problem is becoming a focus of social attention. In the digital age, personal information and confidential data of people are stored on the internet, and lawless persons such as hackers, cybercriminals and the like are also stricken about these precious resources. The frequency of events such as information leakage, network attack, data tampering and the like can cause serious loss to individuals and organizations. "

result 2= "information security is threatened by lawbreakers such as hackers"

prompt＝example1+result1+example2+result2+input

}

Wherein example1 represents the first original text (i.e., example 1), result1 is a summary refined from example 1; example2 represents the second original text (i.e., example 2), result2 being a summary refined from example 2; the campt represents the third original text, i.e., the text that needs to be abstracted, and in this example, the campt is the result of fusing the original text. And obtaining a text generation rule of the abstract according to the previous two original texts and the abstract thereof, and further analyzing the last original text according to the rule to obtain the abstract thereof.

In some embodiments, prior to step S102, comprising: and determining an evaluation template according to the test text and the evaluation category of the target data set, wherein the evaluation category comprises a task category and an attribute category, and the task category at least comprises a classification task, a complete filling task, a generation task and a natural language reasoning task.

In some embodiments, after step S106, comprising: and integrating the target dimension scores of the language model in each evaluation category to obtain comprehensive performance scores for characterizing the multidimensional performance of the language model.

In some implementations, after comparing the predicted text to the reference text to generate the target dimension score, further comprising: comparing the target dimension score and/or the comprehensive performance score with a control evaluation system and/or an expected score respectively to obtain difference data of the language model; and according to the difference data, carrying out targeted optimization on the language model. The comparison system may be an existing chinese model evaluation system, such as a clee.

According to the evaluation method of the language model, which is provided by one aspect of the disclosure, the evaluation template is constructed by directly utilizing the data set in the traditional Chinese language library, so that the construction cost of the evaluation template is saved; in addition, the confusion degree of the predicted language sequence is calculated by using the language model, the performance state of the language model in each dimension is determined, the limitation of a mask evaluation mode on the unidirectional prediction model such as the generated language model is avoided, and the application range of model evaluation is improved. In addition, the prompt information is added to serve as a guide text of the input model, and the auxiliary generation type language model outputs more accurate prediction text.

FIG. 4 is a block diagram of an evaluation device of a language model according to an exemplary embodiment of the present disclosure

As shown in fig. 4, the present disclosure proposes a device 1000 for evaluating a language model, which may include: the guide text generation module 1002 is configured to generate a guide text for inputting a language model based on a target data set, wherein the guide text includes a plurality of predicted text sequences or hints information; a predicted text acquisition module 1004, configured to process the guide text with the language model to determine a predicted text of the language model; and a performance calculation module 1006, configured to compare the predicted text with the reference text, and generate a target dimension score, where the target dimension score is used to characterize performance of the language model in an evaluation category corresponding to the target data set.

The modules of the evaluation device 1000 of the language model are provided for executing the steps of the evaluation system of the language model, and the execution steps and principles thereof and reference to the foregoing are not repeated herein.

The apparatus 1000 may include corresponding elements for performing each or several of the steps in the flowcharts described above. Accordingly, each step or several steps of the flowcharts described above may be performed by a corresponding unit, and the apparatus 1000 may include one or more of these units. An element may be one or more hardware modules specifically configured to perform the respective steps, or be implemented by a processor configured to perform the respective steps, or be stored in a computer-readable storage medium for implementation by a processor, or be implemented by some combination.

The hardware structure may be implemented in a bus architecture. A bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the bus design constraints. Bus 1100 connects together various circuits including one or more processors 1200, memory 1300, and/or hardware modules. Bus 1100 may also connect various other circuits 1400, such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

Bus 1100 may be an industry standard architecture (ISA, industry Standard Architecture) bus, peripheral component interconnect (PCI, peripheral Component) bus, or an extended industry standard architecture (EISA, extended Industry Standard Component) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the drawing is represented with only one connection line, but does not represent only one bus or one type of bus.

According to the evaluation device of the language model, which is provided by one aspect of the disclosure, the evaluation template is constructed by directly utilizing the data set in the traditional Chinese language library, so that the construction cost of the evaluation template is saved; in addition, the confusion degree of the predicted language sequence is calculated by using the language model, the performance state of the language model in each dimension is determined, the limitation of a mask evaluation mode on the unidirectional prediction model such as the generated language model is avoided, and the application range of model evaluation is improved. In addition, the prompt information is added to serve as a guide text of the input model, and the auxiliary generation type language model outputs more accurate prediction text. Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present disclosure in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied on a machine-readable medium, such as a memory. In some embodiments, part or all of the software program may be loaded and/or installed via memory and/or a communication interface. One or more of the steps of the methods described above may be performed when a software program is loaded into memory and executed by a processor. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above in any other suitable manner (e.g., by means of firmware).

Logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or other suitable medium on which the program can be printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner if necessary, and then stored in a memory.

It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or part of the steps implementing the method of the above embodiment may be implemented by a program to instruct related hardware, and the program may be stored in a readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiment.

Furthermore, each functional unit in each embodiment of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. The storage medium may be a read-only memory, a magnetic disk or optical disk, etc.

It will be appreciated by those skilled in the art that the above-described embodiments are merely for clarity of illustration of the disclosure, and are not intended to limit the scope of the disclosure. Other variations or modifications will be apparent to persons skilled in the art from the foregoing disclosure, and such variations or modifications are intended to be within the scope of the present disclosure.

Claims

1. A method for evaluating a language model, comprising:

generating a guide text for inputting a language model based on a target data set, wherein the guide text comprises a plurality of predicted text sequences or hints information;

processing the guide text with the language model to determine a predicted text of the language model; and

and comparing the predicted text with a reference text to generate a target dimension score, wherein the target dimension score is used for representing the performance of the language model in the evaluation category corresponding to the target data set.

2. The method for evaluating a language model according to claim 1, wherein generating a guide text for inputting the language model based on the target data set comprises:

determining a plurality of language tags corresponding to attribute categories of the target data set as candidate character strings in response to the task categories of the target data set being classification tasks;

screening at least one candidate word for representing the language tag for each language tag; and

and filling each candidate word into an evaluation template corresponding to the target data set respectively to obtain a plurality of predicted text sequences, and taking the predicted text sequences as the guide texts.

3. The method for evaluating a language model according to claim 1, wherein generating a guide text for inputting the language model based on the target data set comprises:

analyzing the contextual meaning of the evaluation template in response to the task class of the target dataset being a complete filling task;

generating at least one candidate character string matching the contextual meaning; and

and filling each candidate character string into an evaluation template corresponding to the target data set respectively to obtain a plurality of predicted text sequences, and taking the predicted text sequences as the guide texts.

4. The method for evaluating a language model according to claim 1, wherein generating a guide text for inputting the language model based on the target data set comprises:

and extracting sample text from the target data set as the prompt information, and generating a prediction text conforming to the text generation rule by taking the prompt information as the guide text.

5. The method for evaluating a language model according to any one of claims 1 to 4, comprising, before the generating of the guide text for inputting the language model based on the target data set:

and determining an evaluation template according to the test text and the evaluation category of the target data set, wherein the evaluation category comprises a task category and an attribute category, and the task category at least comprises a classification task, a complete filling task and a generation task.

6. The method for evaluating a language model according to claim 1, comprising, after said comparing the predicted text with a reference text to generate a target dimension score:

and integrating the target dimension scores of the language model in each evaluation category to obtain comprehensive performance scores for representing the multidimensional performance of the language model.

7. The method for evaluating a language model of claim 6, further comprising, after said comparing said predicted text with a reference text to generate a target dimension score:

comparing the target dimension score and/or the comprehensive performance score with a control evaluation system and/or an expected score respectively to obtain difference data of the language model; and

and according to the difference data, performing targeted optimization on the language model.

8. An evaluation device of a language model, comprising:

a guide text generation module for generating a guide text for inputting a language model based on a target data set, wherein the guide text comprises a plurality of predicted text sequences or prompt messages;

a predicted text acquisition module for processing the guide text by using the language model to determine a predicted text of the language model; and

and the performance calculation module is used for comparing the predicted text with the reference text to generate a target dimension score, wherein the target dimension score is used for representing the performance of the language model in the evaluation category corresponding to the target data set.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, when executing the program, to implement the method of evaluating a language model according to any one of claims 1 to 7.

10. A readable storage medium, characterized in that the readable storage medium stores a computer program, which is adapted to be loaded by a processor for performing the method of evaluating a language model according to any one of claims 1 to 7.