CN112052316A - Model evaluation method, model evaluation device, storage medium and electronic equipment - Google Patents

Model evaluation method, model evaluation device, storage medium and electronic equipment Download PDF

Info

Publication number
CN112052316A
CN112052316A CN202010806446.9A CN202010806446A CN112052316A CN 112052316 A CN112052316 A CN 112052316A CN 202010806446 A CN202010806446 A CN 202010806446A CN 112052316 A CN112052316 A CN 112052316A
Authority
CN
China
Prior art keywords
skill
text
model
recall
dialog
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010806446.9A
Other languages
Chinese (zh)
Inventor
雷士驰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Shenzhen Huantai Technology Co Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Shenzhen Huantai Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd, Shenzhen Huantai Technology Co Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN202010806446.9A priority Critical patent/CN112052316A/en
Publication of CN112052316A publication Critical patent/CN112052316A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The embodiment of the application discloses a model evaluation method, a model evaluation device, a storage medium and electronic equipment, wherein the method comprises the following steps: when a conversation model processes a conversation text, determining at least one labeling skill and at least one recalling skill corresponding to the conversation text, acquiring a conversation result corresponding to the conversation text output by the conversation model, determining at least one target skill corresponding to the conversation result, and performing recalling evaluation on the conversation model based on the at least one labeling skill, the at least one recalling skill and the at least one target skill. By adopting the embodiment of the application, the accuracy of model evaluation is improved.

Description

Model evaluation method, model evaluation device, storage medium and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a model evaluation method and apparatus, a storage medium, and an electronic device.
Background
With the rapid development of Artificial Intelligence (AI) technology, human-computer interaction is more and more commonly applied in daily life. Wherein, the human-computer interaction can be realized based on a dialogue model (also called a dialogue system).
When the Dialog model processes the Dialog text input by the user, the Dialog text is semantically understood by a Natural Language Understanding (NLU) layer corresponding to the Dialog model, a recall skill corresponding to the Dialog text is determined, then a Dialog Management layer (DM) corresponding to the Dialog model determines a skill behavior to be executed according to the recall skill, finally an execution result corresponding to the skill behavior is returned, and finally the Dialog model outputs a resource (namely, a Dialog result) to the execution result.
In practical application, the assessment of the dialogue model is often involved, such as assessing whether the output result of the dialogue model is accurate or not, and assessing whether the recall skill of the dialogue model is accurate or not.
Disclosure of Invention
The embodiment of the application provides a model evaluation method, a model evaluation device, a storage medium and electronic equipment, and the accuracy of model evaluation can be improved. The technical scheme of the embodiment of the application is as follows:
in a first aspect, an embodiment of the present application provides a model evaluation method, where the method includes:
when a dialogue model processes a dialogue text, determining at least one marking skill and at least one recalling skill corresponding to the dialogue text;
obtaining a conversation result corresponding to the conversation text output by the conversation model, and determining at least one target skill corresponding to the conversation result;
performing a recall assessment on the conversation model based on the at least one annotation skill, the at least one recall skill, and the at least one target skill.
In a second aspect, an embodiment of the present application provides a model evaluation apparatus, including:
the skill determination module is used for determining at least one annotation skill and at least one recall skill corresponding to the conversation text when the conversation model processes the conversation text;
the result acquisition module is used for acquiring a conversation result corresponding to the conversation text output by the conversation model and determining at least one target skill corresponding to the conversation result;
a recall-ready assessment module to perform a recall-ready assessment of the conversation model based on the at least one annotation skill, the at least one recall skill, and the at least one target skill.
In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.
In a fourth aspect, an embodiment of the present application provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.
The beneficial effects brought by the technical scheme provided by some embodiments of the application at least comprise:
in one or more embodiments of the present application, when a dialog model processes a dialog text, an electronic device determines at least one tagging skill and at least one recall skill corresponding to the dialog text, acquires a dialog result corresponding to the dialog text output by the dialog model, determines at least one target skill corresponding to the dialog result, and performs a recall-ready evaluation on the dialog model based on the at least one tagging skill, the at least one recall skill, and the at least one target skill. By labeling the dialog text with at least one labeling skill, performing a full recall of at least one recall skill corresponding to semantic understanding of the dialog text at a processing stage of the dialog model, and determining at least one target skill based on a result of the dialog, then the conversation model is subjected to calling-in evaluation, so that the problem of low accuracy of model evaluation when an NLU single-label calling-in evaluation system in the related technology is adopted can be avoided, because a plurality of marking skills can be marked on all the recall skills, the evaluation of the dialogue model at the text semantic understanding level can be accurately covered, and based on the corresponding 'at least one target skill' in the output of the dialogue result, the evaluation of a sequencing layer can be covered on the dialogue model when the dialogue model executes the result (recalling the skill), the granularity of the evaluation dimension is further refined, and the accuracy of the model evaluation is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a model evaluation method provided in an embodiment of the present application;
fig. 2 is a schematic view of a scenario in which a dialog system processes a dialog text according to a model evaluation method provided in an embodiment of the present application;
fig. 3 is a scene schematic diagram of dialog management corresponding to a dialog system related to a model evaluation method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart diagram of another model evaluation method provided in the embodiments of the present application;
fig. 5 is a scene schematic diagram of evaluation set training related to a model evaluation method provided in an embodiment of the present application;
fig. 6 is a scene schematic diagram of a classification skill decomposition involved in the model evaluation method provided in the embodiment of the present application;
fig. 7 is a scene schematic diagram illustrating a decomposition annotation technique related to a model evaluation method according to an embodiment of the present application;
FIG. 8 is a schematic diagram illustrating a statistical detail of an accuracy rate involved in a model evaluation method according to an embodiment of the present application;
FIG. 9 is a schematic diagram illustrating a statistical summary of recall rates involved in a model evaluation method according to an embodiment of the present application;
FIG. 10 is a schematic structural diagram of a model evaluation apparatus according to an embodiment of the present application;
FIG. 11 is a schematic structural diagram of a skill training module provided in an embodiment of the present application;
FIG. 12 is a schematic structural diagram of a skill annotation unit provided in an embodiment of the present application;
FIG. 13 is a block diagram of a recall-allowed evaluation module according to an embodiment of the present disclosure;
FIG. 14 is a schematic structural diagram of another model evaluation apparatus provided in an embodiment of the present application;
fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In the description of the present application, it is noted that, unless explicitly stated or limited otherwise, "including" and "having" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art. Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
In the related art, a dialog system is composed of skills in a plurality of fields, such as: music, system settings, alarm clocks, weather are all different domain skills. The dialogue system usually relies on the NLU to identify an explicit recall skill for the dialogue text input by the user, and the DM performs the following dialogue processing according to the recall skill identified by the NLU, and finally takes the execution result corresponding to the identified explicit recall skill as the dialogue result of the final dialogue model.
In the evaluation of the dialogue model, the quasi-summoning evaluation is usually performed based on an NLU single-label quasi-summoning evaluation system, wherein the NLU single-label quasi-summoning evaluation system: the method comprises the steps of setting unique labeling skills (which can also be understood as a unique center control result) for a dialog text (query), labeling the unique label of the dialog text (query), and then performing joint evaluation on the labeling skills based on the unique recall skills identified by the NLU. The evaluation effect is too limited by adopting the evaluation mode, so that the accuracy of model evaluation is not high.
On one hand, by adopting the evaluation processing mode, the requirement can be met under the condition that the data boundary of each skill of the dialog system is clear (namely when the corresponding semantic boundary is clear during the annotation of the dialog text), but in the practical application process, along with the acceleration of the dialog model skill construction steps and the increase of the corresponding skills, the boundary among the skills begins to become fuzzy, so that an NLU does not usually have enough information to judge whether the NLU belongs to a certain specific skill. This is particularly true in cases involving question-answering, knowledge map (KG), encyclopedia skills. That is, a plurality of recall skills corresponding to the same dialog text in the NLU semantic understanding stage exist, and at this time, accurate definition of semantic boundaries of all dialog texts corresponding to a large number of recall skills is difficult to realize only by virtue of a single label recall evaluation system, for example, a dialog text 'mermaid' belongs to both music skills and video skills and also belongs to story skills, and at this time, accurate definition of the recall skills is difficult to realize in the dialog text 'mermaid', so that a situation that a dialog model cannot be evaluated is caused;
on one hand, by adopting the evaluation processing mode, one dialog text usually corresponds to a plurality of recall skills in the NLU semantic processing stage, but only one recall skill is usually output in the NLU semantic processing stage, and the model effect of the dialog model intention understanding layer and the sequencing decision layer cannot be evaluated based on the existing NLU single-label recall-ready evaluation system;
the present application will be described in detail with reference to specific examples.
In one embodiment, as shown in fig. 1, a model evaluation method is proposed, which can be implemented by means of a computer program and can be run on a model evaluation device based on the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application.
The model evaluation device may be an electronic device with a specific model evaluation function, and the electronic device includes but is not limited to: a server, a wearable device, a handheld device, a personal computer, a tablet, an in-vehicle device, a smartphone, a computing device, or other processing device connected to a wireless modem, and so forth. Electronic devices in different networks may be called different names, such as: user equipment, access terminal, subscriber unit, subscriber station, mobile station, remote terminal, mobile device, user terminal, wireless communication device, user agent or user equipment, cellular telephone, cordless telephone, Personal Digital Assistant (PDA), electronic device in a 5G network or future evolution network, and the like.
Specifically, the model evaluation method comprises the following steps:
step S101: when the dialogue model processes the dialogue text, at least one marking skill and at least one recalling skill corresponding to the dialogue text are determined.
The dialogue model may be understood as an intelligent assistant (e.g. an intelligent voice assistant), which may also be referred to as a dialogue system, a dialogue platform, a dialogue service, etc. in some embodiments, in general, a dialogue model (e.g. an intelligent assistant) may support tens or even hundreds of domain skills, such as music on demand skills, movie playing skills, date query skills, alarm clock skills, weather skills, etc.
The "skill" referred to in the embodiments of the present application refers to a service or a function that can be implemented by the dialogue model or the dialogue system, such as ordering, buying tickets, playing music, and the like.
The dialog text is generally the text content to be semantically recognized generated in the dialog scene, such as: the electronic device may collect dialog speech currently spoken by a user and perform Automatic Speech Recognition (ASR) processing on the dialog speech to text-convert the dialog speech to obtain dialog text entered by the user for the dialog system. For another example: the electronic device can collect the text content input by the user on the dialog interface corresponding to the dialog system, and at this time, the dialog system can use the text content as the dialog text.
For convenience of explanation, the dialogue model according to the present embodiment is described as follows:
conversation model conversation (or chat) method, as shown in fig. 2, fig. 2 is a schematic view of a conversation system processing a conversation text (Query), and generally involves at least: user, text recall, Dialog Management (DM), configuration ordering (skill execution results), in a scenario involving a Dialog model:
a user initiates a dialog by a voice manner or an input of a text, and in a case where the dialog is a voice input, performs Automatic Speech Recognition (ASR) to convert the recognized voice into a dialog text. The dialogue model performs natural language understanding processing (NLU) on the user's dialogue text input to resolve to the user's intention of the dialogue text (which can also be understood as a dialogue entity). An intent recall is made according to a user intent (or dialog entity), which may be understood as: the dialogue system determines the "model behaviors or model skills to be performed" of the recall, i.e. the recall skills, based on the user intention (which may also be understood as the NLU semantic parsing result) understood by the dialogue text semantics, and further, the recall skills are usually at least one in the embodiment of the application, i.e. the dialogue system may determine at least one recall skill of the recall based on the user intention.
In the embodiment of the application, in order to realize accurate evaluation of the dialogue model, evaluation of an intention understanding dimension can be performed on the aspect of semantic understanding during text recall, and it can be understood that the intention understanding can generally feed back the effect of a semantic understanding algorithm in the dialogue model. Specifically, an evaluation mechanism may be set for skill recall that characterizes the intent understanding dimension, and a skill recall permission evaluation condition of the dialogue model may be determined based on a skill recall result and in combination with an evaluation index.
Furthermore, in the process of creating and training the dialogue model, skill labeling can be performed on at least one type of dialogue text with corresponding semantics or similar semantics in advance, namely, skill labeling is performed on a certain or semantically same type of dialogue text (Query), at least one labeling skill corresponding to the dialogue text (Query) is set, the labeling skill can be understood as setting at least one definite execution skill on the corresponding dialogue text in advance, at least one labeling skill contained in each dialogue text can form a text labeling skill set, the electronic device can conveniently determine the at least one labeling skill corresponding to the dialogue text by calling the text labeling skill set in real time, and meanwhile, on the basis of the determined at least one recall skill, quasi-recall evaluation of the dialogue model on the understanding level of the intention drawing is achieved.
Step S102: and acquiring a conversation result corresponding to the conversation text output by the conversation model, and determining at least one target skill corresponding to the conversation result.
The dialog result is a resource result finally output by the dialog model, and the dialog model can perform resource integration or resource aggregation on the execution result corresponding to the at least one target skill which is taken into reference when outputting the result, so that a final output result corresponding to the current dialog text is generated. The target skills are target skills which are finally determined by the Dialog model and are taken as references in the process of (skill execution result) configuration sorting and sorting, wherein the target skills are target skills which are finally determined by the Dialog model for skill decision (such as skill sorting, skill recommendation, skill popularity and the like) on at least one recall skill, for example, the Dialog model corresponds to 6 recall skills in a text recall stage, and finally the execution results corresponding to the ' 3 recall skills ' are selected from the execution results corresponding to the ' 6 recall skills for outputting the Dialog results, and the ' 3 recall skills ' are also target skills.
In a possible implementation, the dialog model mentioned in step S102 is taken as an example.
After determining at least one recall skill through the conversation model, the electronic device may determine, through a conversation management (DM) mechanism, which skill area to enter corresponding to the recall skill, and obtain a conversation result for the corresponding skill based on the corresponding at least one skill instance under the recall skill. As shown in fig. 3, fig. 3 is a scene schematic diagram of session management corresponding to a session system, in fig. 3, an electronic device realizes issuing of a session text based on at least one recall skill through a background central control service corresponding to a session model, that is, the session text is input to a skill field corresponding to each recall skill through a session management (DM) mechanism for processing, and generally, the session text enters a skill field corresponding to a plurality of recall skills, such as "navigation skill field", "multimedia domain", "chatting skill field", and the like in fig. 3. And the dialogue state of the user is configured to be the dialogue state corresponding to the multiple rounds of recall skills based on the dialogue text. And matching the instances corresponding to the navigation skill field based on at least one skill instance corresponding to each recall skill (e.g., compiling the dialog in combination with the instances), and calling a pre-trained Artificial Intelligence (AI) model corresponding to each recall skill to process in the instance matching implementation process, for example, the navigation skill corresponds to the pre-trained navigation skill model in the model pool. So that an artificial intelligence model (AI) can output the results of a dialog performed on the "dialog text" based on the respective recall skills. Further, in practical applications, the artificial intelligence model (AI) corresponding to the corresponding skill may generally exist in the form of a skill service, and the access to the skill service may be implemented by a skill data access unified portal (such as a service interface), through which, for example, a third party skill service and a conventional skill service may be invoked. The third party skill service may be a corresponding skill directly configured by the user, which may be triggered by a particular user speech input scenario or user speech input content. The access of the third-party skill service improves the expandability of semantic decision dimension to a certain extent, and the skill field corresponding to the skill can be recalled through DM, so that the service calling based on the corresponding skill field can be directly triggered. It should be noted that, here, the dialog model subsequently recalls each dialog result and outputs a dialog result corresponding to a plurality of recall skills, that is, in fig. 3, a skill execution result of each recall skill is returned through an RPC channel (software interface of the online service of the dialog model), that is, a "multi-skill result" in fig. 3.
Specifically, the electronic device controls the dialogue model to perform configuration sorting based on the returned "multi-skill result", that is, to perform processing such as sorting and screening before outputting the "multi-skill result", in a specific implementation, in the "configuration sorting of the" skill execution result ", a skill decision (for example, performing configuration sorting of the skill result based on factors such as a skill sorting dimension, a skill recommendation dimension, a skill heat dimension, and the like) may be performed on the" at least one recall skill ", and finally, a target skill to be incorporated into the reference is determined, and then, resource integration is performed on the skill result corresponding to the target skill to be incorporated into the reference, so as to obtain a dialogue result corresponding to the dialogue text, for example, the skill results corresponding to a plurality of target skills are synthesized into the same window card (that is, the dialogue result) and presented on the display interface of. In the embodiment of the application, the electronic device may identify the content of the dialogue result based on the dialogue result output by the dialogue model, and may determine different skill portions constituting the dialogue result, so as to obtain at least one target skill corresponding to the different skill portions. Optionally, the electronic device may also control the dialog model to label the target skill incorporated for reference when the dialog model outputs the dialog result after configuring and sequencing the "multi-skill result", so that the electronic device may obtain the labeled target skill conveniently.
Step S103: performing a recall assessment on the conversation model based on the at least one annotation skill, the at least one recall skill, and the at least one target skill.
Specifically, the skill labeling is performed on the dialog text (which can be understood as sample text) input by the dialog model in advance, so that the accurate evaluation of the dialog model is realized. Further, the evaluation of the intent understanding dimension is performed in the aspect of semantic understanding of the "dialog text" by the dialog model, and it can be understood that the intent understanding can generally feed back the effect of the semantic understanding related algorithm in the dialog model. Specifically, an evaluation mechanism may be set for skill recall that characterizes the intent understanding dimension, and a skill recall permission evaluation condition of the dialogue model may be determined based on a skill recall result and in combination with an evaluation index. In specific implementation, a large number of dialogue data sets of dialogue texts are labeled in advance, a mode of labeling a multi-skill tag of a dialogue text (Query) is adopted during labeling, namely, a plurality of sample skills which may correspond to one dialogue text are labeled for one dialogue text, and then when the dialogue model processes the dialogue text, recall evaluation parameters such as skill accuracy, skill recall rate, skill accuracy and the like of the dialogue model during intention understanding can be determined based on at least one recall skill which is actually recalled when a certain dialogue text is understood and the at least one labeled skill which is labeled for the dialogue text in advance, so that the effect of skill recall during intention understanding of the dialogue model is fed back.
Specifically, after the dialog text (which may be understood as sample text) input by the dialog model is labeled in advance, since the dialog model outputs the execution result corresponding to the "at least one target skill", and the determination of the "at least one target skill" is closely related to the configuration and ranking process of the dialog model (the skill execution result), it may be understood that the electronic device may implement a quasi-assessment of the configuration and ranking process based on the "at least one target skill", that is, an effect of assessing the ranking policy of the dialog model. On one hand, when the labeling skills (such as standard skill weight, standard skill sequence and the like) labeled to the text conversation in advance and the skill execution result corresponding to the at least one target skill are finally output, the corresponding skill decision parameters, the skill decision parameter can be understood as the filtered decision information such as the number of the recalled skills, the actual ranking parameter of the recalled skills, the heat value of the recalled skills, the recommended value of the recalled skills, and on the other hand, in the configuration ranking process (the skill execution result), acquiring ranking information of 'at least one recalling skill', ranking information of 'at least one labeling skill' and ranking information of 'at least one target skill', determining skill comprehensive ranking parameters, for example, the recall ranking accuracy in the skill recall process, the output accuracy in the target skill output process, and the like can be calculated based on the ranking information corresponding to the labeled skill.
It is to be understood that the call-in evaluation of the dialogue model is implemented in the embodiment of the present application based on the call-in evaluation parameter characterizing the intended understanding effect of the dialogue model and the effect ranking policy parameter characterizing the configuration ranking of the dialogue model. And feeding back a calling-in effect of the corresponding algorithm level of the dialogue model, a labeling effect of the dialogue text label and a skill result sequencing effect through the calling-in evaluation parameter and the sequencing strategy parameter.
In the embodiment of the application, when a conversation model processes a conversation text, an electronic device determines at least one labeling skill and at least one recalling skill corresponding to the conversation text, acquires a conversation result corresponding to the conversation text output by the conversation model, determines at least one target skill corresponding to the conversation result, and performs recalling evaluation on the conversation model based on the at least one labeling skill, the at least one recalling skill and the at least one target skill. By labeling the dialog text with at least one labeling skill, performing a full recall of at least one recall skill corresponding to semantic understanding of the dialog text at a processing stage of the dialog model, and determining at least one target skill based on a result of the dialog, then the conversation model is subjected to calling-in evaluation, so that the problem of low accuracy of model evaluation when an NLU single-label calling-in evaluation system in the related technology is adopted can be avoided, because a plurality of marking skills can be marked on all the recall skills, the evaluation of the dialogue model at the text semantic understanding level can be accurately covered, and based on the corresponding 'at least one target skill' in the output of the dialogue result, the evaluation of a sequencing layer can be covered on the dialogue model when the dialogue model executes the result (recalling the skill), the granularity of the evaluation dimension is further refined, and the accuracy of the model evaluation is improved.
Referring to fig. 4, fig. 4 is a schematic flowchart of another embodiment of a model evaluation method according to the present application. Specifically, the method comprises the following steps:
step S201: and acquiring a newly added skill evaluation set, wherein the newly added skill evaluation set comprises at least one newly added tagging skill and at least one first dialog text corresponding to each newly added tagging skill.
The newly added skill evaluation set can be understood as an evaluation set or a corpus set corresponding to a newly added skill in the dialogue model, and is used for inputting the evaluation set or the corpus set into the dialogue model and performing training and labeling on a reference skill evaluation set corresponding to the dialogue model, where the training and labeling can be generally understood as performing multi-label or multi-skill labeling on a certain dialogue text. The newly added skill evaluation set comprises at least one newly added tagging skill and at least one first dialog text corresponding to each newly added tagging skill.
The newly added annotation skill can be understood as a newly added service or a newly added function which can be realized by a conversation model or a conversation system, such as ordering, buying tickets, playing music and the like.
The first dialog text can be understood as the dialog evaluation text corresponding to the newly added annotation skill.
Step S202: and training and labeling a reference skill evaluation set corresponding to the dialogue model based on the newly added skill evaluation set and the dialogue model.
Specifically, the electronic device drives the dialogue model to perform at least one dialogue test on the dialogue model in the evaluation environment of the dialogue model based on the newly added skill evaluation set, determines a conflict dialogue text fed back in a test stage (such as a dialogue text with additional recalling other skills, a dialogue text with non-newly added skills, and a conflict dialogue text already labeled in the evaluation set) in the dialogue test process, and then performs corresponding skill labeling according to a conflict type of the conflict dialogue text, it needs to be noted that a plurality of labeling skills are usually labeled on the conflict dialogue text, and the specific implementation process is as follows:
1. the method comprises the steps that an electronic device obtains a reference skill evaluation set corresponding to a conversation model, and the newly added skill evaluation set is added to the reference skill evaluation set, wherein the reference skill evaluation set is an evaluation set (also can be understood as a conflict-free corpus set) corresponding to a trained conflict-free skill or conflict-free text, and the reference skill evaluation set corresponding to the conversation model comprises at least one labeling skill and at least one second conversation text corresponding to each labeling skill. In the embodiment of the present application, reference may be made to fig. 5, where fig. 5 is a schematic view of a scenario of evaluation set training. The newly added skill evaluation set is added into a conflict-free reference skill evaluation set, namely, the newly added labeling skills in the newly added skill evaluation set are used as labeling skills, the first dialog texts corresponding to the newly added labeling skills are used as second dialog texts, fusion of the newly added skill evaluation set and the reference skill evaluation set is achieved, and evaluation processing is conducted on the basis of a model evaluation environment (such as a V5 model evaluation environment) corresponding to the fused reference skill evaluation set and the combined dialog model.
2. The electronic equipment evaluates the reference skill evaluation set according to the dialogue model to obtain skill conflict information corresponding to the newly added skill, that is, the skill conflict information related to the newly added skill in the reference skill evaluation set can be obtained through evaluation, wherein the skill conflict information is conflict text information, conflict skill information and other skill conflict information generated after the newly added skill evaluation set is added, such as additional recalling of other skills, non-newly added skill recalled dialogue text, and already labeled conflict dialogue text in the evaluation set.
3. And the electronic equipment determines the target dialog text with conflict according to the second dialog texts corresponding to the reference skill evaluation set added into the newly added skill evaluation set based on the skill conflict information after evaluation processing.
4. And carrying out skill annotation on the target dialog text in each second dialog text.
Specific skill labels may be as follows:
1) determining a target dialog text which conflicts with the newly added skill and a conflict type corresponding to the target dialog text in each second dialog text based on the skill conflict information;
specifically, in the evaluation processing stage, the electronic device determines the target dialog text with the conflict and the conflict type corresponding to the target dialog text according to each second dialog text corresponding to the reference skill evaluation set added to the newly added skill evaluation set based on the obtained skill conflict information.
Such as: a certain target dialog text a is a dialog text (i.e. a non-first dialog text) with additional recalling of the newly added skill, and it can be understood that the second dialog text originally existing in the original reference skill evaluation set is the dialog text with additional recalling of the newly added skill in the evaluation processing process.
Such as: the target dialog text b is a dialog text recalled by the labeling skills except the newly added skill, generally, the dialog text can be the first dialog text before being added, and the dialog text can be understood as corresponding to the newly added skill and the reference skill in semantic understanding class.
Such as: and (4) marking the conflict dialog text c (target dialog text) in the newly added skill evaluation set, namely determining the dialog text which conflicts with the marking skill in the reference skill evaluation set before marking evaluation.
Then, in an evaluation processing stage, the electronic device may subdivide the conflict conditions of all the target dialog texts based on the target dialog texts with conflicts and by combining the specific conflict conditions and the conflicting labeling skills, for example, if the target dialog text a may be taken as a first type, the target dialog text of the first type may be understood as a dialog text that needs to be labeled with multiple labels (i.e., a corresponding text labeling manner); if the target dialog text b can be of a second type, the target dialog text of the second type can be understood as a text which needs to be covered with the target dialog text (i.e. a corresponding text labeling mode); further, the electronic device may further calculate a conflict degree based on a corresponding degree of conflict of the target dialog text (for example, the conflict degree may be calculated), where, if the similarity between the target dialog text and one or more second dialog texts is high, which may cause a tagging skill conflict between the target dialog text and a plurality of second dialog texts, at this time, the target dialog text may be regarded as a third type, and the target dialog text of the third type may be understood as a text that needs to be discarded.
2) And performing multi-skill labeling on the target dialog text according to a text labeling mode corresponding to the conflict type.
Wherein, different conflict types correspond to different text labeling modes, and the text labeling mode corresponding to the first type is as follows by taking the above example as an example: labeling the target conversation text with multiple labels or multiple skills, namely labeling the newly added skills of the target conversation text, namely labeling multiple labeling skills corresponding to the target conversation text; the text labeling mode corresponding to the second type is as follows: performing text coverage or skill coverage on the target dialog text, wherein the text coverage can be the text coverage of the target dialog text and the conflicting dialog text, and the skill coverage can be the coverage of the marked-up skill of the target dialog text as a new skill; the text labeling mode corresponding to the second type is as follows: discarding the target dialog text, and so on. In specific implementation, the electronic device performs multi-skill labeling on the target dialog text according to a text labeling mode corresponding to the conflict type. After the marking is finished, the online synchronization of the reference skill evaluation set can be carried out, or the evaluation set is updated for the terminal subscribing the corresponding skill updating service.
Step S203: when the dialogue model processes the dialogue text, at least one marking skill and at least one recalling skill corresponding to the dialogue text are determined.
Specifically, refer to step S101, which is not described herein again.
Step S204: and acquiring a conversation result corresponding to the conversation text output by the conversation model, and determining at least one target skill corresponding to the conversation result.
Specifically, refer to step S102, which is not described herein again.
Step S205: generating skill recall assessment information for the conversation model based on the at least one annotated skill and the at least one recall skill.
The skill recall-permission evaluation information can be understood as evaluation of intention understanding dimensions of a dialogue model on semantic understanding of a 'dialogue text', the generated evaluation information and the skill recall-permission evaluation information can generally feed back the effect of semantic understanding related algorithms in the dialogue model, for example, an evaluation mechanism can be set for skill recall for representing the intention understanding dimensions, and the skill recall-permission evaluation condition of the dialogue model is determined based on a skill recall result and in combination with evaluation indexes.
Specifically, an evaluation mechanism may be set for skill recall that characterizes the intent understanding dimension, and a skill recall permission evaluation condition of the dialogue model may be determined based on a skill recall result and in combination with an evaluation index. In specific implementation, a large number of dialogue data sets of dialogue texts are labeled in advance, a mode of labeling a multi-skill tag of the dialogue texts (Query) is adopted during labeling, namely, a plurality of sample skills which may correspond to one dialogue text are labeled for one dialogue text, and then when the dialogue model processes the dialogue texts, recall evaluation parameters such as skill accuracy, skill recall rate and the like of the dialogue model during intention understanding can be determined based on at least one recall skill which is actually recalled when the dialogue text is subjected to intention understanding and the labeled at least one labeling skill which is labeled for the dialogue text in advance, so that the effect of skill recall when the dialogue model intends to understand is fed back. The method comprises the following specific steps:
1. and performing recall skill decomposition on at least one recall skill corresponding to the conversation text to generate a recall decomposition result.
The recall skill decomposition may be understood as that the dialog model performs evaluation on an intention understanding side on a "dialog text", so as to decompose a plurality of recall skills corresponding to a determined dialog text one by one, as shown in fig. 6, fig. 6 is a scene schematic diagram of a decomposition classification skill (i.e., a recall skill), when the dialog model performs intention understanding on a "dialog text (query a)", 2 recall skills are determined, that is, the recall skill of query a is represented as "chat" or "music", and then recall skill decomposition is performed on two recall skills corresponding to query a, and then a decomposed recall decomposition result may be represented as: query a's recall skill 1 corresponds to "chat", query B's recall skill 2 corresponds to "music", similarly, query B's recall skill 1 corresponds to "encyclopedia", query B's recall skill 2 corresponds to "chat", and query C's recall skill 1 corresponds to "music".
2. The electronic device may perform annotation skill decomposition on the at least one annotation skill corresponding to the dialog text, and generate an annotation decomposition result.
The annotation skill decomposition may be understood as splitting all annotation skills corresponding to the dialog text one by one, and generating a corresponding annotation decomposition result by combining the recall decomposition result, as shown in fig. 7, where fig. 7 is a scene schematic diagram of decomposing the annotation skills.
Taking query a as an example, the recall skills corresponding to query a are 2, which are respectively "chat" and "music", the labeling skills corresponding to query a are 2, which are respectively "music" and "media-qa", all labeling skills corresponding to the dialog text are split one by one, and in combination with the recall decomposition result, 4 (2 × 2) labeling events (also called a labeling decomposition result) can be generated, which are respectively labeling event 1: recall skill 1 for query a: "chat", query A corresponds to annotation skill 1: "music"; labeling event 2: query A corresponding recall skill 2: "chat", query A corresponds to annotation skill 2: "medium-qa"; annotate event 3: query A corresponds to recall skill 1: "music", query A corresponds to annotation skill 2: "music"; annotate event 4: query A corresponds to recall skill 1: "music", query A corresponds to annotation skill 1: "medium-qa";
3. and generating the skill accuracy of the dialog text according to the recall decomposition result, and generating the skill recall rate of the dialog text according to the labeling decomposition result.
Wherein, the skill accuracy can be expressed as: the ratio of the correct prediction event to all prediction events, that is, the ratio of the time of correct prediction of skills to all skill prediction events, wherein the statistical detail of all skill prediction events can be shown in fig. 8, taking dialog text query a as an example, all skill prediction events can be represented as: predicting an event 1, namely corresponding to query A, recall skill "chat, marking skill" music "and media-qa", wherein the event is wrongly predicted and marked as 0; predicting an event 2, namely corresponding to query A, recall skill "music", labeling skill "music" and "media-qa", wherein the event is predicted correctly and marked as 1; ...
Statistics are made on the accuracy of the skills as follows:
if the number of the skill recalls corresponding to the skill "chat" is 2, the accurate number of the skill recalls is 1, and the skill accuracy is 50%;
the number of skill recalling skills corresponding to the skill "music" is 2, the accurate number of skill recalling is 2, and the skill accuracy is 100%;
the number of the skills recalled corresponding to the skill "encyclopedia" is 1, the correct number of the skills recalled accurately is 1, and the skill accuracy is 100%.
The skill recall rate is usually for a sample, that is, for the annotation skill, it can be expressed as: the ratio of the positive type sample prediction event to the all sample prediction event, and the all sample prediction event is composed of the positive type sample prediction event and the negative type sample prediction event. The positive sample prediction event indicates how many positive samples in the sample are predicted correctly, and the original positive sample in the negative sample prediction event sample is predicted as a negative event.
The statistical detail of all sample prediction events can be shown in fig. 9, and taking dialog text query C as an example, all skill prediction events can be represented as: the predicted event 5 corresponds to query C, recall skill ' chat and annotation skill ' music ', and at the moment, the sample event is judged to be a negative type annotation event if the sample event is wrong in prediction (namely, the annotation is inconsistent with the prediction), and the label is 0; predicted event 6: corresponding to query C, recall skill "music" and labeling skill "music", judging that the sample event is correctly predicted (namely, the labeling is consistent with the prediction), and marking as 1; ...
Referring to FIG. 9, as shown in FIG. 9, the recall of skills is counted as follows:
if the actual number corresponding to the skill "chat" is 2, the correct number is 1, and the skill recall rate is 50%;
if the real number corresponding to the skill "music" is 2, the correct number is 2, and the skill recall rate is 100%;
the real number corresponding to the skill "encyclopedia" is 1, the correct number is 1, and the skill recall rate is 100%;
if the real number corresponding to the skill "media-qa" is 1, the correct number is 0, and the skill recall rate is 0%.
In the embodiment of the application, the skill recall rate and the skill accuracy rate are used for judging the quality of the conversation model recall skill result. The skill recall rate is measured by the precision rate of the skill recall of the dialogue system; the skill accuracy is used to judge the recall of the dialog system skills.
Step S206: and obtaining ranking decision information corresponding to the at least one target skill, and generating ranking calling-ready evaluation information of the conversation model based on the ranking decision information.
The ranking decision information may be understood as: after the dialog text (which can be understood as sample text) input by the dialog model is labeled in advance, since the dialog model outputs an execution result corresponding to "at least one target skill", the determination of "at least one target skill" is closely related to the configuration and sorting process of the dialog model (skill execution result), and the sorting decision information is related to the evaluation of the related parameters of the sorting strategy of the dialog model. The ranking decision information can be the filtered number of the recalling skills, the actual ranking parameters of the recalling skills, the heat value of the recalling skills, the recommended value of the recalling skills and the like. Specifically, after the ranking decision information is obtained, the ranking decision information usually corresponds to a plurality of parameters associated with the ranking decisions, and when the ranking recall-permitted evaluation information of the dialogue model is generated, what is fed back is usually an evaluation result obtained by synthesizing the plurality of ranking decision parameters, for example, a filtering probability of an execution result corresponding to a single recall skill may be calculated based on the number of filtered recall skills, it can be understood that when a certain filtering probability is higher, the filtering probability of the result that the recall skill can be fed back is higher, and it is necessary to perform annotation adjustment on the dialogue text corresponding to the corresponding recall skill. For example, the ranking similarity may be calculated based on the "recall skill actual ranking parameter" and the "target skill ranking parameter", and it may be understood that when the ranking similarity is smaller, the result ranking accuracy of the recall skill may be fed back to be smaller, and the ranking weight corresponding to the corresponding recall skill needs to be adjusted. For example, the confidence level of at least one recall skill is calculated in combination with a plurality of preset dimensions of the recall skills in the ranking strategy process, and whether the ranking weight of the recall skill is reasonable can be fed back based on the confidence level, wherein the preset dimensions can be one of a knowledge base dimension, a context information dimension, a current skill dimension, a reference heat dimension, a user customization dimension, a skill recommendation dimension and a terminal type dimension corresponding to the skills in the ranking strategy process. After determining a plurality of corresponding confidence scores according to a plurality of dimensions, determining a comprehensive confidence corresponding to a certain recall skill based on weighted values of the confidence scores, and when the comprehensive confidence is too low, feeding back that the ranking weight of the recall skill is unreasonable, and needing adaptive adjustment.
Step S207: and obtaining effect feedback information corresponding to the at least one target skill.
The effect feedback information may be understood as a calling-ready effect for feeding back an execution result corresponding to the target skill of this time, which is determined based on the behavior operation of the user for the multi-speech result after the dialog model outputs the dialog result of the dialog text, and may be understood as whether the target skill is different from or differs from the actual expected skill of the dialog text of the user directly through the behavior operation.
The effect feedback information can be the click response time of the user, namely the click response time of the user aiming at the resource corresponding to a certain target skill in the conversation result;
the effect feedback information can be browsing response time of the user, namely browsing time of the user aiming at a resource corresponding to a certain target skill in the conversation result;
the effect feedback information may be whether the user clicks a resource corresponding to a certain target skill, that is, a click feedback result, for example: the user sends out the corpus (i.e., the dialog text) to the voice assistant (which may be understood as a dialog model), and after receiving the dialog result output by the voice assistant, closes the voice assistant or closes the resource corresponding to a certain target skill in the dialog result. It is stated that it is likely that the speech assistant's dialog results are unsatisfactory to the user, or that the target skills in the dialog results are not the skills desired by the user, i.e., the dialog model is likely to be "answered nothing".
The effect feedback information may be an emotional feedback parameter of the user with respect to the dialog result after the dialog result is output, and if it is determined that the user has a negative emotion based on the emotional feedback parameter, it indicates that it is highly likely that the target skill determined by the dialog model leaves the user dissatisfied, i.e., the dialog system is highly likely to "answer nothing". I.e., the target skill is not the user desired skill.
The effect feedback information may be a click rate of a dialog result corresponding to the at least one target energy saving by the user.
Step S208: and when the effect feedback information does not meet the calling-ready feedback condition corresponding to the target skill, carrying out calling-ready evaluation on the conversation model.
The call-ready feedback condition may be understood as a threshold condition or a critical condition of call-ready evaluation set for each type of parameter (e.g., click response duration, browsing response duration, click feedback result, emotion feedback parameter, etc.) in the effect feedback information, that is, when at least one certain type of parameter in the effect feedback information does not satisfy the call-ready feedback condition corresponding to "a certain type of parameter", for example, when the click response duration is greater than a duration threshold, the browsing response duration is greater than a duration threshold, a confidence corresponding to the click feedback effect is less than a confidence threshold, etc., then it is determined to perform the call-ready evaluation on the dialogue model, and the feedback call-ready information based on the user operation behavior is generated for the corresponding effect feedback information. The feedback call-ready information is generated based on the effect feedback information, and in some embodiments, the feedback call-ready information may at least include the effect feedback information, and may further include evaluation parameters based on various types of parameters and threshold conditions, such as similarity between a click response duration and a duration threshold, a divergence rate between a confidence level of the click feedback effect and a confidence level threshold, and the like. For example, the higher the resource click rate of each target skill in the dialog result, the higher the degree of similarity between the target skill and the user desired skill, and after the assessment is waited, the ranking weight of the corresponding target skill can be adjusted based on the click-through rate. In some embodiments, the conversation model is subjected to recall-ready evaluation, and based on the generated feedback recall-ready information, a process of a conversation result corresponding to a target skill output by the feedback model can be reversely optimized, such as re-labeling the recall skills, adjusting a ranking weight of the recall skills of the ranking decision layer, adjusting a recommendation degree or a popularity value of the recall skills, and the like.
In the embodiment of the application, when a conversation model processes a conversation text, an electronic device determines at least one labeling skill and at least one recalling skill corresponding to the conversation text, acquires a conversation result corresponding to the conversation text output by the conversation model, determines at least one target skill corresponding to the conversation result, and performs recalling evaluation on the conversation model based on the at least one labeling skill, the at least one recalling skill and the at least one target skill. By labeling the dialog text with at least one labeling skill, performing a full recall of at least one recall skill corresponding to semantic understanding of the dialog text at a processing stage of the dialog model, and determining at least one target skill based on a result of the dialog, then the conversation model is subjected to calling-in evaluation, so that the problem of low accuracy of model evaluation when an NLU single-label calling-in evaluation system in the related technology is adopted can be avoided, because a plurality of marking skills can be marked on all the recall skills, the evaluation of the dialogue model at the text semantic understanding level can be accurately covered, the evaluation of a sequencing layer can be configured when the dialog model executes the result (recalling the skill) based on the corresponding 'at least one target skill' when the dialog result is output, the granularity of the evaluation dimension is further refined, and the accuracy of the model evaluation is improved; the method for labeling the multiple skills of the dialog text can be used for rapidly processing the conflict skills and the conflict dialog text when a new skill evaluation set is added to the dialog model, so that the end-to-end experience effect of the dialog model is improved, and the update period of the dialog model is saved; and the feedback of the user level is also incorporated into the evaluation of the dialogue model based on the effect feedback information corresponding to at least one target skill at the user side, so that the coverage level of the dialogue model can be further improved, and the evaluation effect of the dialogue model is increased.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Please refer to fig. 10, which shows a schematic structural diagram of a model evaluation apparatus according to an exemplary embodiment of the present application. The model evaluation means may be implemented as all or part of the apparatus in software, hardware or a combination of both. The apparatus 1 comprises a skill determination module 11, a result acquisition module 12 and a call admission assessment module 13.
The skill determination module 11 is configured to determine at least one tagging skill and at least one recalling skill corresponding to a dialog text when the dialog model processes the dialog text;
a result obtaining module 12, configured to obtain a dialog result corresponding to the dialog text output by the dialog model, and determine at least one target skill corresponding to the dialog result;
a call-ready evaluation module 13, configured to perform a call-ready evaluation on the dialogue model based on the at least one annotation skill, the at least one recall skill, and the at least one target skill.
Optionally, as shown in fig. 14, the apparatus 1 includes:
the skill acquisition module 14 is configured to acquire a newly added skill evaluation set, where the newly added skill evaluation set includes at least one newly added tagging skill and at least one first dialog text corresponding to each newly added tagging skill;
and the skill training module 15 is configured to train and label a reference skill evaluation set corresponding to the dialogue model based on the newly added skill evaluation set and the dialogue model.
Optionally, as shown in fig. 11, the skill training module 15 includes:
an evaluation set adding unit 151, configured to obtain a reference skill evaluation set corresponding to the dialog model, and add the newly added skill evaluation set to the reference skill evaluation set, where the reference skill evaluation set includes at least one labeling skill and at least one second dialog text corresponding to each labeling skill;
an evaluation set processing unit 152, configured to perform evaluation processing on the reference skill evaluation set according to the dialog model to obtain skill conflict information corresponding to the newly added skill;
a skill labeling unit 153, configured to label a skill of the target dialog text in each of the second dialog texts based on the skill conflict information.
Optionally, as shown in fig. 12, the skill marking unit 153 includes:
a text and type determining subunit 1531, configured to determine, based on the skill conflict information, a target dialog text that conflicts with the newly added skill and a conflict type corresponding to the target dialog text in each of the second dialog texts;
and a multi-skill labeling subunit 1532, configured to perform multi-skill labeling on the target dialog text according to the text labeling manner corresponding to the conflict type.
Optionally, as shown in fig. 13, the recall evaluation module 13 includes:
a skill calling sub-unit 131, configured to generate skill calling assessment information of the dialogue model based on the at least one annotated skill and the at least one calling skill;
a calling-ready subunit 132, configured to obtain the ranking decision information corresponding to the at least one target skill, and generate, based on the ranking decision information, the calling-ready evaluation information of the conversation model.
Optionally, the skill calling subunit 131 is specifically configured to:
performing recall skill decomposition on at least one recall skill corresponding to the conversation text to generate a recall decomposition result;
performing annotation skill decomposition on the at least one annotation skill corresponding to the dialog text to generate an annotation decomposition result;
and generating the skill accuracy of the dialog text according to the label decomposition result, and generating the skill recall rate of the dialog text according to the label decomposition result.
Optionally, the apparatus 1 is specifically configured to:
obtaining effect feedback information corresponding to the at least one target skill;
and when the effect feedback information does not meet the calling-ready feedback condition corresponding to the target skill, carrying out calling-ready evaluation on the conversation model.
It should be noted that, when the model evaluation apparatus provided in the foregoing embodiment executes the model evaluation method, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the model evaluation device and the model evaluation method provided by the above embodiments belong to the same concept, and details of implementation processes are described in the method embodiments, which are not described herein again.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the embodiment of the application, when a conversation model processes a conversation text, an electronic device determines at least one labeling skill and at least one recalling skill corresponding to the conversation text, acquires a conversation result corresponding to the conversation text output by the conversation model, determines at least one target skill corresponding to the conversation result, and performs recalling evaluation on the conversation model based on the at least one labeling skill, the at least one recalling skill and the at least one target skill. By labeling the dialog text with at least one labeling skill, performing a full recall of at least one recall skill corresponding to semantic understanding of the dialog text at a processing stage of the dialog model, and determining at least one target skill based on a result of the dialog, then the conversation model is subjected to calling-in evaluation, so that the problem of low accuracy of model evaluation when an NLU single-label calling-in evaluation system in the related technology is adopted can be avoided, because a plurality of marking skills can be marked on all the recall skills, the evaluation of the dialogue model at the text semantic understanding level can be accurately covered, the evaluation of a sequencing layer can be configured when the dialog model executes the result (recalling the skill) based on the corresponding 'at least one target skill' when the dialog result is output, the granularity of the evaluation dimension is further refined, and the accuracy of the model evaluation is improved; the method for labeling the multiple skills of the dialog text can be used for rapidly processing the conflict skills and the conflict dialog text when a new skill evaluation set is added to the dialog model, so that the end-to-end experience effect of the dialog model is improved, and the update period of the dialog model is saved; and the feedback of the user level is also incorporated into the evaluation of the dialogue model based on the effect feedback information corresponding to at least one target skill at the user side, so that the coverage level of the dialogue model can be further improved, and the evaluation effect of the dialogue model is increased.
An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the model evaluation method according to the embodiment shown in fig. 1 to 9, and a specific execution process may refer to specific descriptions of the embodiment shown in fig. 1 to 9, which is not described herein again.
The present application further provides a computer program product, where at least one instruction is stored, and the at least one instruction is loaded by the processor and executes the model evaluation method according to the embodiment shown in fig. 1 to 9, where a specific execution process may refer to specific descriptions of the embodiment shown in fig. 1 to 9, and is not described herein again.
Fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 15, the electronic device 1000 may include: at least one processor 1001, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002.
Wherein a communication bus 1002 is used to enable connective communication between these components.
The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
Processor 1001 may include one or more processing cores, among other things. The processor 1001 connects various parts throughout the server 1000 using various interfaces and lines, and performs various functions of the server 1000 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1005, and calling data stored in the memory 1005. Alternatively, the processor 1001 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1001 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1001, but may be implemented by a single chip.
The Memory 1005 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer-readable medium. The memory 1005 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 9, a memory 1005, which is one type of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a model evaluation application program.
In the electronic device 1000 shown in fig. 15, the user interface 1003 is mainly used as an interface for providing input for a user, and acquiring data input by the user; and the processor 1001 may be configured to invoke the data transfer control application stored in the memory 1005 and specifically perform the following operations:
when a dialogue model processes a dialogue text, determining at least one marking skill and at least one recalling skill corresponding to the dialogue text;
obtaining a conversation result corresponding to the conversation text output by the conversation model, and determining at least one target skill corresponding to the conversation result;
performing a recall assessment on the conversation model based on the at least one annotation skill, the at least one recall skill, and the at least one target skill.
In one embodiment, before performing the step of determining at least one annotation skill and at least one recall skill corresponding to the dialog text processed by the dialog model, the processor 1001 further performs the following operations:
acquiring a newly added skill evaluation set, wherein the newly added skill evaluation set comprises at least one newly added tagging skill and at least one first dialog text corresponding to each newly added tagging skill;
and training and labeling a reference skill evaluation set corresponding to the dialogue model based on the newly added skill evaluation set and the dialogue model.
In an embodiment, when the processor 1001 performs the training and labeling on the reference skill evaluation set corresponding to the dialogue model based on the newly added skill evaluation set and the dialogue model, specifically perform the following operations:
acquiring a reference skill evaluation set corresponding to the conversation model, and adding the newly added skill evaluation set to the reference skill evaluation set, wherein the reference skill evaluation set comprises at least one labeling skill and at least one second conversation text corresponding to each labeling skill;
evaluating the reference skill evaluation set according to the dialogue model to obtain skill conflict information corresponding to the newly added skill;
and carrying out skill annotation on the target dialog text in each second dialog text based on the skill conflict information.
In an embodiment, when executing the skill tagging on the second dialog text based on the skill conflict information, the processor 1001 specifically further executes the following steps:
determining a target dialog text which conflicts with the newly added skill and a conflict type corresponding to the target dialog text in each second dialog text based on the skill conflict information;
and performing multi-skill labeling on the target dialog text according to a text labeling mode corresponding to the conflict type.
In one embodiment, when performing the recall assessment of the dialogue model based on the at least one annotation skill, the at least one recall skill and the at least one target skill, the processor 1001 specifically further performs the following steps:
generating skill recall assessment information for the conversation model based on the at least one annotated skill and the at least one recall skill;
and obtaining ranking decision information corresponding to the at least one target skill, and generating ranking calling-ready evaluation information of the conversation model based on the ranking decision information.
In one embodiment, when the processor 1001 executes the skill calling assessment information of the dialogue model generated based on the at least one annotation skill and the at least one calling skill, specifically execute the following steps:
performing recall skill decomposition on at least one recall skill corresponding to the conversation text to generate a recall decomposition result;
performing annotation skill decomposition on the at least one annotation skill corresponding to the dialog text to generate an annotation decomposition result;
and generating the skill accuracy of the dialog text according to the label decomposition result, and generating the skill recall rate of the dialog text according to the label decomposition result.
In one embodiment, after executing the step of obtaining the dialog result output by the dialog model based on the dialog text and determining at least one target skill corresponding to the dialog result, the processor 1001 further executes the following steps:
obtaining effect feedback information corresponding to the at least one target skill;
and when the effect feedback information does not meet the calling-ready feedback condition corresponding to the target skill, carrying out calling-ready evaluation on the conversation model.
In the embodiment of the application, when a conversation model processes a conversation text, an electronic device determines at least one labeling skill and at least one recalling skill corresponding to the conversation text, acquires a conversation result corresponding to the conversation text output by the conversation model, determines at least one target skill corresponding to the conversation result, and performs recalling evaluation on the conversation model based on the at least one labeling skill, the at least one recalling skill and the at least one target skill. By labeling the dialog text with at least one labeling skill, performing a full recall of at least one recall skill corresponding to semantic understanding of the dialog text at a processing stage of the dialog model, and determining at least one target skill based on a result of the dialog, then the conversation model is subjected to calling-in evaluation, so that the problem of low accuracy of model evaluation when an NLU single-label calling-in evaluation system in the related technology is adopted can be avoided, because a plurality of marking skills can be marked on all the recall skills, the evaluation of the dialogue model at the text semantic understanding level can be accurately covered, the evaluation of a sequencing layer can be configured when the dialog model executes the result (recalling the skill) based on the corresponding 'at least one target skill' when the dialog result is output, the granularity of the evaluation dimension is further refined, and the accuracy of the model evaluation is improved; the method for labeling the multiple skills of the dialog text can be used for rapidly processing the conflict skills and the conflict dialog text when a new skill evaluation set is added to the dialog model, so that the end-to-end experience effect of the dialog model is improved, and the update period of the dialog model is saved; and the feedback of the user level is also incorporated into the evaluation of the dialogue model based on the effect feedback information corresponding to at least one target skill at the user side, so that the coverage level of the dialogue model can be further improved, and the evaluation effect of the dialogue model is increased.
It is clear to a person skilled in the art that the solution of the present application can be implemented by means of software and/or hardware. The "unit" and "module" in this specification refer to software and/or hardware that can perform a specific function independently or in cooperation with other components, where the hardware may be, for example, a Field-ProgrammaBLE Gate Array (FPGA), an Integrated Circuit (IC), or the like.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some service interfaces, devices or units, and may be an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program, which is stored in a computer-readable memory, and the memory may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The above description is only an exemplary embodiment of the present disclosure, and the scope of the present disclosure should not be limited thereby. That is, all equivalent changes and modifications made in accordance with the teachings of the present disclosure are intended to be included within the scope of the present disclosure. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. A method of model evaluation, the method comprising:
when a dialogue model processes a dialogue text, determining at least one marking skill and at least one recalling skill corresponding to the dialogue text;
obtaining a conversation result corresponding to the conversation text output by the conversation model, and determining at least one target skill corresponding to the conversation result;
performing a recall assessment on the conversation model based on the at least one annotation skill, the at least one recall skill, and the at least one target skill.
2. The method of claim 1, wherein prior to determining at least one tagging skill and at least one recall skill corresponding to the dialog text as it is processed by the dialog model, further comprising:
acquiring a newly added skill evaluation set, wherein the newly added skill evaluation set comprises at least one newly added tagging skill and at least one first dialog text corresponding to each newly added tagging skill;
and training and labeling a reference skill evaluation set corresponding to the dialogue model based on the newly added skill evaluation set and the dialogue model.
3. The method according to claim 2, wherein the training and labeling of the reference skill evaluation set corresponding to the dialogue model based on the newly added skill evaluation set and the dialogue model comprises:
acquiring a reference skill evaluation set corresponding to the conversation model, and adding the newly added skill evaluation set to the reference skill evaluation set, wherein the reference skill evaluation set comprises at least one labeling skill and at least one second conversation text corresponding to each labeling skill;
evaluating the reference skill evaluation set according to the dialogue model to obtain skill conflict information corresponding to the newly added skill;
and carrying out skill annotation on the target dialog text in each second dialog text based on the skill conflict information.
4. The method of claim 3, wherein the skill tagging the second dialog text based on the skill conflict information comprises:
determining a target dialog text which conflicts with the newly added skill and a conflict type corresponding to the target dialog text in each second dialog text based on the skill conflict information;
and performing multi-skill labeling on the target dialog text according to a text labeling mode corresponding to the conflict type.
5. The method of claim 1, wherein said recalling assessment of said dialogue model based on said at least one annotation skill, said at least one recall skill, and said at least one target skill comprises:
generating skill recall assessment information for the conversation model based on the at least one annotated skill and the at least one recall skill;
and obtaining ranking decision information corresponding to the at least one target skill, and generating ranking calling-ready evaluation information of the conversation model based on the ranking decision information.
6. The method of claim 5, wherein generating skill recall assessment information for the conversation model based on the at least one annotation skill and the at least one recall skill comprises:
performing recall skill decomposition on at least one recall skill corresponding to the conversation text to generate a recall decomposition result;
performing annotation skill decomposition on the at least one annotation skill corresponding to the dialog text to generate an annotation decomposition result;
and generating the skill accuracy of the dialog text according to the label decomposition result, and generating the skill recall rate of the dialog text according to the label decomposition result.
7. The method of claim 1, wherein after obtaining the dialog model based on the dialog result output by the dialog text and determining at least one target skill corresponding to the dialog result, the method further comprises:
obtaining effect feedback information corresponding to the at least one target skill;
and when the effect feedback information does not meet the calling-ready feedback condition corresponding to the target skill, carrying out calling-ready evaluation on the conversation model.
8. A model evaluation apparatus, the apparatus comprising:
the skill determination module is used for determining at least one annotation skill and at least one recall skill corresponding to the conversation text when the conversation model processes the conversation text;
the result acquisition module is used for acquiring a conversation result corresponding to the conversation text output by the conversation model and determining at least one target skill corresponding to the conversation result;
a recall-ready assessment module to perform a recall-ready assessment of the conversation model based on the at least one annotation skill, the at least one recall skill, and the at least one target skill.
9. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 7.
10. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 7.
CN202010806446.9A 2020-08-12 2020-08-12 Model evaluation method, model evaluation device, storage medium and electronic equipment Pending CN112052316A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010806446.9A CN112052316A (en) 2020-08-12 2020-08-12 Model evaluation method, model evaluation device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010806446.9A CN112052316A (en) 2020-08-12 2020-08-12 Model evaluation method, model evaluation device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN112052316A true CN112052316A (en) 2020-12-08

Family

ID=73602829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010806446.9A Pending CN112052316A (en) 2020-08-12 2020-08-12 Model evaluation method, model evaluation device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112052316A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113238947A (en) * 2021-05-18 2021-08-10 山东大学 Man-machine collaborative dialog system evaluation method and system
CN113553843A (en) * 2021-06-24 2021-10-26 青岛海尔科技有限公司 Skill creation method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363690A (en) * 2018-02-08 2018-08-03 北京十三科技有限公司 Dialog semantics Intention Anticipation method based on neural network and learning training method
CN109408799A (en) * 2018-08-14 2019-03-01 优视科技(中国)有限公司 Semantic decision-making technique and system
CN109408800A (en) * 2018-08-23 2019-03-01 优视科技(中国)有限公司 Talk with robot system and associative skills configuration method
CN110807330A (en) * 2019-09-09 2020-02-18 腾讯科技(深圳)有限公司 Semantic understanding model evaluation method and device and storage medium
CN110866105A (en) * 2019-11-15 2020-03-06 康佳集团股份有限公司 Semantic decision method, mobile terminal and storage medium
CN111191018A (en) * 2019-12-30 2020-05-22 华为技术有限公司 Response method and device of dialog system, electronic equipment and intelligent equipment
CN111488443A (en) * 2020-04-08 2020-08-04 苏州思必驰信息科技有限公司 Skill selection method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363690A (en) * 2018-02-08 2018-08-03 北京十三科技有限公司 Dialog semantics Intention Anticipation method based on neural network and learning training method
CN109408799A (en) * 2018-08-14 2019-03-01 优视科技(中国)有限公司 Semantic decision-making technique and system
CN109408800A (en) * 2018-08-23 2019-03-01 优视科技(中国)有限公司 Talk with robot system and associative skills configuration method
CN110807330A (en) * 2019-09-09 2020-02-18 腾讯科技(深圳)有限公司 Semantic understanding model evaluation method and device and storage medium
CN110866105A (en) * 2019-11-15 2020-03-06 康佳集团股份有限公司 Semantic decision method, mobile terminal and storage medium
CN111191018A (en) * 2019-12-30 2020-05-22 华为技术有限公司 Response method and device of dialog system, electronic equipment and intelligent equipment
CN111488443A (en) * 2020-04-08 2020-08-04 苏州思必驰信息科技有限公司 Skill selection method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113238947A (en) * 2021-05-18 2021-08-10 山东大学 Man-machine collaborative dialog system evaluation method and system
CN113238947B (en) * 2021-05-18 2023-08-08 山东大学 Man-machine collaborative dialogue system evaluation method and system
CN113553843A (en) * 2021-06-24 2021-10-26 青岛海尔科技有限公司 Skill creation method and device
CN113553843B (en) * 2021-06-24 2023-12-19 青岛海尔科技有限公司 Skill creation method and device

Similar Documents

Publication Publication Date Title
CN111460150B (en) Classification model training method, classification method, device and storage medium
US20220382990A1 (en) System for minimizing repetition in intelligent virtual assistant conversations
CN106372059B (en) Data inputting method and device
CN111712834B (en) Artificial intelligence system for inferring realistic intent
CN110956956A (en) Voice recognition method and device based on policy rules
CN108875055B (en) Answer providing method and equipment
EP2757798A1 (en) Electronic device for determining emotion of user and method for determining emotion of user
CN107992604B (en) Task item distribution method and related device
CN106407178A (en) Session abstract generation method and device
CN112154465A (en) Method, device and equipment for learning intention recognition model
CN112579733B (en) Rule matching method, rule matching device, storage medium and electronic equipment
CN111582360B (en) Method, apparatus, device and medium for labeling data
EP3734472A1 (en) Method and device for text processing
CN111742311A (en) Intelligent assistant method
CN114328838A (en) Event extraction method and device, electronic equipment and readable storage medium
CN112052316A (en) Model evaluation method, model evaluation device, storage medium and electronic equipment
CN108197105B (en) Natural language processing method, device, storage medium and electronic equipment
CN110992937B (en) Language off-line identification method, terminal and readable storage medium
CN113378583A (en) Dialogue reply method and device, dialogue model training method and device, and storage medium
CN111209381B (en) Time management method and device in dialogue scene
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN115840841A (en) Multi-modal dialog method, device, equipment and storage medium
CN114662452A (en) Privacy-removing text label analysis method and device
CN110931014A (en) Speech recognition method and device based on regular matching rule
CN113901832A (en) Man-machine conversation method, device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination