CN114528821B - Understanding-assisted dialog system manual evaluation method and device and storage medium - Google Patents

Understanding-assisted dialog system manual evaluation method and device and storage medium Download PDF

Info

Publication number
CN114528821B
CN114528821B CN202210436767.3A CN202210436767A CN114528821B CN 114528821 B CN114528821 B CN 114528821B CN 202210436767 A CN202210436767 A CN 202210436767A CN 114528821 B CN114528821 B CN 114528821B
Authority
CN
China
Prior art keywords
evaluation
reading
understanding
conversation
dialogue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210436767.3A
Other languages
Chinese (zh)
Other versions
CN114528821A (en
Inventor
李华庆
何向南
向元新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202210436767.3A priority Critical patent/CN114528821B/en
Publication of CN114528821A publication Critical patent/CN114528821A/en
Application granted granted Critical
Publication of CN114528821B publication Critical patent/CN114528821B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates

Abstract

The invention discloses a dialogue system manual evaluation method, a device and a storage medium for assisting understanding, which are used for grouping and sorting common standards, selecting the dialogue standards to be evaluated, constructing a basic evaluation template, referring to reading understanding test and reading strategy design commonly used in Chinese and English examinations, embedding a single-selection completion dialogue and content dragging and sequencing reading task into a part presenting a dialogue history in the basic evaluation template, recording the time for collecting the evaluation dialogue history of a worker, calculating the consistency index of the answer of the worker according to a consistency algorithm, and explaining the improvement of the evaluation result by understanding the dialogue history. The proposal of the invention is based on the task understanding, perfects the dialogue evaluation proposal by enhancing the understanding degree of workers to the tasks, improves the reliability of the evaluation of the workers, and thus obtains high-quality evaluation data.

Description

Understanding-assisted dialog system manual evaluation method and device and storage medium
Technical Field
The invention relates to the technical field of natural language processing and human-computer interaction, in particular to an understanding-assisted dialog system manual evaluation method, an understanding-assisted dialog system manual evaluation device and a storage medium.
Background
Chatty type dialogue research is a research subject which is not neglected in the field of natural language processing, and the current immature dialogue evaluation technology greatly limits further research and application of a dialogue system. Dialog system evaluation is generally divided into two ways according to different evaluation subjects: manual evaluation and automated evaluation. The automatic evaluation is mainly implemented by using some evaluation indexes and evaluation models, and mining semantic relations in a conversation context through statistical conversation features and even a deep learning model so as to perform automatic evaluation; however, for the dialogue in the chatting form, there is no standard reply as a reference, and it is difficult for the automatic evaluation method to achieve a better effect in this scenario. In order to realize more accurate dialogue system evaluation, it is necessary to ensure the reliability and consistency of manual evaluation, so as to obtain high-quality manual annotation data (i.e. evaluation data).
A key problem in the manual evaluation of the current conversation system is that a standard conversation evaluation scheme is lacked, so that the overlapping degree of different works is low, and the reproducibility is poor. The differences of the current manual assessment schemes are mainly focused on the assessment form or the assessment details: for example, in the conversational intelligent challenge match ConvAI, each user is paired with a conversational robot and goes through 4-6 turns of conversation, and then answers the question with a score of 1-4: "how much you like to talk to the user". And a research team of Facebook AI proposes a comparative evaluation scheme Acute-eval under a multi-turn conversation scene, considers the problems of deviation of results and the like possibly brought by a score evaluation method, asks a user which speaker is more preferable under two given complete multi-turn conversations, and gives evaluation of a conversation system by preference. Furthermore, the quality criteria evaluated by the dialog system may vary in the dialog domain depending on the task scenario or data set. Investigations by multiple researchers have shown that there is a low degree of overlap between existing evaluation criteria, which introduces uncertainty to the worker's evaluation work. Researchers summarize different evaluation criteria and corresponding occurrence times in papers in the field of NLG (natural language generation technology) into a table form, and the sparsity of the table indicates that the evaluation criteria used in the papers to be investigated are not uniform, so that comparison between different jobs is very difficult.
In order to ensure the validity of evaluation data (i.e., a session evaluation result) in a chatty-type session, in the existing schemes, consistency of worker evaluation is generally calculated to measure the quality of a labeling result. However, in most manual evaluation schemes, the consistency of workers involved in evaluation is low, and the evaluation result is unreliable. Therefore, many studies have been conducted to improve the consistency of the results of their evaluation schemes. Work by, for example, investigator Novikova demonstrated that using a continuous scale when scoring can improve linguistic assessment consistency. While another group of researchers' research on crowd-sourced annotation tasks suggests that annotation workers may not be accurate and of sufficient quality in completing tasks due to lack of understanding of the task. The Sashank Santhanam carries out an experiment of a dialogue evaluation task based on a theory of cognitive deviation, and shows that the consistency among workers can be improved by giving reference of generated sentences under certain conditions; while the research team in Facebook AI considered the workers 'different understandings of the task problem, measured the worker's consistency under the different wording of each problem, and selected the wording with the highest consistency for use in subsequent experiments. However, existing evaluation schemes such as changing wording and adding references only focus on one to two research points, and there is no systematic consideration for the understanding process involved in the whole dialogue evaluation task, and in conclusion, the reliability of the evaluation result and the quality of the evaluation data are all to be improved.
Disclosure of Invention
The invention aims to provide an understanding-assisted dialog system manual evaluation method, device and storage medium, which can improve the consistency of worker evaluation and improve the reliability of evaluation results and the quality of evaluation data.
The purpose of the invention is realized by the following technical scheme:
an understanding-assisted dialog system manual evaluation method, comprising:
screening a plurality of dialogue evaluation standards from the existing evaluation standards, constructing an evaluation standard framework, and generating a basic evaluation template;
designing reading problems according to a reading understanding assessment mode, embedding the reading problems into conversation contents to be assessed on the basic assessment template, generating an assessment template containing the reading problems and providing the assessment template for workers participating in manual assessment of a conversation system;
receiving the evaluation templates containing the reading questions filled by each worker, extracting answer results of the reading questions from the evaluation templates, screening the workers by using the answer results of the reading questions, and extracting the evaluation results of the conversation contents from the evaluation templates containing the reading questions filled by the screened workers as manual evaluation results.
An understanding-assisted dialog system manual evaluation apparatus comprising:
the evaluation standard screening and basic evaluation template generating unit is used for screening a plurality of conversation evaluation standards from the existing evaluation standards, constructing an evaluation standard frame and generating a basic evaluation template;
the reading problem embedding unit is used for designing a reading problem by referring to a reading understanding and evaluating mode, embedding the reading problem into the conversation content to be evaluated on the basic evaluation template, generating an evaluation template containing the reading problem and providing the evaluation template for workers participating in manual evaluation of the conversation system;
and the evaluation result screening unit is used for receiving the evaluation template which is filled by each worker and contains the reading questions, extracting the response results of the reading questions from the evaluation template, screening the workers by using the response results of the reading questions, and extracting the evaluation results of the conversation contents from the evaluation template which is filled by the screened workers and contains the reading questions as the results of manual evaluation.
A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.
A readable storage medium, storing a computer program which, when executed by a processor, implements the aforementioned method.
According to the technical scheme provided by the invention, from the perspective of task understanding, the conversation evaluation scheme is perfected by enhancing the understanding degree of workers on the tasks, the reliability of worker evaluation is improved, and high-quality evaluation data is obtained.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a manual evaluation method for a dialog system for assisting understanding according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a basic evaluation template according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a missing sentence selection strategy according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a dialog content ranking policy provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of a dialog system manual evaluation apparatus for assisting understanding according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The terms that may be used herein are first described as follows:
the terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, process, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article of manufacture), is to be construed as including not only the particular feature explicitly listed but also other features not explicitly listed as such which are known in the art.
Secondly, a scheme provided by the invention is integrally explained; the invention plans to add an understanding-oriented evaluation strategy in a basic conversation evaluation flow, and refines the understanding of a worker on a task into the understanding of multi-turn conversations and the understanding of conversation evaluation criteria, wherein the worker refers to a user participating in conversation evaluation. In order to achieve the above purpose, the invention improves the evaluation result by adding an auxiliary understanding strategy in the basic evaluation flow on the basis of providing a clear conversation standard framework. Specifically, in the dialogue standard research, the present invention groups and sorts the standards commonly used in the dialogue, selects the dialogue standard to be evaluated, and summarizes the clear definition of the quality standard. In addition, the invention also refers to the reading comprehension test and reading strategy design commonly used in Chinese and English examinations, the reading tasks of single selection completion conversation and content dragging sequencing are embedded in the part presenting the conversation history in the basic evaluation template, the time used for collecting the worker evaluation conversation history is recorded, the consistency index of the worker answering is calculated according to the consistency algorithm, and the improvement of the understanding conversation history on the evaluation result is explained.
The following describes a method, an apparatus and a storage medium for manual evaluation of a dialog system for assisting understanding provided by the present invention in detail. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. Those not specifically mentioned in the examples of the present invention were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer.
Example one
As shown in fig. 1, a manual evaluation method for a dialog system for assisting understanding mainly includes the following steps:
and 11, screening a plurality of dialogue evaluation standards from the existing evaluation standards, constructing an evaluation standard framework, and generating a basic evaluation template.
And 12, designing reading problems by referring to a reading understanding assessment mode, embedding the reading problems into the conversation contents to be assessed on the basic assessment template, generating an assessment template containing the reading problems, and providing the assessment template for workers participating in manual assessment of the conversation system.
In the embodiment of the invention, a missing sentence selection strategy and a dialogue content ordering strategy are designed, the two strategies correspond to different reading problems, a corresponding strategy can be selected according to the dialogue content to be evaluated, and a corresponding reading problem is generated based on the selected strategy and is embedded into the dialogue content.
And step 13, receiving the evaluation templates containing the reading problems filled by the workers, extracting answer results of the reading problems from the evaluation templates, screening the workers by using the answer results of the reading problems, and extracting the evaluation results of the conversation contents from the evaluation templates containing the reading problems filled by the screened workers as manual evaluation results.
In the embodiment of the present invention, the understanding level of the worker is evaluated based on the reading question set in step 12, only the evaluation result submitted by the worker who passes the understanding test is taken as the result of the manual evaluation, and the consistency analysis is also performed on the result of the manual evaluation to illustrate the superiority of the present invention.
For ease of understanding, the following description is made with respect to the main principles of the above-described scheme.
In order to determine the evaluation criteria and the definition thereof, the present invention investigated 105 related papers in several main meetings in the field of natural language processing in 2016-2020, and 27 criteria used therein were used as research objects according to a method of group analysis. In addition, for better classification and definition of the standards, the definition and usage scenario of these quality standards are also explored in dictionaries and linguistic papers, so that these 27 standards are divided into 7 groups as shown in table 1.
Packet numbering Standard name Basis of grouping
1 Fluency,Grammaticality,Correctness,Readability,Understandable Evaluation of sentence quality
2 Relevance,Coherence,Consistency,Sensibleness,Listening,Maintain Context,Logic Association rating with conversation history
3 Informativeness,Diversity,Specificity,Proactivity,Flexible General or repetitive evaluation of sentences
4 Overall Quality,Appropriateness,Naturalness, Humanness,Adequacy Overall quality evaluation of sentences under dialogue history
5 Engagement,Interestingness Interactive experience ratings
6 Empathy,Emotion Emotional experience assessment
7 Others /
TABLE 1 packet case
In order to select the final evaluation criteria to be used, it is necessary to take into account the definition of each evaluation criterion and their use in the field of dialog, for example in the first group, "grammar" and "correctiness" are identical in definition, taking care of the consistency of grammatical rules, while such labeling is not manual, but "reality" is generally considered better than "grammar", because it emphasizes the ease with which sentences can be understood. Furthermore, while "fluent" is the most frequently used criterion in this group, it emphasizes the linguistic abilities of "human" or "machine", and is more appropriate for sentence-oriented use of Readability assessment. Hence "Readability" is selected as one of the dialog criteria. Finally, five evaluation criteria (Readability, Relevance, constancy, Informativeness, Naturalness) were selected as evaluation criteria for dialogue reversion in subsequent experiments. Except that Naturalness represents overall sentence quality, the principle followed is that the chosen standard definitions do not intersect and cover various aspects of dialogue reply evaluation. The five selected session evaluation criteria and corresponding definitions of the present invention are shown in table 2:
Readability The quality of the response to be understood easily
Relevance The quality of a response to connect with the context
Consistency The quality of a response agreeing with the known information
Informativeness The quality of the response providing new information
Naturalness The plausibility of the response generated by a human
TABLE 2 evaluation criteria for screening and definition thereof
Based on the five dialog evaluation criteria, an evaluation criteria framework is constructed, and a basic evaluation template is generated, as shown in fig. 2, which provides an example of the basic evaluation template, the upper half part of which contains the dialog history and the dialog reply (i.e., the dialog content), and the lower half part of which is the evaluation area.
On the basis that workers have basic language knowledge and reading ability, the method and the device provide an auxiliary understanding strategy suitable for chatting type conversation, and assist the workers in understanding conversation history, so that evaluation results are improved.
The invention summarizes 7 text type selection tasks which can be applied in the chatting dialogue, considers whether each task/topic type needs additional manual labeling labels or not, and gives an analysis result in a table 3. It can be seen that the missing text selection and the sorting selection can be performed without manual labeling and additional problems in design, and therefore, the method is suitable for being added into a conversation history auxiliary understanding scheme as a reading task.
Detailed understanding selection Summary selection of subject matter Sentence comprehension selection Inferential judgment selection Attitude emotion selection Missing text selection Rank selection
Whether additional manual labeling is required Is that Is that Is that Is that Is that Whether or not Whether or not
TABLE 3 text class selection reading task analysis
Conventional reading ability assessment approaches rely on understanding after reading is complete (e.g., multiple choice questions are set after the dialog content), requiring the worker to answer the relevant questions after reading the text. However, "understanding" occurs during reading and answering multiple separate reading understanding questions after reading is complete increases the difficulty of workers reasoning about information in the reading material and also increases the cost of evaluation and annotation. Therefore, the invention helps workers to better understand long conversations by embedding questions into the conversation reading process in an auxiliary understanding scheme of conversation history. Since a long dialog can be divided into paragraphs or chapters differently from other language pieces, it is a continuous dialog process, and when embedding an independent question, its coherence is broken. Therefore, in the embedding process, the direct question (such as inquiry: you think that the following sentence should be inserted in the vacancy) is omitted, and the reading and understanding task process is directly merged with the conversation content, and the specific strategy and the front-end interface are designed as follows:
1) missing sentence selection strategy (strategy 1 for short): the worker performs a single selection of a missing sentence in the conversation while reading the conversation history in a single task, and then performs scoring of the sentence.
Specifically, the method comprises the following steps: referring to a reading understanding assessment mode in an English examination, selecting a sentence at a specified position A as a single-choice test question for a dialog content to be assessed, wherein the options comprise an original sentence at the specified position A in the dialog content to be assessed and a sentence randomly selected from a data set. The reading problem under the strategy is that a worker is expected to accurately pick out the original sentence at the designated position A. As shown in fig. 3, an example of a missing sentence selection strategy is provided; the position of the missing sentence in fig. 3 is the first sentence of the third round (turn) of the dialog, and two options are provided, which are respectively the original sentence and a random sentence in the data set, and the order is random. In the implementation of the front-end page, the worker has a gray cue "Phase select the property sensor" in the selection box before selecting (Please select the correct sentence); after a worker clicks the selection box, two options pop up, namely a sentence randomly collected in a data set and an original sentence in a conversation. The background of the options passed by the mouse is changed into orange, and the gray prompt words can be replaced by the texts in the options after the workers click the options.
2) Dialog content ordering policy (policy 2 for short): a worker reads the conversation history in a single task while reordering the randomly garbled sentences in the conversation and then scoring the sentences.
Specifically, the method comprises the following steps: the sentence ordering method is characterized in that the sentence ordering method is used for randomly disordering sentences of conversation contents to be evaluated and requiring workers to reorder the sentences according to a reading understanding assessment mode in an English test. Considering that it is difficult to shuffle a single sentence and then sort the sentence, as shown in fig. 4, taking three middle sentences as a unit of a dialogue wheel (turn) to shuffle the sentence, the dialogue wheel to be shuffled is marked with green and has a text prompt, the sentence can be dragged and sorted, and the sentence cannot be dragged again after clicking a confirm button. In the front-end implementation, the following checking steps are provided: when the worker directly scores without dragging, a popup prompts that "You would drag and sort the above dialog turn |" (i.e., "You should drag and sort the above dialog") and the subsequent scoring task cannot be performed, thereby ensuring that the worker performs sentence evaluation after performing the reading task.
Because the reading difficulty is high due to the fact that the reading problems corresponding to the two strategies are embedded in one conversation at the same time, one strategy can be selected in application, and therefore the purpose of assisting workers in understanding conversation history is achieved. Specifically, when the number of turns (turn) of the dialogue history content is smaller than a set value, a missing sentence selection strategy of the strategy 1 is suggested to be applied; both strategies may be used when the number of turns (turn) of the conversation history content is equal to or greater than the set value turn, and the set value may be 4, for example. In the above strategy 1, since a strategy of completing missing sentences by single selection is added during reading of the conversation history, the accuracy can be checked according to the correct answers (sentences in the original conversation), and workers who make correct answers are screened out for subsequent data analysis; in the strategy 2, because a strategy for sequencing conversations by workers is added during reading of conversation history, the workers who make the correct sequencing are screened out by checking the correctness according to the correct sequencing (sequencing of the original conversation wheel). The invention only takes the evaluation result provided by the screened workers as the result of manual evaluation, and carries out subsequent data analysis.
Compared with the existing chat type dialog manual evaluation process, the scheme of the embodiment of the invention has the following advantages: (1) the details in the manual evaluation flow in the chatting dialogue are improved; (2) the auxiliary understanding strategy is added in the basic template, so that the understanding degree of workers is improved, and the consistency of manual labeling in conversation evaluation can be improved.
In order to verify the technical effects and performances of the above scheme of the invention, the description is made through experiments.
Firstly, setting up an experiment.
In order to investigate the advantages of each scheme and different strategies more closely, as shown in table 4, the following settings were made on the basis of the evaluation template: setting 1 is a basic evaluation template, and setting 2 and setting 3 are respectively added with a missing sentence selection strategy and a dialogue content ordering strategy on the basic template.
Figure DEST_PATH_IMAGE001
TABLE 4 Experimental setup
The missing sentence selection policy + the basic evaluation template, the dialogue content ranking policy + the basic evaluation template all belong to the evaluation template containing the reading question defined in the aforementioned step 12.
To verify the advantages of the present invention, each conversation history and its corresponding answers are published as the content of one task on the platform of amt, recruiting >20 workers to participate for each task, and specifying that the following workers have eligibility to participate: 1) the country where the worker is located is one of US (United states), CA (Canada) and AU (Australia), and the daily used language of the worker is English as far as possible; (2) HIT approval rate (the proportion of workers approved in all submitted tasks on the platform) was > 95%; (3) the number of approvals (total number of approved workers on the platform for all submitted tasks) > 100. Workers who met the conditions and passed the correct rate check at each setting were finally selected for subsequent data analysis, as shown in table 5.
Figure DEST_PATH_IMAGE002
TABLE 5 distribution of participants (number of participants per HIT)
In the experiment, based on the selected experiment data set dailydialog and 4 mainstream dialog generation models (Hred, Gpt, Blender, Dialogpt), three front-end interfaces are constructed by combining the scheme of the first embodiment on the basis of the obtained dialog data and the basic template, and are used for collecting and observing evaluation scores of workers and submitted answers. In consideration of the low use rate of amplitude estimation and comparative evaluation in dialogue evaluation, the 5-point Lekter scale is adopted for evaluation in the experiment.
Secondly, the consistency of workers is improved.
In manual annotation experiments without standard reversion as a reference, the validity of the data is often assessed in terms of worker consistency. Intra-class correlation coefficients were used in this experiment to measure the consistency of worker ratings.
It is of interest in the experiments whether the above-described solution provided by the present invention positively affects the consistency between workers. The number of intra-class relations in the interval of N = [3,20] (N is the number of workers) was calculated, and analysis was performed on four evaluation criteria of Readability, reservance, information, and Consistency at N =6, and the analysis results are shown in table 6. It can be seen that the evaluation results of setting 1 (i.e., the basic template) in 5 criteria are all not high in consistency, and the reliability of the results is low. And an auxiliary understanding strategy is added on the basic template, and after the reading problem is embedded in the conversation history, the consistency of the settings 2 and 3 on each standard is improved. Particularly, on two standards Relevance and Consistency belonging to a grouping of 'association evaluation of conversation history' in a conversation framework, the Consistency of the setting 3 is more than 0.6, which shows that the content dragging and sorting strategy is a very effective strategy, and the understanding of workers on the conversation history is enhanced through sorting, so that the evaluation result is improved. Experiments verify that adding a reading problem of missing sentence selection or a reading problem of content dragging and sequencing on a basic evaluation template can effectively improve the consistency of workers, as shown in table 6.
Figure DEST_PATH_IMAGE003
TABLE 6 results of consistency under different standards and settings
Three, average score analysis
The experiment compared the average scores of 4 dialog systems, as shown in table 7, where human is the original reply in the dialog dataset. It was found experimentally that Gpt and the Dialogpt model produced better replies in chatty-type conversations than the Hred and Blender models, and even scored higher than human replies on the evaluation criteria Readability, indicating that these two conversation models are capable of producing replies with extremely high Readability. While on both Relevance and Consistency evaluation criteria, the human responses score far beyond the dialogue model, indicating that the dialogue model remains to be improved on the evaluation criteria associated with the dialogue history.
Figure DEST_PATH_IMAGE004
TABLE 7 average scores for dialogue models under different criteria
And fourthly, analyzing time cost.
Because two additional reading tasks of selecting and sequencing are added in the setting 2 and the setting 3, whether the time spent by a worker under each setting has influence on the result or not is considered. Unlike previous evaluation runs where time was not considered, or where only the average time to answer the whole was used, the experiment focused on two main indicators, as shown in table 8, including: the time taken for the worker to complete the reading comprehension of the task (reading time) and the time taken for the evaluation scoring (answering time) at different settings, both obtained by collecting the timestamp returned in the front-end code. In the settings 2 and 3, the processing (selection, sorting) time of the dialogue history is also included in the reading understanding time. The statistics of the two parts of time are separated, so that the method can better help us distinguish the context reading difficulty and the difficulty of evaluating the conversation standard.
Figure DEST_PATH_IMAGE005
TABLE 8 time cost at different settings
Comparing the reading time and the answering time at different settings, it can be seen that the average time spent by the workers is the least at setting 1, while the average time spent by setting 2 and setting 3 is longer. In combination with the difference of the relation numbers in the classes, the worker can clearly score the standards more seriously than before after the auxiliary understanding strategy is added. Since the reading time includes the reading task requirement and the time of the conversation history, in the statistics of the reading time, it is found that the reading time of the setting 2 is longer than that of the setting 3, which indicates that the selection of the missing sentence as the reading task takes more time than the conversation content sorting strategy. The answer time and the consistency result of the workers under the three settings are combined, so that the evaluation result with more serious and better consistency can be given after the workers are understood by the conversation standard or the conversation history.
Example two
An embodiment of the present invention provides an understanding-assisting manual evaluation apparatus for a dialog system, which is implemented mainly by using the scheme provided in the first embodiment, as shown in fig. 5, and mainly includes:
the evaluation standard screening and basic evaluation template generating unit is used for screening a plurality of conversation evaluation standards from the existing evaluation standards, constructing an evaluation standard frame and generating a basic evaluation template;
the reading problem embedding unit is used for designing reading problems by referring to a reading understanding and evaluating mode, embedding the reading problems into conversation contents to be evaluated on the basic evaluation template, generating an evaluation template containing the reading problems and providing the evaluation template for each worker participating in manual evaluation of the conversation system;
and the evaluation result screening unit is used for receiving the evaluation template which is filled by each worker and contains the reading questions, extracting the response results of the reading questions from the evaluation template, screening the workers by using the response results of the reading questions, and extracting the evaluation results of the conversation contents from the evaluation template which is filled by the screened workers and contains the reading questions as the results of manual evaluation.
The main principles involved in the units of the above-mentioned apparatus have been described in detail in the first embodiment, and therefore will not be described in detail.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions.
EXAMPLE III
The present invention also provides a processing apparatus, as shown in fig. 6, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method provided by the first embodiment.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.
In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical button or a mouse and the like;
the output device may be a display terminal;
the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.
Example four
The present invention further provides a readable storage medium storing a computer program, which when executed by a processor implements the method provided by the first embodiment.
The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. An understanding-assisted dialog system manual evaluation method, comprising:
screening a plurality of dialogue evaluation standards from the existing evaluation standards, constructing an evaluation standard framework, and generating a basic evaluation template; the basic evaluation template comprises a dialogue history, a dialogue reply and a scoring area, wherein the scoring area comprises screened dialogue evaluation standards;
designing reading problems according to a reading understanding assessment mode, embedding the reading problems into conversation contents to be assessed on the basic assessment template, generating an assessment template containing the reading problems and providing the assessment template for workers participating in manual assessment of a conversation system;
receiving evaluation templates which are filled by workers and contain reading questions, extracting response results of the reading questions from the evaluation templates, screening the workers by using the response results of the reading questions, and extracting evaluation results of conversation contents from the evaluation templates which are filled by the screened workers and contain the reading questions as results of manual evaluation;
wherein, the reading problem is designed by referring to a reading understanding assessment mode, and the step of embedding the reading problem into the conversation content to be assessed on the basic assessment template comprises the following steps: designing a missing sentence selection strategy and a dialogue content ordering strategy by referring to a reading understanding assessment mode; selecting any strategy according to the conversation content to be evaluated, generating a corresponding reading problem based on the selected strategy, and embedding the reading problem into the conversation content to be evaluated; when a missing sentence selection strategy is selected, selecting a sentence at a specified position A as a single selection test question for the dialog content to be evaluated, wherein the options comprise an original sentence at the specified position A in the dialog content to be evaluated and a sentence randomly selected from a data set; when the dialog content ordering strategy is selected, sentences are randomly scrambled for dialog content to be evaluated, and workers are required to reorder.
2. The manual evaluation method for dialogue system for assisting understanding according to claim 1, wherein the step of screening the existing evaluation criteria for several dialogue evaluation criteria comprises: readability, Relevance, Consistency, Informativeness and Naturaless, five evaluation criteria respectively correspond to the representation Readability, Relevance, Consistency, Informativeness and Naturalness.
3. The manual evaluation method for dialogue system for assisting understanding according to claim 1, wherein the selecting any one of the policies according to the dialogue content to be evaluated comprises:
when the number of conversation turns of the conversation content to be evaluated is larger than or equal to a set value, selecting a missing sentence selection strategy or a conversation content sequencing strategy;
and when the number of the conversation turns of the conversation content to be evaluated is less than a set value, selecting a missing sentence selection strategy.
4. An understanding-assisting manual evaluation device for a dialog system, which is implemented based on the method of any one of claims 1 to 3 and comprises:
the evaluation standard screening and basic evaluation template generating unit is used for screening a plurality of conversation evaluation standards from the existing evaluation standards, constructing an evaluation standard frame and generating a basic evaluation template;
the reading problem embedding unit is used for designing a reading problem by referring to a reading understanding and evaluating mode, embedding the reading problem into the conversation content to be evaluated on the basic evaluation template, generating an evaluation template containing the reading problem and providing the evaluation template for workers participating in manual evaluation of the conversation system;
and the evaluation result screening unit is used for receiving the evaluation template which is filled by each worker and contains the reading questions, extracting the response results of the reading questions from the evaluation template, screening the workers by using the response results of the reading questions, and extracting the evaluation results of the conversation contents from the evaluation template which is filled by the screened workers and contains the reading questions as the results of manual evaluation.
5. A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-3.
6. A readable storage medium, storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1-3.
CN202210436767.3A 2022-04-25 2022-04-25 Understanding-assisted dialog system manual evaluation method and device and storage medium Active CN114528821B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210436767.3A CN114528821B (en) 2022-04-25 2022-04-25 Understanding-assisted dialog system manual evaluation method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210436767.3A CN114528821B (en) 2022-04-25 2022-04-25 Understanding-assisted dialog system manual evaluation method and device and storage medium

Publications (2)

Publication Number Publication Date
CN114528821A CN114528821A (en) 2022-05-24
CN114528821B true CN114528821B (en) 2022-09-06

Family

ID=81628251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210436767.3A Active CN114528821B (en) 2022-04-25 2022-04-25 Understanding-assisted dialog system manual evaluation method and device and storage medium

Country Status (1)

Country Link
CN (1) CN114528821B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040007927A (en) * 2002-07-12 2004-01-28 유종연 Autonomy Human Power Estimation Service Through The Interet Communication
CN111046152A (en) * 2019-10-12 2020-04-21 平安科技(深圳)有限公司 FAQ question-answer pair automatic construction method and device, computer equipment and storage medium
CN112330303A (en) * 2020-11-27 2021-02-05 同济大学建筑设计研究院(集团)有限公司 Intelligent project evaluation cooperative management system
CN113536808A (en) * 2021-08-18 2021-10-22 北京师范大学 Reading understanding test question difficulty automatic prediction method introducing multiple text relations

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7010746B2 (en) * 2002-07-23 2006-03-07 Xerox Corporation System and method for constraint-based document generation
EP3278319A4 (en) * 2015-04-03 2018-08-29 Kaplan Inc. System and method for adaptive assessment and training

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040007927A (en) * 2002-07-12 2004-01-28 유종연 Autonomy Human Power Estimation Service Through The Interet Communication
CN111046152A (en) * 2019-10-12 2020-04-21 平安科技(深圳)有限公司 FAQ question-answer pair automatic construction method and device, computer equipment and storage medium
CN112330303A (en) * 2020-11-27 2021-02-05 同济大学建筑设计研究院(集团)有限公司 Intelligent project evaluation cooperative management system
CN113536808A (en) * 2021-08-18 2021-10-22 北京师范大学 Reading understanding test question difficulty automatic prediction method introducing multiple text relations

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《 基于深度学习的图书馆服务机器人研究》;张世景;《自动化与仪器仪表 》;20220325;全文 *
曹亚如 ; 张丽萍 ; 赵乐乐.《多轮任务型对话系统研究进展》.《计算机应用研究》.2021, *

Also Published As

Publication number Publication date
CN114528821A (en) 2022-05-24

Similar Documents

Publication Publication Date Title
Perry et al. Assessing model fit: Caveats and recommendations for confirmatory factor analysis and exploratory structural equation modeling
Laugwitz et al. Construction and evaluation of a user experience questionnaire
Ajagbe et al. Qualitative inquiry for social sciences
Deuchar et al. Building and using the Siarad Corpus: Bilingual conversations in Welsh and English
Ormel et al. The role of sign phonology and iconicity during sign processing: The case of deaf children
Yang et al. Understand users’ comprehension and preferences for composing information visualizations
KR101060973B1 (en) Automatic assessment of excessively repeated word usage in essays
Compton et al. 2010 census race and Hispanic origin alternative questionnaire experiment
Sizmur et al. Achievement of 15-year-olds in England: PISA 2018 results
Vander Putten et al. Comparing results from constant comparative and computer software methods: A reflection about qualitative data analysis
Gruzd et al. Coding and classifying knowledge exchange on social media: A comparative analysis of the# Twitterstorians and AskHistorians communities
CN114021984A (en) Invigilation data processing method
CN114528821B (en) Understanding-assisted dialog system manual evaluation method and device and storage medium
Yamamoto et al. Developing a machine‐supported coding system for constructed‐response items in PISA
Wieczorek et al. Tense/aspect category in fluent and nonfluent German aphasia: An experimental training programme for verb production
De Oliveira et al. Proposal and evaluation of textual description templates for bar charts vocalization
Li et al. Towards a conversational measure of trust
Ebling et al. Single-parameter and parameter combination errors in L2 productions of Swiss German Sign Language
CN114037256A (en) Method for collecting and analyzing multi-person answer data
Sizmur et al. Achievement of 15-Year Olds in Wales: PISA 2018 National Report
CN115408500A (en) Question-answer consistency evaluation method and device, electronic equipment and medium
Shimabukuro et al. H-Matrix: Hierarchical Matrix for visual analysis of cross-linguistic features in large learner corpora
Jiménez et al. CourseObservatory: Sentiment analysis of comments in course surveys
Wang et al. iCLEF 2001 at Maryland: comparing term-for-term gloss and MT
Choe et al. Producer Conflict Management Approaches in Online Peer Production Communities–Case Study of OpenStreetMap

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Li Huaqing

Inventor after: Xiang Yuanxin

Inventor after: He Xiangnan

Inventor before: Li Huaqing

Inventor before: He Xiangnan

Inventor before: Xiang Yuanxin

CB03 Change of inventor or designer information