CN110489756B

CN110489756B - Conversational human-computer interactive spoken language evaluation system

Info

Publication number: CN110489756B
Application number: CN201910781649.4A
Authority: CN
Inventors: 王鑫; 许昭慧
Original assignee: Shanghai Squirrel Classroom Artificial Intelligence Technology Co Ltd
Current assignee: Shanghai Squirrel Classroom Artificial Intelligence Technology Co Ltd
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2020-10-27
Anticipated expiration: 2039-08-23
Also published as: CN112307742A; CN112307742B; CN112232083A; CN110489756A

Abstract

The application relates to a conversational human-computer interaction spoken language evaluation system, which is a task-oriented dialogue system based on human-computer dialogue and voice evaluation correlation technology and applied to spoken language evaluation and driven by scenes. The evaluation system of the present application has three main features: conversational, scene driven, and task oriented. Through the task-oriented dialogue system communicated with the natural language of the user, the ability of the student user for actually utilizing the language and the ability of comprehensively utilizing the English for communication can be known, and a reverse dialing effect is achieved for the oral learning of the student user and the oral teaching of a teacher.

Description

Conversational human-computer interactive spoken language evaluation system

Technical Field

The application relates to the technical field of human-computer interaction, in particular to a conversation type human-computer interaction spoken language evaluation system.

Background

There are two main types of spoken language testing: interview and recording oral test. The interview has high validity, but is time-consuming and labor-consuming to organize, in a large-scale oral test, a man-machine interaction mode is adopted, examinees only need to complete answering and full-automatic intelligent scoring on hearing and oral test questions through a computer and headset equipment, judgment can be conducted from multiple dimensions such as sentence prosody, completeness and accuracy, and a paper answer evaluation report can be generated.

In online voice culture products, the adoption of a voice recognition technology and a voice evaluation technology is quite common, the pronunciations of student users and the pronunciations of machines are compared and graded in a mode of 'listening to original voices, reading/repeating, grading of a system, multi-color visual feedback and adjusting', and the purpose of improving the English listening comprehension and the pronunciations of students is achieved under repeated practice.

Disclosure of Invention

The inventor finds that oral English is different from other courses through long-term observation and research, the oral English is not mainly used for teaching knowledge, English is a carrier of knowledge and culture, and a student user needs to use a language expression thought to communicate with a person so as to achieve the purpose of real culture. The ability of the students to actually use the language and the ability of the students to comprehensively use the English to communicate are developed, and the ability becomes a main teaching task of oral English. Examination and evaluation should serve for teaching, however, the english evaluation technology applied to human-computer interaction has the following disadvantages:

the examination method comprises the steps of firstly, examining the spoken language level of a student through prerecorded voice examination questions, wherein the form is single, the questions are specified in advance, the examination content is in an instruction form, the student passively receives the examination questions and scores, the examination-requiring spoken language examination is generally that the student speaks and auditors listen to the examination questions and then marks a score for the student, and the teaching and learning conditions cannot be comprehensively reflected. In the interview, emotional interaction between the examiner and the examinee can also interfere with the evaluation result.

And secondly, traditional classroom or online oral assessment is final assessment of an examination taking type, is examination-question-driven assessment experience, judges a learning result of a student in a first study period through a one-time end-of-term examination, or determines the class level of the student during learning through a diagnosis test before the beginning of the study period, and then upgrades the students one by one.

Third, through following reading/repeating the activity in the study, the student user contrasts pronunciation of oneself and the pronunciation of machine, revises the exercise of oneself pronunciation repeatedly from grading feedback, is helped to english listening ability and pronunciation, but to the ability of student's actual application language and the ability of using english comprehensively to carry out the interpersonal, but can't survey student's actual level through current technique, has not produced the enlightening effect to spoken english study even more.

In view of the above defects in the prior art, the present application provides a conversational human-computer interactive spoken language assessment system, which is a task-oriented dialog system based on human-computer dialog and speech assessment correlation techniques and applied to spoken language assessment and scene driving. The evaluation system of the present application has three main features: conversational, scene driven, and task oriented. Through the task-oriented dialogue system communicated with the natural language of the user, the ability of the student user for actually utilizing the language and the ability of comprehensively utilizing the English for communication can be known, and a reverse dialing effect is achieved for the oral learning of the student user and the oral teaching of a teacher.

The application provides a conversational human-computer interaction spoken language evaluation system, including the dialog system, the dialog system includes: a voice recognition module configured to recognize a voice input of a user and convert the voice input into text; an intent understanding module configured to enable semantic understanding of the converted text to identify user intent; a dialog management module configured to generate a corresponding system action based on an understanding result of the intent understanding module; the language generation module is configured to convert the system action generated by the dialogue management module into natural language; and a language synthesis module configured to convert the natural language into speech and feed back to the user.

In some embodiments, optionally, the intent understanding module is further configured to enable slot filling, wherein a slot is information that needs to be completed to translate the user intent into an explicit user instruction during the session.

In some embodiments, optionally, the intent understanding module is further configured to enable user intent understanding from the user representation and/or the scenarized information.

In some embodiments, optionally, the dialog management module further includes a dialog state tracking module configured to be able to represent the phase of the dialog and to fuse context information of the dialog process.

In some embodiments, optionally, the dialog management module further comprises a dialog policy learning module configured to generate a next operation of the system based on the current dialog state.

In some embodiments, optionally, further comprising an evaluation system, the evaluation system comprising: the scene dialogue voice and semantic evaluation module is configured to compare the similarity of texts converted from the user voice according to standard contents of the voice and the semantic and obtain a voice evaluation score and a semantic evaluation score; the grammar evaluating and error checking module is configured to be capable of carrying out grammar checking on the text converted from the user voice and obtaining grammar evaluation scores; and the easy mixing evaluating module is configured to mark the error of the easy mixing on the text converted from the user voice so as to evaluate the easy mixing.

In some embodiments, optionally, the dialog management module is further configured to generate a corresponding system action based on the evaluation result of the evaluation system.

In some embodiments, optionally, the speech evaluation score is higher when the similarity between the user speech and the standard speech phoneme is higher; and when the similarity between the content expressed by the user and the comparison reference answer is higher, the semantic evaluation score is higher.

In some embodiments, optionally, the syntax evaluation and error checking module is further configured to examine logical relationships in the sentence, the logical relationships including one or more of the following: matching of subjects and predicates, temporal expression, syntactic structure, single or plural number.

In some embodiments, the conversational, human-computer interactive, spoken language assessment system is optionally a stand-alone and/or online configuration-based computer system to develop an assessment of language-class content.

Compared with the prior art, the beneficial effect of this application lies in at least:

the first and the second application are conversation type human-computer interaction spoken language evaluation systems, a large number of communication opportunities with different virtual people are provided through human-computer conversation, communication scenes are created, a positive reverse dialing effect can be achieved for learning and teaching of student users through repeated communication practices, the learning attitude of students can be changed through a tested reverse dialing effect, and the enthusiasm of learning and using spoken language of the students at ordinary times is stimulated. Furthermore, the conversational human-computer interactive spoken language evaluation system can also avoid emotional interaction between human examiners and examinees.

Secondly, the application is a scene-driven spoken language assessment system, which is a meaningful technology capable of reflecting the content of the taught teaching and reflecting the learning content and the learning process. Not only can a detailed evaluation feedback be obtained in the process of completing the learning task, but also the following steps are included: the method finds the problems of the student users in voice, tone, communication and expression, analyzes the reasons of the problems, can collect rich student user voice and the adopted communication strategy, and has great significance for providing personalized guidance for the student users by follow-up teachers. Moreover, the scene-driven assessment can reduce the tension and anxiety of the student users, and truly reflect the true level and performance of the student users.

Third, the application is a task-oriented spoken language evaluation system, task-oriented spoken language activities are expressed in a meaning-weighted manner, but not in a language-normalized manner, so that student users can experience success easily and experience achievement, inherent learning interest and desire are stimulated, better performance is achieved, interactive spoken English emphasizes the opportunity of providing in-person experience for the student users, knowledge is searched and problems are found from participating in real natural and interactive activities, own communication modes, concepts and strategies are constructed, and the purpose of information transmission and idea expression learning is achieved by completing tasks.

The conception, specific structure and technical effects of the present application will be further described in conjunction with the accompanying drawings to fully understand the purpose, characteristics and effects of the present application.

Drawings

The present application will become more readily understood from the following detailed description when read in conjunction with the accompanying drawings, wherein like reference numerals designate like parts throughout the figures, and in which:

fig. 1 is a schematic structural diagram of a functional module according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a program module according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The present application may be embodied in many different forms of embodiments and the scope of the present application is not limited to only the embodiments set forth herein. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort shall fall within the protection scope of the present application.

Ordinal terms such as "first" and "second" are used herein only for distinguishing and identifying, and do not have any other meanings, unless otherwise specified, either by indicating a particular sequence or by indicating a particular relationship. For example, the term "first component" does not itself imply the presence of a "second component", nor does the term "second component" itself imply the presence of a "first component".

Fig. 1 is a schematic structural diagram of a functional module according to an embodiment of the present application. As shown in FIG. 1, the conversational, human-computer interactive, spoken language assessment system may be based on stand-alone and/or online configured computer systems to develop an assessment of language-class content, including dialog systems and assessment systems.

The dialog system includes a speech recognition module, an intent understanding module, a dialog management module, a language generation module, and a language synthesis module. The voice recognition module can recognize the voice input of the user and convert the voice input into text; the intention understanding module can carry out semantic understanding on the converted text to identify the intention of the user; the dialogue management module can generate corresponding system action based on the understanding result of the intention understanding module; the language generation module can convert the system action generated by the dialogue management module into natural language; the language synthesis module can convert the natural language into voice and feed back to the user.

In some embodiments, the speech recognition module is responsible for recognizing the student user's speech input and converting it into text; the intention understanding module is responsible for carrying out semantic understanding on the text converted from the voice of the student user, and comprises user intention identification and slot filling, wherein the slot is information required to be completed by converting the user intention into an explicit user instruction in the conversation process; the dialogue management module is responsible for the management of the whole dialogue, including dialogue state tracking and dialogue strategy learning; the language generation module is responsible for converting the system action selected by the conversation strategy module into a natural language; the language synthesis module is responsible for converting the text into voice and finally feeding back the voice to the student users. The intent understanding module is also capable of user intent understanding based on the user representation and/or the scenarization information.

The intention can be regarded as a multi-classification problem based on text, namely, the corresponding category is determined according to the user expression, the intention can be understood as the function or flow of a certain application, the request and the purpose of a user are mainly met, and when the student user expresses My name iscanol or expresses This Carol, the intention of self introduction can be triggered. The slot position is information which is required to be completed by converting the preliminary user intention into a definite user instruction in a multi-turn conversation process, one slot position corresponds to one information which is required to be obtained in the process of processing one thing, in the My name is Carol expressed by a student user, the Carol represents the slot position of a name, an intention understanding module not only inputs voice, but also considers user portrait and scene information, and the intention understanding accuracy can be improved through a more comprehensive context.

The user representation may include: name, grade, location, spoken horizontal dimensions of student user, such as: accuracy of sound, completeness, fluency, etc., as well as behavioral characteristics, sexual hobbies, etc. The user portrait can be updated in real time in each round of conversation, the context information is influenced in the next round of conversation and is combined with the context information, the function that the virtual human has memory is achieved, along with the increase of the frequency of the conversation, the system has more understanding on the student users, and the reaction given to the student users by the virtual human is smoother.

The dialog management module may also include a dialog state tracking module and/or a dialog policy learning module. The dialog state tracking module can represent the stage of the dialog and fuse the context information of the dialog process. The dialogue strategy learning module can generate the next operation of the system according to the current dialogue state. In some embodiments, the dialog state tracking module is used for representing the current dialog state information, is a representation of the current whole dialog stage in the dialog system, and fuses context information of the dialog process; and the conversation strategy learning module is used for generating the next operation of the system according to the current conversation state.

The evaluation system can comprise a scene dialogue voice and semantic evaluation module, a grammar evaluation and error check module and an easy mixing evaluation module. The scene dialogue voice and semantic evaluation module can compare the similarity of the text converted from the user voice according to the standard contents of the voice and the semantic and obtain a voice evaluation score and a semantic evaluation score; the grammar evaluation and error check module can carry out grammar check on the text converted from the user voice and obtain grammar evaluation scores; the upmixing evaluating module can mark the upmixing error to the text converted from the user voice so as to evaluate the upmixing.

In some embodiments, the evaluation system may include three modules, namely, a speech and semantic evaluation module of a scenario dialog, a grammar evaluation module, an error check module and a remix evaluation module, where the speech and semantic evaluation module of the scenario dialog is responsible for comparing the similarity of texts converted from the speech of a student user with respect to the standard contents of the speech and the semantics, and when the similarity of the speech of the user and the phonemes of the standard speech is higher, the speech evaluation score is higher, and when the similarity of the expressed contents of the user and the reference answers is higher, the semantic evaluation score is higher. The grammar evaluation and error check are responsible for scoring and indicating errors of texts converted from the voices of the student users aiming at the errors of the grammars, mainly examining logical relations in sentences, including single-complex numbers, main and predicate collocation, temporal expression, use of syntactic structures and the like, wherein the evaluation score is higher when the errors of the grammars are less. The easy-mixing evaluation module is responsible for marking the error of the easy mixing of the text converted from the voice of the student user, so that the evaluation of the easy mixing is realized, the error which is frequently made by Chinese students is required to be brought into the training corpus of the model in the voice recognition module, and the voice recognition module is prevented from actively correcting the error.

The dialogue management module can generate corresponding system action according to the evaluation result of the evaluation system. In some embodiments, the evaluation results of the three modules of the evaluation system enter the dialogue management module of the dialogue system, and the dialogue management module can respond by combining the evaluation target and the strategy after obtaining the evaluation result of the evaluation system on the user voice.

Fig. 2 is a schematic structural diagram of a program module according to an embodiment of the present application. As shown in fig. 2, the system first takes out a first test point, the test point corresponds to a task to be completed in a scene, and the student user sees the description of the task on the front-end interface.

In some embodiments, in the conversational, human-computer interactive spoken language assessment system: the description of the task is provided with conversation background and scene information for the student user, the student user is used for completing a real, natural and interpersonal task type activity, the current end system is virtual and real-time, and the student user can obtain general experience with real and human conversation from rich three-dimensional information.

By adopting the technical scheme: the system starts to carry out conversation according to the information of the context, the user and the system can start to ask or ask questions according to the requirements of different examination points, when the voice of the student user is converted into a text through voice recognition, and after the intention is recognized through the intention recognition module, the text can obtain the scores and the error contents of the voice, the semanteme, the grammar and the easy mixing multi-dimension through the evaluation module, and the new information can be updated to the user portrait.

In some embodiments, in the conversational human-computer interactive spoken language assessment system, the assessment module includes: speech and semantic evaluation, grammar evaluation and error checking, and remix evaluation of scene dialogs. The evaluation purpose is to show the evaluation report need after the evaluation is finished, and also can be used as the information of the virtual human response dialogue, so that the language complexity, the speed or the intelligibility of the human dialogue can be automatically adjusted according to different dialogue objects.

By adopting the technical scheme: after the voice of the student user is converted into the text through voice recognition, the text obtains the intention of conversation through the intention recognition, the slot position is extracted according to the expression of the student user, the voice of the student user is understood, the content of the next conversation is determined, the virtual person is made to speak through language generation, and after the whole process is circulated through a plurality of examination points until the evaluation is finished, an evaluation report is generated.

In some embodiments, in the conversational human-computer interactive spoken language assessment system: the evaluation report comprises: the basic information of the student and the evaluation result of the spoken language level process can indicate the position of the speech and grammar errors of the student user, such as abnormal speech, inaccurate tone, frequently made grammar errors and the like, and further can analyze the capability of comprehensively using the language and the used communication strategy from the behavior characteristics of the student user.

In some embodiments, the conversational, human-computer interactive spoken language assessment system may include: a dialogue system and an evaluation system. In practice, as an example, the working process is as follows:

the system takes out a first examination point at first, the examination point corresponds to a task to be completed in a scene, and a student user sees a description of the task on a front-end interface, such as: the examination point is that strangers are acquainted through English expression, the system can display a proper conversation scene through rich text or virtual reality, and student users see the following task descriptions: recognize new friends, politely greet, and ask the other party for their name and where from.

The system starts a dialog based on the context information, and the examination point is set to let the user start asking questions, when the student user says "Hello, I'm ray. After the voice of the student user is converted into text through voice recognition, the text obtains the intention of a conversation through intention recognition and is called, and scores of the voice, the semanteme, the grammar and the easy mixing sound multi-dimension are obtained through an evaluation module and are updated to the user portrait.

The intention recognition is to make a call, and the slot position is extracted according to the expression of the student user, namely the slot position is extracted as a name, the parameter value is Ray, after the voice of the student user is understood, the content of the next conversation needs to be determined, the virtual person is made to say through language generation, and after a plurality of examination points are taken out in the whole process in a circulating mode until the evaluation is finished, an evaluation report is generated.

In some embodiments, the method further comprises: when the system says Where do you come from? Then, the student user responds a place name of a hometown small city, which is beyond the comprehensible range of the system, the system integrates the context information of the conversation process according to the current stage of the whole conversation in the conversation state tracking module, the conversation strategy learning module adopts a general response strategy, and the system responds to Wow!through a virtual human! This is a nice place! To keep the session ongoing.

In some embodiments, the method may further comprise: when a student user says that the system is not allowed to use a mobile phone when the student user takes the airplane in a scene of taking the airplane, the student user is informed that the social interaction specification score of the student user is low in the user picture, and the student user can preferentially select a serious persuasion response in the conversation strategy selection.

In some embodiments, the various methods, processes, modules, apparatuses, devices, or systems described above may be implemented or performed in one or more processing devices (e.g., digital processors, analog processors, digital circuits designed to process information, analog circuits designed to process information, state machines, computing devices, computers, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices that perform some or all of the operations of a method in response to instructions stored electronically on an electronic storage medium. The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for performing one or more operations of a method. The above description is only for the preferred embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present application, and equivalent alternatives or modifications according to the technical solutions and the inventive concepts of the present application, and all such alternatives or modifications are encompassed in the scope of the present application.

Embodiments of the present application may be implemented in hardware, firmware, software, or various combinations thereof. The present application may also be implemented as instructions stored on a machine-readable medium, which may be read and executed using one or more processing devices. In one implementation, a machine-readable medium may include various mechanisms for storing and/or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable storage medium may include read-only memory, random-access memory, magnetic disk storage media, optical storage media, flash-memory devices, and other media for storing information, and a machine-readable transmission medium may include various forms of propagated signals (including carrier waves, infrared signals, digital signals), and other media for transmitting information. While firmware, software, routines, or instructions may be described in the above disclosure in terms of performing certain exemplary aspects and embodiments of certain actions, it will be apparent that such descriptions are merely for convenience and that such actions in fact result from a machine device, computing device, processing device, processor, controller, or other device or machine executing the firmware, software, routines, or instructions.

This specification discloses the application using examples in which one or more examples are described or illustrated in the specification and drawings. Each example is provided by way of explanation of the application, not limitation of the application. In fact, it will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the scope or spirit of the application. For instance, features illustrated or described as part of one embodiment, can be used with another embodiment to yield a still further embodiment. It is therefore intended that the present application cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents. The above description is only for the specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application are intended to be covered by the scope of the present application.

Claims

1. A conversational human-computer interaction spoken language evaluation system is characterized by comprising a dialogue system and an evaluation system, wherein the dialogue system comprises:

a voice recognition module configured to be able to recognize a voice input of a student user and convert into text;

an intent understanding module configured to enable semantic understanding of the converted text in conjunction with a user representation comprising spoken horizontal dimensions of the user and scenarized information comprising a virtual scene in which the current conversation occurred, to identify a user intent of the student user in the spoken conversation;

a dialogue management module configured to enable a corresponding voice response based on the understanding result of the intention understanding module;

a language generation module configured to be able to convert system actions produced by the dialog management module into natural language; and

a language synthesis module configured to convert natural language into speech and feed back to the student user;

the evaluation system is configured to perform multi-dimensional evaluation on the converted text and update the user portrait according to an evaluation result; and

the dialogue management module is further configured to enable a virtual human to perform corresponding voice response on the student user to keep the conversation going on in combination with an evaluation target and a dialogue strategy based on the evaluation result of the evaluation system and the understanding result of the intention understanding module;

when the response of the student user exceeds the range which can be understood by the intention understanding module, the conversation state tracking module represents the current conversation state according to the current stage of the whole conversation and the context information of the conversation process, and the conversation strategy learning module adopts a general response strategy according to the current conversation state and responds general sentences through a virtual human to keep the conversation continuously.

2. The conversational, human-computer interactive spoken language assessment system according to claim 1, characterized by:

the intention understanding module is further configured to be capable of slot filling, extracting a slot through user expression to understand user voice and decide content of a next conversation, wherein the slot is information required to be completed for converting the user intention into an explicit user instruction in a conversation process.

3. The conversational, human-computer interactive spoken language assessment system according to claim 1, characterized by:

the dialog management module also includes a dialog state tracking module configured to be able to represent the phase of the dialog and to fuse context information of the dialog process.

4. The conversational, human-computer interactive spoken language assessment system according to claim 1, characterized by:

the dialog management module also includes a dialog strategy learning module configured to generate a next operation of the system based on the current dialog state.

5. The conversational, human-computer interactive spoken language assessment system according to claim 1, further comprising an assessment system comprising:

the system comprises a scene dialogue voice and semantic evaluation module, a scene dialogue voice and semantic evaluation module and a semantic evaluation module, wherein the scene dialogue voice and semantic evaluation module is configured to compare the similarity of texts converted from user voices according to standard contents of voice and semantics and obtain a voice evaluation score and a semantic evaluation score;

the grammar evaluating and error checking module is configured to be capable of carrying out grammar checking on a text converted from user voice and obtaining a grammar evaluation score; and

the system comprises an easy mixing evaluating module, a voice mixing evaluating module and a voice mixing evaluating module, wherein the easy mixing evaluating module is configured to mark errors of easy mixing on texts converted from user voices so as to evaluate the easy mixing.

6. The conversational, human-computer interactive spoken language assessment system according to claim 5, characterized by:

the dialogue management module is further configured to generate corresponding system actions according to the evaluation result of the evaluation system.

7. The conversational, human-computer interactive spoken language assessment system according to claim 5, characterized by:

when the similarity between the user voice and the standard voice phoneme is higher, the voice evaluation score is higher; and

the semantic evaluation score is higher when the similarity between the content expressed by the user and the comparison reference answer is higher.

8. The conversational, human-computer interactive spoken language assessment system according to claim 5, characterized by:

the grammar evaluation and error check module is further configured to examine logical relationships in the sentence, the logical relationships including one or more of the following: matching of subjects and predicates, temporal expression, syntactic structure, single or plural number.

9. The conversational, human-computer interactive spoken language assessment system according to any of the preceding claims, characterized by:

the conversation type man-machine interactive spoken language evaluation system is a computer system based on stand-alone and/or online configuration to develop evaluation of language contents.