CN114566167A - Voice answer method and device, electronic equipment and storage medium - Google Patents

Voice answer method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114566167A
CN114566167A CN202210187828.7A CN202210187828A CN114566167A CN 114566167 A CN114566167 A CN 114566167A CN 202210187828 A CN202210187828 A CN 202210187828A CN 114566167 A CN114566167 A CN 114566167A
Authority
CN
China
Prior art keywords
answer
voice
user
determining
finger
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210187828.7A
Other languages
Chinese (zh)
Inventor
李守毅
徐言
王浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Toycloud Technology Co Ltd
Original Assignee
Anhui Toycloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Toycloud Technology Co Ltd filed Critical Anhui Toycloud Technology Co Ltd
Priority to CN202210187828.7A priority Critical patent/CN114566167A/en
Publication of CN114566167A publication Critical patent/CN114566167A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention provides a voice answering method, a voice answering device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring answer voices of a user for current questions; based on the characteristics of the alternative answer of the current question on the answer number and/or the answer content, carrying out answer conversion on the voice recognition text of the answer voice to obtain the voice answer of the user; and determining an answer result based on the voice answer and the standard answer of the current question. The voice answering method, the voice answering device, the electronic equipment and the storage medium are not limited by specific trigger vocabularies, so that the user answers more flexibly, and the user experience can be effectively improved while the accuracy of voice answering recognition is ensured.

Description

Voice answer method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for answering questions with voice, an electronic device, and a storage medium.
Background
The tool education product provides a voice answering function, and a user can finish on-line answering by directly listening to the question and then carrying out voice answering, so that the convenience of answering the question of the user can be effectively improved.
The current voice answer mode is to trigger answer according to a specific vocabulary, if the answer is A/B/C/D, the answer can be made by voice, if the answer can not be triggered for many times, the user is easy to be bored, and the user experience is poor.
Disclosure of Invention
The invention provides a voice answering method, a voice answering device, electronic equipment and a storage medium, which are used for solving the defect of poor user experience caused by answering triggering according to specific vocabularies in the prior art.
The invention provides a voice answering method, which comprises the following steps:
acquiring answer voices of a user for current questions;
based on the characteristics of the alternative answer of the current question on the answer number and/or the answer content, carrying out answer conversion on the voice recognition text of the answer voice to obtain the voice answer of the user;
and determining an answer result based on the voice answer and the standard answer of the current question.
According to the voice answer method provided by the invention, the answer conversion is performed on the voice recognition text of the answer voice based on the characteristics of the alternative answer of the current question on the answer number and/or the answer content to obtain the voice answer of the user, and the method comprises the following steps:
carrying out numbering format conversion on the answer number text in the voice recognition text to obtain a voice answer number which accords with the format of the alternative answer on the answer number;
and/or performing entity extraction of a preset type on the voice recognition text, and taking the extracted entity as voice answer content, wherein the preset type is determined based on the characteristics of the alternative answer on the answer content;
determining the voice response based on the voice response number and/or the voice response content.
According to the voice answer method provided by the invention, the preset type is determined based on the entity type of the answer content in the alternative answer.
According to the voice answer method provided by the invention, the step of determining an answer result based on the voice answer and the standard answer of the current question comprises the following steps:
under the condition that voice answer content exists in the voice answer, converting the voice answer content into pinyin to be audited;
and determining an answer result based on the pinyin to be checked and the answer pinyin of the standard answer.
According to the voice answer method provided by the invention, the step of determining an answer result based on the voice answer and the standard answer of the current question comprises the following steps:
modifying the voice answer based on a finger answer and/or a sight line answer to obtain a user answer, wherein the finger answer is determined based on a finger image of a finger of the user pointing to an answer screen, and the sight line answer is determined based on a face image of the user;
and determining an answer result based on the user answer and the standard answer of the current question.
According to the voice answer method provided by the invention, the finger answer is determined based on the following steps:
acquiring a finger pointing vector of the finger image;
and determining the finger answers based on the placement angle of the answer screen, the area occupied by each alternative answer in the answer screen line and the projection position of the finger vectors on the answer screen.
According to the voice answering method provided by the invention, the sight line answer is determined based on the following steps:
determining the sight direction of human eyes based on the human eye characteristics of the human face image;
determining the spatial position of human eyes based on the key point information of the human face image;
determining the gaze response based on a gaze direction and a spatial location of the human eye.
The invention also provides a voice answering device, comprising:
the voice acquisition unit is used for acquiring the answer voice of the user for the current question;
the answer conversion unit is used for carrying out answer conversion on the voice recognition text of the answer voice based on the characteristics of the alternative answer of the current question on the answer number and/or the answer content to obtain the voice answer of the user;
and the result determining unit is used for determining an answer result based on the voice answer and the standard answer of the current question.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the voice answer method.
The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of answering a question with a voice as described in any one of the above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements any of the voice question answering methods described above.
According to the voice answer method, the voice answer device, the electronic equipment and the storage medium, the answer conversion is carried out on the voice recognition text of the answer voice through the characteristics of the alternative answer of the current question on the answer number and/or the answer content, and the voice answer of the user is obtained. The answering method is not limited by specific trigger vocabularies, so that the answer of the user becomes more flexible, and the user experience can be effectively improved while the accuracy of voice answer recognition is ensured.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a voice question answering method provided by the present invention;
fig. 2 is a schematic flow chart of step 120 in the voice answering method provided by the present invention;
fig. 3 is a schematic flow chart of step 130 of the voice answering method according to the present invention;
fig. 4 is a second schematic flowchart illustrating step 130 of the voice question answering method according to the present invention;
FIG. 5 is a schematic flow diagram of a finger answer determination method provided by the present invention;
FIG. 6 is a schematic flow diagram of a gaze response determination method provided by the present invention;
fig. 7 is a schematic structural diagram of a voice answering device provided by the present invention;
fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The existing voice answering method usually performs answering triggering according to a specific vocabulary, for example, a voice answer of A/B/C/D is necessary to perform the voice answering. The answering mode is very limited, especially for the children of low age, the children like to answer the questions in various modes such as 'third', '3', 'I select three', 'I select third', and the like, if the children can not be triggered for many times, the users are very easy to be bored, and the user experience is poor.
Therefore, there is a need for a more flexible and accurate voice answering method without the limitation of a specific trigger vocabulary, so as to improve the user experience of voice answering.
In view of the above problems, an embodiment of the present invention provides a voice answering method. Fig. 1 is a schematic flow chart of the voice answering method provided by the present invention, and as shown in fig. 1, the method includes:
step 110, obtaining the answer voice of the user for the current question.
Specifically, the current question refers to a question to be answered by voice, and the current question may be any question extracted from a question library, for example, a mathematical subject question such as "ten plus minus method" or "hundred plus minus method", an english subject question, or a chinese language comprehensive subject question. The current topic may be a selection topic, a judgment topic, or a gap filling topic, and this is not specifically limited in this embodiment of the present invention.
The user is the person who answers the current topic in a voice mode, and the user can be 1 person or a plurality of persons. The user may be a preschool child, a pupil, an adult, or the like.
Here, the answer speech is an answer speech for a current question, the answer speech may be obtained through a sound pickup device, where the sound pickup device may be a smartphone, a tablet computer, or a speech answering machine, and the sound pickup device may further amplify and reduce noise of the answer speech after the answer speech is picked up by a microphone array, and the answer speech may be a speech segment formed after the sound pickup is finished, or a speech stream in a real-time sound pickup process, which is not specifically limited in the embodiment of the present invention.
And step 120, based on the characteristics of the alternative answers of the current question on the answer numbers and/or the answer contents, performing answer conversion on the voice recognition texts of the answer voices to obtain the voice answers of the users.
Specifically, the feature of the candidate answer of the current question on the answer number refers to the number feature corresponding to each candidate option for the user to select, and may be, for example, an alphabetic numbering sequence a/B/C/D, or a numeric numbering sequence "first", "second", "third", "fourth", or "1", "2", "3", "4", or the like.
The characteristics of the alternative answers of the current question on the answer content refer to the characteristics of the answer content of each alternative answer. For example, the answer content characteristics of the alternative answer numbers a and B are person names, and the answer content characteristics of the alternative answer numbers C and D are place names; as another example, the answer content characteristic of the alternative answer number A/B/C/D is time, etc.
For example, the current topics are: who is the first emperor in china?
A: yellow emperor B: yandi C: win political affair D: bang Liu
Wherein, the answer number characteristics refer to A/B/C/D and letter number sequence; the content of the answer is characterized in that the Chinese characters 'Huangdi', 'Yandi', 'Shanzheng' and 'Liubang' are names of people.
Considering that in the prior art, answer triggering is usually performed according to a specific vocabulary, if the alternative answer number of the current question is a letter number sequence and the answer voice of the user is a number sequence, the answer cannot be triggered, and the default that the user does not answer or answers wrongly is possible; or the user does not directly answer the serial number of the current question but directly answers the answer content of the current question, the answer cannot be triggered, and the user experience is poor.
The voice answer method provided by the embodiment of the invention comprises the steps of firstly carrying out voice recognition on answer voice to obtain a voice recognition text of the answer voice, then carrying out answer conversion on the obtained voice recognition text, and triggering answer more sensitively by the voice answer of a user obtained by answer conversion so as to improve the experience of the user.
The answer conversion can be carried out according to the characteristics of the alternative answer of the current question on the answer number and/or the answer content, if the voice recognition text contains the answer number, the answer number in the voice recognition text is converted so as to ensure that the voice recognition text conforms to the number format of the alternative answer of the current question; if the voice recognition text contains answer content, the answer content is converted to be closer to the answer content of the current question alternative answer.
It is understood that the voice answer obtained by the answer conversion may only have the answer number, may only have the answer content, may have both the answer number and the answer content, and certainly may not have both the answer number and the answer content.
Step 130, determining answer results based on the phonetic answers and the standard answers of the current question.
Specifically, after the phonetic answer is obtained, the phonetic answer may be compared with the standard answer for the current topic.
Further, under the condition that the voice answer only has answer numbers, the answer numbers in the voice answer can be compared with the answer numbers of the standard answers, and when the answer numbers are the same, the answer result can be considered as correct answer; when the two are different, the answer result can be regarded as an answer error.
Under the condition that the voice answer only has answer content, the answer content in the voice answer can be compared with the answer content of the standard answer, and when the semantics of the voice answer and the answer content are the same, the answer result can be considered as correct answer; when the semantics of the two are different, the answer result can be regarded as an answer error.
In the case that the answer number and the answer content exist in the voice answer, the answer is considered to be correct only if the answer number and the answer content in the voice answer are the same as those of the standard answer, otherwise, the answer is considered to be wrong.
In the case where the speech answer has neither an answer number nor answer content, the answer is considered to be wrong.
According to the voice answer method provided by the embodiment of the invention, the voice recognition text of the answer voice is subjected to answer conversion based on the characteristics of the alternative answer of the current question on the answer number and/or the answer content, so that the voice answer of the user is obtained. The answering method is not limited by specific trigger vocabularies, so that the answering of the user becomes more flexible, and the user experience can be effectively improved while the accuracy of voice answering recognition is ensured.
Based on the above embodiment, fig. 2 is a schematic flow chart of step 120 in the voice answer method provided by the present invention, and as shown in fig. 2, step 120 specifically includes:
step 121, number format conversion is carried out on the answer number text in the voice recognition text, and a voice answer number which accords with the format of the alternative answer on the answer number is obtained;
and/or step 122, performing entity extraction of a preset type on the voice recognition text, and taking the extracted entity as voice answer content, wherein the preset type is determined based on the characteristics of the alternative answer on the answer content;
the voice response is determined based on the voice response number and/or the voice response content, step 123.
Specifically, the answer number text in the speech recognition text may be extracted based on keyword matching or a preset rule. The number format conversion refers to converting the number format corresponding to the answer number text in the voice recognition text into the format which accords with the alternative answer on the answer number. The voice answer number obtained by format conversion can directly trigger the voice answer.
For example, the number format of the alternative answers is letter sequence number, more specifically, A/B/C/D four alternative answer numbers; if the extracted answer number text is 'third', converting 'third' into 'C'; if the extracted answer number text is 'first', converting 'first' into 'A'; if the extracted answer number text is "fifth", it is considered that no voice answer number exists in the voice answer.
Entity extraction for speech recognition text may be achieved through an entity extraction model. The entity referred to herein includes answer content included in the speech recognition text, and the speech recognition text can be input into a pre-trained entity extraction model, the entity extraction model performs entity recognition of a preset type on the speech recognition text, and outputs an entity label of each word in the entity text, where the label system of the entity recognition may be BIO, biees, etc., B represents the beginning of the entity, E represents the end of the entity, I represents an entity intermediate word, O represents a non-entity, and S represents a single entity. The preset type is determined based on the characteristics of the alternative answers on the answer content, for example, the preset type may be a name of a person, a place name, time, an ancient poem, or the like.
Before step 122 is executed, the entity extraction model may be trained, wherein the training method of the entity extraction model may include the following steps: firstly, a large amount of sample voice recognition texts are collected, and entities and entity types in the sample voice recognition texts are labeled manually. And then, training the initial entity extraction model based on the sample voice recognition text and the entity and entity type marked in the sample voice recognition text, thereby obtaining the entity extraction model.
And extracting the entity obtained by the entity extraction model as the voice answer content.
The voice answer number can be used as a voice answer; or using the voice response content as a voice response; or the voice answer number and the voice answer content are used as the voice answer at the same time.
According to the method provided by the embodiment of the invention, the number format conversion is carried out on the answer number text to obtain the voice answer number; and performing entity extraction of a preset type on the voice recognition text to obtain voice answer content, and determining a voice answer based on the voice answer number and/or the voice answer content. The obtained voice answer accords with the number format and/or the content characteristics of the alternative answer, can trigger answer more sensitively without being limited by the triggering of specific words, so that the answer of the user becomes more flexible, and the experience of the user is improved.
According to any of the above embodiments, the preset type is determined based on the entity type of the answer content in the alternative answer.
Specifically, entity extraction of a preset type is performed on the voice recognition text, and the extracted entity is used as voice answer content. And the preset type is determined based on the entity type of the answer content in the alternative answer.
The entity type of the answer content in the alternative answer can also be obtained through an entity extraction model. And inputting the answer content of each alternative answer into a pre-selected trained entity extraction model, and taking the entity type obtained by the extraction of the entity extraction model as a preset type.
Since the number of entities of the answer content in the alternative answer may be one or multiple, the preset type herein may be one or multiple types, accordingly. And when the preset type is a plurality of types, respectively extracting entities of the preset type from the voice recognition text.
According to the method provided by the embodiment of the invention, the entity type of the answer content in the alternative answers is used as the preset type, and the entity extraction of the preset type is carried out on the voice recognition text, so that the voice answer content is obtained.
Based on any of the above embodiments, fig. 3 is a schematic flow chart of step 130 in the voice answer method provided by the present invention, and as shown in fig. 3, step 130 specifically includes:
step 131, converting the voice response content into pinyin to be audited under the condition that the voice response content exists in the voice response;
step 132, determining the answer result based on the pinyin to be checked and the pinyin for the answer of the standard answer.
Under the condition that voice answer content exists in voice answer, the correctness of voice answer is judged by considering the entity text extracted from the voice recognition text, the situation that the entity text is homophonic with the text provided in the actual standard answer but has different character forms can occur, so that the answer recognition is wrong, and the user experience is reduced.
In order to further improve the accuracy of answer recognition and improve the user experience, the method provided by the embodiment of the invention converts the voice response content into the pinyin to be checked, and can understand that the pinyin to be checked is the pinyin corresponding to the characters of the voice response content. Here, the pinyin for the answer of the standard answer refers to the pinyin corresponding to the text of the content of the standard answer.
Then comparing the pinyin to be checked with the pinyin as the answer of the standard answer, and if the pinyin to be checked and the pinyin as the answer of the standard answer are the same, judging that the answer result is correct; otherwise, the answer is wrong.
Further, considering that the degree of the mandarin pronunciation standard of the voice answering user is different, for example, the pronunciation confusion of the flat and warped tongue occurs in the answering voice, and the l and n are not divided, in order to further improve the recognition effect of the answering voice, when the standard answer is converted into the standard answer pinyin, the flat and warped tongue is encountered, and the synchronous expansion conversion is performed when the l and n are encountered, for example:
the current titles are: which poetry in the history of our country is called "poetry'?
Erdu fu B Libai C Baixiyuyi Dhao
The standard answer pinyin comprises a li bai pinyin and a ni bai pinyin, so that answer voices of users with different groups of L and N can be identified, even if the voices answered by the users are the ni bai pinyin, matching can be successful, and experience of the users is further improved.
In addition, aiming at the user group with easy confusion of the pronunciation of the flat and warped tongue, a synchronous expansion conversion method can be adopted to improve the recognition accuracy of the voice answer. For example, assuming that the option D is a standard answer, when the content "hawser" of the standard answer is pinyin converted, the two types of "shi hao" and "si hao" are included, and even if the speech answered by the user is "si hao", the matching can be successful.
According to the method provided by the embodiment of the invention, the correctness of the voice answer content of the user is judged by comparing the pinyin of the voice answer content with the pinyin of the standard answer, the condition that homophones and different characters appear in the voice recognition process of the answer voice can be effectively avoided, the accuracy of answer recognition is further improved, and the user experience is improved.
In addition, the standard answers are subjected to pinyin conversion in a synchronous expansion mode, even if the pronunciation of the mandarin of the user is inaccurate, the pinyin of the standard answers can be accurately matched, so that the accuracy of voice answer recognition is improved, and the user experience is improved.
Based on any embodiment, the answer speech of the user for the current question can be directly converted into the pinyin to be audited, and the answer result is determined based on the pinyin to be audited and the answer pinyin of the standard answer.
Based on any of the above embodiments, fig. 4 is a second schematic flow chart of step 130 in the voice answer method provided by the present invention, as shown in fig. 4, step 130 specifically includes:
step 133, modifying the voice answer based on the finger answer and/or the sight line answer to obtain a user answer, wherein the finger answer is determined based on a finger image of a finger of the user pointing to the answer screen, and the sight line answer is determined based on a face image of the user;
step 134, determining the answer result based on the user answer and the standard answer of the current question.
Specifically, in order to further improve convenience and accuracy of answering, the voice response may be modified based on the finger response and/or the line-of-sight response, so as to obtain a modified user response.
The finger answer is an answer determined according to the area of the answer screen pointed by the finger of the user. Here, the finger and the answer screen may be in a contact type or a non-contact type, which is not specifically limited in the embodiment of the present invention.
The answer screen is an electronic device screen for displaying the current question, and may be a touch screen or a non-touch screen.
The finger image is a finger image of a user finger pointing to the answer screen and can be acquired through image acquisition equipment, the image acquisition equipment can be integrated in electronic equipment for answering questions, and the image acquisition equipment can also be external image acquisition equipment connected through a Universal Serial Bus (USB).
The line-of-sight answer refers to an answer determined according to the user's line of sight toward the area of the answer screen. The sight line answer can be determined according to a face image of the user, the face image contains human eye features, and the sight line answer is determined by determining the area of the sight line of the user, which is oriented to the answer screen, based on the human eye features.
Firstly, the voice response is corrected through finger response and/or line-of-sight response to obtain the user response. For example, in a case where the finger answer and/or the line-of-sight answer and the voice answer both indicate the same answer, it may be further determined that the recognized voice answer is to be the user answer; when the voice answer is null and the finger answer and/or the sight line answer is obtained, the user can be reminded to make the voice answer in time or be asked whether to take the finger answer and/or the sight line answer as the user answer; the results of the finger responses and/or the gaze responses may also be directly used as user responses.
The answer result is then determined based on the user's answer and the standard answer for the current question. The user response is compared to the standard answer to determine if the user response is correct.
Because the user answers comprehensively consider voice answers, finger answers and/or sight line answers, the method can effectively improve the accuracy of answer recognition of the user and further improve the experience of the user. In addition, the finger answer and/or the sight line answer also make the answering mode more flexible.
According to the method provided by the embodiment of the invention, the answer result is determined by adopting the user answer comprehensively considering the voice answer, the finger answer and/or the sight line answer, so that the answer mode becomes more flexible, the accuracy of the answer recognition of the user is effectively improved, and the experience feeling of the user is further improved.
Based on any of the above embodiments, fig. 5 is a schematic flow chart of the finger answer determination method provided by the present invention, and as shown in fig. 5, the finger answer is determined based on the following steps:
step 510, acquiring a finger pointing vector of the finger image;
step 520, determining the finger answers based on the placement angle of the answer screen, the area occupied by each alternative answer on the answer screen line, and the projection position of the finger vectors on the answer screen.
Specifically, the finger pointing vector is used to represent the direction in which the finger points to the answer screen, a direction extraction model may be adopted to extract the finger pointing feature in the finger image, and the finger pointing feature may be output as a result in a vector representation.
The placing angle of the answer screen is an included angle between the screen of the answer electronic equipment and a supporting plane of the electronic equipment, for example, the included angle is 0 degree under the condition of horizontal placement and 90 degrees under the condition of vertical placement, and the placing angle of the answer screen can be flexibly set according to the preference of a user.
The Angle of placement of the answer screen may be measured by an automated goniometer, for example, the Angle of placement of the current answer screen may be displayed and adjusted by the simple Angle application.
Taking the current question as a selection question as an example, a text detection algorithm can be adopted to perform text line segmentation on the current question and the image of each alternative answer to obtain the area occupied by each alternative answer in the answer screen line. For example, DBNet (Real-time Scene Text Detection with differential Binarization), PSENet, PANNet, or the like may be used. The area occupied by each alternative answer in the answer screen line can be represented by a rectangular frame, and specifically can be represented by the coordinate position of each vertex of the rectangular frame on the answer screen.
And after the placing angle of the answer screen and the area occupied by each alternative answer on the answer screen line are obtained, the finger answer can be determined according to the projection position of the finger vector on the answer screen. And the alternative answer corresponding to the answer screen area indicated by the projection position is the answer of the finger answer.
According to the method provided by the embodiment of the invention, the finger answer is determined through the projection position of the finger vector on the answer screen, so that the answer identification accuracy is further improved, and the user experience is improved.
Based on any of the above embodiments, fig. 6 is a schematic flow chart of the gaze direction answer determination method provided by the present invention, and as shown in fig. 6, the gaze direction answer is determined based on the following steps:
step 610, determining the sight line direction of human eyes based on the human eye characteristics of the human face image;
step 620, determining the spatial position of human eyes based on the key point information of the human face image;
at step 630, a gaze answer is determined based on the gaze direction and spatial location of the human eye.
Specifically, the user may adopt an image capturing device, such as a camera or a camera, to capture a face image of the user in real time during the answering process, or may select a video frame containing the face of the user from a real-time video as the face image. And extracting human eye features from the human face image to obtain human eye features.
The human eye characteristics are input into the neural network model, the deflection angles of the sight of the user in the horizontal and vertical directions can be predicted, and the sight direction of the human eyes in a camera coordinate system is obtained through post-processing.
It is understood that before the human eye features are fed into the neural network model, the human eye feature image may be normalized, for example, by using a BN (Batch Normalization) algorithm.
On the other hand, the spatial position information of human eyes under the camera coordinate system can be estimated by using the key point information of the face image. Specifically, the human eye key points in the human face region can be detected through the human face key point detection model, and coordinate values of the human eye key points are obtained. In this step, the face key point detection model may be implemented by any one of face key point detection models in the prior art, which is not described herein again in this embodiment.
It should be noted that, steps 610 and 620 may be executed sequentially, but the execution order of the two is not limited; of course, may be performed simultaneously.
And finally, determining the sight of the user according to the sight direction and the spatial position of the obtained eyes, mapping the sight to a pixel point on an answer screen, and judging the sight answer of the user according to the pixel point falling in the area of the alternative answer.
According to the method provided by the embodiment of the invention, the sight line answer is determined through the human eye characteristics and the key point information of the human face image, so that the accuracy of answer recognition is further improved, and the user experience is improved.
Based on any one of the above embodiments, this embodiment provides a voice answer method, including:
(1) and acquiring the answer voice of the user for the current question.
(2) Carrying out number format conversion on an answer number text in a voice recognition text of answer voice to obtain a voice answer number which accords with the format of the alternative answer on the answer number; and/or performing entity extraction of a preset type on the voice recognition text, and taking the extracted entity as voice answer content, wherein the preset type is determined based on the preset type of the alternative answers and the entity type of the answer content in the alternative answers.
(3) The voice response is determined based on the voice response number and/or the voice response content.
(4) And determining an answer result based on the voice answer and the standard answer of the current question, and displaying the answer result of the user for the current question.
(5) Under the condition that voice answer content exists in the voice answer, converting the voice answer content into pinyin to be audited; and determining an answer result based on the pinyin to be checked and the pinyin of the answer of the standard answer, and displaying the answer result of the user for the current question.
(6) In order to further improve the convenience of answering, the method further comprises the following steps: and modifying the voice response based on the finger response and/or the sight line response to obtain the user response.
Wherein the finger answer is determined based on the following steps: acquiring a finger pointing vector of the finger image; and determining the finger answers based on the placement angle of the answer screen, the occupied area of each alternative answer on the answer screen line and the projection position of the finger vectors on the answer screen.
The gaze response is determined based on the following steps: determining the sight direction of human eyes based on the human eye characteristics of the human face image; determining the spatial position of human eyes based on the key point information of the human face image; based on the gaze direction and spatial location of the human eye, gaze responses are determined.
The following describes the voice answering device provided by the present invention, and the voice answering device described below and the voice answering method described above can be referred to in correspondence.
Fig. 7 is a schematic structural diagram of the speech answering device provided by the present invention, and as shown in fig. 7, the device includes:
a voice obtaining unit 710, configured to obtain an answer voice of a user for a current question;
an answer conversion unit 720, configured to perform answer conversion on the speech recognition text of the answer speech based on characteristics of the candidate answer of the current question on an answer number and/or answer content, so as to obtain a speech answer of the user;
a result determining unit 730, configured to determine an answer result based on the speech answer and the standard answer of the current question.
According to the voice answer device provided by the embodiment of the invention, the answer conversion is carried out on the voice recognition text of the answer voice through the characteristics of the alternative answer of the current question on the answer number and/or the answer content, so that the voice answer of the user is obtained. The answering device is not limited by specific triggering vocabularies, so that the answering of a user becomes more flexible, and the user experience can be effectively improved while the accuracy of voice answering recognition is ensured.
Based on any of the above embodiments, the answer conversion unit 720 is further configured to:
carrying out numbering format conversion on the answer number text in the voice recognition text to obtain a voice answer number which accords with the format of the alternative answer on the answer number;
and/or performing entity extraction of a preset type on the voice recognition text, and taking the extracted entity as voice answer content, wherein the preset type is determined based on the characteristics of the alternative answer on the answer content;
determining the voice response based on the voice response number and/or the voice response content.
Based on any embodiment, the preset type is determined based on an entity type of answer content in the alternative answer.
Based on any of the above embodiments, the result determining unit 730 is further configured to:
under the condition that voice answer content exists in the voice answer, converting the voice answer content into pinyin to be audited;
and determining an answer result based on the pinyin to be checked and the answer pinyin of the standard answer.
Based on any of the above embodiments, the result determining unit 730 is further configured to:
modifying the voice answer based on a finger answer and/or a sight line answer to obtain a user answer, wherein the finger answer is determined based on a finger image of a finger of the user pointing to an answer screen, and the sight line answer is determined based on a face image of the user;
and determining an answer result based on the user answer and the standard answer of the current question.
Based on any of the above embodiments, the speech answer apparatus provided in the embodiment of the present invention further includes a finger answer determining unit, configured to:
acquiring a finger pointing vector of the finger image;
and determining the finger answer based on the placing angle of the answer screen, the area occupied by each alternative answer in the answer screen line and the projection position of the finger vector on the answer screen.
Based on any of the above embodiments, the voice answer apparatus provided in the embodiment of the present invention further includes a line-of-sight answer determining unit, configured to:
determining the sight direction of human eyes based on the human eye characteristics of the human face image;
determining the spatial position of human eyes based on the key point information of the human face image;
determining the gaze response based on a gaze direction and a spatial location of the human eye.
Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. Processor 810 may invoke logic instructions in memory 830 to perform a voice question answering method comprising: acquiring answer voices of a user for current questions; based on the characteristics of the alternative answer of the current question on the answer number and/or the answer content, carrying out answer conversion on the voice recognition text of the answer voice to obtain the voice answer of the user; and determining an answer result based on the voice answer and the standard answer of the current question.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, the computer program may be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, a computer can execute the method for answering questions provided by the above methods, where the method includes: acquiring answer voice of a user for a current question; based on the characteristics of the alternative answer of the current question on the answer number and/or the answer content, carrying out answer conversion on the voice recognition text of the answer voice to obtain the voice answer of the user; and determining an answer result based on the voice answer and the standard answer of the current question.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute the voice answering method provided by the above methods, the method including: acquiring answer voices of a user for current questions; based on the characteristics of the alternative answer of the current question on the answer number and/or the answer content, carrying out answer conversion on the voice recognition text of the answer voice to obtain the voice answer of the user; and determining an answer result based on the voice answer and the standard answer of the current question.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A speech question answering method, comprising:
acquiring answer voices of a user for current questions;
based on the characteristics of the alternative answer of the current question on the answer number and/or the answer content, carrying out answer conversion on the voice recognition text of the answer voice to obtain the voice answer of the user;
and determining an answer result based on the voice answer and the standard answer of the current question.
2. The speech answering method according to claim 1, wherein the performing answer conversion on the speech recognition text of the answering speech based on the characteristics of the alternative answer to the current question on the answer number and/or the answer content to obtain the speech answer of the user comprises:
carrying out numbering format conversion on the answer number text in the voice recognition text to obtain a voice answer number which accords with the format of the alternative answer on the answer number;
and/or performing entity extraction of a preset type on the voice recognition text, and taking the extracted entity as voice answer content, wherein the preset type is determined based on the characteristics of the alternative answer on the answer content;
determining the voice response based on the voice response number and/or the voice response content.
3. The method of claim 2, wherein the predetermined type is determined based on an entity type of the content of the answer in the alternative answer.
4. The method of claim 1, wherein the determining an answer result based on the phonetic answer and the standard answer of the current question comprises:
under the condition that voice answer content exists in the voice answer, converting the voice answer content into pinyin to be audited;
and determining an answer result based on the pinyin to be checked and the answer pinyin of the standard answer.
5. The method for answering a question with speech according to any one of claims 1 to 4, wherein the determining an answer result based on the speech answer and the standard answer of the current question comprises:
modifying the voice answer based on a finger answer and/or a sight line answer to obtain a user answer, wherein the finger answer is determined based on a finger image of a finger of the user pointing to an answer screen, and the sight line answer is determined based on a face image of the user;
and determining an answer result based on the user answer and the standard answer of the current question.
6. The phonetic answering method according to claim 5, wherein the finger answer is determined based on the steps of:
acquiring a finger pointing vector of the finger image;
and determining the finger answers based on the placement angle of the answer screen, the area occupied by each alternative answer in the answer screen line and the projection position of the finger vectors on the answer screen.
7. The phonetic answering method according to claim 5, characterized in that the line-of-sight answer is determined on the basis of the following steps:
determining the sight direction of human eyes based on the human eye characteristics of the human face image;
determining the spatial position of human eyes based on the key point information of the human face image;
determining the gaze response based on a gaze direction and a spatial location of the human eye.
8. A speech answering device, comprising:
the voice acquisition unit is used for acquiring the answer voice of the user for the current question;
the answer conversion unit is used for carrying out answer conversion on the voice recognition text of the answer voice based on the characteristics of the alternative answer of the current question on the answer number and/or the answer content to obtain the voice answer of the user;
and the result determining unit is used for determining an answer result based on the voice answer and the standard answer of the current question.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the voice answer method according to any one of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the phonetic answering method according to any one of claims 1 to 7.
CN202210187828.7A 2022-02-28 2022-02-28 Voice answer method and device, electronic equipment and storage medium Pending CN114566167A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210187828.7A CN114566167A (en) 2022-02-28 2022-02-28 Voice answer method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210187828.7A CN114566167A (en) 2022-02-28 2022-02-28 Voice answer method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114566167A true CN114566167A (en) 2022-05-31

Family

ID=81715291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210187828.7A Pending CN114566167A (en) 2022-02-28 2022-02-28 Voice answer method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114566167A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109428719A (en) * 2017-08-22 2019-03-05 阿里巴巴集团控股有限公司 A kind of auth method, device and equipment
CN109964270A (en) * 2017-10-24 2019-07-02 北京嘀嘀无限科技发展有限公司 System and method for key phrase identification
CN110097880A (en) * 2019-04-20 2019-08-06 广东小天才科技有限公司 A kind of answer determination method and device based on speech recognition
JP2020013178A (en) * 2018-07-13 2020-01-23 株式会社内田洋行 Answer classification support system, answer classification support method and answer classification support program
CN112837687A (en) * 2021-03-03 2021-05-25 北京百家科技集团有限公司 Answering method, answering device, computer equipment and storage medium
CN112861784A (en) * 2020-08-19 2021-05-28 北京猿力未来科技有限公司 Answering method and device
CN113178208A (en) * 2021-04-20 2021-07-27 上海松鼠课堂人工智能科技有限公司 Intelligent on-line voice answer control method and system for chemists
CN113674572A (en) * 2021-06-17 2021-11-19 上海松鼠课堂人工智能科技有限公司 Method and system for prompting student to answer questions through playing voice
CN113870635A (en) * 2019-10-25 2021-12-31 北京猿力教育科技有限公司 Voice answering method and device
CN114005440A (en) * 2021-10-14 2022-02-01 上海众言网络科技有限公司 Question-answering method, system, electronic equipment and storage medium based on voice interaction

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109428719A (en) * 2017-08-22 2019-03-05 阿里巴巴集团控股有限公司 A kind of auth method, device and equipment
CN109964270A (en) * 2017-10-24 2019-07-02 北京嘀嘀无限科技发展有限公司 System and method for key phrase identification
JP2020013178A (en) * 2018-07-13 2020-01-23 株式会社内田洋行 Answer classification support system, answer classification support method and answer classification support program
CN110097880A (en) * 2019-04-20 2019-08-06 广东小天才科技有限公司 A kind of answer determination method and device based on speech recognition
CN113870635A (en) * 2019-10-25 2021-12-31 北京猿力教育科技有限公司 Voice answering method and device
CN112861784A (en) * 2020-08-19 2021-05-28 北京猿力未来科技有限公司 Answering method and device
CN112837687A (en) * 2021-03-03 2021-05-25 北京百家科技集团有限公司 Answering method, answering device, computer equipment and storage medium
CN113178208A (en) * 2021-04-20 2021-07-27 上海松鼠课堂人工智能科技有限公司 Intelligent on-line voice answer control method and system for chemists
CN113674572A (en) * 2021-06-17 2021-11-19 上海松鼠课堂人工智能科技有限公司 Method and system for prompting student to answer questions through playing voice
CN114005440A (en) * 2021-10-14 2022-02-01 上海众言网络科技有限公司 Question-answering method, system, electronic equipment and storage medium based on voice interaction

Similar Documents

Publication Publication Date Title
CN109817046B (en) Learning auxiliary method based on family education equipment and family education equipment
CN111753767A (en) Method and device for automatically correcting operation, electronic equipment and storage medium
CN107203953A (en) It is a kind of based on internet, Expression Recognition and the tutoring system of speech recognition and its implementation
US10311743B2 (en) Systems and methods for providing a multi-modal evaluation of a presentation
CN111177359A (en) Multi-turn dialogue method and device
CN111179935B (en) Voice quality inspection method and device
CN109189895B (en) Question correcting method and device for oral calculation questions
CN112507294B (en) English teaching system and teaching method based on human-computer interaction
CN111899576A (en) Control method and device for pronunciation test application, storage medium and electronic equipment
CN111739534A (en) Processing method and device for assisting speech recognition, electronic equipment and storage medium
KR20180013777A (en) Apparatus and method for analyzing irregular data, a recording medium on which a program / application for implementing the same
CN112562723B (en) Pronunciation accuracy determination method and device, storage medium and electronic equipment
CN111079489B (en) Content identification method and electronic equipment
KR20190121593A (en) Sign language recognition system
US10971148B2 (en) Information providing device, information providing method, and recording medium for presenting words extracted from different word groups
CN114566167A (en) Voice answer method and device, electronic equipment and storage medium
JP6285377B2 (en) Communication skill evaluation feedback device, communication skill evaluation feedback method, and communication skill evaluation feedback program
CN115019788A (en) Voice interaction method, system, terminal equipment and storage medium
CN109710735B (en) Reading content recommendation method based on multiple social channels and electronic equipment
CN111553365A (en) Method and device for selecting questions, electronic equipment and storage medium
CN112307748A (en) Method and device for processing text
CN112383593B (en) Intelligent content pushing method and device based on offline accompanying visit and computer equipment
CN113658609B (en) Method and device for determining keyword matching information, electronic equipment and medium
CN113850235B (en) Text processing method, device, equipment and medium
US20230290332A1 (en) System and method for automatically generating synthetic head videos using a machine learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination