WO2023143439A1

WO2023143439A1 - Speech interaction method, system and apparatus, and device and storage medium

Info

Publication number: WO2023143439A1
Application number: PCT/CN2023/073326
Authority: WO
Inventors: 王军锋; 袁国勇; 王伟健
Original assignee: 达闼机器人股份有限公司
Priority date: 2022-01-28
Filing date: 2023-01-20
Publication date: 2023-08-03
Also published as: CN114464179A; CN114464179B

Abstract

A speech interaction method, system and apparatus, and a server and a storage medium. In the speech interaction system, a terminal device may acquire speech data and facial feature data of a user (401); a cloud server may sort at least one standby speech recognition engine according to the facial feature data, wherein the at least one standby speech recognition engine respectively corresponds to at least one language type (402); whether a first speech recognition engine corresponding to a preset first language type matches the speech data is determined (403); when the first speech recognition engine corresponding to the first language type does not match the speech data, the cloud server may select, from among the standby speech recognition engine, a target speech recognition engine that matches the speech data (404); and reply information for the speech data is generated according to a second speech recognition result, from the target speech recognition engine, of the speech data (405). By means of this implementation, when a user uses a different type of language, a terminal device can relatively accurately perform speech recognition on speech data input by the user, such that a reply that is relatively matched with the speech information can be provided for the user.

Description

Voice interaction method, system, device, equipment and storage medium

cross reference

This application refers to the Chinese patent application No. 2022101081354 entitled "Voice Interaction Method, System, Device, Equipment, and Storage Medium" submitted on January 28, 2022, which is fully incorporated by reference into this application.

technical field

The embodiments of the present application relate to the technical field of intelligent robots, and in particular to a voice interaction method, system, device, equipment and storage medium.

Background technique

With the continuous development of artificial intelligence technology, intelligent dialogue is becoming more and more popular. For example, in shopping malls, supermarkets, restaurants and other scenarios, intelligent devices (such as robots) that can intelligently talk to each other are widely used. In the prior art, usually on the premise of knowing the user's language type, an ASR (Automatic Speech Recognition, automatic speech recognition) engine corresponding to the user's language type is manually set in advance to recognize the voice information input by the user and convert the voice The information is converted into text information, and the text information is recognized and answered according to the recognition result. However, in many usage scenarios where the user's language type is unknown, the user's language type cannot be known in advance, so that the ASR engine cannot be configured before interacting with the user. Furthermore, the speech recognition of the speech information input by the user cannot be performed accurately, and thus the answer matching the speech information cannot be provided to the user. Therefore, it is urgent to propose a solution.

Contents of the invention

Embodiments of the present application provide a voice interaction method, system, device, device, and storage medium, which are used to more accurately perform voice recognition on voice data input by a user, and then provide the user with a reply that matches the voice information.

An embodiment of the present application provides a voice interaction method, including: acquiring voice data sent by a user for a device and facial feature data of the user; sorting at least one standby voice recognition engine according to the facial feature data; At least one alternate speech recognition engine, separate from the Corresponding to one less language type; judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data; if not, according to the ordering of the at least one spare speech recognition engine, from at least one Among the standby speech recognition engines, select a target speech recognition engine that matches the speech data; generate reply information for the speech data according to a second speech recognition result of the speech data by the target speech recognition engine.

Further optionally, sorting at least one standby speech recognition engine according to the facial feature data includes: identifying the target language type group to which the user belongs according to the facial feature data; according to the target language group to which the user belongs Language type groups and the corresponding relationship between language type groups and standby speech recognition engines, sorting the at least one standby speech recognition engine; the facial feature data includes: skin feature data, hair feature data, eye feature data, nose bridge features At least one of data and lip feature data.

Further optionally, before judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, it also includes: acquiring the current geographic location of the device; The distribution feature determines the first language type, and uses the recognition engine corresponding to the first language type as the first speech recognition engine.

Further optionally, judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data includes: using the first voice recognition engine corresponding to the preset first language type to analyze the voice Perform speech recognition on the data to obtain the first speech recognition result; obtain the text information in the first speech recognition result; calculate the recognition accuracy of the text information; if the recognition accuracy is less than the set accuracy threshold, determine The speech data does not match the first speech recognition engine.

Further optionally, the method further includes: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, then using a question-answer matching model to perform question-answer matching on the text information to obtain answer information and the answer information confidence level; if the confidence level of the reply information is less than a preset confidence level threshold, it is determined that the voice data does not match the first voice recognition engine.

Further optionally, selecting a target speech recognition engine that matches the voice data from at least one standby speech recognition engine includes: sequentially sorting the at least one standby speech recognition engine according to the order of the at least one standby speech recognition engine. A standby speech recognition engine is selected to obtain the second speech recognition engine; according to the second speech recognition result, it is judged whether the speech data matches the second speech recognition engine; if the speech data matches the If the second speech recognition engine matches, the second speech recognition engine is used as the target speech recognition engine.

The embodiment of the present application also provides a voice interaction system, including: a terminal device and a cloud server; wherein, the terminal device is mainly used to: obtain the voice data sent by the user for the device and the facial feature data of the user; The voice data and the facial feature data are sent to the cloud server; the cloud server is mainly used for: receiving the voice data and the facial feature data; according to the facial feature data, at least one standby speech recognition engine Sorting; the at least one standby speech recognition engine corresponds to at least one language type respectively; judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data; if not, then according to The sorting of the at least one standby speech recognition engine is to select a target speech recognition engine that matches the voice data from at least one standby speech recognition engine; 2. As a result of the voice recognition, generate reply information for the voice data.

The embodiment of the present application also provides a voice interaction device, including: an acquisition module, configured to: acquire the voice data sent by the user for the device and the facial feature data of the user; a sorting module, configured to: according to the facial feature data, Sorting at least one spare speech recognition engine; the at least one spare speech recognition engine corresponds to at least one language type; the judging module is used to: judge the first speech recognition engine corresponding to the preset first language type Whether it matches the voice data; the selection module is used for: if not, then according to the sorting of the at least one standby voice recognition engine, select the one that matches the voice data from at least one standby voice recognition engine A target speech recognition engine; a generating module configured to: generate reply information for the speech data according to a second speech recognition result of the speech data by the target speech recognition engine.

The embodiment of the present application also provides a cloud server, including: a memory, a processor, and a communication component; wherein, the memory is used to: store one or more computer instructions; and the processor is used to execute the one or more computer instructions. The instruction is used for: executing the steps in the voice interaction method.

The embodiment of the present application also provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor is caused to implement the steps in the voice interaction method.

Embodiments of the present application provide a voice interaction method, system, device, device, and storage medium, in which the terminal equipment can obtain the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data. When the speech recognition engine corresponding to the first language type does not match the speech data, the cloud server can select the target speech recognition engine from the standby speech recognition engines, and select the target speech recognition engine according to the second speech recognition result of the speech data by the target speech recognition engine. As a result, answer information for the voice data is generated. Through this implementation manner, when the language types used by the users are different, the terminal device can more accurately perform speech recognition on the speech data input by the user, and then can provide the user with a reply that matches the speech information.

Description of drawings

The drawings are only for illustrating the embodiments and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:

FIG. 1 is a schematic structural diagram of a voice interaction system provided by an exemplary embodiment of the present application;

FIG. 2 is a schematic structural diagram of a voice interaction system in an actual scenario provided by an exemplary embodiment of the present application;

FIG. 3 is a schematic structural diagram of a voice interaction system in an actual scenario provided by another exemplary embodiment of the present application;

FIG. 4 is a schematic flowchart of a voice interaction method provided by an exemplary embodiment of the present application;

FIG. 5 is a schematic structural diagram of a voice interaction device provided by an exemplary embodiment of the present application;

Fig. 6 is a schematic diagram of a cloud server provided by an exemplary embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

In the prior art, when the language types used by the users are different, the robot cannot accurately perform voice recognition on the voice information input by the user, so that it cannot provide the user with a reply that matches the voice information. Aiming at this technical problem, in some embodiments of the present application, a solution is provided. The technical solutions provided by various embodiments of the present application will be described in detail below with reference to the accompanying drawings.

FIG. 1 is a schematic structural diagram of a voice interaction system provided by an exemplary embodiment of the present application. As shown in FIG. 1 , the voice interaction system 100 includes: a cloud server 10 and a terminal device 20 .

The cloud server 10 can be implemented as a cloud host, a virtual center in the cloud, or an elastic computing instance in the cloud etc., which is not limited in this embodiment. Wherein, the composition of the cloud server 10 mainly includes a processor, a hard disk, a memory, a system bus, etc., and is similar to a general computer architecture, and will not be repeated here.

The terminal device 20 can be realized as a variety of terminal devices in different scenarios. For example, in the scenarios of hotels, hotels, restaurants, etc., it can be realized as a robot that provides services; in the scenario of intelligent driving assistance or automatic driving, it can be realized as a controlled Vehicles. In the banking scenario, it can be realized as a multi-functional financial terminal; in the hospital scenario, it can be realized as a registration and payment terminal; in the movie theater scenario, it can be realized as a ticket collection terminal, etc.

In the voice interaction system 100, a wireless communication connection can be established between the cloud server 10 and the terminal device 20, and the specific communication connection method can be determined according to different application scenarios. In some embodiments, the wireless communication connection can be implemented based on a dedicated virtual network (Virtual Private Network, VPN) to ensure communication security.

In the voice interaction system 100 , the terminal device 20 is mainly used to: obtain the voice data sent by the user to the terminal device 20 and the user's facial feature data, and send the voice data to the cloud server 10 . Wherein, the facial feature data is used to identify the language group to which the user belongs, and the facial feature data may include: at least one of skin feature data, hair feature data, eye feature data, nose bridge feature data, and lip feature data. For example, the user's facial feature data may include: eyes are light green, darker, and hair is blonde.

Correspondingly, the cloud server 10 is mainly used for: receiving the voice data and facial feature data, and sorting at least one standby voice recognition engine according to the facial feature data. Wherein, at least one spare speech recognition engine corresponds to at least one language type. For example, the at least one spare speech recognition engine includes: a speech recognition engine corresponding to Arabic, a speech recognition engine corresponding to German, and a speech recognition engine corresponding to French.

After the cloud server 10 sorts, it can judge whether the first speech recognition engine corresponding to the preset first language type matches the speech data, if not, then according to the sorting of at least one spare speech recognition engine, start from at least one spare speech recognition engine. Among the speech recognition engines, a target speech recognition engine matching the speech data is selected. Wherein, "first" is used to define the speech recognition engine, which is only used to distinguish the speech recognition engines. Wherein, the target speech recognition engine refers to a speech recognition engine that matches the speech data.

For example, the user uses French, and the first speech recognition engine corresponding to the preset first language type is used to recognize Chinese. After the cloud server 10 judges that the speech recognition engine does not match the speech data, According to the order of "speech recognition engine corresponding to French, speech recognition engine corresponding to German, and speech recognition engine corresponding to Arabic", the target speech recognition engine matching the speech data can be selected from these spare speech recognition engines, that is, French The corresponding speech recognition engine.

Based on the above steps, the cloud server 10 can generate the reply information of the voice data according to the second voice recognition result of the voice data by the target voice recognition engine. Wherein, the reply information may be implemented as text information or audio information used to provide the user with a reply. For example, if the user says to the terminal device 20 "What time will dinner be served in the afternoon", the cloud server 10 may generate a reply message of "six o'clock in the afternoon". Further optionally, the cloud server 10 may send the generated reply information to the terminal device 20 in the form of text or audio, so that the terminal device 20 outputs the reply information to the user through an audio component or a display component.

In this embodiment, the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data. The cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data. Through this implementation manner, when the language types used by the users are different, the terminal device can more accurately perform speech recognition on the speech data input by the user, and then can provide the user with a reply that matches the speech information.

Optionally, the "sorting at least one spare speech recognition engine according to facial feature data" described in the foregoing embodiments can be implemented based on the following steps:

The cloud server 10 can obtain facial feature data by performing feature extraction on pre-collected user facial images, and further, the cloud server 10 can identify the target language group to which the user belongs according to the facial feature data. Wherein, the target language type group refers to the language type group to which the user belongs. For example, the user's facial feature data includes: the user's eyes are light green, relatively deep, and the hair is golden yellow, because people with this facial feature often appear in European countries such as France or Germany, so the cloud server 10 can be based on the facial features. The feature data identifies that the target language group to which the user belongs is a French group or a German group.

The optional identification process for linguistic type groups is described in detail below.

When the cloud server 10 identifies the language group, it can input the facial feature data into a preset language group SVM (Support Vector Machine, Support Vector Machine) classifier. due to the The language group SVM classifier has been continuously trained in advance, and can divide human facial images into Korean groups, Chinese groups, French groups, etc. according to language groups, and obtain categories of multiple language groups. Therefore, after the cloud server 10 inputs the facial feature data into the classifier, the classifier can match the facial feature data with a plurality of language-type group categories, and obtain a plurality of language-type groups that match the facial feature data and the corresponding matching degree (ie probability), and then the classifier can output the language type group corresponding to the facial feature data. For example, the cloud server can input the facial feature data into the preset classifier, and the matching degree of the target language group is 80% for the Chinese group, 70% for the French group and 50% for the English group .

Based on the above steps, the cloud server 10 sorts the at least one standby speech recognition engine according to the target language group to which the user belongs and the corresponding relationship between the language group and the spare speech recognition engine.

For example, the correspondence between the language group and the standby speech recognition engine may be: the French group corresponds to the speech recognition engine corresponding to French, and the German group corresponds to the speech recognition engine corresponding to German. Using the aforementioned example as an example, after the cloud server 10 recognizes that the target language group is 80% matching degree of the German group, 70% matching degree of the French group and 50% matching degree of the English group, the target language type can be Groups and corresponding relationships, arrange at least one spare speech recognition engine in the order of "the speech recognition engine corresponding to German, the speech recognition engine corresponding to French, and the speech recognition engine corresponding to English" according to the order of matching degree from high to low .

In some optional embodiments, before the cloud server 10 judges whether the first speech recognition engine corresponding to the preset first language type matches the speech data, it may acquire the current geographic location of the terminal device 20 and, according to the The language distribution characteristics of the geographic location determine the first language type, and use the recognition engine corresponding to the first language type as the first speech recognition engine. For example, the terminal device 20 is currently in a certain community. Since this community is a community inhabited by Koreans, the language distribution feature of this community is that there are many people who speak Korean and few people who speak Chinese. The cloud server 10 can determine the first language type as Korean according to the language distribution feature, and use the recognition engine corresponding to Korean as the first speech recognition engine.

Optionally, when the cloud server 10 judges whether the first voice recognition engine corresponding to the preset first language type matches the voice data, it may use the first voice recognition engine corresponding to the preset first language type to The engine performs speech recognition on the speech data to obtain a first speech recognition result.

Furthermore, the cloud server 10 can obtain the text information in the first speech recognition result, and calculate the recognition accuracy of the text information. Wherein, the calculation of the recognition accuracy rate may be performed through a preset speech recognition model, or may be calculated through a preset algorithm. For example, the Sentence Error Rate (SER), Sentence Correct (S.Corr) or Character Error Rate (CER) of text information can be calculated through a preset model or algorithm. Evaluation indicators, and according to multiple evaluation indicators and their respective weights, the recognition accuracy rate of the text information is calculated.

If the calculated recognition accuracy is less than the set accuracy threshold, it is determined that the voice data does not match the first speech recognition engine, wherein the threshold can be set to 90%, 85%, or 80%, etc., the present embodiment does not Do limit.

If the calculated recognition accuracy rate is greater than or equal to the set accuracy rate threshold, it can be preliminarily determined that the voice data matches the first voice recognition engine. On this basis, the cloud server 10 can further judge whether the voice data matches the first voice recognition engine according to the confidence of the answer information generated by the question-answer matching link. The details will be described below.

If the recognition accuracy of the text information in the first speech recognition result is greater than or equal to the set accuracy threshold, the cloud server 10 may use a question-answer matching model to perform question-answer matching on the text information based on NLP (Natural Language Processing, Natural Language Processing) technology . After pre-trained model training, the question-answer matching model can search for a plurality of pre-selected information with different confidence levels corresponding to the text information in the built-in data set of the model according to the input text information. Furthermore, the question-answer matching model can select the pre-selected information with the highest confidence as the answer information from the multiple pre-selected information. For example, the cloud server 10 uses the question-answer matching model to perform question-and-answer matching on the text information of "which street is the nearest bank to me", and can obtain the pre-selected information of "on street A" with a confidence level of 80% and the pre-selected information of "on street A" with a confidence level of 85%. The preselected information of "on street B", and then, the preselected information of "on street B" with a confidence level of 85% can be selected from the two preselected information as the answer information.

Through the above question-answer matching method, the cloud server 10 can obtain the reply information and the confidence level of the reply information. If the confidence of the reply information is less than the preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.

If it is determined that the voice data does not match the first voice recognition engine, then the cloud server 10 may select from at least In a standby speech recognition engine, a target speech recognition engine matching the speech data is selected.

In some optional embodiments, when the cloud server 10 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it can select any speech recognition engine from at least one spare speech recognition engine. The recognition engine acts as a second speech recognition engine. For example, at least one backup speech recognition engine includes a speech recognition engine corresponding to Chinese and a speech recognition engine corresponding to French, and the cloud server can select a speech recognition engine corresponding to French from the at least one backup speech recognition engine as the second voice recognition engine.

After the cloud server 10 selects the second speech recognition engine, the engine can perform speech recognition on the speech data to obtain a second speech recognition result. Wherein, the second speech recognition result refers to the speech recognition result obtained by performing speech recognition through the second speech recognition engine. Wherein, "second" is used to limit the speech recognition result, which is only used to distinguish the speech recognition results obtained from multiple speech recognitions.

After the voice recognition, the cloud server 10 can determine whether the voice data matches the second voice recognition engine according to the second voice recognition result. The details will be described below.

The cloud server 10 can obtain the text information in the second speech recognition result, and calculate the recognition accuracy of the text information. Wherein, the calculation of the recognition accuracy rate may be performed through a preset speech recognition model, or may be calculated through a preset algorithm. For example, multiple evaluation indicators such as sentence error rate, sentence correct rate, or word error rate of text information can be calculated through a preset model or algorithm, and the recognition accuracy of the text information can be calculated based on multiple evaluation indicators and their respective weights. Rate. If the recognition accuracy is greater than or equal to the set accuracy threshold, it is determined that the speech data matches the second recognition engine, and the cloud server 10 can use the second speech recognition engine as a target speech recognition engine. If the recognition accuracy is less than the set accuracy threshold, it is determined that the voice data does not match the second speech recognition engine, wherein the threshold can be set to 90%, 85%, or 80%, etc., which is not limited in this embodiment .

The voice interaction system will be further described below in conjunction with FIG. 2 and FIG. 3 and actual application scenarios.

As shown in Fig. 2 and Fig. 3, the terminal device can collect the user's facial image, and perform image recognition to obtain the user's facial feature data. Afterwards, the terminal device can identify the target language type group according to the facial feature data, and set the backup voice recognition engine according to the target language type group. Based on the above steps, the terminal device can collect the user's initial voice data through the microphone, and The voice data is sent to the voice endpoint detection module. The voice endpoint detection module can intercept effective voice data in the initial voice data. Afterwards, the terminal device can perform voice recognition on the voice data through the first voice recognition engine corresponding to the first language type in the main module (ie, the main engine), and can obtain text information corresponding to the voice data. Afterwards, the terminal device may perform question-answer matching on the text information through the question-answer matching model corresponding to the first language type, and obtain answer information corresponding to the text information. If the confidence level of the reply information is greater than or equal to the confidence threshold, the text-to-speech module corresponding to the first language type will convert the reply information into speech and output the speech. If the confidence of the reply information is less than the confidence threshold, select a target speech recognition engine that matches the speech data from at least one spare speech recognition engine to perform speech recognition on the speech data again.

Taking the example where the target speech recognition engine is the backup speech recognition engine corresponding to Korean, the terminal device performs speech recognition on the speech data through the backup speech recognition engine corresponding to Korean to obtain corresponding text information. Afterwards, the terminal device can perform question-answer matching on the text information through the standby question-answer matching model corresponding to Korean in the main module, and obtain corresponding answer information. If the confidence degree of the reply information is greater than or equal to the confidence degree threshold, the reply information is converted into speech through the text-to-speech module corresponding to Korean, and the speech output is performed. If the confidence degree of the reply information is less than the confidence degree threshold, the standby speech recognition engine is reselected to re-recognize the speech data.

The embodiment of the present application also provides a voice interaction method, which will be described in detail below with reference to FIG. 4 .

Step 401. Acquire the voice data sent by the user to the device and the user's facial feature data.

Step 402, sort the at least one spare speech recognition engine according to the facial feature data; the at least one spare speech recognition engine corresponds to at least one language type respectively.

Step 403, judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data.

Step 404, if not, select a target speech recognition engine matching the voice data from the at least one spare speech recognition engine according to the ranking of the at least one spare speech recognition engine.

Step 405: Generate reply information for the voice data according to the second voice recognition result of the voice data by the target voice recognition engine.

Further optionally, sorting the at least one standby speech recognition engine according to the facial feature data includes: identifying the target language type group to which the user belongs according to the facial feature data; according to the target language type group and the language type group and Correspondence of standby voice recognition engine relationship, sorting at least one spare speech recognition engine; facial feature data includes: at least one of skin feature data, hair feature data, eye feature data, nose bridge feature data, and lip feature data.

Further optionally, before judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data, it also includes: acquiring the current geographical location of the device; determining the first language type, and use the recognition engine corresponding to the first language type as the first speech recognition engine.

Further optionally, judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data includes: performing voice recognition on the voice data through the first voice recognition engine corresponding to the preset first language type , obtain the first speech recognition result; obtain the text information in the first speech recognition result; calculate the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set accuracy rate threshold, then determine that the speech data does not match the first speech recognition engine .

Further optionally, the method also includes: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, then using the question-answer matching model to perform question-answer matching on the text information to obtain the answer information and the confidence degree of the answer information; if the confidence degree of the answer information If it is less than the preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.

Further optionally, selecting a target speech recognition engine that matches the voice data from at least one spare speech recognition engine includes: sequentially performing at least one spare speech recognition engine according to the ordering of the at least one spare speech recognition engine Select to obtain the second speech recognition engine; according to the second speech recognition result, judge whether the voice data matches the second speech recognition engine; if the speech data matches the second speech recognition engine, then use the second speech recognition engine as the target voice recognition engine.

It should be noted that the subject of execution of each step of the method provided in the above embodiments can be the same A device, or the method is executed by a different device. For example, the execution subject of steps 401 to 405 may be device A; for another example, the execution subject of steps 401 to 403 may be device A, and the execution subject of steps 404 and 405 may be device B; and so on.

In addition, in some of the processes described in the above embodiments and accompanying drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may not be executed in the order in which they appear herein or executed in parallel , the sequence numbers of the operations, such as 401, 402, etc., are only used to distinguish different operations, and the sequence numbers themselves do not represent any execution order. Additionally, these processes can include more or fewer operations, and these operations can be performed sequentially or in parallel.

It should be noted that the descriptions of "first" and "second" in this article are used to distinguish different messages, devices, modules, etc. are different types.

An embodiment of the present application provides a voice interaction device. As shown in FIG. 5 , the voice interaction device includes: an acquisition module 501 , a sorting module 502 , a judgment module 503 , a selection module 504 and a generation module 505 .

Among them, the acquisition module 501 is used to: acquire the voice data sent by the user for the device and the facial feature data of the user; the sorting module 502 is used to: sort at least one standby voice recognition engine according to the facial feature data The at least one standby speech recognition engine corresponds to at least one language type respectively; the judging module 503 is used to: judge whether the first speech recognition engine corresponding to the preset first language type matches the speech data; select Module 504, configured to: if not, select a target speech recognition engine that matches the voice data from at least one spare speech recognition engine according to the ranking of the at least one spare speech recognition engine; generating module 505, configured to: generate reply information for the voice data according to a second voice recognition result of the voice data by the target voice recognition engine.

Further optionally, when the sorting module 502 sorts at least one standby speech recognition engine according to the facial feature data, it is specifically used to: identify the target language group to which the user belongs according to the facial feature data; According to the target language type group to which the user belongs and the corresponding relationship between the language type group and the backup speech recognition engine, the at least one backup speech recognition engine is sorted; the facial feature data includes: skin feature data, hair feature data , eye feature data, nose bridge feature data, and lip feature data.

Further optionally, before judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, the sorting module 502 is further configured to: acquire the current geographic location of the device; The language distribution feature of the geographic location is used to determine the first language type, and the recognition engine corresponding to the first language type is used as the first speech recognition engine.

Further optionally, when judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, the judging module 503 is specifically configured to: use the first voice corresponding to the preset first language type The recognition engine performs speech recognition on the speech data to obtain a first speech recognition result; obtains text information in the first speech recognition result; calculates the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set If the accuracy threshold is determined, it is determined that the voice data does not match the first voice recognition engine.

Further optionally, the judging module 503 is also configured to: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, use a question-answer matching model to perform question-answer matching on the text information to obtain answer information and the answer Confidence of the information; if the confidence of the reply information is less than a preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.

Further optionally, when the selection module 504 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it is specifically configured to: sort according to the at least one spare speech recognition engine , selecting the at least one standby speech recognition engine in turn to obtain the second speech recognition engine; judging whether the speech data matches the second speech recognition engine according to the second speech recognition result; if The voice data is matched with the second voice recognition engine, and the second voice recognition engine is used as the target voice recognition engine.

Fig. 6 is a schematic structural diagram of a cloud server provided by an exemplary embodiment of the present application. The server is suitable for the voice interaction system provided in the foregoing embodiment. As shown in Fig. 6, the server includes: Storage 601, processor 602 and communication component 603.

The memory 601 is used to store computer programs, and can be configured to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phonebook data, messages, pictures, videos, etc.

The processor 602, coupled with the memory 601, is used to execute the computer program in the memory 601, so as to: obtain the voice data sent by the user for the device and the facial feature data of the user; according to the facial feature data, at least one The standby speech recognition engines are sorted; the at least one standby speech recognition engine corresponds to at least one language type; it is judged whether the first speech recognition engine corresponding to the preset first language type matches the voice data; if If no, then according to the sorting of the at least one standby voice recognition engine, select a target voice recognition engine that matches the voice data from at least one standby voice recognition engine; The second voice recognition result of the voice data is used to generate reply information for the voice data.

Further optionally, when the processor 602 sorts at least one standby speech recognition engine according to the facial feature data, it is specifically configured to: identify the target language type group to which the user belongs according to the facial feature data; According to the target language type group to which the user belongs and the corresponding relationship between the language type group and the backup speech recognition engine, the at least one backup speech recognition engine is sorted; the facial feature data includes: skin feature data, hair feature data , eye feature data, nose bridge feature data, and lip feature data.

Further optionally, before judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, the processor 602 is further configured to: acquire the current geographic location of the device; The language distribution feature of the geographic location is used to determine the first language type, and the recognition engine corresponding to the first language type is used as the first speech recognition engine.

Further optionally, when judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, the processor 602 is specifically configured to: use the first voice corresponding to the preset first language type The recognition engine performs speech recognition on the speech data to obtain a first speech recognition result; obtains text information in the first speech recognition result; calculates the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set If the accuracy threshold is determined, it is determined that the voice data does not match the first voice recognition engine.

Further optionally, the processor 602 is further configured to: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, use a question-answer matching model to perform question-answer matching on the text information to obtain Reply information and a confidence degree of the reply information; if the confidence degree of the reply information is less than a preset confidence threshold, it is determined that the voice data does not match the first speech recognition engine.

Further optionally, when the processor 602 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it is specifically configured to: sort according to the at least one spare speech recognition engine , selecting the at least one standby speech recognition engine in turn to obtain the second speech recognition engine; judging whether the speech data matches the second speech recognition engine according to the second speech recognition result; if The voice data is matched with the second voice recognition engine, and the second voice recognition engine is used as the target voice recognition engine.

Further, as shown in FIG. 6 , the cloud server further includes: a power supply component 604 and other components. FIG. 6 only schematically shows some components, which does not mean that the cloud server only includes the components shown in FIG. 6 .

Correspondingly, the embodiments of the present application also provide a computer-readable storage medium storing a computer program. When the computer program is executed, the steps that can be executed by the cloud server in the above method embodiments can be realized.

Correspondingly, an embodiment of the present application further provides a computer program product, including computer programs/instructions. When the computer program/instructions are executed by a processor, the steps that can be executed by the cloud server in the above method embodiments are implemented.

The memory 601 in the above-mentioned Fig. 6 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) , Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.

The above-mentioned communication component 603 in FIG. 6 is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access a wireless network based on communication standards, such as WiFi, 2G, 3G, 4G or 5G, or a combination thereof. In an exemplary embodiment, The communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may be based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies to fulfill.

The power supply component 604 in FIG. 6 provides power for various components of the device where the power supply component is located. A power supply component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which the power supply component resides.

Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output Outbound interface, network interface and memory.

Memory may include non-permanent storage in computer readable media, in the form of random access memory (RAM) and/or nonvolatile memory such as read-only memory (ROM) or flash RAM. Memory is an example of computer readable media.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, A magnetic tape cartridge, disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.

It should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes Other elements not expressly listed, or elements inherent in the process, method, commodity, or apparatus are also included. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

The above descriptions are only examples of the present application, and are not intended to limit the present application. For those skilled in the art, various modifications and changes may occur in this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included within the scope of the claims of the present application.

Claims

An image-based voice interaction method, characterized in that it comprises:

Obtain the voice data sent by the user for the device and the facial feature data of the user;

Sorting at least one spare speech recognition engine according to the facial feature data; the at least one spare speech recognition engine corresponds to at least one language type respectively;

judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data;

If not, then according to the sorting of the at least one standby voice recognition engine, select a target voice recognition engine that matches the voice data from at least one standby voice recognition engine;

Generate reply information for the voice data according to a second voice recognition result of the voice data by the target voice recognition engine.
The method according to claim 1, wherein, according to the facial feature data, at least one standby speech recognition engine is sorted, comprising;

Identify the target language group to which the user belongs according to the facial feature data;

According to the target language type group to which the user belongs and the corresponding relationship between the language type group and the backup speech recognition engine, the at least one backup speech recognition engine is sorted; the facial feature data includes: skin feature data, hair feature data , eye feature data, nose bridge feature data, and lip feature data.
The method according to claim 1, wherein before judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data, further comprising:

Obtain the current geographic location of the device;

The first language type is determined according to the language distribution characteristics of the geographical location, and the recognition engine corresponding to the first language type is used as the first speech recognition engine.
The method according to claim 1, wherein judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data comprises:

performing speech recognition on the speech data by using a first speech recognition engine corresponding to a preset first language type to obtain a first speech recognition result;

Acquiring text information in the first speech recognition result;

calculating the recognition accuracy of the text information;

If the recognition accuracy is less than the set accuracy threshold, it is determined that the speech data does not match the first speech recognition engine.
The method according to claim 4, further comprising:

If the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, then using a question-answer matching model to perform question-answer matching on the text information to obtain answer information and the confidence level of the answer information;

If the confidence of the reply information is less than a preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
The method according to claim 1, wherein selecting a target speech recognition engine matching the speech data from at least one standby speech recognition engine comprises:

According to the sorting of the at least one standby speech recognition engine, sequentially select the at least one standby speech recognition engine to obtain the second speech recognition engine;

judging whether the voice data matches the second voice recognition engine according to the second voice recognition result;

If the voice data matches the second voice recognition engine, use the second voice recognition engine as the target voice recognition engine.
A voice interaction system, characterized in that it includes: a terminal device and a cloud server;

Wherein, the terminal device is mainly used for: obtaining the voice data sent by the user for the device and the facial feature data of the user; sending the voice data and the facial feature data to the cloud server;

The cloud server is mainly used for: receiving the voice data and the facial feature data; sorting at least one spare speech recognition engine according to the facial feature data; Corresponding to at least one language type; judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data; if not, according to the sorting of the at least one spare speech recognition engine, from at least one Among the standby speech recognition engines, select a target speech recognition engine that matches the speech data; generate reply information for the speech data according to a second speech recognition result of the speech data by the target speech recognition engine.
A voice interaction device is characterized in that it comprises:

An acquisition module, configured to: acquire the voice data sent by the user for the device and the facial feature data of the user;

A sorting module, configured to: sort at least one spare speech recognition engine according to the facial feature data; the at least one spare speech recognition engine corresponds to at least one language type;

Judging module, used for: judging the first speech recognition engine corresponding to the preset first language type and Whether the voice data matches;

A selection module, configured to: if not, select a target speech recognition engine that matches the voice data from at least one spare speech recognition engine according to the ranking of the at least one spare speech recognition engine;

The generating module is configured to: generate reply information for the voice data according to a second voice recognition result of the voice data by the target voice recognition engine.
A cloud server, characterized in that it includes: a memory, a processor, and a communication component;

Wherein, the memory is used to: store one or more computer instructions;

The processor is configured to execute the one or more computer instructions for: performing the steps in the method of any one of claims 1-6.
A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the processor is caused to implement the steps in the method of any one of claims 1-6.
A computer program product, comprising computer programs/instructions, characterized in that, when the computer program/instructions are executed by a processor, the steps in the method according to any one of claims 1-6 are implemented.