WO2023143439A1 - Procédé, système et appareil d'interaction vocale, dispositif, et support de stockage - Google Patents

Procédé, système et appareil d'interaction vocale, dispositif, et support de stockage Download PDF

Info

Publication number
WO2023143439A1
WO2023143439A1 PCT/CN2023/073326 CN2023073326W WO2023143439A1 WO 2023143439 A1 WO2023143439 A1 WO 2023143439A1 CN 2023073326 W CN2023073326 W CN 2023073326W WO 2023143439 A1 WO2023143439 A1 WO 2023143439A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognition engine
speech recognition
data
voice
speech
Prior art date
Application number
PCT/CN2023/073326
Other languages
English (en)
Chinese (zh)
Inventor
王军锋
袁国勇
王伟健
Original Assignee
达闼机器人股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 达闼机器人股份有限公司 filed Critical 达闼机器人股份有限公司
Publication of WO2023143439A1 publication Critical patent/WO2023143439A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the embodiments of the present application relate to the technical field of intelligent robots, and in particular to a voice interaction method, system, device, equipment and storage medium.
  • Embodiments of the present application provide a voice interaction method, system, device, device, and storage medium, which are used to more accurately perform voice recognition on voice data input by a user, and then provide the user with a reply that matches the voice information.
  • An embodiment of the present application provides a voice interaction method, including: acquiring voice data sent by a user for a device and facial feature data of the user; sorting at least one standby voice recognition engine according to the facial feature data; At least one alternate speech recognition engine, separate from the Corresponding to one less language type; judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data; if not, according to the ordering of the at least one spare speech recognition engine, from at least one Among the standby speech recognition engines, select a target speech recognition engine that matches the speech data; generate reply information for the speech data according to a second speech recognition result of the speech data by the target speech recognition engine.
  • sorting at least one standby speech recognition engine according to the facial feature data includes: identifying the target language type group to which the user belongs according to the facial feature data; according to the target language group to which the user belongs Language type groups and the corresponding relationship between language type groups and standby speech recognition engines, sorting the at least one standby speech recognition engine; the facial feature data includes: skin feature data, hair feature data, eye feature data, nose bridge features At least one of data and lip feature data.
  • the first voice recognition engine corresponding to the preset first language type matches the voice data, it also includes: acquiring the current geographic location of the device; The distribution feature determines the first language type, and uses the recognition engine corresponding to the first language type as the first speech recognition engine.
  • judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data includes: using the first voice recognition engine corresponding to the preset first language type to analyze the voice Perform speech recognition on the data to obtain the first speech recognition result; obtain the text information in the first speech recognition result; calculate the recognition accuracy of the text information; if the recognition accuracy is less than the set accuracy threshold, determine The speech data does not match the first speech recognition engine.
  • the method further includes: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, then using a question-answer matching model to perform question-answer matching on the text information to obtain answer information and the answer information confidence level; if the confidence level of the reply information is less than a preset confidence level threshold, it is determined that the voice data does not match the first voice recognition engine.
  • selecting a target speech recognition engine that matches the voice data from at least one standby speech recognition engine includes: sequentially sorting the at least one standby speech recognition engine according to the order of the at least one standby speech recognition engine.
  • a standby speech recognition engine is selected to obtain the second speech recognition engine; according to the second speech recognition result, it is judged whether the speech data matches the second speech recognition engine; if the speech data matches the If the second speech recognition engine matches, the second speech recognition engine is used as the target speech recognition engine.
  • the embodiment of the present application also provides a voice interaction system, including: a terminal device and a cloud server; wherein, the terminal device is mainly used to: obtain the voice data sent by the user for the device and the facial feature data of the user; The voice data and the facial feature data are sent to the cloud server; the cloud server is mainly used for: receiving the voice data and the facial feature data; according to the facial feature data, at least one standby speech recognition engine Sorting; the at least one standby speech recognition engine corresponds to at least one language type respectively; judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data; if not, then according to The sorting of the at least one standby speech recognition engine is to select a target speech recognition engine that matches the voice data from at least one standby speech recognition engine; 2. As a result of the voice recognition, generate reply information for the voice data.
  • the terminal device is mainly used to: obtain the voice data sent by the user for the device and the facial feature data of the user;
  • the voice data and the facial feature data
  • the embodiment of the present application also provides a voice interaction device, including: an acquisition module, configured to: acquire the voice data sent by the user for the device and the facial feature data of the user; a sorting module, configured to: according to the facial feature data, Sorting at least one spare speech recognition engine; the at least one spare speech recognition engine corresponds to at least one language type; the judging module is used to: judge the first speech recognition engine corresponding to the preset first language type Whether it matches the voice data; the selection module is used for: if not, then according to the sorting of the at least one standby voice recognition engine, select the one that matches the voice data from at least one standby voice recognition engine A target speech recognition engine; a generating module configured to: generate reply information for the speech data according to a second speech recognition result of the speech data by the target speech recognition engine.
  • an acquisition module configured to: acquire the voice data sent by the user for the device and the facial feature data of the user
  • a sorting module configured to: according to the facial feature data, Sorting at least one spare speech recognition
  • the embodiment of the present application also provides a cloud server, including: a memory, a processor, and a communication component; wherein, the memory is used to: store one or more computer instructions; and the processor is used to execute the one or more computer instructions.
  • the instruction is used for: executing the steps in the voice interaction method.
  • the embodiment of the present application also provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor is caused to implement the steps in the voice interaction method.
  • Embodiments of the present application provide a voice interaction method, system, device, device, and storage medium, in which the terminal equipment can obtain the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data.
  • the cloud server can select the target speech recognition engine from the standby speech recognition engines, and select the target speech recognition engine according to the second speech recognition result of the speech data by the target speech recognition engine. As a result, answer information for the voice data is generated.
  • the terminal device can more accurately perform speech recognition on the speech data input by the user, and then can provide the user with a reply that matches the speech information.
  • FIG. 1 is a schematic structural diagram of a voice interaction system provided by an exemplary embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a voice interaction system in an actual scenario provided by an exemplary embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a voice interaction system in an actual scenario provided by another exemplary embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a voice interaction method provided by an exemplary embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a voice interaction device provided by an exemplary embodiment of the present application.
  • Fig. 6 is a schematic diagram of a cloud server provided by an exemplary embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of a voice interaction system provided by an exemplary embodiment of the present application.
  • the voice interaction system 100 includes: a cloud server 10 and a terminal device 20 .
  • the cloud server 10 can be implemented as a cloud host, a virtual center in the cloud, or an elastic computing instance in the cloud etc., which is not limited in this embodiment.
  • the composition of the cloud server 10 mainly includes a processor, a hard disk, a memory, a system bus, etc., and is similar to a general computer architecture, and will not be repeated here.
  • the terminal device 20 can be realized as a variety of terminal devices in different scenarios. For example, in the scenarios of hotels, hotels, restaurants, etc., it can be realized as a robot that provides services; in the scenario of intelligent driving assistance or automatic driving, it can be realized as a controlled Vehicles. In the banking scenario, it can be realized as a multi-functional financial terminal; in the hospital scenario, it can be realized as a registration and payment terminal; in the movie theater scenario, it can be realized as a ticket collection terminal, etc.
  • a wireless communication connection can be established between the cloud server 10 and the terminal device 20, and the specific communication connection method can be determined according to different application scenarios.
  • the wireless communication connection can be implemented based on a dedicated virtual network (Virtual Private Network, VPN) to ensure communication security.
  • VPN Virtual Private Network
  • the terminal device 20 is mainly used to: obtain the voice data sent by the user to the terminal device 20 and the user's facial feature data, and send the voice data to the cloud server 10 .
  • the facial feature data is used to identify the language group to which the user belongs, and the facial feature data may include: at least one of skin feature data, hair feature data, eye feature data, nose bridge feature data, and lip feature data.
  • the user's facial feature data may include: eyes are light green, darker, and hair is blonde.
  • the cloud server 10 is mainly used for: receiving the voice data and facial feature data, and sorting at least one standby voice recognition engine according to the facial feature data.
  • at least one spare speech recognition engine corresponds to at least one language type.
  • the at least one spare speech recognition engine includes: a speech recognition engine corresponding to Arabic, a speech recognition engine corresponding to German, and a speech recognition engine corresponding to French.
  • the cloud server 10 sorts, it can judge whether the first speech recognition engine corresponding to the preset first language type matches the speech data, if not, then according to the sorting of at least one spare speech recognition engine, start from at least one spare speech recognition engine.
  • a target speech recognition engine matching the speech data is selected.
  • “first” is used to define the speech recognition engine, which is only used to distinguish the speech recognition engines.
  • the target speech recognition engine refers to a speech recognition engine that matches the speech data.
  • the user uses French
  • the first speech recognition engine corresponding to the preset first language type is used to recognize Chinese.
  • the cloud server 10 judges that the speech recognition engine does not match the speech data, According to the order of "speech recognition engine corresponding to French, speech recognition engine corresponding to German, and speech recognition engine corresponding to Arabic", the target speech recognition engine matching the speech data can be selected from these spare speech recognition engines, that is, French The corresponding speech recognition engine.
  • the cloud server 10 can generate the reply information of the voice data according to the second voice recognition result of the voice data by the target voice recognition engine.
  • the reply information may be implemented as text information or audio information used to provide the user with a reply. For example, if the user says to the terminal device 20 "What time will dinner be served in the afternoon", the cloud server 10 may generate a reply message of "six o'clock in the afternoon". Further optionally, the cloud server 10 may send the generated reply information to the terminal device 20 in the form of text or audio, so that the terminal device 20 outputs the reply information to the user through an audio component or a display component.
  • the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data.
  • the cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data.
  • the "sorting at least one spare speech recognition engine according to facial feature data" described in the foregoing embodiments can be implemented based on the following steps:
  • the cloud server 10 can obtain facial feature data by performing feature extraction on pre-collected user facial images, and further, the cloud server 10 can identify the target language group to which the user belongs according to the facial feature data.
  • the target language type group refers to the language type group to which the user belongs.
  • the user's facial feature data includes: the user's eyes are light green, relatively deep, and the hair is golden yellow, because people with this facial feature often appear in European countries such as France or Germany, so the cloud server 10 can be based on the facial features.
  • the feature data identifies that the target language group to which the user belongs is a French group or a German group.
  • the cloud server 10 When the cloud server 10 identifies the language group, it can input the facial feature data into a preset language group SVM (Support Vector Machine, Support Vector Machine) classifier.
  • SVM Small Vector Machine, Support Vector Machine
  • the language group SVM classifier has been continuously trained in advance, and can divide human facial images into Korean groups, Chinese groups, French groups, etc. according to language groups, and obtain categories of multiple language groups. Therefore, after the cloud server 10 inputs the facial feature data into the classifier, the classifier can match the facial feature data with a plurality of language-type group categories, and obtain a plurality of language-type groups that match the facial feature data and the corresponding matching degree (ie probability), and then the classifier can output the language type group corresponding to the facial feature data.
  • the cloud server can input the facial feature data into the preset classifier, and the matching degree of the target language group is 80% for the Chinese group, 70% for the French group and 50% for the English group .
  • the cloud server 10 sorts the at least one standby speech recognition engine according to the target language group to which the user belongs and the corresponding relationship between the language group and the spare speech recognition engine.
  • the correspondence between the language group and the standby speech recognition engine may be: the French group corresponds to the speech recognition engine corresponding to French, and the German group corresponds to the speech recognition engine corresponding to German.
  • the target language type can be Groups and corresponding relationships, arrange at least one spare speech recognition engine in the order of "the speech recognition engine corresponding to German, the speech recognition engine corresponding to French, and the speech recognition engine corresponding to English" according to the order of matching degree from high to low .
  • the cloud server 10 may acquire the current geographic location of the terminal device 20 and, according to the The language distribution characteristics of the geographic location determine the first language type, and use the recognition engine corresponding to the first language type as the first speech recognition engine.
  • the terminal device 20 is currently in a certain community. Since this community is a community inhabited by Koreans, the language distribution feature of this community is that there are many people who speak Korean and few people who speak Chinese.
  • the cloud server 10 can determine the first language type as Korean according to the language distribution feature, and use the recognition engine corresponding to Korean as the first speech recognition engine.
  • the cloud server 10 may use the first voice recognition engine corresponding to the preset first language type to The engine performs speech recognition on the speech data to obtain a first speech recognition result.
  • the cloud server 10 can obtain the text information in the first speech recognition result, and calculate the recognition accuracy of the text information.
  • the calculation of the recognition accuracy rate may be performed through a preset speech recognition model, or may be calculated through a preset algorithm.
  • the Sentence Error Rate (SER), Sentence Correct (S.Corr) or Character Error Rate (CER) of text information can be calculated through a preset model or algorithm. Evaluation indicators, and according to multiple evaluation indicators and their respective weights, the recognition accuracy rate of the text information is calculated.
  • the present embodiment does not Do limit.
  • the cloud server 10 can further judge whether the voice data matches the first voice recognition engine according to the confidence of the answer information generated by the question-answer matching link. The details will be described below.
  • the cloud server 10 may use a question-answer matching model to perform question-answer matching on the text information based on NLP (Natural Language Processing, Natural Language Processing) technology .
  • NLP Natural Language Processing, Natural Language Processing
  • the question-answer matching model can search for a plurality of pre-selected information with different confidence levels corresponding to the text information in the built-in data set of the model according to the input text information.
  • the question-answer matching model can select the pre-selected information with the highest confidence as the answer information from the multiple pre-selected information.
  • the cloud server 10 uses the question-answer matching model to perform question-and-answer matching on the text information of "which street is the nearest bank to me", and can obtain the pre-selected information of "on street A” with a confidence level of 80% and the pre-selected information of "on street A” with a confidence level of 85%.
  • the preselected information of "on street B”, and then, the preselected information of "on street B" with a confidence level of 85% can be selected from the two preselected information as the answer information.
  • the cloud server 10 can obtain the reply information and the confidence level of the reply information. If the confidence of the reply information is less than the preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
  • the cloud server 10 may select from at least In a standby speech recognition engine, a target speech recognition engine matching the speech data is selected.
  • the cloud server 10 when the cloud server 10 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it can select any speech recognition engine from at least one spare speech recognition engine.
  • the recognition engine acts as a second speech recognition engine.
  • at least one backup speech recognition engine includes a speech recognition engine corresponding to Chinese and a speech recognition engine corresponding to French, and the cloud server can select a speech recognition engine corresponding to French from the at least one backup speech recognition engine as the second voice recognition engine.
  • the engine can perform speech recognition on the speech data to obtain a second speech recognition result.
  • the second speech recognition result refers to the speech recognition result obtained by performing speech recognition through the second speech recognition engine.
  • “second” is used to limit the speech recognition result, which is only used to distinguish the speech recognition results obtained from multiple speech recognitions.
  • the cloud server 10 can determine whether the voice data matches the second voice recognition engine according to the second voice recognition result. The details will be described below.
  • the cloud server 10 can obtain the text information in the second speech recognition result, and calculate the recognition accuracy of the text information.
  • the calculation of the recognition accuracy rate may be performed through a preset speech recognition model, or may be calculated through a preset algorithm.
  • multiple evaluation indicators such as sentence error rate, sentence correct rate, or word error rate of text information can be calculated through a preset model or algorithm, and the recognition accuracy of the text information can be calculated based on multiple evaluation indicators and their respective weights. Rate. If the recognition accuracy is greater than or equal to the set accuracy threshold, it is determined that the speech data matches the second recognition engine, and the cloud server 10 can use the second speech recognition engine as a target speech recognition engine. If the recognition accuracy is less than the set accuracy threshold, it is determined that the voice data does not match the second speech recognition engine, wherein the threshold can be set to 90%, 85%, or 80%, etc., which is not limited in this embodiment .
  • the voice interaction system will be further described below in conjunction with FIG. 2 and FIG. 3 and actual application scenarios.
  • the terminal device can collect the user's facial image, and perform image recognition to obtain the user's facial feature data. Afterwards, the terminal device can identify the target language type group according to the facial feature data, and set the backup voice recognition engine according to the target language type group. Based on the above steps, the terminal device can collect the user's initial voice data through the microphone, and The voice data is sent to the voice endpoint detection module. The voice endpoint detection module can intercept effective voice data in the initial voice data. Afterwards, the terminal device can perform voice recognition on the voice data through the first voice recognition engine corresponding to the first language type in the main module (ie, the main engine), and can obtain text information corresponding to the voice data.
  • the terminal device can collect the user's facial image, and perform image recognition to obtain the user's facial feature data. Afterwards, the terminal device can identify the target language type group according to the facial feature data, and set the backup voice recognition engine according to the target language type group. Based on the above steps, the terminal device can collect the user's initial voice data through the microphone, and
  • the terminal device may perform question-answer matching on the text information through the question-answer matching model corresponding to the first language type, and obtain answer information corresponding to the text information. If the confidence level of the reply information is greater than or equal to the confidence threshold, the text-to-speech module corresponding to the first language type will convert the reply information into speech and output the speech. If the confidence of the reply information is less than the confidence threshold, select a target speech recognition engine that matches the speech data from at least one spare speech recognition engine to perform speech recognition on the speech data again.
  • the terminal device performs speech recognition on the speech data through the backup speech recognition engine corresponding to Korean to obtain corresponding text information. Afterwards, the terminal device can perform question-answer matching on the text information through the standby question-answer matching model corresponding to Korean in the main module, and obtain corresponding answer information. If the confidence degree of the reply information is greater than or equal to the confidence degree threshold, the reply information is converted into speech through the text-to-speech module corresponding to Korean, and the speech output is performed. If the confidence degree of the reply information is less than the confidence degree threshold, the standby speech recognition engine is reselected to re-recognize the speech data.
  • the embodiment of the present application also provides a voice interaction method, which will be described in detail below with reference to FIG. 4 .
  • Step 401 Acquire the voice data sent by the user to the device and the user's facial feature data.
  • Step 402 sort the at least one spare speech recognition engine according to the facial feature data; the at least one spare speech recognition engine corresponds to at least one language type respectively.
  • Step 403 judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data.
  • Step 404 if not, select a target speech recognition engine matching the voice data from the at least one spare speech recognition engine according to the ranking of the at least one spare speech recognition engine.
  • Step 405 Generate reply information for the voice data according to the second voice recognition result of the voice data by the target voice recognition engine.
  • sorting the at least one standby speech recognition engine according to the facial feature data includes: identifying the target language type group to which the user belongs according to the facial feature data; according to the target language type group and the language type group and Correspondence of standby voice recognition engine relationship, sorting at least one spare speech recognition engine; facial feature data includes: at least one of skin feature data, hair feature data, eye feature data, nose bridge feature data, and lip feature data.
  • the first speech recognition engine corresponding to the preset first language type matches the speech data, it also includes: acquiring the current geographical location of the device; determining the first language type, and use the recognition engine corresponding to the first language type as the first speech recognition engine.
  • judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data includes: performing voice recognition on the voice data through the first voice recognition engine corresponding to the preset first language type , obtain the first speech recognition result; obtain the text information in the first speech recognition result; calculate the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set accuracy rate threshold, then determine that the speech data does not match the first speech recognition engine .
  • the method also includes: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, then using the question-answer matching model to perform question-answer matching on the text information to obtain the answer information and the confidence degree of the answer information; if the confidence degree of the answer information If it is less than the preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
  • selecting a target speech recognition engine that matches the voice data from at least one spare speech recognition engine includes: sequentially performing at least one spare speech recognition engine according to the ordering of the at least one spare speech recognition engine Select to obtain the second speech recognition engine; according to the second speech recognition result, judge whether the voice data matches the second speech recognition engine; if the speech data matches the second speech recognition engine, then use the second speech recognition engine as the target voice recognition engine.
  • the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data.
  • the cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data.
  • the subject of execution of each step of the method provided in the above embodiments can be the same A device, or the method is executed by a different device.
  • the execution subject of steps 401 to 405 may be device A; for another example, the execution subject of steps 401 to 403 may be device A, and the execution subject of steps 404 and 405 may be device B; and so on.
  • the voice interaction device includes: an acquisition module 501 , a sorting module 502 , a judgment module 503 , a selection module 504 and a generation module 505 .
  • the acquisition module 501 is used to: acquire the voice data sent by the user for the device and the facial feature data of the user;
  • the sorting module 502 is used to: sort at least one standby voice recognition engine according to the facial feature data
  • the at least one standby speech recognition engine corresponds to at least one language type respectively;
  • the judging module 503 is used to: judge whether the first speech recognition engine corresponding to the preset first language type matches the speech data; select Module 504, configured to: if not, select a target speech recognition engine that matches the voice data from at least one spare speech recognition engine according to the ranking of the at least one spare speech recognition engine;
  • generating module 505, configured to: generate reply information for the voice data according to a second voice recognition result of the voice data by the target voice recognition engine.
  • the sorting module 502 sorts at least one standby speech recognition engine according to the facial feature data, it is specifically used to: identify the target language group to which the user belongs according to the facial feature data; According to the target language type group to which the user belongs and the corresponding relationship between the language type group and the backup speech recognition engine, the at least one backup speech recognition engine is sorted; the facial feature data includes: skin feature data, hair feature data , eye feature data, nose bridge feature data, and lip feature data.
  • the sorting module 502 is further configured to: acquire the current geographic location of the device; The language distribution feature of the geographic location is used to determine the first language type, and the recognition engine corresponding to the first language type is used as the first speech recognition engine.
  • the judging module 503 is specifically configured to: use the first voice corresponding to the preset first language type The recognition engine performs speech recognition on the speech data to obtain a first speech recognition result; obtains text information in the first speech recognition result; calculates the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set If the accuracy threshold is determined, it is determined that the voice data does not match the first voice recognition engine.
  • the judging module 503 is also configured to: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, use a question-answer matching model to perform question-answer matching on the text information to obtain answer information and the answer Confidence of the information; if the confidence of the reply information is less than a preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
  • the selection module 504 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it is specifically configured to: sort according to the at least one spare speech recognition engine , selecting the at least one standby speech recognition engine in turn to obtain the second speech recognition engine; judging whether the speech data matches the second speech recognition engine according to the second speech recognition result; if The voice data is matched with the second voice recognition engine, and the second voice recognition engine is used as the target voice recognition engine.
  • the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data.
  • the cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data.
  • Fig. 6 is a schematic structural diagram of a cloud server provided by an exemplary embodiment of the present application.
  • the server is suitable for the voice interaction system provided in the foregoing embodiment.
  • the server includes: Storage 601, processor 602 and communication component 603.
  • the memory 601 is used to store computer programs, and can be configured to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phonebook data, messages, pictures, videos, etc.
  • the processor 602, coupled with the memory 601, is used to execute the computer program in the memory 601, so as to: obtain the voice data sent by the user for the device and the facial feature data of the user; according to the facial feature data, at least one The standby speech recognition engines are sorted; the at least one standby speech recognition engine corresponds to at least one language type; it is judged whether the first speech recognition engine corresponding to the preset first language type matches the voice data; if If no, then according to the sorting of the at least one standby voice recognition engine, select a target voice recognition engine that matches the voice data from at least one standby voice recognition engine; The second voice recognition result of the voice data is used to generate reply information for the voice data.
  • the processor 602 sorts at least one standby speech recognition engine according to the facial feature data, it is specifically configured to: identify the target language type group to which the user belongs according to the facial feature data; According to the target language type group to which the user belongs and the corresponding relationship between the language type group and the backup speech recognition engine, the at least one backup speech recognition engine is sorted; the facial feature data includes: skin feature data, hair feature data , eye feature data, nose bridge feature data, and lip feature data.
  • the processor 602 is further configured to: acquire the current geographic location of the device; The language distribution feature of the geographic location is used to determine the first language type, and the recognition engine corresponding to the first language type is used as the first speech recognition engine.
  • the processor 602 when judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, the processor 602 is specifically configured to: use the first voice corresponding to the preset first language type The recognition engine performs speech recognition on the speech data to obtain a first speech recognition result; obtains text information in the first speech recognition result; calculates the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set If the accuracy threshold is determined, it is determined that the voice data does not match the first voice recognition engine.
  • the processor 602 is further configured to: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, use a question-answer matching model to perform question-answer matching on the text information to obtain Reply information and a confidence degree of the reply information; if the confidence degree of the reply information is less than a preset confidence threshold, it is determined that the voice data does not match the first speech recognition engine.
  • the processor 602 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it is specifically configured to: sort according to the at least one spare speech recognition engine , selecting the at least one standby speech recognition engine in turn to obtain the second speech recognition engine; judging whether the speech data matches the second speech recognition engine according to the second speech recognition result; if The voice data is matched with the second voice recognition engine, and the second voice recognition engine is used as the target voice recognition engine.
  • the cloud server further includes: a power supply component 604 and other components.
  • FIG. 6 only schematically shows some components, which does not mean that the cloud server only includes the components shown in FIG. 6 .
  • the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data.
  • the cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data.
  • the embodiments of the present application also provide a computer-readable storage medium storing a computer program.
  • the steps that can be executed by the cloud server in the above method embodiments can be realized.
  • an embodiment of the present application further provides a computer program product, including computer programs/instructions.
  • the computer program/instructions are executed by a processor, the steps that can be executed by the cloud server in the above method embodiments are implemented.
  • the memory 601 in the above-mentioned Fig. 6 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) , Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM Erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic Disk Magnetic Disk or Optical Disk.
  • the above-mentioned communication component 603 in FIG. 6 is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices.
  • the device where the communication component is located can access a wireless network based on communication standards, such as WiFi, 2G, 3G, 4G or 5G, or a combination thereof.
  • the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component may be based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies to fulfill.
  • NFC Near Field Communication
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wideband
  • Bluetooth Bluetooth
  • the power supply component 604 in FIG. 6 provides power for various components of the device where the power supply component is located.
  • a power supply component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which the power supply component resides.
  • the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
  • a computing device includes one or more processors (CPUs), input/output Outbound interface, network interface and memory.
  • processors CPUs
  • input/output Outbound interface network interface
  • memory volatile and non-volatile memory
  • Memory may include non-permanent storage in computer readable media, in the form of random access memory (RAM) and/or nonvolatile memory such as read-only memory (ROM) or flash RAM. Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash random access memory
  • Computer-readable media including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information.
  • Information may be computer readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, A magnetic tape cartridge, disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Procédé, système et appareil d'interaction vocale, serveur, et support de stockage. Dans le système d'interaction vocale, un dispositif terminal peut acquérir des données vocales et des données de caractéristiques faciales d'un utilisateur (401) ; un serveur en nuage peut trier au moins un moteur de reconnaissance vocale de secours en fonction des données de caractéristiques faciales, ledit au moins un moteur de reconnaissance vocale de secours correspondant respectivement à au moins un type de langue (402) ; si un premier moteur de reconnaissance vocale correspondant à un premier type de langue prédéfini correspond aux données vocales est déterminé (403) ; lorsque le premier moteur de reconnaissance vocale correspondant au premier type de langue ne correspond pas aux données vocales, le serveur en nuage peut sélectionner, parmi le moteur de reconnaissance vocale de secours, un moteur de reconnaissance vocale cible qui correspond aux données vocales (404) ; et des informations de réponse pour les données vocales sont générées selon un second résultat de reconnaissance vocale, à partir du moteur de reconnaissance vocale cible, des données vocales (405). Au moyen de cette mise en œuvre, lorsqu'un utilisateur utilise un type de langue différent, un dispositif terminal peut effectuer de manière relativement précise une reconnaissance vocale sur des données vocales entrées par l'utilisateur, de telle sorte qu'une réponse qui est relativement mise en correspondance avec les informations vocales peut être fournie pour l'utilisateur.
PCT/CN2023/073326 2022-01-28 2023-01-20 Procédé, système et appareil d'interaction vocale, dispositif, et support de stockage WO2023143439A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210108135.4 2022-01-28
CN202210108135.4A CN114464179B (zh) 2022-01-28 2022-01-28 语音交互方法、系统、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023143439A1 true WO2023143439A1 (fr) 2023-08-03

Family

ID=81412433

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/073326 WO2023143439A1 (fr) 2022-01-28 2023-01-20 Procédé, système et appareil d'interaction vocale, dispositif, et support de stockage

Country Status (2)

Country Link
CN (1) CN114464179B (fr)
WO (1) WO2023143439A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114464179B (zh) * 2022-01-28 2024-03-19 达闼机器人股份有限公司 语音交互方法、系统、装置、设备及存储介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5991719A (en) * 1998-04-27 1999-11-23 Fujistu Limited Semantic recognition system
CN107545887A (zh) * 2016-06-24 2018-01-05 中兴通讯股份有限公司 语音指令处理方法及装置
CN109545197A (zh) * 2019-01-02 2019-03-29 珠海格力电器股份有限公司 语音指令的识别方法、装置和智能终端
CN109949795A (zh) * 2019-03-18 2019-06-28 北京猎户星空科技有限公司 一种控制智能设备交互的方法及装置
CN110491383A (zh) * 2019-09-25 2019-11-22 北京声智科技有限公司 一种语音交互方法、装置、系统、存储介质及处理器
CN111128194A (zh) * 2019-12-31 2020-05-08 云知声智能科技股份有限公司 一种提高在线语音识别效果的系统及方法
CN112732887A (zh) * 2021-01-22 2021-04-30 南京英诺森软件科技有限公司 一种多轮对话的处理装置及其系统
CN113506565A (zh) * 2021-07-12 2021-10-15 北京捷通华声科技股份有限公司 语音识别的方法、装置、计算机可读存储介质与处理器
CN114464179A (zh) * 2022-01-28 2022-05-10 达闼机器人股份有限公司 语音交互方法、系统、装置、设备及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8909532B2 (en) * 2007-03-23 2014-12-09 Nuance Communications, Inc. Supporting multi-lingual user interaction with a multimodal application
WO2017112813A1 (fr) * 2015-12-22 2017-06-29 Sri International Assistant personnel virtuel multilingue
CN106710586B (zh) * 2016-12-27 2020-06-30 北京儒博科技有限公司 一种语音识别引擎自动切换方法和装置
CN107391122B (zh) * 2017-07-01 2020-03-27 珠海格力电器股份有限公司 设置终端系统语言的方法、装置和终端
CN108766414B (zh) * 2018-06-29 2021-01-15 北京百度网讯科技有限公司 用于语音翻译的方法、装置、设备和计算机可读存储介质
CN111508472B (zh) * 2019-01-11 2023-03-03 华为技术有限公司 一种语种切换方法、装置及存储介质
CN112116909A (zh) * 2019-06-20 2020-12-22 杭州海康威视数字技术股份有限公司 语音识别方法、装置及系统
CN111627432B (zh) * 2020-04-21 2023-10-20 升智信息科技(南京)有限公司 主动式外呼智能语音机器人多语种交互方法及装置

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5991719A (en) * 1998-04-27 1999-11-23 Fujistu Limited Semantic recognition system
CN107545887A (zh) * 2016-06-24 2018-01-05 中兴通讯股份有限公司 语音指令处理方法及装置
CN109545197A (zh) * 2019-01-02 2019-03-29 珠海格力电器股份有限公司 语音指令的识别方法、装置和智能终端
CN109949795A (zh) * 2019-03-18 2019-06-28 北京猎户星空科技有限公司 一种控制智能设备交互的方法及装置
CN110491383A (zh) * 2019-09-25 2019-11-22 北京声智科技有限公司 一种语音交互方法、装置、系统、存储介质及处理器
CN111128194A (zh) * 2019-12-31 2020-05-08 云知声智能科技股份有限公司 一种提高在线语音识别效果的系统及方法
CN112732887A (zh) * 2021-01-22 2021-04-30 南京英诺森软件科技有限公司 一种多轮对话的处理装置及其系统
CN113506565A (zh) * 2021-07-12 2021-10-15 北京捷通华声科技股份有限公司 语音识别的方法、装置、计算机可读存储介质与处理器
CN114464179A (zh) * 2022-01-28 2022-05-10 达闼机器人股份有限公司 语音交互方法、系统、装置、设备及存储介质

Also Published As

Publication number Publication date
CN114464179B (zh) 2024-03-19
CN114464179A (zh) 2022-05-10

Similar Documents

Publication Publication Date Title
CN111488433B (zh) 一种适用于银行的提升现场体验感的人工智能交互系统
WO2020007129A1 (fr) Procédé et dispositif d'acquisition de contexte basés sur une interaction vocale
US20210073551A1 (en) Method and system for video segmentation
WO2023143439A1 (fr) Procédé, système et appareil d'interaction vocale, dispositif, et support de stockage
US11972598B2 (en) Context-based object location via augmented reality device
CN106649290A (zh) 语音翻译方法及系统
CN111341311A (zh) 一种语音对话方法及装置
CN110570208A (zh) 投诉预处理方法以及装置
CN111553706A (zh) 一种刷脸支付方法、装置及设备
CN108305629B (zh) 一种场景学习内容获取方法、装置、学习设备及存储介质
WO2022083114A1 (fr) Procédé, appareil, dispositif, support de stockage et programme de dialogue intelligent
CN113947209A (zh) 基于云边协同的集成学习方法、系统及存储介质
CN113436614A (zh) 语音识别方法、装置、设备、系统及存储介质
CN113766633A (zh) 数据处理方法、装置、电子设备以及存储介质
CN113076533B (zh) 一种业务处理方法及装置
WO2023124869A1 (fr) Procédé, dispositif et appareil de détection de caractère vivant, et support de stockage
CN111179913A (zh) 一种语音处理方法及装置
CN115602160A (zh) 基于语音识别的业务办理方法、装置及电子设备
KR102443629B1 (ko) 딥러닝 nlp 모델을 활용한 뉴스 긍정도 분석 솔루션 및 시스템
CN114495931A (zh) 语音交互方法、系统、装置、设备及存储介质
CN113269154A (zh) 一种图像识别方法、装置、设备及存储介质
JP2019212311A (ja) ネットワークメッセージサービスを利用して費用振り分ける方法、コンピュータプログラム、およびコンピューティングデバイス
CN105701459B (zh) 一种图片显示方法及终端设备
CN114283486B (zh) 图像处理、模型训练、识别方法、装置、设备及存储介质
CN108922547A (zh) 身份的识别方法、装置及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23746318

Country of ref document: EP

Kind code of ref document: A1