WO2023143439A1 - Speech interaction method, system and apparatus, and device and storage medium - Google Patents

Speech interaction method, system and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2023143439A1
WO2023143439A1 PCT/CN2023/073326 CN2023073326W WO2023143439A1 WO 2023143439 A1 WO2023143439 A1 WO 2023143439A1 CN 2023073326 W CN2023073326 W CN 2023073326W WO 2023143439 A1 WO2023143439 A1 WO 2023143439A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognition engine
speech recognition
data
voice
speech
Prior art date
Application number
PCT/CN2023/073326
Other languages
French (fr)
Chinese (zh)
Inventor
王军锋
袁国勇
王伟健
Original Assignee
达闼机器人股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 达闼机器人股份有限公司 filed Critical 达闼机器人股份有限公司
Publication of WO2023143439A1 publication Critical patent/WO2023143439A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the embodiments of the present application relate to the technical field of intelligent robots, and in particular to a voice interaction method, system, device, equipment and storage medium.
  • Embodiments of the present application provide a voice interaction method, system, device, device, and storage medium, which are used to more accurately perform voice recognition on voice data input by a user, and then provide the user with a reply that matches the voice information.
  • An embodiment of the present application provides a voice interaction method, including: acquiring voice data sent by a user for a device and facial feature data of the user; sorting at least one standby voice recognition engine according to the facial feature data; At least one alternate speech recognition engine, separate from the Corresponding to one less language type; judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data; if not, according to the ordering of the at least one spare speech recognition engine, from at least one Among the standby speech recognition engines, select a target speech recognition engine that matches the speech data; generate reply information for the speech data according to a second speech recognition result of the speech data by the target speech recognition engine.
  • sorting at least one standby speech recognition engine according to the facial feature data includes: identifying the target language type group to which the user belongs according to the facial feature data; according to the target language group to which the user belongs Language type groups and the corresponding relationship between language type groups and standby speech recognition engines, sorting the at least one standby speech recognition engine; the facial feature data includes: skin feature data, hair feature data, eye feature data, nose bridge features At least one of data and lip feature data.
  • the first voice recognition engine corresponding to the preset first language type matches the voice data, it also includes: acquiring the current geographic location of the device; The distribution feature determines the first language type, and uses the recognition engine corresponding to the first language type as the first speech recognition engine.
  • judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data includes: using the first voice recognition engine corresponding to the preset first language type to analyze the voice Perform speech recognition on the data to obtain the first speech recognition result; obtain the text information in the first speech recognition result; calculate the recognition accuracy of the text information; if the recognition accuracy is less than the set accuracy threshold, determine The speech data does not match the first speech recognition engine.
  • the method further includes: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, then using a question-answer matching model to perform question-answer matching on the text information to obtain answer information and the answer information confidence level; if the confidence level of the reply information is less than a preset confidence level threshold, it is determined that the voice data does not match the first voice recognition engine.
  • selecting a target speech recognition engine that matches the voice data from at least one standby speech recognition engine includes: sequentially sorting the at least one standby speech recognition engine according to the order of the at least one standby speech recognition engine.
  • a standby speech recognition engine is selected to obtain the second speech recognition engine; according to the second speech recognition result, it is judged whether the speech data matches the second speech recognition engine; if the speech data matches the If the second speech recognition engine matches, the second speech recognition engine is used as the target speech recognition engine.
  • the embodiment of the present application also provides a voice interaction system, including: a terminal device and a cloud server; wherein, the terminal device is mainly used to: obtain the voice data sent by the user for the device and the facial feature data of the user; The voice data and the facial feature data are sent to the cloud server; the cloud server is mainly used for: receiving the voice data and the facial feature data; according to the facial feature data, at least one standby speech recognition engine Sorting; the at least one standby speech recognition engine corresponds to at least one language type respectively; judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data; if not, then according to The sorting of the at least one standby speech recognition engine is to select a target speech recognition engine that matches the voice data from at least one standby speech recognition engine; 2. As a result of the voice recognition, generate reply information for the voice data.
  • the terminal device is mainly used to: obtain the voice data sent by the user for the device and the facial feature data of the user;
  • the voice data and the facial feature data
  • the embodiment of the present application also provides a voice interaction device, including: an acquisition module, configured to: acquire the voice data sent by the user for the device and the facial feature data of the user; a sorting module, configured to: according to the facial feature data, Sorting at least one spare speech recognition engine; the at least one spare speech recognition engine corresponds to at least one language type; the judging module is used to: judge the first speech recognition engine corresponding to the preset first language type Whether it matches the voice data; the selection module is used for: if not, then according to the sorting of the at least one standby voice recognition engine, select the one that matches the voice data from at least one standby voice recognition engine A target speech recognition engine; a generating module configured to: generate reply information for the speech data according to a second speech recognition result of the speech data by the target speech recognition engine.
  • an acquisition module configured to: acquire the voice data sent by the user for the device and the facial feature data of the user
  • a sorting module configured to: according to the facial feature data, Sorting at least one spare speech recognition
  • the embodiment of the present application also provides a cloud server, including: a memory, a processor, and a communication component; wherein, the memory is used to: store one or more computer instructions; and the processor is used to execute the one or more computer instructions.
  • the instruction is used for: executing the steps in the voice interaction method.
  • the embodiment of the present application also provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor is caused to implement the steps in the voice interaction method.
  • Embodiments of the present application provide a voice interaction method, system, device, device, and storage medium, in which the terminal equipment can obtain the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data.
  • the cloud server can select the target speech recognition engine from the standby speech recognition engines, and select the target speech recognition engine according to the second speech recognition result of the speech data by the target speech recognition engine. As a result, answer information for the voice data is generated.
  • the terminal device can more accurately perform speech recognition on the speech data input by the user, and then can provide the user with a reply that matches the speech information.
  • FIG. 1 is a schematic structural diagram of a voice interaction system provided by an exemplary embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a voice interaction system in an actual scenario provided by an exemplary embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a voice interaction system in an actual scenario provided by another exemplary embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a voice interaction method provided by an exemplary embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a voice interaction device provided by an exemplary embodiment of the present application.
  • Fig. 6 is a schematic diagram of a cloud server provided by an exemplary embodiment of the present application.
  • FIG. 1 is a schematic structural diagram of a voice interaction system provided by an exemplary embodiment of the present application.
  • the voice interaction system 100 includes: a cloud server 10 and a terminal device 20 .
  • the cloud server 10 can be implemented as a cloud host, a virtual center in the cloud, or an elastic computing instance in the cloud etc., which is not limited in this embodiment.
  • the composition of the cloud server 10 mainly includes a processor, a hard disk, a memory, a system bus, etc., and is similar to a general computer architecture, and will not be repeated here.
  • the terminal device 20 can be realized as a variety of terminal devices in different scenarios. For example, in the scenarios of hotels, hotels, restaurants, etc., it can be realized as a robot that provides services; in the scenario of intelligent driving assistance or automatic driving, it can be realized as a controlled Vehicles. In the banking scenario, it can be realized as a multi-functional financial terminal; in the hospital scenario, it can be realized as a registration and payment terminal; in the movie theater scenario, it can be realized as a ticket collection terminal, etc.
  • a wireless communication connection can be established between the cloud server 10 and the terminal device 20, and the specific communication connection method can be determined according to different application scenarios.
  • the wireless communication connection can be implemented based on a dedicated virtual network (Virtual Private Network, VPN) to ensure communication security.
  • VPN Virtual Private Network
  • the terminal device 20 is mainly used to: obtain the voice data sent by the user to the terminal device 20 and the user's facial feature data, and send the voice data to the cloud server 10 .
  • the facial feature data is used to identify the language group to which the user belongs, and the facial feature data may include: at least one of skin feature data, hair feature data, eye feature data, nose bridge feature data, and lip feature data.
  • the user's facial feature data may include: eyes are light green, darker, and hair is blonde.
  • the cloud server 10 is mainly used for: receiving the voice data and facial feature data, and sorting at least one standby voice recognition engine according to the facial feature data.
  • at least one spare speech recognition engine corresponds to at least one language type.
  • the at least one spare speech recognition engine includes: a speech recognition engine corresponding to Arabic, a speech recognition engine corresponding to German, and a speech recognition engine corresponding to French.
  • the cloud server 10 sorts, it can judge whether the first speech recognition engine corresponding to the preset first language type matches the speech data, if not, then according to the sorting of at least one spare speech recognition engine, start from at least one spare speech recognition engine.
  • a target speech recognition engine matching the speech data is selected.
  • “first” is used to define the speech recognition engine, which is only used to distinguish the speech recognition engines.
  • the target speech recognition engine refers to a speech recognition engine that matches the speech data.
  • the user uses French
  • the first speech recognition engine corresponding to the preset first language type is used to recognize Chinese.
  • the cloud server 10 judges that the speech recognition engine does not match the speech data, According to the order of "speech recognition engine corresponding to French, speech recognition engine corresponding to German, and speech recognition engine corresponding to Arabic", the target speech recognition engine matching the speech data can be selected from these spare speech recognition engines, that is, French The corresponding speech recognition engine.
  • the cloud server 10 can generate the reply information of the voice data according to the second voice recognition result of the voice data by the target voice recognition engine.
  • the reply information may be implemented as text information or audio information used to provide the user with a reply. For example, if the user says to the terminal device 20 "What time will dinner be served in the afternoon", the cloud server 10 may generate a reply message of "six o'clock in the afternoon". Further optionally, the cloud server 10 may send the generated reply information to the terminal device 20 in the form of text or audio, so that the terminal device 20 outputs the reply information to the user through an audio component or a display component.
  • the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data.
  • the cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data.
  • the "sorting at least one spare speech recognition engine according to facial feature data" described in the foregoing embodiments can be implemented based on the following steps:
  • the cloud server 10 can obtain facial feature data by performing feature extraction on pre-collected user facial images, and further, the cloud server 10 can identify the target language group to which the user belongs according to the facial feature data.
  • the target language type group refers to the language type group to which the user belongs.
  • the user's facial feature data includes: the user's eyes are light green, relatively deep, and the hair is golden yellow, because people with this facial feature often appear in European countries such as France or Germany, so the cloud server 10 can be based on the facial features.
  • the feature data identifies that the target language group to which the user belongs is a French group or a German group.
  • the cloud server 10 When the cloud server 10 identifies the language group, it can input the facial feature data into a preset language group SVM (Support Vector Machine, Support Vector Machine) classifier.
  • SVM Small Vector Machine, Support Vector Machine
  • the language group SVM classifier has been continuously trained in advance, and can divide human facial images into Korean groups, Chinese groups, French groups, etc. according to language groups, and obtain categories of multiple language groups. Therefore, after the cloud server 10 inputs the facial feature data into the classifier, the classifier can match the facial feature data with a plurality of language-type group categories, and obtain a plurality of language-type groups that match the facial feature data and the corresponding matching degree (ie probability), and then the classifier can output the language type group corresponding to the facial feature data.
  • the cloud server can input the facial feature data into the preset classifier, and the matching degree of the target language group is 80% for the Chinese group, 70% for the French group and 50% for the English group .
  • the cloud server 10 sorts the at least one standby speech recognition engine according to the target language group to which the user belongs and the corresponding relationship between the language group and the spare speech recognition engine.
  • the correspondence between the language group and the standby speech recognition engine may be: the French group corresponds to the speech recognition engine corresponding to French, and the German group corresponds to the speech recognition engine corresponding to German.
  • the target language type can be Groups and corresponding relationships, arrange at least one spare speech recognition engine in the order of "the speech recognition engine corresponding to German, the speech recognition engine corresponding to French, and the speech recognition engine corresponding to English" according to the order of matching degree from high to low .
  • the cloud server 10 may acquire the current geographic location of the terminal device 20 and, according to the The language distribution characteristics of the geographic location determine the first language type, and use the recognition engine corresponding to the first language type as the first speech recognition engine.
  • the terminal device 20 is currently in a certain community. Since this community is a community inhabited by Koreans, the language distribution feature of this community is that there are many people who speak Korean and few people who speak Chinese.
  • the cloud server 10 can determine the first language type as Korean according to the language distribution feature, and use the recognition engine corresponding to Korean as the first speech recognition engine.
  • the cloud server 10 may use the first voice recognition engine corresponding to the preset first language type to The engine performs speech recognition on the speech data to obtain a first speech recognition result.
  • the cloud server 10 can obtain the text information in the first speech recognition result, and calculate the recognition accuracy of the text information.
  • the calculation of the recognition accuracy rate may be performed through a preset speech recognition model, or may be calculated through a preset algorithm.
  • the Sentence Error Rate (SER), Sentence Correct (S.Corr) or Character Error Rate (CER) of text information can be calculated through a preset model or algorithm. Evaluation indicators, and according to multiple evaluation indicators and their respective weights, the recognition accuracy rate of the text information is calculated.
  • the present embodiment does not Do limit.
  • the cloud server 10 can further judge whether the voice data matches the first voice recognition engine according to the confidence of the answer information generated by the question-answer matching link. The details will be described below.
  • the cloud server 10 may use a question-answer matching model to perform question-answer matching on the text information based on NLP (Natural Language Processing, Natural Language Processing) technology .
  • NLP Natural Language Processing, Natural Language Processing
  • the question-answer matching model can search for a plurality of pre-selected information with different confidence levels corresponding to the text information in the built-in data set of the model according to the input text information.
  • the question-answer matching model can select the pre-selected information with the highest confidence as the answer information from the multiple pre-selected information.
  • the cloud server 10 uses the question-answer matching model to perform question-and-answer matching on the text information of "which street is the nearest bank to me", and can obtain the pre-selected information of "on street A” with a confidence level of 80% and the pre-selected information of "on street A” with a confidence level of 85%.
  • the preselected information of "on street B”, and then, the preselected information of "on street B" with a confidence level of 85% can be selected from the two preselected information as the answer information.
  • the cloud server 10 can obtain the reply information and the confidence level of the reply information. If the confidence of the reply information is less than the preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
  • the cloud server 10 may select from at least In a standby speech recognition engine, a target speech recognition engine matching the speech data is selected.
  • the cloud server 10 when the cloud server 10 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it can select any speech recognition engine from at least one spare speech recognition engine.
  • the recognition engine acts as a second speech recognition engine.
  • at least one backup speech recognition engine includes a speech recognition engine corresponding to Chinese and a speech recognition engine corresponding to French, and the cloud server can select a speech recognition engine corresponding to French from the at least one backup speech recognition engine as the second voice recognition engine.
  • the engine can perform speech recognition on the speech data to obtain a second speech recognition result.
  • the second speech recognition result refers to the speech recognition result obtained by performing speech recognition through the second speech recognition engine.
  • “second” is used to limit the speech recognition result, which is only used to distinguish the speech recognition results obtained from multiple speech recognitions.
  • the cloud server 10 can determine whether the voice data matches the second voice recognition engine according to the second voice recognition result. The details will be described below.
  • the cloud server 10 can obtain the text information in the second speech recognition result, and calculate the recognition accuracy of the text information.
  • the calculation of the recognition accuracy rate may be performed through a preset speech recognition model, or may be calculated through a preset algorithm.
  • multiple evaluation indicators such as sentence error rate, sentence correct rate, or word error rate of text information can be calculated through a preset model or algorithm, and the recognition accuracy of the text information can be calculated based on multiple evaluation indicators and their respective weights. Rate. If the recognition accuracy is greater than or equal to the set accuracy threshold, it is determined that the speech data matches the second recognition engine, and the cloud server 10 can use the second speech recognition engine as a target speech recognition engine. If the recognition accuracy is less than the set accuracy threshold, it is determined that the voice data does not match the second speech recognition engine, wherein the threshold can be set to 90%, 85%, or 80%, etc., which is not limited in this embodiment .
  • the voice interaction system will be further described below in conjunction with FIG. 2 and FIG. 3 and actual application scenarios.
  • the terminal device can collect the user's facial image, and perform image recognition to obtain the user's facial feature data. Afterwards, the terminal device can identify the target language type group according to the facial feature data, and set the backup voice recognition engine according to the target language type group. Based on the above steps, the terminal device can collect the user's initial voice data through the microphone, and The voice data is sent to the voice endpoint detection module. The voice endpoint detection module can intercept effective voice data in the initial voice data. Afterwards, the terminal device can perform voice recognition on the voice data through the first voice recognition engine corresponding to the first language type in the main module (ie, the main engine), and can obtain text information corresponding to the voice data.
  • the terminal device can collect the user's facial image, and perform image recognition to obtain the user's facial feature data. Afterwards, the terminal device can identify the target language type group according to the facial feature data, and set the backup voice recognition engine according to the target language type group. Based on the above steps, the terminal device can collect the user's initial voice data through the microphone, and
  • the terminal device may perform question-answer matching on the text information through the question-answer matching model corresponding to the first language type, and obtain answer information corresponding to the text information. If the confidence level of the reply information is greater than or equal to the confidence threshold, the text-to-speech module corresponding to the first language type will convert the reply information into speech and output the speech. If the confidence of the reply information is less than the confidence threshold, select a target speech recognition engine that matches the speech data from at least one spare speech recognition engine to perform speech recognition on the speech data again.
  • the terminal device performs speech recognition on the speech data through the backup speech recognition engine corresponding to Korean to obtain corresponding text information. Afterwards, the terminal device can perform question-answer matching on the text information through the standby question-answer matching model corresponding to Korean in the main module, and obtain corresponding answer information. If the confidence degree of the reply information is greater than or equal to the confidence degree threshold, the reply information is converted into speech through the text-to-speech module corresponding to Korean, and the speech output is performed. If the confidence degree of the reply information is less than the confidence degree threshold, the standby speech recognition engine is reselected to re-recognize the speech data.
  • the embodiment of the present application also provides a voice interaction method, which will be described in detail below with reference to FIG. 4 .
  • Step 401 Acquire the voice data sent by the user to the device and the user's facial feature data.
  • Step 402 sort the at least one spare speech recognition engine according to the facial feature data; the at least one spare speech recognition engine corresponds to at least one language type respectively.
  • Step 403 judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data.
  • Step 404 if not, select a target speech recognition engine matching the voice data from the at least one spare speech recognition engine according to the ranking of the at least one spare speech recognition engine.
  • Step 405 Generate reply information for the voice data according to the second voice recognition result of the voice data by the target voice recognition engine.
  • sorting the at least one standby speech recognition engine according to the facial feature data includes: identifying the target language type group to which the user belongs according to the facial feature data; according to the target language type group and the language type group and Correspondence of standby voice recognition engine relationship, sorting at least one spare speech recognition engine; facial feature data includes: at least one of skin feature data, hair feature data, eye feature data, nose bridge feature data, and lip feature data.
  • the first speech recognition engine corresponding to the preset first language type matches the speech data, it also includes: acquiring the current geographical location of the device; determining the first language type, and use the recognition engine corresponding to the first language type as the first speech recognition engine.
  • judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data includes: performing voice recognition on the voice data through the first voice recognition engine corresponding to the preset first language type , obtain the first speech recognition result; obtain the text information in the first speech recognition result; calculate the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set accuracy rate threshold, then determine that the speech data does not match the first speech recognition engine .
  • the method also includes: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, then using the question-answer matching model to perform question-answer matching on the text information to obtain the answer information and the confidence degree of the answer information; if the confidence degree of the answer information If it is less than the preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
  • selecting a target speech recognition engine that matches the voice data from at least one spare speech recognition engine includes: sequentially performing at least one spare speech recognition engine according to the ordering of the at least one spare speech recognition engine Select to obtain the second speech recognition engine; according to the second speech recognition result, judge whether the voice data matches the second speech recognition engine; if the speech data matches the second speech recognition engine, then use the second speech recognition engine as the target voice recognition engine.
  • the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data.
  • the cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data.
  • the subject of execution of each step of the method provided in the above embodiments can be the same A device, or the method is executed by a different device.
  • the execution subject of steps 401 to 405 may be device A; for another example, the execution subject of steps 401 to 403 may be device A, and the execution subject of steps 404 and 405 may be device B; and so on.
  • the voice interaction device includes: an acquisition module 501 , a sorting module 502 , a judgment module 503 , a selection module 504 and a generation module 505 .
  • the acquisition module 501 is used to: acquire the voice data sent by the user for the device and the facial feature data of the user;
  • the sorting module 502 is used to: sort at least one standby voice recognition engine according to the facial feature data
  • the at least one standby speech recognition engine corresponds to at least one language type respectively;
  • the judging module 503 is used to: judge whether the first speech recognition engine corresponding to the preset first language type matches the speech data; select Module 504, configured to: if not, select a target speech recognition engine that matches the voice data from at least one spare speech recognition engine according to the ranking of the at least one spare speech recognition engine;
  • generating module 505, configured to: generate reply information for the voice data according to a second voice recognition result of the voice data by the target voice recognition engine.
  • the sorting module 502 sorts at least one standby speech recognition engine according to the facial feature data, it is specifically used to: identify the target language group to which the user belongs according to the facial feature data; According to the target language type group to which the user belongs and the corresponding relationship between the language type group and the backup speech recognition engine, the at least one backup speech recognition engine is sorted; the facial feature data includes: skin feature data, hair feature data , eye feature data, nose bridge feature data, and lip feature data.
  • the sorting module 502 is further configured to: acquire the current geographic location of the device; The language distribution feature of the geographic location is used to determine the first language type, and the recognition engine corresponding to the first language type is used as the first speech recognition engine.
  • the judging module 503 is specifically configured to: use the first voice corresponding to the preset first language type The recognition engine performs speech recognition on the speech data to obtain a first speech recognition result; obtains text information in the first speech recognition result; calculates the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set If the accuracy threshold is determined, it is determined that the voice data does not match the first voice recognition engine.
  • the judging module 503 is also configured to: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, use a question-answer matching model to perform question-answer matching on the text information to obtain answer information and the answer Confidence of the information; if the confidence of the reply information is less than a preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
  • the selection module 504 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it is specifically configured to: sort according to the at least one spare speech recognition engine , selecting the at least one standby speech recognition engine in turn to obtain the second speech recognition engine; judging whether the speech data matches the second speech recognition engine according to the second speech recognition result; if The voice data is matched with the second voice recognition engine, and the second voice recognition engine is used as the target voice recognition engine.
  • the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data.
  • the cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data.
  • Fig. 6 is a schematic structural diagram of a cloud server provided by an exemplary embodiment of the present application.
  • the server is suitable for the voice interaction system provided in the foregoing embodiment.
  • the server includes: Storage 601, processor 602 and communication component 603.
  • the memory 601 is used to store computer programs, and can be configured to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phonebook data, messages, pictures, videos, etc.
  • the processor 602, coupled with the memory 601, is used to execute the computer program in the memory 601, so as to: obtain the voice data sent by the user for the device and the facial feature data of the user; according to the facial feature data, at least one The standby speech recognition engines are sorted; the at least one standby speech recognition engine corresponds to at least one language type; it is judged whether the first speech recognition engine corresponding to the preset first language type matches the voice data; if If no, then according to the sorting of the at least one standby voice recognition engine, select a target voice recognition engine that matches the voice data from at least one standby voice recognition engine; The second voice recognition result of the voice data is used to generate reply information for the voice data.
  • the processor 602 sorts at least one standby speech recognition engine according to the facial feature data, it is specifically configured to: identify the target language type group to which the user belongs according to the facial feature data; According to the target language type group to which the user belongs and the corresponding relationship between the language type group and the backup speech recognition engine, the at least one backup speech recognition engine is sorted; the facial feature data includes: skin feature data, hair feature data , eye feature data, nose bridge feature data, and lip feature data.
  • the processor 602 is further configured to: acquire the current geographic location of the device; The language distribution feature of the geographic location is used to determine the first language type, and the recognition engine corresponding to the first language type is used as the first speech recognition engine.
  • the processor 602 when judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, the processor 602 is specifically configured to: use the first voice corresponding to the preset first language type The recognition engine performs speech recognition on the speech data to obtain a first speech recognition result; obtains text information in the first speech recognition result; calculates the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set If the accuracy threshold is determined, it is determined that the voice data does not match the first voice recognition engine.
  • the processor 602 is further configured to: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, use a question-answer matching model to perform question-answer matching on the text information to obtain Reply information and a confidence degree of the reply information; if the confidence degree of the reply information is less than a preset confidence threshold, it is determined that the voice data does not match the first speech recognition engine.
  • the processor 602 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it is specifically configured to: sort according to the at least one spare speech recognition engine , selecting the at least one standby speech recognition engine in turn to obtain the second speech recognition engine; judging whether the speech data matches the second speech recognition engine according to the second speech recognition result; if The voice data is matched with the second voice recognition engine, and the second voice recognition engine is used as the target voice recognition engine.
  • the cloud server further includes: a power supply component 604 and other components.
  • FIG. 6 only schematically shows some components, which does not mean that the cloud server only includes the components shown in FIG. 6 .
  • the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data.
  • the cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data.
  • the embodiments of the present application also provide a computer-readable storage medium storing a computer program.
  • the steps that can be executed by the cloud server in the above method embodiments can be realized.
  • an embodiment of the present application further provides a computer program product, including computer programs/instructions.
  • the computer program/instructions are executed by a processor, the steps that can be executed by the cloud server in the above method embodiments are implemented.
  • the memory 601 in the above-mentioned Fig. 6 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) , Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM Erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic Disk Magnetic Disk or Optical Disk.
  • the above-mentioned communication component 603 in FIG. 6 is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices.
  • the device where the communication component is located can access a wireless network based on communication standards, such as WiFi, 2G, 3G, 4G or 5G, or a combination thereof.
  • the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component may be based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies to fulfill.
  • NFC Near Field Communication
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wideband
  • Bluetooth Bluetooth
  • the power supply component 604 in FIG. 6 provides power for various components of the device where the power supply component is located.
  • a power supply component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which the power supply component resides.
  • the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
  • a computing device includes one or more processors (CPUs), input/output Outbound interface, network interface and memory.
  • processors CPUs
  • input/output Outbound interface network interface
  • memory volatile and non-volatile memory
  • Memory may include non-permanent storage in computer readable media, in the form of random access memory (RAM) and/or nonvolatile memory such as read-only memory (ROM) or flash RAM. Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash random access memory
  • Computer-readable media including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information.
  • Information may be computer readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, A magnetic tape cartridge, disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A speech interaction method, system and apparatus, and a server and a storage medium. In the speech interaction system, a terminal device may acquire speech data and facial feature data of a user (401); a cloud server may sort at least one standby speech recognition engine according to the facial feature data, wherein the at least one standby speech recognition engine respectively corresponds to at least one language type (402); whether a first speech recognition engine corresponding to a preset first language type matches the speech data is determined (403); when the first speech recognition engine corresponding to the first language type does not match the speech data, the cloud server may select, from among the standby speech recognition engine, a target speech recognition engine that matches the speech data (404); and reply information for the speech data is generated according to a second speech recognition result, from the target speech recognition engine, of the speech data (405). By means of this implementation, when a user uses a different type of language, a terminal device can relatively accurately perform speech recognition on speech data input by the user, such that a reply that is relatively matched with the speech information can be provided for the user.

Description

语音交互方法、系统、装置、设备及存储介质Voice interaction method, system, device, equipment and storage medium
交叉引用cross reference
本申请引用于2022年01月28日递交的名称为“语音交互方法、系统、装置、设备及存储介质”的第2022101081354号中国专利申请,其通过引用被全部并入本申请。This application refers to the Chinese patent application No. 2022101081354 entitled "Voice Interaction Method, System, Device, Equipment, and Storage Medium" submitted on January 28, 2022, which is fully incorporated by reference into this application.
技术领域technical field
本申请实施例涉及智能机器人技术领域,尤其涉及一种语音交互方法、系统、装置、设备及存储介质。The embodiments of the present application relate to the technical field of intelligent robots, and in particular to a voice interaction method, system, device, equipment and storage medium.
背景技术Background technique
随着人工智能科技的不断发展,智能对话越来越普及。诸如在商场、超市、餐厅等场景下,可智能对话的智能设备(例如机器人)被广泛应用。在现有技术中,通常在已知用户语言类型的前提下,预先手动设置与用户语言类型对应的ASR(Automatic Speech Recognition,自动语音识别)引擎,以对用户输入的语音信息进行识别,将语音信息转换为文字信息,识别该文字信息并根据识别结果进行答复。但是,在诸多未知用户语言类型的使用场景中,无法提前获知用户语言类型,从而无法在与用户进行交互之前对ASR引擎进行设置。进而,导致无法准确地对用户输入的语音信息进行语音识别,从而导致无法为用户提供与该语音信息较为匹配的答复。因此,亟待提出一种解决方案。With the continuous development of artificial intelligence technology, intelligent dialogue is becoming more and more popular. For example, in shopping malls, supermarkets, restaurants and other scenarios, intelligent devices (such as robots) that can intelligently talk to each other are widely used. In the prior art, usually on the premise of knowing the user's language type, an ASR (Automatic Speech Recognition, automatic speech recognition) engine corresponding to the user's language type is manually set in advance to recognize the voice information input by the user and convert the voice The information is converted into text information, and the text information is recognized and answered according to the recognition result. However, in many usage scenarios where the user's language type is unknown, the user's language type cannot be known in advance, so that the ASR engine cannot be configured before interacting with the user. Furthermore, the speech recognition of the speech information input by the user cannot be performed accurately, and thus the answer matching the speech information cannot be provided to the user. Therefore, it is urgent to propose a solution.
发明内容Contents of the invention
本申请实施例提供一种语音交互方法、系统、装置、设备及存储介质,用以较为准确地对用户输入的语音数据进行语音识别,进而,可为用户提供与该语音信息较为匹配的答复。Embodiments of the present application provide a voice interaction method, system, device, device, and storage medium, which are used to more accurately perform voice recognition on voice data input by a user, and then provide the user with a reply that matches the voice information.
本申请实施例提供一种语音交互方法,包括:获取用户针对设备发出的语音数据以及所述用户的面部特征数据;根据所述面部特征数据,对至少一种备用语音识别引擎进行排序;所述至少一种备用语音识别引擎,分别与至 少一个语言类型对应;判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配;若为否,则按照所述至少一种备用语音识别引擎的排序,从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎;根据所述目标语音识别引擎对所述语音数据的第二语音识别结果,生成所述语音数据的答复信息。An embodiment of the present application provides a voice interaction method, including: acquiring voice data sent by a user for a device and facial feature data of the user; sorting at least one standby voice recognition engine according to the facial feature data; At least one alternate speech recognition engine, separate from the Corresponding to one less language type; judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data; if not, according to the ordering of the at least one spare speech recognition engine, from at least one Among the standby speech recognition engines, select a target speech recognition engine that matches the speech data; generate reply information for the speech data according to a second speech recognition result of the speech data by the target speech recognition engine.
进一步可选地,根据所述面部特征数据,对至少一种备用语音识别引擎进行排序,包括;根据所述面部特征数据,识别所述用户所属的目标语言类型群体;根据所述用户所属的目标语言类型群体以及语言类型群体和备用语音识别引擎的对应关系,对所述至少一种备用语音识别引擎进行排序;所述面部特征数据包括:皮肤特征数据、头发特征数据、眼睛特征数据、鼻梁特征数据、嘴唇特征数据中的至少一种。Further optionally, sorting at least one standby speech recognition engine according to the facial feature data includes: identifying the target language type group to which the user belongs according to the facial feature data; according to the target language group to which the user belongs Language type groups and the corresponding relationship between language type groups and standby speech recognition engines, sorting the at least one standby speech recognition engine; the facial feature data includes: skin feature data, hair feature data, eye feature data, nose bridge features At least one of data and lip feature data.
进一步可选地,判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配之前,还包括:获取所述设备当前所处的地理位置;根据所述地理位置的语言分布特征,确定所述第一语言类型,并将所述第一语言类型对应的识别引擎,作为所述第一语音识别引擎。Further optionally, before judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, it also includes: acquiring the current geographic location of the device; The distribution feature determines the first language type, and uses the recognition engine corresponding to the first language type as the first speech recognition engine.
进一步可选地,判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配,包括:通过预设的第一语言类型对应的第一语音识别引擎,对所述语音数据进行语音识别,得到第一语音识别结果;获取所述第一语音识别结果中的文本信息;计算所述文本信息的识别准确率;若所述识别准确率小于设定准确率阈值,则确定所述语音数据与所述第一语音识别引擎不匹配。Further optionally, judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data includes: using the first voice recognition engine corresponding to the preset first language type to analyze the voice Perform speech recognition on the data to obtain the first speech recognition result; obtain the text information in the first speech recognition result; calculate the recognition accuracy of the text information; if the recognition accuracy is less than the set accuracy threshold, determine The speech data does not match the first speech recognition engine.
进一步可选地,所述方法还包括:若所述识别准确率大于或等于所述设定准确率阈值,则采用问答匹配模型对所述文本信息进行问答匹配,得到答复信息以及所述答复信息的置信度;若所述答复信息的置信度小于预设的置信度阈值,则确定所述语音数据与所述第一语音识别引擎不匹配。Further optionally, the method further includes: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, then using a question-answer matching model to perform question-answer matching on the text information to obtain answer information and the answer information confidence level; if the confidence level of the reply information is less than a preset confidence level threshold, it is determined that the voice data does not match the first voice recognition engine.
进一步可选地,从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎,包括:按照所述至少一种备用语音识别引擎的排序,依次对所述至少一种备用语音识别引擎进行选择,得到所述第二语音识别引擎;根据所述第二语音识别结果,判断所述语音数据与所述第二语音识别引擎是否匹配;若所述语音数据与所述第二语音识别引擎匹配,则将所述第二语音识别引擎,作为所述目标语音识别引擎。 Further optionally, selecting a target speech recognition engine that matches the voice data from at least one standby speech recognition engine includes: sequentially sorting the at least one standby speech recognition engine according to the order of the at least one standby speech recognition engine. A standby speech recognition engine is selected to obtain the second speech recognition engine; according to the second speech recognition result, it is judged whether the speech data matches the second speech recognition engine; if the speech data matches the If the second speech recognition engine matches, the second speech recognition engine is used as the target speech recognition engine.
本申请实施例还提供一种语音交互系统,包括:终端设备以及云端服务器;其中,所述终端设备主要用于:获取用户针对设备发出的语音数据以及所述用户的面部特征数据;将所述语音数据和所述面部特征数据发送到所述云端服务器;所述云端服务器主要用于:接收所述语音数据和所述面部特征数据;根据所述面部特征数据,对至少一种备用语音识别引擎进行排序;所述至少一种备用语音识别引擎,分别与至少一个语言类型对应;判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配;若为否,则按照所述至少一种备用语音识别引擎的排序,从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎;根据所述目标语音识别引擎对所述语音数据的第二语音识别结果,生成所述语音数据的答复信息。The embodiment of the present application also provides a voice interaction system, including: a terminal device and a cloud server; wherein, the terminal device is mainly used to: obtain the voice data sent by the user for the device and the facial feature data of the user; The voice data and the facial feature data are sent to the cloud server; the cloud server is mainly used for: receiving the voice data and the facial feature data; according to the facial feature data, at least one standby speech recognition engine Sorting; the at least one standby speech recognition engine corresponds to at least one language type respectively; judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data; if not, then according to The sorting of the at least one standby speech recognition engine is to select a target speech recognition engine that matches the voice data from at least one standby speech recognition engine; 2. As a result of the voice recognition, generate reply information for the voice data.
本申请实施例还提供一种语音交互装置,包括:获取模块,用于:获取用户针对设备发出的语音数据以及所述用户的面部特征数据;排序模块,用于:根据所述面部特征数据,对至少一种备用语音识别引擎进行排序;所述至少一种备用语音识别引擎,分别与至少一个语言类型对应;判断模块,用于:判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配;选择模块,用于:若为否,则按照所述至少一种备用语音识别引擎的排序,从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎;生成模块,用于:根据所述目标语音识别引擎对所述语音数据的第二语音识别结果,生成所述语音数据的答复信息。The embodiment of the present application also provides a voice interaction device, including: an acquisition module, configured to: acquire the voice data sent by the user for the device and the facial feature data of the user; a sorting module, configured to: according to the facial feature data, Sorting at least one spare speech recognition engine; the at least one spare speech recognition engine corresponds to at least one language type; the judging module is used to: judge the first speech recognition engine corresponding to the preset first language type Whether it matches the voice data; the selection module is used for: if not, then according to the sorting of the at least one standby voice recognition engine, select the one that matches the voice data from at least one standby voice recognition engine A target speech recognition engine; a generating module configured to: generate reply information for the speech data according to a second speech recognition result of the speech data by the target speech recognition engine.
本申请实施例还提供一种云端服务器,包括:存储器、处理器以及通信组件;其中,所述存储器用于:存储一条或多条计算机指令;所述处理器用于执行所述一条或多条计算机指令,以用于:执行所述语音交互方法中的步骤。The embodiment of the present application also provides a cloud server, including: a memory, a processor, and a communication component; wherein, the memory is used to: store one or more computer instructions; and the processor is used to execute the one or more computer instructions. The instruction is used for: executing the steps in the voice interaction method.
本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,当计算机程序被处理器执行时,致使处理器实现所述语音交互方法中的步骤。The embodiment of the present application also provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor is caused to implement the steps in the voice interaction method.
本申请实施例提供一种语音交互方法、系统、装置、设备及存储介质中,终端设备可获取用户的语音数据和面部特征数据,云端服务器可根据面部特征数据对备用语音引擎进行排序。云端服务器可在第一语言类型对应的语音识别引擎与语音数据不匹配时,从备用语音识别引擎中选择出目标语音识别引擎,并根据目标语音识别引擎对语音数据的第二语音识别结 果,生成语音数据的答复信息。通过这种实施方式,当用户使用的语言类型不同时,终端设备可较为准确地对用户输入的语音数据进行语音识别,进而,可为用户提供与该语音信息较为匹配的答复。Embodiments of the present application provide a voice interaction method, system, device, device, and storage medium, in which the terminal equipment can obtain the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data. When the speech recognition engine corresponding to the first language type does not match the speech data, the cloud server can select the target speech recognition engine from the standby speech recognition engines, and select the target speech recognition engine according to the second speech recognition result of the speech data by the target speech recognition engine. As a result, answer information for the voice data is generated. Through this implementation manner, when the language types used by the users are different, the terminal device can more accurately perform speech recognition on the speech data input by the user, and then can provide the user with a reply that matches the speech information.
附图说明Description of drawings
附图仅用于示出实施方式,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:The drawings are only for illustrating the embodiments and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:
图1为本申请一示例性实施例提供的语音交互系统的结构示意图;FIG. 1 is a schematic structural diagram of a voice interaction system provided by an exemplary embodiment of the present application;
图2为本申请一示例性实施例提供的实际场景下语音交互系统的结构示意图;FIG. 2 is a schematic structural diagram of a voice interaction system in an actual scenario provided by an exemplary embodiment of the present application;
图3为本申请另一示例性实施例提供的实际场景下语音交互系统的结构示意图;FIG. 3 is a schematic structural diagram of a voice interaction system in an actual scenario provided by another exemplary embodiment of the present application;
图4为本申请一示例性实施例提供的语音交互方法的流程示意图;FIG. 4 is a schematic flowchart of a voice interaction method provided by an exemplary embodiment of the present application;
图5为本申请一示例性实施例提供的语音交互装置的结构示意图;FIG. 5 is a schematic structural diagram of a voice interaction device provided by an exemplary embodiment of the present application;
图6为本申请一示例性实施例提供的云端服务器的示意图。Fig. 6 is a schematic diagram of a cloud server provided by an exemplary embodiment of the present application.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.
在现有技术中,当用户使用的语言类型不同时,机器人无法准确地对用户输入的语音信息进行语音识别,从而导致无法为用户提供与该语音信息较为匹配的答复。针对此技术问题,在本申请一些实施例中,提供了一种解决方案。以下将结合附图,详细说明本申请各实施例提供的技术方案。In the prior art, when the language types used by the users are different, the robot cannot accurately perform voice recognition on the voice information input by the user, so that it cannot provide the user with a reply that matches the voice information. Aiming at this technical problem, in some embodiments of the present application, a solution is provided. The technical solutions provided by various embodiments of the present application will be described in detail below with reference to the accompanying drawings.
图1为本申请一示例性实施例提供的语音交互系统的结构示意图,如图1所示,语音交互系统100包含:云端服务器10以及终端设备20。FIG. 1 is a schematic structural diagram of a voice interaction system provided by an exemplary embodiment of the present application. As shown in FIG. 1 , the voice interaction system 100 includes: a cloud server 10 and a terminal device 20 .
云端服务器10可实现为云主机、云端的虚拟中心、云端的弹性计算实例 等,本实施例对此不做限制。其中,云端服务器10的构成主要包括处理器、硬盘、内存、系统总线等,和通用的计算机架构类似,不再赘述。The cloud server 10 can be implemented as a cloud host, a virtual center in the cloud, or an elastic computing instance in the cloud etc., which is not limited in this embodiment. Wherein, the composition of the cloud server 10 mainly includes a processor, a hard disk, a memory, a system bus, etc., and is similar to a general computer architecture, and will not be repeated here.
终端设备20在不同场景下可实现为多种终端设备,比如,在酒店、宾馆、餐厅等场景下,可实现为提供服务的机器人;在智能驾驶辅助或者自动驾驶场景下,可实现为受控的车辆。在银行场景下,可实现为多功能金融终端;在医院场景下,可实现为挂号缴费终端;在电影院场景下,可实现为取票终端,等等。The terminal device 20 can be realized as a variety of terminal devices in different scenarios. For example, in the scenarios of hotels, hotels, restaurants, etc., it can be realized as a robot that provides services; in the scenario of intelligent driving assistance or automatic driving, it can be realized as a controlled Vehicles. In the banking scenario, it can be realized as a multi-functional financial terminal; in the hospital scenario, it can be realized as a registration and payment terminal; in the movie theater scenario, it can be realized as a ticket collection terminal, etc.
在语音交互系统100中,云端服务器10与终端设备20之间,可建立无线通信连接,具体的通信连接方式可视不同应用场景而定。在一些实施例中,该无线通信连接,可基于专用虚拟网络(Virtual Private Network,VPN)实现,以确保通信安全。In the voice interaction system 100, a wireless communication connection can be established between the cloud server 10 and the terminal device 20, and the specific communication connection method can be determined according to different application scenarios. In some embodiments, the wireless communication connection can be implemented based on a dedicated virtual network (Virtual Private Network, VPN) to ensure communication security.
在语音交互系统100中,终端设备20主要用于:获取用户针对终端设备20发出的语音数据以及用户的面部特征数据,并将语音数据发送到云端服务器10。其中,面部特征数据用于识别用户所属的语言类型群体,面部特征数据可包括:皮肤特征数据、头发特征数据、眼睛特征数据、鼻梁特征数据、嘴唇特征数据中的至少一种。比如,用户的面部特征数据可包括:眼睛呈淡绿色,较为深邃,且头发呈金黄色。In the voice interaction system 100 , the terminal device 20 is mainly used to: obtain the voice data sent by the user to the terminal device 20 and the user's facial feature data, and send the voice data to the cloud server 10 . Wherein, the facial feature data is used to identify the language group to which the user belongs, and the facial feature data may include: at least one of skin feature data, hair feature data, eye feature data, nose bridge feature data, and lip feature data. For example, the user's facial feature data may include: eyes are light green, darker, and hair is blonde.
相应地,云端服务器10主要用于:接收该语音数据和面部特征数据,并根据面部特征数据,对至少一种备用语音识别引擎进行排序。其中,至少一种备用语音识别引擎,分别与至少一个语言类型对应。比如,至少一种备用语音识别引擎包括:阿拉伯语对应的语音识别引擎、德语对应的语音识别引擎以及法语对应的语音识别引擎。Correspondingly, the cloud server 10 is mainly used for: receiving the voice data and facial feature data, and sorting at least one standby voice recognition engine according to the facial feature data. Wherein, at least one spare speech recognition engine corresponds to at least one language type. For example, the at least one spare speech recognition engine includes: a speech recognition engine corresponding to Arabic, a speech recognition engine corresponding to German, and a speech recognition engine corresponding to French.
云端服务器10进行排序后,可判断预设的第一语言类型对应的第一语音识别引擎与语音数据是否匹配,若不匹配,则按照至少一种备用语音识别引擎的排序,从至少一种备用语音识别引擎中,选择出与语音数据匹配的目标语音识别引擎。其中,采用“第一”对语音识别引擎进行限定,仅用于对语音识别引擎进行区分。其中,目标语音识别引擎指的是与语音数据匹配的语音识别引擎。After the cloud server 10 sorts, it can judge whether the first speech recognition engine corresponding to the preset first language type matches the speech data, if not, then according to the sorting of at least one spare speech recognition engine, start from at least one spare speech recognition engine. Among the speech recognition engines, a target speech recognition engine matching the speech data is selected. Wherein, "first" is used to define the speech recognition engine, which is only used to distinguish the speech recognition engines. Wherein, the target speech recognition engine refers to a speech recognition engine that matches the speech data.
比如,用户使用的是法语,而预设的第一语言类型对应的第一语音识别引擎用于识别汉语。云端服务器10判定该语音识别引擎与语音数据不匹配后, 可按照“法语对应的语音识别引擎、德语对应的语音识别引擎、阿拉伯语对应的语音识别引擎”的排列顺序,从这些备用语音识别引擎中选择出与语音数据匹配的目标语音识别引擎,即法语对应的语音识别引擎。For example, the user uses French, and the first speech recognition engine corresponding to the preset first language type is used to recognize Chinese. After the cloud server 10 judges that the speech recognition engine does not match the speech data, According to the order of "speech recognition engine corresponding to French, speech recognition engine corresponding to German, and speech recognition engine corresponding to Arabic", the target speech recognition engine matching the speech data can be selected from these spare speech recognition engines, that is, French The corresponding speech recognition engine.
基于上述步骤,云端服务器10可根据目标语音识别引擎对语音数据的第二语音识别结果,生成语音数据的答复信息。其中,答复信息可实现为用来为用户提供答复的文本信息或者音频信息。比如,用户对终端设备20说“下午几点提供晚餐”,云端服务器10可生成“下午六点”的答复信息。进一步可选地,云端服务器10可将生成的答复信息以文本或者音频的形式发送至终端设备20,以使终端设备20通过音频组件或者显示组件向用户输出该答复信息。Based on the above steps, the cloud server 10 can generate the reply information of the voice data according to the second voice recognition result of the voice data by the target voice recognition engine. Wherein, the reply information may be implemented as text information or audio information used to provide the user with a reply. For example, if the user says to the terminal device 20 "What time will dinner be served in the afternoon", the cloud server 10 may generate a reply message of "six o'clock in the afternoon". Further optionally, the cloud server 10 may send the generated reply information to the terminal device 20 in the form of text or audio, so that the terminal device 20 outputs the reply information to the user through an audio component or a display component.
在本实施例中,终端设备可获取用户的语音数据和面部特征数据,云端服务器可根据面部特征数据对备用语音引擎进行排序。云端服务器可在第一语言类型对应的语音识别引擎与语音数据不匹配时,从备用语音识别引擎中选择出目标语音识别引擎,并根据目标语音识别引擎对语音数据的第二语音识别结果,生成语音数据的答复信息。通过这种实施方式,当用户使用的语言类型不同时,终端设备可较为准确地对用户输入的语音数据进行语音识别,进而,可为用户提供与该语音信息较为匹配的答复。In this embodiment, the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data. The cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data. Through this implementation manner, when the language types used by the users are different, the terminal device can more accurately perform speech recognition on the speech data input by the user, and then can provide the user with a reply that matches the speech information.
可选地,前述实施例中记载的“根据面部特征数据,对至少一种备用语音识别引擎进行排序”,可基于以下步骤实现:Optionally, the "sorting at least one spare speech recognition engine according to facial feature data" described in the foregoing embodiments can be implemented based on the following steps:
云端服务器10可通过对预先采集的用户面部图像进行特征提取来获得面部特征数据,进而,云端服务器10可根据面部特征数据,识别用户所属的目标语言类型群体。其中,目标语言类型群体指的是用户所属的语言类型群体。比如,用户的面部特征数据包括:用户的眼睛呈淡绿色,较为深邃,且头发呈金黄色,由于这种面部特征的人常常出现在法国或德国等欧洲国家,所以云端服务器10可根据该面部特征数据识别出用户所属的目标语言类型群体为法语群体或者德语群体。The cloud server 10 can obtain facial feature data by performing feature extraction on pre-collected user facial images, and further, the cloud server 10 can identify the target language group to which the user belongs according to the facial feature data. Wherein, the target language type group refers to the language type group to which the user belongs. For example, the user's facial feature data includes: the user's eyes are light green, relatively deep, and the hair is golden yellow, because people with this facial feature often appear in European countries such as France or Germany, so the cloud server 10 can be based on the facial features. The feature data identifies that the target language group to which the user belongs is a French group or a German group.
以下将对语言类型群体的可选识别过程进行详细说明。The optional identification process for linguistic type groups is described in detail below.
云端服务器10在识别语言类型群体时,可将面部特征数据输入到预设的语言类型群体SVM(Support Vector Machine,支持向量机)分类器。由于该 语言类型群体SVM分类器预先经过不断地训练,可将人的面部图像按照语言类型群体划分为韩语群体、汉语群体、法语群体等等,得到多个语言类型群体的类别。所以云端服务器10将面部特征数据输入到该分类器后,分类器可将面部特征数据与多个语言类型群体类别进行匹配,得到与该面部特征数据较为匹配的多个语言类型群体以及对应的匹配度(即概率),进而,分类器可输出该面部特征数据对应的语言类型群体。比如,云端服务器可将面部特征数据输入到预设的分类器中,得到目标语言类型群体为汉语群体的匹配度为80%、法语群体的匹配度为70%以及英语群体的匹配度为50%。When the cloud server 10 identifies the language group, it can input the facial feature data into a preset language group SVM (Support Vector Machine, Support Vector Machine) classifier. due to the The language group SVM classifier has been continuously trained in advance, and can divide human facial images into Korean groups, Chinese groups, French groups, etc. according to language groups, and obtain categories of multiple language groups. Therefore, after the cloud server 10 inputs the facial feature data into the classifier, the classifier can match the facial feature data with a plurality of language-type group categories, and obtain a plurality of language-type groups that match the facial feature data and the corresponding matching degree (ie probability), and then the classifier can output the language type group corresponding to the facial feature data. For example, the cloud server can input the facial feature data into the preset classifier, and the matching degree of the target language group is 80% for the Chinese group, 70% for the French group and 50% for the English group .
基于上述步骤,云端服务器10根据用户所属的目标语言类型群体以及语言类型群体和备用语音识别引擎的对应关系,对至少一种备用语音识别引擎进行排序。Based on the above steps, the cloud server 10 sorts the at least one standby speech recognition engine according to the target language group to which the user belongs and the corresponding relationship between the language group and the spare speech recognition engine.
比如,语言类型群体和备用语音识别引擎的对应关系可为:法语群体与法语对应的语音识别引擎对应,德语群体与德语对应的语音识别引擎对应。沿用前述例子进行举例,云端服务器10识别出目标语言类型群体为德语群体的匹配度为80%、法语群体的匹配度为70%以及英语群体的匹配度为50%后,可根据该标语言类型群体以及对应关系,将至少一种备用语音识别引擎,按照匹配度从高到低的顺序,排列成“德语对应的语音识别引擎、法语对应的语音识别引擎、英语对应的语音识别引擎”的顺序。For example, the correspondence between the language group and the standby speech recognition engine may be: the French group corresponds to the speech recognition engine corresponding to French, and the German group corresponds to the speech recognition engine corresponding to German. Using the aforementioned example as an example, after the cloud server 10 recognizes that the target language group is 80% matching degree of the German group, 70% matching degree of the French group and 50% matching degree of the English group, the target language type can be Groups and corresponding relationships, arrange at least one spare speech recognition engine in the order of "the speech recognition engine corresponding to German, the speech recognition engine corresponding to French, and the speech recognition engine corresponding to English" according to the order of matching degree from high to low .
在一些可选的实施例中,云端服务器10在判断预设的第一语言类型对应的第一语音识别引擎与语音数据是否匹配之前,可获取终端设备20当前所处的地理位置,并根据该地理位置的语言分布特征,确定第一语言类型,并将第一语言类型对应的识别引擎,作为第一语音识别引擎。比如,终端设备20当前处在某小区内,由于该小区为韩国人聚居的小区,所以该小区的语言分布特征为使用韩语的人较多,使用汉语的人较少。云端服务器10可根据该语言分布特征,确定第一语言类型为韩语,并将韩语对应的识别引擎作为第一语音识别引擎。In some optional embodiments, before the cloud server 10 judges whether the first speech recognition engine corresponding to the preset first language type matches the speech data, it may acquire the current geographic location of the terminal device 20 and, according to the The language distribution characteristics of the geographic location determine the first language type, and use the recognition engine corresponding to the first language type as the first speech recognition engine. For example, the terminal device 20 is currently in a certain community. Since this community is a community inhabited by Koreans, the language distribution feature of this community is that there are many people who speak Korean and few people who speak Chinese. The cloud server 10 can determine the first language type as Korean according to the language distribution feature, and use the recognition engine corresponding to Korean as the first speech recognition engine.
可选地,云端服务器10判断预设的第一语言类型对应的第一语音识别引擎与语音数据是否匹配时,可通过预设的第一语言类型对应的第一语音识别 引擎,对语音数据进行语音识别,得到第一语音识别结果。Optionally, when the cloud server 10 judges whether the first voice recognition engine corresponding to the preset first language type matches the voice data, it may use the first voice recognition engine corresponding to the preset first language type to The engine performs speech recognition on the speech data to obtain a first speech recognition result.
进而,云端服务器10可获取第一语音识别结果中的文本信息,并计算该文本信息的识别准确率。其中,识别准确率的计算,可通过预设的语音识别模型进行计算,也可通过预设的算法进行计算。比如,可通过预设的模型或算法计算文本信息的句错率(Sentence Error Rate,SER)、句正确率(Sentence Correct,S.Corr)或者字错率(Character Error Rate,CER)等多个评估指标,并根据多个评估指标以及各自的权重,计算得到该文本信息的识别准确率。Furthermore, the cloud server 10 can obtain the text information in the first speech recognition result, and calculate the recognition accuracy of the text information. Wherein, the calculation of the recognition accuracy rate may be performed through a preset speech recognition model, or may be calculated through a preset algorithm. For example, the Sentence Error Rate (SER), Sentence Correct (S.Corr) or Character Error Rate (CER) of text information can be calculated through a preset model or algorithm. Evaluation indicators, and according to multiple evaluation indicators and their respective weights, the recognition accuracy rate of the text information is calculated.
若计算得到的识别准确率小于设定准确率阈值,则确定语音数据与第一语音识别引擎不匹配,其中,该阈值可设定为90%、85%或80%等等,本实施例不做限制。If the calculated recognition accuracy is less than the set accuracy threshold, it is determined that the voice data does not match the first speech recognition engine, wherein the threshold can be set to 90%, 85%, or 80%, etc., the present embodiment does not Do limit.
若计算得到的识别准确率大于或等于设定准确率阈值,则可初步判定语音数据与第一语音识别引擎匹配。在此基础上,云端服务器10可进一步根据问答匹配环节生成的答复信息的置信度判断语音数据与第一语音识别引擎是否匹配。以下将进行详细说明。If the calculated recognition accuracy rate is greater than or equal to the set accuracy rate threshold, it can be preliminarily determined that the voice data matches the first voice recognition engine. On this basis, the cloud server 10 can further judge whether the voice data matches the first voice recognition engine according to the confidence of the answer information generated by the question-answer matching link. The details will be described below.
若第一语音识别结果中的文本信息的识别准确率大于或等于设定准确率阈值,云端服务器10可基于NLP(Natural Language Processing,自然语言处理)技术,采用问答匹配模型对文本信息进行问答匹配。该问答匹配模型经过预先的模型训练后,可根据输入的文本信息,在模型内置的数据集中搜索与该文本信息对应的多个置信度不同的预选信息。进而,问答匹配模型可从多个预选信息中选择置信度最高的预选信息作为答复信息。比如,云端服务器10通过问答匹配模型对“离我最近的银行在哪个街道”的文本信息进行问答匹配,可得到置信度为80%的“在A街道”的预选信息以及置信度为85%的“在B街道”的预选信息,之后,可从这两个预选信息选出置信度为85%的“在B街道”的预选信息作为答复信息。If the recognition accuracy of the text information in the first speech recognition result is greater than or equal to the set accuracy threshold, the cloud server 10 may use a question-answer matching model to perform question-answer matching on the text information based on NLP (Natural Language Processing, Natural Language Processing) technology . After pre-trained model training, the question-answer matching model can search for a plurality of pre-selected information with different confidence levels corresponding to the text information in the built-in data set of the model according to the input text information. Furthermore, the question-answer matching model can select the pre-selected information with the highest confidence as the answer information from the multiple pre-selected information. For example, the cloud server 10 uses the question-answer matching model to perform question-and-answer matching on the text information of "which street is the nearest bank to me", and can obtain the pre-selected information of "on street A" with a confidence level of 80% and the pre-selected information of "on street A" with a confidence level of 85%. The preselected information of "on street B", and then, the preselected information of "on street B" with a confidence level of 85% can be selected from the two preselected information as the answer information.
通过以上问答匹配的方式,云端服务器10可得到答复信息以及答复信息的置信度。若答复信息的置信度小于预设的置信度阈值,则确定语音数据与第一语音识别引擎不匹配。Through the above question-answer matching method, the cloud server 10 can obtain the reply information and the confidence level of the reply information. If the confidence of the reply information is less than the preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
若确定语音数据与第一语音识别引擎不匹配,则云端服务器10可从至少 一种备用语音识别引擎中,选择出与语音数据匹配的目标语音识别引擎。If it is determined that the voice data does not match the first voice recognition engine, then the cloud server 10 may select from at least In a standby speech recognition engine, a target speech recognition engine matching the speech data is selected.
在一些可选的实施例中,云端服务器10从至少一种备用语音识别引擎中,选择出与语音数据匹配的目标语音识别引擎时,可从至少一种备用语音识别引擎中,选择任一语音识别引擎作为第二语音识别引擎。比如,至少一种备用语音识别引擎包含有汉语对应的语音识别引擎以及法语对应的语音识别引擎,云端服务器可从该至少一种备用语音识别引擎中选择法语对应的语音识别引擎,作为第二语音识别引擎。In some optional embodiments, when the cloud server 10 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it can select any speech recognition engine from at least one spare speech recognition engine. The recognition engine acts as a second speech recognition engine. For example, at least one backup speech recognition engine includes a speech recognition engine corresponding to Chinese and a speech recognition engine corresponding to French, and the cloud server can select a speech recognition engine corresponding to French from the at least one backup speech recognition engine as the second voice recognition engine.
云端服务器10选择第二语音识别引擎后,可通过该引擎对语音数据进行语音识别,得到第二语音识别结果。其中,第二语音识别结果指的是通过第二语音识别引擎进行语音识别得到的语音识别结果。其中,采用“第二”对语音识别结果进行限定,仅用于对多次语音识别得到的语音识别结果进行区分。After the cloud server 10 selects the second speech recognition engine, the engine can perform speech recognition on the speech data to obtain a second speech recognition result. Wherein, the second speech recognition result refers to the speech recognition result obtained by performing speech recognition through the second speech recognition engine. Wherein, "second" is used to limit the speech recognition result, which is only used to distinguish the speech recognition results obtained from multiple speech recognitions.
云端服务器10在语音识别后,可根据第二语音识别结果,判断语音数据与第二语音识别引擎是否匹配。以下将进行详细说明。After the voice recognition, the cloud server 10 can determine whether the voice data matches the second voice recognition engine according to the second voice recognition result. The details will be described below.
云端服务器10可获取第二语音识别结果中的文本信息,并计算该文本信息的识别准确率。其中,识别准确率的计算,可通过预设的语音识别模型进行计算,也可通过预设的算法进行计算。比如,可通过预设的模型或算法计算文本信息的句错率、句正确率或者字错率等多个评估指标,并根据多个评估指标以及各自的权重,计算得到该文本信息的识别准确率。若该识别准确率大于或等于设定准确率阈值,则确定语音数据与第二识别引擎匹配,云端服务器10可将该第二语音识别引擎,作为目标语音识别引擎。若该识别准确率小于设定准确率阈值,则确定语音数据与第二语音识别引擎不匹配,其中,该阈值可设定为90%、85%或80%等等,本实施例不做限制。The cloud server 10 can obtain the text information in the second speech recognition result, and calculate the recognition accuracy of the text information. Wherein, the calculation of the recognition accuracy rate may be performed through a preset speech recognition model, or may be calculated through a preset algorithm. For example, multiple evaluation indicators such as sentence error rate, sentence correct rate, or word error rate of text information can be calculated through a preset model or algorithm, and the recognition accuracy of the text information can be calculated based on multiple evaluation indicators and their respective weights. Rate. If the recognition accuracy is greater than or equal to the set accuracy threshold, it is determined that the speech data matches the second recognition engine, and the cloud server 10 can use the second speech recognition engine as a target speech recognition engine. If the recognition accuracy is less than the set accuracy threshold, it is determined that the voice data does not match the second speech recognition engine, wherein the threshold can be set to 90%, 85%, or 80%, etc., which is not limited in this embodiment .
以下将结合图2、图3以及实际应用场景,对语音交互系统进行进一步说明。The voice interaction system will be further described below in conjunction with FIG. 2 and FIG. 3 and actual application scenarios.
如图2和图3所示,终端设备可采集用户的面部图像,并进行图像识别,得到用户的面部特征数据。之后,终端设备可根据该面部特征数据识别目标语言类型群体,根据目标语言类型群体对备用语音识别引擎进行设置。基于前述步骤,终端设备可通过麦克风采集用户的初始语音数据,并将该初始语 音数据发送到语音端点检测模块。语音端点检测模块可截取出初始语音数据中的有效的语音数据。之后终端设备可通过主模块中的第一语言类型对应的第一语音识别引擎(即主引擎)对该语音数据进行语音识别,可得到与该语音数据对应的文本信息。之后,终端设备可通过第一语言类型对应的问答匹配模型对该文本信息进行问答匹配,得到与该文本信息对应的答复信息。若该答复信息的置信度大于或等于置信度阈值,则通过第一语言类型对应的文字转语音模块将该答复信息转换成语音,并进行语音输出。若该答复信息的置信度小于置信度阈值,则从至少一种备用语音识别引擎中,选择出与该语音数据匹配的目标语音识别引擎,以重新对该语音数据进行语音识别。As shown in Fig. 2 and Fig. 3, the terminal device can collect the user's facial image, and perform image recognition to obtain the user's facial feature data. Afterwards, the terminal device can identify the target language type group according to the facial feature data, and set the backup voice recognition engine according to the target language type group. Based on the above steps, the terminal device can collect the user's initial voice data through the microphone, and The voice data is sent to the voice endpoint detection module. The voice endpoint detection module can intercept effective voice data in the initial voice data. Afterwards, the terminal device can perform voice recognition on the voice data through the first voice recognition engine corresponding to the first language type in the main module (ie, the main engine), and can obtain text information corresponding to the voice data. Afterwards, the terminal device may perform question-answer matching on the text information through the question-answer matching model corresponding to the first language type, and obtain answer information corresponding to the text information. If the confidence level of the reply information is greater than or equal to the confidence threshold, the text-to-speech module corresponding to the first language type will convert the reply information into speech and output the speech. If the confidence of the reply information is less than the confidence threshold, select a target speech recognition engine that matches the speech data from at least one spare speech recognition engine to perform speech recognition on the speech data again.
以目标语音识别引擎为韩语对应的备用语音识别引擎为例,终端设备通过韩语对应的备用语音识别引擎,对语音数据进行语音识别,得到对应的文本信息。之后,终端设备可通过主模块中的韩语对应的备用问答匹配模型,对该文本信息进行问答匹配,得到对应的答复信息。若该答复信息的置信度大于或等于置信度阈值,则通过韩语对应的文字转语音模块将该答复信息转换成语音,并进行语音输出。若该答复信息的置信度小于置信度阈值,则重新选择备用语音识别引擎,重新对语音数据进行识别。Taking the example where the target speech recognition engine is the backup speech recognition engine corresponding to Korean, the terminal device performs speech recognition on the speech data through the backup speech recognition engine corresponding to Korean to obtain corresponding text information. Afterwards, the terminal device can perform question-answer matching on the text information through the standby question-answer matching model corresponding to Korean in the main module, and obtain corresponding answer information. If the confidence degree of the reply information is greater than or equal to the confidence degree threshold, the reply information is converted into speech through the text-to-speech module corresponding to Korean, and the speech output is performed. If the confidence degree of the reply information is less than the confidence degree threshold, the standby speech recognition engine is reselected to re-recognize the speech data.
本申请实施例还提供语音交互方法,以下将结合图4进行详细说明。The embodiment of the present application also provides a voice interaction method, which will be described in detail below with reference to FIG. 4 .
步骤401、获取用户针对设备发出的语音数据以及用户的面部特征数据。Step 401. Acquire the voice data sent by the user to the device and the user's facial feature data.
步骤402、根据面部特征数据,对至少一种备用语音识别引擎进行排序;至少一种备用语音识别引擎,分别与至少一个语言类型对应。Step 402, sort the at least one spare speech recognition engine according to the facial feature data; the at least one spare speech recognition engine corresponds to at least one language type respectively.
步骤403、判断预设的第一语言类型对应的第一语音识别引擎与语音数据是否匹配。Step 403, judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data.
步骤404、若为否,则按照至少一种备用语音识别引擎的排序,从至少一种备用语音识别引擎中,选择出与语音数据匹配的目标语音识别引擎。Step 404, if not, select a target speech recognition engine matching the voice data from the at least one spare speech recognition engine according to the ranking of the at least one spare speech recognition engine.
步骤405、根据目标语音识别引擎对语音数据的第二语音识别结果,生成语音数据的答复信息。Step 405: Generate reply information for the voice data according to the second voice recognition result of the voice data by the target voice recognition engine.
进一步可选地,根据面部特征数据,对至少一种备用语音识别引擎进行排序,包括;根据面部特征数据,识别用户所属的目标语言类型群体;根据用户所属的目标语言类型群体以及语言类型群体和备用语音识别引擎的对应 关系,对至少一种备用语音识别引擎进行排序;面部特征数据包括:皮肤特征数据、头发特征数据、眼睛特征数据、鼻梁特征数据、嘴唇特征数据中的至少一种。Further optionally, sorting the at least one standby speech recognition engine according to the facial feature data includes: identifying the target language type group to which the user belongs according to the facial feature data; according to the target language type group and the language type group and Correspondence of standby voice recognition engine relationship, sorting at least one spare speech recognition engine; facial feature data includes: at least one of skin feature data, hair feature data, eye feature data, nose bridge feature data, and lip feature data.
进一步可选地,判断预设的第一语言类型对应的第一语音识别引擎与语音数据是否匹配之前,还包括:获取设备当前所处的地理位置;根据地理位置的语言分布特征,确定第一语言类型,并将第一语言类型对应的识别引擎,作为第一语音识别引擎。Further optionally, before judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data, it also includes: acquiring the current geographical location of the device; determining the first language type, and use the recognition engine corresponding to the first language type as the first speech recognition engine.
进一步可选地,判断预设的第一语言类型对应的第一语音识别引擎与语音数据是否匹配,包括:通过预设的第一语言类型对应的第一语音识别引擎,对语音数据进行语音识别,得到第一语音识别结果;获取第一语音识别结果中的文本信息;计算文本信息的识别准确率;若识别准确率小于设定准确率阈值,则确定语音数据与第一语音识别引擎不匹配。Further optionally, judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data includes: performing voice recognition on the voice data through the first voice recognition engine corresponding to the preset first language type , obtain the first speech recognition result; obtain the text information in the first speech recognition result; calculate the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set accuracy rate threshold, then determine that the speech data does not match the first speech recognition engine .
进一步可选地,方法还包括:若识别准确率大于或等于设定准确率阈值,则采用问答匹配模型对文本信息进行问答匹配,得到答复信息以及答复信息的置信度;若答复信息的置信度小于预设的置信度阈值,则确定语音数据与第一语音识别引擎不匹配。Further optionally, the method also includes: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, then using the question-answer matching model to perform question-answer matching on the text information to obtain the answer information and the confidence degree of the answer information; if the confidence degree of the answer information If it is less than the preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
进一步可选地,从至少一种备用语音识别引擎中,选择出与语音数据匹配的目标语音识别引擎,包括:按照至少一种备用语音识别引擎的排序,依次对至少一种备用语音识别引擎进行选择,得到第二语音识别引擎;根据第二语音识别结果,判断语音数据与第二语音识别引擎是否匹配;若语音数据与第二语音识别引擎匹配,则将第二语音识别引擎,作为目标语音识别引擎。Further optionally, selecting a target speech recognition engine that matches the voice data from at least one spare speech recognition engine includes: sequentially performing at least one spare speech recognition engine according to the ordering of the at least one spare speech recognition engine Select to obtain the second speech recognition engine; according to the second speech recognition result, judge whether the voice data matches the second speech recognition engine; if the speech data matches the second speech recognition engine, then use the second speech recognition engine as the target voice recognition engine.
在本实施例中,终端设备可获取用户的语音数据和面部特征数据,云端服务器可根据面部特征数据对备用语音引擎进行排序。云端服务器可在第一语言类型对应的语音识别引擎与语音数据不匹配时,从备用语音识别引擎中选择出目标语音识别引擎,并根据目标语音识别引擎对语音数据的第二语音识别结果,生成语音数据的答复信息。通过这种实施方式,当用户使用的语言类型不同时,终端设备可较为准确地对用户输入的语音数据进行语音识别,进而,可为用户提供与该语音信息较为匹配的答复。In this embodiment, the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data. The cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data. Through this implementation manner, when the language types used by the users are different, the terminal device can more accurately perform speech recognition on the speech data input by the user, and then can provide the user with a reply that matches the speech information.
需要说明的是,上述实施例所提供方法的各步骤的执行主体均可以是同 一设备,或者,该方法也由不同设备作为执行主体。比如,步骤401至步骤405的执行主体可以为设备A;又比如,步骤401到403的执行主体可以为设备A,步骤404和405的执行主体可以为设备B;等等。It should be noted that the subject of execution of each step of the method provided in the above embodiments can be the same A device, or the method is executed by a different device. For example, the execution subject of steps 401 to 405 may be device A; for another example, the execution subject of steps 401 to 403 may be device A, and the execution subject of steps 404 and 405 may be device B; and so on.
另外,在上述实施例及附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如401、402等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。In addition, in some of the processes described in the above embodiments and accompanying drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may not be executed in the order in which they appear herein or executed in parallel , the sequence numbers of the operations, such as 401, 402, etc., are only used to distinguish different operations, and the sequence numbers themselves do not represent any execution order. Additionally, these processes can include more or fewer operations, and these operations can be performed sequentially or in parallel.
需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。It should be noted that the descriptions of "first" and "second" in this article are used to distinguish different messages, devices, modules, etc. are different types.
本申请实施例提供一种语音交互装置,如图5所示,该语音交互装置包括:获取模块501、排序模块502、判断模块503、选择模块504以及生成模块505。An embodiment of the present application provides a voice interaction device. As shown in FIG. 5 , the voice interaction device includes: an acquisition module 501 , a sorting module 502 , a judgment module 503 , a selection module 504 and a generation module 505 .
其中,获取模块501,用于:获取用户针对设备发出的语音数据以及所述用户的面部特征数据;排序模块502,用于:根据所述面部特征数据,对至少一种备用语音识别引擎进行排序;所述至少一种备用语音识别引擎,分别与至少一个语言类型对应;判断模块503,用于:判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配;选择模块504,用于:若为否,则按照所述至少一种备用语音识别引擎的排序,从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎;生成模块505,用于:根据所述目标语音识别引擎对所述语音数据的第二语音识别结果,生成所述语音数据的答复信息。Among them, the acquisition module 501 is used to: acquire the voice data sent by the user for the device and the facial feature data of the user; the sorting module 502 is used to: sort at least one standby voice recognition engine according to the facial feature data The at least one standby speech recognition engine corresponds to at least one language type respectively; the judging module 503 is used to: judge whether the first speech recognition engine corresponding to the preset first language type matches the speech data; select Module 504, configured to: if not, select a target speech recognition engine that matches the voice data from at least one spare speech recognition engine according to the ranking of the at least one spare speech recognition engine; generating module 505, configured to: generate reply information for the voice data according to a second voice recognition result of the voice data by the target voice recognition engine.
进一步可选地,排序模块502在根据所述面部特征数据,对至少一种备用语音识别引擎进行排序时,具体用于;根据所述面部特征数据,识别所述用户所属的目标语言类型群体;根据所述用户所属的目标语言类型群体以及语言类型群体和备用语音识别引擎的对应关系,对所述至少一种备用语音识别引擎进行排序;所述面部特征数据包括:皮肤特征数据、头发特征数据、眼睛特征数据、鼻梁特征数据、嘴唇特征数据中的至少一种。 Further optionally, when the sorting module 502 sorts at least one standby speech recognition engine according to the facial feature data, it is specifically used to: identify the target language group to which the user belongs according to the facial feature data; According to the target language type group to which the user belongs and the corresponding relationship between the language type group and the backup speech recognition engine, the at least one backup speech recognition engine is sorted; the facial feature data includes: skin feature data, hair feature data , eye feature data, nose bridge feature data, and lip feature data.
进一步可选地,排序模块502在判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配之前,还用于:获取所述设备当前所处的地理位置;根据所述地理位置的语言分布特征,确定所述第一语言类型,并将所述第一语言类型对应的识别引擎,作为所述第一语音识别引擎。Further optionally, before judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, the sorting module 502 is further configured to: acquire the current geographic location of the device; The language distribution feature of the geographic location is used to determine the first language type, and the recognition engine corresponding to the first language type is used as the first speech recognition engine.
进一步可选地,判断模块503在判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配时,具体用于:通过预设的第一语言类型对应的第一语音识别引擎,对所述语音数据进行语音识别,得到第一语音识别结果;获取所述第一语音识别结果中的文本信息;计算所述文本信息的识别准确率;若所述识别准确率小于设定准确率阈值,则确定所述语音数据与所述第一语音识别引擎不匹配。Further optionally, when judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, the judging module 503 is specifically configured to: use the first voice corresponding to the preset first language type The recognition engine performs speech recognition on the speech data to obtain a first speech recognition result; obtains text information in the first speech recognition result; calculates the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set If the accuracy threshold is determined, it is determined that the voice data does not match the first voice recognition engine.
进一步可选地,判断模块503还用于:若所述识别准确率大于或等于所述设定准确率阈值,则采用问答匹配模型对所述文本信息进行问答匹配,得到答复信息以及所述答复信息的置信度;若所述答复信息的置信度小于预设的置信度阈值,则确定所述语音数据与所述第一语音识别引擎不匹配。Further optionally, the judging module 503 is also configured to: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, use a question-answer matching model to perform question-answer matching on the text information to obtain answer information and the answer Confidence of the information; if the confidence of the reply information is less than a preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
进一步可选地,选择模块504在从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎时,具体用于:按照所述至少一种备用语音识别引擎的排序,依次对所述至少一种备用语音识别引擎进行选择,得到所述第二语音识别引擎;根据所述第二语音识别结果,判断所述语音数据与所述第二语音识别引擎是否匹配;若所述语音数据与所述第二语音识别引擎匹配,则将所述第二语音识别引擎,作为所述目标语音识别引擎。Further optionally, when the selection module 504 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it is specifically configured to: sort according to the at least one spare speech recognition engine , selecting the at least one standby speech recognition engine in turn to obtain the second speech recognition engine; judging whether the speech data matches the second speech recognition engine according to the second speech recognition result; if The voice data is matched with the second voice recognition engine, and the second voice recognition engine is used as the target voice recognition engine.
在本实施例中,终端设备可获取用户的语音数据和面部特征数据,云端服务器可根据面部特征数据对备用语音引擎进行排序。云端服务器可在第一语言类型对应的语音识别引擎与语音数据不匹配时,从备用语音识别引擎中选择出目标语音识别引擎,并根据目标语音识别引擎对语音数据的第二语音识别结果,生成语音数据的答复信息。通过这种实施方式,当用户使用的语言类型不同时,终端设备可较为准确地对用户输入的语音数据进行语音识别,进而,可为用户提供与该语音信息较为匹配的答复。In this embodiment, the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data. The cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data. Through this implementation manner, when the language types used by the users are different, the terminal device can more accurately perform speech recognition on the speech data input by the user, and then can provide the user with a reply that matches the speech information.
图6是本申请一示例性实施例提供的云端服务器的结构示意图,该服务器适用于前述实施例提供的语音交互系统,如图6所示,该服务器包括:存 储器601、处理器602以及通信组件603。Fig. 6 is a schematic structural diagram of a cloud server provided by an exemplary embodiment of the present application. The server is suitable for the voice interaction system provided in the foregoing embodiment. As shown in Fig. 6, the server includes: Storage 601, processor 602 and communication component 603.
存储器601,用于存储计算机程序,并可被配置为存储其它各种数据以支持在终端设备上的操作。这些数据的示例包括用于在终端设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。The memory 601 is used to store computer programs, and can be configured to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phonebook data, messages, pictures, videos, etc.
处理器602,与存储器601耦合,用于执行存储器601中的计算机程序,以用于:获取用户针对设备发出的语音数据以及所述用户的面部特征数据;根据所述面部特征数据,对至少一种备用语音识别引擎进行排序;所述至少一种备用语音识别引擎,分别与至少一个语言类型对应;判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配;若为否,则按照所述至少一种备用语音识别引擎的排序,从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎;根据所述目标语音识别引擎对所述语音数据的第二语音识别结果,生成所述语音数据的答复信息。The processor 602, coupled with the memory 601, is used to execute the computer program in the memory 601, so as to: obtain the voice data sent by the user for the device and the facial feature data of the user; according to the facial feature data, at least one The standby speech recognition engines are sorted; the at least one standby speech recognition engine corresponds to at least one language type; it is judged whether the first speech recognition engine corresponding to the preset first language type matches the voice data; if If no, then according to the sorting of the at least one standby voice recognition engine, select a target voice recognition engine that matches the voice data from at least one standby voice recognition engine; The second voice recognition result of the voice data is used to generate reply information for the voice data.
进一步可选地,处理器602在根据所述面部特征数据,对至少一种备用语音识别引擎进行排序时,具体用于:根据所述面部特征数据,识别所述用户所属的目标语言类型群体;根据所述用户所属的目标语言类型群体以及语言类型群体和备用语音识别引擎的对应关系,对所述至少一种备用语音识别引擎进行排序;所述面部特征数据包括:皮肤特征数据、头发特征数据、眼睛特征数据、鼻梁特征数据、嘴唇特征数据中的至少一种。Further optionally, when the processor 602 sorts at least one standby speech recognition engine according to the facial feature data, it is specifically configured to: identify the target language type group to which the user belongs according to the facial feature data; According to the target language type group to which the user belongs and the corresponding relationship between the language type group and the backup speech recognition engine, the at least one backup speech recognition engine is sorted; the facial feature data includes: skin feature data, hair feature data , eye feature data, nose bridge feature data, and lip feature data.
进一步可选地,处理器602在判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配之前,还用于:获取所述设备当前所处的地理位置;根据所述地理位置的语言分布特征,确定所述第一语言类型,并将所述第一语言类型对应的识别引擎,作为所述第一语音识别引擎。Further optionally, before judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, the processor 602 is further configured to: acquire the current geographic location of the device; The language distribution feature of the geographic location is used to determine the first language type, and the recognition engine corresponding to the first language type is used as the first speech recognition engine.
进一步可选地,处理器602在判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配时,具体用于:通过预设的第一语言类型对应的第一语音识别引擎,对所述语音数据进行语音识别,得到第一语音识别结果;获取所述第一语音识别结果中的文本信息;计算所述文本信息的识别准确率;若所述识别准确率小于设定准确率阈值,则确定所述语音数据与所述第一语音识别引擎不匹配。Further optionally, when judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data, the processor 602 is specifically configured to: use the first voice corresponding to the preset first language type The recognition engine performs speech recognition on the speech data to obtain a first speech recognition result; obtains text information in the first speech recognition result; calculates the recognition accuracy rate of the text information; if the recognition accuracy rate is less than the set If the accuracy threshold is determined, it is determined that the voice data does not match the first voice recognition engine.
进一步可选地,处理器602还用于:若所述识别准确率大于或等于所述设定准确率阈值,则采用问答匹配模型对所述文本信息进行问答匹配,得到 答复信息以及所述答复信息的置信度;若所述答复信息的置信度小于预设的置信度阈值,则确定所述语音数据与所述第一语音识别引擎不匹配。Further optionally, the processor 602 is further configured to: if the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, use a question-answer matching model to perform question-answer matching on the text information to obtain Reply information and a confidence degree of the reply information; if the confidence degree of the reply information is less than a preset confidence threshold, it is determined that the voice data does not match the first speech recognition engine.
进一步可选地,处理器602在从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎时,具体用于:按照所述至少一种备用语音识别引擎的排序,依次对所述至少一种备用语音识别引擎进行选择,得到所述第二语音识别引擎;根据所述第二语音识别结果,判断所述语音数据与所述第二语音识别引擎是否匹配;若所述语音数据与所述第二语音识别引擎匹配,则将所述第二语音识别引擎,作为所述目标语音识别引擎。Further optionally, when the processor 602 selects a target speech recognition engine that matches the voice data from at least one spare speech recognition engine, it is specifically configured to: sort according to the at least one spare speech recognition engine , selecting the at least one standby speech recognition engine in turn to obtain the second speech recognition engine; judging whether the speech data matches the second speech recognition engine according to the second speech recognition result; if The voice data is matched with the second voice recognition engine, and the second voice recognition engine is used as the target voice recognition engine.
进一步,如图6所示,该云端服务器还包括:电源组件604等其它组件。图6中仅示意性给出部分组件,并不意味着云端服务器只包括图6所示组件。Further, as shown in FIG. 6 , the cloud server further includes: a power supply component 604 and other components. FIG. 6 only schematically shows some components, which does not mean that the cloud server only includes the components shown in FIG. 6 .
在本实施例中,终端设备可获取用户的语音数据和面部特征数据,云端服务器可根据面部特征数据对备用语音引擎进行排序。云端服务器可在第一语言类型对应的语音识别引擎与语音数据不匹配时,从备用语音识别引擎中选择出目标语音识别引擎,并根据目标语音识别引擎对语音数据的第二语音识别结果,生成语音数据的答复信息。通过这种实施方式,当用户使用的语言类型不同时,终端设备可较为准确地对用户输入的语音数据进行语音识别,进而,可为用户提供与该语音信息较为匹配的答复。In this embodiment, the terminal device can acquire the user's voice data and facial feature data, and the cloud server can sort the standby voice engines according to the facial feature data. The cloud server can select the target speech recognition engine from the standby speech recognition engines when the speech recognition engine corresponding to the first language type does not match the speech data, and generate Reply message for voice data. Through this implementation manner, when the language types used by the users are different, the terminal device can more accurately perform speech recognition on the speech data input by the user, and then can provide the user with a reply that matches the speech information.
相应地,本申请实施例还提供一种存储有计算机程序的计算机可读存储介质,计算机程序被执行时能够实现上述方法实施例中可由云端服务器执行的各步骤。Correspondingly, the embodiments of the present application also provide a computer-readable storage medium storing a computer program. When the computer program is executed, the steps that can be executed by the cloud server in the above method embodiments can be realized.
相应地,本申请实施例还提供一种计算机程序产品,包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现上述方法实施例中可由云端服务器执行的各步骤。Correspondingly, an embodiment of the present application further provides a computer program product, including computer programs/instructions. When the computer program/instructions are executed by a processor, the steps that can be executed by the cloud server in the above method embodiments are implemented.
上述图6中的存储器601可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 601 in the above-mentioned Fig. 6 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) , Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.
上述图6中的通信组件603被配置为便于通信组件所在设备和其他设备之间有线或无线方式的通信。通信组件所在设备可以接入基于通信标准的无线网络,如WiFi,2G、3G、4G或5G,或它们的组合。在一个示例性实施例中, 通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,通信组件可基于近场通信(NFC)技术、射频识别(RFID)技术、红外数据协会(IrDA)技术、超宽带(UWB)技术、蓝牙(BT)技术和其他技术来实现。The above-mentioned communication component 603 in FIG. 6 is configured to facilitate wired or wireless communication between the device where the communication component is located and other devices. The device where the communication component is located can access a wireless network based on communication standards, such as WiFi, 2G, 3G, 4G or 5G, or a combination thereof. In an exemplary embodiment, The communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may be based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies to fulfill.
上述图6中的电源组件604,为电源组件所在设备的各种组件提供电力。电源组件可以包括电源管理系统,一个或多个电源,及其他与为电源组件所在设备生成、管理和分配电力相关联的组件。The power supply component 604 in FIG. 6 provides power for various components of the device where the power supply component is located. A power supply component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which the power supply component resides.
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输 出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output Outbound interface, network interface and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-permanent storage in computer readable media, in the form of random access memory (RAM) and/or nonvolatile memory such as read-only memory (ROM) or flash RAM. Memory is an example of computer readable media.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can be implemented by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, A magnetic tape cartridge, disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes Other elements not expressly listed, or elements inherent in the process, method, commodity, or apparatus are also included. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。 The above descriptions are only examples of the present application, and are not intended to limit the present application. For those skilled in the art, various modifications and changes may occur in this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included within the scope of the claims of the present application.

Claims (11)

  1. 一种基于图像的语音交互方法,其特征在于,包括:An image-based voice interaction method, characterized in that it comprises:
    获取用户针对设备发出的语音数据以及所述用户的面部特征数据;Obtain the voice data sent by the user for the device and the facial feature data of the user;
    根据所述面部特征数据,对至少一种备用语音识别引擎进行排序;所述至少一种备用语音识别引擎,分别与至少一个语言类型对应;Sorting at least one spare speech recognition engine according to the facial feature data; the at least one spare speech recognition engine corresponds to at least one language type respectively;
    判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配;judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data;
    若为否,则按照所述至少一种备用语音识别引擎的排序,从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎;If not, then according to the sorting of the at least one standby voice recognition engine, select a target voice recognition engine that matches the voice data from at least one standby voice recognition engine;
    根据所述目标语音识别引擎对所述语音数据的第二语音识别结果,生成所述语音数据的答复信息。Generate reply information for the voice data according to a second voice recognition result of the voice data by the target voice recognition engine.
  2. 根据权利要求1所述的方法,其特征在于,根据所述面部特征数据,对至少一种备用语音识别引擎进行排序,包括;The method according to claim 1, wherein, according to the facial feature data, at least one standby speech recognition engine is sorted, comprising;
    根据所述面部特征数据,识别所述用户所属的目标语言类型群体;Identify the target language group to which the user belongs according to the facial feature data;
    根据所述用户所属的目标语言类型群体以及语言类型群体和备用语音识别引擎的对应关系,对所述至少一种备用语音识别引擎进行排序;所述面部特征数据包括:皮肤特征数据、头发特征数据、眼睛特征数据、鼻梁特征数据、嘴唇特征数据中的至少一种。According to the target language type group to which the user belongs and the corresponding relationship between the language type group and the backup speech recognition engine, the at least one backup speech recognition engine is sorted; the facial feature data includes: skin feature data, hair feature data , eye feature data, nose bridge feature data, and lip feature data.
  3. 根据权利要求1所述的方法,其特征在于,判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配之前,还包括:The method according to claim 1, wherein before judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data, further comprising:
    获取所述设备当前所处的地理位置;Obtain the current geographic location of the device;
    根据所述地理位置的语言分布特征,确定所述第一语言类型,并将所述第一语言类型对应的识别引擎,作为所述第一语音识别引擎。The first language type is determined according to the language distribution characteristics of the geographical location, and the recognition engine corresponding to the first language type is used as the first speech recognition engine.
  4. 根据权利要求1所述的方法,其特征在于,判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配,包括:The method according to claim 1, wherein judging whether the first voice recognition engine corresponding to the preset first language type matches the voice data comprises:
    通过预设的第一语言类型对应的第一语音识别引擎,对所述语音数据进行语音识别,得到第一语音识别结果;performing speech recognition on the speech data by using a first speech recognition engine corresponding to a preset first language type to obtain a first speech recognition result;
    获取所述第一语音识别结果中的文本信息;Acquiring text information in the first speech recognition result;
    计算所述文本信息的识别准确率;calculating the recognition accuracy of the text information;
    若所述识别准确率小于设定准确率阈值,则确定所述语音数据与所述第一语音识别引擎不匹配。 If the recognition accuracy is less than the set accuracy threshold, it is determined that the speech data does not match the first speech recognition engine.
  5. 根据权利要求4所述的方法,其特征在于,还包括:The method according to claim 4, further comprising:
    若所述识别准确率大于或等于所述设定准确率阈值,则采用问答匹配模型对所述文本信息进行问答匹配,得到答复信息以及所述答复信息的置信度;If the recognition accuracy rate is greater than or equal to the set accuracy rate threshold, then using a question-answer matching model to perform question-answer matching on the text information to obtain answer information and the confidence level of the answer information;
    若所述答复信息的置信度小于预设的置信度阈值,则确定所述语音数据与所述第一语音识别引擎不匹配。If the confidence of the reply information is less than a preset confidence threshold, it is determined that the voice data does not match the first voice recognition engine.
  6. 根据权利要求1所述的方法,其特征在于,从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎,包括:The method according to claim 1, wherein selecting a target speech recognition engine matching the speech data from at least one standby speech recognition engine comprises:
    按照所述至少一种备用语音识别引擎的排序,依次对所述至少一种备用语音识别引擎进行选择,得到所述第二语音识别引擎;According to the sorting of the at least one standby speech recognition engine, sequentially select the at least one standby speech recognition engine to obtain the second speech recognition engine;
    根据所述第二语音识别结果,判断所述语音数据与所述第二语音识别引擎是否匹配;judging whether the voice data matches the second voice recognition engine according to the second voice recognition result;
    若所述语音数据与所述第二语音识别引擎匹配,则将所述第二语音识别引擎,作为所述目标语音识别引擎。If the voice data matches the second voice recognition engine, use the second voice recognition engine as the target voice recognition engine.
  7. 一种语音交互系统,其特征在于,包括:终端设备以及云端服务器;A voice interaction system, characterized in that it includes: a terminal device and a cloud server;
    其中,所述终端设备主要用于:获取用户针对设备发出的语音数据以及所述用户的面部特征数据;将所述语音数据和所述面部特征数据发送到所述云端服务器;Wherein, the terminal device is mainly used for: obtaining the voice data sent by the user for the device and the facial feature data of the user; sending the voice data and the facial feature data to the cloud server;
    所述云端服务器主要用于:接收所述语音数据和所述面部特征数据;根据所述面部特征数据,对至少一种备用语音识别引擎进行排序;所述至少一种备用语音识别引擎,分别与至少一个语言类型对应;判断预设的第一语言类型对应的第一语音识别引擎与所述语音数据是否匹配;若为否,则按照所述至少一种备用语音识别引擎的排序,从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎;根据所述目标语音识别引擎对所述语音数据的第二语音识别结果,生成所述语音数据的答复信息。The cloud server is mainly used for: receiving the voice data and the facial feature data; sorting at least one spare speech recognition engine according to the facial feature data; Corresponding to at least one language type; judging whether the first speech recognition engine corresponding to the preset first language type matches the speech data; if not, according to the sorting of the at least one spare speech recognition engine, from at least one Among the standby speech recognition engines, select a target speech recognition engine that matches the speech data; generate reply information for the speech data according to a second speech recognition result of the speech data by the target speech recognition engine.
  8. 一种语音交互装置,其特征在于,包括:A voice interaction device is characterized in that it comprises:
    获取模块,用于:获取用户针对设备发出的语音数据以及所述用户的面部特征数据;An acquisition module, configured to: acquire the voice data sent by the user for the device and the facial feature data of the user;
    排序模块,用于:根据所述面部特征数据,对至少一种备用语音识别引擎进行排序;所述至少一种备用语音识别引擎,分别与至少一个语言类型对应;A sorting module, configured to: sort at least one spare speech recognition engine according to the facial feature data; the at least one spare speech recognition engine corresponds to at least one language type;
    判断模块,用于:判断预设的第一语言类型对应的第一语音识别引擎与 所述语音数据是否匹配;Judging module, used for: judging the first speech recognition engine corresponding to the preset first language type and Whether the voice data matches;
    选择模块,用于:若为否,则按照所述至少一种备用语音识别引擎的排序,从至少一种备用语音识别引擎中,选择出与所述语音数据匹配的目标语音识别引擎;A selection module, configured to: if not, select a target speech recognition engine that matches the voice data from at least one spare speech recognition engine according to the ranking of the at least one spare speech recognition engine;
    生成模块,用于:根据所述目标语音识别引擎对所述语音数据的第二语音识别结果,生成所述语音数据的答复信息。The generating module is configured to: generate reply information for the voice data according to a second voice recognition result of the voice data by the target voice recognition engine.
  9. 一种云端服务器,其特征在于,包括:存储器、处理器以及通信组件;A cloud server, characterized in that it includes: a memory, a processor, and a communication component;
    其中,所述存储器用于:存储一条或多条计算机指令;Wherein, the memory is used to: store one or more computer instructions;
    所述处理器用于执行所述一条或多条计算机指令,以用于:执行权利要求1-6任一项所述的方法中的步骤。The processor is configured to execute the one or more computer instructions for: performing the steps in the method of any one of claims 1-6.
  10. 一种存储有计算机程序的计算机可读存储介质,其特征在于,当计算机程序被处理器执行时,致使处理器实现权利要求1-6任一项所述方法中的步骤。A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the processor is caused to implement the steps in the method of any one of claims 1-6.
  11. 一种计算机程序产品,包括计算机程序/指令,其特征在于,所述计算机程序/指令被处理器执行时实现权利要求1-6任一项所述的方法中的步骤。 A computer program product, comprising computer programs/instructions, characterized in that, when the computer program/instructions are executed by a processor, the steps in the method according to any one of claims 1-6 are implemented.
PCT/CN2023/073326 2022-01-28 2023-01-20 Speech interaction method, system and apparatus, and device and storage medium WO2023143439A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210108135.4A CN114464179B (en) 2022-01-28 2022-01-28 Voice interaction method, system, device, equipment and storage medium
CN202210108135.4 2022-01-28

Publications (1)

Publication Number Publication Date
WO2023143439A1 true WO2023143439A1 (en) 2023-08-03

Family

ID=81412433

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/073326 WO2023143439A1 (en) 2022-01-28 2023-01-20 Speech interaction method, system and apparatus, and device and storage medium

Country Status (2)

Country Link
CN (1) CN114464179B (en)
WO (1) WO2023143439A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114464179B (en) * 2022-01-28 2024-03-19 达闼机器人股份有限公司 Voice interaction method, system, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5991719A (en) * 1998-04-27 1999-11-23 Fujistu Limited Semantic recognition system
CN107545887A (en) * 2016-06-24 2018-01-05 中兴通讯股份有限公司 Phonetic order processing method and processing device
CN109545197A (en) * 2019-01-02 2019-03-29 珠海格力电器股份有限公司 Recognition methods, device and the intelligent terminal of phonetic order
CN109949795A (en) * 2019-03-18 2019-06-28 北京猎户星空科技有限公司 A kind of method and device of control smart machine interaction
CN110491383A (en) * 2019-09-25 2019-11-22 北京声智科技有限公司 A kind of voice interactive method, device, system, storage medium and processor
CN111128194A (en) * 2019-12-31 2020-05-08 云知声智能科技股份有限公司 System and method for improving online voice recognition effect
CN112732887A (en) * 2021-01-22 2021-04-30 南京英诺森软件科技有限公司 Processing device and system for multi-turn conversation
CN113506565A (en) * 2021-07-12 2021-10-15 北京捷通华声科技股份有限公司 Speech recognition method, speech recognition device, computer-readable storage medium and processor
CN114464179A (en) * 2022-01-28 2022-05-10 达闼机器人股份有限公司 Voice interaction method, system, device, equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8909532B2 (en) * 2007-03-23 2014-12-09 Nuance Communications, Inc. Supporting multi-lingual user interaction with a multimodal application
WO2017112813A1 (en) * 2015-12-22 2017-06-29 Sri International Multi-lingual virtual personal assistant
CN106710586B (en) * 2016-12-27 2020-06-30 北京儒博科技有限公司 Automatic switching method and device for voice recognition engine
CN107391122B (en) * 2017-07-01 2020-03-27 珠海格力电器股份有限公司 Method and device for setting system language of terminal and terminal
CN108766414B (en) * 2018-06-29 2021-01-15 北京百度网讯科技有限公司 Method, apparatus, device and computer-readable storage medium for speech translation
CN111508472B (en) * 2019-01-11 2023-03-03 华为技术有限公司 Language switching method, device and storage medium
CN112116909A (en) * 2019-06-20 2020-12-22 杭州海康威视数字技术股份有限公司 Voice recognition method, device and system
CN111627432B (en) * 2020-04-21 2023-10-20 升智信息科技(南京)有限公司 Active outbound intelligent voice robot multilingual interaction method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5991719A (en) * 1998-04-27 1999-11-23 Fujistu Limited Semantic recognition system
CN107545887A (en) * 2016-06-24 2018-01-05 中兴通讯股份有限公司 Phonetic order processing method and processing device
CN109545197A (en) * 2019-01-02 2019-03-29 珠海格力电器股份有限公司 Recognition methods, device and the intelligent terminal of phonetic order
CN109949795A (en) * 2019-03-18 2019-06-28 北京猎户星空科技有限公司 A kind of method and device of control smart machine interaction
CN110491383A (en) * 2019-09-25 2019-11-22 北京声智科技有限公司 A kind of voice interactive method, device, system, storage medium and processor
CN111128194A (en) * 2019-12-31 2020-05-08 云知声智能科技股份有限公司 System and method for improving online voice recognition effect
CN112732887A (en) * 2021-01-22 2021-04-30 南京英诺森软件科技有限公司 Processing device and system for multi-turn conversation
CN113506565A (en) * 2021-07-12 2021-10-15 北京捷通华声科技股份有限公司 Speech recognition method, speech recognition device, computer-readable storage medium and processor
CN114464179A (en) * 2022-01-28 2022-05-10 达闼机器人股份有限公司 Voice interaction method, system, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114464179A (en) 2022-05-10
CN114464179B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
CN111488433B (en) Artificial intelligence interactive system suitable for bank and capable of improving field experience
WO2020007129A1 (en) Context acquisition method and device based on voice interaction
US20210073551A1 (en) Method and system for video segmentation
WO2023143439A1 (en) Speech interaction method, system and apparatus, and device and storage medium
CN109086276B (en) Data translation method, device, terminal and storage medium
CN106649290A (en) Speech translation method and system
CN111341311A (en) Voice conversation method and device
US11551444B2 (en) Context-based object location via augmented reality device
CN110570208A (en) Complaint preprocessing method and device
CN111553706A (en) Face brushing payment method, device and equipment
CN108305629B (en) Scene learning content acquisition method and device, learning equipment and storage medium
CN113436614A (en) Speech recognition method, apparatus, device, system and storage medium
CN113766633A (en) Data processing method, data processing device, electronic equipment and storage medium
WO2023124869A1 (en) Liveness detection method, device and apparatus, and storage medium
CN111179913A (en) Voice processing method and device
CN115602160A (en) Service handling method and device based on voice recognition and electronic equipment
WO2022083114A1 (en) Smart dialog method, apparatus, device, storage medium, and program
KR102443629B1 (en) Solution and system for news positive tendency analysis using deep learning nlp model
CN114495931A (en) Voice interaction method, system, device, equipment and storage medium
CN113269154A (en) Image identification method, device, equipment and storage medium
CN112381989A (en) Sorting method, device and system and electronic equipment
JP2019212311A (en) Expense distribution method using network message service, computer pogram and computing device
CN105701459B (en) A kind of image display method and terminal device
CN110085237A (en) Restoration methods, device and the equipment of interactive process
CN114283486B (en) Image processing method, model training method, image processing device, model training device, image recognition method, model training device, image recognition device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23746318

Country of ref document: EP

Kind code of ref document: A1