CN114495931A - Voice interaction method, system, device, equipment and storage medium - Google Patents

Voice interaction method, system, device, equipment and storage medium Download PDF

Info

Publication number
CN114495931A
CN114495931A CN202210106996.9A CN202210106996A CN114495931A CN 114495931 A CN114495931 A CN 114495931A CN 202210106996 A CN202210106996 A CN 202210106996A CN 114495931 A CN114495931 A CN 114495931A
Authority
CN
China
Prior art keywords
voice
recognition engine
speech recognition
voice recognition
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210106996.9A
Other languages
Chinese (zh)
Inventor
王军锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cloudminds Shanghai Robotics Co Ltd
Original Assignee
Cloudminds Shanghai Robotics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloudminds Shanghai Robotics Co Ltd filed Critical Cloudminds Shanghai Robotics Co Ltd
Priority to CN202210106996.9A priority Critical patent/CN114495931A/en
Publication of CN114495931A publication Critical patent/CN114495931A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application provides a voice interaction method, a system, a device, equipment and a storage medium. In the voice interaction system, the terminal equipment can acquire the voice data of the user and send the voice data to the cloud server. The cloud server can perform voice recognition on the voice data through the voice recognition engine corresponding to the first language type, and judge whether the voice data is matched with the voice recognition engine or not according to the voice recognition result. And if not, selecting the voice recognition engine matched with the voice data from the standby voice recognition engines, re-performing voice recognition, and generating reply information according to the voice recognition result. By the implementation mode, when the language types used by the users are different, the terminal equipment can accurately perform voice recognition on the voice data input by the users, and further, responses matched with the voice information can be provided for the users.

Description

Voice interaction method, system, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of intelligent robots, in particular to a voice interaction method, a system, a device, equipment and a storage medium.
Background
With the continuous development of artificial intelligence technology, intelligent conversation is more and more popular. Intelligent devices (e.g., robots) that can intelligently talk are widely used, such as in shopping malls, supermarkets, restaurants, and the like. In the prior art, generally, on the premise that the user language type is known, an ASR (Automatic Speech Recognition) engine corresponding to the user language type is manually set in advance to recognize the voice information input by the user, convert the voice information into text information, recognize the text information, and reply according to the Recognition result. However, in many usage scenarios where the user language type is unknown, the user language type cannot be known in advance, and thus the ASR engine cannot be set before interaction with the user. Further, it is impossible to accurately perform voice recognition on the voice information input by the user, and thus it is impossible to provide the user with a response that is more matched with the voice information. Therefore, a solution is urgently needed.
Disclosure of Invention
The embodiment of the application provides a voice interaction method, a system, a device, equipment and a storage medium, which are used for accurately carrying out voice recognition on voice data input by a user and further providing a reply matched with the voice information for the user.
The embodiment of the application provides a voice interaction method, which comprises the following steps: acquiring voice data sent by a user aiming at equipment; performing voice recognition on the voice data through a first voice recognition engine corresponding to a preset first language type to obtain a first voice recognition result; judging whether the voice data is matched with the first voice recognition engine or not according to the first voice recognition result; if not, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine; the at least one standby speech recognition engine is respectively corresponding to at least one language type; and generating reply information of the voice data according to a second voice recognition result of the target voice recognition engine on the voice data.
Further optionally, determining whether the speech data is matched with the first speech recognition engine according to the first speech recognition result includes: acquiring text information in the first voice recognition result; calculating the identification accuracy of the text information; and if the recognition accuracy is smaller than a set accuracy threshold, determining that the voice data is not matched with the first voice recognition engine.
Further optionally, the method further comprises: if the identification accuracy is greater than or equal to the set accuracy threshold, performing question-answer matching on the text information by adopting a question-answer matching model to obtain answer information and confidence of the answer information; and if the confidence of the reply information is smaller than a preset confidence threshold, determining that the voice data is not matched with the first voice recognition engine.
Further optionally, selecting a target speech recognition engine matching the speech data from at least one alternative speech recognition engine, comprising: selecting any speech recognition engine from the at least one standby speech recognition engine as a second speech recognition engine; performing voice recognition on the voice data through a second voice recognition engine to obtain a second voice recognition result; judging whether the voice data is matched with the second voice recognition engine or not according to the second voice recognition result; and if the voice data is matched with the second voice recognition engine, taking the second voice recognition engine as the target voice recognition engine.
Further optionally, the at least one standby speech recognition engine is sorted according to a set priority order; selecting any speech recognition engine from the at least one standby speech recognition engine as a second speech recognition engine, comprising: and sequentially selecting the at least one standby speech recognition engine according to the sequence of the at least one standby speech recognition engine to obtain the second speech recognition engine.
Further optionally, before selecting a target speech recognition engine matching the speech data from at least one alternative speech recognition engine, the method further includes: sequencing the at least one standby speech recognition engine according to the current region of the equipment; alternatively, the at least one standby speech recognition engine is ranked according to a historical frequency of use of the at least one speech recognition engine of the device.
An embodiment of the present application further provides a voice interaction system, including: the system comprises terminal equipment and a cloud server; wherein, the terminal equipment is mainly used for: acquiring voice data sent by a user aiming at the terminal equipment; sending the voice data to the cloud server; the cloud server is mainly used for: receiving the voice data; performing voice recognition on the voice data through a first voice recognition engine corresponding to a preset first language type to obtain a first voice recognition result; judging whether the voice data is matched with the first voice recognition engine or not according to the first voice recognition result; if not, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine; the at least one standby speech recognition engine is respectively corresponding to at least one language type; and generating reply information of the voice data according to a second voice recognition result of the target voice recognition engine on the voice data.
An embodiment of the present application further provides a voice interaction apparatus, including: an acquisition module to: acquiring voice data sent by a user aiming at equipment; an identification module to: performing voice recognition on the voice data through a first voice recognition engine corresponding to a preset first language type to obtain a first voice recognition result; a determination module configured to: judging whether the voice data is matched with the first voice recognition engine or not according to the first voice recognition result; a selection module to: if not, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine; the at least one standby speech recognition engine is respectively corresponding to at least one language type; a generation module: and generating reply information of the voice data according to a second voice recognition result of the target voice recognition engine on the voice data.
An embodiment of the present application further provides a cloud server, including: a memory, a processor, and a communication component; wherein the memory is to: storing one or more computer instructions; the processor is to execute the one or more computer instructions to: and executing the steps in the voice interaction method.
Embodiments of the present application also provide a computer readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps in the voice interaction method.
In the voice interaction method, system, device, equipment and storage medium provided by the embodiment of the application, the terminal equipment can acquire voice data sent by a user aiming at the terminal equipment and sends the voice data to the cloud server. After the cloud server receives the voice data, voice recognition can be carried out on the voice data through a first voice recognition engine corresponding to the first language type to obtain a first voice recognition result, and whether the voice data are matched with the first voice recognition engine or not is judged according to the first voice recognition result. And if not, selecting a target speech recognition engine matched with the speech data from the standby speech recognition engines, and generating response information of the speech data according to a second speech recognition result of the target speech recognition engine on the speech data. By the implementation mode, when the language types used by the users are different, the terminal equipment can accurately perform voice recognition on the voice data input by the users, and further, responses matched with the voice information can be provided for the users.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic structural diagram of a voice interaction system according to an exemplary embodiment of the present application;
fig. 2 is a schematic structural diagram of a voice interaction system in an actual scenario according to an exemplary embodiment of the present application;
FIG. 3 is a schematic structural diagram of a voice interaction system in an actual scenario according to another exemplary embodiment of the present application;
FIG. 4 is a flowchart illustrating a voice interaction method according to an exemplary embodiment of the present application;
FIG. 5 is a schematic structural diagram of a voice interaction apparatus according to an exemplary embodiment of the present application;
fig. 6 is a schematic diagram of a cloud server according to an exemplary embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the prior art, when the types of languages used by the users are different, the robot cannot accurately perform voice recognition on the voice information input by the users, so that responses which are more matched with the voice information cannot be provided for the users. In view of this technical problem, in some embodiments of the present application, a solution is provided. Technical solutions provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic structural diagram of a voice interaction system according to an exemplary embodiment of the present application, and as shown in fig. 1, the voice interaction system 100 includes: cloud server 10 and terminal device 20.
The cloud server 10 may be implemented as a cloud host, a virtual center of the cloud, an elastic computing example of the cloud, and the like, which is not limited in this embodiment. The cloud server 10 mainly includes a processor, a hard disk, a memory, a system bus, and the like, and is similar to a general computer architecture, and is not described in detail.
The terminal device 20 may be implemented as various terminal devices in different scenes, for example, as a robot providing services in a hotel, a restaurant, and other scenes; under intelligent driving assistance or automatic driving scenes, the vehicle can be controlled. In a bank scene, the system can be realized as a multifunctional financial terminal; under a hospital scene, the system can be realized as a registration payment terminal; in a movie theater scenario, it may be implemented as a ticketing terminal, and so on.
In the voice interactive system 100, a wireless communication connection may be established between the cloud server 10 and the terminal device 20, and the specific communication connection manner may be determined according to different application scenarios. In some embodiments, the wireless communication connection may be implemented based on a Virtual Private Network (VPN) to ensure communication security.
In the voice interaction system 100, the terminal device 20 is mainly used for: and acquiring voice data sent by the user aiming at the terminal equipment 20, and sending the voice data to the cloud server 10.
Accordingly, the cloud server 10 is mainly used for: and receiving the voice data, and performing voice recognition on the voice data through a first voice recognition engine corresponding to a preset first language type to obtain a first voice recognition result. The first method is adopted to limit the voice recognition engine and the voice recognition result and is only used for distinguishing the voice recognition engine and the voice recognition result. Wherein, the voice recognition result can include: for example, when the cloud server 10 performs voice recognition on the voice data of the user, a first voice recognition result containing text information of "where movie theater closest to me" can be obtained.
After the cloud server 10 performs the voice recognition, it may determine whether the voice data matches the first voice recognition engine according to the first voice recognition result. And if not, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine.
Wherein the target speech recognition engine refers to a speech recognition engine that matches the speech data. Wherein the at least one standby speech recognition engine corresponds to at least one language type respectively. For example, the at least one alternate speech recognition engine includes: a speech recognition engine for chinese, a speech recognition engine for arabic, and a speech recognition engine for indian. When the user uses Arabic, after the voice data is judged to be not matched with the first voice recognition engine according to the first voice recognition result, the voice recognition engine corresponding to the Arabic matched with the voice data can be selected from the standby voice recognition engines.
Based on the above steps, the cloud server 10 may generate a reply message of the voice data according to a second voice recognition result of the target voice recognition engine on the voice data. Wherein the reply information may be implemented as text information or audio information used to provide the user with a reply. For example, the user may say "provide dinner in the afternoon" to the terminal device 20, and the cloud server 10 may generate a reply message of "six pm". Further alternatively, the cloud server 10 may transmit the generated reply information to the terminal device 20 in the form of text or audio, so that the terminal device 20 outputs the reply information to the user through an audio component or a display component.
In this embodiment, the terminal device 20 may acquire voice data sent by a user for the terminal device, and send the voice data to the cloud server 10. After receiving the voice data, the cloud server 10 may perform voice recognition on the voice data through a first voice recognition engine corresponding to the first language type to obtain a first voice recognition result, and determine whether the voice data is matched with the first voice recognition engine according to the first voice recognition result. And if not, selecting a target speech recognition engine matched with the speech data from the standby speech recognition engines, and generating response information of the speech data according to a second speech recognition result of the target speech recognition engine on the speech data. With this embodiment, when the types of languages used by the users are different, the terminal device 20 can perform voice recognition on the voice data input by the users more accurately, and can provide responses more matched with the voice information for the users.
In some optional embodiments, the "determining whether the speech data matches the first speech recognition engine according to the first speech recognition result" described in the foregoing embodiments may be implemented based on the following steps:
the cloud server 10 may obtain the text information in the first speech recognition result, and calculate the recognition accuracy of the text information. The calculation of the recognition accuracy can be performed through a preset speech recognition model or through a preset algorithm. For example, a plurality of evaluation indexes such as a Sentence Error Rate (SER), a Sentence Correct Rate (s.corr), or a Character Error Rate (CER) of the text information may be calculated through a preset model or algorithm, and the recognition accuracy of the text information is calculated according to the plurality of evaluation indexes and respective weights.
If the calculated recognition accuracy is smaller than the set accuracy threshold, it is determined that the speech data does not match the first speech recognition engine, where the threshold may be set to 90%, 85%, or 80%, and so on, which is not limited in this embodiment.
If the calculated recognition accuracy is greater than or equal to the set accuracy threshold, it can be preliminarily determined that the speech data matches the first speech recognition engine. On this basis, the cloud server 10 may further determine whether the voice data matches the first voice recognition engine according to a confidence level of the reply information generated in the question-answer matching procedure. As will be described in detail below.
If the recognition accuracy of the text information in the first speech recognition result is greater than or equal to the set accuracy threshold, the cloud server 10 may perform question-answer matching on the text information by using a question-answer matching model based on an NLP (Natural Language Processing) technique. After the question-answer matching model is trained in advance, a plurality of pieces of preselected information with different confidence degrees corresponding to the text information can be searched in a data set built in the model according to the input text information. Further, the question-answer matching model may select, as the reply information, the preselected information with the highest confidence level from among a plurality of preselected information. For example, the cloud server 10 performs question-answer matching on the text message of "street in which bank nearest to me" through a question-answer matching model, so that the preselected information of "street a" with a confidence of 80% and the preselected information of "street B" with a confidence of 85% can be obtained, and then the preselected information of "street B" with a confidence of 85% can be selected from the two preselected information as the reply information.
By the above question-answer matching method, the cloud server 10 can obtain the reply information and the confidence level of the reply information. And if the confidence of the reply information is smaller than a preset confidence threshold, determining that the voice data is not matched with the first voice recognition engine.
If it is determined that the voice data does not match the first voice recognition engine, the cloud server 10 may select a target voice recognition engine matching the voice data from at least one standby voice recognition engine.
In some optional embodiments, when the cloud server 10 selects a target speech recognition engine matching the speech data from the at least one alternative speech recognition engine, any speech recognition engine may be selected from the at least one alternative speech recognition engine as the second speech recognition engine. For example, the at least one standby speech recognition engine includes a speech recognition engine corresponding to chinese and a speech recognition engine corresponding to french, and the cloud server may select the speech recognition engine corresponding to french from the at least one standby speech recognition engine as the second speech recognition engine.
After the cloud server 10 selects the second speech recognition engine, speech recognition can be performed on the speech data through the second speech recognition engine to obtain a second speech recognition result. The second speech recognition result refers to a speech recognition result obtained by performing speech recognition through the second speech recognition engine. The second method is adopted to limit the voice recognition result and is only used for distinguishing the voice recognition results obtained by multiple times of voice recognition.
After the voice recognition, the cloud server 10 may determine whether the voice data matches the second voice recognition engine according to the second voice recognition result. As will be described in detail below.
The cloud server 10 may obtain the text information in the second speech recognition result, and calculate the recognition accuracy of the text information. The calculation of the recognition accuracy can be performed through a preset speech recognition model or through a preset algorithm. For example, a plurality of evaluation indexes such as a Sentence Error Rate (SER), a Sentence Correct Rate (s.corr), and a Character Error Rate (CER) of the text information may be calculated by a preset model or algorithm, and the recognition accuracy of the text information may be calculated according to the plurality of evaluation indexes and respective weights. If the recognition accuracy is greater than or equal to the set accuracy threshold, it is determined that the voice data matches the second recognition engine, and the cloud server 10 may use the second recognition engine as the target voice recognition engine. If the recognition accuracy is less than the set accuracy threshold, the speech data is determined not to match the second speech recognition engine, wherein the threshold may be set to 90%, 85%, or 80%, and so on, which is not limited in this embodiment.
It is noted that the at least one alternative speech recognition engine may be ranked in accordance with a set priority order. The priority order may be preset by a user, or preset according to the identification result of the big data. For example, the user may preset the priority order with the speech recognition engine for korean preceding and the speech recognition engine for german succeeding. Or, the identification result of the big data indicates that the shopping mall where the terminal device 20 is located is mainly an arabic person and a french person, and then the cloud server 10 may arrange the speech recognition engines corresponding to arabic languages in front and the speech recognition engines corresponding to french languages in back according to the identification result of the big data.
Based on the above sorting process, when the cloud server 10 selects any one of the at least one standby speech recognition engine as the second speech recognition engine, the at least one standby speech recognition engine may be sequentially selected according to the sorting of the at least one standby speech recognition engine to obtain the second speech recognition engine. For example, by taking the foregoing example as an example, the engines are sorted such that the speech recognition engine corresponding to korean is in front, the speech recognition engine corresponding to german is behind, and after the cloud server 10 fails to recognize by the first recognition engine, the speech recognition engines corresponding to korean and german can be sequentially selected according to the sorting order, that is, the speech recognition engine corresponding to korean is selected first, and then the speech recognition engine corresponding to german is selected, so as to obtain the second speech recognition engine.
In some optional embodiments, the cloud server 10 may rank the at least one alternative speech recognition engine according to the current region of the terminal device 20 before selecting a target speech recognition engine matching the speech data from the at least one alternative speech recognition engine. For example, when the terminal device 20 is currently in a community where korean people live, and the community also live with a small number of germans, the cloud server 10 may preset the priority order such that the speech recognition engine corresponding to korean is before and the speech recognition engine corresponding to german is after.
In addition to the above embodiments, the cloud server 10 may rank the at least one standby speech recognition engine according to a historical usage frequency of the at least one speech recognition engine of the terminal device 20. Taking a terminal device as a bank withdrawal terminal for example, the historical frequency of use of the bank withdrawal terminal is 30 times in korean, 20 times in french, and 10 times in german, which means that the terminal is most frequently used by people who use korean, and secondly, people who use french and people who use german, so the cloud server 10 can sort at least one standby speech recognition engine into korean, french, and german according to the historical frequency of use.
The voice interactive system will be further explained with reference to fig. 2 and 3 and the practical application scenario.
As shown in fig. 2 and 3, the terminal device collects initial voice data of the user through a microphone and transmits the initial voice data to the voice endpoint detection module. The voice endpoint detection module may intercept valid voice data in the initial voice data. After the interception is successful, the terminal device may perform speech recognition on the speech data through a first speech recognition engine (i.e., a main engine) corresponding to the first language type in the main module, and may obtain text information corresponding to the speech data. And then, the terminal equipment can perform question-answer matching on the text information through a question-answer matching model corresponding to the first language type to obtain answer information corresponding to the text information.
If the confidence of the reply information is larger than or equal to the confidence threshold, the reply information is converted into voice through a text-to-voice module corresponding to the first language type, and the voice is output.
If the confidence of the reply message is smaller than the confidence threshold, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine so as to perform speech recognition on the speech data again.
Taking the target speech recognition engine as the standby speech recognition engine corresponding to the korean language as an example, the terminal device performs speech recognition on the speech data through the standby speech recognition engine corresponding to the korean language to obtain corresponding text information. And then, the terminal equipment can perform question-answer matching on the text information through the standby question-answer matching model corresponding to the Korean language in the main module to obtain corresponding reply information. If the confidence of the reply information is larger than or equal to the confidence threshold, the reply information is converted into voice through a text-to-voice module corresponding to the Korean language, and the voice is output. If the confidence of the reply information is smaller than the confidence threshold, the standby speech recognition engine is reselected, and the speech data is re-recognized.
The embodiment of the present application further provides a voice interaction method, which will be described in detail below with reference to fig. 4.
Step 401, acquiring voice data sent by a user aiming at the device.
Step 402, performing voice recognition on the voice data through a first voice recognition engine corresponding to a preset first language type to obtain a first voice recognition result.
Step 403, determining whether the speech data matches the first speech recognition engine according to the first speech recognition result.
Step 404, if not, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine; the at least one standby speech recognition engine is respectively corresponding to at least one language type.
Step 405, generating reply information of the voice data according to a second voice recognition result of the target voice recognition engine to the voice data.
Further optionally, determining whether the speech data is matched with the first speech recognition engine according to the first speech recognition result includes: acquiring text information in the first voice recognition result; calculating the identification accuracy of the text information; and if the recognition accuracy is smaller than a set accuracy threshold, determining that the voice data is not matched with the first voice recognition engine.
Further optionally, the method further comprises: if the identification accuracy is greater than or equal to the set accuracy threshold, performing question-answer matching on the text information by adopting a question-answer matching model to obtain answer information and confidence of the answer information; if the confidence level of the reply message is less than a preset confidence level threshold, the voice data is determined not to be matched with the first voice recognition engine.
Further optionally, selecting a target speech recognition engine from the at least one alternative speech recognition engine that matches the speech data comprises: selecting any speech recognition engine from the at least one standby speech recognition engine as a second speech recognition engine; performing voice recognition on the voice data through a second voice recognition engine to obtain a second voice recognition result; judging whether the voice data is matched with the second voice recognition engine or not according to the second voice recognition result; and if the voice data is matched with the second voice recognition engine, taking the second voice recognition engine as the target voice recognition engine.
Further optionally, the at least one standby speech recognition engine is ranked in a set priority order; selecting any speech recognition engine from the at least one alternative speech recognition engine as a second speech recognition engine, comprising: and sequentially selecting the at least one standby speech recognition engine according to the sequence of the at least one standby speech recognition engine to obtain the second speech recognition engine.
Further optionally, before selecting a target speech recognition engine matching the speech data from the at least one alternative speech recognition engine, the method further includes: sequencing the at least one standby speech recognition engine according to the current region of the device; alternatively, the at least one standby speech recognition engine is ranked according to a historical frequency of use of the at least one speech recognition engine of the device.
In this embodiment, voice recognition may be performed on the voice data through a first voice recognition engine corresponding to the first language type to obtain a first voice recognition result, and whether the voice data is matched with the first voice recognition engine is determined according to the first voice recognition result. And if not, selecting a target speech recognition engine matched with the speech data from the standby speech recognition engines, and generating response information of the speech data according to a second speech recognition result of the target speech recognition engine on the speech data. By the implementation mode, when the language types used by the users are different, the terminal equipment can accurately perform voice recognition on the voice data input by the users, and further, responses matched with the voice information can be provided for the users.
It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 401 to 405 may be device a; for another example, the execution subject of steps 401 to 403 may be device a, and the execution subject of steps 404 and 405 may be device B; and so on.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 401, 402, etc., are merely used to distinguish various operations, and the sequence numbers themselves do not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.
It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.
An embodiment of the present application provides a voice interaction apparatus, as shown in fig. 5, the voice interaction apparatus includes: the device comprises an acquisition module 501, an identification module 502, a judgment module 503, a selection module 504 and a generation module 505.
The obtaining module 501 is configured to: acquiring voice data sent by a user aiming at equipment; an identification module 502 for: performing voice recognition on the voice data through a first voice recognition engine corresponding to a preset first language type to obtain a first voice recognition result; a determining module 503, configured to: judging whether the voice data is matched with the first voice recognition engine or not according to the first voice recognition result; a selecting module 504 configured to: if not, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine; the at least one standby speech recognition engine is respectively corresponding to at least one language type; the generation module 505: and generating reply information of the voice data according to a second voice recognition result of the target voice recognition engine on the voice data.
Further optionally, when determining, according to the first speech recognition result, whether the speech data matches the first speech recognition engine, the determining module 503 is specifically configured to: acquiring text information in the first voice recognition result; calculating the identification accuracy of the text information; and if the recognition accuracy is smaller than a set accuracy threshold, determining that the voice data is not matched with the first voice recognition engine.
Further optionally, the determining module 503 is further configured to: if the identification accuracy is greater than or equal to the set accuracy threshold, performing question-answer matching on the text information by adopting a question-answer matching model to obtain answer information and confidence of the answer information; and if the confidence of the reply information is smaller than a preset confidence threshold, determining that the voice data is not matched with the first voice recognition engine.
Further optionally, when the target speech recognition engine matched with the speech data is selected from the at least one standby speech recognition engine, the selecting module 504 is specifically configured to: selecting any speech recognition engine from the at least one standby speech recognition engine as a second speech recognition engine; performing voice recognition on the voice data through a second voice recognition engine to obtain a second voice recognition result; judging whether the voice data is matched with the second voice recognition engine or not according to the second voice recognition result; and if the voice data is matched with the second voice recognition engine, taking the second voice recognition engine as the target voice recognition engine.
Further optionally, the at least one standby speech recognition engine is ordered according to a set priority order. The selecting module 504, when selecting any speech recognition engine from the at least one standby speech recognition engine as the second speech recognition engine, is specifically configured to: and sequentially selecting the at least one standby speech recognition engine according to the sequence of the at least one standby speech recognition engine to obtain the second speech recognition engine.
Further optionally, the selecting module 504, before selecting a target speech recognition engine matching the speech data from at least one alternative speech recognition engine, is further configured to: sequencing the at least one standby speech recognition engine according to the current region of the equipment; alternatively, the at least one standby speech recognition engine is ranked according to a historical frequency of use of the at least one speech recognition engine of the device.
In this embodiment, voice recognition may be performed on the voice data through a first voice recognition engine corresponding to the first language type to obtain a first voice recognition result, and whether the voice data is matched with the first voice recognition engine is determined according to the first voice recognition result. And if not, selecting a target speech recognition engine matched with the speech data from the standby speech recognition engines, and generating response information of the speech data according to a second speech recognition result of the target speech recognition engine on the speech data. By the implementation mode, when the language types used by the users are different, the terminal equipment can accurately perform voice recognition on the voice data input by the users, and further, responses matched with the voice information can be provided for the users.
Fig. 6 is a schematic structural diagram of a cloud server according to an exemplary embodiment of the present application, where the cloud server is suitable for the voice interaction system according to the foregoing embodiment, and as shown in fig. 6, the cloud server includes: memory 601, processor 602, and communication component 603.
The memory 601 is used for storing computer programs and may be configured to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phonebook data, messages, pictures, videos, etc.
A processor 602, coupled to the memory 601, for executing the computer programs in the memory 601 to: acquiring voice data sent by a user aiming at equipment; performing voice recognition on the voice data through a first voice recognition engine corresponding to a preset first language type to obtain a first voice recognition result; judging whether the voice data is matched with the first voice recognition engine or not according to the first voice recognition result; if not, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine; the at least one standby speech recognition engine is respectively corresponding to at least one language type; and generating reply information of the voice data according to a second voice recognition result of the target voice recognition engine on the voice data.
Further optionally, when determining, according to the first speech recognition result, whether the speech data matches the first speech recognition engine, the processor 602 is specifically configured to: acquiring text information in the first voice recognition result; calculating the identification accuracy rate of the text information; and if the recognition accuracy is smaller than a set accuracy threshold, determining that the voice data is not matched with the first voice recognition engine.
Further optionally, the processor 602 is further configured to: if the identification accuracy is greater than or equal to the set accuracy threshold, performing question-answer matching on the text information by adopting a question-answer matching model to obtain answer information and confidence of the answer information; and if the confidence of the reply information is smaller than a preset confidence threshold, determining that the voice data is not matched with the first voice recognition engine.
Further optionally, the processor 602, when selecting a target speech recognition engine matching the speech data from the at least one alternative speech recognition engine, is specifically configured to: selecting any speech recognition engine from the at least one standby speech recognition engine as a second speech recognition engine; performing voice recognition on the voice data through a second voice recognition engine to obtain a second voice recognition result; judging whether the voice data is matched with the second voice recognition engine or not according to the second voice recognition result; and if the voice data is matched with the second voice recognition engine, taking the second voice recognition engine as the target voice recognition engine.
Further optionally, the at least one standby speech recognition engine is ordered according to a set priority order. The processor 602, when selecting any one of the at least one alternative speech recognition engines as the second speech recognition engine, is specifically configured to: and sequentially selecting the at least one standby speech recognition engine according to the sequence of the at least one standby speech recognition engine to obtain the second speech recognition engine.
Further optionally, the processor 602, before selecting a target speech recognition engine matching the speech data from the at least one alternative speech recognition engine, is further configured to: sequencing the at least one standby speech recognition engine according to the current region of the equipment; alternatively, the at least one standby speech recognition engine is ranked according to a historical frequency of use of the at least one speech recognition engine of the device.
Further, as shown in fig. 6, the cloud server further includes: power supply components 604, and the like. Only some of the components are schematically shown in fig. 6, and it is not meant that the cloud server includes only the components shown in fig. 6.
In this embodiment, voice recognition may be performed on the voice data through a first voice recognition engine corresponding to the first language type to obtain a first voice recognition result, and whether the voice data is matched with the first voice recognition engine is determined according to the first voice recognition result. And if not, selecting a target speech recognition engine matched with the speech data from the standby speech recognition engines, and generating response information of the speech data according to a second speech recognition result of the target speech recognition engine on the speech data. By the implementation mode, when the language types used by the users are different, the terminal equipment can accurately perform voice recognition on the voice data input by the users, and further, responses matched with the voice information can be provided for the users.
Accordingly, an embodiment of the present application further provides a computer readable storage medium storing a computer program, and when the computer program is executed, the computer program can implement the steps that can be executed by the cloud server in the foregoing method embodiments.
The memory 601 in fig. 6 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The communications component 603 of fig. 6 described above is configured to facilitate communications between the device in which the communications component resides and other devices in a wired or wireless manner. The device in which the communication component is located may access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, or 5G, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may be implemented based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
The power supply assembly 604 of fig. 6 provides power to the various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method of voice interaction, comprising:
acquiring voice data sent by a user aiming at equipment;
performing voice recognition on the voice data through a first voice recognition engine corresponding to a preset first language type to obtain a first voice recognition result;
judging whether the voice data is matched with the first voice recognition engine or not according to the first voice recognition result;
if not, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine; the at least one standby speech recognition engine is respectively corresponding to at least one language type;
and generating reply information of the voice data according to a second voice recognition result of the target voice recognition engine on the voice data.
2. The method of claim 1, wherein determining whether the speech data matches the first speech recognition engine based on the first speech recognition result comprises:
acquiring text information in the first voice recognition result;
calculating the identification accuracy of the text information;
and if the recognition accuracy is smaller than a set accuracy threshold, determining that the voice data is not matched with the first voice recognition engine.
3. The method of claim 2, further comprising:
if the identification accuracy is greater than or equal to the set accuracy threshold, performing question-answer matching on the text information by adopting a question-answer matching model to obtain answer information and confidence of the answer information;
and if the confidence of the reply information is smaller than a preset confidence threshold, determining that the voice data is not matched with the first voice recognition engine.
4. The method of claim 1, wherein selecting a target speech recognition engine from at least one alternative speech recognition engine that matches the speech data comprises:
selecting any speech recognition engine from the at least one standby speech recognition engine as a second speech recognition engine;
performing voice recognition on the voice data through a second voice recognition engine to obtain a second voice recognition result;
judging whether the voice data is matched with the second voice recognition engine or not according to the second voice recognition result;
and if the voice data is matched with the second voice recognition engine, taking the second voice recognition engine as the target voice recognition engine.
5. The method of claim 4, wherein the at least one standby speech recognition engine is ordered according to a set priority order;
selecting any speech recognition engine from the at least one standby speech recognition engine as a second speech recognition engine, comprising:
and sequentially selecting the at least one standby speech recognition engine according to the sequence of the at least one standby speech recognition engine to obtain the second speech recognition engine.
6. The method of claim 5, wherein prior to selecting a target speech recognition engine from the at least one alternative speech recognition engine that matches the speech data, further comprising:
sequencing the at least one standby speech recognition engine according to the current region of the equipment; alternatively, the first and second electrodes may be,
ranking the at least one standby speech recognition engine according to a historical frequency of use of the at least one speech recognition engine of the device.
7. A voice interaction system, comprising: the system comprises terminal equipment and a cloud server;
wherein, the terminal equipment is mainly used for: acquiring voice data sent by a user aiming at the terminal equipment; sending the voice data to the cloud server;
the cloud server is mainly used for: receiving the voice data; performing voice recognition on the voice data through a first voice recognition engine corresponding to a preset first language type to obtain a first voice recognition result; judging whether the voice data is matched with the first voice recognition engine or not according to the first voice recognition result; if not, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine; the at least one standby speech recognition engine is respectively corresponding to at least one language type; and generating reply information of the voice data according to a second voice recognition result of the target voice recognition engine on the voice data.
8. A voice interaction apparatus, comprising:
an acquisition module to: acquiring voice data sent by a user aiming at equipment;
an identification module to: performing voice recognition on the voice data through a first voice recognition engine corresponding to a preset first language type to obtain a first voice recognition result;
a determination module to: judging whether the voice data is matched with the first voice recognition engine or not according to the first voice recognition result;
a selection module to: if not, selecting a target speech recognition engine matched with the speech data from at least one standby speech recognition engine; the at least one standby speech recognition engine is respectively corresponding to at least one language type;
a generation module: and generating reply information of the voice data according to a second voice recognition result of the target voice recognition engine on the voice data.
9. A cloud server, comprising: a memory, a processor, and a communications component;
wherein the memory is to: storing one or more computer instructions;
the processor is to execute the one or more computer instructions to: performing the steps of the method of any one of claims 1-6.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 6.
CN202210106996.9A 2022-01-28 2022-01-28 Voice interaction method, system, device, equipment and storage medium Pending CN114495931A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210106996.9A CN114495931A (en) 2022-01-28 2022-01-28 Voice interaction method, system, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210106996.9A CN114495931A (en) 2022-01-28 2022-01-28 Voice interaction method, system, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114495931A true CN114495931A (en) 2022-05-13

Family

ID=81476314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210106996.9A Pending CN114495931A (en) 2022-01-28 2022-01-28 Voice interaction method, system, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114495931A (en)

Similar Documents

Publication Publication Date Title
CN111932144B (en) Customer service agent distribution method and device, server and storage medium
CN111428010B (en) Man-machine intelligent question-answering method and device
US11127399B2 (en) Method and apparatus for pushing information
CN111341311A (en) Voice conversation method and device
CN110619701A (en) Queuing channel recommendation method and device, storage medium and electronic equipment
CN105279168A (en) Data query method supporting natural language, open platform, and user terminal
CN112037775B (en) Voice recognition method, device, equipment and storage medium
CN111105294A (en) VR navigation method, system, client, server and storage medium thereof
CN111625638B (en) Question processing method, device, equipment and readable storage medium
CN116010574A (en) Intelligent dialogue processing method, cloud server and readable storage medium
CN110956955A (en) Voice interaction method and device
CN111611358A (en) Information interaction method and device, electronic equipment and storage medium
WO2023143439A1 (en) Speech interaction method, system and apparatus, and device and storage medium
CN115202599A (en) Screen projection display method and related device
CN113132214B (en) Dialogue method, dialogue device, dialogue server and dialogue storage medium
CN113076533B (en) Service processing method and device
WO2018010635A1 (en) Method of generating random interactive data, network server, and smart conversation system
CN110047473B (en) Man-machine cooperative interaction method and system
CN117094690A (en) Information processing method, electronic device, and storage medium
CN114495931A (en) Voice interaction method, system, device, equipment and storage medium
CN114638308A (en) Method and device for acquiring object relationship, electronic equipment and storage medium
CN114036379A (en) Customer service recommendation method and device, electronic equipment and readable storage medium
CN113360590A (en) Method and device for updating point of interest information, electronic equipment and storage medium
CN112632241A (en) Method, device, equipment and computer readable medium for intelligent conversation
CN113761136A (en) Dialogue processing method, information processing method, model training method, information processing apparatus, model training apparatus, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination