CN112002325A - Multi-language voice interaction method and device - Google Patents

Multi-language voice interaction method and device Download PDF

Info

Publication number
CN112002325A
CN112002325A CN202011162634.9A CN202011162634A CN112002325A CN 112002325 A CN112002325 A CN 112002325A CN 202011162634 A CN202011162634 A CN 202011162634A CN 112002325 A CN112002325 A CN 112002325A
Authority
CN
China
Prior art keywords
language
language model
switching
audio
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011162634.9A
Other languages
Chinese (zh)
Other versions
CN112002325B (en
Inventor
宋泽
甘津瑞
邓建凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN202011162634.9A priority Critical patent/CN112002325B/en
Publication of CN112002325A publication Critical patent/CN112002325A/en
Application granted granted Critical
Publication of CN112002325B publication Critical patent/CN112002325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The invention discloses a multilingual voice interaction method and a multilingual voice interaction device, wherein the multilingual voice interaction method comprises the following steps: responding to the acquired audio, and sending the audio into a mixed language model for recognition, wherein the mixed language model trains switching language command words of multiple languages and stores the switching language command words in the local; judging whether language switching command words exist in the audio based on the recognition result; if the language switching command word exists, determining the switched language based on the language switching command word; setting an online default language model based on the switched languages and synchronizing the online default language model to a server, wherein the server comprises a plurality of single language models. By using the mixed language model for switching language command words at the client and using a plurality of single language models at the server, the high cost for training the mixed language model can be reduced, and the stability of voice interaction is improved.

Description

Multi-language voice interaction method and device
Technical Field
The invention belongs to the field of voice interaction, and particularly relates to a multi-language voice interaction method and device.
Background
Currently, there are single technologies such As Speech Recognition (ASR), Natural Language Processing (NLP), and Dialog Management (DM) in the market, which provide basic capabilities for Speech interaction.
The speech recognition mainly converts the speech content sent by a person into text information which can be read in by a computer, and has two working modes: a recognition mode and a command mode. The speech recognition program may also be implemented using different types of programs depending on the two modes. The working principle of the recognition mode is as follows: the engine system directly provides a word stock and a recognition template stock in the background, and any system does not need to further change recognition grammar and only needs to rewrite according to a main program source code provided by the recognition engine. The command pattern is relatively difficult to implement and the dictionary must be written by the programmer himself, programmed, and finally processed and corrected according to the phonetic dictionary. The recognition mode is different from the command mode in the largest way, namely, the programmer checks and modifies the codes according to the dictionary content.
Natural Language processing is an important means for realizing man-machine Natural Language communication, and includes two parts, Natural Language Understanding (NLU) and Natural Language Generation (NLG), which enables a computer to understand the meaning of a Natural Language text and express a given intention, thought, and the like in the Natural Language text. Natural language understanding is the building of a computer model, which is based on linguistics, fusing disciplines such as logics, psychology and computer disciplines, and attempts to solve the following problems: the language is how to organize and transmit information, and the person obtains information from a series of language symbols, and the alternative expression is to obtain semantic representation of natural language through analysis of grammar, semantics and pragmatics and understand the intention expressed by natural language text. The natural language generation is a branch of artificial intelligence and computational linguistics, and a corresponding language generation system is a computer model based on language information processing, and the working process of the language generation system is opposite to natural language analysis, namely, a text is generated by selecting and executing certain semantic and grammatical rules from an abstract concept level.
Disclosure of Invention
The embodiment of the invention provides a multilingual voice interaction method and device, which are used for solving at least one technical problem.
In a first aspect, an embodiment of the present invention provides a multilingual voice interaction method, which is used for a client, and includes: responding to the acquired audio, and sending the audio into a mixed language model for recognition, wherein the mixed language model trains switching language command words of multiple languages and stores the switching language command words in the local; judging whether language switching command words exist in the audio based on the recognition result; if the language switching command word exists, determining the switched language based on the language switching command word; setting an online default language model based on the switched languages and synchronizing the online default language model to a server, wherein the server comprises a plurality of single language models.
In a second aspect, an embodiment of the present invention provides a multilingual voice interaction method, used in a server, including: responding to the acquired audio, and sending the audio into a first single language model for recognition, wherein the server trains a plurality of single language models; and performing subsequent processing on the recognition result, wherein the subsequent processing comprises semantic processing and dialogue processing.
In a third aspect, an embodiment of the present invention provides a multilingual voice interaction apparatus, which is used for a client, and includes: the first acquisition and recognition module is configured to respond to acquired audio and send the audio into a mixed language model for recognition, wherein the mixed language model trains switching language command words of multiple languages and stores the switching language command words in local; the judging module is configured to judge whether language switching command words exist in the audio frequency based on the identification result; the switching module is configured to determine the switched language based on the language switching command word if the language switching command word exists; and the setting synchronization module is configured to set an online default language model based on the switched languages and synchronize the online default language model to a server, wherein the server comprises a plurality of single language models.
In a fourth aspect, an embodiment of the present invention provides a multilingual speech interaction apparatus, which is used for a server and includes: the second acquisition and recognition module is configured to respond to the acquired audio and send the audio into the first single language model for recognition, wherein the server trains a plurality of single language models; and the processing module is configured to perform subsequent processing on the recognition result, wherein the subsequent processing comprises semantic processing and dialogue processing.
In a fifth aspect, a computer program product is provided, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the multilingual speech interaction method of the first aspect.
In a sixth aspect, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of the first aspect.
According to the method provided by the embodiment of the application, the mixed language model for switching the language command words is used at the client side, and the single language models are used at the server side, so that the high cost for training the mixed language model can be reduced, and the stability of voice interaction is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a flowchart of a multilingual voice interaction method for a client according to an embodiment of the present invention;
FIG. 2 is a flowchart of a multilingual voice interaction method for a server according to an embodiment of the present invention;
FIG. 3 is a flow chart of another multilingual speech interaction method according to an embodiment of the present invention;
FIG. 4 is a flowchart of a multilingual voice interaction in accordance with an embodiment of the present invention;
FIG. 5 is a block diagram of a multilingual speech interaction apparatus for a client according to an embodiment of the present invention;
FIG. 6 is a block diagram of a multi-language voice interaction apparatus for a server according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to FIG. 1, a flow chart of an embodiment of a multilingual voice interaction method of the present invention is shown for a client.
As shown in fig. 1, in step 101, in response to an acquired audio, sending the audio into a mixed language model for recognition, wherein a switching language command word of multiple languages is trained in the mixed language model and stored locally;
in step 102, judging whether a language switching command word exists in the audio based on the recognition result;
in step 103, if there is a language switching command word, determining the switched language based on the language switching command word;
in step 104, setting an online default language model based on the switched languages and synchronizing the online default language model to a server, wherein the server includes a plurality of single language models.
In this embodiment, for step 101, the multilingual speech interaction apparatus, in response to the obtained audio, sends the audio to a mixed language model for recognition, where the mixed language model has multiple language-switching command words trained therein and stored locally, for example, the multiple languages include: mandarin, multiple preset dialects and multiple preset foreign languages;
for step 102, the multilingual speech interaction apparatus determines whether there is a language-switching command word in the audio based on the recognition result after the audio is sent to the mixed language model for recognition, for example, the language-switching command word includes: can say either Guangdong, you can say northeast or you will say English;
for step 103, if there is a language switching command word, determining the switched language based on the language switching command word, for example, the language switching command word is "can speak a cantonese language", and determining that the language to be switched is a cantonese language, for example, when the language switching command word is "you will speak an english language", it may be determined that the language to be switched is an english language;
for step 104, the multilingual speech interaction apparatus sets an online default language model based on the switched language and synchronizes the online default language model to the server, for example, the language to be switched is cantonese, and cantonese is set as the online default language model and is synchronized to the server.
In the solution described in this embodiment, the hybrid language model for switching the language command words is used at the client, so that the expensive cost for training the hybrid language model can be reduced.
In the method according to the foregoing embodiment, the determining whether there is a language-switching command word in the audio further includes:
and if the audio is judged to have no language switching command word, ending the language switching instruction.
Referring to fig. 2, a flow chart of an embodiment of a multilingual voice interaction method of the present invention is shown for a server.
As shown in fig. 2, in step 201, in response to the obtained audio, the audio is fed into a first single language model for recognition, wherein the server trains a plurality of single language models;
in step 202, the recognition result is subsequently processed.
In this embodiment, for step 201, the multilingual speech interaction apparatus sends the audio to the first single language model for recognition in response to the obtained audio, for example, the first single language model is a default language model of the server, and may be a mandarin language model or a cantonese language model, where the server trains a plurality of single language models, for example, the mandarin single language model, the cantonese single language model, and the english single language model, and the like, and is not described herein again because of too many languages;
for step 202, the multilingual speech interaction device performs subsequent processing on the recognition result, for example, performs semantic processing on the recognition result and then performs dialog processing, for example, in the context of purchasing tickets for a subway, the first single language model of the multilingual speech interaction device is a mandarin language model, the user uses mandarin, and after the user speaks "i want to purchase tickets" and performs recognition and semantic processing, the dialog result "please choose to arrive at a station" can be obtained and output.
In the scheme described in this embodiment, stability of voice interaction can be improved by training a plurality of single language models.
Referring to fig. 3, a flowchart of another multilingual speech interaction method according to an embodiment of the present invention is shown, and the flowchart mainly refers to the steps further defined after the "post-processing the recognition result" method in embodiment 203.
As shown in fig. 3, in step 301, it is determined whether an online default language model synchronization instruction sent by a client is received;
in step 302, if yes, determining whether a second single language model in the online default language model synchronization instruction is consistent with a current first single language model of the server;
in step 303, if they are not consistent, the first single language model is switched to the second single language model.
In this embodiment, for step 301, the multilingual speech interaction apparatus determines whether an online default language model synchronization instruction sent by the client is received, for example, the client and the server may perform parallel processing after acquiring the audio of the user, for example, when the online default language model synchronization instruction sent by the client is not received, the audio of the user is subjected to single language model recognition and semantic processing first, or the audio of the user is subjected to single language model recognition and semantic processing after the online default language model synchronization instruction sent by the client is received, and further, the function of the client may be processed at the server.
For step 302, if an online default language model synchronization instruction sent by the client is received, determining whether a second single language model in the online default language model synchronization instruction is consistent with a current first single language model of the server, for example, the first single language model is a mandarin language model, and the second single language model is a cantonese language model or an english language model;
for step 303, if there is no match, the first uni-language model is switched to the second uni-language model.
In the solution described in this embodiment, the language model required by the user can be switched by determining the online default language model synchronization instruction sent by the client.
In the method in the foregoing embodiment, the determining whether an online default language model synchronization instruction for switching a single language model instruction sent by the client is received further includes:
and if the on-line default language model synchronization instruction sent by the client is not received, outputting a processing result after the subsequent processing is carried out on the identification result.
For example, the first language model is a mandarin language model, a user uses mandarin language interaction, and directly outputs a conversation result when an online default language model synchronization instruction sent by the client is not received, for example, the first language model is a mandarin language model, the user uses cantonese speech interaction, and the user is asked whether to switch languages when the online default language model synchronization instruction sent by the client is not received and a single mandarin language model cannot accurately identify the audio of the user.
In the scheme of this embodiment, the processing result obtained after the subsequent processing is performed on the recognition result is output by not receiving the online default language model synchronization instruction sent by the client, so that the user can be prompted to switch languages when the recognition is successful and the dialog result is output or the recognition is failed.
In the method in the foregoing embodiment, after the determining whether the second single language model in the online default language model synchronization instruction is consistent with the current first single language model of the server, the method further includes:
and if the second single language model is consistent with the first single language model, outputting the conversation result.
For example, the first language model is a mandarin model, and the online default language model synchronization command received from the client is also the mandarin model, and the conversation result is output.
It should be noted that, although the above embodiments adopt numbers with definite precedence order such as step 101 and step 102 to define the precedence order of the steps, in an actual application scenario, some steps may be executed in parallel, and the precedence order of some steps is also not defined by the numbers, and this application is not limited herein and is not described herein again.
The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.
The inventors discovered the defects of these similar techniques in the process of implementing the present invention:
in most ASR systems, different languages (dialects) are considered independently, and an AcousTIc Model (AM) is typically trained from scratch for each language. Thereby resulting in applications that support only monolingual interactions. Due to different application scenarios, there is an increasing demand for intelligent applications of multiple languages, such as: ticket buying systems and consultation systems of subway stations, autonomous registration systems of hospitals, etc., so that a hybrid recognition model appears, however, training of this model introduces the following problems. First, training an AM from scratch requires a large amount of manually labeled data, which is not only expensive, but also takes a lot of time to acquire. This also results in a considerable difference in the quality of the acoustic model between rich and poor-featured languages. This is because for starved languages only small models of low complexity can be estimated. The large amount of labeled training data is also an inevitable bottleneck for those languages that are low in traffic and newly published and difficult to obtain a large amount of representative corpora. Second, to achieve the same recognition rate, a hybrid language model is trained significantly more often than a monolingual language model.
The inventors have found in the course of carrying out the invention why the reason is not easily imaginable:
hybrid language models are often employed to enable applications to support multilingual speech interactions. But the use of hybrid language models is expensive, resulting in a low prevalence rate.
The scheme is that a multilingual voice interaction method for switching language models is realized by means of a Dialogue User Interface (DUI) of the company.
Firstly, the scheme trains a single language model at a server side, and locally uses a mixed language model of language switching command words; secondly, the client uses the mixed language model to perform voice recognition, and judges whether the command word is a language switching command word according to a recognition result. If the voice interaction is carried out by the default single language model, the voice interaction is carried out by switching the appointed single language model. According to the scheme, the voice recognition is carried out by using the mixed model and the single language model for switching the language command words, and then the semantic processing and the dialogue management are carried out, so that not only is the expensive cost reduced, but also the stability is ensured.
The invention has the technical innovation points that:
scenario 1: language switching instruction
The method comprises the following steps: inputting audio;
step two: the audio acquisition module acquires audio;
step three: sending the audio to an offline mixed language model recognition kernel;
step four; and performing instruction processing on the identified result.
Step five: judging whether the language switching instruction is a language switching instruction, and if the language switching instruction is the language switching instruction, setting an online default language model; otherwise, the process is finished.
Scenario 2: non-switching language instruction
The method comprises the following steps: inputting audio;
step two: the audio acquisition module acquires audio;
step three: sending the audio to an online identification service;
step four: sending the recognition result to an online semantic service;
step five: sending the voice result to a dialogue service;
step six: and judging whether a language switching instruction is received, if not, outputting a call receiving result, otherwise, canceling the current conversation result.
Beta version formed by the inventor in the process of implementing the invention:
alternative scheme: and the mixed language model is placed at a server side for processing.
The advantages are that: the occupied local resources are less;
the disadvantages are as follows: the network transmission is time-consuming, and the response is fast without local processing.
beta version: the offline module can be consistent with the online voice management module in function;
the disadvantages are as follows: the offline voice interaction management module consumes more resources than the offline recognition management.
The inventor finds that deeper effects are achieved in the process of implementing the invention:
the scheme supports multiple languages to carry out voice interaction, meets the requirements of multi-language scenes to a great extent, and is short in time consumption and high in stability in one round of interaction. Compared with the multi-language voice interaction scheme on the market at present, the method reduces the training cost of the voice model.
Referring to fig. 5, a block diagram of a multilingual voice interaction apparatus for a client according to an embodiment of the present invention is shown.
As shown in fig. 5, a first acquiring identification module 510, a judging module 520, a switching module 530 and a setting synchronization module 540.
The first obtaining and identifying module 510 is configured to respond to the obtained audio, and send the audio into a mixed language model for identification, where the mixed language model has multiple language-switching command words trained therein and is stored locally; a determining module 520, configured to determine whether there is a language switching command word in the audio based on the recognition result; a switching module 530 configured to determine a switched language based on a switching language command word if the switching language command word exists; a setting synchronization module 540 configured to set an online default language model based on the switched language and synchronize the online default language model to a server, wherein the server includes multiple single language models.
Referring to fig. 6, a block diagram of a multilingual voice interaction apparatus for a server according to an embodiment of the present invention is shown.
As shown in fig. 6, a second acquiring identification module 610 and a processing module 620.
The second obtaining and identifying module 610 is configured to respond to the obtained audio, and send the audio into the first single language model for identification, wherein the server trains a plurality of single language models; and the processing module 620 is configured to perform subsequent processing on the recognition result.
It should be understood that the modules recited in fig. 5 and 6 correspond to various steps in the methods described with reference to fig. 1, 2, and 3. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 5 and 6, and are not described again here.
It should be noted that the modules in the embodiments of the present application are not limited to the scheme of the present application, for example, the first obtaining and recognizing module may be described as a module that sends the audio to a mixed language model for recognition in response to the obtained audio, where the mixed language model trains multiple language switching language command words and stores the multiple language switching language command words in a local module, and in addition, the related function module may also be implemented by a hardware processor, for example, the first obtaining and recognizing module may be implemented by a processor, and details are not described herein again.
In other embodiments, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the multilingual voice interaction method in any of the above method embodiments;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
responding to the acquired audio, and sending the audio into a mixed language model for recognition, wherein the mixed language model trains switching language command words of multiple languages and stores the switching language command words in the local;
judging whether language switching command words exist in the audio based on the recognition result;
if the language switching command word exists, determining the switched language based on the language switching command word;
setting an online default language model based on the switched languages and synchronizing the online default language model to a server, wherein the server comprises a plurality of single language models.
As another embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
responding to the acquired audio, and sending the audio into a first single language model for recognition, wherein the server trains a plurality of single language models;
and carrying out subsequent processing on the identification result.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the multilingual voice-interaction device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the multilingual voice-interaction device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present invention also provide a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, which, when executed by a computer, cause the computer to perform any one of the above-mentioned multilingual speech interaction methods.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 7, the electronic device includes: one or more processors 710 and a memory 720, one processor 710 being illustrated in fig. 7. The apparatus for the multilingual voice interaction method may further include: an input device 730 and an output device 740. The processor 710, the memory 720, the input device 730, and the output device 740 may be connected by a bus or other means, such as the bus connection in fig. 7. The memory 720 is a non-volatile computer-readable storage medium as described above. The processor 710 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 720, namely, implements the above method embodiments for the multilingual voice interaction apparatus method. The input device 730 may receive input numeric or character information and generate key signal inputs related to user settings and function control for the multilingual voice-interaction device. The output device 740 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
As an embodiment, the electronic device is applied to a multilingual voice interaction apparatus, and includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:
responding to the acquired audio, and sending the audio into a mixed language model for recognition, wherein the mixed language model trains switching language command words of multiple languages and stores the switching language command words in the local;
judging whether language switching command words exist in the audio based on the recognition result;
if the language switching command word exists, determining the switched language based on the language switching command word;
setting an online default language model based on the switched languages and synchronizing the online default language model to a server, wherein the server comprises a plurality of single language models.
As another embodiment, the electronic device is applied to a multilingual voice interaction apparatus, and includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:
responding to the acquired audio, and sending the audio into a first single language model for recognition, wherein the server trains a plurality of single language models;
and carrying out subsequent processing on the identification result.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc.
(3) A portable entertainment device: such devices can display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A multilingual voice interaction method is used for a client and comprises the following steps:
responding to the acquired audio, and sending the audio into a mixed language model for recognition, wherein the mixed language model trains switching language command words of multiple languages and stores the switching language command words in the local;
judging whether language switching command words exist in the audio based on the recognition result;
if the language switching command word exists, determining the switched language based on the language switching command word;
setting an online default language model based on the switched languages and synchronizing the online default language model to a server, wherein the server comprises a plurality of single language models.
2. The method of claim 1, wherein the determining whether a language switching command word is present in the audio further comprises:
and if the audio is judged to have no language switching command word, ending the language switching instruction.
3. A multilingual voice interaction method is used for a server and comprises the following steps:
responding to the acquired audio, and sending the audio into a first single language model for recognition, wherein the server trains a plurality of single language models;
and performing subsequent processing on the recognition result, wherein the subsequent processing comprises semantic processing and dialogue processing.
4. The method of claim 3, wherein after the subsequent processing of the recognition result, further comprising:
judging whether an online default language model synchronization instruction sent by a client side is received;
if so, judging whether a second single language model in the online default language model synchronous instruction is consistent with a current first single language model of the server;
and if the first single language model is inconsistent with the second single language model, switching the first single language model to the second single language model.
5. The method of claim 4, wherein the determining whether the on-line default language model synchronization command for switching the single language model command sent by the client is received further comprises:
and if the on-line default language model synchronization instruction sent by the client is not received, outputting a processing result after the subsequent processing is carried out on the identification result.
6. The method of claim 4, wherein after said determining whether the second single language model in the online default language model synchronization instruction is consistent with the current first single language model of the server, further comprising:
and if the second single language model is consistent with the first single language model, outputting a conversation result.
7. A multilingual voice-interaction device for a client, comprising:
the first acquisition and recognition module is configured to respond to acquired audio and send the audio into a mixed language model for recognition, wherein the mixed language model trains switching language command words of multiple languages and stores the switching language command words in local;
the judging module is configured to judge whether language switching command words exist in the audio frequency based on the identification result;
the switching module is configured to determine the switched language based on the language switching command word if the language switching command word exists;
and the setting synchronization module is configured to set an online default language model based on the switched languages and synchronize the online default language model to a server, wherein the server comprises a plurality of single language models.
8. A multilingual voice interaction device is used for a server and comprises:
the second acquisition and recognition module is configured to respond to the acquired audio and send the audio into the first single language model for recognition, wherein the server trains a plurality of single language models;
and the processing module is configured to perform subsequent processing on the recognition result, wherein the subsequent processing comprises semantic processing and dialogue processing.
9. A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the method of any of claims 1 to 6.
10. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1 to 6.
CN202011162634.9A 2020-10-27 2020-10-27 Multi-language voice interaction method and device Active CN112002325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011162634.9A CN112002325B (en) 2020-10-27 2020-10-27 Multi-language voice interaction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011162634.9A CN112002325B (en) 2020-10-27 2020-10-27 Multi-language voice interaction method and device

Publications (2)

Publication Number Publication Date
CN112002325A true CN112002325A (en) 2020-11-27
CN112002325B CN112002325B (en) 2021-02-09

Family

ID=73474430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011162634.9A Active CN112002325B (en) 2020-10-27 2020-10-27 Multi-language voice interaction method and device

Country Status (1)

Country Link
CN (1) CN112002325B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705240A (en) * 2021-08-03 2021-11-26 中科讯飞互联(北京)信息科技有限公司 Text processing method based on multi-language branch model and related device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101478613A (en) * 2009-02-03 2009-07-08 中国电信股份有限公司 Multi-language voice recognition method and system based on soft queuing call center
US20140129220A1 (en) * 2011-03-03 2014-05-08 Shilei ZHANG Speaker and call characteristic sensitive open voice search
CN108461082A (en) * 2017-02-20 2018-08-28 Lg 电子株式会社 The method that control executes the artificial intelligence system of more voice processing
CN109272983A (en) * 2018-10-12 2019-01-25 武汉辽疆科技有限公司 Bilingual switching device for child-parent education
CN109817213A (en) * 2019-03-11 2019-05-28 腾讯科技(深圳)有限公司 The method, device and equipment of speech recognition is carried out for adaptive languages
CN110634487A (en) * 2019-10-24 2019-12-31 科大讯飞股份有限公司 Bilingual mixed speech recognition method, device, equipment and storage medium
CN110838290A (en) * 2019-11-18 2020-02-25 中国银行股份有限公司 Voice robot interaction method and device for cross-language communication
CN111508472A (en) * 2019-01-11 2020-08-07 华为技术有限公司 Language switching method and device and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101478613A (en) * 2009-02-03 2009-07-08 中国电信股份有限公司 Multi-language voice recognition method and system based on soft queuing call center
US20140129220A1 (en) * 2011-03-03 2014-05-08 Shilei ZHANG Speaker and call characteristic sensitive open voice search
CN108461082A (en) * 2017-02-20 2018-08-28 Lg 电子株式会社 The method that control executes the artificial intelligence system of more voice processing
CN109272983A (en) * 2018-10-12 2019-01-25 武汉辽疆科技有限公司 Bilingual switching device for child-parent education
CN111508472A (en) * 2019-01-11 2020-08-07 华为技术有限公司 Language switching method and device and storage medium
CN109817213A (en) * 2019-03-11 2019-05-28 腾讯科技(深圳)有限公司 The method, device and equipment of speech recognition is carried out for adaptive languages
CN110491382A (en) * 2019-03-11 2019-11-22 腾讯科技(深圳)有限公司 Audio recognition method, device and interactive voice equipment based on artificial intelligence
CN110634487A (en) * 2019-10-24 2019-12-31 科大讯飞股份有限公司 Bilingual mixed speech recognition method, device, equipment and storage medium
CN110838290A (en) * 2019-11-18 2020-02-25 中国银行股份有限公司 Voice robot interaction method and device for cross-language communication

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705240A (en) * 2021-08-03 2021-11-26 中科讯飞互联(北京)信息科技有限公司 Text processing method based on multi-language branch model and related device
CN113705240B (en) * 2021-08-03 2024-04-19 科大讯飞(北京)有限公司 Text processing method and related device based on multilingual branch model

Also Published As

Publication number Publication date
CN112002325B (en) 2021-02-09

Similar Documents

Publication Publication Date Title
CN112100349B (en) Multi-round dialogue method and device, electronic equipment and storage medium
CN108877782B (en) Speech recognition method and device
CN111402861B (en) Voice recognition method, device, equipment and storage medium
CN110930980B (en) Acoustic recognition method and system for Chinese and English mixed voice
CN113327609B (en) Method and apparatus for speech recognition
CN109429522A (en) Voice interactive method, apparatus and system
CN111027291B (en) Method and device for adding mark symbols in text and method and device for training model, and electronic equipment
CN108228574B (en) Text translation processing method and device
CN110517668B (en) Chinese and English mixed speech recognition system and method
CN109726397B (en) Labeling method and device for Chinese named entities, storage medium and electronic equipment
CN109256125B (en) Off-line voice recognition method and device and storage medium
CN116737908A (en) Knowledge question-answering method, device, equipment and storage medium
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN112560510A (en) Translation model training method, device, equipment and storage medium
CN116187320A (en) Training method and related device for intention recognition model
CN112002325B (en) Multi-language voice interaction method and device
KR20190074508A (en) Method for crowdsourcing data of chat model for chatbot
CN111160512B (en) Method for constructing double-discriminant dialogue generation model based on generation type countermeasure network
CN110473524B (en) Method and device for constructing voice recognition system
CN109273004B (en) Predictive speech recognition method and device based on big data
CN116189663A (en) Training method and device of prosody prediction model, and man-machine interaction method and device
CN113808572B (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN113763925B (en) Speech recognition method, device, computer equipment and storage medium
CN111966803B (en) Dialogue simulation method and device, storage medium and electronic equipment
CN114490967A (en) Training method of dialogue model, dialogue method and device of dialogue robot and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee after: Sipic Technology Co.,Ltd.

Address before: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee before: AI SPEECH Ltd.

CP01 Change in the name or title of a patent holder