CN111627432B

CN111627432B - Active outbound intelligent voice robot multilingual interaction method and device

Info

Publication number: CN111627432B
Application number: CN202010316400.9A
Authority: CN
Inventors: 李训林; 王帅; 张晋
Original assignee: Shengzhi Information Technology Nanjing Co ltd
Current assignee: Shengzhi Information Technology Nanjing Co ltd
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2023-10-20
Anticipated expiration: 2040-04-21
Also published as: CN111627432A; WO2021212929A1

Abstract

The application discloses an active outbound intelligent voice robot multilingual interaction method, an active outbound intelligent voice robot multilingual interaction device, computer equipment and a storage medium.

Description

Active outbound intelligent voice robot multilingual interaction method and device

Technical Field

The application relates to the technical field of voice signal processing, in particular to an active outbound intelligent voice robot multilingual interaction method, an active outbound intelligent voice robot multilingual interaction device, computer equipment and a storage medium.

Background

With the advent of the cloud era and the continuous innovation of artificial intelligence technology, intelligent robots based on voice systems are entering into various industries; the current intelligent voice robot replaces a lot of tedious and repeated customer service work, and liberates manual productivity; a large number of traversals are provided for intelligent replies of various industries;

the active outbound intelligent voice robot guides a user to conduct a conversation in a Torons conversation mode on the premise of presetting a conversation scene, so that the marketing purpose is achieved. Its main core functional modules Are Speech Recognition (ASR), speech synthesis (TTS), dialog Management (DM), natural Language Processing (NLP), natural Language Understanding (NLU).

In overseas markets, the intelligent voice robots are mostly single languages, and the single language supports 95% of users. However, in the actual outbound scene, some users still have weaker expression capability in a single language scene, such as southeast Asia, the main language is English, and the fuqiao is more familiar with Chinese with the Chinese ratio of 5%. In case that the voice robot announcements are heard in english, such a user would ask the voice robot if other language services, such as chinese, can be provided. In such a scenario, the product value is reduced due to language problems, resulting in poor user experience of the corresponding product.

Disclosure of Invention

Aiming at the problems, the application provides an active outbound intelligent voice robot multilingual interaction method, an active outbound intelligent voice robot multilingual interaction device, computer equipment and a storage medium.

In order to achieve the purpose of the application, the multi-language interaction method of the active outbound intelligent voice robot comprises the following steps:

s10, detecting voice data sent by a user when the user enters a multilingual setting scene;

s20, sending the voice data to each language recognition engine to obtain recognition texts returned by each language recognition engine;

s30, when all the identification texts are not blank texts, detecting whether all the identification texts carry preset weight words, and determining the texts carrying the weight words as effective texts;

s40, inputting the effective text into an NLU system, carrying out intention recognition on the effective text in the NLU system, and triggering interaction according to the intention recognition result.

In one embodiment, the language recognition engine includes an English language recognition engine and a Chinese language recognition engine.

As one embodiment, sending the voice data to each language recognition engine to obtain the recognition text returned by each language recognition engine includes:

the voice data is sent to an English language recognition engine to obtain English recognition text returned by the English language recognition engine;

and sending the voice data to a Chinese language recognition engine to obtain a Chinese recognition text returned by the Chinese language recognition engine.

In one embodiment, after detecting whether each recognition text carries a preset weight word, the method further comprises:

if all the recognition texts carry preset weight words or do not carry preset weight words, respectively calling corresponding voice models from all the recognition texts, detecting the text scores of all the recognition texts by adopting all the voice models, determining the comprehensive scores of all the recognition texts according to the text scores, the hesitation time coefficients and the adjustment coefficients of all the recognition texts, and determining the recognition text with the highest comprehensive score as the effective text.

In one embodiment, after sending the voice data to each language recognition engine to obtain the recognition text returned by each language recognition engine, the method further comprises:

if each identification text is an empty text, recording the using language as a default language, and triggering the interaction action by adopting the default language.

if one non-empty text exists in each identification text, the non-empty text is determined to be a valid text.

In one embodiment, inputting valid text into an NLU system, and in the NLU system, intent recognition of the valid text includes:

and inputting the effective text into an NLU system, enabling the NLU system to identify the language corresponding to the effective text, obtaining the current language, and carrying out intention identification on the effective text by adopting a language algorithm model corresponding to the current language.

An active outbound intelligent voice robot multilingual interaction device, comprising:

the first detection module is used for detecting voice data sent by a user when the user enters a multilingual setting scene;

the sending module is used for sending the voice data to each language identification engine to obtain identification texts returned by each language identification engine;

the second detection module is used for detecting whether each identification text carries a preset weight word or not when each identification text is not a blank text, and determining the text carrying the weight word as a valid text;

and the input module is used for inputting the effective text into the NLU system, carrying out intention recognition on the effective text in the NLU system, and triggering interaction action according to the intention recognition result.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the active outbound intelligent voice robot multilingual interaction method of any of the embodiments described above when the computer program is executed.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the active outbound intelligent voice robot multilingual interaction method of any of the above embodiments.

According to the active outbound intelligent voice robot multilingual interaction method, device, computer equipment and storage medium, voice data sent by a user can be detected when the user enters a multilingual setting scene, the voice data are sent to each language recognition engine to obtain recognition texts returned by each language recognition engine, when none of the recognition texts is an empty text, whether each recognition text carries a preset weight word or not is detected, the text carrying the weight word is determined to be an effective text, the effective text is input into an NLU (natural language understanding) system, intention recognition is carried out on the effective text in the NLU system, interaction is triggered according to an intention recognition result, multilingual service of the corresponding intelligent voice robot is achieved, the value of the intelligent voice robot is improved, and accordingly user experience is improved.

Drawings

FIG. 1 is a flow chart of a method of active outbound intelligent voice robot multilingual interaction of one embodiment;

FIG. 2 is a schematic diagram of the intelligent voice robot operation of one embodiment;

FIG. 3 is a language decision flow diagram of one embodiment;

FIG. 4 is a schematic diagram of a multi-lingual interaction device of an active outbound intelligent voice robot according to one embodiment;

FIG. 5 is a schematic diagram of a computer device of an embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The active outbound intelligent voice robot multilingual interaction method provided by the application can be applied to related intelligent voice robots. According to the intelligent voice robot, when a user enters a multilingual setting scene, voice data sent by the user can be detected, the voice data are sent to each language recognition engine to obtain recognition texts returned by each language recognition engine, when none of the recognition texts is an empty text, whether each recognition text carries a preset weight word or not is detected, the text carrying the weight word is determined to be an effective text, the effective text is input into an NLU (natural language understanding) system, intention recognition is carried out on the effective text in the NLU system, interaction is triggered according to the intention recognition result, multilingual service of the corresponding intelligent voice robot is achieved, the value of the intelligent voice robot is improved, and accordingly user experience is improved.

In one embodiment, as shown in fig. 1, a multilingual interaction method of an active outbound intelligent voice robot is provided, and the method is applied to the intelligent voice robot for illustration, and includes the following steps:

s10, detecting voice data sent by a user when the user enters a multilingual setting scene.

The intelligent voice robot can preset a speaking scene through a language identification setter, and can set possible language types of speaking in the scene needing multilingual identification, such as Chinese, english and the like; when a user enters a scene where the intelligent voice robot is located, the intelligent voice robot can use a preset language identification engine to identify languages corresponding to the user.

And S20, sending the voice data to each language recognition engine to obtain recognition texts returned by each language recognition engine.

In one embodiment, the language recognition engine includes an English language recognition engine and a Chinese language recognition engine. The english language recognition engine may be a default language recognition engine, and correspondingly, the english language may be a default language.

Specifically, sending the voice data to each language recognition engine, and obtaining the recognition text returned by each language recognition engine includes:

the voice data is sent to an English language recognition engine to obtain English recognition text returned by the English language recognition engine; further, the English recognition text may be noted as TXT-EN;

and sending the voice data to a Chinese language recognition engine to obtain a Chinese recognition text returned by the Chinese language recognition engine, and further, marking the Chinese recognition text as TXT-CN.

And S30, when all the identification texts are not blank texts, detecting whether all the identification texts carry preset weight words, and determining the texts carrying the weight words as effective texts.

The weight word can be preset in the intelligent voice robot. Specifically, for calculation, a specific weight word needs to be set, and if a preset speaking scene of the intelligent voice robot includes a chinese scene and an english scene, the setting process of the weight word may include:

according to the actual scene analysis, the language expression analysis possibly used by Chinese and English in the scene is provided with weight words, and the weight words relate to the voice habit and the related psychological level of the logic in each context of the user; for example, if a normal person is receiving a call, the first sentence, if familiar, will answer normally. If the open white robot (intelligent voice robot) asks "Hello, this is XX calling from XX, may I speak to XXX? If the person receiving the call understands English, the person can answer the sentence smoothly. From two planes. Firstly, the answer semantically accords with the answer of the open scene; secondly, the speed of the answer is normal. Typically within 200ms to 500 ms. If people who do not understand English call, the people first take a word of "the wizard" and then answer with Chinese "please ask for what can be said Chinese" or "you have made an error". For the open-time of different languages, the response answer of a general person is of a range. The intelligent voice robot uses the 'range' to set the adjustment coefficient of each language, and determines the hesitation time coefficient according to the time interval in the response process of the corresponding user.

The core function of the weight word is that the voice recognition result contains the preset weight word, which indicates that the user is most likely to express the language; the weight word rule is used as a part of core logic of language judgment; according to actual scene analysis, under an unfamiliar language scene, a user hesitates 300-500ms, and the more information, the longer the hesitation time is, so that a time threshold T is set according to scene analysis; the more time usage beyond T indicates less language familiarity.

The steps can input the effective text and the language corresponding to the effective text into the NLU system.

Specifically, the NLU system acquires a valid text, and considers that different natural language processing models are required to be used for different languages, so that the NLU enters different processing rules and models by acquiring languages; taking the recognition result as intention response input, and carrying out intention recognition through pre-trained algorithm models of different languages; after the intention recognition is completed, triggering actions corresponding to the intention, such as broadcasting; and processing the action according to the language, calling a text-to-speech service (TTS) corresponding to the language to generate corresponding voice for broadcasting, and completing feedback communication with the user.

According to the active outbound intelligent voice robot multilingual interaction method, voice data sent by a user can be detected when the user enters a multilingual setting scene, the voice data are sent to each language recognition engine to obtain recognition texts returned by each language recognition engine, when none of the recognition texts is empty text, whether each recognition text carries a preset weight word or not is detected, the text carrying the weight word is determined to be an effective text, the effective text is input into an NLU (natural language understanding) system, intention recognition is carried out on the effective text in the NLU system, interaction is triggered according to the intention recognition result, multilingual service of the corresponding intelligent voice robot is achieved, the value of the intelligent voice robot is improved, and accordingly user experience is improved.

Specifically, the language recognition engine includes an english language recognition engine and a chinese language recognition engine as examples, the intelligent voice robot sends user voices (voice data) to the english language recognition engine and the chinese language recognition engine for voice recognition, the result returned by the english engine is TXT-EN, and the result returned by the chinese engine is TXT-CN. The determination of the corresponding recognition result (valid text) may include:

scene one: TXT-EN or TXT-CN is empty text (effective results are not recognized), the current radio reception probability is considered to be noise, and the used language is recorded as a default language (such as English);

scene II: one of TXT-EN or TXT-CN is empty text (valid result is not recognized), the text is considered to be returned to be the correct language, and the language type is recorded;

scene III: and (4) returning a non-empty text by the TXT-EN or the TXT-CN, and calculating weights: whether the TXT-EN or the TXT-CN contains the weight words set in the step two or not, and considering the weight word fit scene, if the weight words appear, the recognition result is shown to be the user answer with high probability; so that the recognition result is the optimal result only when one of the TXT-EN or TXT-CN contains a weight word; if TXT-EN or TXT-CN contains weight words or does not contain weight words, respectively calling a Chinese language model and an English language model by the TXT-EN and the TXT-CN, and scoring the returned results to obtain sourcen (TXT-EN) and sourcCN (TXT-CN); meanwhile, according to the different dimensionalities of the Chinese and English model scores, the adjustment coefficient s is found according to actual scene data statistics; it is believed that the score dimension approaches sourceCN after sourceEN (TXT-EN) x; meanwhile, considering hesitation time coefficients delta t=delay time (user delay time) -T (preset time threshold) in different language processing, and obtaining the most suitable score processing when the hesitation time coefficients delta t=delay time) -T (preset time threshold) are matched with the sensitivity coefficients; the empirical values a (sensitivity coefficient) and s (adjustment coefficient) are obtained through a large number of data verification verifications, and a scoring formula (calculation work of comprehensive scores) is finally obtained:

the English comprehensive score is as follows: sourcen (TXT-EN) s-a ^△t

The Chinese comprehensive score is as follows: sourceCN (TXT-CN)

Comparing English score (English comprehensive score) with Chinese score (Chinese comprehensive score), and selecting high-scoring result as user identification result and language.

In one embodiment, the method for passively switching dialogue languages in a specific scene of the intelligent voice robot can comprise a voice synthesis module, a natural voice processing module, a natural language understanding module, a dialogue management module and a voice recognition module.

Referring to fig. 2, in the use of the intelligent voice robot, voice types to be supported are configured for different scenes: such as support for english and chinese. When the intelligent robot executes the scene, the configuration is acquired. Processing different languages in the corresponding scenes according to the configuration; generally, the configuration is a judgment condition for the language processing logic to execute.

The method comprises the steps of setting different language recognition engines according to the configuration of a scene, evaluating and scoring multilingual voice recognition results through a specific language model, finding out the most suitable result according to the evaluation result, marking the language used by a user corresponding to the most suitable result, and providing the best result and the language of the result to an NLU as the basis for NLU judgment; the method comprises the steps of carrying out a first treatment on the surface of the

The natural language understanding layer is compatible with different languages, according to the languages judged in the voice recognition flow, the intelligent voice robot can use different semantic matching algorithms to improve the semantic recognition degree, and the output language (TTS) of the intelligent voice robot is adjusted according to the user language;

at the psychological level, a normal person is receiving a call, and the first sentence, if familiar, will answer normally. If the open field robot asks "Hello, this is XX calling from XX, may I speak to XXX? If the person receiving the call understands English, the person can answer the sentence smoothly. From two planes. Firstly, the answer semantically accords with the answer of the open scene; secondly, the speed of the answer is normal. Typically within 200ms to 500 ms. If people who do not understand English call, the people first take a word of "the wizard" and then answer with Chinese "please ask for what can be said Chinese" or "you have made an error". For the open-time of different languages, the response answer of a general person is of a range. We use this "range"

Building Chinese speech technology: under the condition that the dialogue management system presets the dialect into English, a set of corresponding Chinese language scenes is newly added on the premise of aiming at the English language preset scene in order to cover more language scenes;

chinese intention building: under the original active outbound scene, a new field, namely a client intention scene branch is: the customer wants to speak the intent branch of chinese. Possible descriptions of the scene corresponding to the client to be configured under the intention branch are as follows: can say chinese? can you speak chinese?

Definition of node hotwords in a dialog management specific scenario: the link is mainly to analyze the fact that under a specific scene, if the English expression capability of a user is weak, the user can possibly ask the dialogue of the AI reversely, and summarize and generalize the corresponding high-frequency hotwords in the switching intention dialogue as: chinese, chinese, english.

A Chinese engine is newly added for a specific scene under the scene of the original English engine: for the opening of english, under the condition of playing english, a client who does not speak english will generally ask back to the speech system, do you speak chinese? For the specific scene, a layer of Chinese ASR engine is added to supplement 5% of the scene;

the method has the advantages that: the multi-language user group is covered, and recognition negative effects caused by supporting a bilingual engine, such as the reaction rate of the whole system, the recognition accuracy rate for the user in a 95% scene and the like, are avoided by using a smart method.

In an example, in the application process of the active outbound intelligent voice robot multilingual interaction method, after the voice data is obtained, the relevant decision process may refer to fig. 3, and first, a fast decision is made, and a fast decision logic is implemented, so that two or more ASR engines may be invoked for recognition respectively. Which engine returned the result determines the language. If there is no result to make an acoustic model decision, the acoustic model is mainly used to solve the problem of similar pronunciation in different languages, and there may be correct return results for different ASRs. For example, the chinese spoken word "that (nei)" sounds very similar to the trigger in english. English ASR may typically recognize it as a trigger. The sound is decomposed into IPA and then the matching is made to the IPA.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an active outbound intelligent voice robot multilingual interaction device according to an embodiment, including:

the first detection module 10 is configured to detect voice data sent by a user when the user enters a multilingual setting scene;

the sending module 20 is configured to send the voice data to each language recognition engine, so as to obtain a recognition text returned by each language recognition engine;

the second detection module 30 is configured to detect whether each recognition text carries a preset weight word when none of the recognition texts is a blank text, and determine the text carrying the weight word as a valid text;

and the input module 40 is used for inputting the effective text into the NLU system, carrying out intention recognition on the effective text in the NLU system, and triggering interaction according to the result of the intention recognition.

The specific limitation of the active outbound intelligent voice robot multilingual interaction device can be referred to as the limitation of the active outbound intelligent voice robot multilingual interaction method, and is not repeated herein. All or part of each module in the active outbound intelligent voice robot multilingual interaction device can be realized by software, hardware and combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by the processor is used for realizing an active outbound intelligent voice robot multilingual interaction method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 5 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

Based on the examples described above, in one embodiment there is also provided a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the active outbound intelligent voice robot multilingual interaction method as in any of the embodiments described above when the program is executed by the processor.

Those skilled in the art will appreciate that implementing all or part of the above-described embodiments of the method may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and as in the embodiment of the present application, the program may be stored in a storage medium of a computer system and executed by at least one processor in the computer system to implement the embodiment of the method for multilingual interaction of an active outbound intelligent voice robot as described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

Accordingly, in one embodiment, there is also provided a computer storage medium, on which is stored a computer program, wherein the program when executed by a processor implements the active outbound intelligent voice robot multilingual interaction method according to any one of the above embodiments.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

It should be noted that, the term "first\second\third" related to the embodiment of the present application is merely to distinguish similar objects, and does not represent a specific order for the objects, it is to be understood that "first\second\third" may interchange a specific order or sequence where allowed. It is to be understood that the "first\second\third" distinguishing aspects may be interchanged where appropriate to enable embodiments of the application described herein to be implemented in sequences other than those illustrated or described.

The terms "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or modules is not limited to the particular steps or modules listed and may optionally include additional steps or modules not listed or inherent to such process, method, article, or device.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. An active outbound intelligent voice robot multilingual interaction method is characterized by comprising the following steps:

s40, inputting the effective text into an NLU system, carrying out intention recognition on the effective text in the NLU system, and triggering interaction according to an intention recognition result;

after detecting whether each recognition text carries a preset weight word, the method further comprises the following steps:

2. The method of claim 1, wherein the language recognition engine comprises an english language recognition engine and a chinese language recognition engine.

3. The method for multilingual interaction of an active outbound intelligent voice robot according to claim 2, wherein the step of transmitting voice data to each of the language recognition engines to obtain the recognition text returned by each of the language recognition engines comprises the steps of:

4. The method for multi-lingual interaction of an active outbound intelligent voice robot according to claim 1, wherein after sending voice data to each language recognition engine to obtain the recognition text returned by each language recognition engine, further comprising:

5. The method for multi-lingual interaction of an active outbound intelligent voice robot according to claim 1, wherein after sending voice data to each language recognition engine to obtain the recognition text returned by each language recognition engine, further comprising:

6. The active outbound intelligent voice robot multilingual interaction method of claim 1, wherein inputting the valid text into the NLU system, and wherein performing intent recognition on the valid text in the NLU system comprises:

7. An apparatus for implementing the active outbound intelligent voice robot multilingual interaction method of claim 1, comprising:

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 6 when the computer program is executed by the processor.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.