CN111627432A

CN111627432A - Active call-out intelligent voice robot multi-language interaction method and device

Info

Publication number: CN111627432A
Application number: CN202010316400.9A
Authority: CN
Inventors: 李训林; 王帅; 张晋
Original assignee: Shengzhi Information Technology Nanjing Co ltd
Current assignee: Shengzhi Information Technology Nanjing Co ltd
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-09-04
Anticipated expiration: 2040-04-21
Also published as: CN111627432B; WO2021212929A1

Abstract

The invention discloses a multilingual interaction method, a device, computer equipment and a storage medium for an active outbound intelligent voice robot.

Description

Active call-out intelligent voice robot multi-language interaction method and device

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a multi-language interaction method and device for an active outbound intelligent voice robot, computer equipment and a storage medium.

Background

With the coming of the cloud era and the continuous innovation of the artificial intelligence technology, the intelligent robot based on the voice system enters various industries; the intelligent voice robot replaces a large amount of boring and repetitive customer service work at present, and the manual productivity is liberated; a large amount of traversals are provided for intelligent replies of various industries;

the active outbound intelligent voice robot guides a user to have a conversation in a Torontal conversation mode on the premise of presetting a conversation scene, so that the marketing purpose is achieved. Its main core functional modules Are Speech Recognition (ASR), speech synthesis (TTS), Dialog Management (DM), Natural Language Processing (NLP), Natural Language Understanding (NLU).

In overseas markets, most of intelligent voice robots are in a single language, and users are supported by the single language to account for 95%. However, in a real outbound scene, some users still have weak expression ability in a single language scene, such as southeast asia, where the main language is english, and qiao ju hua is more familiar with chinese by 5%. In the case of hearing the speech robot broadcast in english, such a user would ask the speech robot whether other language services, such as chinese? In such a scenario, the product value is reduced due to the language problem, resulting in poor user experience of the corresponding product.

Disclosure of Invention

In order to solve the problems, the invention provides an active outbound intelligent voice robot multi-language interaction method, an active outbound intelligent voice robot multi-language interaction device, a computer device and a storage medium.

In order to realize the aim of the invention, the invention provides an active intelligent voice calling robot multilingual interaction method, which comprises the following steps:

s10, when the user enters a multilingual setting scene, detecting voice data sent by the user;

s20, sending the voice data to each language recognition engine to obtain the recognition text returned by each language recognition engine;

s30, when each recognition text is not empty text, detecting whether each recognition text carries preset weight words or not, and determining the text carrying the weight words as effective text;

and S40, inputting the effective text into the NLU system, performing intention recognition on the effective text in the NLU system, and triggering interactive action according to an intention recognition result.

In one embodiment, the language recognition engine includes an English language recognition engine and a Chinese language recognition engine.

As an embodiment, sending the speech data to each language recognition engine, and obtaining the recognition text returned by each language recognition engine includes:

sending the voice data to an English language recognition engine to obtain an English recognition text returned by the English language recognition engine;

and sending the voice data to a Chinese language recognition engine to obtain a Chinese recognition text returned by the Chinese language recognition engine.

In one embodiment, after detecting whether each recognition text carries a preset weight word, the method further includes:

if each recognition text carries a preset weight word or does not carry a preset weight word, calling the corresponding voice model for each recognition text respectively, detecting the text score of each recognition text by adopting each voice model, determining the comprehensive score of each recognition text according to the text score, the hesitation time coefficient and the adjustment coefficient of each recognition text, and determining the recognition text with the highest comprehensive score as the effective text.

In one embodiment, after sending the speech data to each language recognition engine and obtaining the recognized text returned by each language recognition engine, the method further includes:

and if all the identification texts are empty texts, recording the used language as a default language, and triggering the interaction action by adopting the default language.

and if one non-empty text exists in each recognition text, determining the non-empty text as the valid text.

In one embodiment, inputting valid text into the NLU system, where the intent recognition of the valid text comprises:

inputting the effective text into the NLU system, enabling the NLU system to identify the language corresponding to the effective text to obtain the current language, and adopting a language algorithm model corresponding to the current language to identify the intention of the effective text.

An active intelligent voice robot multi-language interaction device for calling out, comprising:

the system comprises a first detection module, a second detection module and a third detection module, wherein the first detection module is used for detecting voice data sent by a user when the user enters a multilingual setting scene;

the sending module is used for sending the voice data to each language recognition engine to obtain a recognition text returned by each language recognition engine;

the second detection module is used for detecting whether each recognition text carries preset weight words or not when each recognition text is not empty text, and determining the text carrying the weight words as an effective text;

and the input module is used for inputting the effective text into the NLU system, performing intention recognition on the effective text in the NLU system, and triggering interactive action according to an intention recognition result.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the proactive intelligent voice-out-calling robot multilingual interaction method of any of the above embodiments when executing the computer program.

A computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the active intelligent voice-over-outbound robot multilingual interaction method of any of the above embodiments.

The active intelligent voice robot multi-language interaction method, the device, the computer equipment and the storage medium can detect voice data sent by a user when the user enters a multi-language setting scene, send the voice data to each language recognition engine to obtain recognition texts returned by each language recognition engine, detect whether each recognition text carries preset weight words when each recognition text is not empty text, determine the text carrying the weight words as an effective text, input the effective text into an NLU (natural language understanding) system, perform intention recognition on the effective text in the NLU system, and trigger interaction according to intention recognition results to realize multi-language services of corresponding intelligent voice robots, improve the value of the intelligent voice robots and further improve corresponding user experience.

Drawings

FIG. 1 is a flowchart of an active outbound intelligent voice robot multilingual interaction method of an embodiment;

FIG. 2 is a schematic diagram of an embodiment of an intelligent voice robot operating process;

FIG. 3 is a language decision flow diagram of an embodiment;

FIG. 4 is a schematic diagram of an exemplary multi-lingual interaction device of the proactive intelligent voice outbound robot;

FIG. 5 is a schematic diagram of a computer device of an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The active intelligent voice calling robot multilingual interaction method can be applied to relevant intelligent voice robots. The intelligent voice robot can detect voice data sent by a user when the user enters a multi-language setting scene, send the voice data to each language recognition engine to obtain recognition texts returned by each language recognition engine, detect whether each recognition text carries preset weight words or not when each recognition text is not empty text, determine the text carrying the weight words as an effective text, input the effective text into an NLU (natural language understanding) system, perform intention recognition on the effective text in the NLU system, and trigger interactive action according to intention recognition results to realize multi-language services of the corresponding intelligent voice robot, improve the value of the intelligent voice robot and further improve corresponding user experience.

In one embodiment, as shown in fig. 1, there is provided an active intelligent voice calling robot multi-language interaction method, which is described by taking the method as an example for an intelligent voice robot, and includes the following steps:

s10, when the user enters the multilingual setting scenario, voice data uttered by the user is detected.

The intelligent voice robot can preset a language scene through a language identification setter, and set possible language types of the language, such as Chinese, English and the like, in the scene needing multi-language identification; when a user enters a scene where the intelligent voice robot is located, the intelligent voice robot can use a preset language recognition engine to recognize the language corresponding to the user.

And S20, sending the voice data to each language recognition engine to obtain the recognition text returned by each language recognition engine.

In one embodiment, the language recognition engine includes an English language recognition engine and a Chinese language recognition engine. The english language recognition engine may be a default language recognition engine, and correspondingly, the english language may be a default language.

Specifically, sending the speech data to each language recognition engine, and obtaining a recognition text returned by each language recognition engine includes:

sending the voice data to an English language recognition engine to obtain an English recognition text returned by the English language recognition engine; furthermore, the English identification text can be recorded as TXT-EN;

and sending the voice data to a Chinese language recognition engine to obtain a Chinese recognition text returned by the Chinese language recognition engine, and further recording the Chinese recognition text as TXT-CN.

And S30, when each recognition text is not empty text, detecting whether each recognition text carries preset weight words or not, and determining the text carrying the weight words as effective text.

The weight words can be preset in the intelligent voice robot. Specifically, for calculation, a specific weight word needs to be set, and if the conversational scene preset by the intelligent voice robot includes a chinese scene and an english scene, the setting process of the weight word may include:

according to the analysis of an actual scene, language expression analysis which is possibly used by Chinese and English in the scene sets weight words, and the related logic of the weight words can be according to the speech habits and related psychological levels of a user in each context; for example, a normal person receiving a call will answer normally the first sentence if the first sentence is in a familiar language. If an open-field white robot (intelligent speech robot) asks "Hello, this is XX calling from XX, may I speak to XXX? If the person receiving the phone understands english, the person will answer the word smoothly. This is understood from two levels. First, the answer words semantically conform to the answer of the open field white; secondly, the speed of the answer is normal. Typically within 200ms to 500 ms. If the person who does not know English answers the phone, the person who does not know English firstly takes "stupid" and then answers "ask can say Chinese" or "you make a mistake". For different languages, there is a range of responses from the average person. The intelligent voice robot utilizes the 'range' to set the adjustment coefficients of all languages, and determines the hesitation time coefficient according to the time interval in the response process of the corresponding user.

The core function of the weight word is that the voice recognition result contains the preset weight word, which indicates that the user most possibly expresses the language; the weight word rule is used as a part of the core logic of language judgment; according to the actual scene analysis, under an unfamiliar language scene, a user is hesitant to set a time threshold T according to the scene analysis, the more information is, the more hesitant time is, the 500ms is, and therefore the time threshold T is set according to the scene analysis; more time usage after T indicates lower language familiarity.

The above steps can input the effective text and the language corresponding to the effective text into the NLU system.

Specifically, the NLU system acquires valid text, and considers that different languages require different natural language processing models, so the NLU enters different processing rules and models by acquiring languages; the recognition result is used as intention response input, and intention recognition is carried out through different pre-trained language algorithm models; after the intention identification is completed, an action corresponding to the intention is triggered, such as broadcasting; and processing the action according to the language, calling a text-to-speech service (TTS) corresponding to the language to generate corresponding speech for broadcasting, and completing feedback communication with the user.

The active intelligent voice robot multi-language interaction method for calling out the outside can detect voice data sent by a user when the user enters a multi-language setting scene, send the voice data to each language recognition engine to obtain recognition texts returned by each language recognition engine, detect whether each recognition text carries preset weight words or not when each recognition text is not empty text, determine the text carrying the weight words as an effective text, input the effective text into an NLU (natural language understanding) system, perform intention recognition on the effective text in the NLU system, and trigger interaction actions according to intention recognition results to realize multi-language services of corresponding intelligent voice robots, improve the value of the intelligent voice robots, and further improve corresponding user experience.

Specifically, the language recognition engine includes an english language recognition engine and a chinese language recognition engine as examples, the intelligent voice robot sends the user voice (voice data) to the english language recognition engine and the chinese language recognition engine respectively for voice recognition, the result returned by the english engine is TXT-EN, and the result returned by the chinese engine is TXT-CN. The determination process of the corresponding recognition result (valid text) may include:

scene one: if TXT-EN or TXT-CN is empty text (no effective result is identified), the current probability of radio reception is noise, and the language used is recorded as the default language (such as English);

scene two: if one of the TXT-EN or the TXT-CN is a null text (no effective result is identified), the returned text is considered to be in the correct language, and the language type is recorded;

scene three: and if TXT-EN or TXT-CN returns non-empty text, performing weight calculation: whether the TXT-EN or the TXT-CN contains the weight words set in the step two or not is judged, and the scene matched with the weight words is considered, so that if the weight words appear, the recognition result is shown to be answered by the user with a high probability; therefore, the identification result is the optimal result only when one of TXT-EN or TXT-CN contains the weight word; if TXT-EN or TXT-CN contains the weight word or does not contain the weight word, respectively calling the Chinese language model and the English language model for TXT-EN and TXT-CN, and scoring the respective returned results to obtain sourceEN (TXT-EN) and sourceCN (TXT-CN); meanwhile, considering different dimensions of scores of the Chinese and English models, an adjustment coefficient s is found according to actual scene data statistics; it is believed that the score dimensionality approaches sourceCN after sourceEN (TXT-EN) × s; meanwhile, considering the hesitation time coefficient delta T (DelayTime) -T (preset time threshold) in different language processing, and obtaining the most appropriate score processing by matching with the sensitivity coefficient; and (3) obtaining an empirical value a (sensitivity coefficient) and an adjustment coefficient s (adjustment coefficient) through a large amount of data verification and verification, and finally obtaining a score formula (calculation work of comprehensive score):

the English comprehensive score is as follows: sourceEN (TXT-EN) s-a^△t

The Chinese comprehensive score is as follows: sourceCN (TXT-CN)

And comparing the English score (English comprehensive score) with the Chinese score (Chinese comprehensive score), and selecting the result with high score as the user identification result and the language.

In an embodiment, the active intelligent voice calling robot multilingual interaction method completes the judgment of the interactive language through a mode of recognition judgment, and finally solves the problem that the existing intelligent voice robot has defects in a multilingual scene.

Referring to fig. 2, in the use of the intelligent voice robot, voice categories to be supported are configured for different scenes: such as supporting english and chinese. When the intelligent robot executes the scene, the configuration is acquired. Processing different languages in corresponding scenes according to the configuration; generally, the configuration is a judgment condition that the language processing logic executes.

A speech recognition layer, which sets different language recognition engines according to the configuration of the scene, evaluates and scores multilingual speech recognition results through a specific language model, finds out the most appropriate result according to the evaluation result, marks the language used by the user corresponding to the most appropriate result, and provides the best result and the language of the result to the NLU as the basis for NLU judgment; (ii) a

The intelligent voice robot uses different semantic matching algorithms according to the language judged in the voice recognition process, improves the semantic recognition degree, and adjusts the output language (TTS) of the intelligent voice robot according to the user language;

at a psychological level, when a normal person receives a call, the first sentence, if familiar with language, will be answered normally. If the open-field white robot asks "Hello, this is XX filling from XX, may I spot to XXX? "if the person receiving the phone understands english, the word is answered smoothly. This is understood from two levels. First, the answer words semantically conform to the answer of the open field white; secondly, the speed of the answer is normal. Typically within 200ms to 500 ms. If the person who does not know English answers the phone, the person who does not know English firstly takes "stupid" and then answers "ask can say Chinese" or "you make a mistake". For different languages, there is a range of responses from the average person. We make use of this "Range"

The construction of Chinese dialect: under the condition that the dialect management system presets the dialect to be English, in order to cover more language scenes, aiming at the preset scene of the English dialect, a set of corresponding Chinese dialect scenes is added;

building a Chinese intention: under the original active outbound scene, a new field, namely a client intention scene branch, is added: the customer wants to speak the branch of the intent of Chinese. Possible expressions of corresponding scenes of the client need to be configured under the intention branch, such as: can one say Chinese? can you spread Chinese?

Definition of node hotwords in a specific scene of conversation management: this link is mainly to analyze the word technique that the AI may be asked reversely in a specific scene if the english expression ability of the user is weak, and summarize and conclude that the corresponding high-frequency hotword in the switching intention word technique is: chinese, English.

A new Chinese engine layer is added for a specific scene under the scene of the original English engine: for the opening of the english language, in the case of broadcasting english, a client who cannot say english generally asks the speech system backwards, is you able to say chinese? Aiming at the specific scene, a Chinese ASR engine is additionally added to supplement the 5% scene;

the advantages are that: the method covers multi-language user groups, and avoids the identification negative effects caused by using a support bilingual engine by using a smart method, such as the reaction rate of the whole system, the user identification accuracy rate aiming at 95% of scenes and the like.

In an example, in the application process of the active intelligent voice call robot multilingual interaction method, after obtaining the used voice data, the relevant decision process can be as shown in fig. 3, and first, a fast decision is made, a fast decision logic is made, and a simple implementation can call two or more ASR engines respectively for recognition. Which engine returns the results is determined to be the language. Without the results to make acoustic model decisions, acoustic models are mainly used to solve the problem of similar pronunciations in different languages, and may have correct results returned for different ASRs. Say that the Chinese spoken words "that (nei)" are very similar to the nigger pronunciation in English. The english ASR might typically recognize it as nigger. The sound is decomposed into IPA and then matching is done to IPA.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an active intelligent voice call robot multi-language interaction device according to an embodiment, including:

the first detection module 10 is used for detecting voice data sent by a user when the user enters a multilingual setting scene;

the sending module 20 is configured to send the voice data to each language recognition engine to obtain a recognition text returned by each language recognition engine;

the second detection module 30 is configured to detect whether each recognition text carries a preset weight word when each recognition text is not an empty text, and determine the text carrying the weight word as an effective text;

and the input module 40 is used for inputting the effective text into the NLU system, performing intention recognition on the effective text in the NLU system, and triggering interactive action according to an intention recognition result.

For the specific limitations of the active intelligent voice robot multilingual interaction apparatus, reference may be made to the above limitations of the active intelligent voice robot multilingual interaction method, which will not be described herein again. All modules in the active call-out intelligent voice robot multi-language interaction device can be completely or partially realized through software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize the active intelligent voice calling robot multi-language interaction method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Based on the above examples, there is also provided in one embodiment a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement any one of the above embodiments of the active intelligent voice-over-outbound robot multi-lingual interaction method.

It will be understood by those skilled in the art that all or part of the processes of the method according to the above embodiments may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and in the embodiments of the present invention, the program may be stored in the storage medium of a computer system and executed by at least one processor in the computer system, so as to implement the processes according to the embodiments including the above active intelligent voice robot multilingual interaction method. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Accordingly, in one embodiment, a computer storage medium and a computer readable storage medium are provided, on which a computer program is stored, wherein the program, when executed by a processor, implements any one of the above-mentioned active intelligent voice-over-outbound robot multi-lingual interaction methods.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application merely distinguish similar objects, and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may exchange a specific order or sequence when allowed. It should be understood that "first \ second \ third" distinct objects may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be implemented in an order other than those illustrated or described herein.

The terms "comprising" and "having" and any variations thereof in the embodiments of the present application are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or device that comprises a list of steps or modules is not limited to the listed steps or modules but may alternatively include other steps or modules not listed or inherent to such process, method, product, or device.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An active intelligent voice robot multi-language interaction method for calling out is characterized by comprising the following steps:

2. The active intelligent voice-robot multi-lingual interaction method of claim 1, wherein the language recognition engine comprises an English language recognition engine and a Chinese language recognition engine.

3. The active intelligent voice robot multilingual interaction method of claim 2, wherein sending the voice data to each of the speech recognition engines to obtain the recognition text returned by each of the speech recognition engines comprises:

4. The active intelligent voice-calling robot multilingual interaction method according to claim 1, further comprising, after detecting whether each recognition text carries a preset weight word:

5. The active intelligent voice robot multilingual interaction method of claim 1, wherein after sending the voice data to each of the speech recognition engines and obtaining the recognition text returned by each of the speech recognition engines, the method further comprises:

6. The active intelligent voice robot multilingual interaction method of claim 1, wherein after sending the voice data to each of the speech recognition engines and obtaining the recognition text returned by each of the speech recognition engines, the method further comprises:

7. The active intelligent voice-robot multilingual interaction method of outbound calls according to claim 1, wherein inputting the valid text into the NLU system, wherein the intent recognition of the valid text in the NLU system comprises:

8. An active intelligent voice robot multi-language interaction device of calling out, characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.