CN114283820A

CN114283820A - Multi-character voice interaction method, electronic equipment and storage medium

Info

Publication number: CN114283820A
Application number: CN202111649321.0A
Authority: CN
Inventors: 宋泽; 甘津瑞; 陈铭竑; 邓建凯
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-05

Abstract

The invention discloses an interaction method of multi-role voice, electronic equipment and a storage medium, wherein the method comprises the following steps: and acquiring the audio data through the local user side, and sending the audio data to the voice recognition server of the remote side. And the voice recognition server recognizes the character data to be recognized through the semantic recognition model to acquire semantic recognition result information. And recognizing semantic recognition result information through the conversation model, and acquiring conversation result information and setting role information. The voice recognition server synthesizes reply voice according to the set role information and the conversation information and sends the reply voice to the local user terminal. And the local user terminal plays the reply voice. The method supports multi-role voice interaction, recommends a proper role to carry out conversation communication with the multi-role voice interaction according to the emotional state of the user, occupies less resources, has the advantages of high reliability and high stability, automatically switches the role function, greatly improves the interestingness of the voice interaction, and has better robustness compared with the multi-role interaction scheme on the market at present.

Description

Multi-character voice interaction method, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to an interaction method of multi-role voice, electronic equipment and a storage medium.

Background

Technologies such As Speech Recognition (ASR), Natural Language Processing (NLP), Speech synthesis (Text To Speech, TTS), and Emotion Recognition (auto Speech Recognition, ASER) are currently available in the market, and provide basic capabilities of Speech interaction.

The speech recognition mainly converts the speech content sent by a person into text information which can be read in by a computer, and has two working modes: a recognition mode and a command mode. The speech recognition program may also be implemented using different types of programs depending on the two modes. The working principle of the recognition mode is as follows: the engine system directly provides a word stock and a recognition template stock in the background, and any system does not need to further change recognition grammar and only needs to rewrite according to a main program source code provided by the recognition engine. The command pattern is relatively difficult to implement and the dictionary must be written by the programmer himself, programmed, and finally processed and corrected according to the phonetic dictionary. The recognition mode is different from the command mode in the largest way, namely, the programmer checks and modifies the codes according to the dictionary content.

Natural Language processing is an important means for realizing man-machine Natural Language communication, and includes two parts, Natural Language Understanding (NLU) and Natural Language Generation (NLG), which enables a computer to understand the meaning of a Natural Language text and express a given intention, thought, and the like in the Natural Language text. Natural language understanding is the building of a computer model, which is based on linguistics, fusing disciplines such as logics, psychology and computer disciplines, and attempts to solve the following problems: how does the language organize to transmit information? How does a person in turn obtain information from a series of language symbols? The alternative expression is to obtain the semantic representation of the natural language through the analysis of grammar, semantics and pragmatics, and understand the intention expressed by the natural language text. The natural language generation is a branch of artificial intelligence and computational linguistics, and a corresponding language generation system is a computer model based on language information processing, and the working process of the language generation system is opposite to natural language analysis, namely, a text is generated by selecting and executing certain semantic and grammatical rules from an abstract concept level.

Speech synthesis is a technique that can convert arbitrary text into corresponding speech. Conventional speech synthesis systems typically include two modules, a front end and a back end. The front-end module mainly analyzes the input text and extracts the language information needed by the rear-end module, and generally comprises sub-modules of text regularization, word segmentation, part of speech prediction, polyphone disambiguation, prosody prediction and the like. The back-end module generates a speech waveform by a certain method according to the front-end analysis result, and the method is generally divided into speech synthesis (or called parameter synthesis) based on statistical parameter modeling and speech synthesis (or called concatenation synthesis) based on unit selection and waveform concatenation. For parameter synthesis, the method performs context-dependent modeling on speech acoustic features and duration information in a training stage, predicts acoustic feature parameters through a duration model and an acoustic model in a synthesis stage, performs post-processing on the acoustic feature parameters, and finally recovers speech waveforms through a vocoder. The method has more stable synthesis effect under the condition that a voice library is relatively small, and has the defects of the problem of 'over-smooth' acoustic characteristic parameters caused by statistical modeling and the damage of a vocoder to the tone quality.

For splicing synthesis, a training stage is basically the same as parameter synthesis, unit selection is guided through model calculation cost in a synthesis stage, an optimal unit sequence is selected by adopting a dynamic programming algorithm, and energy normalization and waveform splicing are performed on the selected units. The splicing synthesis directly uses real voice segments, and can retain voice tone quality to the maximum extent; the disadvantage is that the required sound library is large and the synthesis effect of the text outside the field can not be ensured. Therefore, the front-end module needs a strong linguistic background and needs expert support in a specific field, the parameter system in the rear-end module needs to know the sounding mechanism of the voice to a certain extent, and the further improvement of the expressive force of the synthesized voice is limited due to the information loss in the traditional parameter system modeling. And a splicing system which is a back-end system has high requirements on a voice database, and needs manual intervention to make a plurality of selection rules and parameters. These all contribute to the emergence of end-to-end speech synthesis. The end-to-end synthesis system directly inputs text or ZhuYin characters, and the system directly outputs audio waveforms. The end-to-end system reduces the requirement on linguistic knowledge, can be conveniently copied on different languages, and can realize a synthesis system of dozens of languages or more in batches. And the end-to-end speech synthesis system presents powerful and rich pronunciation style and prosody expression.

Common emotion recognition methods fall into two main categories, non-physiological signal based recognition and physiological signal based recognition. The emotion recognition method based on non-physiological signals mainly comprises the recognition of facial expressions and voice tones. The facial expression recognition method is characterized in that different emotions are recognized according to the corresponding relation between expressions and emotions, and people can generate specific facial muscle movement and expression modes under a specific emotion state, for example, the mouth corners are tilted up when people feel happy, and the eyes are folded annularly; anger may frown, open eyes, etc.

At present, facial expression recognition is mostly realized by adopting an image recognition method. The speech tone recognition method is realized according to different language expression modes of people in different emotional states, for example, the tone of speaking is cheerful when the mood is happy, and the tone is dull when the mood is fidgety. The non-physiological signal identification method has the advantages of simple operation and no need of special equipment. The disadvantage is that reliability of emotion recognition cannot be guaranteed because people can disguise their own true emotions by disguising facial expressions and voice tones, which are often not easily discovered. Secondly, methods based on non-physiological signal recognition are often difficult to implement for disabled persons suffering from certain specific diseases

A method for emotion recognition based on physiological signals mainly comprises emotion recognition based on an autonomic nervous system (autonomic nervous system) and emotion recognition based on a central nervous system (central nervous system). The recognition method based on the autonomic nervous system means that corresponding emotional states are recognized by measuring physiological signals such as heart rate, skin impedance, respiration and the like. Although the physiological signals of these autonomic nervous systems cannot be disguised and real data can be obtained, they are not suitable for practical use because of low accuracy and lack of reasonable evaluation criteria. The identification method based on the central nervous system is used for identifying corresponding emotions by analyzing different signals emitted by the brain under different emotional states. This method is not easily camouflaged and has a high recognition rate compared to other physiological signal recognition methods, and thus is increasingly applied to emotion recognition research.

Because the traditional voice interaction realizes the voice interaction of a single role through single technologies such as voice recognition, semantic processing, voice synthesis and the like, and the voice interaction is carried out by a single tone, the interactive experience of role switching cannot be effectively carried out with a user, the emotion of the user cannot be felt, and therefore the voice interaction cannot be carried out between the user and a corresponding role, and the interaction between a machine and a person lacks emotional colors and interestingness.

The inventor finds that: in the current voice interaction system, single-role voice interaction can meet most application scene requirements, so that customers do not have the requirement of multi-role voice interaction, and the multi-role voice interaction system cannot be paid attention to sufficiently; secondly, because the early emotion recognition technology cannot efficiently and stably recognize the emotion of the user, with the occurrence of the physiological signal-based central nervous system recognition method, the corresponding emotion is recognized by analyzing different signals emitted by the brain under different emotion states, so that the method has high recognition rate and high reliability, and can be widely applied to multi-role voice interaction scenes such as novel broadcasting, children toys and the like.

Disclosure of Invention

The embodiment of the invention aims to solve at least one of the technical problems.

In a first aspect, an embodiment of the present invention provides an interaction method for multi-role speech, where the interaction method for multi-role speech can be implemented in a system including a local user side and a remote side. And arranging a voice recognition server at the remote end. The interaction method comprises the following steps:

and sending the interactive audio data acquired by the local user side to a voice recognition server of the remote side for processing, and obtaining a semantic recognition result of the interactive audio data from the voice recognition server.

And the voice recognition server recognizes the semantic recognition result through the dialogue model and acquires the dialogue result. And the voice recognition server acquires the set role information according to the conversation result.

And the voice recognition server synthesizes the reply voice according to the set role information and the dialogue information. The voice recognition service sends the reply voice to the local user terminal. And the local user terminal plays the reply voice.

In a second aspect, an embodiment of the present invention provides an electronic device, which includes: the system comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute any multi-character voice interaction method.

In a third aspect, an embodiment of the present invention provides a storage medium, where one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above-described interaction methods for multi-character voice according to the present invention.

In a fourth aspect, the embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a storage medium, and the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is caused to execute any one of the above interaction methods of multi-character voice.

The embodiment of the invention acquires the audio data through the local user side and sends the audio data to the voice recognition server of the remote side. And the voice recognition server acquires corresponding character data to be recognized according to the audio data. And recognizing character data to be recognized through a semantic recognition model to obtain semantic recognition result information. And the voice recognition server recognizes the semantic recognition result information through the dialogue model, and acquires the dialogue result information and the set role information. The voice recognition server synthesizes reply voice according to the set role information and the conversation information and sends the reply voice to the local user terminal. And the local user terminal plays the reply voice.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of an embodiment of a multi-character voice interaction method of the present invention;

FIG. 2 is a flow chart of another embodiment of a multi-character voice interaction method of the present invention;

fig. 3 is a schematic structural diagram of an embodiment of an electronic device according to the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiment of the invention provides an interaction method of multi-role voice, which can be applied to electronic equipment. The electronic device may be a computer, a server, or other electronic products, and the invention is not limited thereto.

Referring to fig. 1, an interaction method of multi-character voice according to an embodiment of the present invention is shown.

The multi-role voice interaction method can be realized in a system comprising a local user side and a remote side. And arranging a voice recognition server at the remote end. As shown in fig. 1, the interaction method of the multi-character voice includes:

step S101, audio data is acquired.

In this step, the local user side acquires audio data. The audio data includes question voice audio data or answer voice audio data in the voice conversation.

And step S102, obtaining semantic recognition result information.

In this step, the local user sends the audio data to the voice recognition server of the remote end. And the voice recognition server acquires corresponding character data to be recognized according to the audio data. And recognizing character data to be recognized through a semantic recognition model to obtain semantic recognition result information.

In step S103, dialog result information is acquired.

In this step, the speech recognition server recognizes the semantic recognition result information through the dialogue model, and acquires the dialogue result information. And acquiring the set role information.

And step S104, the voice recognition server acquires the set role information according to the conversation result information.

In step S105, a reply voice is synthesized.

In this step, the voice recognition server synthesizes a reply voice according to the set role information and the dialogue information.

Step S106, the reply voice is played at the local user terminal.

In this step, the voice recognition service sends the reply voice to the local user side. And the local user terminal plays the reply voice.

In some optional embodiments, an emotion recognition server is provided at the remote end. Step S104 further includes: and sending the conversation result information to the emotion recognition server. And the emotion recognition server acquires emotion result information through a local model or a recognition algorithm.

In some optional embodiments, the emotional result information comprises: gender, age, mood, and tone information.

In some optional embodiments, step S105 further includes: and the remote end acquires the current role information of the local user end. And judging whether the current role information is the set role information or not, and if not, setting the set role information as the current role information.

In some optional embodiments, step S103 includes: the speech recognition server recognizes the semantic recognition result information through a dialogue model based on a natural language algorithm model.

In some optional embodiments, the semantic recognition result information includes: the user speaks the content. A task field of the user. An intent field for a user task and a reply phrase field for a dialog service.

In some optional embodiments, step S106 includes: the voice recognition service sends the reply voice to the local user end through the HTTP protocol and returns the PCM data of 16K 16 bit.

In some optional embodiments, step S101 includes: the local user side is provided with equipment comprising an intelligent mobile terminal, and audio data are collected through the intelligent mobile terminal.

In some alternative embodiments, the audio data is PCM pulse code modulated data in a format of 16K 16bit single channel.

According to the invention, emotion recognition, gender and age recognition technologies are adopted, user information is accurately recognized, and the voice conversation system recommends proper roles to carry out emotion interaction with the user according to the client information. Such as: the voice system recognizes that the user is a middle-aged male, and can recommend a cognitive female to communicate with the user; as another example, the system identifies loss of user emotion, and may recommend that the user listen to a happy song, relieving depressed mood.

The method mainly comprises the steps of collecting voice of a user by a microphone, sending voice data to a remote emotion recognition service through a network, analyzing the voice through the remote service, then returning the analyzed data (including gender, age, mood and recommended tone information) to a terminal, automatically switching roles when the information of the roles is locally received, and communicating the voice using the roles with you after application. The main innovation is that personas can be automatically switched according to emotion information of a current user, and secondly, compared with an offline function, the online voice conversation service and the online TTS service are used, so that more personas can be supported to perform voice communication with you, the content is richer, and consultation weather, music listening, stories, calendars and the like are supported.

Another embodiment of the present invention provides an interactive method of multi-character voice, and the embodiment includes two scenarios. Referring to FIG. 2:

scenario 1 role switching:

the method comprises the following steps: audio is input.

Step two: and the audio acquisition module acquires audio.

The audio is real-time acquisition stream-fed into the recognition engine, and the format is PCM data of a 16K 16bit single channel.

PCM (Pulse-code-modulation) is a representation of an analog signal converted into a digital signal at a fixed sampling frequency.

Step three: the audio is fed into an emotion recognition kernel.

The client sends the user voice to the emotion recognition service through the network, and the remote service returns the result. The output emotion result information mainly comprises sex, age, emotion and tone information. In addition, the identification algorithm and each network are responsible for the research and development members of the company and belong to confidential information.

Step four; and analyzing the role of the result after emotion recognition.

After streaming audio is sent to speech emotion recognition service through a websocket protocol, a server side returns corresponding data to represent the emotion of a user at the moment, and the user side analyzes through corresponding fields, for example:

sex: female for purposes of female and male for purposes of male;

age: child, adult, elder;

emotion: angry, happy, sad, neutral;

tone color: voiceId (role ID);

step five: judging whether role switching is carried out, and if the role switching is required, setting role information; otherwise, the process is finished.

The user starts an application program, detects that the mood is a middle-aged male and cheerful mood, and at the moment, the application can recommend a sound with timbre as a Dingling sound to communicate with the application program, and sets the role of the application program as Dingling; however, since the user suddenly receives bad information, such as major accident, stock drop, etc., the mood is too much, the emotion is low, the speaking voice is changed, and then the user communicates with the system application, the system detects that the voice stream is changed, and detects that the emotion of the user is too little, the application automatically switches the sound with difficulty (such as deep sound, and the class of guo german) to communicate with the user, and modifies the current role information. The role information mainly comprises: sex, age, emotion, and timbre, as described in detail below

Sex: female for purposes of female and male for purposes of male;

age: child, adult, elder;

emotion: angry, happy, sad, neutral;

tone color: voiceId (speaker ID);

scenario 2: voice interaction

The method comprises the following steps: audio is input.

Step two: and the audio acquisition module acquires audio.

Step three: the audio is sent to an online identification service, respectively.

The identification service is not implemented locally, but on a remote server, both locally and remotely, communicate via a network protocol. The method mainly comprises the steps of collecting 16K 16bit single-channel audio, sending voice stream data to a server through a websocket network protocol, processing by the server, and returning a result.

Such as: we say that: "what is the weather today", the local application collects PCM audio data through a microphone, then sends the voice stream to the server, and the server receives the PCM, converts the audio into text, and returns the text to the user side through the websocket.

Step four: and sending the recognition result to an online semantic service.

The identification result mainly comprises:

recognized text, i.e., the user's utterance, such as the content of a spoken utterance.

And identifying pinyin corresponding to the text.

And the confidence field can be used for evaluating whether the current voice stream is converted into characters and is credible, wherein the range is (0-1), and the closer to 1, the more accurate the result is considered.

The user side sends the sound to the voice recognition service to convert the sound into a text, then sends the text to the semantic service, analyzes and processes the text by using a natural language understanding algorithm (NLU) provided by research personnel, and finally outputs a corresponding semantic result (semantic slot information).

Such as: the user says that the semantic service of 'how the weather of Beijing today' analyzes the characters, outputs the key content of the semantic slot (date: today; city: Beijing; target: weather; intention: query weather; and the like), and lets the service program understand the intention of the user and perform corresponding processing.

Example (c):

step five: feeding speech results into a conversational service

The dialogue processing process is deployed on a remote server, receives semantic slot (slots) information of semantic services, analyzes the intention of a user, and outputs a corresponding reply language according to a natural language generation algorithm (NLG).

Example of results: including the user's spoken content (input field); task field (task) of user; an intention field (intenName) of the user task; reply words field of dialogue service (nlg)

"dm":{

"input" how much weather today in Beijing,

"task": weather ",

"intentName": query weather ",

"nlg" Beijing is cloudy all day long today, has air temperature of-1 to 9 ℃, 8 ℃ lower than that of Suzhou city today, has 1 grade of north wind to south wind, has good air quality and cold weather, and keeps warm when going out. The information is broadcasted for you by ink weather. "

}

Step six: obtaining role information and sending conversation result to synthesis service

If the emotion recognition service throws out the emotion result of the user, the program recommends role information, and then sends the result of voice reply and the role voiceId to the synthesis service, converts the text into a 16K 16bit voice stream, and broadcasts the voice stream by a player.

Such as: the user says ' how the weather of Beijing today is, the voice conversation service returns a reply word ' the weather of Beijing is clear ', the application program sends the voice conversation reply word and voiceId (such as Linzhiling) information to the synthesis service, and the service returns an audio stream, at the moment, the application is to communicate with the user through the sound of Linzhiling.

And if so, synthesizing the reply language returned by the voice service, and then broadcasting to complete a round of voice interaction.

Speech synthesis is primarily the conversion of a piece of text into a piece of audio for a given character. Since the local synthesis of each role has large resources, which results in limited roles, online speech synthesis is used, and resources, and programs are deployed on remote services.

The specific content of the synthesis is as follows: depending on the user's communication, a reply language is returned from the voice conversation.

The synthesis process comprises the following steps:

the application communicates with the service through an http protocol, a reply language of the voice conversation service and the voiceId of the role are sent to the remote server, then the remote server returns 16K 16bit PCM data through the http protocol in a streaming mode, and at the moment, the user end side sends the data to the playing module for playing.

Step seven: and broadcasting the synthesized audio.

The synthesis audio frequency in the multi-role voice interaction method can adopt off-line processing, the synthesis speed is high, the synthesis resource amount is large, and the off-line roles are few. The offline voice interaction occupies a higher CPU and has a larger memory

The invention has the following effective effects: the method supports multi-role voice interaction, recommends a proper role to carry out conversation communication with the multi-role voice interaction according to the emotional state of the user, occupies less resources, has the advantages of high reliability and high stability, automatically switches the role function, greatly improves the interestingness of the voice interaction, and has better robustness compared with the multi-role interaction scheme on the market at present.

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In some embodiments, the present invention provides a non-transitory computer readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the multi-character voice interaction methods described above.

In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any one of the above-mentioned multi-character voice interaction methods.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a multi-character voice interaction method.

Fig. 3 is a schematic hardware structure diagram of an electronic device for performing an interaction method of multi-character voice according to another embodiment of the present application, and as shown in fig. 3, the electronic device includes:

one or more processors 310 and a memory 320, one processor 310 being illustrated in fig. 3.

The apparatus for performing the interactive method of multi-character voice may further include: an input device 330 and an output device 430.

The processor 310, the memory 320, the input device 330, and the output device 430 may be connected by a bus or other means, such as the bus connection in fig. 3.

The memory 320 is a non-volatile computer-readable storage medium and can be used for storing non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the interaction method of multi-character voice in the embodiment of the present application. The processor 310 executes various functional applications of the server and data processing by running the non-volatile software programs, instructions and modules stored in the memory 320, that is, implements the multi-role voice interaction method of the above-described method embodiment.

The memory 320 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the interactive apparatus for multi-character voice, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 320 optionally includes memory located remotely from processor 310, which may be connected to the multi-character voice interaction device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 330 may receive input numeric or character information and generate signals related to user settings and function control of the multi-character voice interactive apparatus. The output device 430 may include a display device such as a display screen.

The one or more modules are stored in the memory 320 and, when executed by the one or more processors 310, perform the multi-character voice interaction method of any of the above-described method embodiments.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, among others.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other onboard electronic devices with data interaction functions, such as a vehicle-mounted device mounted on a vehicle.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An interactive method of multi-character voice can be realized in a system comprising a local user terminal and a remote terminal; a voice recognition server is arranged at the remote end; the interaction method comprises the following steps:

sending the interactive audio data acquired by the local user side to a voice recognition server of the remote side for processing, and obtaining a semantic recognition result of the interactive audio data from the voice recognition server;

the voice recognition server recognizes the semantic recognition result through a dialogue model to obtain a dialogue result; the voice recognition server acquires set role information according to the conversation result;

the voice recognition server synthesizes reply voice according to the set role information and the dialogue information; the voice recognition service sends the reply voice to a local user terminal; and the local user terminal plays the reply voice.

2. The interaction method of claim 1, wherein the interaction audio data comprises: question voice audio data or answer voice audio data.

3. The interaction method according to claim 1, wherein the step of sending the interactive audio data acquired by the local user side to the voice recognition server of the remote side for processing, and the step of obtaining the semantic recognition result of the interactive audio data from the voice recognition server comprises:

the voice recognition server acquires corresponding character data to be recognized according to the audio data; and identifying the character data to be identified through a semantic identification model to obtain semantic identification result information.

4. The interaction method according to claim 1, wherein an emotion recognition server is provided at the remote end;

the step of the voice recognition server obtaining the set role information according to the conversation result further comprises the following steps: sending the conversation result information to the emotion recognition server; the emotion recognition server acquires emotion result information through a local model or a recognition algorithm; the emotion result information comprises: gender, age, mood, and tone information.

5. The interaction method according to claim 4, wherein the step of the speech recognition server synthesizing a reply speech according to the set character information and the dialogue information further comprises: the remote end acquires the current role information of the local user end; and judging whether the current role information is set role information or not, and if not, setting the set role information as the current role information.

6. The interaction method according to claim 1, wherein the speech recognition server recognizes the semantic recognition result information through a dialogue model, and the step of obtaining the dialogue result information comprises: the voice recognition server recognizes the semantic recognition result information through a dialogue model based on a natural language algorithm model;

the semantic recognition result information includes: the speaking content of the user; a task field of a user; an intent field for a user task and a reply phrase field for a dialog service.

7. The interaction method according to claim 1, wherein the voice recognition service transmits the reply voice to a local user terminal; the step of playing the reply voice by the local user side comprises the following steps: and the voice recognition service sends the reply voice to the local user side through the PCM data of 16K 16bit returned by the http protocol.

8. The interactive method of claim 1, wherein the step of obtaining audio data at the local user end comprises: the local user side is provided with equipment comprising an intelligent mobile terminal, and audio data are collected through the intelligent mobile terminal;

the audio data is PCM pulse code modulation data with a format of 16K 16bit single channel.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 8.

10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 8.