CN113643691A

CN113643691A - Far-field voice message interaction method and system

Info

Publication number: CN113643691A
Application number: CN202110937579.4A
Authority: CN
Inventors: 陈明佳
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2021-11-12

Abstract

The embodiment of the invention provides a far-field voice message interaction method. The method comprises the following steps: caching the input user voice and sending the user voice to a service cloud; receiving time alignment information fed back by the service cloud, cutting the cached user voice based on the time alignment information, and at least determining a voice message instruction audio segment and a voice message audio segment in the user voice; and triggering a message leaving function by utilizing the voice message instruction audio segment, determining message leaving content based on the voice message audio segment, and sending the message leaving content to specified equipment for playing. The embodiment of the invention also provides a far-field voice message interaction system for the equipment terminal. The embodiment of the invention realizes far-field voice message interaction based on cloud audio alignment, local audio cutting and communication noise reduction fusion, reduces the conversation turns of message-leaving people, and improves the interaction experience of the message-leaving people.

Description

Far-field voice message interaction method and system

Technical Field

The invention relates to the field of intelligent voice, in particular to a far-field voice message interaction method and system.

Background

In the existing voice interaction products, to realize the far-field voice message function, basically, the far-field voice message function is realized by waking up first, then speaking a fixed instruction, and then speaking the content needing to be left, and the voice message function is realized based on the flow.

In the existing scheme based on question and answer, the basic method steps are as follows:

user awakening

Sys wakeup feedback

User: entering message mode to specify a statement

Sys: feedback entering message mode

User: message content

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the method is not efficient enough, the function of voice message can be realized only through multiple interactions, and interaction failure needs to be restarted as long as one link is wrong.

In the scheme based on original audio broadcasting, the problem that the volume is small or the listening feeling is poor when the original audio is directly used for remote interaction is solved.

Disclosure of Invention

The method and the device aim to at least solve the problems that in the prior art, the far-field voice message interaction can be finished for a plurality of times, the interaction is low in efficiency, the hearing of a user is poor due to the recording of the far-field voice, and the hearing of the user is further influenced if the content of the message has accent and is in a far-field environment.

In a first aspect, an embodiment of the present invention provides a far-field voice message interaction method, applied to a device side, including:

caching the input user voice and sending the user voice to a service cloud;

receiving time alignment information fed back by the service cloud, cutting the cached user voice based on the time alignment information, and at least determining a voice message instruction audio segment and a voice message audio segment in the user voice;

and triggering a message leaving function by utilizing the voice message instruction audio segment, determining message leaving content based on the voice message audio segment, and sending the message leaving content to specified equipment for playing.

In a second aspect, an embodiment of the present invention provides a voice message interaction method, applied to a device side, including:

taking the awakening words and the voice message instructions as awakening statements of the equipment end;

when the input user voice hits the awakening statement, cutting the user voice by taking the hit awakening statement in the user voice as a node, and determining an audio segment after the node as a voice message audio segment;

and sending the voice message audio segment to a specified device for playing.

In a third aspect, an embodiment of the present invention provides a far-field voice message interaction method, applied to a service cloud, including:

the service cloud receives user voice sent by the equipment terminal, and identifies a text corresponding to the user voice, wherein the text at least comprises: a voice message instruction text and a voice message text;

determining a corresponding time point of each character in the text in the user voice, and at least marking the voice message instruction text and the time alignment information of the voice message text and the user voice;

and at least sending the voice message instruction text and the time alignment information of the voice message text and the user voice to the equipment end for assisting the equipment end in cutting the user voice.

In a fourth aspect, an embodiment of the present invention provides a far-field voice message interaction system for a device, including:

the information transmission program module is used for caching the input user voice and sending the user voice to the service cloud end;

the audio cutting program module is used for receiving time alignment information fed back by the service cloud, cutting the cached user voice based on the time alignment information, and at least determining a voice message instruction audio segment and a voice message audio segment in the user voice;

and the message playing program module is used for triggering a message function by utilizing the voice message instruction audio segment, determining message content based on the voice message audio segment, and sending the message content to specified equipment for playing.

In a fifth aspect, an embodiment of the present invention provides a far-field voice message interaction system for a device, including:

the wake-up statement determining program module is used for taking the wake-up words and the voice message leaving instructions as the wake-up statements of the equipment terminal;

the audio cutting program module is used for cutting the user voice by taking the hitting awakening statement in the user voice as a node when the input user voice hits the awakening statement, and determining an audio segment after the node as a voice message audio segment;

and the message playing program module is used for sending the voice message audio segment to the specified equipment for playing.

In a sixth aspect, an embodiment of the present invention provides a far-field voice message interaction system for a service cloud, including:

the voice recognition program module is used for serving the user voice sent by the cloud receiving equipment end and recognizing the text corresponding to the user voice, wherein the text at least comprises: a voice message instruction text and a voice message text;

the audio alignment program module is used for determining a corresponding time point of each character in the text in the user voice and at least marking the voice message instruction text and the time alignment information of the voice message text and the user voice;

and the information transmission program module is used for at least sending the voice message instruction text and the time alignment information of the voice message text and the user voice to the equipment terminal and assisting the equipment terminal in cutting the user voice.

In a seventh aspect, an electronic device is provided, which includes: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the far-field voice message interaction method of any embodiment of the present invention.

In an eighth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the far-field voice message interaction method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: and distinguishing an instruction part and a voice message part in the identification content by using a cloud identification technology and a semantic understanding technology, aligning the time relation between the audio and the characters, and returning time alignment information to the equipment terminal. The equipment side can cut the voice of the user according to the aligned time information, and the content of the message is accurately intercepted under the condition that the awakening words and the message content are continuously spoken. Meanwhile, the interaction turns of the far-field voice message of the user are reduced, and the interaction experience of the message-leaving person is ensured. To ensure that the audio of the voice message is suitable for the auditory experience, the local audio is processed using a voice call noise reduction algorithm. The tone of the voice message under the far-field voice message scene is ensured to be comfortable and the feeling of a listener is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a far-field voice message interaction method according to an embodiment of the present invention;

fig. 2 is a structural diagram of noise reduction in a far-field voice call of a far-field voice message interaction method according to an embodiment of the present invention;

fig. 3 is a flowchart of a far-field voice message interaction method according to another embodiment of the present invention;

fig. 4 is a flowchart of a far-field voice message interaction method according to another embodiment of the present invention;

fig. 5 is an overall structure diagram of a device side and a service cloud side of a far-field voice message interaction method according to an embodiment of the present invention;

fig. 6 is an audio text alignment diagram of a far-field voice message interaction method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a far-field voice message interaction system for a device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a far-field voice message interaction system for a device according to another embodiment of the present invention;

fig. 9 is a schematic structural diagram of a far-field voice message interaction system for a service cloud according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a far-field voice message interaction method according to an embodiment of the present invention, which includes the following steps:

s11: caching the input user voice and sending the user voice to a service cloud;

s12: receiving time alignment information fed back by the service cloud, cutting the cached user voice based on the time alignment information, and at least determining a voice message instruction audio segment and a voice message audio segment in the user voice;

s13: and triggering a message leaving function by utilizing the voice message instruction audio segment, determining message leaving content based on the voice message audio segment, and sending the message leaving content to specified equipment for playing.

In this embodiment, the purpose is to distinguish the voice message instruction and the voice message content in the user voice by the device side, and in consideration of the processing performance of the device side, the instruction part and the voice message part in the user voice are distinguished by using the recognition technology of the service cloud and the semantic understanding technology.

For step S11, in a far-field environment, for example, when the user goes out to work, walking to the gate suddenly thinks it needs to leave a message "you'll leave a message to dad remember to pay for gas when they go home today", at which time the sound travels through the corridor and propagates to the device side (e.g., which may be a smart audio, or a smart tv, or other electronic device) in the living room.

The method comprises the following steps that a user as a speaker records a gas fee when a hello message is left to father at present, an equipment end sends audio data to a system of a service cloud end in real time for processing such as voice recognition and the like on the basis of local cache, a voice message instruction part and a voice message part in message content of the user are distinguished through the service cloud end, and a voice message instruction text and time alignment information of the voice message text and the voice of the user are determined;

in the whole process, the device side always buffers a section of audio, and the length of the audio is generally about 2s and is generally the length of a wakeup word. After the device end is awakened, the audio is continuously and additionally cached to the awakened audio until the voice spoken by the user is finished. The audio of the part can be sent into a recognition system of the service cloud, the awakening words, the voice message instructions and the voice message contents can be recognized in the recognition system of the service cloud, and the time point corresponding relation between each part of text and the audio is marked.

If the user is not a complete sentence of "you leave a message to dad remembering to pay a gas fee when they go home today", but rather:

the user: awakening word (hello little Chi)

The equipment end: wake-up feedback language (o mu)

The user: voice message instruction + voice message content (e.g., when leaving a message to dad coming home today, remember to pay a gas fee.)

At this time, the audio of the part is "remember to pay the gas fee when the message is left to dad home today", and the voice message instruction + the voice message content can be identified in the identification system of the service cloud.

For step S12, after the cloud to be served completes processing, and determines the time alignment information between the voice message instruction text and the voice of the user, the voice message instruction text is sent back to the device. And the equipment end receives the time alignment information fed back by the service cloud end and cuts the voice of the user.

In the time alignment information, for example, "you'll leave a message to dad remember to pay for gas when they go home today": and (3) awakening word: hello is small and relaxed for 0-2500 ms; object: father, 3500 ms-4500 ms; time: today, 4500 ms-5500 ms; the content is as follows: when the user goes home, the user can pay for the gas within 5500 ms-10000 ms. With the time alignment information, the local audio cutting program at the equipment end can cut the audio of the voice call corresponding to the message content according to the time information returned by the remote service, thereby determining the corresponding voice message instruction audio segment and voice message audio segment.

Similarly, if the user says "leave a message to dad remembering to pay a gas fee when they come home today", then the time alignment information fed back: object: 1000 ms-2000 ms for father; time: today, 2000ms to 3000 ms; the content is as follows: when the user goes home, the user can pay 3000 ms-7500 ms for gas.

The two scenes are the only difference in that the audio cached at the device end is not the audio from the awakening to the formal speaking of the message content, only the audio of the part of the voice message instruction and the voice message content needs to be cached, and other processing modes are the same.

For step S13, the device triggers the message leaving function according to the voice message instruction audio segment, determines the message leaving content based on "remember to pay gas fee when going home", and sends the message leaving content to the specified device for playing, for example, to "object: play on dad's device.

As an embodiment, before the buffering the input user speech, the method further comprises:

inputting the collected user voice to a beam forming module for extracting clear user voice in a far-field voice environment;

the user voice is input to an automatic gain module after being processed by the beam forming module and is used for stabilizing the user voice in a far-field environment;

and the processed signal is input to a deep learning post-processing module after being processed by the automatic gain module and is used for reducing noise of user voice.

In this embodiment, in order to adapt to the far-field environment, a far-field voice call noise reduction module is added at the device end as shown in fig. 2, in which far-field multi-channel audio of the original user voice passes through a beam forming module, an automatic gain module and a deep learning post-processing module, and finally, clear human voice is obtained. The beam forming module is mainly used for extracting clear voice of a target person under the condition that the surrounding is noisy; the automatic gain module mainly ensures that the processed human voice has a relatively stable volume when the speaking voice of the target person is suddenly large or small, and ensures that the listening is stable. The deep learning post-processing module is mainly responsible for processing noises which cannot be consistent or processed in the beam forming module, and realizing clearer far-field voice.

As an embodiment, the message content includes: voice message audio segments or synthesized message audio;

and when the message content is the voice message audio segment, directly sending the voice message audio segment to a specified device for playing.

In the embodiment, the message content can be a voice message audio segment of the user, the voice message audio segment of the user can usually express the intention of the user most accurately, and the method can accurately segment the message audio segment of the user, so that the use effect of the user is improved.

As an embodiment, the method further comprises: receiving an identification text fed back by the service cloud;

when the message content is the synthesized message audio, cutting the recognition text based on the time alignment information to obtain a voice message text segment;

and generating synthesized message audio based on the voice message text segment, and sending the message audio to specified equipment for playing.

In the present embodiment, in consideration of such a scenario, a user who may use a message may have a strong accent, for example, a grandma such as grandma, etc., migrates in a young time domain, some elders may speak with a strong accent, and in such an accent, if a listener who has a message does not have a communication basis with the elder for a long time, even when listening in the same time, it may be unclear and it may not be possible to determine what the elders speak. In this case, the synthesized message audio is an option. The service cloud end is provided with a voice recognition model of a large number of languages and accents, and can recognize the message content in the voices of the elders to the maximum extent.

Under the condition, if the listener leaving the message listens to the voice message audio segment, even if the problem of far field is solved by voice noise reduction, the listener still cannot understand the voice message, and the listener can request the equipment end to play the synthesized message audio, so that the problem that the listener cannot understand the message audio segment is avoided. In addition, the method performs further optimization processing aiming at the far field, so that the accuracy of synthesizing the message audio is improved.

According to the embodiment, the instruction part and the voice message part in the identification content are distinguished by using the cloud identification technology and the semantic understanding technology, then the time relation between the audio and the characters is aligned, and the time alignment information is returned to the equipment terminal. The equipment side can cut the voice of the user according to the aligned time information, and the content of the message is accurately intercepted under the condition that the awakening words and the message content are continuously spoken. Meanwhile, the interaction turns of the far-field voice message of the user are reduced, and the interaction experience of the message-leaving person is ensured. To ensure that the audio of the voice message is suitable for the auditory experience, the local audio is processed using a voice call noise reduction algorithm. The tone of the voice message under the far-field voice message scene is ensured to be comfortable and the feeling of a listener is improved.

Fig. 3 is a flowchart of a far-field voice message interaction method according to an embodiment of the present invention, which includes the following steps:

s21: taking the awakening words and the voice message instructions as awakening statements of the equipment end;

s22: when the input user voice hits the awakening statement, cutting the user voice by taking the hit awakening statement in the user voice as a node, and determining an audio segment after the node as a voice message audio segment;

s23: and sending the voice message audio segment to a specified device for playing.

In this embodiment, all processing and operations are performed at the device side, considering that the service cloud may not be available at some time. And the whole method is relatively simple and easy to realize.

For step S21, based on the cutting scheme of the wakeup word, the wakeup model at the device end uses all the contents of the wakeup word + voice message instruction as the wakeup words, for example, the content of "hello message" in gas fee "to dad" is remembered as the wakeup word when dad returns home today.

In step S22, when the message taker says the wakeup word + voice message instruction + voice message content, the wakeup word + voice message instruction part triggers wakeup, and calculates wakeup time point information, and the device uses the wakeup time information to cut the audio after the wakeup point as the audio segment of the message content. The voice of the user is directly cut out, namely that the user remembers to pay the gas when going home today.

For step S23, the cut voice message audio segment is sent to the target listener "dad".

According to the embodiment, in the far-field voice message interaction, efficient message interaction can be realized without a service cloud. Because the computing power of the equipment end is usually limited, the method cannot be infinitely extended, the using effect is relatively more intelligent without the equipment end plus the service cloud end, but in a network or other harsh environments, the method can efficiently realize far-field voice message interaction only through the equipment end.

Fig. 4 is a flowchart of a far-field voice message interaction method according to an embodiment of the present invention, which includes the following steps:

s31: the service cloud receives user voice sent by the equipment terminal, and identifies a text corresponding to the user voice, wherein the text at least comprises: a voice message instruction text and a voice message text;

s32: determining a corresponding time point of each character in the text in the user voice, and at least marking the voice message instruction text and the time alignment information of the voice message text and the user voice;

s33: and at least sending the voice message instruction text and the time alignment information of the voice message text and the user voice to the equipment end for assisting the equipment end in cutting the user voice.

In the embodiment, the service cloud is taken as a main part, the command text content of the user voice is recognized in the recognition system of the service cloud, and the text information and the audio time information are aligned to store the corresponding time relation; and the identified result is sent to an NLP natural language processing engine, and if the content spoken by the user is the content needing to leave a message, the service cloud end returns the result of semantic analysis and the time corresponding information of the audio words to the local simultaneously.

In step S31, the receiving device receives the voice "remember to pay gas fee when father leaves the message and then" performs far-field voice noise reduction, and sends the audio after noise reduction to the service cloud. The service cloud receives the user voice sent by the device side, and recognizes a text corresponding to the user voice, as shown in fig. 5, the device side and the service cloud have an overall structure. The service cloud end receives the voice of the user, inputs the voice into the recognition service system, and determines the text content ' leaving a message to dad and remembering to pay the gas fee ' when dad returns home today ' of the user through a voice recognition module in the recognition service system. The text content obtained determines that the text has the voice message instruction text 'leaving a message to dad' and the voice message text 'remembering to pay gas fee when going home today'.

For step S32, first, a time point corresponding to each character in the user' S voice is determined, and a time of each character is obtained, and then, each character is combined into a word, so as to determine a time period of each word, and obtain time alignment information: object: 1000 ms-2000 ms for father; time: today, 2000ms to 3000 ms; the content is as follows: when the user goes home, the user can pay 3000 ms-7500 ms for gas. Specifically, the text and audio time are aligned by using a cloud audio text alignment service. Wherein the alignment scheme is shown in figure 6. The horizontal axis represents the audio information that is input and the vertical axis represents the textual information that may correspond to each input. A path, which is the path formed by the dark dots in fig. 6, can be found in fig. 6 by the search algorithm, where the currently input audio information and the text information are the best matched. On the basis of the path, it can be confirmed that the audio time information corresponding to the text of the voice content of the whole message can be obtained if the audio input corresponding to each text message (for example, the input corresponding to a is x2) has single text corresponding information.

For step S33, the time alignment information determined in step S32 is transmitted to the device side, and the device side is assisted by the time alignment information in cutting the user' S voice.

As an embodiment, the text further comprises: a wake-up word;

after the determining a corresponding time point of each word in the text in the user speech, the method further includes:

marking the awakening words, the voice message instruction texts and time alignment information of the voice message texts and the user voice;

and at least sending the awakening word, the voice message instruction text and the time alignment information of the voice message text and the user voice to the equipment terminal.

In the embodiment, in consideration of different use scenes of the user, if the user says that the content is ' you'll leave a message to dad to remember to pay a gas fee when they go home today ', the device end receives the message and then performs noise reduction processing and sends the message to the service cloud end. The service cloud receives the user voice sent by the equipment end, recognizes the text corresponding to the user voice, and determines the text content of the input user voice that ' you are good and have a little message leaving behind to dad and remember to pay gas fee when dad returns home today ' through a voice recognition module in a recognition service system '. In the obtained text contents, there are a wake-up word "hello is small", and there are a voice message instruction text "leave a message to dad", and a voice message text "remember to pay gas when returning home today". Firstly, the time point corresponding to each word is determined, and the time corresponding to each word is determined. Then, each word is combined into a word, so that the time period of each word is determined, and time alignment information is obtained: and (3) awakening word: hello is small and relaxed for 0-2500 ms; object: father, 3500 ms-4500 ms; time: today, 4500 ms-5500 ms; the content is as follows: when the user goes home, the user can pay for the gas within 5500 ms-10000 ms. And sending the time alignment information to the equipment side. And cutting the voice of the user through the time alignment information auxiliary equipment.

As an implementation manner, before at least the voice message instruction text and the time alignment information between the voice message text and the user voice are sent to the device, the method further includes:

performing semantic understanding on the text, and extracting key message information in the text;

and sending the message key information, the voice message instruction text and the time alignment information of the voice message text and the user voice to the equipment terminal.

The key message leaving information at least comprises: message object, time, message content.

In this embodiment, as shown in fig. 5, an overall structure of the device side and the service cloud side of the method is shown. Considering that the voice input by the user may or may not have the wakeup word, if the voice input by the user does not have the wakeup word, the voice input by the service cloud end can be input into the natural language understanding service after receiving the voice of the user, and the engine for sending the semantic understanding becomes the voice message instruction and the voice message content. The semantic understanding engine extracts key information of the message such as message object, time, message content and the like (for example, object: dad, time: today, content: going home, remember to pay gas fee). The information understood by semantics can be matched with the time alignment information of the text and the audio and transmitted back to the equipment end, and the varied voice message triggering instruction is accurately analyzed through accurate NLP processing.

As an embodiment, the method further comprises:

performing awakening word filtration on the text, performing semantic understanding on the text after the awakening word filtration, and extracting key message information in the text;

and sending the message key information, the awakening words, the voice message instruction text and the time alignment information of the voice message text and the user voice to the equipment terminal.

In this embodiment, considering that the voice input by the user may have a wakeup word or may not have a wakeup word, if the voice input by the user has a wakeup word, another processing method may be used, and first, the text corresponding to the wakeup word is filtered before the text of the voice of the user is sent to the semantic understanding engine, so as to filter the wakeup word in the voice of the user, thereby reducing the processing work of the natural language understanding service, improving the far-field voice interaction efficiency, and avoiding the repeated description of the same steps after obtaining the voice message instruction and the voice message content.

According to the embodiment, the instruction part and the voice message part in the identification content are distinguished by using the cloud identification technology and the semantic understanding technology, then the time relation between the audio and the characters is aligned, and the time alignment information is returned to the equipment terminal. The equipment side can cut the voice of the user according to the aligned time information, and the content of the message is accurately intercepted under the condition that the awakening words and the message content are continuously spoken. Meanwhile, the interaction turns of the far-field voice message of the user are reduced, and the interaction experience of the message-leaving person is ensured.

Fig. 7 is a schematic structural diagram of a far-field voice message interaction system for a device end according to an embodiment of the present invention, where the system can execute the far-field voice message interaction method according to any of the above embodiments and is configured in a terminal.

The present embodiment provides a far-field voice message interaction system 10 for a device side, which includes: a message transmission program module 11, an audio cutting program module 12 and a message playing program module 13.

The information transmission program module 11 is configured to cache an input user voice and send the user voice to the service cloud; the audio cutting program module 12 is configured to receive time alignment information fed back by the service cloud, cut the cached user voice based on the time alignment information, and at least determine a voice message instruction audio segment and a voice message audio segment in the user voice; the message playing program module 13 is configured to trigger a message function by using the voice message instruction audio segment, determine message content based on the voice message audio segment, and send the message content to a designated device for playing.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the far-field voice message interaction method in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

caching the input user voice and sending the user voice to a service cloud;

Fig. 8 is a schematic structural diagram of a far-field voice message interaction system for a device end according to an embodiment of the present invention, where the system can execute the far-field voice message interaction method according to any of the above embodiments and is configured in a terminal.

The present embodiment provides a far-field voice message interaction system 20 for a device side, which includes: a wake-up sentence determination program module 21, an audio cutting program module 22 and a message playing program module 23.

The wake-up statement determining program module 21 is configured to use a wake-up word and a voice message instruction as the wake-up statement of the device side; the audio cutting program module 22 is configured to, when the input user voice hits the wake-up statement, cut the user voice with the hit wake-up statement in the user voice as a node, and determine an audio segment after the node as a voice message audio segment; the message playing program module 23 is configured to send the voice message audio segment to a designated device for playing.

and sending the voice message audio segment to a specified device for playing.

Fig. 9 is a schematic structural diagram of a far-field voice message interaction system for a service cloud according to an embodiment of the present invention, where the system can execute the far-field voice message interaction method according to any of the above embodiments and is configured in a terminal.

The far-field voice message interaction system 30 for the service cloud according to the embodiment includes: a speech recognition program module 31, a speech recognition program module 32 and an information transmission program module 33.

The speech recognition program module 31 is configured to serve a user speech sent by the cloud receiving device, and recognize a text corresponding to the user speech, where the text at least includes: a voice message instruction text and a voice message text; the audio alignment program module 32 is configured to determine a corresponding time point of each character in the text in the user voice, and at least mark the voice message instruction text and time alignment information of the voice message text and the user voice; the information transmission program module 33 is configured to send at least the voice message instruction text and the time alignment information of the voice message text and the user voice to the device side, so as to assist the device side in cutting the user voice.

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform the far-field voice message interaction method of any of the above method embodiments.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the far-field voice message interaction method of any embodiment of the present invention.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A far-field voice message interaction method is applied to a device side and comprises the following steps:

caching the input user voice and sending the user voice to a service cloud;

2. The method of claim 1, wherein prior to said caching the input user speech, the method further comprises:

3. The method of claim 1, wherein the message content comprises: voice message audio segments or synthesized message audio;

4. The method of claim 3, wherein the method further comprises:

receiving an identification text fed back by the service cloud;

5. A voice message interaction method is applied to a device side and comprises the following steps:

and sending the voice message audio segment to a specified device for playing.

6. A far-field voice message interaction method is applied to a service cloud end and comprises the following steps:

7. The method of claim 6, wherein the text further comprises: a wake-up word;

8. The method according to claim 6, wherein before sending at least the voice message instruction text and the time alignment information of the voice message text and the user voice to the device side, the method further comprises:

9. The method according to claim 7, wherein before sending at least the voice message instruction text and the time alignment information of the voice message text and the user voice to the device side, the method further comprises:

10. The method according to any one of claims 8-9, wherein the message-leaving key information includes at least: message object, time, message content.

11. A far-field voice message interaction system for a device side comprises:

12. A far-field voice message interaction system for a device side comprises:

13. A far-field voice message interaction system for a service cloud, comprising:

14. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-10.

15. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.