CN111916082A

CN111916082A - Voice interaction method and device, computer equipment and storage medium

Info

Publication number: CN111916082A
Application number: CN202010817186.5A
Authority: CN
Inventors: 王宏景; 傅成彬; 陈龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2020-11-10
Anticipated expiration: 2040-08-14
Also published as: CN111916082B

Abstract

The application relates to a cloud server and provides a voice interaction method and device, computer equipment and a storage medium. The method comprises the following steps: receiving voice data packets sent by a terminal in sequence; in the process of sequentially receiving the voice data packets, sequentially carrying out voice recognition and mute detection on the received voice data packets according to the receiving time to obtain a recognition text and continuous accumulated mute duration corresponding to each voice data packet; when the current continuous accumulated mute duration is larger than or equal to the mute duration threshold and the current obtained identification text is consistent with the target text, taking the response data obtained in advance in an asynchronous mode based on the semantic analysis result of the target text as the target response data; the target text is an identification text which is subjected to semantic analysis at the latest; and feeding back the target response data to the terminal. By adopting the method, the voice interaction efficiency can be improved.

Description

Voice interaction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a voice interaction method, apparatus, computer device, and storage medium.

Background

With the continuous development of the artificial intelligence technology, the voice interaction mode based on the artificial intelligence technology develops gradually, and convenient and fast service is provided for various aspects of life of people. The voice interaction mode mainly relates to operations such as voice recognition, semantic parsing and response data acquisition. In the current voice interaction mode, voice recognition is usually performed on interactive voice, after a complete voice recognition result is obtained, semantic analysis is performed on the complete voice recognition result, and corresponding response data is obtained and fed back according to the semantic analysis result. However, this voice interaction mode requires a large amount of response latency, and thus has a problem of low voice interaction efficiency.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a voice interaction method, apparatus, computer device and storage medium capable of improving voice interaction efficiency.

A method of voice interaction, the method comprising:

receiving voice data packets sent by a terminal in sequence;

in the process of sequentially receiving the voice data packets, sequentially carrying out voice recognition and mute detection on the received voice data packets according to the receiving time to obtain a recognition text and continuous accumulated mute duration corresponding to each voice data packet;

when the current continuous accumulated mute duration is larger than or equal to the mute duration threshold and the current obtained identification text is consistent with the target text, taking the response data obtained in advance in an asynchronous mode based on the semantic analysis result of the target text as the target response data; the target text is an identification text which is subjected to semantic analysis at the latest;

and feeding back the target response data to the terminal.

A voice interaction apparatus, the apparatus comprising:

the receiving module is used for receiving voice data packets sent by the terminal in sequence;

the identification module is used for carrying out voice identification and mute detection on the received voice data packets in sequence according to the receiving time in the process of receiving the voice data packets in sequence to obtain an identification text and continuous accumulated mute duration corresponding to each voice data packet;

the acquisition module is used for taking response data obtained in advance in an asynchronous mode based on the semantic analysis result of the target text as target response data when the currently obtained continuous accumulated mute duration is greater than or equal to the mute duration threshold and the currently obtained identification text is consistent with the target text; the target text is an identification text which is subjected to semantic analysis at the latest;

and the response module is used for feeding back the target response data to the terminal.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

receiving voice data packets sent by a terminal in sequence;

and feeding back the target response data to the terminal.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

receiving voice data packets sent by a terminal in sequence;

and feeding back the target response data to the terminal.

In the process of sequentially receiving the voice data packets sent by the terminal, the voice identification and mute detection are sequentially carried out on the received voice data packets according to the receiving time to obtain the identification text corresponding to each voice data packet and the continuous accumulated mute duration, when the continuous accumulated mute duration is greater than or equal to the mute duration threshold, the currently obtained identification text is indicated to be a complete identification text, if the currently obtained identification text is consistent with the identification text which carries out semantic analysis at the latest, the response data which is obtained in advance based on the semantic analysis result of the identification text which carries out semantic analysis at the latest is taken as the target response data corresponding to the currently obtained identification text and fed back to the terminal, because the response data corresponding to the identification text which carries out semantic analysis at the latest, the method is obtained by parallel processing in an asynchronous mode in the process of carrying out voice recognition and silence detection on the received voice data packet in sequence according to the receiving time, so that the time consumed for acquiring corresponding response data based on the recognition text can be saved under the condition of not influencing the recognition process of the voice data packet, and the efficiency of acquiring target response data can be improved. Therefore, when the complete recognition text is recognized, the response data obtained in advance based on the recognition text consistent with the complete recognition text is directly used as the response data corresponding to the complete recognition text, the complete recognition text does not need to be analyzed, the target response data is further obtained based on the obtained semantic analysis result, the time for semantic analysis and response data obtaining can be shortened, the response waiting time can be shortened, and the voice interaction efficiency can be improved.

Drawings

FIG. 1 is a diagram of an application environment of a voice interaction method in one embodiment;

FIG. 2 is a flow diagram that illustrates a method for voice interaction, according to one embodiment;

FIG. 3 is a flow diagram of another embodiment voice interaction method;

FIG. 4 is a block diagram of a voice interaction system to which the voice interaction method is applied in one embodiment;

FIG. 5 is an architecture diagram of an intelligent voice assistant in one embodiment;

FIG. 6 is a timing diagram of voice interactions in the case where the full recognition text coincides with the intermediate recognition text, under one embodiment;

FIG. 7 is a sequence diagram of voice interactions in the case where the full recognition text is inconsistent with the intermediate recognition text but the semantic parsing results of the full recognition text and the intermediate recognition text are consistent in one embodiment;

FIG. 8 is a comparison of time consumption profiles of a voice interaction process in one embodiment;

FIG. 9 is a block diagram showing the structure of a voice interactive apparatus according to an embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The voice interaction method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 receives voice data packets sequentially sent by the terminal 102, and in the process of sequentially receiving the voice data packets, performs voice recognition and silence detection on the received voice data packets sequentially according to the receiving time to obtain a recognition text and a continuous accumulated silence duration corresponding to each voice data packet, and when the currently obtained continuous accumulated silence duration is greater than or equal to a silence duration threshold and the currently obtained recognition text is consistent with the recognition text which performs semantic analysis at the latest, takes a semantic analysis result of the recognition text which performs semantic analysis at the latest as target response data in advance through an asynchronous mode, and feeds the target response data back to the terminal 102. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

In one embodiment, as shown in fig. 2, a voice interaction method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 202, receiving voice data packets sequentially sent by the terminal.

The voice data packet is a data packet obtained by encapsulating voice. The effective data part in the voice data packet is voice encapsulated in the voice data packet.

Specifically, the terminal collects voices from the environment in real time, packages the collected voices in sequence according to voice collection time to obtain corresponding voice data packets, and sends the obtained voice data packets to the server in sequence in the process of packaging the voices in sequence. Accordingly, the server receives the voice data packets sequentially generated and transmitted by the terminal. The voice acquisition time refers to the time for acquiring voice by the terminal.

In one embodiment, the terminal collects voice from the environment in real time, and sequentially encapsulates the collected voice according to the voice time length of single voice transmission to obtain a plurality of corresponding voice data packets. The voice duration of the single voice transmission can be customized according to actual requirements, such as 64 milliseconds, and can also be dynamically determined according to the dynamically detected network performance. Network performance is used to indicate the operating conditions of the network, such as the network operating bandwidth, or the rate of data transmission.

In one embodiment, when detecting a voice interaction triggering condition, the terminal performs the related steps of collecting voice in real time and sequentially sending voice data packets obtained by packaging the collected voice to the server. The voice interaction triggering condition is, for example, a voice interaction triggering operation of the user is detected, or a preset wake-up word is detected, and is not particularly limited herein.

And 204, in the process of sequentially receiving the voice data packets, sequentially carrying out voice recognition and mute detection on the received voice data packets according to the receiving time to obtain a recognition text and a continuous accumulated mute duration corresponding to each voice data packet.

The receiving time refers to the time when the server receives the voice data packet. The recognized text is a text obtained by performing speech recognition on the speech in the speech data packet, that is, a text obtained by performing text conversion on the speech in the speech data packet. The continuous accumulated silence duration refers to an accumulated duration in which silence is continuously detected, for example, if voices in a plurality of continuous voice data packets are all silence, voice durations corresponding to the plurality of voice data packets are accumulated and summed, so that a corresponding continuous accumulated silence duration can be obtained.

Specifically, in the process of sequentially receiving the data packets sent by the terminal, the server sequentially performs voice recognition and silence detection on the received voice data packets according to the receiving time of each voice data packet, so as to obtain a recognition text and a continuous accumulated silence duration corresponding to each received voice data packet.

In one embodiment, the server performs voice recognition and silence detection on the received voice data packets in sequence according to the receiving time in an asynchronous manner in the process of sequentially receiving the voice data packets. In this way, the server sequentially receives the voice packets transmitted from the terminal, and sequentially performs voice recognition and silence detection on the received voice packets according to the reception time.

In an embodiment, the server performs voice recognition and silence detection on voice in a current voice data packet during sequentially performing voice recognition and silence detection on received voice data packets, obtains a recognition text corresponding to the current voice data packet based on a voice recognition result corresponding to the current voice data packet and a recognition text corresponding to a previous voice data packet, and correspondingly obtains a continuous accumulated silence duration corresponding to the current voice data packet based on a silence detection result corresponding to the current voice data packet and a continuous accumulated silence duration corresponding to the previous voice data packet. It can be understood that if the mute time in the current voice data packet is not continuous with the mute time in the previous voice data packet, the corresponding continuous accumulated mute time is obtained only according to the mute detection result in the current voice data packet.

Step 206, when the current continuous accumulated mute duration is greater than or equal to the mute duration threshold and the current obtained identification text is consistent with the target text, taking the response data obtained in advance in an asynchronous mode based on the semantic analysis result of the target text as the target response data; the target text is the recognized text that is semantically parsed at the latest.

The mute duration threshold is a duration threshold used for comparing with the continuously accumulated mute duration to determine whether a complete recognized text is obtained, that is, a basis for determining whether the user has stopped speaking. The mute duration threshold may be customized based on existing experience, such as 500 milliseconds. The semantic analysis result refers to an analysis result obtained by performing semantic analysis on the recognized text. The recognition text subjected to the semantic analysis at the latest refers to the recognition text with the latest time of the semantic analysis, and the time of the semantic analysis refers to the time point of the semantic analysis of the recognition text. The asynchronous mode is to execute two different and relatively independent operations in parallel, for example, perform speech recognition and silence detection on a speech data packet in sequence, perform semantic parsing on the obtained recognition text in sequence, and obtain corresponding response data based on the result of the semantic parsing. The response data refers to data that is obtained as a response result of the speech corresponding to the recognition text based on the semantic parsing result of the recognition text, for example, the recognition text is "how today's weather is", and the corresponding response data is "today's weather is fine".

Specifically, in the process of sequentially performing voice recognition and silence detection on received voice data packets, the server performs voice recognition and silence detection on the current voice data packet to obtain a corresponding recognition text and continuous accumulated silence duration, and then compares the currently obtained continuous accumulated silence duration with a preset silence duration threshold. And when the currently obtained continuous accumulated mute duration is judged to be smaller than the mute duration threshold, the currently obtained identification text is indicated to be a middle identification text, the server performs semantic analysis on the currently obtained identification text in an asynchronous mode to obtain a corresponding semantic analysis result, and response data corresponding to the middle identification text is obtained based on the semantic analysis result. And when the currently obtained continuous accumulated mute duration is judged to be greater than or equal to the mute duration threshold, the user is indicated to stop speaking, the server judges that the complete identification text is obtained, takes the currently obtained identification text as the complete identification text, and compares the currently obtained identification text with the identification text subjected to the latest semantic analysis.

Further, when it is determined that the currently obtained identification text is consistent with the identification text subjected to the latest semantic analysis, the server uses the identification text subjected to the latest semantic analysis as a target text, and uses response data obtained in advance for the target text in an asynchronous manner based on the semantic analysis result of the target text as target response data corresponding to the currently obtained identification text, that is, uses response data obtained in advance in an asynchronous manner based on the target text as target response data corresponding to the complete identification text.

In one embodiment, in the process of sequentially performing voice recognition and silence detection on received voice data packets according to receiving time, if the currently obtained continuous accumulated silence duration is less than a silence duration threshold, the server determines that the user has not stopped speaking, determines the currently obtained recognition text as an intermediate recognition text, performs semantic parsing on the intermediate recognition text in an asynchronous mode, and acquires corresponding response data based on a semantic parsing result, if the currently obtained continuous accumulated silence duration is greater than or equal to the silence duration threshold, the server determines that the user has stopped speaking, determines the currently obtained recognition text as a complete recognition text, and takes the latest semantically parsed intermediate recognition text as a target text, and when the complete recognition text is consistent with the target text, acquires response data in advance based on the target text, and the target response data is used as the target response data corresponding to the complete identification text.

It can be understood that, since the server needs to determine whether the user has stopped speaking based on the continuous accumulated mute time and the mute time threshold, there may be a case where the server determines that the user has stopped speaking compared to when the user has actually stopped speaking, in other words, there may be a case where the user has actually stopped speaking, but the server determines that the user has not stopped speaking because the obtained continuous accumulated mute time is still less than the mute time threshold.

For example, it is assumed that 20 voice data packets are received in sequence, when the 12 th voice data packet is received, the user actually stops speaking, the server performs voice recognition on the 12 th voice data packet to obtain a recognition text which is a complete recognition text actually obtained when the user actually stops speaking, but the server performs mute detection on the 12 th voice data packet, the obtained continuous accumulated mute duration is less than the mute duration threshold, the server determines that the user has not stopped speaking, and the recognition text obtained based on the 12 th voice data packet is determined as a middle recognition text.

Further, since the user actually stops speaking during the 12 th voice data packet, the voices in the subsequent consecutive voice data packets are all muted, that is, the voices in the 13 th to 20 th voice data packets are all muted, when the server performs voice recognition and mute detection on the consecutive voice data packets in sequence according to the receiving time, the obtained recognized texts are all consistent with the recognized text obtained based on the 12 th voice data packet, and the correspondingly obtained continuous accumulated mute duration is continuously increased, for example, the continuous accumulated mute duration obtained based on the 19 th voice data packet is still less than the mute duration threshold, the server still determines that the user has not stopped speaking, and still determines the recognized text obtained based on the 19 th voice data packet as the middle recognized text, when the continuous accumulated mute duration obtained based on the 20 th voice data packet is greater than or equal to the mute duration threshold, the server determines that the user has stopped speaking, and determines the recognized text obtained based on the 20 th voice packet as a complete recognized text.

It can be seen that, when the user actually stops speaking in the 12 th voice data packet, but the server determines that the user stops speaking based on the 20 th voice data packet, and the recognition text obtained by the server based on the 12 th voice data packet is consistent with the recognition text obtained based on the 20 th voice data packet, therefore, the server obtains corresponding response data in advance in an asynchronous manner for the intermediate recognition texts in the process of performing voice recognition and silence detection on the received voice data packets in sequence according to the receiving time, if the intermediate recognition text consistent with the finally obtained complete recognition text exists, the response data obtained based on the semantic analysis result of the complete recognition text is consistent with the response data obtained based on the intermediate recognition text consistent with the complete recognition text, and therefore, when the server determines that the obtained complete recognition text is consistent with the intermediate recognition text which performs semantic analysis at the latest, and taking the response data corresponding to the intermediate recognition text subjected to semantic analysis at the latest as target response data to avoid performing semantic analysis on the complete recognition text, and then acquiring corresponding target response data based on a semantic analysis result, so that the acquisition efficiency of the target response data can be improved.

It should be noted that, in the above example, the voice data packet corresponding to the time when the user actually stops speaking, the server detects the voice data packet required by the continuously accumulated mute duration greater than or equal to the mute duration threshold, and the server determines the voice data packet corresponding to the time when the user stops speaking, and the like, which are merely examples and are not limited in particular.

In one embodiment, when the currently obtained continuous accumulated mute duration is less than the mute duration threshold, the server caches the response data obtained based on the semantic parsing result of the currently obtained identification text and the currently obtained identification text locally. It can be understood that the server may cache the currently obtained identification text and the corresponding response data in a local manner in an overlay cache manner, or may cache the currently obtained identification text and the corresponding response data in a local manner in an updated cache manner. The overlay cache mode is to locally cache only the currently obtained identification text and the corresponding response data. The more novel caching mode is that a plurality of identification texts and corresponding response data obtained in a single voice interaction process are locally cached. For the updated caching mode, corresponding semantic parsing time can be cached for each cached identification text. Therefore, when the complete recognition text is judged to be obtained, the server can directly obtain the recognition text subjected to the semantic analysis at the latest from the local cache, and when the complete recognition text is judged to be consistent with the recognition text subjected to the semantic analysis at the latest, corresponding response data can be quickly obtained from the cache to serve as target response data, so that the time consumed for obtaining the target response data can be reduced, and the voice interaction efficiency can be improved.

In one embodiment, when the currently obtained continuous accumulated mute duration is less than the mute duration threshold, the server further caches a semantic parsing result corresponding to the currently obtained recognition text locally corresponding to the recognition text.

And step 208, feeding back the target response data to the terminal.

Specifically, after the server obtains the target response data corresponding to the complete recognition text according to the above manner, the obtained target response data is fed back to the terminal. And the terminal displays the received target response data to a corresponding user so as to realize voice interaction. It is understood that the complete recognition text is the recognition text obtained based on the complete speech during the single speech interaction after the user stops speaking, i.e. the complete recognition text corresponds to the complete speech during the single speech interaction.

In one embodiment, the server may directly feed back the target response data to the corresponding terminal, or perform voice synthesis on the target response data to obtain a corresponding response data packet, and feed back the response data packet to the terminal, so as to instruct the terminal to display the target response data to the user in a voice broadcast manner based on the received response data packet.

The voice interaction method comprises the steps of carrying out voice recognition and mute detection on received voice data packets according to receiving time in sequence in the process of receiving the voice data packets sent by a terminal in sequence to obtain an identification text and continuous accumulated mute duration corresponding to each voice data packet, indicating that the currently obtained identification text is a complete identification text when the continuous accumulated mute duration is greater than or equal to a mute duration threshold value, and taking a semantic analysis result of the identification text subjected to semantic analysis latest as target response data corresponding to the currently obtained identification text in an asynchronous mode to feed back the target response data to the terminal, wherein the response data corresponding to the identification text subjected to semantic analysis latest is in the process of carrying out voice recognition and mute detection on the received voice data packets according to the receiving time, the method is obtained by parallel processing in an asynchronous mode, so that time consumption for acquiring corresponding response data based on the recognition text can be saved under the condition of not influencing the recognition process of the voice data packet, and the efficiency for acquiring target response data can be improved. Therefore, when the complete recognition text is recognized, the response data obtained in advance based on the recognition text consistent with the complete recognition text is directly used as the response data corresponding to the complete recognition text, the complete recognition text does not need to be analyzed, the target response data is further obtained based on the obtained semantic analysis result, the time for semantic analysis and response data obtaining can be shortened, the response waiting time can be shortened, and the voice interaction efficiency can be improved.

In one embodiment, the voice interaction method further includes: when the currently obtained continuous accumulated mute duration is greater than or equal to the mute duration threshold and the currently obtained identification text is inconsistent with the target text, comparing the semantic analysis result of the currently obtained identification text with the semantic analysis result of the target text; and when the semantic analysis result of the current obtained identification text is matched with the semantic analysis result of the target text, taking the semantic analysis result based on the target text and response data obtained in advance in an asynchronous mode as target response data.

Specifically, when it is determined that the currently obtained continuous accumulated mute duration is greater than or equal to the mute duration threshold and the currently obtained identification text is inconsistent with the identification text subjected to the latest semantic analysis, it indicates that the user has stopped speaking, the currently obtained identification text is the complete identification text obtained in the single voice interaction process, but the complete identification text is inconsistent with the identification text subjected to the latest semantic analysis in the single voice interaction process, the server performs the semantic analysis on the currently obtained identification text to obtain a corresponding semantic analysis result, determines the identification text which is before the currently obtained identification text and subjected to the latest semantic analysis as the target text, and compares the semantic analysis result of the currently obtained identification text with the semantic analysis result of the target text. And when the semantic analysis result of the currently obtained identification text is judged to be matched with the semantic analysis result of the target text, the server uses the response data which is obtained in advance in an asynchronous mode based on the semantic analysis result of the target text as the target response data corresponding to the currently obtained identification text.

In one embodiment, the server directly obtains the semantic parsing result corresponding to the target text from the cache, and obtains response data corresponding to the target text from the local cache as target response data when the semantic parsing result of the target text is judged to be consistent with the semantic parsing result of the currently obtained identification text. When the semantic analysis result of the target text is consistent with the semantic analysis result of the currently obtained identification text, the response data obtained based on the semantic analysis result of the currently obtained identification text is consistent with the response data obtained based on the semantic analysis result of the target text, so that the response data obtained in advance based on the target text is used as the target response data corresponding to the currently obtained identification text, the obtaining efficiency of the target response data can be improved under the condition of ensuring the accuracy of the target response data, and the corresponding response data does not need to be dynamically obtained based on the semantic analysis result of the currently obtained identification text.

In one embodiment, when it is determined that the currently obtained continuous accumulated mute duration is greater than or equal to the mute duration threshold, the currently obtained identification text is inconsistent with the target text, and the semantic parsing result of the currently obtained identification text is inconsistent with the semantic parsing result of the target text, the server acquires corresponding response data as the target response data according to the semantic parsing result of the currently obtained identification text. Therefore, when the text corresponding to the complete recognition text and the middle recognition text is inconsistent with the semantic parsing result, the corresponding target response data can still be obtained based on the semantic parsing result of the complete recognition text, and the accuracy of voice interaction can be ensured under the condition of ensuring the voice interaction efficiency. The intermediate recognition text refers to a recognition text which is acquired before the text is completely recognized and is subjected to semantic analysis at the latest in a single voice interaction process.

In the above embodiment, when the complete recognition text is not consistent with the intermediate recognition text but the semantic parsing results of the complete recognition text and the intermediate recognition text are consistent, the response data obtained in advance based on the intermediate recognition text is directly used as the target response data corresponding to the complete recognition text, so that the time consumed for dynamically obtaining the corresponding response data based on the semantic parsing results of the complete recognition text can be reduced, and the voice interaction efficiency can be improved.

In one embodiment, the voice interaction method further includes: when the continuously accumulated mute duration obtained currently is smaller than the mute duration threshold, performing semantic analysis on the identification text obtained currently to obtain a semantic analysis result; determining a service field and a service intention according to a currently obtained semantic analysis result; and acquiring corresponding response data according to the service field and the service intention.

The service domain refers to a domain to which a requested service belongs, and specifically may refer to a domain to which a service requested by a user through voice belongs. Service areas include, but are not limited to, weather, news, music, etc. The service intention refers to the requested service content, and specifically may be the service content requested by the user through voice. Service intentions include, but are not limited to, albums, songs, and current news, etc. For example, the identification text is "how the weather is today", and the semantic analysis result obtained by performing semantic analysis on the identification text may be represented as "weather the user wants to query", the service field determined based on the semantic analysis result is "weather", and the service intention is "query".

Specifically, when the currently obtained continuous accumulated mute duration is smaller than the mute duration threshold, the server determines the currently obtained identification text as the middle identification text, and performs semantic analysis on the currently obtained identification text to obtain a corresponding semantic analysis result. And the server analyzes the currently obtained semantic analysis result to obtain the service field and the service intention corresponding to the corresponding identification text. And the server acquires response data corresponding to the identification text in an asynchronous mode according to the service field and the service intention corresponding to the currently obtained identification text. It can be understood that the server performs the above-mentioned operations related to acquiring the response data according to the currently obtained recognition text, and at the same time performs voice recognition and silence detection on the received voice data packets in sequence according to the receiving time of the voice data packets.

In one embodiment, the server calls a corresponding service to acquire response data corresponding to the identification text according to the service field and the service intention corresponding to the currently obtained identification text. The service for acquiring the response data corresponds to the service field and the service intention, so that the corresponding response data can be acquired quickly and accurately by calling the corresponding service based on the service field and the service intention.

In one embodiment, the service domain may be understood as a service skill. One or more service skills may be preconfigured for the voice assistant. And when the requested service skill is determined according to the semantic parsing result of the currently obtained identification text, the server calls the service skill to obtain corresponding response data.

In the above embodiment, when the intermediate recognition text is obtained through recognition, the corresponding service field and the corresponding service intention are determined based on the semantic analysis result of the intermediate recognition text, and the corresponding response data is obtained according to the service field and the service intention, so that the accuracy of obtaining the response data can be improved.

In one embodiment, when the currently obtained continuous accumulated mute duration is less than the mute duration threshold, performing semantic analysis on the currently obtained identification text to obtain a semantic analysis result, including: when the continuously accumulated mute duration obtained currently is smaller than the mute duration threshold, comparing the identification text obtained currently with a preset analysis condition; and when the currently obtained identification text meets the preset analysis conditions, performing semantic analysis on the currently obtained identification text to obtain a semantic analysis result.

The preset parsing condition is a condition or basis for triggering semantic parsing operation on the recognized text. The preset parsing condition includes, but is not limited to, that the text length of the currently obtained recognition text is greater than or equal to a text length threshold, and/or that the currently obtained recognition text is inconsistent with the previously obtained recognition text.

Specifically, when it is determined that the currently obtained continuous accumulated mute duration is smaller than the mute duration threshold, the server compares the currently obtained identification text with a preset parsing condition. And when the currently obtained identification text is judged to meet the preset analysis condition, the server carries out semantic analysis on the currently obtained identification text to obtain a corresponding semantic analysis result so as to obtain corresponding response data based on the semantic analysis result.

In one embodiment, when it is determined that the currently obtained recognized text does not satisfy the preset parsing condition, the server does not perform a relevant operation of semantic parsing on the currently obtained recognized text any more, but performs a relevant operation of semantic parsing on a next recognized text satisfying the preset parsing condition in time sequence.

In the above embodiment, semantic parsing is performed on the recognition texts meeting the preset parsing condition in sequence according to the time sequence, instead of performing semantic parsing on each recognition text in sequence, which can avoid performing unnecessary semantic parsing related operations on the recognition texts not meeting the preset parsing condition, thereby saving computer resources, improving semantic parsing efficiency, and avoiding that when a complete recognition text is obtained, corresponding response data is obtained in an asynchronous manner without being based on an intermediate recognition text consistent with a text of the complete recognition text or a semantic parsing result, thereby improving acquisition efficiency of target response data, and further improving voice interaction efficiency.

In one embodiment, the preset parsing condition includes that the text length of the currently obtained recognition text is greater than or equal to a text length threshold, and the currently obtained recognition text is inconsistent with the previously obtained recognition text.

The text length threshold is a length threshold used for comparing with the text length of the recognition text to determine whether the recognition text meets the preset parsing condition, that is, a basis for determining whether the recognition text meets the preset parsing condition. The text length threshold may be customized, such as 4.

Specifically, when the currently obtained continuous accumulated mute duration is smaller than the mute duration threshold, the server obtains the text length of the currently obtained recognition text, and compares the text length with a preset text length threshold. And when the text length is judged to be larger than or equal to the text length threshold value, the server acquires the latest obtained identification text before the current obtained identification text and compares the acquired identification text with the current obtained identification text. And when the acquired identification text is judged to be consistent with the currently acquired identification text, the server judges that the currently acquired identification text meets the preset analysis condition. It will be appreciated that the recognition text that is obtained latest before the currently obtained recognition text, i.e. the recognition text that was obtained last for comparison with the currently obtained recognition text, is used.

For example, assuming that the currently obtained identification text is an identification text obtained by performing voice recognition on the received third voice data packet, when the text length of the identification text is greater than or equal to a text length threshold and the identification text is consistent with the identification text obtained by performing voice recognition on the second voice data packet, it is determined that the currently obtained identification text satisfies a preset parsing condition.

In one embodiment, the preset parsing condition includes that the text length of the currently obtained recognized text is greater than or equal to a text length threshold, and/or the number of text changes is less than a change number threshold. The text change times refer to the change times of the currently obtained identification text relative to the previously obtained identification text, and are used for representing whether the currently obtained identification text is consistent with the previously obtained identification text or not. When the currently obtained recognition text is consistent with the recognition text obtained last time, the corresponding text change frequency is 0, and when the currently obtained recognition text is inconsistent with the previously obtained recognition text, the text change frequency is 1. The threshold of the number of changes is self-defined, such as 1.

In the above embodiment, whether the currently obtained identification text meets the preset parsing condition is determined according to the text length of the currently obtained identification text and whether the currently obtained identification text is consistent with the previously obtained identification text, so that when the identification text is determined to meet the preset parsing condition, a corresponding semantic parsing process is triggered.

In one embodiment, acquiring corresponding response data according to the service field and the service intention comprises: inquiring a target service field matched with the service field and a target service intention matched with the service intention from the response configuration set; and calling the service corresponding to the target service field to acquire corresponding response data through the interface corresponding to the target service intention.

The answer configuration set refers to an index set configured in advance for obtaining answer data corresponding to the identification text, and specifically may include candidate service intents corresponding to candidate service neighborhoods and each candidate service field for indexing the answer data. The answer configuration set can also comprise a service identifier corresponding to each candidate service field and a service interface identifier corresponding to each candidate service intention.

Specifically, after determining a corresponding service field and a service intention according to a semantic parsing result of a currently obtained recognition text, the server queries a candidate service field matched with the service field from a pre-configured response configuration set as a target service field, and queries a candidate service intention corresponding to the target service field and matched with the service intention from the response configuration set as the target service intention. Further, the server determines a service to be called according to the target service field, determines an interface for calling the service according to the target service intention, and calls the service to be called through the determined interface so as to obtain the response data corresponding to the currently obtained identification text through the service.

In one embodiment, the server obtains the corresponding service identifier according to the target service field, obtains the corresponding service interface identifier according to the target service intention, and calls the service corresponding to the service identifier to obtain the corresponding response data through the interface corresponding to the service interface identifier.

In one embodiment, the server may invoke a local service to obtain the corresponding response data in the manner described above, and may also invoke a service running on another computer device to obtain the corresponding response data in the manner described above. Other computer devices such as terminal acquisition servers for storing the response data.

In one embodiment, when the server calls a service corresponding to the target service field according to the currently obtained identification text through an interface corresponding to the target service intention to obtain response data corresponding to the identification text, user information corresponding to a user currently performing voice interaction needs to be obtained, and the server is called to obtain corresponding response data according to the user information. The user information refers to information corresponding to the user and associated with the service domain corresponding to the identification text, such as user location information, and is not limited in particular. For example, when the identification text is "how the weather is today", the corresponding user location information is acquired, and if the user location information is "futian", the service is called to acquire the weather information of the futian as response data.

In one embodiment, when a target service field matched with the service field is not inquired in the answer configuration set and/or a target service intention matched with the service intention is not inquired, the server does not acquire corresponding answer data for a corresponding recognition text, and performs related operation of semantic parsing for the next recognition text according to time sequence. Thus, when the service skill requested by the user is not the service skill which is pre-configured by the voice assistant, the corresponding response data is not fed back. It can be understood that, according to the voice interaction method provided in one or more embodiments of the present application, if target response data corresponding to a complete recognition text is not obtained, corresponding prompt information or query information is dynamically generated, and the prompt information or query information is fed back to the terminal as the target response data. When the complete recognition text is "how today's weather is, the corresponding prompt message is, for example," sorry, no weather information of today is queried ", and the query message is, for example," ask you to query the weather of fotian today ".

In the above embodiment, the target service field and the target service intention matched with the identification text are obtained from the response configuration set, and the response data corresponding to the identification text is obtained by calling the service corresponding to the target service field through the interface corresponding to the target service intention, so that the obtaining efficiency and accuracy of the response data can be improved.

In one embodiment, step 208 includes: carrying out voice synthesis on the target response data to obtain a response data packet; sending the response data packet to the terminal; and the sent response data packet is used for indicating the terminal to analyze the response data packet and broadcasting the analyzed response voice.

The response data packet is a voice data packet obtained by performing voice synthesis on the target response data to obtain corresponding response voice and packaging the response voice.

Specifically, after obtaining target response data corresponding to the complete recognition text in the single voice interaction process, the server performs voice synthesis on the target response data to obtain corresponding response voice. And the server packages the obtained response voice to obtain a corresponding response data packet and sends the response data packet to a corresponding terminal. The terminal analyzes the received response data packet to obtain corresponding response voice, and broadcasts the response voice in a voice broadcasting mode.

In the above embodiment, the response data packet obtained by performing voice synthesis on the target response data is sent to the terminal, so that the indication terminal broadcasts the corresponding response voice to the user according to the response data packet, the terminal does not need to perform voice synthesis on the target response data, and the response voice obtained by synthesis is further broadcasted, so that the voice interaction efficiency is not limited by the voice synthesis efficiency of the terminal, and the voice interaction efficiency can be improved.

As shown in fig. 3, in an embodiment, a voice interaction method is provided, which specifically includes the following steps:

step 302, receiving voice data packets sequentially sent by the terminal.

And 304, in the process of sequentially receiving the voice data packets, sequentially carrying out voice recognition and mute detection on the received voice data packets according to the receiving time to obtain a recognition text and a continuous accumulated mute duration corresponding to each voice data packet.

Step 306, when the continuously accumulated mute duration obtained currently is smaller than the mute duration threshold, comparing the identification text obtained currently with a preset analysis condition; the preset analysis conditions comprise that the text length of the currently obtained identification text is larger than or equal to a text length threshold value, and the currently obtained identification text is inconsistent with the previously obtained identification text.

And 308, when the currently obtained identification text meets the preset analysis condition, performing semantic analysis on the currently obtained identification text to obtain a semantic analysis result.

And 310, determining a service field and a service intention according to the currently obtained semantic analysis result.

In step 312, the target service domain matching the service domain and the target service intention matching the service intention are queried from the answer configuration set.

And step 314, calling the service corresponding to the target service field through the interface corresponding to the target service intention to acquire corresponding response data.

Step 316, when the current continuous accumulated mute duration is greater than or equal to the mute duration threshold and the current identification text is consistent with the target text, taking the response data obtained in advance in an asynchronous mode based on the semantic analysis result of the target text as the target response data; the target text is the recognized text that is semantically parsed at the latest.

And step 318, comparing the semantic analysis result of the currently obtained identification text with the semantic analysis result of the target text when the currently obtained continuous accumulated mute time is greater than or equal to the mute time threshold and the currently obtained identification text is inconsistent with the target text.

And 320, when the semantic analysis result of the current obtained identification text is matched with the semantic analysis result of the target text, taking the semantic analysis result based on the target text and response data obtained in advance in an asynchronous mode as target response data.

And 322, performing voice synthesis on the target response data to obtain a response data packet.

Step 324, sending the response data packet to the terminal; and the sent response data packet is used for indicating the terminal to analyze the response data packet and broadcasting the analyzed response voice.

In the above embodiment, in the process of sequentially receiving voice data packets, voice recognition and silence detection are sequentially performed on the received voice data packets according to the receiving time to obtain the recognition text and the continuous accumulated silence duration corresponding to each voice data packet, and when the continuous accumulated silence duration is greater than or equal to the silence duration threshold, it is determined that a complete recognition text is obtained, the voice recognition mode may be understood as streaming voice recognition, and when the continuous accumulated silence duration corresponding to a voice data packet is less than the silence duration threshold, the recognition text corresponding to the voice data packet is determined as the middle recognition text of the streaming voice recognition. And in the process of the streaming voice recognition, performing service prefetching on an intermediate recognition text of the streaming voice recognition in an asynchronous mode, namely performing semantic analysis and service calling on the intermediate recognition text in advance to obtain response data corresponding to the intermediate recognition text, and if an obtained complete recognition text is consistent with the intermediate recognition text after silence detection is finished, or a semantic analysis result of the complete recognition text is consistent with a semantic analysis result of the intermediate recognition text, using the response data obtained in advance based on the intermediate recognition text as target response data fed back to a terminal, so that the time for performing the semantic analysis or the service calling on the complete recognition text can be saved, and the voice interaction efficiency can be improved. The service prefetching refers to performing semantic analysis and service calling on the intermediate recognition text in advance in an asynchronous mode before the complete recognition text is obtained. The service call refers to acquiring response data corresponding to the corresponding intermediate recognition text based on the semantic parsing result.

In one embodiment, the voice interaction method provided in one or more embodiments of the present application is applied to an intelligent voice assistant, which is also referred to as a voice assistant. For the intelligent voice assistant adopting the streaming voice recognition, the voice interaction method provided by the application can be utilized to improve the end-to-end response speed, namely the voice interaction efficiency, so that the user experience is improved.

In one embodiment, the intelligent voice assistant to which the voice interaction method provided by the application is applied mainly comprises access service, voice recognition, semantic parsing, skill center control and domain service and the like. The access service refers to an intelligent voice assistant access layer, the skill center control is used for realizing calling control of service skills, and is specifically used for calling corresponding services in field services to acquire response data corresponding to the intermediate recognition text according to the service field and the service intention corresponding to the intermediate recognition text.

FIG. 4 is a block diagram of a voice interaction system to which the voice interaction method is applied in one embodiment. As shown in fig. 4, the voice interaction system includes a terminal layer and a server layer, where the server layer includes an access layer and a logic layer, the terminal in the terminal layer includes but is not limited to a car machine, a speaker and a robot, the access layer includes an AIProxy (intelligent voice assistant access layer) which is a voice request channel connecting the terminal layer and the logic layer, and the logic layer includes speech recognition (ASR), Natural Language Processing (NLP), speech synthesis (TTS), skills center control (TSK), and services in the fields of music, weather search and news, etc. The speech recognition (ASR), the Natural Language Processing (NLP) and the speech synthesis (TTS) carry out data interaction through an access layer. Wherein, S1-S6 shown in FIG. 4 are used for characterizing the general interaction flow of the voice interaction method provided by the present application.

FIG. 5 is a diagram illustrating the architecture of an intelligent voice assistant, in one embodiment. As shown in fig. 5, the intelligent voice assistant mainly includes an access layer, a voice recognition (ASR), a Natural Language Processing (NLP), a voice synthesis (TTS), and the like, where the architecture of the Natural Language Processing (NLP) mainly includes a Dialogue Management (DM), a semantic parsing (NLU), and a skill central control (TSK), the voice recognition (ASR), the Natural Language Processing (NLP), and the voice synthesis (TTS) respectively perform data interaction with the access layer, and the semantic parsing (NLU) and the skill central control (TSK) respectively perform data interaction with the Dialogue Management (DM). The access layer, speech recognition (ASR), Dialog Management (DM), semantic parsing (NLU), skills center control (TSK), and speech synthesis (TTS) may be understood as functional units of the intelligent speech assistant. The semantic analysis unit carries out semantic analysis on the identification text based on one or more of template matching, corpus matching, kbqa (knowledge graph), qaprars (question and answer pairs) and the like to obtain a corresponding semantic analysis result. S1-S6 shown in fig. 5 are used to characterize a general interaction process inside the server in the voice interaction method provided by the present application, and S4.1 refers to semantic parsing of the recognition text based on one or more of template matching, corpus matching, kbqa, qapair, and the like.

In one embodiment, the access layer is configured to receive voice data packets sequentially sent by the terminal layer, and sequentially send the received voice data packets to the voice recognition unit according to a time sequence. The voice recognition unit is used for sequentially carrying out voice recognition and mute detection on the received voice data packets according to the receiving time to obtain a recognition text and continuous accumulated mute duration corresponding to each voice data packet, judging whether the current obtained recognition text is a complete recognition text according to the current obtained continuous accumulated mute duration, namely judging whether a user stops speaking, triggering to generate a voice tail packet identifier when judging that the current obtained recognition text is the complete recognition text, feeding the complete recognition text and the corresponding voice tail packet identifier back to the access layer, and directly feeding the middle recognition text back to the access layer when judging that the current obtained recognition text is the middle recognition text. The access layer is used for judging whether the received intermediate identification text meets a preset analysis condition or not, and initiating asynchronous calling to the dialogue management unit according to the intermediate identification text when the intermediate identification text meets the preset analysis condition. The dialogue management unit is used for carrying out semantic analysis on the corresponding intermediate recognition text according to asynchronous calling initiated by the access layer to obtain a corresponding semantic analysis result, initiating calling of service skills to the skill center control unit based on the semantic analysis result, and receiving and caching response data correspondingly fed back by the skill center control unit, wherein the response data can be understood as a skill calling result or a pre-fetching result.

Therefore, the access layer is mainly used for controlling the forwarding of the prefetching opportunity after the voice recognition unit returns the packet, and the dialogue management unit is mainly used for controlling the configuration of the prefetching field and the prefetching intention and the caching of the prefetching result. The prefetch time refers to an opportunity to initiate an asynchronous call to the dialog management unit, and the prefetch field and the prefetch intent refer to the service field and the service intent, respectively, mentioned in one or more embodiments above.

FIG. 6 is a timing diagram of voice interactions in the case where the full recognition text coincides with the intermediate recognition text, under one embodiment. The access layer receives voice data packets which are sequentially generated and sent by the terminal according to the collected voice, and sequentially sends the received voice data packets to the voice recognition unit, for example, the voice collected by the terminal from the user is 'as the weather of today', and the access layer sequentially receives and sends the voice data packets, such as the voice data packet 1, the voice data packet 2 and the voice data packet 3. The voice recognition unit performs voice recognition and silence detection on the received voice data in sequence according to the receiving time (which can also be understood as a receiving sequence) to obtain corresponding recognition texts and continuous accumulated silence duration, when the complete recognition texts are judged to be obtained according to the continuous accumulated silence duration, the complete recognition texts and corresponding voice tail packet identifications are fed back to the access layer, otherwise, the currently obtained recognition texts are taken as intermediate recognition texts and fed back to the access layer, for example, the recognition texts corresponding to the voice data packet 1, the voice data packet 2 and the voice data packet 3 are respectively 'today', 'how today' and the voice recognition unit judges that the recognition texts corresponding to the voice data packet 1 and the voice data packet 2 are intermediate recognition texts according to the above manner, and then the intermediate recognition texts are directly fed back to the access layer, and judging that the recognition text corresponding to the voice data packet 3 is a complete recognition text according to the above manner, and feeding back the complete recognition text and the corresponding voice tail packet identifier to the access layer.

The access layer initiates asynchronous calling to the dialogue management unit according to the received intermediate identification text, the dialogue management unit performs semantic analysis on the intermediate identification text designated by the asynchronous calling, initiates service skill calling to the skill central control unit according to a semantic analysis result to obtain a corresponding skill calling result, caches the semantic analysis result and the skill calling result corresponding to the intermediate identification text and the intermediate identification text locally, for example, determines that the service skill to be called is weather skill based on the semantic analysis result of the intermediate identification text corresponding to the voice data packet 2, and initiates weather skill calling to the skill central control unit. Correspondingly, when receiving the complete recognition text and the voice tail packet identifier fed back by the voice recognition unit, the access layer initiates calling to the dialogue management unit according to the complete recognition text. The dialogue management unit compares the complete recognition text with the intermediate recognition text cached locally, and when the complete recognition text and the intermediate recognition text are consistent, the skill calling result corresponding to the intermediate recognition text cache is fed back to the access layer, for example, because the intermediate recognition text "how much the weather is today" corresponding to the voice data packet 2 is consistent with the complete recognition text "how much the weather is today" corresponding to the voice data packet 3, the skill calling result is directly obtained from the cache. The access layer feeds back the received skill calling result as target response data to the terminal, for example, the target response data is "good fortune field today is sunny".

It is to be understood that the number of voice packets, the timing of the access stratum initiating the asynchronous call to the dialog management unit based on the intermediate recognized text, and the target answer data finally fed back shown in fig. 6 are only examples and are not limited in particular. It should be noted that the recognized text "today" can be recognized based on a single voice data packet 1 in fig. 6, which is also only an example and not a specific limitation, for example, the recognized text "today" may be recognized based on a plurality of voice data packets. The terminals between the access stratum and the users are not illustrated in fig. 6.

In the embodiment, when the complete recognition text is consistent with the intermediate recognition text, the skill calling result obtained in advance in an asynchronous mode based on the intermediate recognition text is used as the target response data, so that the time consumption for performing semantic analysis and service skill calling on the complete recognition text can be reduced, and the voice interaction efficiency can be improved.

FIG. 7 is a sequence diagram of voice interactions in the case where the fully recognized text and the intermediate recognized text do not match but the semantic parsing results of the two match, according to an embodiment. Since the interaction sequence diagrams shown in fig. 6 and fig. 7 are similar to the operations performed by the terminal, the access stratum, the voice recognition unit and the skill center control unit, they will not be described one by one. As shown in fig. 7, the voice initiated by the user in the voice interaction process is "weather of today", the voice recognition unit obtains the recognition texts "today", "weather of today" and "weather of today" respectively according to the voice data packet 1, the voice data packet 2 and the voice data packet 3, the access layer takes the recognition text "weather of today" corresponding to the voice data packet 2 as an intermediate recognition text, and initiates asynchronous call to the dialog management unit based on the intermediate recognition text. And the dialogue management unit initiates weather skill calling to the skill central control unit based on the semantic analysis result of the intermediate identification text, acquires a corresponding skill calling result, and caches the intermediate identification text, the corresponding semantic analysis result and the skill calling result. When receiving a call initiated by the access layer based on the complete identification text, the dialogue management unit compares the complete identification text with the cached intermediate identification text, and because the complete identification text 'weather today' is inconsistent with the intermediate identification text 'weather today', the dialogue management unit performs semantic analysis on the complete identification text, and further judges whether the semantic analysis result of the complete identification text is consistent with the semantic analysis result of the intermediate identification text, and because the semantic analysis results of the complete identification text and the intermediate identification text are consistent, the dialogue management unit feeds back the skill calling result corresponding to the cache of the intermediate identification text 'weather' to the access layer. And the access layer feeds back the skill calling result as final target response data to the terminal so as to indicate the terminal to display the target response data to the user.

It is to be understood that similar to the description of FIG. 6, the speech interaction timing diagram shown in FIG. 7 is by way of example only and is not intended to be limiting in any way. The terminals between the access stratum and the users are also not illustrated in fig. 7.

In the embodiment, when the complete recognition text is not consistent with the intermediate recognition text but the semantic parsing results of the complete recognition text and the intermediate recognition text are consistent, the skill calling result obtained in advance in an asynchronous mode based on the intermediate recognition text is used as the target response data, so that the time consumption for calling the service skill according to the semantic parsing result of the complete recognition text can be reduced, and the voice interaction efficiency can be improved.

FIG. 8 is a comparison of time consumption profiles for a voice interaction process in one embodiment. Fig. 8(a) is a time consumption distribution diagram in a current voice interaction mode, and fig. 8(b) and (c) are time consumption distribution diagrams in the voice interaction mode provided by the present application, respectively, where fig. 8(b) is a time consumption distribution diagram in a case where a complete recognized text is consistent with an intermediate recognized text, and fig. 8(c) is a time consumption distribution diagram in a case where the complete recognized text is inconsistent with the intermediate recognized text, but semantic parsing results of the complete recognized text and the intermediate recognized text are consistent. The time-consuming components of the three time-consuming profiles include, in order: tail silence detection, voice recognition tail packet, conversation management, voice synthesis head packet, terminal processing of the voice synthesis head packet, hardware writing and the like, wherein the conversation management further comprises semantic analysis and skill center control calling. The terminal can determine that the user stops speaking through tail silence detection when the user finishes speaking. It can be understood that the server may also perform tail silence detection and generate a corresponding speech recognition tail packet, and after obtaining the target response data, the server may perform speech synthesis on the target response data to obtain a corresponding speech synthesis head packet, which is also a response data packet. After receiving a voice synthesis first packet sent by a server, a terminal processes the voice synthesis first packet to obtain corresponding response voice, writes the response voice into hardware, and broadcasts the response voice to a user through the hardware.

As shown in fig. 8(a), in the current voice interaction mode, tail silence detection, voice recognition tail packet, dialog management, voice synthesis head packet, and terminal processing voice synthesis head packet respectively consume 880ms, 65ms, 198ms, 13ms, and 399ms, wherein the time consumed by dialog management includes time consumed by semantic parsing and skill center call, and also includes other time consumed, and the time consumed by semantic parsing and skill center call is 92ms and 47ms, respectively, so that the total time consumed in the voice interaction mode is 1555ms, and the total time consumed by the server side is 276 ms.

As shown in fig. 8(b), in the speech interaction mode provided by the present application, when the complete recognition text is consistent with the middle recognition text, the tail silence detection, the speech recognition tail packet, the dialog management, the speech synthesis head packet, and the terminal processing the speech synthesis head packet respectively consume 880ms, 65ms, 21ms, 13ms, and 399ms, respectively, wherein the time consumption of the dialog management includes the time consumption of the semantic parsing and the skill center control calling, and further includes other time consumption, and the time consumption of the semantic parsing and the skill center control calling is 0ms and 0ms, respectively, so that the total time consumption in the speech interaction mode is 1378ms, and the total time consumption on the server side is 99, wherein since the semantic parsing and the skill center control calling are executed asynchronously, there is no serial time consumption statistics, and the time consumption of the dialog management further includes the time consumption of comparing the complete recognition text with the middle recognition text, there is thus an additional time consumption of 21 ms. Therefore, when the complete recognition text is consistent with the middle recognition text, the time consumption of semantic analysis and skill center control calling can be reduced, and the total time consumption of voice interaction can be reduced.

As shown in fig. 8(c), in the speech interaction mode provided by the present application, when the complete recognition text is consistent with the intermediate recognition text, the tail silence detection, the speech recognition tail packet, the dialog management, the speech synthesis head packet, and the terminal processing the speech synthesis head packet respectively consume 880ms, 65ms, 148ms, 13ms, and 399ms, respectively, wherein the time consumption of the dialog management includes the time consumption of the semantic parsing and the skill center control calling, and also includes other time consumption, and the time consumption of the semantic parsing and the skill center control calling is 91ms and 0ms, respectively, so that the total time consumption in the speech interaction mode is 1505ms, and the total time consumption on the server side is 226, wherein since the semantic parsing and the skill center control calling of the intermediate recognition text are performed asynchronously, there is no serial time consumption statistics, but when the intermediate recognition text is inconsistent with the complete recognition text, the complete recognition text also needs to be semantically parsed, there is thus a semantic parsing time of 91ms, while the time consuming dialog management also includes the time consuming comparing the full recognized text with the intermediate recognized text and comparing the semantic parsing results of the full recognized text with the semantic parsing results of the intermediate recognized text, so there is an additional time consuming of 148 ms. Therefore, when the complete recognition text is consistent with the middle recognition text but the semantic parsing results of the complete recognition text and the middle recognition text are inconsistent, the time consumption of calling the skill center control can be reduced, and the total time consumption of voice interaction can be reduced.

Based on the time consumption distribution comparison graph shown in fig. 8, the voice interaction method based on service prefetching by stream semantic recognition provided by the application can improve the end-to-end response speed of the voice assistant, that is, the voice interaction efficiency can be improved.

In one embodiment, in one or more of the above embodiments, the intermediate recognition text used for comparison with the complete recognition text may specifically refer to the recognition text which is before the complete recognition text and is semantically parsed at the latest.

In one embodiment, the steps performed by the server in one or more of the above embodiments may be specifically performed by a voice assistant deployed at the server.

It should be understood that although the various steps in the flow charts of fig. 2-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-3 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 9, a voice interaction apparatus 900 is provided, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: a receiving module 901, an identifying module 902, an obtaining module 903 and a responding module 904, wherein:

a receiving module 901, configured to receive voice data packets sent by terminals in sequence;

the recognition module 902 is configured to, in the process of sequentially receiving voice data packets, sequentially perform voice recognition and silence detection on the received voice data packets according to the receiving time to obtain a recognition text and a continuously accumulated silence duration corresponding to each voice data packet;

an obtaining module 903, configured to, when the currently obtained continuous accumulated mute duration is greater than or equal to the mute duration threshold and the currently obtained identification text is consistent with the target text, take response data obtained in advance through an asynchronous manner based on a semantic analysis result of the target text as target response data; the target text is the recognition text which carries out semantic analysis at the latest;

and the response module 904 is configured to feed back the target response data to the terminal.

In an embodiment, the obtaining module 903 is further configured to compare a semantic parsing result of the currently obtained identification text with a semantic parsing result of the target text when the currently obtained continuous accumulated mute duration is greater than or equal to the mute duration threshold and the currently obtained identification text is inconsistent with the target text; and when the semantic analysis result of the current obtained identification text is matched with the semantic analysis result of the target text, taking the semantic analysis result based on the target text and response data obtained in advance in an asynchronous mode as target response data.

In an embodiment, the obtaining module 903 is further configured to perform semantic analysis on the currently obtained identification text to obtain a semantic analysis result when the currently obtained continuous accumulated mute duration is smaller than the mute duration threshold; determining a service field and a service intention according to a currently obtained semantic analysis result; and acquiring corresponding response data according to the service field and the service intention.

In an embodiment, the obtaining module 903 is further configured to compare the currently obtained identification text with a preset parsing condition when the currently obtained continuous accumulated mute duration is less than a mute duration threshold; and when the currently obtained identification text meets the preset analysis conditions, performing semantic analysis on the currently obtained identification text to obtain a semantic analysis result.

In one embodiment, the obtaining module 903 is further configured to query, from the answer configuration set, a target service field matching the service field and a target service intention matching the service intention; and calling the service corresponding to the target service field to acquire corresponding response data through the interface corresponding to the target service intention.

In one embodiment, the response module 904 is further configured to perform voice synthesis on the target response data to obtain a response data packet; sending the response data packet to the terminal; and the sent response data packet is used for indicating the terminal to analyze the response data packet and broadcasting the analyzed response voice.

For the specific definition of the voice interaction device, reference may be made to the above definition of the voice interaction method, which is not described herein again. The modules in the voice interaction device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the received voice data packets, and corresponding recognized texts and continuous accumulated mute duration. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a voice interaction method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of voice interaction, the method comprising:

receiving voice data packets sent by a terminal in sequence;

and feeding back the target response data to the terminal.

2. The method of claim 1, further comprising:

when the current continuous accumulated mute duration is larger than or equal to the mute duration threshold and the current obtained identification text is inconsistent with the target text, comparing the semantic analysis result of the current obtained identification text with the semantic analysis result of the target text;

and when the semantic analysis result of the currently obtained identification text is matched with the semantic analysis result of the target text, response data which is obtained in advance through an asynchronous mode based on the semantic analysis result of the target text is used as target response data.

3. The method of claim 1, further comprising:

when the continuously accumulated mute duration obtained currently is smaller than the mute duration threshold, performing semantic analysis on the identification text obtained currently to obtain a semantic analysis result;

determining a service field and a service intention according to a currently obtained semantic analysis result;

and acquiring corresponding response data according to the service field and the service intention.

4. The method according to claim 3, wherein when the currently obtained continuous accumulated mute duration is less than the mute duration threshold, performing semantic parsing on the currently obtained recognized text to obtain a semantic parsing result, including:

when the continuously accumulated mute duration obtained currently is smaller than the mute duration threshold, comparing the identification text obtained currently with a preset analysis condition;

and when the currently obtained identification text meets the preset analysis condition, performing semantic analysis on the currently obtained identification text to obtain a semantic analysis result.

5. The method according to claim 4, wherein the preset parsing condition includes that a text length of the currently obtained recognition text is greater than or equal to a text length threshold, and that the currently obtained recognition text is inconsistent with a previously obtained recognition text.

6. The method of claim 3, wherein the obtaining response data corresponding to the service intention according to the service domain comprises:

inquiring a target service field matched with the service field and a target service intention matched with the service intention from a response configuration set;

and calling the service corresponding to the target service field to acquire corresponding response data through the interface corresponding to the target service intention.

7. The method according to any one of claims 1 to 6, wherein the feeding back the target response data to the terminal comprises:

carrying out voice synthesis on the target response data to obtain a response data packet;

sending the response data packet to the terminal; and the sent response data packet is used for indicating the terminal to analyze the response data packet and broadcasting the analyzed response voice.

8. A voice interaction apparatus, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.