CN109741753B

CN109741753B - Voice interaction method, device, terminal and server

Info

Publication number: CN109741753B
Application number: CN201910026638.5A
Authority: CN
Inventors: 王丹; 邹赛赛; 马赛; 宇文宏伟; 谢延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-01-11
Filing date: 2019-01-11
Publication date: 2020-07-28
Anticipated expiration: 2039-01-11
Also published as: CN109741753A

Abstract

The invention provides a voice interaction method, a voice interaction device, a terminal and a server, wherein the voice interaction method comprises the following steps: performing serial multi-round endpoint detection; sending session identification request information to a server aiming at each audio data obtained by each round of endpoint detection, and identifying a plurality of audio data obtained by the plurality of rounds of endpoint detection by the server; and receiving the identification result sent by the server and the target broadcast content corresponding to the identification result. The embodiment of the invention can realize continuous multiple voice recognition processes, thereby reducing the influence of environmental sounds and/or speaker pause intervals and the like on the recognition accuracy, supporting hesitation questioning, improving the recognition accuracy and enhancing the response naturalness of the terminal.

Description

Voice interaction method, device, terminal and server

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a voice interaction method, apparatus, terminal, and server.

Background

In recent years, far-field speech recognition technology has developed rapidly. By means of a microphone array front-end processing algorithm, noise is effectively eliminated, and meanwhile, the voice of a target speaker is enhanced, so that far-field voices in scenes such as smart homes, smart hardware and robot voice interaction can be accurately identified.

Currently, voice wakeup and voice recognition are used in combination, i.e., one wakeup at a time for recognition. The existing voice recognition mode is as follows: after the trigger wakeup is determined, the terminal obtains effective audio through Voice Activity Detection (VAD), meanwhile, a Voice recognition request is sent To a server (recognition server), the effective audio is transmitted To the server, the server recognizes the effective audio To obtain a recognition result, and the recognition result can comprise Voice recognition information, cloud resource data and Text To Speech (TTS) broadcast information.

However, when the speaker is awakened once for identification, the identification accuracy is greatly influenced by environmental sounds and speaker pause intervals, so that the identification accuracy is low and the response naturalness of the terminal is poor.

Disclosure of Invention

The embodiment of the invention provides a voice interaction method, a voice interaction device, a terminal and a server, and aims to solve the problem of low recognition accuracy caused by the existing voice recognition method.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a voice interaction method, applied to a terminal, including:

performing serial multi-round endpoint detection;

sending session identification request information to a server aiming at each audio data obtained by each round of endpoint detection, and identifying a plurality of audio data obtained by the plurality of rounds of endpoint detection by the server;

and receiving the identification result sent by the server and the target broadcast content corresponding to the identification result.

In a second aspect, an embodiment of the present invention provides a voice interaction method, applied to a server, including:

receiving identification session request information sent by a terminal; the terminal carries out serial multi-round endpoint detection, and the identification session request information is sent by the terminal aiming at each audio data obtained by each round of endpoint detection;

according to the identification session request information, identifying a plurality of audio data obtained by the multi-round endpoint detection;

and sending the identification result and the target broadcast content corresponding to the identification result to the terminal.

In a third aspect, an embodiment of the present invention provides a voice interaction apparatus, which is applied to a terminal, and includes:

the detection module is used for carrying out serial multi-round endpoint detection;

the first sending module is used for sending session identification request information to a server aiming at each audio data obtained by each round of endpoint detection, and the server identifies a plurality of audio data obtained by the multi-round endpoint detection;

and the first receiving module is used for receiving the identification result sent by the server and the target broadcast content corresponding to the identification result.

In a fourth aspect, an embodiment of the present invention provides a voice interaction apparatus, which is applied to a server, and includes:

the second receiving module is used for receiving the identification session request information sent by the terminal; the terminal carries out serial multi-round endpoint detection, and the identification session request information is sent by the terminal aiming at each audio data obtained by each round of endpoint detection;

the identification module is used for identifying a plurality of audio data obtained by the multi-round endpoint detection according to the identification session request information;

and the second sending module is used for sending the identification result and the target broadcast content corresponding to the identification result to the terminal.

In a fifth aspect, an embodiment of the present invention provides a terminal, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, can implement the steps of the voice interaction method applied to the terminal.

In a sixth aspect, an embodiment of the present invention provides a server, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, can implement the steps of the above-mentioned voice interaction method applied to the server.

In a seventh aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, can implement the steps of the voice interaction method applied to the terminal or the steps of the voice interaction method applied to the server.

In the embodiment of the invention, through carrying out serial multi-round endpoint detection, sending identification session request information to the server aiming at each audio data obtained by each round of endpoint detection, identifying the plurality of audio data obtained by the multi-round endpoint detection by the server, receiving the identification result sent by the server and the target broadcast content corresponding to the identification result, and realizing continuous multi-time voice identification process, thereby reducing the influence of environmental sounds and/or speaker pause intervals and the like on identification accuracy, not only supporting hesitation questions, but also improving the identification accuracy and enhancing the response naturalness of the terminal.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a flow chart of a voice interaction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an audio processing flow according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a speech recognition process according to an embodiment of the present invention;

FIG. 4 is a flow chart of another voice interaction method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present invention;

FIG. 6 is a second schematic structural diagram of a voice interaction apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a voice interaction method according to an embodiment of the present invention, where the method is applied to a terminal, and as shown in fig. 1, the method includes the following steps:

step 101: serial multi-round endpoint detection is performed.

The terminal is in a wake-up mode (also called a multi-time identification mode) when serial multi-round endpoint detection is carried out, so that a multi-time identification process is realized by waking up once, the wake-up mode of the terminal can be actively triggered by a user, and further, the wake-up mode of the terminal can be quitted after receiving indication information of Stop answering (Stop L isten) of the cloud end, and can also be actively quitted by the user.

The above mentioned endpoint detection may be understood as VAD detection. Taking VAD detection as an example, the audio processing flow in the process of waking up the terminal for multiple times and identifying the terminal once may be as shown in fig. 2, where in the wake-up mode, the terminal may utilize a Recorder to continuously read the audio stream, and transmit the read audio stream to VAD through MicServer (wireless server) and MicClient (wireless client) for multiple rounds of audio endpoint detection, and each valid audio obtained by each round of VAD detection may be continuously uploaded to a server (e.g. an identification server) through a corresponding network module (Chunk1, Chunk2, …, or Chunk n) for identification.

Step 102: and sending session identification request information to a server aiming at each audio data obtained by detecting each round of endpoint, and identifying a plurality of audio data obtained by detecting a plurality of rounds of endpoints by the server.

The audio data may be understood as valid audio obtained by corresponding endpoint detection (e.g., VAD detection). Specifically, when an effective audio data is detected, corresponding identification session request information is sent to the server, so that the effective audio data is sent to the server for identification. In this way, the terminal can sequentially transmit to the server for a plurality of audio data detected in series.

Step 103: and receiving the identification result sent by the server and the target broadcast content corresponding to the identification result.

Wherein the recognition result comprises a recognition result of at least one piece of audio data. Optionally, the recognition result may be obtained by concurrently recognizing a plurality of audio data by the server.

It should be noted that, in a normal situation, after obtaining an identification result of each piece of audio data, the server sends the identification result to the terminal, that is, the identification result is not sent to the terminal at the same time. The recognition result of the audio data may include audio recognition information, cloud resource data, TTS broadcast information, and the like, the audio recognition information is recognition information of the audio data, and the cloud resource data is cloud resources related to the recognition object, such as an audio address, a video address, a picture resource, and the like.

It should be noted that the target broadcast content may be determined by the recognition result of one audio data, or may be determined by the recognition results of at least two audio data, depending on the specific situation. For example, when a user interacts with the intelligent sound box, if the user asks a sentence of 'how much the weather is today' completely, clearly and continuously, the effective audio obtained through VAD detection is 'how much the weather is today', and the subsequent server can determine target broadcast content according to the identification result of the effective audio 'how much the weather is today' so that the terminal broadcasts corresponding weather conditions; or, if the user pauses and discontinuously says that a sentence of ' … hiccup … chrysanthemum of one zhou jilun is put to me ', three effective audios are possibly obtained through VAD detection, namely ' the friend jilun ' is put to me ', ' … hiccup … ' and ' chrysanthemum table ', and the subsequent server can determine the target broadcast content according to the identification results of the three effective audios (wherein the identification result of the ' … hiccup … ' is possibly rejection information), so that the terminal plays the song ' zhou jilun chrysanthemum table '.

Therefore, after the target broadcast content sent by the server is received, the terminal carries out voice broadcast based on the target broadcast content.

According to the voice interaction method, serial multi-round endpoint detection is carried out, identification conversation request information is sent to the server aiming at each audio data obtained by each round of endpoint detection, the server identifies the plurality of audio data obtained by the multi-round endpoint detection, the identification result sent by the server and the target broadcast content corresponding to the identification result are received, and a continuous multi-time voice identification process can be achieved, so that the influence of environment sounds and/or speaker pause intervals and the like on identification accuracy is reduced, hesitation questions are supported, the identification accuracy is improved, and the response naturalness of the terminal is enhanced.

In the embodiment of the invention, in order to ensure that the server (cloud) accurately identifies the corresponding audio data, the information for identifying the current identification state parameter can be carried in the identification session request information every time. Optionally, the session identification request information may include at least one of the following:

wake-up mode information, identification request index, identification sequence number.

Therefore, the server can know the current voice recognition state conveniently by recognizing the content in the conversation request information, so that the audio data can be recognized accurately, such as concurrent recognition, and the recognition accuracy is improved.

Further, the identification result includes an identification result of at least one piece of audio data, and the identification result of each piece of audio data is obtained by identifying, by the server, according to the content included in the corresponding identification session request information, in combination with at least one of the following processing manners:

voiceprint detection, context semantic recognition and noise detection.

When the voiceprint detection is combined, whether the voiceprint information of the current audio data is matched with the preset voiceprint information or not can be detected, if the voiceprint information of the current audio data is matched with the preset voiceprint information, the current audio data is the content needing to be identified, otherwise, the current audio data does not need to be identified, and the terminal does not need to respond. The preset voiceprint information can be preset in a terminal such as a smart sound box. In addition, when the voiceprint detection is combined, whether the voiceprint information of the current audio data is matched with the voiceprint information of the user starting the awakening mode or not can be detected, if the voiceprint information of the current audio data is matched with the voiceprint information of the user starting the awakening mode, the current audio data is the content which needs to be identified, otherwise, the current audio data does not need to be identified, and the terminal does not need to respond. Thus, by combining voiceprint detection, the method can only respond to a specific person and avoid interference.

When the context semantic recognition is combined, when the current audio data is recognized, the previous audio data or the previous preset audio data of the current audio data can be combined for recognition, so that the context semantic integration is realized, the expression is more natural, and the real intention of a user can be responded more quickly.

When the noise detection is combined, the environmental noise can be rejected, so that the influence of the external environmental noise is eliminated, and the identification accuracy is improved.

Optionally, the recognition result of each audio data may include: first indication information indicating whether the audio data is rejected, and/or second indication information indicating whether the audio data is full semantic. Therefore, the terminal can quickly know the condition of the corresponding audio data through the indication information, namely whether the audio data is rejected or not and whether the audio data is complete semantic.

In this embodiment of the present invention, after step 103, the method may further include:

and according to the identification serial number, orderly caching the identification result of each audio data in the identification result.

The above-mentioned cache can be implemented by SDK (Software Development Kit). The timing and integrity of the recognition results can be guaranteed by a local session handler (asmutiplyprocmanager). Therefore, by means of ordered cache, the terminal can determine the broadcast content conveniently, and the response speed is improved.

Further, after receiving an identification final result (including the rejection result information) of the audio data from the server, the method may further include:

and detecting and calling back the historical identification result cached in the cache queue, ending the historical identification session which does not receive the identification final result in the cache queue, and calling back all the cache data corresponding to the audio data.

When the history identification session is ended, the session may be forcibly supplemented with the end information. After each round of recognition callback finishes the supplementary session identifier ASR _ FINISH, the SDK can clear the cached results in the session processor. Therefore, the terminal can cache the completely recognized result through the callback processing process, so that the terminal can determine the broadcast content conveniently, and the response speed is improved.

It should be noted that, in a specific implementation, the recognition result of each piece of audio data may include at least one intermediate recognition result and a final recognition result, where the intermediate recognition result includes a part of audio recognition information, cloud-related resources, and a TTS broadcast data result, and the final recognition result includes the audio recognition information, the cloud-related resources, and the TTS broadcast data result.

The voice interaction process in the embodiment of the present invention is described in detail below with reference to fig. 3 by taking VAD detection as an example.

In the embodiment of the present invention, referring to fig. 3, after a voice client APP of a terminal starts ASR (automatic speech Recognition) and turns on Mic (Microphone), VAD may be started for multiple rounds of endpoint detection. After each VAD round detects a valid audio start, the valid audio can be sent to the recognition server through a network module such as ChunkNet1, ChunkNet2, …, or ChunkNet n, while information identifying the current recognition state parameters such as AsrSn1, AsrSn2, …, or AsrSnN is sent, and after each VAD round detects a valid audio end, the VAD is reset and a new VAD round is started. The server can perform concurrent identification on the received effective audio, obtain a concurrent identification result, and send the concurrent identification result and the corresponding target broadcast content to the terminal for orderly caching, and the time sequence and the integrity of the caching can be ensured by a session processor ASRMutiplyProcMaxager. And after receiving the target broadcast content, the voice client can broadcast, so that the process of awakening for identification for multiple times at one time is realized.

Referring to fig. 4, fig. 4 is a flowchart of another voice interaction method provided by an embodiment of the present invention, where the method is applied to a server (e.g., a recognition server), and as shown in fig. 4, the method includes the following steps:

step 401: receiving identification session request information sent by a terminal; the terminal carries out serial multi-round endpoint detection, and the identification session request information is sent by the terminal aiming at each audio data obtained by each round of endpoint detection.

Optionally, the session identification request information includes at least one of the following:

Step 402: and identifying a plurality of audio data obtained by the multi-round endpoint detection according to the identification session request information.

Step 403: and sending the identification result and the target broadcast content corresponding to the identification result to the terminal.

The voice interaction method of the embodiment of the invention can realize continuous multiple voice recognition processes, thereby reducing the influence of environmental sounds and/or speaker pause intervals and the like on the recognition accuracy, not only supporting hesitation inquiry, but also improving the recognition accuracy and enhancing the response naturalness of the terminal.

In this embodiment of the present invention, optionally, step 402 may include:

and according to the identification session request information, carrying out concurrent identification on a plurality of audio data obtained by the multi-round endpoint detection.

Optionally, step 402 may include:

according to the content included in the identification session request information, in combination with at least one of the following processing manners, identifying each audio data in the plurality of audio data respectively:

voiceprint detection, context semantic recognition and noise detection.

Optionally, the result of identifying each audio data may include: first indication information indicating whether the audio data is rejected, and/or second indication information indicating whether the audio data is full semantic.

The foregoing embodiment describes a voice interaction method according to the present invention, and a voice interaction apparatus, a terminal, and a server according to the present invention are described below with reference to the embodiment and the drawings.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a voice interaction apparatus according to an embodiment of the present invention, and as shown in fig. 5, the voice interaction apparatus 50 may include:

a detection module 51 for performing serial multi-round endpoint detection;

a first sending module 52, configured to send session identification request information to a server for each piece of audio data obtained by each round of endpoint detection, where the server identifies multiple pieces of audio data obtained by multiple rounds of endpoint detection;

and a first receiving module 53, configured to receive the identification result sent by the server and the target broadcast content corresponding to the identification result.

Optionally, the recognition result is obtained by the server performing concurrent recognition on the plurality of audio data.

Optionally, the identification result includes an identification result of at least one piece of audio data, and the identification result of each piece of audio data is obtained by identifying, by the server, according to content included in the identification session request information corresponding to the piece of audio data, in combination with at least one of the following processing manners:

voiceprint detection, context semantic recognition and noise detection.

Optionally, the recognition result of each audio data includes: first indication information indicating whether the audio data is rejected, and/or second indication information indicating whether the audio data is full semantic.

Optionally, the terminal further includes:

and the cache module is used for orderly caching the identification result of each audio data in the identification result according to the identification serial number.

Optionally, the terminal further includes:

and the callback module is used for detecting and calling back the historical identification result cached in the cache queue after receiving the identification final result of the audio data from the server, ending the historical identification session which does not receive the identification final result in the cache queue, and calling back all the cache data corresponding to the audio data.

The voice interaction device 50 of the embodiment of the invention can realize continuous multiple voice recognition processes, thereby reducing the influence of environment voice and/or speaker pause interval and the like on the recognition accuracy, not only supporting hesitation inquiry, but also improving the recognition accuracy and enhancing the response naturalness.

Referring to fig. 6, fig. 6 is a schematic structural diagram of another voice interaction apparatus according to an embodiment of the present invention, and as shown in fig. 6, the voice interaction apparatus 60 may include:

a second receiving module 61, configured to receive session identification request information sent by a terminal; the terminal carries out serial multi-round endpoint detection, and the identification session request information is sent by the terminal aiming at each audio data obtained by each round of endpoint detection;

the identification module 62 is configured to identify, according to the identification session request information, a plurality of pieces of audio data obtained by the multi-round endpoint detection;

and a second sending module 63, configured to send the identification result and the target broadcast content corresponding to the identification result to the terminal.

Optionally, the identification module 62 is specifically configured to:

voiceprint detection, context semantic recognition and noise detection.

The voice interaction device 60 of the embodiment of the invention can realize continuous multiple voice recognition processes, thereby reducing the influence of environment voice and/or speaker pause interval and the like on the recognition accuracy, not only supporting hesitation inquiry, but also improving the recognition accuracy and enhancing the response naturalness of the terminal.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present invention, and as shown in fig. 7, the terminal 70 includes: a processor 71, a memory 72 and a computer program stored on said memory 72 and executable on said processor; wherein, each component in the terminal 70 is coupled together through the bus interface 73, and when being executed by the processor 71, the computer program can implement each process of the above-mentioned voice interaction method embodiment applied to the terminal, and can achieve the same technical effect, and for avoiding repetition, it is not described here again.

In addition, referring to fig. 8, fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention, and as shown in fig. 8, the server 80 includes: a processor 81, a memory 82 and a computer program stored on said memory 82 and executable on said processor; wherein, each component in the server 80 is coupled together through the bus interface 83, and when being executed by the processor 81, the computer program can implement each process of the above-mentioned voice interaction method embodiment applied to the server, and can achieve the same technical effect, and for avoiding repetition, it is not described here again.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when being executed by a processor, the computer program may implement each process of the voice interaction method embodiment applied to the terminal or each process of the voice interaction method embodiment applied to the server, and may achieve the same technical effect, and in order to avoid repetition, the computer program is not described herein again.

Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A voice interaction method is applied to a terminal and is characterized by comprising the following steps:

performing serial multi-round endpoint detection;

receiving an identification result sent by the server and target broadcast content corresponding to the identification result;

after receiving the identification result sent by the server, the method further includes:

according to the identification serial number, orderly caching the identification result of each audio data in the identification result;

and after receiving an identification final result of the audio data from the server, detecting and calling back the historical identification result cached in the cache queue, ending the historical identification session which does not receive the identification final result in the cache queue, and calling back all the cache data corresponding to the audio data.

2. The method of claim 1, wherein the recognition result is obtained by the server performing concurrent recognition on the plurality of audio data.

3. The method of claim 1, wherein the identifying session request information comprises at least one of:

4. The method according to claim 3, wherein the recognition result includes at least one recognition result of audio data, and the recognition result of each audio data is obtained by the server recognizing, according to the content included in the corresponding recognition session request information, at least one of the following processing manners:

voiceprint detection, context semantic recognition and noise detection.

5. The method of claim 4, wherein the recognition result of each audio data comprises: first indication information indicating whether the audio data is rejected, and/or second indication information indicating whether the audio data is full semantic.

6. A voice interaction method is applied to a server and is characterized by comprising the following steps:

sending an identification result and target broadcast content corresponding to the identification result to the terminal; the identification result is used for the terminal to sequentially cache the identification result of each audio data in the identification result according to the identification serial number, and after receiving an identification final result of the audio data from the server, the historical identification result cached in the cache queue is detected and recalled, the historical identification session of the identification final result which is not received in the cache queue is ended, and all the cache data corresponding to the audio data is recalled.

7. The method of claim 6, wherein the identifying the plurality of audio data resulting from the multiple rounds of endpoint detection comprises:

and carrying out concurrent identification on a plurality of audio data obtained by the multi-round endpoint detection.

8. The method of claim 6, wherein the identifying session request information comprises at least one of:

9. The method of claim 8, wherein the identifying the plurality of audio data detected by the plurality of rounds of endpoint according to the identifying session request information comprises:

voiceprint detection, context semantic recognition and noise detection.

10. The method of claim 9, wherein the identifying of each audio data comprises: first indication information indicating whether the audio data is rejected, and/or second indication information indicating whether the audio data is full semantic.

11. A voice interaction device is applied to a terminal and is characterized by comprising:

the first receiving module is used for receiving the identification result sent by the server and the target broadcast content corresponding to the identification result;

the device further comprises:

the cache module is used for orderly caching the identification result of each audio data in the identification result according to the identification serial number;

12. The apparatus according to claim 11, wherein the recognition result is obtained by the server performing concurrent recognition on the plurality of audio data.

13. The apparatus of claim 11, wherein the session request identification information comprises at least one of:

14. The apparatus according to claim 13, wherein the recognition result includes at least one recognition result of audio data, and the recognition result of each audio data is obtained by the server recognizing, according to the content included in the corresponding recognition session request information, at least one of the following processing manners:

voiceprint detection, context semantic recognition and noise detection.

15. The apparatus according to claim 14, wherein the recognition result of each audio data comprises: first indication information indicating whether the audio data is rejected, and/or second indication information indicating whether the audio data is full semantic.

16. A voice interaction device applied to a server is characterized by comprising:

the second sending module is used for sending the identification result and the target broadcast content corresponding to the identification result to the terminal; the identification result is used for the terminal to sequentially cache the identification result of each audio data in the identification result according to the identification serial number, and after receiving an identification final result of the audio data from the server, the historical identification result cached in the cache queue is detected and recalled, the historical identification session of the identification final result which is not received in the cache queue is ended, and all the cache data corresponding to the audio data is recalled.

17. The apparatus of claim 16, wherein the identification module is specifically configured to:

18. The apparatus of claim 16, wherein the session request identification information comprises at least one of:

19. The apparatus of claim 18, wherein the identification module is specifically configured to:

voiceprint detection, context semantic recognition and noise detection.

20. The apparatus of claim 19, wherein the result of identifying each audio data comprises: first indication information indicating whether the audio data is rejected, and/or second indication information indicating whether the audio data is full semantic.

21. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program, when executed by the processor, implements the steps of the voice interaction method according to any of claims 1 to 5.

22. A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program, when executed by the processor, implements the steps of the voice interaction method according to any of claims 6 to 10.

23. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the voice interaction method as claimed in any one of claims 1 to 5, or the steps of the voice interaction method as claimed in any one of claims 6 to 10.