CN111540357A

CN111540357A - Voice processing method, device, terminal, server and storage medium

Info

Publication number: CN111540357A
Application number: CN202010315910.4A
Authority: CN
Inventors: 杨香斌
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2020-08-14
Anticipated expiration: 2040-04-21
Also published as: CN111540357B

Abstract

The application provides a voice processing method, a voice processing device, a terminal, a server and a storage medium. The method comprises the following steps: collecting audio to be detected; when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for the terminal to respond to a wakeup word input by a user and upload a second audio to a server; and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection. According to the method and the device, the first audio of the echo generated by the response words is detected through the terminal, the second audio after the end point of the first audio is uploaded to the server to perform voice end point detection, so that the server does not contain the first audio in the audio for voice end point detection, the situation that the end point of the echo is mistakenly identified as the end point of the voice of the user by the server is avoided, and the accuracy of voice end point detection is improved.

Description

Voice processing method, device, terminal, server and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech processing method, apparatus, terminal, server, and storage medium.

Background

With the rapid development of voice recognition technology, the application scenarios of far-field voice interaction are more and more common, and for example, terminals such as smart televisions, smart sound boxes, smart homes, smart vehicle-mounted terminals, smart robots, mobile phones and the like can perform far-field voice interaction with users, so as to provide services for the users. In far-field Voice interaction, the front end point and the back end point of the user Voice are detected through Voice Activity Detection (VAD) algorithm.

Generally, after receiving a wake-up word input by a user, a terminal plays a response word and simultaneously performs audio acquisition, uploads the acquired audio to a server, and the server identifies a front end point and a rear end point of the user voice in the audio through a voice activity detection model based on deep learning.

However, sometimes, the echo generated by the answer word played by the terminal is collected by the terminal into the audio. The server easily and wrongly identifies the front end point and the rear end point of the echo in the audio as the front end point and the rear end point of the voice of the user, and voice end point detection errors occur, so that subsequent voice interaction errors are caused.

Disclosure of Invention

The embodiment of the application provides a voice processing method, a voice processing device, a terminal, a server and a storage medium, so as to solve the problem that error is easy to occur in voice endpoint detection.

In a first aspect, an embodiment of the present application provides a speech processing method, which is applied to a terminal, and the method includes:

collecting audio to be detected;

when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for the terminal to respond to a wakeup word input by a user and upload a second audio to a server;

and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection.

In one possible embodiment, the method further comprises:

and determining whether the first audio exists in the audio to be detected according to the audio characteristics corresponding to the response words.

In a possible implementation manner, determining whether the first audio exists in the audio to be detected according to the audio feature corresponding to the answer, includes:

extracting audio features corresponding to the answer words;

and when the audio features corresponding to the answer words are matched with the audio to be detected, determining that the first audio exists in the audio to be detected, wherein the first audio is the audio matched with the audio features corresponding to the answer words in the audio to be detected.

and detecting whether the first audio exists in the audio to be detected through a first voice endpoint detection model based on deep learning, wherein the first voice endpoint detection model is trained through an audio sample of an echo generated by the answer, and the audio sample contains an audio feature corresponding to the answer.

In one possible embodiment, the method further comprises:

when the first audio is detected not to exist in the audio to be detected and the front end point of the voice exists in the audio to be detected, a third audio is uploaded to a server so that the server can detect the rear end point of the voice in the third audio, wherein the third audio is the audio behind the front end point of the voice in the audio to be detected.

In one possible embodiment, the acquiring audio to be detected includes:

and after receiving the awakening words, collecting the audio to be detected and playing the response words.

In one possible embodiment, the method further comprises:

and receiving and displaying a voice recognition result sent by the server, wherein the voice recognition result is obtained by recognizing the voice after the server performs voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio.

In a second aspect, an embodiment of the present application provides a speech processing method, which is applied to a server, and the method includes:

receiving a second audio sent by a terminal, wherein the second audio is an audio which is collected by the terminal and is positioned behind an end point of a first audio in a to-be-detected audio, the first audio is an echo audio generated by a response word, and the response word is used for responding to a wake-up word input by a user by the terminal;

and performing voice endpoint detection on the second audio.

In one possible embodiment, the voice endpoint detection of the second audio includes:

determining a starting point of the second audio as a front end point of the voice in the second audio, and detecting a rear end point of the voice through a second voice end point detection model based on deep learning; wherein the second speech endpoint detection model is trained from audio samples containing back endpoints of speech.

and detecting front end points and back end points of the voice in the second audio through a third voice end point detection model based on deep learning, wherein the third voice end point detection model is trained by audio samples containing the front end points and the back end points of the voice.

In one possible implementation, after performing voice endpoint detection on the second audio, the method further includes:

extracting the voice from the second audio according to the detected front end point and rear end point of the voice in the second audio;

recognizing the voice to obtain a voice recognition result;

and sending the voice recognition result to the terminal, wherein the voice recognition result is used for displaying by the terminal.

In a third aspect, an embodiment of the present application provides a speech processing apparatus, which is applied to a terminal, and the apparatus includes:

the acquisition module is used for acquiring the audio to be detected;

the sending module is used for uploading a second audio to a server when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for responding to a wake-up word input by a user by the terminal;

In a possible embodiment, the apparatus further comprises: a detection module;

the detection module is configured to:

In a possible implementation manner, the detection module is specifically configured to:

extracting audio features corresponding to the answer words;

In a possible implementation manner, the sending module is further configured to:

In a possible embodiment, the acquisition module is specifically configured to:

In a possible embodiment, the apparatus further comprises: a display module;

the display module is used for:

In a fourth aspect, an embodiment of the present application provides a speech processing apparatus, which is applied to a server, and the apparatus includes:

the receiving module is used for receiving a second audio frequency sent by the terminal, wherein the second audio frequency is an audio frequency which is behind an end point of a first audio frequency in the audio frequency to be detected and collected by the terminal, the first audio frequency is an echo audio frequency generated by a response word, and the response word is used for a wake-up word input by the terminal in response to a user;

and the processing module is used for carrying out voice endpoint detection on the second audio.

In a possible implementation manner, the processing module is specifically configured to:

In a possible implementation, the processing module is further configured to:

recognizing the voice to obtain a voice recognition result;

In a fifth aspect, an embodiment of the present application provides a terminal, including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the speech processing method as described above in the first aspect and in various possible implementations of the first aspect.

In a sixth aspect, an embodiment of the present application provides a server, including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes the computer-executable instructions stored by the memory to cause the at least one processor to perform the speech processing method as described above in the second aspect and various possible embodiments of the second aspect.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer executing instruction is stored in the computer-readable storage medium, and when a processor executes the computer executing instruction, the speech processing method according to the first aspect and various possible implementations of the first aspect are implemented.

In an eighth aspect, embodiments of the present application provide a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the speech processing method according to the second aspect and various possible implementation manners of the second aspect is implemented.

According to the voice processing method, the voice processing device, the terminal, the server and the storage medium, the terminal collects the audio to be detected; when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for the terminal to respond to a wake-up word input by a user and upload a second audio to a server; the second audio is the audio which is located behind the end point of the first audio in the audio to be detected, the second audio is used for voice end point detection by the server, the first audio of the echo generated by the answer is detected through the terminal, the second audio behind the end point of the first audio is uploaded to the server for voice end point detection, the audio which is used for voice end point detection by the server does not contain the first audio, the situation that the end point of the echo is mistakenly identified as the end point of the voice of the user by the server is avoided, and the accuracy of voice end point detection is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a schematic view of a scene of a speech processing method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech processing method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a speech processing method according to another embodiment of the present application;

FIG. 4 is a flowchart illustrating a speech processing method according to another embodiment of the present application;

fig. 5 is a signaling interaction diagram of a voice processing method according to yet another embodiment of the present application;

FIG. 6A is a flow chart illustrating a conventional speech processing method;

fig. 6B is a schematic flowchart of a speech processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a speech processing apparatus according to another embodiment of the present application;

fig. 9 is a schematic hardware structure diagram of a terminal according to an embodiment of the present application;

fig. 10 is a schematic hardware structure diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a scene schematic diagram of a speech processing method according to an embodiment of the present application. As shown in fig. 1, the scenario includes a terminal 11 and a server 12. The terminal 11 may be a smart television, a smart speaker, a smart home, a smart vehicle-mounted terminal, a smart robot, a mobile phone, and the like, which is not limited herein. The terminal 11 may perform far-field voice interaction with the user and capture audio containing the voice input by the user. The terminal 11 may upload the captured audio to the server 12. The server 12 performs voice endpoint detection and voice recognition processing on the audio, and returns the recognition result to the terminal 11. The terminal 11 interacts with the user through playing or displaying, or the terminal 11 executes the operation corresponding to the recognition result.

For example, the terminal 11 is a smart tv, and when the user needs to use the smart tv, the user may speak the wake-up word "ABC" to wake up the smart tv. After the intelligent television receives the ABC input by the user, the intelligent television can wake up from the dormant state, play the response word 'I' corresponding to the welcome word and perform voice collection. After hearing the 'i am' played by the smart television, the user knows that the smart television is awakened and performs subsequent voice input, for example, the user says 'how much the weather is today'. The intelligent television can acquire audio containing user voice 'how much weather is today', the audio is uploaded to the server, the server performs voice endpoint detection on the audio, detects a front endpoint and a rear endpoint of the 'how much weather is today', performs semantic recognition on the voice between the front endpoint and the rear endpoint, and feeds back a recognition result to the intelligent television. The intelligent television can inquire the weather condition of today according to the identification result and inform the user of the weather condition in a voice playing and/or screen display mode. Optionally, the smart television may also display the voice recognition result on the display screen, so that the user knows the recognized voice content.

However, the answer words played by the terminal 11 may generate echoes, for example, echoes generated in a room by the answer words played by a smart television. The echo is picked up in the audio by the terminal 11 and uploaded to the server. When the server detects the voice endpoint, the front endpoint and the rear endpoint of the audio corresponding to the echo are detected firstly, the front endpoint and the rear endpoint are identified as the front endpoint and the rear endpoint of the voice of the user wrongly, so that the server carries out voice identification on the audio corresponding to the echo subsequently, and further subsequent voice interaction mistakes are caused. For example, the audio collected by the smart television includes the audio of an echo generated by the answer "i am at" and the user voice "what is the weather today" input by the user, wherein the audio position corresponding to "i am at" is before the user voice. When the server detects the voice endpoint, the front endpoint and the rear endpoint of the audio corresponding to the 'I' are detected firstly, semantic recognition is carried out on the audio corresponding to the 'I' and the voice of the user is recognized as 'I' by mistake, so that the intelligent television displays or plays reply information aiming at the 'I' and voice interaction is wrong. In addition, for a single-round interactive scene, once the server detects that the front end point and the back end point of the corresponding audio are ' I ' at ', the subsequent audio is not identified any more, so that the voice of the user is not processed.

In this embodiment, a first audio of an echo generated by a response word is detected by a terminal, and a second audio after an end point of the first audio is uploaded to a server for voice end point detection, so that the audio for voice end point detection by the server does not contain the first audio, thereby avoiding the situation that the end point of the echo is mistakenly identified by the server as the end point of the voice of the user, and improving the accuracy of voice end point detection. The following examples are given for illustrative purposes.

Fig. 2 is a flowchart illustrating a speech processing method according to an embodiment of the present application. The method may be performed by the terminal described above. As shown in fig. 2, the method includes:

s201, collecting audio to be detected.

In this embodiment, the terminal may collect the audio to be detected. The audio to be detected may include one or more of an audio of ambient noise, an audio of an echo generated by a response word played by the terminal, and a user voice. Taking the above-mentioned smart television as an example, the audio to be detected collected by the smart television may include indoor noise audio, audio of echo generated by the response word "i am", and voice of "what is today's weather" input by the user.

Optionally, after receiving the wake-up word, collecting the audio to be detected, and playing the response word.

In this embodiment, the terminal interacts with the user after being awakened by the awakening language. The user first speaks a wake-up word to wake up the terminal. And after receiving the awakening words input by the user, the terminal starts to collect the audio to be detected and plays the response words to inform the user that the user is awakened. The response words may be determined according to needs, and are not limited herein, for example, the response words may be "i am", "you are good", "good morning", "ask what service is needed", and the like.

S202, when a first audio frequency is detected to exist in the audio frequency to be detected, wherein the first audio frequency is an echo audio frequency generated by a response word, the response word is used for responding a wake-up word input by a user through the terminal, and a second audio frequency is uploaded to a server; and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection.

In this embodiment, the terminal plays the response after receiving the wake-up word input by the user, and the first audio is an audio of an echo generated by the response. If the answer words generate echoes, if the terminal and the user are in the space such as a house, a car and the like, a first audio exists in the audio to be detected collected by the terminal; if the answer is not echo, if the terminal and the user are in an outdoor space, the first audio does not exist in the audio to be detected collected by the terminal.

The terminal detects whether the first audio exists in the audio to be detected, and if the first audio exists, the second audio is uploaded to the server. When the terminal detects that the first audio exists in the audio to be detected, the terminal can determine the end point of the first audio, and the audio behind the end point of the first audio in the audio to be detected is used as the second audio to be uploaded to the server. The server may perform voice endpoint detection on the second audio to detect a front endpoint and a back endpoint of the voice in the second audio. And extracting voice from the second audio according to the front end point and the rear end point, identifying the voice to obtain a voice identification result, and then sending the voice identification result to the terminal. The terminal may display the voice recognition result on a screen or push reply information to the user according to the voice recognition result. For example, the voice recognition result is "how much weather is today", the terminal may display a text of "how much weather is today" on the screen, and may also query the weather condition of today and reply the weather condition of today to the user in a manner of displaying on the screen or voice broadcasting. Through the local detection of the terminal, when the first audio exists in the audio to be detected, only the second audio after the end point of the first audio is uploaded to the server, the audio of the echo is not given to the server for end point identification, and the voice end point detection error caused by the audio of the echo is avoided.

In the embodiment of the application, the terminal collects the audio to be detected; when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for the terminal to respond to a wake-up word input by a user and upload a second audio to a server; the second audio is the audio which is located behind the end point of the first audio in the audio to be detected, the second audio is used for voice end point detection by the server, the first audio of the echo generated by the answer is detected through the terminal, the second audio behind the end point of the first audio is uploaded to the server for voice end point detection, the audio which is used for voice end point detection by the server does not contain the first audio, the situation that the end point of the echo is mistakenly identified as the end point of the voice of the user by the server is avoided, and the accuracy of voice end point detection is improved.

Optionally, the method may further include:

In this embodiment, the speech refers to speech input by the user, for example, "what is the weather today" input by the user. The terminal detects whether a first audio exists in the audio to be detected, if not, the terminal detects whether a front end point of the voice exists in the audio to be detected, and if so, a third audio which is located behind the front end point of the voice in the audio to be detected is uploaded to the server. The server detects a back endpoint for speech in the third audio.

For example, the environment of the terminal enables the answer words played by the terminal not to generate echoes, so that the first audio does not exist in the audio to be detected collected by the terminal. The method comprises the steps that voice input by a user is used as 'how much weather today's ', when a terminal detects that first audio does not exist in audio to be detected, a front end point of the voice' how much weather today's', the audio behind the front end point is used as third audio to be uploaded to a server, a rear end point of the voice 'how much weather today's 'in the third audio is detected by the server, and then the server can perform subsequent voice recognition and other processing according to the front end point and the rear end point of the voice' how much weather today's'.

Optionally, the terminal locally may adopt a VAD algorithm based on energy and zero-crossing rate to detect a front end point of the voice in the audio to be detected; the server detects a back-end point of speech in the third audio through a deep learning based VAD model. The terminal detects the front end point of the voice based on VAD algorithm of energy and zero crossing rate, and can prevent sudden transient noise from interfering the voice end point detection. The server detects the rear end point of the voice based on the deep learning VAD model, and the VAD model can be optimized by updating the model in real time, so that the method is suitable for detecting the rear end point of the voice in more scenes, and the front end point and the rear end point of the voice are respectively detected by combining two detection modes, so that the accuracy of voice end point detection can be improved.

Optionally, after S202, the method may further include:

In this embodiment, after performing voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio, the server may extract the voice from the second audio according to the front endpoint and the rear endpoint, recognize the voice to obtain a voice recognition result, and then send the voice recognition result to the terminal. The terminal may display the voice recognition result on a screen or push reply information to the user according to the voice recognition result. For example, the speech recognition result is "how much weather today", the terminal may display a text of "how much weather today" on the screen, and may also query the weather condition of today and display the weather condition of today on the screen. The voice recognition result is displayed through the terminal, so that the user can conveniently check the voice recognition result, the user can input the voice again when the voice recognition result is wrong, and the user experience is improved.

Fig. 3 is a flowchart illustrating a speech processing method according to another embodiment of the present application. The embodiment describes in detail a specific implementation process for detecting whether the first audio exists in the audio to be detected. As shown in fig. 3, the method includes:

s301, collecting audio to be detected.

In this embodiment, S301 is similar to S201 in the embodiment of fig. 2, and is not described here again.

S302, determining whether the first audio exists in the audio to be detected according to the audio characteristics corresponding to the answer words.

In this embodiment, the terminal may determine whether the first audio exists in the audio to be detected according to the audio feature corresponding to the response word. Wherein, the audio features corresponding to the answer words can be in the audio samples containing the answer words. The terminal can obtain the audio characteristics corresponding to the answer from the audio sample. The type of the audio feature may include, but is not limited to, one or more of energy, zero-crossing rate, frequency domain, cepstrum, harmonic, and the like.

In one possible implementation, S302 may include:

extracting audio features corresponding to the answer words;

In this embodiment, the terminal may extract the audio features corresponding to the response words from the audio sample including the response words, and when detecting whether the first audio exists in the audio to be detected, may match the audio features corresponding to the response words with the audio to be detected, and if the audio to be detected exists in the audio that matches the audio features corresponding to the response words, determine that the first audio exists in the audio to be detected. The terminal can determine the audio matched with the audio characteristics corresponding to the answer words in the audio to be detected as the first audio, so that the end point of the first audio is determined.

For example, the audio features corresponding to the answer words include an energy feature and a zero-crossing rate feature. The terminal can match the audio characteristics corresponding to the answer words with the audio to be detected based on the voice detection algorithm of energy and zero crossing rate, and determine whether the first audio exists in the audio to be detected. The terminal can rapidly realize the detection of the first audio in the audio to be detected in a characteristic matching mode.

In one possible implementation, S302 may include:

In this embodiment, a first speech endpoint detection model based on deep learning may be pre-constructed, a plurality of audio samples of echoes generated by the response words may be collected to form a training set, and the first speech endpoint detection model may be trained by the training set. The terminal is preset with a trained first voice endpoint detection model, and the terminal can detect whether a first audio exists in the audio to be detected through the first voice endpoint detection model.

S303, when a first audio frequency is detected to exist in the audio frequency to be detected, wherein the first audio frequency is an echo audio frequency generated by a response language, the response language is used for responding a wake-up language input by a user through the terminal, and a second audio frequency is uploaded to a server; and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection.

Whether first audio exists in the audio frequency of treating is confirmed through the audio frequency characteristic that the answer word corresponds to this embodiment, can accurately detect out first audio frequency, and then prevents to upload the server with first audio frequency, causes the interference for the pronunciation endpoint detection to improve the degree of accuracy that the pronunciation endpoint detected.

Fig. 4 is a flowchart illustrating a speech processing method according to another embodiment of the present application. The method may be performed by a server. As shown in fig. 4, the method includes:

s401, receiving a second audio sent by a terminal, wherein the second audio is an audio which is located behind an end point of a first audio in the audio to be detected and collected by the terminal, the first audio is an audio of an echo generated by a response word, and the response word is used for a wake-up word input by the terminal in response to a user.

S402, voice endpoint detection is carried out on the second audio.

In this embodiment, the terminal may collect the audio to be detected, and when it is detected that the first audio exists in the audio to be detected, upload the second audio after the end point of the first audio to the server. And the server receives a second audio sent by the terminal and performs voice endpoint detection on the second audio.

In the embodiment of the application, the server receives a second audio sent by the terminal, wherein the second audio is an audio collected by the terminal and located behind an end point of the first audio in the audio to be detected, the first audio is an audio of an echo generated by a response word, the response word is used for the terminal to respond to a wake-up word input by a user, voice end point detection is performed on the second audio, the first audio of the echo generated by the response word is detected through the terminal, the second audio behind the end point of the first audio is uploaded to the server to perform voice end point detection, so that the server does not contain the first audio in the audio subjected to the voice end point detection, the situation that the end point of the echo is mistakenly identified by the server as the end point of the voice of the user is avoided, and the accuracy of the voice end point detection is improved.

In one possible implementation, S402 may include:

In this embodiment, the server may directly use the starting point of the second audio as the front end point of the voice, and only detect the rear end point of the voice. Since the starting point of the second audio is the end point of the first audio, the second audio does not contain the echo audio, and the server only detects the end point of the second audio, so that the subsequent speech recognition can be realized. For example, the speech is "how much it is today's weather", there may be a section of silence or noise before "how much it is today's weather" in the second audio, or there may be no silence or noise, but since neither silence or noise is speech, subsequent speech recognition is not affected, and thus the starting point of the second audio can be directly taken as the front end point of the speech.

A second voice endpoint detection model based on deep learning can be constructed in advance, audio samples of a plurality of voice rear endpoints are collected to form a training set, and the second voice endpoint detection model is trained through the training set. For example, the speech may be "how much the weather is today", "please open the X channel", "how many the vehicle limit number is today", and the like, so that the trained second speech endpoint detection model can accurately detect the back endpoint of the speech. And when the second voice endpoint detection model is used, updating and optimizing the model by using the newly collected user voice. The server may detect a back-end point of speech in the second audio through the second speech end point detection model.

In this embodiment, the server determines the starting point of the second audio as the front endpoint of the voice in the second audio, detects the back endpoint of the voice through the second voice endpoint detection model based on deep learning, and the server only detects the back endpoint of the voice, so that the speed of voice endpoint detection is increased, the response speed is increased, and the user experience is improved.

In one possible implementation, S402 may include:

In this embodiment, a third speech endpoint detection model based on deep learning may be pre-constructed, a training set may be formed by collecting front endpoints and rear endpoints of a plurality of speeches, and the third speech endpoint detection model may be trained by the training set. For example, the speech may be "how much the weather is today", "please open the X channel", "how many the vehicle limit number is today", and the like, so that the trained second speech endpoint detection model can accurately detect the front endpoint and the rear endpoint of the speech. And when the third voice endpoint detection model is used, updating and optimizing the model by using the newly collected user voice. The server may detect front and back endpoints of speech in the second audio through the third speech endpoint detection model.

The server can detect the front end point and the rear end point of the voice in the second audio, because a section of silent sound or noise possibly exists between the starting point of the second audio and the front end point of the voice, the server can eliminate the silent sound or noise by detecting the front end point of the voice through the third voice end point detection model based on deep learning, and clean voice in the stationary noise of the non-human voice can be extracted more accurately by utilizing the characteristics of the third voice end point detection model based on deep learning; therefore, the front end point for recognition can not lose the voice and bring more noise, and the accuracy of voice recognition is further improved.

Optionally, after S402, the method may further include:

recognizing the voice to obtain a voice recognition result;

For example, the voice input by the user is "how much the weather is today", the server extracts the voice "how much the weather is today" from the second audio according to the detected front end point and rear end point of "how much the weather is today", and then recognizes the voice to obtain the voice recognition result "how much the weather is today". The terminal can display a text of 'how today' weather exists on the screen, and can also inquire the weather condition of the day and display the weather condition of the day on the screen.

The server sends the voice recognition result to the terminal to be displayed, so that the user can conveniently check the voice recognition result, the user can input the voice again when the voice recognition result is wrong, and the user experience is improved.

Fig. 5 is a signaling interaction diagram of a speech processing method according to still another embodiment of the present application. The execution main body of the method comprises the terminal and the server. As shown in fig. 5, the method includes:

s501, after receiving the awakening words, the terminal collects the audio to be detected and plays the response words.

S502, when the terminal detects that the first audio exists in the audio to be detected, wherein the first audio is the audio of an echo generated by a response word, uploading a second audio to a server, wherein the second audio is the audio behind the end point of the first audio in the audio to be detected.

S503, the server performs voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio.

S504, the server extracts the voice from the second audio according to the detected front end point and rear end point of the voice in the second audio, and identifies the voice to obtain a voice identification result.

And S505, the server sends the voice recognition result to the terminal.

S506, the terminal displays the voice recognition result.

The speech processing method provided in the embodiment of the present application is similar to the method embodiment using the terminal as the execution main body and the method embodiment using the server as the execution main body, and the implementation principle and the technical effect thereof are similar to each other, and are not described herein again.

The following describes an implementation example of a conventional speech processing method in comparison with an implementation example of the speech processing method provided in the present application. Fig. 6A is a flowchart illustrating a conventional speech processing method, and fig. 6B is a flowchart illustrating a speech processing method according to an embodiment of the present application.

Referring to fig. 6A, the conventional speech processing method flows as follows: and after the terminal is awakened successfully, playing a response word 'I' and starting to collect the audio to be detected and upload the audio to the server in real time. The server detects the front end point of the voice in the audio to be detected through the VAD, starts to perform voice recognition on the audio behind the front end point after the front end point is detected, feeds the voice recognition result back to the terminal for real-time presentation, and ends the VAD and the voice recognition until the rear end point of the voice is detected. In the traditional voice processing method, because the echo cancellation technology cannot completely cancel the echo, if the echo is generated by the answer word, the audio frequency of the echo is collected by the terminal into the audio frequency to be detected and uploaded to the server, so that the server mistakenly recognizes the echo as the voice of the user, and the voice endpoint detection and the voice recognition are mistaken.

Referring to fig. 6B, the flow of the voice processing method provided in this embodiment is as follows: and after the terminal is awakened successfully, playing the response word 'I' and starting to collect the audio to be detected. The terminal detects whether an echo 'I' audio exists in the audio to be detected through VAD, and if the echo 'I' audio is detected, the terminal starts to upload the acquired audio to the server in real time after the end point of the echo 'I' audio. And the server performs voice recognition on the received audio and feeds the voice recognition result back to the terminal for real-time presentation until the server detects a rear end point of the voice through the VAD, and then the VAD and the voice recognition are finished. In the embodiment, the terminal locally detects the audio frequency of the echo in the audio frequency to be detected through AVD, and only uploads the audio frequency after the audio frequency of the echo to the server, so that the problem that the server recognizes the echo as the voice of the user by mistake is avoided, and the accuracy of voice endpoint detection is improved. The local VAD of the terminal is combined with the cloud VAD of the server, the local VAD can adopt a voice endpoint detection method based on energy and zero crossing rate, the cloud VAD can adopt a voice endpoint detection method based on deep learning, the interference of echo on endpoint detection is eliminated through the local VAD, and the interference caused by sudden instant noise can be eliminated; the advantage that can optimize through high in the clouds VAD model, the back end point of accurate discernment has improved the degree of accuracy that the pronunciation end point detected under the prerequisite of guaranteeing detection speed and occupying terminal local computing resource as little as possible.

Fig. 7 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application. The voice processing device is applied to the terminal. As shown in fig. 7, the speech processing apparatus 70 includes: an acquisition module 701 and a sending module 702.

The acquisition module 701 is used for acquiring audio to be detected.

A sending module 702, configured to, when it is detected that a first audio exists in the audio to be detected, where the first audio is an audio of an echo generated by a response word, and the response word is used by the terminal to respond to a wakeup word input by a user, and upload a second audio to a server.

According to the embodiment of the application, the audio to be detected is collected through the collection module; when the sending module detects that a first audio exists in the audio to be detected, wherein the first audio is the audio of an echo generated by a response word, and the response word is used for responding to a wake-up word input by a user by a terminal and uploading a second audio to a server; the second audio is the audio which is located behind the end point of the first audio in the audio to be detected, the second audio is used for voice endpoint detection of the server, the first audio of the echo generated by the answer is detected through the terminal, the second audio behind the end point of the first audio is uploaded to the server for voice endpoint detection, the audio which is used for voice endpoint detection of the server does not contain the first audio, the situation that the end point of the echo is mistakenly identified as the end point of the voice of the user by the server is avoided, and the accuracy of voice endpoint detection is improved

Optionally, the apparatus further comprises: and a detection module.

The detection module is configured to:

Optionally, the detection module is specifically configured to:

extracting audio features corresponding to the answer words;

Optionally, the detection module is specifically configured to:

Optionally, the sending module 702 is further configured to:

Optionally, the acquisition module 701 is specifically configured to:

Optionally, the apparatus further comprises: a display module;

the display module is used for:

The speech processing apparatus provided in the embodiment of the present application may be configured to execute the method embodiment using the terminal as an execution main body, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 8 is a schematic structural diagram of a speech processing apparatus according to yet another embodiment of the present application. The voice processing device is applied to a server. As shown in fig. 8, the speech processing apparatus 80 includes:

the receiving module 801 is configured to receive a second audio sent by a terminal, where the second audio is an audio that is located after an end point of a first audio in to-be-detected audio collected by the terminal, the first audio is an audio of an echo generated by a response word, and the response word is used for a wake-up word input by the terminal in response to a user.

A processing module 802, configured to perform voice endpoint detection on the second audio.

In the embodiment of the application, the second audio sent by the terminal is received through the receiving module, wherein the second audio is the audio collected by the terminal and located behind the end point of the first audio in the audio to be detected, the first audio is the audio of the echo generated by the response word, the response word is used for the wake-up word input by the terminal in response to the user, the processing module performs voice endpoint detection on the second audio, the first audio of the echo generated by the response word is detected through the terminal, the second audio behind the end point of the first audio is uploaded to the server to perform voice endpoint detection, so that the audio subjected to the voice endpoint detection by the server does not contain the first audio, the situation that the end point of the echo is mistakenly identified by the server as the end point of the voice of the user is avoided, and the accuracy of the voice endpoint detection is improved.

Optionally, the processing module 802 is specifically configured to:

Optionally, the processing module 802 is further configured to:

recognizing the voice to obtain a voice recognition result;

The speech processing apparatus provided in the embodiment of the present application may be configured to execute the method embodiment using the server as an execution main body, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 9 is a schematic hardware structure diagram of a terminal according to an embodiment of the present application. As shown in fig. 9, the terminal 90 provided in the present embodiment includes: at least one processor 901 and memory 902. The terminal 90 further comprises a communication component 903. The processor 901, the memory 902, and the communication section 903 are connected by a bus 904.

Optionally, the terminal 90 may also include audio components and/or multimedia components. Wherein the audio component is configured to output and/or input an audio signal. For example, the audio component includes a microphone configured to receive external audio signals when the terminal is in an operational mode, such as a speech recognition mode. The received audio signal may further be stored in a memory or transmitted via the communication component 903. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals. The multimedia components include a screen providing an output interface between the terminal 90 and the user. In some embodiments, the screen may include a liquid crystal display and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

In a specific implementation process, the at least one processor 901 executes the computer-executable instructions stored in the memory 902, so that the at least one processor 901 performs the voice processing method with the terminal as the execution subject as described above.

For a specific implementation process of the processor 901, reference may be made to the above method embodiments, which implement principles and technical effects are similar, and details of this embodiment are not described herein again.

Fig. 10 is a schematic hardware structure diagram of a server according to another embodiment of the present application. As shown in fig. 10, the server 100 provided in the present embodiment includes: at least one processor 1001 and memory 1002. The server 100 further comprises a communication component 1003. The processor 1001, the memory 1002, and the communication unit 1003 are connected by a bus 1004.

In a specific implementation process, the at least one processor 1001 executes the computer-executable instructions stored in the memory 1002, so that the at least one processor 1001 executes the voice processing method with the server as the execution subject.

For a specific implementation process of the processor 1001, reference may be made to the above method embodiments, which have similar implementation principles and technical effects, and details of this embodiment are not described herein again.

In the embodiments shown in fig. 9 and fig. 10, it should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in the incorporated application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.

The memory may comprise high speed RAM memory and may also include non-volatile storage NVM, such as at least one disk memory.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The application also provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the voice processing method taking the terminal as an execution subject is realized.

The application also provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the voice processing method taking the server as an execution subject is realized.

The readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the readable storage medium may also reside as discrete components in the apparatus.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A voice processing method is applied to a terminal, and the method comprises the following steps:

collecting audio to be detected;

2. The method of claim 1, further comprising:

3. The method according to claim 2, wherein determining whether the first audio exists in the audio to be detected according to the audio feature corresponding to the answer word comprises:

extracting audio features corresponding to the answer words;

4. The method according to claim 2, wherein determining whether the first audio exists in the audio to be detected according to the audio feature corresponding to the answer word comprises:

5. The method according to any one of claims 1-4, further comprising:

6. The method according to any one of claims 1-4, wherein the acquiring audio to be detected comprises:

7. The method according to any one of claims 1-4, further comprising:

8. A speech processing method applied to a server, the method comprising:

and performing voice endpoint detection on the second audio.

9. The method of claim 8, wherein performing voice endpoint detection on the second audio comprises:

10. The method of claim 8, wherein performing voice endpoint detection on the second audio comprises:

11. The method of any of claims 8-10, wherein after performing voice endpoint detection on the second audio, the method further comprises:

recognizing the voice to obtain a voice recognition result;

12. A speech processing apparatus, applied to a terminal, the apparatus comprising:

the acquisition module is used for acquiring the audio to be detected;

13. The apparatus of claim 12, further comprising: a detection module;

the detection module is configured to:

14. The apparatus according to claim 13, wherein the detection module is specifically configured to:

extracting audio features corresponding to the answer words;

15. The apparatus according to claim 13, wherein the detection module is specifically configured to:

16. The apparatus of any one of claims 12-15, wherein the sending module is further configured to:

17. The device according to any one of claims 12 to 15, wherein the acquisition module is specifically configured to:

18. The apparatus according to any one of claims 12-15, further comprising: a display module;

the display module is used for:

19. A speech processing apparatus, applied to a server, the apparatus comprising:

20. The apparatus of claim 19, wherein the processing module is specifically configured to:

21. The apparatus of claim 19, wherein the processing module is specifically configured to:

22. The apparatus of any one of claims 19-21, wherein the processing module is further configured to:

recognizing the voice to obtain a voice recognition result;

23. A terminal, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the speech processing method of any of claims 1-7.

24. A server, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the speech processing method of any of claims 8-11.

25. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the speech processing method of any one of claims 1-7.

26. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the speech processing method of any one of claims 8-11.