CN111540357A - Voice processing method, device, terminal, server and storage medium - Google Patents

Voice processing method, device, terminal, server and storage medium Download PDF

Info

Publication number
CN111540357A
CN111540357A CN202010315910.4A CN202010315910A CN111540357A CN 111540357 A CN111540357 A CN 111540357A CN 202010315910 A CN202010315910 A CN 202010315910A CN 111540357 A CN111540357 A CN 111540357A
Authority
CN
China
Prior art keywords
audio
voice
detected
end point
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010315910.4A
Other languages
Chinese (zh)
Other versions
CN111540357B (en
Inventor
杨香斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Original Assignee
Hisense Visual Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd filed Critical Hisense Visual Technology Co Ltd
Priority to CN202010315910.4A priority Critical patent/CN111540357B/en
Publication of CN111540357A publication Critical patent/CN111540357A/en
Application granted granted Critical
Publication of CN111540357B publication Critical patent/CN111540357B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The application provides a voice processing method, a voice processing device, a terminal, a server and a storage medium. The method comprises the following steps: collecting audio to be detected; when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for the terminal to respond to a wakeup word input by a user and upload a second audio to a server; and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection. According to the method and the device, the first audio of the echo generated by the response words is detected through the terminal, the second audio after the end point of the first audio is uploaded to the server to perform voice end point detection, so that the server does not contain the first audio in the audio for voice end point detection, the situation that the end point of the echo is mistakenly identified as the end point of the voice of the user by the server is avoided, and the accuracy of voice end point detection is improved.

Description

Voice processing method, device, terminal, server and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a speech processing method, apparatus, terminal, server, and storage medium.
Background
With the rapid development of voice recognition technology, the application scenarios of far-field voice interaction are more and more common, and for example, terminals such as smart televisions, smart sound boxes, smart homes, smart vehicle-mounted terminals, smart robots, mobile phones and the like can perform far-field voice interaction with users, so as to provide services for the users. In far-field Voice interaction, the front end point and the back end point of the user Voice are detected through Voice Activity Detection (VAD) algorithm.
Generally, after receiving a wake-up word input by a user, a terminal plays a response word and simultaneously performs audio acquisition, uploads the acquired audio to a server, and the server identifies a front end point and a rear end point of the user voice in the audio through a voice activity detection model based on deep learning.
However, sometimes, the echo generated by the answer word played by the terminal is collected by the terminal into the audio. The server easily and wrongly identifies the front end point and the rear end point of the echo in the audio as the front end point and the rear end point of the voice of the user, and voice end point detection errors occur, so that subsequent voice interaction errors are caused.
Disclosure of Invention
The embodiment of the application provides a voice processing method, a voice processing device, a terminal, a server and a storage medium, so as to solve the problem that error is easy to occur in voice endpoint detection.
In a first aspect, an embodiment of the present application provides a speech processing method, which is applied to a terminal, and the method includes:
collecting audio to be detected;
when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for the terminal to respond to a wakeup word input by a user and upload a second audio to a server;
and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection.
In one possible embodiment, the method further comprises:
and determining whether the first audio exists in the audio to be detected according to the audio characteristics corresponding to the response words.
In a possible implementation manner, determining whether the first audio exists in the audio to be detected according to the audio feature corresponding to the answer, includes:
extracting audio features corresponding to the answer words;
and when the audio features corresponding to the answer words are matched with the audio to be detected, determining that the first audio exists in the audio to be detected, wherein the first audio is the audio matched with the audio features corresponding to the answer words in the audio to be detected.
In a possible implementation manner, determining whether the first audio exists in the audio to be detected according to the audio feature corresponding to the answer, includes:
and detecting whether the first audio exists in the audio to be detected through a first voice endpoint detection model based on deep learning, wherein the first voice endpoint detection model is trained through an audio sample of an echo generated by the answer, and the audio sample contains an audio feature corresponding to the answer.
In one possible embodiment, the method further comprises:
when the first audio is detected not to exist in the audio to be detected and the front end point of the voice exists in the audio to be detected, a third audio is uploaded to a server so that the server can detect the rear end point of the voice in the third audio, wherein the third audio is the audio behind the front end point of the voice in the audio to be detected.
In one possible embodiment, the acquiring audio to be detected includes:
and after receiving the awakening words, collecting the audio to be detected and playing the response words.
In one possible embodiment, the method further comprises:
and receiving and displaying a voice recognition result sent by the server, wherein the voice recognition result is obtained by recognizing the voice after the server performs voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio.
In a second aspect, an embodiment of the present application provides a speech processing method, which is applied to a server, and the method includes:
receiving a second audio sent by a terminal, wherein the second audio is an audio which is collected by the terminal and is positioned behind an end point of a first audio in a to-be-detected audio, the first audio is an echo audio generated by a response word, and the response word is used for responding to a wake-up word input by a user by the terminal;
and performing voice endpoint detection on the second audio.
In one possible embodiment, the voice endpoint detection of the second audio includes:
determining a starting point of the second audio as a front end point of the voice in the second audio, and detecting a rear end point of the voice through a second voice end point detection model based on deep learning; wherein the second speech endpoint detection model is trained from audio samples containing back endpoints of speech.
In one possible embodiment, the voice endpoint detection of the second audio includes:
and detecting front end points and back end points of the voice in the second audio through a third voice end point detection model based on deep learning, wherein the third voice end point detection model is trained by audio samples containing the front end points and the back end points of the voice.
In one possible implementation, after performing voice endpoint detection on the second audio, the method further includes:
extracting the voice from the second audio according to the detected front end point and rear end point of the voice in the second audio;
recognizing the voice to obtain a voice recognition result;
and sending the voice recognition result to the terminal, wherein the voice recognition result is used for displaying by the terminal.
In a third aspect, an embodiment of the present application provides a speech processing apparatus, which is applied to a terminal, and the apparatus includes:
the acquisition module is used for acquiring the audio to be detected;
the sending module is used for uploading a second audio to a server when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for responding to a wake-up word input by a user by the terminal;
and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection.
In a possible embodiment, the apparatus further comprises: a detection module;
the detection module is configured to:
and determining whether the first audio exists in the audio to be detected according to the audio characteristics corresponding to the response words.
In a possible implementation manner, the detection module is specifically configured to:
extracting audio features corresponding to the answer words;
and when the audio features corresponding to the answer words are matched with the audio to be detected, determining that the first audio exists in the audio to be detected, wherein the first audio is the audio matched with the audio features corresponding to the answer words in the audio to be detected.
In a possible implementation manner, the detection module is specifically configured to:
and detecting whether the first audio exists in the audio to be detected through a first voice endpoint detection model based on deep learning, wherein the first voice endpoint detection model is trained through an audio sample of an echo generated by the answer, and the audio sample contains an audio feature corresponding to the answer.
In a possible implementation manner, the sending module is further configured to:
when the first audio is detected not to exist in the audio to be detected and the front end point of the voice exists in the audio to be detected, a third audio is uploaded to a server so that the server can detect the rear end point of the voice in the third audio, wherein the third audio is the audio behind the front end point of the voice in the audio to be detected.
In a possible embodiment, the acquisition module is specifically configured to:
and after receiving the awakening words, collecting the audio to be detected and playing the response words.
In a possible embodiment, the apparatus further comprises: a display module;
the display module is used for:
and receiving and displaying a voice recognition result sent by the server, wherein the voice recognition result is obtained by recognizing the voice after the server performs voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio.
In a fourth aspect, an embodiment of the present application provides a speech processing apparatus, which is applied to a server, and the apparatus includes:
the receiving module is used for receiving a second audio frequency sent by the terminal, wherein the second audio frequency is an audio frequency which is behind an end point of a first audio frequency in the audio frequency to be detected and collected by the terminal, the first audio frequency is an echo audio frequency generated by a response word, and the response word is used for a wake-up word input by the terminal in response to a user;
and the processing module is used for carrying out voice endpoint detection on the second audio.
In a possible implementation manner, the processing module is specifically configured to:
determining a starting point of the second audio as a front end point of the voice in the second audio, and detecting a rear end point of the voice through a second voice end point detection model based on deep learning; wherein the second speech endpoint detection model is trained from audio samples containing back endpoints of speech.
In a possible implementation manner, the processing module is specifically configured to:
and detecting front end points and back end points of the voice in the second audio through a third voice end point detection model based on deep learning, wherein the third voice end point detection model is trained by audio samples containing the front end points and the back end points of the voice.
In a possible implementation, the processing module is further configured to:
extracting the voice from the second audio according to the detected front end point and rear end point of the voice in the second audio;
recognizing the voice to obtain a voice recognition result;
and sending the voice recognition result to the terminal, wherein the voice recognition result is used for displaying by the terminal.
In a fifth aspect, an embodiment of the present application provides a terminal, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the speech processing method as described above in the first aspect and in various possible implementations of the first aspect.
In a sixth aspect, an embodiment of the present application provides a server, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executes the computer-executable instructions stored by the memory to cause the at least one processor to perform the speech processing method as described above in the second aspect and various possible embodiments of the second aspect.
In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer executing instruction is stored in the computer-readable storage medium, and when a processor executes the computer executing instruction, the speech processing method according to the first aspect and various possible implementations of the first aspect are implemented.
In an eighth aspect, embodiments of the present application provide a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the speech processing method according to the second aspect and various possible implementation manners of the second aspect is implemented.
According to the voice processing method, the voice processing device, the terminal, the server and the storage medium, the terminal collects the audio to be detected; when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for the terminal to respond to a wake-up word input by a user and upload a second audio to a server; the second audio is the audio which is located behind the end point of the first audio in the audio to be detected, the second audio is used for voice end point detection by the server, the first audio of the echo generated by the answer is detected through the terminal, the second audio behind the end point of the first audio is uploaded to the server for voice end point detection, the audio which is used for voice end point detection by the server does not contain the first audio, the situation that the end point of the echo is mistakenly identified as the end point of the voice of the user by the server is avoided, and the accuracy of voice end point detection is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a schematic view of a scene of a speech processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a speech processing method according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating a speech processing method according to another embodiment of the present application;
FIG. 4 is a flowchart illustrating a speech processing method according to another embodiment of the present application;
fig. 5 is a signaling interaction diagram of a voice processing method according to yet another embodiment of the present application;
FIG. 6A is a flow chart illustrating a conventional speech processing method;
fig. 6B is a schematic flowchart of a speech processing method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a speech processing apparatus according to another embodiment of the present application;
fig. 9 is a schematic hardware structure diagram of a terminal according to an embodiment of the present application;
fig. 10 is a schematic hardware structure diagram of a server according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a scene schematic diagram of a speech processing method according to an embodiment of the present application. As shown in fig. 1, the scenario includes a terminal 11 and a server 12. The terminal 11 may be a smart television, a smart speaker, a smart home, a smart vehicle-mounted terminal, a smart robot, a mobile phone, and the like, which is not limited herein. The terminal 11 may perform far-field voice interaction with the user and capture audio containing the voice input by the user. The terminal 11 may upload the captured audio to the server 12. The server 12 performs voice endpoint detection and voice recognition processing on the audio, and returns the recognition result to the terminal 11. The terminal 11 interacts with the user through playing or displaying, or the terminal 11 executes the operation corresponding to the recognition result.
For example, the terminal 11 is a smart tv, and when the user needs to use the smart tv, the user may speak the wake-up word "ABC" to wake up the smart tv. After the intelligent television receives the ABC input by the user, the intelligent television can wake up from the dormant state, play the response word 'I' corresponding to the welcome word and perform voice collection. After hearing the 'i am' played by the smart television, the user knows that the smart television is awakened and performs subsequent voice input, for example, the user says 'how much the weather is today'. The intelligent television can acquire audio containing user voice 'how much weather is today', the audio is uploaded to the server, the server performs voice endpoint detection on the audio, detects a front endpoint and a rear endpoint of the 'how much weather is today', performs semantic recognition on the voice between the front endpoint and the rear endpoint, and feeds back a recognition result to the intelligent television. The intelligent television can inquire the weather condition of today according to the identification result and inform the user of the weather condition in a voice playing and/or screen display mode. Optionally, the smart television may also display the voice recognition result on the display screen, so that the user knows the recognized voice content.
However, the answer words played by the terminal 11 may generate echoes, for example, echoes generated in a room by the answer words played by a smart television. The echo is picked up in the audio by the terminal 11 and uploaded to the server. When the server detects the voice endpoint, the front endpoint and the rear endpoint of the audio corresponding to the echo are detected firstly, the front endpoint and the rear endpoint are identified as the front endpoint and the rear endpoint of the voice of the user wrongly, so that the server carries out voice identification on the audio corresponding to the echo subsequently, and further subsequent voice interaction mistakes are caused. For example, the audio collected by the smart television includes the audio of an echo generated by the answer "i am at" and the user voice "what is the weather today" input by the user, wherein the audio position corresponding to "i am at" is before the user voice. When the server detects the voice endpoint, the front endpoint and the rear endpoint of the audio corresponding to the 'I' are detected firstly, semantic recognition is carried out on the audio corresponding to the 'I' and the voice of the user is recognized as 'I' by mistake, so that the intelligent television displays or plays reply information aiming at the 'I' and voice interaction is wrong. In addition, for a single-round interactive scene, once the server detects that the front end point and the back end point of the corresponding audio are ' I ' at ', the subsequent audio is not identified any more, so that the voice of the user is not processed.
In this embodiment, a first audio of an echo generated by a response word is detected by a terminal, and a second audio after an end point of the first audio is uploaded to a server for voice end point detection, so that the audio for voice end point detection by the server does not contain the first audio, thereby avoiding the situation that the end point of the echo is mistakenly identified by the server as the end point of the voice of the user, and improving the accuracy of voice end point detection. The following examples are given for illustrative purposes.
Fig. 2 is a flowchart illustrating a speech processing method according to an embodiment of the present application. The method may be performed by the terminal described above. As shown in fig. 2, the method includes:
s201, collecting audio to be detected.
In this embodiment, the terminal may collect the audio to be detected. The audio to be detected may include one or more of an audio of ambient noise, an audio of an echo generated by a response word played by the terminal, and a user voice. Taking the above-mentioned smart television as an example, the audio to be detected collected by the smart television may include indoor noise audio, audio of echo generated by the response word "i am", and voice of "what is today's weather" input by the user.
Optionally, after receiving the wake-up word, collecting the audio to be detected, and playing the response word.
In this embodiment, the terminal interacts with the user after being awakened by the awakening language. The user first speaks a wake-up word to wake up the terminal. And after receiving the awakening words input by the user, the terminal starts to collect the audio to be detected and plays the response words to inform the user that the user is awakened. The response words may be determined according to needs, and are not limited herein, for example, the response words may be "i am", "you are good", "good morning", "ask what service is needed", and the like.
S202, when a first audio frequency is detected to exist in the audio frequency to be detected, wherein the first audio frequency is an echo audio frequency generated by a response word, the response word is used for responding a wake-up word input by a user through the terminal, and a second audio frequency is uploaded to a server; and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection.
In this embodiment, the terminal plays the response after receiving the wake-up word input by the user, and the first audio is an audio of an echo generated by the response. If the answer words generate echoes, if the terminal and the user are in the space such as a house, a car and the like, a first audio exists in the audio to be detected collected by the terminal; if the answer is not echo, if the terminal and the user are in an outdoor space, the first audio does not exist in the audio to be detected collected by the terminal.
The terminal detects whether the first audio exists in the audio to be detected, and if the first audio exists, the second audio is uploaded to the server. When the terminal detects that the first audio exists in the audio to be detected, the terminal can determine the end point of the first audio, and the audio behind the end point of the first audio in the audio to be detected is used as the second audio to be uploaded to the server. The server may perform voice endpoint detection on the second audio to detect a front endpoint and a back endpoint of the voice in the second audio. And extracting voice from the second audio according to the front end point and the rear end point, identifying the voice to obtain a voice identification result, and then sending the voice identification result to the terminal. The terminal may display the voice recognition result on a screen or push reply information to the user according to the voice recognition result. For example, the voice recognition result is "how much weather is today", the terminal may display a text of "how much weather is today" on the screen, and may also query the weather condition of today and reply the weather condition of today to the user in a manner of displaying on the screen or voice broadcasting. Through the local detection of the terminal, when the first audio exists in the audio to be detected, only the second audio after the end point of the first audio is uploaded to the server, the audio of the echo is not given to the server for end point identification, and the voice end point detection error caused by the audio of the echo is avoided.
In the embodiment of the application, the terminal collects the audio to be detected; when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for the terminal to respond to a wake-up word input by a user and upload a second audio to a server; the second audio is the audio which is located behind the end point of the first audio in the audio to be detected, the second audio is used for voice end point detection by the server, the first audio of the echo generated by the answer is detected through the terminal, the second audio behind the end point of the first audio is uploaded to the server for voice end point detection, the audio which is used for voice end point detection by the server does not contain the first audio, the situation that the end point of the echo is mistakenly identified as the end point of the voice of the user by the server is avoided, and the accuracy of voice end point detection is improved.
Optionally, the method may further include:
when the first audio is detected not to exist in the audio to be detected and the front end point of the voice exists in the audio to be detected, a third audio is uploaded to a server so that the server can detect the rear end point of the voice in the third audio, wherein the third audio is the audio behind the front end point of the voice in the audio to be detected.
In this embodiment, the speech refers to speech input by the user, for example, "what is the weather today" input by the user. The terminal detects whether a first audio exists in the audio to be detected, if not, the terminal detects whether a front end point of the voice exists in the audio to be detected, and if so, a third audio which is located behind the front end point of the voice in the audio to be detected is uploaded to the server. The server detects a back endpoint for speech in the third audio.
For example, the environment of the terminal enables the answer words played by the terminal not to generate echoes, so that the first audio does not exist in the audio to be detected collected by the terminal. The method comprises the steps that voice input by a user is used as 'how much weather today's ', when a terminal detects that first audio does not exist in audio to be detected, a front end point of the voice' how much weather today's', the audio behind the front end point is used as third audio to be uploaded to a server, a rear end point of the voice 'how much weather today's 'in the third audio is detected by the server, and then the server can perform subsequent voice recognition and other processing according to the front end point and the rear end point of the voice' how much weather today's'.
Optionally, the terminal locally may adopt a VAD algorithm based on energy and zero-crossing rate to detect a front end point of the voice in the audio to be detected; the server detects a back-end point of speech in the third audio through a deep learning based VAD model. The terminal detects the front end point of the voice based on VAD algorithm of energy and zero crossing rate, and can prevent sudden transient noise from interfering the voice end point detection. The server detects the rear end point of the voice based on the deep learning VAD model, and the VAD model can be optimized by updating the model in real time, so that the method is suitable for detecting the rear end point of the voice in more scenes, and the front end point and the rear end point of the voice are respectively detected by combining two detection modes, so that the accuracy of voice end point detection can be improved.
Optionally, after S202, the method may further include:
and receiving and displaying a voice recognition result sent by the server, wherein the voice recognition result is obtained by recognizing the voice after the server performs voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio.
In this embodiment, after performing voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio, the server may extract the voice from the second audio according to the front endpoint and the rear endpoint, recognize the voice to obtain a voice recognition result, and then send the voice recognition result to the terminal. The terminal may display the voice recognition result on a screen or push reply information to the user according to the voice recognition result. For example, the speech recognition result is "how much weather today", the terminal may display a text of "how much weather today" on the screen, and may also query the weather condition of today and display the weather condition of today on the screen. The voice recognition result is displayed through the terminal, so that the user can conveniently check the voice recognition result, the user can input the voice again when the voice recognition result is wrong, and the user experience is improved.
Fig. 3 is a flowchart illustrating a speech processing method according to another embodiment of the present application. The embodiment describes in detail a specific implementation process for detecting whether the first audio exists in the audio to be detected. As shown in fig. 3, the method includes:
s301, collecting audio to be detected.
In this embodiment, S301 is similar to S201 in the embodiment of fig. 2, and is not described here again.
S302, determining whether the first audio exists in the audio to be detected according to the audio characteristics corresponding to the answer words.
In this embodiment, the terminal may determine whether the first audio exists in the audio to be detected according to the audio feature corresponding to the response word. Wherein, the audio features corresponding to the answer words can be in the audio samples containing the answer words. The terminal can obtain the audio characteristics corresponding to the answer from the audio sample. The type of the audio feature may include, but is not limited to, one or more of energy, zero-crossing rate, frequency domain, cepstrum, harmonic, and the like.
In one possible implementation, S302 may include:
extracting audio features corresponding to the answer words;
and when the audio features corresponding to the answer words are matched with the audio to be detected, determining that the first audio exists in the audio to be detected, wherein the first audio is the audio matched with the audio features corresponding to the answer words in the audio to be detected.
In this embodiment, the terminal may extract the audio features corresponding to the response words from the audio sample including the response words, and when detecting whether the first audio exists in the audio to be detected, may match the audio features corresponding to the response words with the audio to be detected, and if the audio to be detected exists in the audio that matches the audio features corresponding to the response words, determine that the first audio exists in the audio to be detected. The terminal can determine the audio matched with the audio characteristics corresponding to the answer words in the audio to be detected as the first audio, so that the end point of the first audio is determined.
For example, the audio features corresponding to the answer words include an energy feature and a zero-crossing rate feature. The terminal can match the audio characteristics corresponding to the answer words with the audio to be detected based on the voice detection algorithm of energy and zero crossing rate, and determine whether the first audio exists in the audio to be detected. The terminal can rapidly realize the detection of the first audio in the audio to be detected in a characteristic matching mode.
In one possible implementation, S302 may include:
and detecting whether the first audio exists in the audio to be detected through a first voice endpoint detection model based on deep learning, wherein the first voice endpoint detection model is trained through an audio sample of an echo generated by the answer, and the audio sample contains an audio feature corresponding to the answer.
In this embodiment, a first speech endpoint detection model based on deep learning may be pre-constructed, a plurality of audio samples of echoes generated by the response words may be collected to form a training set, and the first speech endpoint detection model may be trained by the training set. The terminal is preset with a trained first voice endpoint detection model, and the terminal can detect whether a first audio exists in the audio to be detected through the first voice endpoint detection model.
S303, when a first audio frequency is detected to exist in the audio frequency to be detected, wherein the first audio frequency is an echo audio frequency generated by a response language, the response language is used for responding a wake-up language input by a user through the terminal, and a second audio frequency is uploaded to a server; and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection.
In this embodiment, S301 is similar to S201 in the embodiment of fig. 2, and is not described here again.
Whether first audio exists in the audio frequency of treating is confirmed through the audio frequency characteristic that the answer word corresponds to this embodiment, can accurately detect out first audio frequency, and then prevents to upload the server with first audio frequency, causes the interference for the pronunciation endpoint detection to improve the degree of accuracy that the pronunciation endpoint detected.
Fig. 4 is a flowchart illustrating a speech processing method according to another embodiment of the present application. The method may be performed by a server. As shown in fig. 4, the method includes:
s401, receiving a second audio sent by a terminal, wherein the second audio is an audio which is located behind an end point of a first audio in the audio to be detected and collected by the terminal, the first audio is an audio of an echo generated by a response word, and the response word is used for a wake-up word input by the terminal in response to a user.
S402, voice endpoint detection is carried out on the second audio.
In this embodiment, the terminal may collect the audio to be detected, and when it is detected that the first audio exists in the audio to be detected, upload the second audio after the end point of the first audio to the server. And the server receives a second audio sent by the terminal and performs voice endpoint detection on the second audio.
In the embodiment of the application, the server receives a second audio sent by the terminal, wherein the second audio is an audio collected by the terminal and located behind an end point of the first audio in the audio to be detected, the first audio is an audio of an echo generated by a response word, the response word is used for the terminal to respond to a wake-up word input by a user, voice end point detection is performed on the second audio, the first audio of the echo generated by the response word is detected through the terminal, the second audio behind the end point of the first audio is uploaded to the server to perform voice end point detection, so that the server does not contain the first audio in the audio subjected to the voice end point detection, the situation that the end point of the echo is mistakenly identified by the server as the end point of the voice of the user is avoided, and the accuracy of the voice end point detection is improved.
In one possible implementation, S402 may include:
determining a starting point of the second audio as a front end point of the voice in the second audio, and detecting a rear end point of the voice through a second voice end point detection model based on deep learning; wherein the second speech endpoint detection model is trained from audio samples containing back endpoints of speech.
In this embodiment, the server may directly use the starting point of the second audio as the front end point of the voice, and only detect the rear end point of the voice. Since the starting point of the second audio is the end point of the first audio, the second audio does not contain the echo audio, and the server only detects the end point of the second audio, so that the subsequent speech recognition can be realized. For example, the speech is "how much it is today's weather", there may be a section of silence or noise before "how much it is today's weather" in the second audio, or there may be no silence or noise, but since neither silence or noise is speech, subsequent speech recognition is not affected, and thus the starting point of the second audio can be directly taken as the front end point of the speech.
A second voice endpoint detection model based on deep learning can be constructed in advance, audio samples of a plurality of voice rear endpoints are collected to form a training set, and the second voice endpoint detection model is trained through the training set. For example, the speech may be "how much the weather is today", "please open the X channel", "how many the vehicle limit number is today", and the like, so that the trained second speech endpoint detection model can accurately detect the back endpoint of the speech. And when the second voice endpoint detection model is used, updating and optimizing the model by using the newly collected user voice. The server may detect a back-end point of speech in the second audio through the second speech end point detection model.
In this embodiment, the server determines the starting point of the second audio as the front endpoint of the voice in the second audio, detects the back endpoint of the voice through the second voice endpoint detection model based on deep learning, and the server only detects the back endpoint of the voice, so that the speed of voice endpoint detection is increased, the response speed is increased, and the user experience is improved.
In one possible implementation, S402 may include:
and detecting front end points and back end points of the voice in the second audio through a third voice end point detection model based on deep learning, wherein the third voice end point detection model is trained by audio samples containing the front end points and the back end points of the voice.
In this embodiment, a third speech endpoint detection model based on deep learning may be pre-constructed, a training set may be formed by collecting front endpoints and rear endpoints of a plurality of speeches, and the third speech endpoint detection model may be trained by the training set. For example, the speech may be "how much the weather is today", "please open the X channel", "how many the vehicle limit number is today", and the like, so that the trained second speech endpoint detection model can accurately detect the front endpoint and the rear endpoint of the speech. And when the third voice endpoint detection model is used, updating and optimizing the model by using the newly collected user voice. The server may detect front and back endpoints of speech in the second audio through the third speech endpoint detection model.
The server can detect the front end point and the rear end point of the voice in the second audio, because a section of silent sound or noise possibly exists between the starting point of the second audio and the front end point of the voice, the server can eliminate the silent sound or noise by detecting the front end point of the voice through the third voice end point detection model based on deep learning, and clean voice in the stationary noise of the non-human voice can be extracted more accurately by utilizing the characteristics of the third voice end point detection model based on deep learning; therefore, the front end point for recognition can not lose the voice and bring more noise, and the accuracy of voice recognition is further improved.
Optionally, after S402, the method may further include:
extracting the voice from the second audio according to the detected front end point and rear end point of the voice in the second audio;
recognizing the voice to obtain a voice recognition result;
and sending the voice recognition result to the terminal, wherein the voice recognition result is used for displaying by the terminal.
For example, the voice input by the user is "how much the weather is today", the server extracts the voice "how much the weather is today" from the second audio according to the detected front end point and rear end point of "how much the weather is today", and then recognizes the voice to obtain the voice recognition result "how much the weather is today". The terminal can display a text of 'how today' weather exists on the screen, and can also inquire the weather condition of the day and display the weather condition of the day on the screen.
The server sends the voice recognition result to the terminal to be displayed, so that the user can conveniently check the voice recognition result, the user can input the voice again when the voice recognition result is wrong, and the user experience is improved.
Fig. 5 is a signaling interaction diagram of a speech processing method according to still another embodiment of the present application. The execution main body of the method comprises the terminal and the server. As shown in fig. 5, the method includes:
s501, after receiving the awakening words, the terminal collects the audio to be detected and plays the response words.
S502, when the terminal detects that the first audio exists in the audio to be detected, wherein the first audio is the audio of an echo generated by a response word, uploading a second audio to a server, wherein the second audio is the audio behind the end point of the first audio in the audio to be detected.
S503, the server performs voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio.
S504, the server extracts the voice from the second audio according to the detected front end point and rear end point of the voice in the second audio, and identifies the voice to obtain a voice identification result.
And S505, the server sends the voice recognition result to the terminal.
S506, the terminal displays the voice recognition result.
The speech processing method provided in the embodiment of the present application is similar to the method embodiment using the terminal as the execution main body and the method embodiment using the server as the execution main body, and the implementation principle and the technical effect thereof are similar to each other, and are not described herein again.
The following describes an implementation example of a conventional speech processing method in comparison with an implementation example of the speech processing method provided in the present application. Fig. 6A is a flowchart illustrating a conventional speech processing method, and fig. 6B is a flowchart illustrating a speech processing method according to an embodiment of the present application.
Referring to fig. 6A, the conventional speech processing method flows as follows: and after the terminal is awakened successfully, playing a response word 'I' and starting to collect the audio to be detected and upload the audio to the server in real time. The server detects the front end point of the voice in the audio to be detected through the VAD, starts to perform voice recognition on the audio behind the front end point after the front end point is detected, feeds the voice recognition result back to the terminal for real-time presentation, and ends the VAD and the voice recognition until the rear end point of the voice is detected. In the traditional voice processing method, because the echo cancellation technology cannot completely cancel the echo, if the echo is generated by the answer word, the audio frequency of the echo is collected by the terminal into the audio frequency to be detected and uploaded to the server, so that the server mistakenly recognizes the echo as the voice of the user, and the voice endpoint detection and the voice recognition are mistaken.
Referring to fig. 6B, the flow of the voice processing method provided in this embodiment is as follows: and after the terminal is awakened successfully, playing the response word 'I' and starting to collect the audio to be detected. The terminal detects whether an echo 'I' audio exists in the audio to be detected through VAD, and if the echo 'I' audio is detected, the terminal starts to upload the acquired audio to the server in real time after the end point of the echo 'I' audio. And the server performs voice recognition on the received audio and feeds the voice recognition result back to the terminal for real-time presentation until the server detects a rear end point of the voice through the VAD, and then the VAD and the voice recognition are finished. In the embodiment, the terminal locally detects the audio frequency of the echo in the audio frequency to be detected through AVD, and only uploads the audio frequency after the audio frequency of the echo to the server, so that the problem that the server recognizes the echo as the voice of the user by mistake is avoided, and the accuracy of voice endpoint detection is improved. The local VAD of the terminal is combined with the cloud VAD of the server, the local VAD can adopt a voice endpoint detection method based on energy and zero crossing rate, the cloud VAD can adopt a voice endpoint detection method based on deep learning, the interference of echo on endpoint detection is eliminated through the local VAD, and the interference caused by sudden instant noise can be eliminated; the advantage that can optimize through high in the clouds VAD model, the back end point of accurate discernment has improved the degree of accuracy that the pronunciation end point detected under the prerequisite of guaranteeing detection speed and occupying terminal local computing resource as little as possible.
Fig. 7 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application. The voice processing device is applied to the terminal. As shown in fig. 7, the speech processing apparatus 70 includes: an acquisition module 701 and a sending module 702.
The acquisition module 701 is used for acquiring audio to be detected.
A sending module 702, configured to, when it is detected that a first audio exists in the audio to be detected, where the first audio is an audio of an echo generated by a response word, and the response word is used by the terminal to respond to a wakeup word input by a user, and upload a second audio to a server.
And the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection.
According to the embodiment of the application, the audio to be detected is collected through the collection module; when the sending module detects that a first audio exists in the audio to be detected, wherein the first audio is the audio of an echo generated by a response word, and the response word is used for responding to a wake-up word input by a user by a terminal and uploading a second audio to a server; the second audio is the audio which is located behind the end point of the first audio in the audio to be detected, the second audio is used for voice endpoint detection of the server, the first audio of the echo generated by the answer is detected through the terminal, the second audio behind the end point of the first audio is uploaded to the server for voice endpoint detection, the audio which is used for voice endpoint detection of the server does not contain the first audio, the situation that the end point of the echo is mistakenly identified as the end point of the voice of the user by the server is avoided, and the accuracy of voice endpoint detection is improved
Optionally, the apparatus further comprises: and a detection module.
The detection module is configured to:
and determining whether the first audio exists in the audio to be detected according to the audio characteristics corresponding to the response words.
Optionally, the detection module is specifically configured to:
extracting audio features corresponding to the answer words;
and when the audio features corresponding to the answer words are matched with the audio to be detected, determining that the first audio exists in the audio to be detected, wherein the first audio is the audio matched with the audio features corresponding to the answer words in the audio to be detected.
Optionally, the detection module is specifically configured to:
and detecting whether the first audio exists in the audio to be detected through a first voice endpoint detection model based on deep learning, wherein the first voice endpoint detection model is trained through an audio sample of an echo generated by the answer, and the audio sample contains an audio feature corresponding to the answer.
Optionally, the sending module 702 is further configured to:
when the first audio is detected not to exist in the audio to be detected and the front end point of the voice exists in the audio to be detected, a third audio is uploaded to a server so that the server can detect the rear end point of the voice in the third audio, wherein the third audio is the audio behind the front end point of the voice in the audio to be detected.
Optionally, the acquisition module 701 is specifically configured to:
and after receiving the awakening words, collecting the audio to be detected and playing the response words.
Optionally, the apparatus further comprises: a display module;
the display module is used for:
and receiving and displaying a voice recognition result sent by the server, wherein the voice recognition result is obtained by recognizing the voice after the server performs voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio.
The speech processing apparatus provided in the embodiment of the present application may be configured to execute the method embodiment using the terminal as an execution main body, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 8 is a schematic structural diagram of a speech processing apparatus according to yet another embodiment of the present application. The voice processing device is applied to a server. As shown in fig. 8, the speech processing apparatus 80 includes:
the receiving module 801 is configured to receive a second audio sent by a terminal, where the second audio is an audio that is located after an end point of a first audio in to-be-detected audio collected by the terminal, the first audio is an audio of an echo generated by a response word, and the response word is used for a wake-up word input by the terminal in response to a user.
A processing module 802, configured to perform voice endpoint detection on the second audio.
In the embodiment of the application, the second audio sent by the terminal is received through the receiving module, wherein the second audio is the audio collected by the terminal and located behind the end point of the first audio in the audio to be detected, the first audio is the audio of the echo generated by the response word, the response word is used for the wake-up word input by the terminal in response to the user, the processing module performs voice endpoint detection on the second audio, the first audio of the echo generated by the response word is detected through the terminal, the second audio behind the end point of the first audio is uploaded to the server to perform voice endpoint detection, so that the audio subjected to the voice endpoint detection by the server does not contain the first audio, the situation that the end point of the echo is mistakenly identified by the server as the end point of the voice of the user is avoided, and the accuracy of the voice endpoint detection is improved.
Optionally, the processing module 802 is specifically configured to:
determining a starting point of the second audio as a front end point of the voice in the second audio, and detecting a rear end point of the voice through a second voice end point detection model based on deep learning; wherein the second speech endpoint detection model is trained from audio samples containing back endpoints of speech.
Optionally, the processing module 802 is specifically configured to:
and detecting front end points and back end points of the voice in the second audio through a third voice end point detection model based on deep learning, wherein the third voice end point detection model is trained by audio samples containing the front end points and the back end points of the voice.
Optionally, the processing module 802 is further configured to:
extracting the voice from the second audio according to the detected front end point and rear end point of the voice in the second audio;
recognizing the voice to obtain a voice recognition result;
and sending the voice recognition result to the terminal, wherein the voice recognition result is used for displaying by the terminal.
The speech processing apparatus provided in the embodiment of the present application may be configured to execute the method embodiment using the server as an execution main body, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 9 is a schematic hardware structure diagram of a terminal according to an embodiment of the present application. As shown in fig. 9, the terminal 90 provided in the present embodiment includes: at least one processor 901 and memory 902. The terminal 90 further comprises a communication component 903. The processor 901, the memory 902, and the communication section 903 are connected by a bus 904.
Optionally, the terminal 90 may also include audio components and/or multimedia components. Wherein the audio component is configured to output and/or input an audio signal. For example, the audio component includes a microphone configured to receive external audio signals when the terminal is in an operational mode, such as a speech recognition mode. The received audio signal may further be stored in a memory or transmitted via the communication component 903. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals. The multimedia components include a screen providing an output interface between the terminal 90 and the user. In some embodiments, the screen may include a liquid crystal display and a touch panel. If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
In a specific implementation process, the at least one processor 901 executes the computer-executable instructions stored in the memory 902, so that the at least one processor 901 performs the voice processing method with the terminal as the execution subject as described above.
For a specific implementation process of the processor 901, reference may be made to the above method embodiments, which implement principles and technical effects are similar, and details of this embodiment are not described herein again.
Fig. 10 is a schematic hardware structure diagram of a server according to another embodiment of the present application. As shown in fig. 10, the server 100 provided in the present embodiment includes: at least one processor 1001 and memory 1002. The server 100 further comprises a communication component 1003. The processor 1001, the memory 1002, and the communication unit 1003 are connected by a bus 1004.
In a specific implementation process, the at least one processor 1001 executes the computer-executable instructions stored in the memory 1002, so that the at least one processor 1001 executes the voice processing method with the server as the execution subject.
For a specific implementation process of the processor 1001, reference may be made to the above method embodiments, which have similar implementation principles and technical effects, and details of this embodiment are not described herein again.
In the embodiments shown in fig. 9 and fig. 10, it should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in the incorporated application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.
The memory may comprise high speed RAM memory and may also include non-volatile storage NVM, such as at least one disk memory.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The application also provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the voice processing method taking the terminal as an execution subject is realized.
The application also provides a computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the voice processing method taking the server as an execution subject is realized.
The readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. Readable storage media can be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the readable storage medium may also reside as discrete components in the apparatus.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (26)

1. A voice processing method is applied to a terminal, and the method comprises the following steps:
collecting audio to be detected;
when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for the terminal to respond to a wakeup word input by a user and upload a second audio to a server;
and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection.
2. The method of claim 1, further comprising:
and determining whether the first audio exists in the audio to be detected according to the audio characteristics corresponding to the response words.
3. The method according to claim 2, wherein determining whether the first audio exists in the audio to be detected according to the audio feature corresponding to the answer word comprises:
extracting audio features corresponding to the answer words;
and when the audio features corresponding to the answer words are matched with the audio to be detected, determining that the first audio exists in the audio to be detected, wherein the first audio is the audio matched with the audio features corresponding to the answer words in the audio to be detected.
4. The method according to claim 2, wherein determining whether the first audio exists in the audio to be detected according to the audio feature corresponding to the answer word comprises:
and detecting whether the first audio exists in the audio to be detected through a first voice endpoint detection model based on deep learning, wherein the first voice endpoint detection model is trained through an audio sample of an echo generated by the answer, and the audio sample contains an audio feature corresponding to the answer.
5. The method according to any one of claims 1-4, further comprising:
when the first audio is detected not to exist in the audio to be detected and the front end point of the voice exists in the audio to be detected, a third audio is uploaded to a server so that the server can detect the rear end point of the voice in the third audio, wherein the third audio is the audio behind the front end point of the voice in the audio to be detected.
6. The method according to any one of claims 1-4, wherein the acquiring audio to be detected comprises:
and after receiving the awakening words, collecting the audio to be detected and playing the response words.
7. The method according to any one of claims 1-4, further comprising:
and receiving and displaying a voice recognition result sent by the server, wherein the voice recognition result is obtained by recognizing the voice after the server performs voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio.
8. A speech processing method applied to a server, the method comprising:
receiving a second audio sent by a terminal, wherein the second audio is an audio which is collected by the terminal and is positioned behind an end point of a first audio in a to-be-detected audio, the first audio is an echo audio generated by a response word, and the response word is used for responding to a wake-up word input by a user by the terminal;
and performing voice endpoint detection on the second audio.
9. The method of claim 8, wherein performing voice endpoint detection on the second audio comprises:
determining a starting point of the second audio as a front end point of the voice in the second audio, and detecting a rear end point of the voice through a second voice end point detection model based on deep learning; wherein the second speech endpoint detection model is trained from audio samples containing back endpoints of speech.
10. The method of claim 8, wherein performing voice endpoint detection on the second audio comprises:
and detecting front end points and back end points of the voice in the second audio through a third voice end point detection model based on deep learning, wherein the third voice end point detection model is trained by audio samples containing the front end points and the back end points of the voice.
11. The method of any of claims 8-10, wherein after performing voice endpoint detection on the second audio, the method further comprises:
extracting the voice from the second audio according to the detected front end point and rear end point of the voice in the second audio;
recognizing the voice to obtain a voice recognition result;
and sending the voice recognition result to the terminal, wherein the voice recognition result is used for displaying by the terminal.
12. A speech processing apparatus, applied to a terminal, the apparatus comprising:
the acquisition module is used for acquiring the audio to be detected;
the sending module is used for uploading a second audio to a server when detecting that a first audio exists in the audio to be detected, wherein the first audio is an echo audio generated by a response word, and the response word is used for responding to a wake-up word input by a user by the terminal;
and the second audio is the audio which is positioned behind the end point of the first audio in the audio to be detected, and the second audio is used for the server to perform voice endpoint detection.
13. The apparatus of claim 12, further comprising: a detection module;
the detection module is configured to:
and determining whether the first audio exists in the audio to be detected according to the audio characteristics corresponding to the response words.
14. The apparatus according to claim 13, wherein the detection module is specifically configured to:
extracting audio features corresponding to the answer words;
and when the audio features corresponding to the answer words are matched with the audio to be detected, determining that the first audio exists in the audio to be detected, wherein the first audio is the audio matched with the audio features corresponding to the answer words in the audio to be detected.
15. The apparatus according to claim 13, wherein the detection module is specifically configured to:
and detecting whether the first audio exists in the audio to be detected through a first voice endpoint detection model based on deep learning, wherein the first voice endpoint detection model is trained through an audio sample of an echo generated by the answer, and the audio sample contains an audio feature corresponding to the answer.
16. The apparatus of any one of claims 12-15, wherein the sending module is further configured to:
when the first audio is detected not to exist in the audio to be detected and the front end point of the voice exists in the audio to be detected, a third audio is uploaded to a server so that the server can detect the rear end point of the voice in the third audio, wherein the third audio is the audio behind the front end point of the voice in the audio to be detected.
17. The device according to any one of claims 12 to 15, wherein the acquisition module is specifically configured to:
and after receiving the awakening words, collecting the audio to be detected and playing the response words.
18. The apparatus according to any one of claims 12-15, further comprising: a display module;
the display module is used for:
and receiving and displaying a voice recognition result sent by the server, wherein the voice recognition result is obtained by recognizing the voice after the server performs voice endpoint detection on the second audio to obtain a front endpoint and a rear endpoint of the voice in the second audio.
19. A speech processing apparatus, applied to a server, the apparatus comprising:
the receiving module is used for receiving a second audio frequency sent by the terminal, wherein the second audio frequency is an audio frequency which is behind an end point of a first audio frequency in the audio frequency to be detected and collected by the terminal, the first audio frequency is an echo audio frequency generated by a response word, and the response word is used for a wake-up word input by the terminal in response to a user;
and the processing module is used for carrying out voice endpoint detection on the second audio.
20. The apparatus of claim 19, wherein the processing module is specifically configured to:
determining a starting point of the second audio as a front end point of the voice in the second audio, and detecting a rear end point of the voice through a second voice end point detection model based on deep learning; wherein the second speech endpoint detection model is trained from audio samples containing back endpoints of speech.
21. The apparatus of claim 19, wherein the processing module is specifically configured to:
and detecting front end points and back end points of the voice in the second audio through a third voice end point detection model based on deep learning, wherein the third voice end point detection model is trained by audio samples containing the front end points and the back end points of the voice.
22. The apparatus of any one of claims 19-21, wherein the processing module is further configured to:
extracting the voice from the second audio according to the detected front end point and rear end point of the voice in the second audio;
recognizing the voice to obtain a voice recognition result;
and sending the voice recognition result to the terminal, wherein the voice recognition result is used for displaying by the terminal.
23. A terminal, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the speech processing method of any of claims 1-7.
24. A server, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the speech processing method of any of claims 8-11.
25. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the speech processing method of any one of claims 1-7.
26. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the speech processing method of any one of claims 8-11.
CN202010315910.4A 2020-04-21 2020-04-21 Voice processing method, device, terminal, server and storage medium Active CN111540357B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010315910.4A CN111540357B (en) 2020-04-21 2020-04-21 Voice processing method, device, terminal, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010315910.4A CN111540357B (en) 2020-04-21 2020-04-21 Voice processing method, device, terminal, server and storage medium

Publications (2)

Publication Number Publication Date
CN111540357A true CN111540357A (en) 2020-08-14
CN111540357B CN111540357B (en) 2024-01-26

Family

ID=71975064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010315910.4A Active CN111540357B (en) 2020-04-21 2020-04-21 Voice processing method, device, terminal, server and storage medium

Country Status (1)

Country Link
CN (1) CN111540357B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185388A (en) * 2020-09-14 2021-01-05 北京小米松果电子有限公司 Speech recognition method, device, equipment and computer readable storage medium
CN112700782A (en) * 2020-12-25 2021-04-23 维沃移动通信有限公司 Voice processing method and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252872A (en) * 2014-09-23 2014-12-31 深圳市中兴移动通信有限公司 Lyric generating method and intelligent terminal
CN105472191A (en) * 2015-11-18 2016-04-06 百度在线网络技术(北京)有限公司 Method and device for tracking echo time delay
CN106341563A (en) * 2015-07-06 2017-01-18 北京视联动力国际信息技术有限公司 Terminal communication based echo suppression method and device
CN106653031A (en) * 2016-10-17 2017-05-10 海信集团有限公司 Voice wake-up method and voice interaction device
CN109285554A (en) * 2017-07-20 2019-01-29 阿里巴巴集团控股有限公司 A kind of echo cancel method, server, terminal and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252872A (en) * 2014-09-23 2014-12-31 深圳市中兴移动通信有限公司 Lyric generating method and intelligent terminal
CN106341563A (en) * 2015-07-06 2017-01-18 北京视联动力国际信息技术有限公司 Terminal communication based echo suppression method and device
CN105472191A (en) * 2015-11-18 2016-04-06 百度在线网络技术(北京)有限公司 Method and device for tracking echo time delay
CN106653031A (en) * 2016-10-17 2017-05-10 海信集团有限公司 Voice wake-up method and voice interaction device
CN109285554A (en) * 2017-07-20 2019-01-29 阿里巴巴集团控股有限公司 A kind of echo cancel method, server, terminal and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185388A (en) * 2020-09-14 2021-01-05 北京小米松果电子有限公司 Speech recognition method, device, equipment and computer readable storage medium
CN112185388B (en) * 2020-09-14 2024-04-09 北京小米松果电子有限公司 Speech recognition method, device, equipment and computer readable storage medium
CN112700782A (en) * 2020-12-25 2021-04-23 维沃移动通信有限公司 Voice processing method and electronic equipment

Also Published As

Publication number Publication date
CN111540357B (en) 2024-01-26

Similar Documents

Publication Publication Date Title
EP3611895B1 (en) Method and device for user registration, and electronic device
US11176938B2 (en) Method, device and storage medium for controlling game execution using voice intelligent interactive system
CN102568478B (en) Video play control method and system based on voice recognition
CN106971723B (en) Voice processing method and device for voice processing
CA2897365C (en) Method and system for recognizing speech commands
CN110914828B (en) Speech translation method and device
CN111667835A (en) Voice recognition method, living body detection method, model training method and device
CN110706707B (en) Method, apparatus, device and computer-readable storage medium for voice interaction
CN111540357B (en) Voice processing method, device, terminal, server and storage medium
CN109361995A (en) A kind of volume adjusting method of electrical equipment, device, electrical equipment and medium
CN110211609A (en) A method of promoting speech recognition accuracy
CN112509568A (en) Voice awakening method and device
US20200312305A1 (en) Performing speaker change detection and speaker recognition on a trigger phrase
CN111161746B (en) Voiceprint registration method and system
CN108322770A (en) Video frequency program recognition methods, relevant apparatus, equipment and system
CN111862943B (en) Speech recognition method and device, electronic equipment and storage medium
CN112185425A (en) Audio signal processing method, device, equipment and storage medium
CN110956958A (en) Searching method, searching device, terminal equipment and storage medium
CN112908336A (en) Role separation method for voice processing device and voice processing device thereof
CN111400463B (en) Dialogue response method, device, equipment and medium
US10818298B2 (en) Audio processing
CN114171029A (en) Audio recognition method and device, electronic equipment and readable storage medium
CN113889091A (en) Voice recognition method and device, computer readable storage medium and electronic equipment
CN114664303A (en) Continuous voice instruction rapid recognition control system
CN111968630B (en) Information processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant