CN112382281A - Voice recognition method and device, electronic equipment and readable storage medium - Google Patents

Voice recognition method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN112382281A
CN112382281A CN202011223168.0A CN202011223168A CN112382281A CN 112382281 A CN112382281 A CN 112382281A CN 202011223168 A CN202011223168 A CN 202011223168A CN 112382281 A CN112382281 A CN 112382281A
Authority
CN
China
Prior art keywords
audio
channel
channel audio
voice recognition
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011223168.0A
Other languages
Chinese (zh)
Other versions
CN112382281B (en
Inventor
杨松
纪盛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202011223168.0A priority Critical patent/CN112382281B/en
Publication of CN112382281A publication Critical patent/CN112382281A/en
Application granted granted Critical
Publication of CN112382281B publication Critical patent/CN112382281B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Abstract

The application discloses a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, and relates to the technical field of voice processing. The implementation scheme adopted when the voice recognition is carried out is as follows: acquiring audio to be identified; preprocessing the audio to be identified to obtain a first multi-channel audio; performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist; after the second multi-channel audio is subjected to multi-channel mixed compression, the compressed audio is sent to a server for voice recognition; and receiving a voice recognition result returned by the server. The method and the device can simplify the steps of voice recognition and improve the accuracy of the voice recognition.

Description

Voice recognition method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a readable storage medium in the field of speech processing technologies.
Background
With the popularity of voice interaction, applications and products surrounding voice interaction are constantly emerging. Since the voice interaction needs to be implemented on the basis of voice recognition, the accuracy of the voice recognition indirectly affects the accuracy of the voice interaction.
In the prior art, when speech recognition is performed, audio data used for speech recognition is usually processed into a single-channel audio, and due to a large difference between the single-channel audio and an original audio, distortion of the audio data used for speech recognition is serious, and audio quality is low, so that accuracy of the speech recognition is reduced.
Disclosure of Invention
The technical scheme adopted by the application for solving the technical problem is to provide a voice recognition method, which comprises the following steps: acquiring audio to be identified; preprocessing the audio to be identified to obtain a first multi-channel audio; performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist; after the second multi-channel audio is subjected to multi-channel mixed compression, the compressed audio is sent to a server for voice recognition; and receiving a voice recognition result returned by the server.
The technical solution adopted by the present application for solving the technical problem is to provide a speech recognition apparatus, including: the acquisition unit is used for acquiring the audio to be identified; the preprocessing unit is used for preprocessing the audio to be identified to obtain a first multi-channel audio; the detection unit is used for performing awakening detection on the first multi-channel audio and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist; the compression unit is used for performing multi-channel mixed compression on the second multi-channel audio and then sending the compressed audio to a server for voice recognition; and the receiving unit is used for receiving the voice recognition result returned by the server.
An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above method.
A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the above method.
One embodiment in the above application has the following advantages or benefits: the method and the device can simplify the steps of voice recognition and improve the accuracy of the voice recognition. Because the technical means of carrying out voice recognition according to the multi-channel audio is adopted, the technical problem that the accuracy of voice recognition is lower due to the fact that single-channel audio with serious distortion and low audio quality is used for carrying out voice recognition in the prior art is solved, the steps of simplifying the voice recognition are achieved, and the technical effect of improving the accuracy of the voice recognition is achieved.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present application;
FIG. 2 is a schematic diagram according to a second embodiment of the present application;
FIG. 3 is a schematic illustration according to a third embodiment of the present application;
fig. 4 is a block diagram of an electronic device for implementing a speech recognition method according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram according to a first embodiment of the present application. As shown in fig. 1, the speech recognition method of this embodiment may specifically include the following steps:
s101, acquiring audio to be identified;
s102, preprocessing the audio to be identified to obtain a first multi-channel audio;
s103, performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected;
s104, after multi-channel mixed compression is carried out on the second multi-channel audio, the compressed audio is sent to a server for voice recognition;
and S105, receiving the voice recognition result returned by the server.
The voice recognition method of the embodiment processes the audio to be recognized into the multi-channel audio to be used for the server to perform voice recognition, so that the audio used by the server during voice recognition is ensured to have higher quality, the voice recognition steps are simplified, and the accuracy of the voice recognition is further improved.
The execution main body of the voice recognition method of this embodiment may be a terminal device, such as a smart phone, a personal computer, a smart speaker, a smart home appliance, a vehicle-mounted device, and other devices capable of performing voice interaction, that is, in this embodiment, voice recognition is implemented through interaction between the terminal device and the server.
In this embodiment, the audio to be recognized acquired in S101 is audio data collected by a microphone of the terminal device, and the audio data collected by the microphone is a multi-channel audio.
After the audio to be recognized is acquired in S101, the embodiment performs S102 to preprocess the acquired audio to be recognized, so as to obtain a first multi-channel audio.
Specifically, in this embodiment, when the obtained audio to be identified is preprocessed in step S102 to obtain a first multi-channel audio, an optional implementation manner that can be adopted is as follows: performing at least one of noise reduction processing and de-reverberation processing on the acquired audio to be recognized, for example, performing noise reduction processing on the audio to be recognized and then performing de-reverberation processing on the audio to be recognized; and taking the processing result as the first multi-channel audio.
In this embodiment, when S102 is executed to perform noise reduction processing on the audio to be recognized, the existing AEC multi-path noise reduction algorithm may be used to remove the echo in the audio; when the audio to be identified is subjected to dereverberation processing, the existing WPE multi-path dereverberation algorithm can be used for removing reverberation in the audio.
That is to say, the first multi-channel audio obtained by performing S102 in the present embodiment is the multi-channel audio from which the echo and the reverberation in the audio to be recognized are removed, so that the quality of the multi-channel audio used for speech recognition is enhanced, and the accuracy of speech recognition is improved.
In this embodiment, after the step S102 is performed to obtain the first multi-channel audio, the step S103 is performed to perform wake-up detection on the obtained first multi-channel audio, and when a wake-up word is detected to exist, the second multi-channel audio is extracted from the first multi-channel audio, where the obtained second multi-channel audio is audio data that is finally used by the server for voice recognition.
Specifically, in this embodiment, when the step S103 is executed and the second multi-channel audio is extracted from the first multi-channel audio under the condition that the presence of the wakeup word is detected, the optional implementation manner that can be adopted is as follows: and extracting the audio part behind the awakening word from the first multi-channel audio as a second multi-channel audio.
That is to say, in this embodiment, the second multi-channel audio obtained by executing S103 is an audio part before the wakeup word in the first multi-channel audio is removed, the audio part is a main part for performing voice interaction between the user and the terminal device, and by performing voice recognition on the second multi-channel audio, the efficiency and accuracy of voice recognition can be improved.
In this embodiment, after the second multi-channel audio is obtained by executing S103, S104 is executed to perform multi-channel mixing compression on the second multi-channel audio, and then the compressed audio is sent to the server for speech recognition.
In the prior art, after wake-up detection is performed on multiple channels of audio, signal enhancement is also performed on the audio after wake-up detection, and after one channel of audio for voice recognition is obtained, the one channel of audio is compressed to be used for voice recognition by a server. However, the difference between the audio of the channel obtained by the prior art and the original audio is large, which results in a large loss of the audio quality of the channel compared with the original audio, and reduces the accuracy of the server in performing the speech recognition.
In this embodiment, after the second multi-channel audio is obtained, signal enhancement processing is not performed on the second multi-channel audio, so that the audio finally used for speech recognition is still multi-channel audio, the technical problem of a large difference between the audio used for speech recognition and the original audio is avoided, the audio used for speech recognition still has high audio quality, the speech recognition step is simplified, and the accuracy of speech recognition is improved.
Specifically, when performing the S104 multi-channel mixed compression on the second multi-channel audio, the present embodiment may adopt the following optional implementation manners: determining the audio energy of each channel of audio in the second multi-channel audio; and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.
According to the embodiment, different compression rates are used for respectively compressing according to the audio energy of each path of audio in the second multi-path audio, so that the compression efficiency of the multi-path audio is improved, the compression loss of the multi-path audio is reduced, and the audio quality of the compressed audio for the server to perform voice recognition is higher.
For example, if the second multi-channel audio is two channels of audio, one channel of audio corresponds to the ambient sound, and the other channel of audio corresponds to the user sound; if the audio energy of the environmental sound is low, the embodiment can compress the audio by using a high compression rate; if the audio energy of the user's voice is high, the embodiment may use a low compression rate to compress the audio.
In this embodiment, after S104 is executed to send the compressed audio to the server, for example, the compressed audio is sent to the server through the communication module in the terminal device, the server decompresses the compressed audio first, then extracts audio features from each of the decompressed multiple channels of audio respectively, and finally performs speech recognition according to the extracted audio features to obtain a speech recognition result.
The present embodiment performs S105 to receive the speech recognition result returned by the server after performing S104 to transmit the compressed audio to the server.
It is understood that S105 in this embodiment may further include the following after receiving the speech recognition result returned by the server: inquiring the received voice recognition result to obtain an inquiry result; and after converting the query result into audio, displaying the audio to the user.
That is to say, this embodiment completes the speech interaction according to the speech recognition result that the server returned, because the server uses the multichannel audio frequency that has higher audio quality to carry out speech recognition, when can promoting the accuracy of speech recognition result, can also promote the accuracy of speech interaction correspondingly.
According to the method provided by the embodiment, the terminal device processes the audio to be recognized into the multi-channel audio for the server to perform voice recognition, and the difference between the multi-channel audio and the audio to be recognized is small, so that the audio quality is high, the accuracy of the voice recognition is improved, the multi-channel audio does not need to be subjected to signal enhancement and other processing, the voice recognition steps are simplified, and the voice recognition efficiency is improved.
Fig. 2 is a schematic diagram according to a second embodiment of the present application. As shown in fig. 2, the speech recognition apparatus of the present embodiment includes:
the acquiring unit 201 is used for acquiring audio to be identified;
the preprocessing unit 202 is configured to preprocess the audio to be identified to obtain a first multi-channel audio;
the detection unit 203 is configured to perform wake-up detection on the first multi-channel audio, and extract a second multi-channel audio from the first multi-channel audio when a wake-up word is detected to exist;
the compression unit 204 is configured to perform multi-channel mixing compression on the second multi-channel audio, and then send the compressed audio to a server for speech recognition;
the receiving unit 205 is configured to receive the voice recognition result returned by the server.
The voice recognition device of this embodiment may be located in a terminal device, such as a smart phone, a personal computer, a smart speaker, a smart home appliance, a vehicle-mounted device, and other devices capable of performing voice interaction, that is, in this embodiment, the voice recognition is implemented through interaction between the terminal device and a server.
The audio to be identified acquired by the acquiring unit 201 in this embodiment is audio data acquired by a microphone of the terminal device, and the audio data acquired by the microphone is a multi-channel audio.
After the audio to be recognized is acquired by the acquisition unit 201, the acquired audio to be recognized is preprocessed by the preprocessing unit 202, so that a first multi-channel audio is obtained.
Specifically, when the preprocessing unit 202 in this embodiment preprocesses the acquired audio to be identified to obtain the first multi-channel audio, the optional implementation manners that can be adopted are as follows: performing at least one of noise reduction processing and de-reverberation processing on the acquired audio to be recognized, for example, performing noise reduction processing on the audio to be recognized and then performing de-reverberation processing on the audio to be recognized; and taking the processing result as the first multi-channel audio.
In this embodiment, when the pre-processing unit 202 performs noise reduction processing on the audio to be recognized, the echo in the audio may be removed by using the existing AEC multi-path noise reduction algorithm; when the audio to be identified is subjected to dereverberation processing, the existing WPE multi-path dereverberation algorithm can be used for removing reverberation in the audio.
That is to say, in this embodiment, the first multi-channel audio obtained by the preprocessing unit 202 is a multi-channel audio from which echoes and reverberation in the audio to be recognized are removed, so that the quality of the multi-channel audio used for speech recognition is enhanced, and the accuracy of speech recognition is improved.
In this embodiment, after the preprocessing unit 202 obtains the first multi-channel audio, the detecting unit 203 performs wake-up detection on the obtained first multi-channel audio, and extracts the second multi-channel audio from the first multi-channel audio when detecting that the wake-up word exists, where the obtained second multi-channel audio is audio data that is finally used by the server for voice recognition.
Specifically, in this embodiment, when the detecting unit 203 extracts the second multi-channel audio from the first multi-channel audio under the condition that the presence of the wake-up word is detected, the optional implementation manner that can be adopted is as follows: and extracting the audio part behind the awakening word from the first multi-channel audio as a second multi-channel audio.
That is to say, the second multi-channel audio obtained by the detecting unit 203 in this embodiment is an audio part removed from the first multi-channel audio and located before the wakeup word, where the audio part is a main part of voice interaction between the user and the terminal device, and by performing voice recognition on the second multi-channel audio, efficiency and accuracy of voice recognition can be improved.
In this embodiment, after the detection unit 203 obtains the second multi-channel audio, the compression unit 204 performs multi-channel mixing compression on the second multi-channel audio, and then sends the compressed audio to the server for speech recognition.
In this embodiment, after the second multi-channel audio is obtained, signal enhancement processing is not performed on the second multi-channel audio, so that the audio finally used for speech recognition is still multi-channel audio, the technical problem of a large difference between the audio used for speech recognition and the original audio is avoided, the audio used for speech recognition still has high audio quality, the speech recognition step is simplified, and the accuracy of speech recognition is improved.
Specifically, when the compression unit 204 performs multi-channel hybrid compression on the second multi-channel audio, the present embodiment may adopt the following optional implementation manners: determining the audio energy of each channel of audio in the second multi-channel audio; and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.
According to the embodiment, different compression rates are used for respectively compressing according to the audio energy of each path of audio in the second multi-path audio, so that the compression efficiency of the multi-path audio is improved, the compression loss of the multi-path audio is reduced, and the audio quality of the compressed audio for the server to perform voice recognition is higher.
After the compression unit 204 in this embodiment sends the compressed audio to the server, the server first decompresses the compressed audio, then extracts audio features from each of the decompressed multiple channels of audio, and finally performs voice recognition according to the extracted audio features to obtain a voice recognition result.
The present embodiment receives a speech recognition result returned by the server by the receiving unit 205 after the compressed audio is transmitted to the server by the compressing unit 204.
It is understood that the receiving unit 205 in this embodiment may further include the following after receiving the speech recognition result returned by the server: inquiring the received voice recognition result to obtain an inquiry result; and after converting the query result into audio, displaying the audio to the user.
Fig. 3 is a schematic diagram according to a third embodiment of the present application. As shown in fig. 3, which is a flow chart of speech recognition of the present application: firstly, a microphone of a terminal device collects audio to be recognized, then preprocessing including noise reduction elimination and reverberation de-reverberation is carried out on the audio to be recognized to obtain first multi-channel audio, voice awakening detection is carried out on the first multi-channel audio, then second multi-channel audio is output, the second multi-channel audio is multi-channel recognition audio, then multi-channel mixing compression is carried out on the second multi-channel audio, the compressed audio is transmitted to a server through a network link, the server decompresses the compressed audio, voice recognition is completed, and a voice recognition result is returned to the terminal device; pcm in fig. 3 is (Pulse Code Modulation) and asr is (Automatic Speech Recognition).
According to an embodiment of the present application, an electronic device and a computer-readable storage medium are also provided.
Fig. 4 is a block diagram of an electronic device according to the speech recognition method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.
Memory 402 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the speech recognition methods provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the speech recognition method provided by the present application.
The memory 402, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the method of searching for an emoticon in the embodiment of the present application (for example, the acquisition unit 201, the pre-processing unit 202, the detection unit 203, the compression unit 204, and the receiving unit 205 shown in fig. 2). The processor 401 executes various functional applications of the server and data processing, i.e., implements the voice recognition method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 402.
The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 402 may optionally include memory located remotely from the processor 401, which may be connected to the speech recognition method electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the speech recognition method may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.
The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the voice recognition method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS").
According to the technical scheme of the embodiment of the application, the audio to be recognized is processed into the multi-channel audio for the server to perform voice recognition, so that the audio used by the server during voice recognition is ensured to have higher quality, the voice recognition steps are simplified, and the accuracy of the voice recognition is further improved.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A speech recognition method comprising:
acquiring audio to be identified;
preprocessing the audio to be identified to obtain a first multi-channel audio;
performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist;
after the second multi-channel audio is subjected to multi-channel mixed compression, the compressed audio is sent to a server for voice recognition;
and receiving a voice recognition result returned by the server.
2. The method of claim 1, wherein the preprocessing the audio to be recognized to obtain a first multi-channel audio comprises:
performing at least one of noise reduction processing and reverberation removing processing on the audio to be identified;
and taking the processing result as the first multi-channel audio.
3. The method of claim 1, wherein the extracting, in the event that a wake word is detected to be present, a second multi-channel audio from the first multi-channel audio comprises:
and extracting an audio part behind the awakening word from the first multi-channel audio to be used as the second multi-channel audio.
4. The method of claim 1, wherein the multi-way mixing compression of the second multi-way audio comprises:
determining the audio energy of each channel of audio in the second multi-channel audio;
and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.
5. A speech recognition apparatus comprising:
the acquisition unit is used for acquiring the audio to be identified;
the preprocessing unit is used for preprocessing the audio to be identified to obtain a first multi-channel audio;
the detection unit is used for performing awakening detection on the first multi-channel audio and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist;
the compression unit is used for performing multi-channel mixed compression on the second multi-channel audio and then sending the compressed audio to a server for voice recognition;
and the receiving unit is used for receiving the voice recognition result returned by the server.
6. The apparatus according to claim 5, wherein the preprocessing unit, when preprocessing the audio to be recognized to obtain a first multi-channel audio, specifically performs:
performing at least one of noise reduction processing and reverberation removing processing on the audio to be identified;
and taking the processing result as the first multi-channel audio.
7. The apparatus according to claim 5, wherein the detecting unit, when detecting that the wake-up word exists, specifically performs, when extracting a second multi-channel audio from the first multi-channel audio:
and extracting an audio part behind the awakening word from the first multi-channel audio to be used as the second multi-channel audio.
8. The apparatus according to claim 5, wherein the compression unit, when performing the multi-channel hybrid compression on the second multi-channel audio, specifically performs:
determining the audio energy of each channel of audio in the second multi-channel audio;
and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.
CN202011223168.0A 2020-11-05 2020-11-05 Voice recognition method, device, electronic equipment and readable storage medium Active CN112382281B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011223168.0A CN112382281B (en) 2020-11-05 2020-11-05 Voice recognition method, device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011223168.0A CN112382281B (en) 2020-11-05 2020-11-05 Voice recognition method, device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN112382281A true CN112382281A (en) 2021-02-19
CN112382281B CN112382281B (en) 2023-11-21

Family

ID=74579404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011223168.0A Active CN112382281B (en) 2020-11-05 2020-11-05 Voice recognition method, device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112382281B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113393838A (en) * 2021-06-30 2021-09-14 北京探境科技有限公司 Voice processing method and device, computer readable storage medium and computer equipment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020064284A1 (en) * 2000-11-24 2002-05-30 Yoshiaki Takagi Sound signal encoding apparatus and method
US20080199021A1 (en) * 2005-07-12 2008-08-21 Samsung Electronics Co., Ltd. Method and Apparatus For Providing Ip Datacasting Service in Digital Audio Broadcasting System
CN103295571A (en) * 2012-02-29 2013-09-11 辉达公司 Control using time and/or spectrally compacted audio commands
CN106653031A (en) * 2016-10-17 2017-05-10 海信集团有限公司 Voice wake-up method and voice interaction device
CN107223280A (en) * 2017-03-03 2017-09-29 深圳前海达闼云端智能科技有限公司 robot awakening method, device and robot
CN108986822A (en) * 2018-08-31 2018-12-11 出门问问信息科技有限公司 Audio recognition method, device, electronic equipment and non-transient computer storage medium
CN109859757A (en) * 2019-03-19 2019-06-07 百度在线网络技术(北京)有限公司 A kind of speech ciphering equipment control method, device and terminal
CN110060685A (en) * 2019-04-15 2019-07-26 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN110189753A (en) * 2019-05-28 2019-08-30 北京百度网讯科技有限公司 Baffle Box of Bluetooth and its control method, system and storage medium
CN110427097A (en) * 2019-06-18 2019-11-08 华为技术有限公司 Voice data processing method, apparatus and system
CN111128201A (en) * 2019-12-31 2020-05-08 百度在线网络技术(北京)有限公司 Interaction method, device, system, electronic equipment and storage medium
CN111755002A (en) * 2020-06-19 2020-10-09 北京百度网讯科技有限公司 Speech recognition device, electronic apparatus, and speech recognition method

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020064284A1 (en) * 2000-11-24 2002-05-30 Yoshiaki Takagi Sound signal encoding apparatus and method
US20080199021A1 (en) * 2005-07-12 2008-08-21 Samsung Electronics Co., Ltd. Method and Apparatus For Providing Ip Datacasting Service in Digital Audio Broadcasting System
CN103295571A (en) * 2012-02-29 2013-09-11 辉达公司 Control using time and/or spectrally compacted audio commands
CN106653031A (en) * 2016-10-17 2017-05-10 海信集团有限公司 Voice wake-up method and voice interaction device
CN107223280A (en) * 2017-03-03 2017-09-29 深圳前海达闼云端智能科技有限公司 robot awakening method, device and robot
CN108986822A (en) * 2018-08-31 2018-12-11 出门问问信息科技有限公司 Audio recognition method, device, electronic equipment and non-transient computer storage medium
CN109859757A (en) * 2019-03-19 2019-06-07 百度在线网络技术(北京)有限公司 A kind of speech ciphering equipment control method, device and terminal
CN110060685A (en) * 2019-04-15 2019-07-26 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN110189753A (en) * 2019-05-28 2019-08-30 北京百度网讯科技有限公司 Baffle Box of Bluetooth and its control method, system and storage medium
CN110427097A (en) * 2019-06-18 2019-11-08 华为技术有限公司 Voice data processing method, apparatus and system
CN111128201A (en) * 2019-12-31 2020-05-08 百度在线网络技术(北京)有限公司 Interaction method, device, system, electronic equipment and storage medium
CN111755002A (en) * 2020-06-19 2020-10-09 北京百度网讯科技有限公司 Speech recognition device, electronic apparatus, and speech recognition method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周旺;姜?|;: "基于TDM的多通道声卡设计", 应用科技, no. 10 *
杨松平: "卫星接收技术在现场直播传输中的应用", 电视技术, no. 05 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113393838A (en) * 2021-06-30 2021-09-14 北京探境科技有限公司 Voice processing method and device, computer readable storage medium and computer equipment

Also Published As

Publication number Publication date
CN112382281B (en) 2023-11-21

Similar Documents

Publication Publication Date Title
CN111192591B (en) Awakening method and device of intelligent equipment, intelligent sound box and storage medium
CN111402868B (en) Speech recognition method, device, electronic equipment and computer readable storage medium
US20210097994A1 (en) Data processing method and apparatus for intelligent device, and storage medium
CN111755002B (en) Speech recognition device, electronic apparatus, and speech recognition method
CN112434139A (en) Information interaction method and device, electronic equipment and storage medium
CN112908318A (en) Awakening method and device of intelligent sound box, intelligent sound box and storage medium
CN112634890B (en) Method, device, equipment and storage medium for waking up playing equipment
CN111128201A (en) Interaction method, device, system, electronic equipment and storage medium
CN112382294B (en) Speech recognition method, device, electronic equipment and storage medium
CN112382281B (en) Voice recognition method, device, electronic equipment and readable storage medium
CN112071323B (en) Method and device for acquiring false wake-up sample data and electronic equipment
CN110600039B (en) Method and device for determining speaker attribute, electronic equipment and readable storage medium
CN112382292A (en) Voice-based control method and device
CN111369999A (en) Signal processing method and device and electronic equipment
CN110633357A (en) Voice interaction method, device, equipment and medium
CN114333017A (en) Dynamic pickup method and device, electronic equipment and storage medium
CN111724805A (en) Method and apparatus for processing information
CN112037781B (en) Voice data acquisition method and device
CN115312042A (en) Method, apparatus, device and storage medium for processing audio
CN112329907A (en) Dialogue processing method and device, electronic equipment and storage medium
CN114221940B (en) Audio data processing method, system, device, equipment and storage medium
CN112164396A (en) Voice control method and device, electronic equipment and storage medium
CN111986682A (en) Voice interaction method, device, equipment and storage medium
CN114071318B (en) Voice processing method, terminal equipment and vehicle
CN113129904B (en) Voiceprint determination method, apparatus, system, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant