CN112382281A - Voice recognition method and device, electronic equipment and readable storage medium - Google Patents
Voice recognition method and device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN112382281A CN112382281A CN202011223168.0A CN202011223168A CN112382281A CN 112382281 A CN112382281 A CN 112382281A CN 202011223168 A CN202011223168 A CN 202011223168A CN 112382281 A CN112382281 A CN 112382281A
- Authority
- CN
- China
- Prior art keywords
- audio
- channel
- channel audio
- voice recognition
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000006835 compression Effects 0.000 claims abstract description 34
- 238000007906 compression Methods 0.000 claims abstract description 34
- 238000012545 processing Methods 0.000 claims abstract description 27
- 238000007781 pre-processing Methods 0.000 claims abstract description 19
- 238000001514 detection method Methods 0.000 claims abstract description 17
- 230000015654 memory Effects 0.000 claims description 19
- 230000009467 reduction Effects 0.000 claims description 11
- 239000000126 substance Substances 0.000 claims 1
- 230000003993 interaction Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002592 echocardiography Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001151 other effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
Abstract
The application discloses a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, and relates to the technical field of voice processing. The implementation scheme adopted when the voice recognition is carried out is as follows: acquiring audio to be identified; preprocessing the audio to be identified to obtain a first multi-channel audio; performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist; after the second multi-channel audio is subjected to multi-channel mixed compression, the compressed audio is sent to a server for voice recognition; and receiving a voice recognition result returned by the server. The method and the device can simplify the steps of voice recognition and improve the accuracy of the voice recognition.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a readable storage medium in the field of speech processing technologies.
Background
With the popularity of voice interaction, applications and products surrounding voice interaction are constantly emerging. Since the voice interaction needs to be implemented on the basis of voice recognition, the accuracy of the voice recognition indirectly affects the accuracy of the voice interaction.
In the prior art, when speech recognition is performed, audio data used for speech recognition is usually processed into a single-channel audio, and due to a large difference between the single-channel audio and an original audio, distortion of the audio data used for speech recognition is serious, and audio quality is low, so that accuracy of the speech recognition is reduced.
Disclosure of Invention
The technical scheme adopted by the application for solving the technical problem is to provide a voice recognition method, which comprises the following steps: acquiring audio to be identified; preprocessing the audio to be identified to obtain a first multi-channel audio; performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist; after the second multi-channel audio is subjected to multi-channel mixed compression, the compressed audio is sent to a server for voice recognition; and receiving a voice recognition result returned by the server.
The technical solution adopted by the present application for solving the technical problem is to provide a speech recognition apparatus, including: the acquisition unit is used for acquiring the audio to be identified; the preprocessing unit is used for preprocessing the audio to be identified to obtain a first multi-channel audio; the detection unit is used for performing awakening detection on the first multi-channel audio and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist; the compression unit is used for performing multi-channel mixed compression on the second multi-channel audio and then sending the compressed audio to a server for voice recognition; and the receiving unit is used for receiving the voice recognition result returned by the server.
An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above method.
A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the above method.
One embodiment in the above application has the following advantages or benefits: the method and the device can simplify the steps of voice recognition and improve the accuracy of the voice recognition. Because the technical means of carrying out voice recognition according to the multi-channel audio is adopted, the technical problem that the accuracy of voice recognition is lower due to the fact that single-channel audio with serious distortion and low audio quality is used for carrying out voice recognition in the prior art is solved, the steps of simplifying the voice recognition are achieved, and the technical effect of improving the accuracy of the voice recognition is achieved.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present application;
FIG. 2 is a schematic diagram according to a second embodiment of the present application;
FIG. 3 is a schematic illustration according to a third embodiment of the present application;
fig. 4 is a block diagram of an electronic device for implementing a speech recognition method according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram according to a first embodiment of the present application. As shown in fig. 1, the speech recognition method of this embodiment may specifically include the following steps:
s101, acquiring audio to be identified;
s102, preprocessing the audio to be identified to obtain a first multi-channel audio;
s103, performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected;
s104, after multi-channel mixed compression is carried out on the second multi-channel audio, the compressed audio is sent to a server for voice recognition;
and S105, receiving the voice recognition result returned by the server.
The voice recognition method of the embodiment processes the audio to be recognized into the multi-channel audio to be used for the server to perform voice recognition, so that the audio used by the server during voice recognition is ensured to have higher quality, the voice recognition steps are simplified, and the accuracy of the voice recognition is further improved.
The execution main body of the voice recognition method of this embodiment may be a terminal device, such as a smart phone, a personal computer, a smart speaker, a smart home appliance, a vehicle-mounted device, and other devices capable of performing voice interaction, that is, in this embodiment, voice recognition is implemented through interaction between the terminal device and the server.
In this embodiment, the audio to be recognized acquired in S101 is audio data collected by a microphone of the terminal device, and the audio data collected by the microphone is a multi-channel audio.
After the audio to be recognized is acquired in S101, the embodiment performs S102 to preprocess the acquired audio to be recognized, so as to obtain a first multi-channel audio.
Specifically, in this embodiment, when the obtained audio to be identified is preprocessed in step S102 to obtain a first multi-channel audio, an optional implementation manner that can be adopted is as follows: performing at least one of noise reduction processing and de-reverberation processing on the acquired audio to be recognized, for example, performing noise reduction processing on the audio to be recognized and then performing de-reverberation processing on the audio to be recognized; and taking the processing result as the first multi-channel audio.
In this embodiment, when S102 is executed to perform noise reduction processing on the audio to be recognized, the existing AEC multi-path noise reduction algorithm may be used to remove the echo in the audio; when the audio to be identified is subjected to dereverberation processing, the existing WPE multi-path dereverberation algorithm can be used for removing reverberation in the audio.
That is to say, the first multi-channel audio obtained by performing S102 in the present embodiment is the multi-channel audio from which the echo and the reverberation in the audio to be recognized are removed, so that the quality of the multi-channel audio used for speech recognition is enhanced, and the accuracy of speech recognition is improved.
In this embodiment, after the step S102 is performed to obtain the first multi-channel audio, the step S103 is performed to perform wake-up detection on the obtained first multi-channel audio, and when a wake-up word is detected to exist, the second multi-channel audio is extracted from the first multi-channel audio, where the obtained second multi-channel audio is audio data that is finally used by the server for voice recognition.
Specifically, in this embodiment, when the step S103 is executed and the second multi-channel audio is extracted from the first multi-channel audio under the condition that the presence of the wakeup word is detected, the optional implementation manner that can be adopted is as follows: and extracting the audio part behind the awakening word from the first multi-channel audio as a second multi-channel audio.
That is to say, in this embodiment, the second multi-channel audio obtained by executing S103 is an audio part before the wakeup word in the first multi-channel audio is removed, the audio part is a main part for performing voice interaction between the user and the terminal device, and by performing voice recognition on the second multi-channel audio, the efficiency and accuracy of voice recognition can be improved.
In this embodiment, after the second multi-channel audio is obtained by executing S103, S104 is executed to perform multi-channel mixing compression on the second multi-channel audio, and then the compressed audio is sent to the server for speech recognition.
In the prior art, after wake-up detection is performed on multiple channels of audio, signal enhancement is also performed on the audio after wake-up detection, and after one channel of audio for voice recognition is obtained, the one channel of audio is compressed to be used for voice recognition by a server. However, the difference between the audio of the channel obtained by the prior art and the original audio is large, which results in a large loss of the audio quality of the channel compared with the original audio, and reduces the accuracy of the server in performing the speech recognition.
In this embodiment, after the second multi-channel audio is obtained, signal enhancement processing is not performed on the second multi-channel audio, so that the audio finally used for speech recognition is still multi-channel audio, the technical problem of a large difference between the audio used for speech recognition and the original audio is avoided, the audio used for speech recognition still has high audio quality, the speech recognition step is simplified, and the accuracy of speech recognition is improved.
Specifically, when performing the S104 multi-channel mixed compression on the second multi-channel audio, the present embodiment may adopt the following optional implementation manners: determining the audio energy of each channel of audio in the second multi-channel audio; and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.
According to the embodiment, different compression rates are used for respectively compressing according to the audio energy of each path of audio in the second multi-path audio, so that the compression efficiency of the multi-path audio is improved, the compression loss of the multi-path audio is reduced, and the audio quality of the compressed audio for the server to perform voice recognition is higher.
For example, if the second multi-channel audio is two channels of audio, one channel of audio corresponds to the ambient sound, and the other channel of audio corresponds to the user sound; if the audio energy of the environmental sound is low, the embodiment can compress the audio by using a high compression rate; if the audio energy of the user's voice is high, the embodiment may use a low compression rate to compress the audio.
In this embodiment, after S104 is executed to send the compressed audio to the server, for example, the compressed audio is sent to the server through the communication module in the terminal device, the server decompresses the compressed audio first, then extracts audio features from each of the decompressed multiple channels of audio respectively, and finally performs speech recognition according to the extracted audio features to obtain a speech recognition result.
The present embodiment performs S105 to receive the speech recognition result returned by the server after performing S104 to transmit the compressed audio to the server.
It is understood that S105 in this embodiment may further include the following after receiving the speech recognition result returned by the server: inquiring the received voice recognition result to obtain an inquiry result; and after converting the query result into audio, displaying the audio to the user.
That is to say, this embodiment completes the speech interaction according to the speech recognition result that the server returned, because the server uses the multichannel audio frequency that has higher audio quality to carry out speech recognition, when can promoting the accuracy of speech recognition result, can also promote the accuracy of speech interaction correspondingly.
According to the method provided by the embodiment, the terminal device processes the audio to be recognized into the multi-channel audio for the server to perform voice recognition, and the difference between the multi-channel audio and the audio to be recognized is small, so that the audio quality is high, the accuracy of the voice recognition is improved, the multi-channel audio does not need to be subjected to signal enhancement and other processing, the voice recognition steps are simplified, and the voice recognition efficiency is improved.
Fig. 2 is a schematic diagram according to a second embodiment of the present application. As shown in fig. 2, the speech recognition apparatus of the present embodiment includes:
the acquiring unit 201 is used for acquiring audio to be identified;
the preprocessing unit 202 is configured to preprocess the audio to be identified to obtain a first multi-channel audio;
the detection unit 203 is configured to perform wake-up detection on the first multi-channel audio, and extract a second multi-channel audio from the first multi-channel audio when a wake-up word is detected to exist;
the compression unit 204 is configured to perform multi-channel mixing compression on the second multi-channel audio, and then send the compressed audio to a server for speech recognition;
the receiving unit 205 is configured to receive the voice recognition result returned by the server.
The voice recognition device of this embodiment may be located in a terminal device, such as a smart phone, a personal computer, a smart speaker, a smart home appliance, a vehicle-mounted device, and other devices capable of performing voice interaction, that is, in this embodiment, the voice recognition is implemented through interaction between the terminal device and a server.
The audio to be identified acquired by the acquiring unit 201 in this embodiment is audio data acquired by a microphone of the terminal device, and the audio data acquired by the microphone is a multi-channel audio.
After the audio to be recognized is acquired by the acquisition unit 201, the acquired audio to be recognized is preprocessed by the preprocessing unit 202, so that a first multi-channel audio is obtained.
Specifically, when the preprocessing unit 202 in this embodiment preprocesses the acquired audio to be identified to obtain the first multi-channel audio, the optional implementation manners that can be adopted are as follows: performing at least one of noise reduction processing and de-reverberation processing on the acquired audio to be recognized, for example, performing noise reduction processing on the audio to be recognized and then performing de-reverberation processing on the audio to be recognized; and taking the processing result as the first multi-channel audio.
In this embodiment, when the pre-processing unit 202 performs noise reduction processing on the audio to be recognized, the echo in the audio may be removed by using the existing AEC multi-path noise reduction algorithm; when the audio to be identified is subjected to dereverberation processing, the existing WPE multi-path dereverberation algorithm can be used for removing reverberation in the audio.
That is to say, in this embodiment, the first multi-channel audio obtained by the preprocessing unit 202 is a multi-channel audio from which echoes and reverberation in the audio to be recognized are removed, so that the quality of the multi-channel audio used for speech recognition is enhanced, and the accuracy of speech recognition is improved.
In this embodiment, after the preprocessing unit 202 obtains the first multi-channel audio, the detecting unit 203 performs wake-up detection on the obtained first multi-channel audio, and extracts the second multi-channel audio from the first multi-channel audio when detecting that the wake-up word exists, where the obtained second multi-channel audio is audio data that is finally used by the server for voice recognition.
Specifically, in this embodiment, when the detecting unit 203 extracts the second multi-channel audio from the first multi-channel audio under the condition that the presence of the wake-up word is detected, the optional implementation manner that can be adopted is as follows: and extracting the audio part behind the awakening word from the first multi-channel audio as a second multi-channel audio.
That is to say, the second multi-channel audio obtained by the detecting unit 203 in this embodiment is an audio part removed from the first multi-channel audio and located before the wakeup word, where the audio part is a main part of voice interaction between the user and the terminal device, and by performing voice recognition on the second multi-channel audio, efficiency and accuracy of voice recognition can be improved.
In this embodiment, after the detection unit 203 obtains the second multi-channel audio, the compression unit 204 performs multi-channel mixing compression on the second multi-channel audio, and then sends the compressed audio to the server for speech recognition.
In this embodiment, after the second multi-channel audio is obtained, signal enhancement processing is not performed on the second multi-channel audio, so that the audio finally used for speech recognition is still multi-channel audio, the technical problem of a large difference between the audio used for speech recognition and the original audio is avoided, the audio used for speech recognition still has high audio quality, the speech recognition step is simplified, and the accuracy of speech recognition is improved.
Specifically, when the compression unit 204 performs multi-channel hybrid compression on the second multi-channel audio, the present embodiment may adopt the following optional implementation manners: determining the audio energy of each channel of audio in the second multi-channel audio; and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.
According to the embodiment, different compression rates are used for respectively compressing according to the audio energy of each path of audio in the second multi-path audio, so that the compression efficiency of the multi-path audio is improved, the compression loss of the multi-path audio is reduced, and the audio quality of the compressed audio for the server to perform voice recognition is higher.
After the compression unit 204 in this embodiment sends the compressed audio to the server, the server first decompresses the compressed audio, then extracts audio features from each of the decompressed multiple channels of audio, and finally performs voice recognition according to the extracted audio features to obtain a voice recognition result.
The present embodiment receives a speech recognition result returned by the server by the receiving unit 205 after the compressed audio is transmitted to the server by the compressing unit 204.
It is understood that the receiving unit 205 in this embodiment may further include the following after receiving the speech recognition result returned by the server: inquiring the received voice recognition result to obtain an inquiry result; and after converting the query result into audio, displaying the audio to the user.
Fig. 3 is a schematic diagram according to a third embodiment of the present application. As shown in fig. 3, which is a flow chart of speech recognition of the present application: firstly, a microphone of a terminal device collects audio to be recognized, then preprocessing including noise reduction elimination and reverberation de-reverberation is carried out on the audio to be recognized to obtain first multi-channel audio, voice awakening detection is carried out on the first multi-channel audio, then second multi-channel audio is output, the second multi-channel audio is multi-channel recognition audio, then multi-channel mixing compression is carried out on the second multi-channel audio, the compressed audio is transmitted to a server through a network link, the server decompresses the compressed audio, voice recognition is completed, and a voice recognition result is returned to the terminal device; pcm in fig. 3 is (Pulse Code Modulation) and asr is (Automatic Speech Recognition).
According to an embodiment of the present application, an electronic device and a computer-readable storage medium are also provided.
Fig. 4 is a block diagram of an electronic device according to the speech recognition method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.
The memory 402, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the method of searching for an emoticon in the embodiment of the present application (for example, the acquisition unit 201, the pre-processing unit 202, the detection unit 203, the compression unit 204, and the receiving unit 205 shown in fig. 2). The processor 401 executes various functional applications of the server and data processing, i.e., implements the voice recognition method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 402.
The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 402 may optionally include memory located remotely from the processor 401, which may be connected to the speech recognition method electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the speech recognition method may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.
The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the voice recognition method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS").
According to the technical scheme of the embodiment of the application, the audio to be recognized is processed into the multi-channel audio for the server to perform voice recognition, so that the audio used by the server during voice recognition is ensured to have higher quality, the voice recognition steps are simplified, and the accuracy of the voice recognition is further improved.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (10)
1. A speech recognition method comprising:
acquiring audio to be identified;
preprocessing the audio to be identified to obtain a first multi-channel audio;
performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist;
after the second multi-channel audio is subjected to multi-channel mixed compression, the compressed audio is sent to a server for voice recognition;
and receiving a voice recognition result returned by the server.
2. The method of claim 1, wherein the preprocessing the audio to be recognized to obtain a first multi-channel audio comprises:
performing at least one of noise reduction processing and reverberation removing processing on the audio to be identified;
and taking the processing result as the first multi-channel audio.
3. The method of claim 1, wherein the extracting, in the event that a wake word is detected to be present, a second multi-channel audio from the first multi-channel audio comprises:
and extracting an audio part behind the awakening word from the first multi-channel audio to be used as the second multi-channel audio.
4. The method of claim 1, wherein the multi-way mixing compression of the second multi-way audio comprises:
determining the audio energy of each channel of audio in the second multi-channel audio;
and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.
5. A speech recognition apparatus comprising:
the acquisition unit is used for acquiring the audio to be identified;
the preprocessing unit is used for preprocessing the audio to be identified to obtain a first multi-channel audio;
the detection unit is used for performing awakening detection on the first multi-channel audio and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist;
the compression unit is used for performing multi-channel mixed compression on the second multi-channel audio and then sending the compressed audio to a server for voice recognition;
and the receiving unit is used for receiving the voice recognition result returned by the server.
6. The apparatus according to claim 5, wherein the preprocessing unit, when preprocessing the audio to be recognized to obtain a first multi-channel audio, specifically performs:
performing at least one of noise reduction processing and reverberation removing processing on the audio to be identified;
and taking the processing result as the first multi-channel audio.
7. The apparatus according to claim 5, wherein the detecting unit, when detecting that the wake-up word exists, specifically performs, when extracting a second multi-channel audio from the first multi-channel audio:
and extracting an audio part behind the awakening word from the first multi-channel audio to be used as the second multi-channel audio.
8. The apparatus according to claim 5, wherein the compression unit, when performing the multi-channel hybrid compression on the second multi-channel audio, specifically performs:
determining the audio energy of each channel of audio in the second multi-channel audio;
and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011223168.0A CN112382281B (en) | 2020-11-05 | 2020-11-05 | Voice recognition method, device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011223168.0A CN112382281B (en) | 2020-11-05 | 2020-11-05 | Voice recognition method, device, electronic equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112382281A true CN112382281A (en) | 2021-02-19 |
CN112382281B CN112382281B (en) | 2023-11-21 |
Family
ID=74579404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011223168.0A Active CN112382281B (en) | 2020-11-05 | 2020-11-05 | Voice recognition method, device, electronic equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112382281B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113393838A (en) * | 2021-06-30 | 2021-09-14 | 北京探境科技有限公司 | Voice processing method and device, computer readable storage medium and computer equipment |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020064284A1 (en) * | 2000-11-24 | 2002-05-30 | Yoshiaki Takagi | Sound signal encoding apparatus and method |
US20080199021A1 (en) * | 2005-07-12 | 2008-08-21 | Samsung Electronics Co., Ltd. | Method and Apparatus For Providing Ip Datacasting Service in Digital Audio Broadcasting System |
CN103295571A (en) * | 2012-02-29 | 2013-09-11 | 辉达公司 | Control using time and/or spectrally compacted audio commands |
CN106653031A (en) * | 2016-10-17 | 2017-05-10 | 海信集团有限公司 | Voice wake-up method and voice interaction device |
CN107223280A (en) * | 2017-03-03 | 2017-09-29 | 深圳前海达闼云端智能科技有限公司 | robot awakening method, device and robot |
CN108986822A (en) * | 2018-08-31 | 2018-12-11 | 出门问问信息科技有限公司 | Audio recognition method, device, electronic equipment and non-transient computer storage medium |
CN109859757A (en) * | 2019-03-19 | 2019-06-07 | 百度在线网络技术(北京)有限公司 | A kind of speech ciphering equipment control method, device and terminal |
CN110060685A (en) * | 2019-04-15 | 2019-07-26 | 百度在线网络技术(北京)有限公司 | Voice awakening method and device |
CN110189753A (en) * | 2019-05-28 | 2019-08-30 | 北京百度网讯科技有限公司 | Baffle Box of Bluetooth and its control method, system and storage medium |
CN110427097A (en) * | 2019-06-18 | 2019-11-08 | 华为技术有限公司 | Voice data processing method, apparatus and system |
CN111128201A (en) * | 2019-12-31 | 2020-05-08 | 百度在线网络技术(北京)有限公司 | Interaction method, device, system, electronic equipment and storage medium |
CN111755002A (en) * | 2020-06-19 | 2020-10-09 | 北京百度网讯科技有限公司 | Speech recognition device, electronic apparatus, and speech recognition method |
-
2020
- 2020-11-05 CN CN202011223168.0A patent/CN112382281B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020064284A1 (en) * | 2000-11-24 | 2002-05-30 | Yoshiaki Takagi | Sound signal encoding apparatus and method |
US20080199021A1 (en) * | 2005-07-12 | 2008-08-21 | Samsung Electronics Co., Ltd. | Method and Apparatus For Providing Ip Datacasting Service in Digital Audio Broadcasting System |
CN103295571A (en) * | 2012-02-29 | 2013-09-11 | 辉达公司 | Control using time and/or spectrally compacted audio commands |
CN106653031A (en) * | 2016-10-17 | 2017-05-10 | 海信集团有限公司 | Voice wake-up method and voice interaction device |
CN107223280A (en) * | 2017-03-03 | 2017-09-29 | 深圳前海达闼云端智能科技有限公司 | robot awakening method, device and robot |
CN108986822A (en) * | 2018-08-31 | 2018-12-11 | 出门问问信息科技有限公司 | Audio recognition method, device, electronic equipment and non-transient computer storage medium |
CN109859757A (en) * | 2019-03-19 | 2019-06-07 | 百度在线网络技术(北京)有限公司 | A kind of speech ciphering equipment control method, device and terminal |
CN110060685A (en) * | 2019-04-15 | 2019-07-26 | 百度在线网络技术(北京)有限公司 | Voice awakening method and device |
CN110189753A (en) * | 2019-05-28 | 2019-08-30 | 北京百度网讯科技有限公司 | Baffle Box of Bluetooth and its control method, system and storage medium |
CN110427097A (en) * | 2019-06-18 | 2019-11-08 | 华为技术有限公司 | Voice data processing method, apparatus and system |
CN111128201A (en) * | 2019-12-31 | 2020-05-08 | 百度在线网络技术(北京)有限公司 | Interaction method, device, system, electronic equipment and storage medium |
CN111755002A (en) * | 2020-06-19 | 2020-10-09 | 北京百度网讯科技有限公司 | Speech recognition device, electronic apparatus, and speech recognition method |
Non-Patent Citations (2)
Title |
---|
周旺;姜?|;: "基于TDM的多通道声卡设计", 应用科技, no. 10 * |
杨松平: "卫星接收技术在现场直播传输中的应用", 电视技术, no. 05 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113393838A (en) * | 2021-06-30 | 2021-09-14 | 北京探境科技有限公司 | Voice processing method and device, computer readable storage medium and computer equipment |
Also Published As
Publication number | Publication date |
---|---|
CN112382281B (en) | 2023-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111192591B (en) | Awakening method and device of intelligent equipment, intelligent sound box and storage medium | |
CN111402868B (en) | Speech recognition method, device, electronic equipment and computer readable storage medium | |
US20210097994A1 (en) | Data processing method and apparatus for intelligent device, and storage medium | |
CN111755002B (en) | Speech recognition device, electronic apparatus, and speech recognition method | |
CN112434139A (en) | Information interaction method and device, electronic equipment and storage medium | |
CN112908318A (en) | Awakening method and device of intelligent sound box, intelligent sound box and storage medium | |
CN112634890B (en) | Method, device, equipment and storage medium for waking up playing equipment | |
CN111128201A (en) | Interaction method, device, system, electronic equipment and storage medium | |
CN112382294B (en) | Speech recognition method, device, electronic equipment and storage medium | |
CN112382281B (en) | Voice recognition method, device, electronic equipment and readable storage medium | |
CN112071323B (en) | Method and device for acquiring false wake-up sample data and electronic equipment | |
CN110600039B (en) | Method and device for determining speaker attribute, electronic equipment and readable storage medium | |
CN112382292A (en) | Voice-based control method and device | |
CN111369999A (en) | Signal processing method and device and electronic equipment | |
CN110633357A (en) | Voice interaction method, device, equipment and medium | |
CN114333017A (en) | Dynamic pickup method and device, electronic equipment and storage medium | |
CN111724805A (en) | Method and apparatus for processing information | |
CN112037781B (en) | Voice data acquisition method and device | |
CN115312042A (en) | Method, apparatus, device and storage medium for processing audio | |
CN112329907A (en) | Dialogue processing method and device, electronic equipment and storage medium | |
CN114221940B (en) | Audio data processing method, system, device, equipment and storage medium | |
CN112164396A (en) | Voice control method and device, electronic equipment and storage medium | |
CN111986682A (en) | Voice interaction method, device, equipment and storage medium | |
CN114071318B (en) | Voice processing method, terminal equipment and vehicle | |
CN113129904B (en) | Voiceprint determination method, apparatus, system, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |