CN112382281A

CN112382281A - Voice recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN112382281A
Application number: CN202011223168.0A
Authority: CN
Inventors: 杨松; 纪盛
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-02-19
Anticipated expiration: 2040-11-05
Also published as: CN112382281B

Abstract

The application discloses a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, and relates to the technical field of voice processing. The implementation scheme adopted when the voice recognition is carried out is as follows: acquiring audio to be identified; preprocessing the audio to be identified to obtain a first multi-channel audio; performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist; after the second multi-channel audio is subjected to multi-channel mixed compression, the compressed audio is sent to a server for voice recognition; and receiving a voice recognition result returned by the server. The method and the device can simplify the steps of voice recognition and improve the accuracy of the voice recognition.

Description

Voice recognition method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a readable storage medium in the field of speech processing technologies.

Background

With the popularity of voice interaction, applications and products surrounding voice interaction are constantly emerging. Since the voice interaction needs to be implemented on the basis of voice recognition, the accuracy of the voice recognition indirectly affects the accuracy of the voice interaction.

In the prior art, when speech recognition is performed, audio data used for speech recognition is usually processed into a single-channel audio, and due to a large difference between the single-channel audio and an original audio, distortion of the audio data used for speech recognition is serious, and audio quality is low, so that accuracy of the speech recognition is reduced.

Disclosure of Invention

The technical scheme adopted by the application for solving the technical problem is to provide a voice recognition method, which comprises the following steps: acquiring audio to be identified; preprocessing the audio to be identified to obtain a first multi-channel audio; performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist; after the second multi-channel audio is subjected to multi-channel mixed compression, the compressed audio is sent to a server for voice recognition; and receiving a voice recognition result returned by the server.

The technical solution adopted by the present application for solving the technical problem is to provide a speech recognition apparatus, including: the acquisition unit is used for acquiring the audio to be identified; the preprocessing unit is used for preprocessing the audio to be identified to obtain a first multi-channel audio; the detection unit is used for performing awakening detection on the first multi-channel audio and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist; the compression unit is used for performing multi-channel mixed compression on the second multi-channel audio and then sending the compressed audio to a server for voice recognition; and the receiving unit is used for receiving the voice recognition result returned by the server.

An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above method.

A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the above method.

One embodiment in the above application has the following advantages or benefits: the method and the device can simplify the steps of voice recognition and improve the accuracy of the voice recognition. Because the technical means of carrying out voice recognition according to the multi-channel audio is adopted, the technical problem that the accuracy of voice recognition is lower due to the fact that single-channel audio with serious distortion and low audio quality is used for carrying out voice recognition in the prior art is solved, the steps of simplifying the voice recognition are achieved, and the technical effect of improving the accuracy of the voice recognition is achieved.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present application;

FIG. 2 is a schematic diagram according to a second embodiment of the present application;

FIG. 3 is a schematic illustration according to a third embodiment of the present application;

fig. 4 is a block diagram of an electronic device for implementing a speech recognition method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present application. As shown in fig. 1, the speech recognition method of this embodiment may specifically include the following steps:

s101, acquiring audio to be identified;

s102, preprocessing the audio to be identified to obtain a first multi-channel audio;

s103, performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected;

s104, after multi-channel mixed compression is carried out on the second multi-channel audio, the compressed audio is sent to a server for voice recognition;

and S105, receiving the voice recognition result returned by the server.

The voice recognition method of the embodiment processes the audio to be recognized into the multi-channel audio to be used for the server to perform voice recognition, so that the audio used by the server during voice recognition is ensured to have higher quality, the voice recognition steps are simplified, and the accuracy of the voice recognition is further improved.

The execution main body of the voice recognition method of this embodiment may be a terminal device, such as a smart phone, a personal computer, a smart speaker, a smart home appliance, a vehicle-mounted device, and other devices capable of performing voice interaction, that is, in this embodiment, voice recognition is implemented through interaction between the terminal device and the server.

In this embodiment, the audio to be recognized acquired in S101 is audio data collected by a microphone of the terminal device, and the audio data collected by the microphone is a multi-channel audio.

After the audio to be recognized is acquired in S101, the embodiment performs S102 to preprocess the acquired audio to be recognized, so as to obtain a first multi-channel audio.

Specifically, in this embodiment, when the obtained audio to be identified is preprocessed in step S102 to obtain a first multi-channel audio, an optional implementation manner that can be adopted is as follows: performing at least one of noise reduction processing and de-reverberation processing on the acquired audio to be recognized, for example, performing noise reduction processing on the audio to be recognized and then performing de-reverberation processing on the audio to be recognized; and taking the processing result as the first multi-channel audio.

In this embodiment, when S102 is executed to perform noise reduction processing on the audio to be recognized, the existing AEC multi-path noise reduction algorithm may be used to remove the echo in the audio; when the audio to be identified is subjected to dereverberation processing, the existing WPE multi-path dereverberation algorithm can be used for removing reverberation in the audio.

That is to say, the first multi-channel audio obtained by performing S102 in the present embodiment is the multi-channel audio from which the echo and the reverberation in the audio to be recognized are removed, so that the quality of the multi-channel audio used for speech recognition is enhanced, and the accuracy of speech recognition is improved.

In this embodiment, after the step S102 is performed to obtain the first multi-channel audio, the step S103 is performed to perform wake-up detection on the obtained first multi-channel audio, and when a wake-up word is detected to exist, the second multi-channel audio is extracted from the first multi-channel audio, where the obtained second multi-channel audio is audio data that is finally used by the server for voice recognition.

Specifically, in this embodiment, when the step S103 is executed and the second multi-channel audio is extracted from the first multi-channel audio under the condition that the presence of the wakeup word is detected, the optional implementation manner that can be adopted is as follows: and extracting the audio part behind the awakening word from the first multi-channel audio as a second multi-channel audio.

That is to say, in this embodiment, the second multi-channel audio obtained by executing S103 is an audio part before the wakeup word in the first multi-channel audio is removed, the audio part is a main part for performing voice interaction between the user and the terminal device, and by performing voice recognition on the second multi-channel audio, the efficiency and accuracy of voice recognition can be improved.

In this embodiment, after the second multi-channel audio is obtained by executing S103, S104 is executed to perform multi-channel mixing compression on the second multi-channel audio, and then the compressed audio is sent to the server for speech recognition.

In the prior art, after wake-up detection is performed on multiple channels of audio, signal enhancement is also performed on the audio after wake-up detection, and after one channel of audio for voice recognition is obtained, the one channel of audio is compressed to be used for voice recognition by a server. However, the difference between the audio of the channel obtained by the prior art and the original audio is large, which results in a large loss of the audio quality of the channel compared with the original audio, and reduces the accuracy of the server in performing the speech recognition.

In this embodiment, after the second multi-channel audio is obtained, signal enhancement processing is not performed on the second multi-channel audio, so that the audio finally used for speech recognition is still multi-channel audio, the technical problem of a large difference between the audio used for speech recognition and the original audio is avoided, the audio used for speech recognition still has high audio quality, the speech recognition step is simplified, and the accuracy of speech recognition is improved.

Specifically, when performing the S104 multi-channel mixed compression on the second multi-channel audio, the present embodiment may adopt the following optional implementation manners: determining the audio energy of each channel of audio in the second multi-channel audio; and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.

According to the embodiment, different compression rates are used for respectively compressing according to the audio energy of each path of audio in the second multi-path audio, so that the compression efficiency of the multi-path audio is improved, the compression loss of the multi-path audio is reduced, and the audio quality of the compressed audio for the server to perform voice recognition is higher.

For example, if the second multi-channel audio is two channels of audio, one channel of audio corresponds to the ambient sound, and the other channel of audio corresponds to the user sound; if the audio energy of the environmental sound is low, the embodiment can compress the audio by using a high compression rate; if the audio energy of the user's voice is high, the embodiment may use a low compression rate to compress the audio.

In this embodiment, after S104 is executed to send the compressed audio to the server, for example, the compressed audio is sent to the server through the communication module in the terminal device, the server decompresses the compressed audio first, then extracts audio features from each of the decompressed multiple channels of audio respectively, and finally performs speech recognition according to the extracted audio features to obtain a speech recognition result.

The present embodiment performs S105 to receive the speech recognition result returned by the server after performing S104 to transmit the compressed audio to the server.

It is understood that S105 in this embodiment may further include the following after receiving the speech recognition result returned by the server: inquiring the received voice recognition result to obtain an inquiry result; and after converting the query result into audio, displaying the audio to the user.

That is to say, this embodiment completes the speech interaction according to the speech recognition result that the server returned, because the server uses the multichannel audio frequency that has higher audio quality to carry out speech recognition, when can promoting the accuracy of speech recognition result, can also promote the accuracy of speech interaction correspondingly.

According to the method provided by the embodiment, the terminal device processes the audio to be recognized into the multi-channel audio for the server to perform voice recognition, and the difference between the multi-channel audio and the audio to be recognized is small, so that the audio quality is high, the accuracy of the voice recognition is improved, the multi-channel audio does not need to be subjected to signal enhancement and other processing, the voice recognition steps are simplified, and the voice recognition efficiency is improved.

Fig. 2 is a schematic diagram according to a second embodiment of the present application. As shown in fig. 2, the speech recognition apparatus of the present embodiment includes:

the acquiring unit 201 is used for acquiring audio to be identified;

the preprocessing unit 202 is configured to preprocess the audio to be identified to obtain a first multi-channel audio;

the detection unit 203 is configured to perform wake-up detection on the first multi-channel audio, and extract a second multi-channel audio from the first multi-channel audio when a wake-up word is detected to exist;

the compression unit 204 is configured to perform multi-channel mixing compression on the second multi-channel audio, and then send the compressed audio to a server for speech recognition;

the receiving unit 205 is configured to receive the voice recognition result returned by the server.

The voice recognition device of this embodiment may be located in a terminal device, such as a smart phone, a personal computer, a smart speaker, a smart home appliance, a vehicle-mounted device, and other devices capable of performing voice interaction, that is, in this embodiment, the voice recognition is implemented through interaction between the terminal device and a server.

The audio to be identified acquired by the acquiring unit 201 in this embodiment is audio data acquired by a microphone of the terminal device, and the audio data acquired by the microphone is a multi-channel audio.

After the audio to be recognized is acquired by the acquisition unit 201, the acquired audio to be recognized is preprocessed by the preprocessing unit 202, so that a first multi-channel audio is obtained.

Specifically, when the preprocessing unit 202 in this embodiment preprocesses the acquired audio to be identified to obtain the first multi-channel audio, the optional implementation manners that can be adopted are as follows: performing at least one of noise reduction processing and de-reverberation processing on the acquired audio to be recognized, for example, performing noise reduction processing on the audio to be recognized and then performing de-reverberation processing on the audio to be recognized; and taking the processing result as the first multi-channel audio.

In this embodiment, when the pre-processing unit 202 performs noise reduction processing on the audio to be recognized, the echo in the audio may be removed by using the existing AEC multi-path noise reduction algorithm; when the audio to be identified is subjected to dereverberation processing, the existing WPE multi-path dereverberation algorithm can be used for removing reverberation in the audio.

That is to say, in this embodiment, the first multi-channel audio obtained by the preprocessing unit 202 is a multi-channel audio from which echoes and reverberation in the audio to be recognized are removed, so that the quality of the multi-channel audio used for speech recognition is enhanced, and the accuracy of speech recognition is improved.

In this embodiment, after the preprocessing unit 202 obtains the first multi-channel audio, the detecting unit 203 performs wake-up detection on the obtained first multi-channel audio, and extracts the second multi-channel audio from the first multi-channel audio when detecting that the wake-up word exists, where the obtained second multi-channel audio is audio data that is finally used by the server for voice recognition.

Specifically, in this embodiment, when the detecting unit 203 extracts the second multi-channel audio from the first multi-channel audio under the condition that the presence of the wake-up word is detected, the optional implementation manner that can be adopted is as follows: and extracting the audio part behind the awakening word from the first multi-channel audio as a second multi-channel audio.

That is to say, the second multi-channel audio obtained by the detecting unit 203 in this embodiment is an audio part removed from the first multi-channel audio and located before the wakeup word, where the audio part is a main part of voice interaction between the user and the terminal device, and by performing voice recognition on the second multi-channel audio, efficiency and accuracy of voice recognition can be improved.

In this embodiment, after the detection unit 203 obtains the second multi-channel audio, the compression unit 204 performs multi-channel mixing compression on the second multi-channel audio, and then sends the compressed audio to the server for speech recognition.

Specifically, when the compression unit 204 performs multi-channel hybrid compression on the second multi-channel audio, the present embodiment may adopt the following optional implementation manners: determining the audio energy of each channel of audio in the second multi-channel audio; and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.

After the compression unit 204 in this embodiment sends the compressed audio to the server, the server first decompresses the compressed audio, then extracts audio features from each of the decompressed multiple channels of audio, and finally performs voice recognition according to the extracted audio features to obtain a voice recognition result.

The present embodiment receives a speech recognition result returned by the server by the receiving unit 205 after the compressed audio is transmitted to the server by the compressing unit 204.

It is understood that the receiving unit 205 in this embodiment may further include the following after receiving the speech recognition result returned by the server: inquiring the received voice recognition result to obtain an inquiry result; and after converting the query result into audio, displaying the audio to the user.

Fig. 3 is a schematic diagram according to a third embodiment of the present application. As shown in fig. 3, which is a flow chart of speech recognition of the present application: firstly, a microphone of a terminal device collects audio to be recognized, then preprocessing including noise reduction elimination and reverberation de-reverberation is carried out on the audio to be recognized to obtain first multi-channel audio, voice awakening detection is carried out on the first multi-channel audio, then second multi-channel audio is output, the second multi-channel audio is multi-channel recognition audio, then multi-channel mixing compression is carried out on the second multi-channel audio, the compressed audio is transmitted to a server through a network link, the server decompresses the compressed audio, voice recognition is completed, and a voice recognition result is returned to the terminal device; pcm in fig. 3 is (Pulse Code Modulation) and asr is (Automatic Speech Recognition).

According to an embodiment of the present application, an electronic device and a computer-readable storage medium are also provided.

Fig. 4 is a block diagram of an electronic device according to the speech recognition method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.

Memory 402 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the speech recognition methods provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the speech recognition method provided by the present application.

The memory 402, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the method of searching for an emoticon in the embodiment of the present application (for example, the acquisition unit 201, the pre-processing unit 202, the detection unit 203, the compression unit 204, and the receiving unit 205 shown in fig. 2). The processor 401 executes various functional applications of the server and data processing, i.e., implements the voice recognition method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 402.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 402 may optionally include memory located remotely from the processor 401, which may be connected to the speech recognition method electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the speech recognition method may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the voice recognition method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS").

According to the technical scheme of the embodiment of the application, the audio to be recognized is processed into the multi-channel audio for the server to perform voice recognition, so that the audio used by the server during voice recognition is ensured to have higher quality, the voice recognition steps are simplified, and the accuracy of the voice recognition is further improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A speech recognition method comprising:

acquiring audio to be identified;

preprocessing the audio to be identified to obtain a first multi-channel audio;

performing awakening detection on the first multi-channel audio, and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist;

after the second multi-channel audio is subjected to multi-channel mixed compression, the compressed audio is sent to a server for voice recognition;

and receiving a voice recognition result returned by the server.

2. The method of claim 1, wherein the preprocessing the audio to be recognized to obtain a first multi-channel audio comprises:

performing at least one of noise reduction processing and reverberation removing processing on the audio to be identified;

and taking the processing result as the first multi-channel audio.

3. The method of claim 1, wherein the extracting, in the event that a wake word is detected to be present, a second multi-channel audio from the first multi-channel audio comprises:

and extracting an audio part behind the awakening word from the first multi-channel audio to be used as the second multi-channel audio.

4. The method of claim 1, wherein the multi-way mixing compression of the second multi-way audio comprises:

determining the audio energy of each channel of audio in the second multi-channel audio;

and respectively compressing each path of audio according to the compression ratio corresponding to the audio energy to obtain compressed audio.

5. A speech recognition apparatus comprising:

the acquisition unit is used for acquiring the audio to be identified;

the preprocessing unit is used for preprocessing the audio to be identified to obtain a first multi-channel audio;

the detection unit is used for performing awakening detection on the first multi-channel audio and extracting a second multi-channel audio from the first multi-channel audio under the condition that an awakening word is detected to exist;

the compression unit is used for performing multi-channel mixed compression on the second multi-channel audio and then sending the compressed audio to a server for voice recognition;

and the receiving unit is used for receiving the voice recognition result returned by the server.

6. The apparatus according to claim 5, wherein the preprocessing unit, when preprocessing the audio to be recognized to obtain a first multi-channel audio, specifically performs:

and taking the processing result as the first multi-channel audio.

7. The apparatus according to claim 5, wherein the detecting unit, when detecting that the wake-up word exists, specifically performs, when extracting a second multi-channel audio from the first multi-channel audio:

8. The apparatus according to claim 5, wherein the compression unit, when performing the multi-channel hybrid compression on the second multi-channel audio, specifically performs:

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.