CN117153178B

CN117153178B - Audio signal processing method, device, electronic equipment and storage medium

Info

Publication number: CN117153178B
Application number: CN202311397646.3A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2024-01-30
Anticipated expiration: 2043-10-26
Also published as: CN117153178A

Abstract

The embodiment of the application discloses an audio signal processing method, an audio signal processing device, electronic equipment and a storage medium; the method may include: acquiring an audio signal to be processed; extracting sub-audio signals corresponding to a plurality of audio types from the audio signals to be processed; for each sub-audio signal, performing audio enhancement processing on the sub-audio signal based on the audio type corresponding to the sub-audio signal to obtain an enhanced sub-audio signal; and carrying out signal reconstruction based on the enhancer audio signals to obtain target audio signals. The scheme can effectively improve the quality of the audio signal.

Description

Audio signal processing method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of audio enhancement technologies, and in particular, to an audio signal processing method, an audio signal processing device, an electronic device, and a storage medium.

Background

In recent years, with the continuous development and innovation of technology, the requirements of people on audio quality are higher and higher, so that audio processing technology has been widely applied to daily life of people, wherein the most common way is to record audio through a recording device and then perform audio enhancement processing on the recorded audio to obtain audio with better tone quality.

Current audio enhancement algorithms are mainly directed to simple sound signals, such as voice call applications, where the input signal is mainly human voice, and therefore are relatively simple to process.

However, with the popularization of applications such as live broadcasting and internet entertainment, the input signal is not a simple noisy speech signal, but includes various signals of different types (such as sounds of different musical instruments), and if the original audio enhancement algorithm is still adopted for enhancement processing, the audio quality cannot be ensured, so that the user experience is poor.

Disclosure of Invention

The embodiment of the application provides an audio signal processing method, an audio signal processing device, electronic equipment and a storage medium, which can improve audio signal processing.

The embodiment of the application provides an audio signal processing method, which comprises the following steps:

acquiring an audio signal to be processed;

extracting sub-audio signals corresponding to a plurality of audio types from the audio signals to be processed;

for each sub-audio signal, performing audio enhancement processing on the sub-audio signal based on the audio type corresponding to the sub-audio signal to obtain an enhanced sub-audio signal;

and carrying out signal reconstruction based on the enhancer audio signals to obtain target audio signals.

The embodiment of the application also provides an audio signal processing device, which comprises:

the signal acquisition unit is used for acquiring an audio signal to be processed;

the signal extraction unit is used for extracting sub-audio signals corresponding to a plurality of audio types from the audio signals to be processed;

the enhancement unit is used for carrying out audio enhancement processing on each sub-audio signal based on the audio type corresponding to the sub-audio signal to obtain an enhanced sub-audio signal;

and the reconstruction unit is used for carrying out signal reconstruction based on the enhancement sub-audio signals to obtain target audio signals.

In some embodiments, the signal extraction unit comprises:

a network acquisition subunit, configured to acquire at least one signal extraction network, where each signal extraction network in the at least one signal extraction network is configured to extract an audio signal of a target audio type;

and the extraction subunit is used for carrying out signal extraction on the audio signal to be processed through the at least one signal extraction network to obtain sub audio signals corresponding to various audio types.

In some embodiments, the extraction subunit comprises:

the first screening module is used for screening a target signal extraction network from the at least one signal extraction network;

The extraction module is used for extracting the signal of the audio signal to be processed through the target signal extraction network;

the determining module is used for determining the extracted signal as a sub-audio signal of a target audio type corresponding to the target signal extraction network if the signal is extracted from the audio signal to be processed, and taking the extracted audio signal to be processed as a new audio signal to be processed;

a second screening module, configured to screen a new target signal extraction network from the at least one signal extraction network;

a return module, configured to return to execution based on the new target signal extraction network and the new audio signal to be processed: and the step of extracting the signal of the audio signal to be processed through the target signal extraction network until a new audio signal to be processed cannot be extracted by any one signal extraction network in the at least one signal extraction network, and determining the new audio signal to be processed as a sub-audio signal corresponding to other audio types.

In some embodiments, the extraction module is specifically further configured to:

if the signal is extracted from the audio signal to be processed, the duty ratio of the extracted signal in the audio signal to be processed is obtained;

And if the duty ratio is greater than or equal to a preset duty ratio threshold, determining the extracted signal as a sub-audio signal of a target audio type corresponding to the target signal extraction network.

In some embodiments, the first screening module is specifically configured to:

acquiring an extraction priority of each signal extraction network in the at least one signal extraction network;

screening a target signal extraction network from the at least one signal extraction network based on the extraction priority;

correspondingly, the second screening module is specifically configured to:

and screening a new target signal extraction network from the at least one signal extraction network based on the extraction priority.

extracting features of the audio signal to be processed to obtain signal features, wherein the signal features comprise a power spectrum or a mel-frequency cepstrum;

and carrying out signal extraction on the audio signal to be processed based on the target signal extraction network and the signal characteristics.

performing Fourier transform on the audio signal to be processed to obtain a power spectrum and a phase spectrum;

And taking the power spectrum and the phase spectrum as the signal characteristics;

the signal extraction of the audio signal to be processed based on the target signal extraction network and the signal characteristics includes:

inputting the power spectrum into the target signal extraction network, and acquiring frequency point gain output by the target signal extraction network;

generating a target power spectrum based on the frequency bin gain and the power spectrum;

and performing inverse Fourier transform on the target power spectrum based on the phase spectrum to obtain an extracted signal.

In some embodiments, the extraction subunit is further configured to:

respectively extracting the audio signals to be processed through the at least one signal extraction network;

taking a signal extraction network of the extracted signals as a target signal extraction network, and taking the signals extracted by the target signal extraction network as sub-audio signals of a target audio type corresponding to the target signal extraction network;

and filtering sub-audio signals of the target audio type corresponding to the target signal extraction network in the audio signals to be processed, and taking the filtered audio signals to be processed as audio signals corresponding to other types.

In some embodiments, the extraction subunit is further configured to:

acquiring the duty ratio of sub-audio signals of the target audio type corresponding to each target signal extraction network in the audio signals to be processed;

sequentially filtering sub-audio signals of the target audio type corresponding to the target signal extraction network in the audio signals to be processed according to the order of the duty ratio from large to small so as to obtain filtered audio signals to be processed;

and determining the filtered audio signal to be processed as other types of corresponding audio signals.

In some embodiments, the plurality of audio types includes a target audio type and other audio types, the enhancement unit comprising:

a first enhancement subunit, configured to perform audio enhancement processing on the sub-audio signal according to a first enhancement policy if the audio type corresponding to the sub-audio signal is a target audio type, so as to obtain an enhanced sub-audio signal, where the first enhancement policy includes at least one of equalizer enhancement processing and harmonic enhancement processing;

and the second enhancement subunit is used for carrying out audio enhancement processing on the sub-audio signals through a second enhancement strategy if the audio types corresponding to the sub-audio signals are other audio types, so as to obtain enhanced sub-audio signals, wherein the second enhancement strategy comprises noise reduction processing.

In some embodiments, the target audio type comprises a plurality of sub-audio types, a first enhancement subunit, specifically configured to:

determining a target sub-audio type corresponding to the sub-audio signal from the plurality of sub-audio types;

determining target enhancement parameters corresponding to the first enhancement strategy according to the target sub-audio type;

and carrying out audio enhancement processing on the sub-audio signals based on the target enhancement parameters and the first enhancement strategy.

In some embodiments, the reconstruction unit is specifically configured to:

and performing linear superposition processing on the enhancer audio signals to obtain target audio signals.

The embodiment of the application also provides electronic equipment, which comprises a memory, wherein the memory stores a plurality of instructions; the processor loads instructions from the memory to perform steps in any of the audio signal processing methods provided by the embodiments of the present application.

Embodiments of the present application also provide a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform steps in any of the audio signal processing methods provided in the embodiments of the present application.

Embodiments of the present application also provide a computer program product comprising computer programs/instructions which, when executed by a processor, implement steps in any of the audio signal processing methods provided in the embodiments of the present application.

According to the embodiment of the application, after the audio signal to be processed is obtained, sub-audio signals corresponding to various audio types are extracted from the audio signal to be processed; then, aiming at each sub-audio signal, carrying out audio enhancement processing on the sub-audio signal based on the audio type corresponding to the sub-audio signal to obtain an enhanced sub-audio signal; and finally, carrying out signal reconstruction based on the enhancer audio signals to obtain target audio signals. That is, for the audio signal to be processed including multiple audio types, the audio signal to be processed can be split into multiple sub-audio signals according to the audio types, then, for each sub-audio signal, the audio enhancement mode matched with the audio type can be adopted for individual enhancement, so that the sub-audio signal corresponding to each audio type in the audio to be processed can be effectively subjected to audio enhancement processing, and finally, the enhanced sub-audio signal is reconstructed to obtain a target audio signal, and the target audio signal is guaranteed to have better audio quality.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1a is a schematic view of a scenario of an audio signal processing method according to an embodiment of the present application;

fig. 1b is a schematic diagram of a result of noise reduction of a music signal through existing deep learning according to an embodiment of the present application;

fig. 1c is a schematic flow chart of an audio signal processing method according to an embodiment of the present application;

FIG. 1d is a flowchart showing the implementation of steps A221 to A222 according to the embodiment of the present application;

fig. 2a is a schematic flow chart of an audio signal processing method applied to an electronic device according to an embodiment of the present application;

fig. 2b is a schematic diagram of an audio signal processing architecture according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an audio signal processing apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The embodiment of the application provides an audio signal processing method, an audio signal processing device, electronic equipment and a storage medium.

The audio signal processing device may be integrated in an electronic device, which may be a terminal, a server, or other devices. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer (Personal Computer, PC) or the like; the server may be a single server or a server cluster composed of a plurality of servers.

In some embodiments, the audio signal processing apparatus may also be integrated in a plurality of electronic devices, for example, the audio signal processing apparatus may be integrated in a plurality of servers, and the audio signal processing method of the present application is implemented by the plurality of servers.

In some embodiments, the server may also be implemented in the form of a terminal.

For example, referring to fig. 1a, fig. 1a shows a schematic view of a scenario of an audio signal processing method provided in an embodiment of the present application, as shown in fig. 1a, where the scenario may include an electronic device, and the electronic device may have an audio signal receiver (such as a microphone), and the electronic device may perform the following steps:

an audio signal to be processed is acquired.

Sub-audio signals corresponding to a plurality of audio types are extracted from the audio signals to be processed.

And aiming at each sub-audio signal, carrying out audio enhancement processing on the sub-audio signal based on the audio type corresponding to the sub-audio signal to obtain an enhanced sub-audio signal.

And carrying out signal reconstruction based on the enhancer audio signal to obtain a target audio signal.

The following will describe in detail. The numbers of the following examples are not intended to limit the preferred order of the examples.

Alternative embodiments of the present application may be implemented based on artificial intelligence techniques. Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Key technologies of the voice technology (Speech Technology) are an automatic voice recognition technology and a voice synthesis technology, and a voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicle, robot, smart medical, smart customer service, car networking, smart transportation, audio enhancement technology, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and play an increasingly important value.

The audio enhancement technology refers to the technical field of improving audio quality and improving audio effect through a series of signal processing algorithms and technical means. Several common audio enhancement techniques are presented below:

noise suppression: noise suppression is an important technique in audio enhancement that reduces the effect of noise on audio signals and improves speech intelligibility. Common noise suppression methods include spectral subtraction, subband iterative noise suppression, frequency domain filtering, and the like.

Echo cancellation: echo cancellation techniques are used to remove echoes resulting from reflections of an audio signal in space, thereby improving the audible effect of the audio. The echo cancellation algorithm is realized by methods such as model estimation, filter design and the like.

Speech enhancement: speech enhancement techniques aim to improve the quality and intelligibility of speech signals. For example, speech may be made clearer and more natural by enhancing the time and frequency domain characteristics of the speech signal. Audio noise reduction: the audio noise reduction technology is mainly used for removing interference of noise on audio signals, so that audio is clearer. Common audio noise reduction methods include frequency domain filtering, adaptive filtering, and the like.

Volume balance: the volume balancing technology is used for adjusting the volume difference between different audio signals, so that the sound is more balanced and coordinated when the audio is played. This technique is widely used in music production and audio post-processing.

And (3) audio restoration: audio repair techniques are mainly used to recover old, corrupted or low quality audio signals. The quality and integrity of the audio can be improved by removing noise, recovering missing parts, compensating for lost spectral information, and the like.

End-to-end audio enhancement model: in recent years, end-to-end models based on deep learning are widely used in audio enhancement. The models directly generate the enhanced audio signals from the input audio signals by using the deep neural network, so that more efficient and accurate audio enhancement effects are realized.

These audio enhancement techniques may be applied in different fields, such as communications, entertainment, speech recognition, audio devices, etc., to provide better audio experience and services. With the development of technology, audio enhancement technology will continue to innovate and advance, bringing better hearing experience for people.

Illustratively, the following are some common audio enhancement technology applications:

communication and speech recognition: audio enhancement techniques play a key role in the fields of communications and speech recognition. By reducing noise, removing echoes, and improving speech intelligibility, the quality of speech and speech recognition can be improved. This is critical for applications such as cell phones, teleconferencing, voice assistants, and voice recognition systems.

Entertainment industry: audio enhancement techniques are widely used in the entertainment industry. Through enhancing sound effects and music effects, entertainment experiences such as movies, television shows, games, virtual reality and the like can be improved. In addition, audio enhancement techniques may also be used for audio mixing, audio post-processing, and audio repair during music production.

Audio equipment and consumer electronics: audio enhancement techniques are widely used in a variety of audio devices and consumer electronics. For example, audio enhancement algorithms in products such as smart speakers, headphones, speakers, and audio players may provide better sound quality and sound effects.

Vehicle audio system: the audio enhancement technology in the vehicle audio system can improve the in-vehicle hearing environment and provide better music enjoyment and conversation experience. The vehicle noise reduction device can reduce vehicle noise, inhibit echo and carry out self-adaptive adjustment according to the acoustic environment in the vehicle.

Audio repair and cultural heritage protection: audio repair is the process of repairing and recovering old, damaged or low quality audio recordings. The audio enhancement technology plays an important role in cultural heritage protection, music archiving, history recording and the like, and can help to protect and save important audio data.

Virtual assistant and intelligent house: audio enhancement techniques are also being applied in virtual assistants and smart home devices. By optimizing voice recognition and voice interaction, recognition accuracy and response speed of virtual assistants (e.g., siri, alexa, google assant, etc.) can be improved. Meanwhile, the intelligent household equipment can be more easily interacted with the user in a voice mode.

An Equalizer (EQ) is used to achieve the purpose of adjusting the tone by gain or attenuation of one or more frequency bands of sound.

In the related art, the audio enhancement algorithm is mainly aimed at simple sound signals, such as voice call application, input signals are mainly human voice and environmental noise, because voice has short-time stationarity and inter-frame correlation, and conventional stationary noise has Gaussian distribution and long-time stationary characteristics, the voice signals can be protected to a certain extent while stationary noise can be restrained based on some statistical noise reduction methods, such as wiener filters and least square filters, and for non-stationary noise, the effect of the conventional statistical noise reduction method is not obvious due to the characteristic instability, the noise reduction method based on deep learning is needed to be restrained, the deep learning noise reduction is to obtain a noise reduction network model through training of a large amount of noisy data, and the trained network model has good identification capability and inhibition capability on noisy data.

However, with the popularity of live broadcast, internet entertainment, etc., the input signal of sound has not been a simple noisy speech signal, but has included various signals of different types, including: piano, flute, violin, drum, etc., including human voice, environmental noise, etc., such composite signals are difficult to assume conditions of a conventional audio enhancement scheme along with feature diversity and complexity of input signals, performance of a deep learning noise reduction scheme is very dependent on feature coverage of training samples, and combination of the composite signals in the above applications is very variable and difficult to be extremely high, so that the enhancement effect of the mixed audio signal cannot be guaranteed by the audio enhancement scheme in the related art.

In addition, some related technologies propose electronic music signal noise reduction based on wavelet transformation and fourier transformation, which is to analyze the characteristic distance between a music signal and a noise signal from different transformation domains only for a certain music, and suppress the music signal after a specific characteristic domain, so that the application limitation is obvious, only enhancement can be performed for a certain musical instrument, and the effective signal characteristic domain distribution is different for a composite signal, so that the audio enhancement scheme cannot be compatible with different signals, and therefore the existing problem cannot be solved.

In addition, the conventional audio enhancement has various processing algorithms for optimizing the tone quality of the audio signal besides noise reduction and suppression, for example, the EQ and EQ algorithms divide the input signal into different frequency bands according to the human ear perception characteristics, and each frequency band adopts different gains to adjust the energy of the frequency band so as to make the characteristic tone of the input signal brighter and full. However, the same set of EQ parameters can improve the sound quality for a certain type of signal due to different tone characteristics of different types of signals, however, problems such as sound blurring and tone degradation may occur for some types of signals under the set of EQ parameters, so that an ideal effect cannot be achieved for a mixed signal by using conventional EQ enhancement processing.

In addition, the mixed signal also has the problem of mixing weight, namely the component ratio between the independent signals in the mixed signal is a problem, for example, the mixed signal contains a human voice signal and music signals of different musical instruments, however, the energy distribution of each component in the finally recorded mixed signal is not necessarily reasonable due to the difference of recording distances of different sound sources and the difference of sound intensity of the sound sources in the sound recording stage, for example, the human voice is masked by the music signals, or a certain path of signal is too strong, so that the hearing of the final recording result is not harmonious. This problem is not addressed by the related current audio enhancement schemes.

In summary, the audio enhancement method in the related art has the following problems:

1. only a limited research is carried out on a few types of music signals, and the research result is ineffective for complex composite signals, because the effective characteristic fields of the composite signals are different, and the effective characteristic fields of certain types of signals in the composite signals are overlapped with the ineffective characteristic fields of other types of signals, the scheme can not solve the problem of enhancing the composite signals in actual scenes.

2. The method is mainly applied to voice call application, solves the problem of noise of a single signal (such as a human voice signal), and is characterized in that the prior enhancement scheme is not applicable to the application because the prior enhancement scheme completely fails to meet the assumption condition of short-time stability and relevance of the signal. The enhancement processing of the music signal is difficult to break through, and the root of the enhancement processing is that the types of the music signal are numerous, and the music signal can be played by a plurality of different music instruments at the same time or added with a human voice part, so that the frequency spectrum information is rich, complex and changeable, and the effective characteristics which can be used for clearly distinguishing the music signal from the noise signal are difficult to find out, so that the music signal is more easily damaged by a noise reduction algorithm or the noise in the music cannot be effectively restrained. As shown in fig. 1b, fig. 1b shows a schematic diagram of the result of the music signal after the noise reduction through the existing deep learning, and the music signal before the noise reduction in fig. 1b and the music signal after the noise reduction can be compared, so that the music signal can be seen to be severely damaged.

In view of the above-mentioned problems, in the present embodiment, there is provided an audio signal processing method related to a deep learning technique, which may be applied to the electronic device shown in fig. 1a, as shown in fig. 1c, and the specific flow of the audio signal processing method may be as follows:

101. an audio signal to be processed is acquired.

The audio signal to be processed refers to an audio signal that needs to be subjected to audio enhancement processing, and in this embodiment, the audio signal to be processed may be a mixed audio signal, that is, the audio signal to be processed may be an audio signal formed by superimposing sub-audio signals corresponding to multiple audio types.

The audio type may be a type previously classified according to audio features, wherein the audio features may include, but are not limited to: frequency characteristics, tone characteristics, etc. In this embodiment, the audio types may be divided in advance according to the audio features corresponding to different music instruments, the audio features corresponding to the voice, and other features, where for example, the audio types obtained by the division include: keyboards (e.g., pianos, electronic organ, etc.), violins, flute, tap, voice, noise, etc.

In some embodiments, the audio signal to be processed may be collected in real-time on site, for example, the electronic device collects an audio signal on site (e.g., in a live scene) through its configured audio receiver, and takes the collected audio signal as the audio signal to be processed.

In other embodiments, the audio signal to be processed may be collected by another device and sent to the electronic device, for example, in a live broadcast scenario, the electronic device may be a server, where the server is respectively communicatively connected to a plurality of clients, and the plurality of clients may upload the audio signal collected by the plurality of clients to the server, and the server may receive the audio signal and use the audio signal as the audio signal to be processed.

102. Sub-audio signals corresponding to a plurality of audio types are extracted from the audio signals to be processed.

In some embodiments, in step 102, a specific embodiment of extracting sub-audio signals corresponding to a plurality of audio types from the audio signal to be processed may include:

a1, acquiring at least one signal extraction network, wherein each signal extraction network in the at least one signal extraction network is used for extracting an audio signal of a target audio type.

The signal extraction network may be a deep learning network trained based on an audio signal sample corresponding to a certain target audio type, and may be used to extract a sub-audio signal corresponding to the target audio type from an input audio signal and output the sub-audio signal. The target audio type may refer to an audio type that needs to be identified and extracted currently, where the target audio type may be set in a user-defined manner according to actual requirements, and is not limited herein.

The signal extraction network may be obtained from a preset network database, where the preset network database may be pre-stored with signal extraction networks corresponding to each of a plurality of audio types. Optionally, the preset network database may be stored locally on the electronic device, or may be stored in a cloud server that is communicatively connected to the electronic device, which is not limited herein.

In some embodiments, after determining the target audio type, the corresponding signal extraction network may be acquired according to the target audio type, for example, the user may input the target audio type into the electronic device, and if the input target audio type is a flute type, the electronic device may call out the signal extraction network corresponding to the flute type from the preset network database.

A2, carrying out signal extraction on the audio signal to be processed through at least one signal extraction network to obtain sub-audio signals corresponding to various audio types.

In some embodiments, in step A2, signal extraction is performed on an audio signal to be processed through at least one signal extraction network, and a specific embodiment for obtaining sub-audio signals corresponding to multiple audio types may include:

A21, screening a target signal extraction network from at least one signal extraction network.

The target signal extraction network may refer to a signal extraction network currently required to extract an audio signal.

In some embodiments, when the number of signal extraction networks is plural, the plural signal extraction networks may be regarded as one signal extraction network set, and then one signal extraction may be randomly selected from the signal extraction network set as a target signal extraction network, and the target signal extraction network may be deleted from the signal extraction network set.

Illustratively, for example, the signal extraction network set includes the signal extraction network 1, the signal extraction network 2, the signal extraction network 3, and the signal extraction network 4, and when the signal extraction network 1 is selected as the target signal extraction network, the signal extraction network 1 may be deleted from the signal extraction network set.

A22, extracting the signal of the audio signal to be processed through the target signal extraction network.

In some embodiments, in step a22, a specific embodiment of performing signal extraction on the audio signal to be processed through the target signal extraction network may include:

A221, extracting features of the audio signal to be processed to obtain signal features, wherein the signal features comprise a power spectrum or a Mel cepstrum.

Among them, power Spectrum (Power Spectrum) is a Spectrum representation method commonly used in signal processing, and is used to describe the energy distribution of a signal at different frequencies. The power spectrum can help us to understand the frequency domain characteristics of the signal and to perform spectral analysis on the signal. In this embodiment, the power spectrum may be extracted from the audio signal to be processed by means of fourier transformation, for example, by performing a discrete fourier transformation (Discrete Fourier Transform, DFT) or a fast fourier transformation (Fast Fourier Transform, FFT) on the signal, a spectral representation of the signal, i.e. the power spectrum, may be obtained, including amplitude and phase information.

Among them, mel-frequency cepstrum (MFCC), mel-Frequency Cepstral Coefficients, is a feature extraction method widely used in the fields of speech recognition, music information retrieval, and the like. MFCCs are nonlinear transformations of the spectral envelope, which primarily have the meaning of simulating the perception mechanism of the human ear. The human ear has the unbalanced perception characteristic to the frequency characteristic of sound: the low frequency energy is strong and the high frequency energy is weak. To simulate such perceptual characteristics, the MFCC may be obtained by:

1. Pre-emphasis (Pre-emphsis): to enhance the high frequency content, the signal is pre-emphasized. A filter is typically used, which functions to reduce the amplitude of the low frequency part by increasing the amplitude of the high frequency part.

2. Framing (Framing): the pre-emphasized signal is divided into small segments of frames (frames), each Frame typically lasting 20-40 milliseconds. To ensure continuity, there is typically some overlap between adjacent frames.

3. Windowing (window): a windowing operation, such as Hamming Window (Hamming Window), is performed on each frame of signal to reduce boundary effects. The function of the window is to smooth the frame signal so that samples at the frame boundary approach zero.

4. Fourier transform (Fourier Transform): a fast fourier transform (Fast Fourier Transform, FFT) is performed on the windowed signal for each frame to obtain the frequency spectrum for that frame.

5. Mel Filter Bank (Mel Filter Bank): in the frequency domain, the continuous frequency range is divided into a series of mel filters. The mel filter bank has higher resolution in the low frequency region and lower resolution in the high frequency region, and simulates the perception characteristics of human ears on sound frequency.

6. Mel frequency cepstral coefficient calculation (Mel Frequency Cepstral Coefficients): the energy value output by each mel filter is logarithmically calculated and then converted to cepstral coefficients by discrete cosine transform (Discrete Cosine Transform, DCT). The first few cepstrum coefficients are typically taken as the final MFCC feature vector, i.e., mel-frequency cepstrum.

It will be appreciated that the signal characteristics extracted from the audio signal to be processed may be other signal characteristics than the power spectrum, mel-frequency cepstrum mentioned above, without limitation.

And A222, extracting signals from the audio signal to be processed based on the target signal extraction network and the signal characteristics.

For example, referring to fig. 1d, the specific embodiments of step a221 to step a222 may include:

and step one, carrying out Fourier transform on an audio signal to be processed to obtain a power spectrum and a phase spectrum.

Wherein the phase spectrum refers to the phase information of the audio signal to be processed in the frequency domain. Representing the phase angle of the audio signal to be processed at different frequencies. The phase angle describes the starting phase of the audio signal to be processed at each frequency, typically expressed in radians or angles.

Wherein the fourier transform of the audio signal to be processed may be a Fast Fourier Transform (FFT).

And step two, taking the power spectrum and the phase spectrum as signal characteristics.

For example, as shown in fig. 1d, in the first and second steps, the audio signal to be processed may be used as an input signal, where the input signal is subjected to fast fourier transform to obtain signal characteristics: the power spectrum and the phase spectrum (also called as the phase for short) are input into the target signal extraction network.

Inputting the power spectrum into a target signal extraction network, and obtaining the frequency point gain output by the target signal extraction network.

As shown in fig. 1d, the target signal extraction network may include a Full Connected (FC) unit, a one-dimensional convolution (1 d, 1 d) network unit, a gating loop unit (Gated Recurrent Unit, GRU), and a Sigmoid function.

The Sigmoid function is a commonly used nonlinear activation function that maps input values to a continuous value ranging from 0 to 1.

And step four, generating a target power spectrum based on the frequency point gain and the power spectrum.

And fifthly, performing inverse Fourier transform on the target power spectrum based on the phase spectrum to obtain an extracted signal.

In the third to fifth steps, the power spectrum of the input target signal extraction network generates a high-dimensional feature through full connection and a convolution layer, and the high-dimensional feature outputs the power spectrum gain value of each frequency point of the frequency domain, namely the frequency point gain through the multi-stage gating circulation unit and the final full connection unit. And multiplying the frequency point gain with the power spectrum of the signal to be processed to obtain a filtered signal power spectrum, namely a target power spectrum. And finally, performing inverse Fourier transform on the target power spectrum and the phase spectrum of the audio signal to be processed, so as to obtain a filter output signal of the network, wherein the filter output signal is an extraction signal of the audio signal to be processed extracted through the target signal extraction network.

It can be understood that the signal extraction network cannot ensure that the signal can be successfully extracted, and if the sub-audio signal of the audio type corresponding to the signal extraction network does not exist in the audio signal to be processed, the signal cannot be extracted.

A23, if the signal is extracted from the audio signal to be processed, determining the extracted signal as a sub-audio signal of a target audio type corresponding to the target signal extraction network, and taking the extracted audio signal to be processed as a new audio signal to be processed.

Along the above example, for example, the target signal extraction network is the signal extraction network 1, the audio type corresponding to the signal extraction network 1 is a human voice type, if a signal is extracted from the audio signals to be processed through the signal extraction network 1, the extracted signal may be determined to be a sub-audio signal of the human voice type, and the extracted audio signals to be processed (i.e., the remaining audio signals of the audio signals to be processed after filtering the sub-audio signals of the human voice type) may be used as new audio signals to be processed.

If no signal is extracted from the audio signal to be processed, a new target signal extraction network may be selected from the at least one signal extraction network (e.g., the signal extraction network 2 is used as a new target signal extraction network), and step a22 is performed based on the new target signal extraction network, until a signal is extracted from the audio signal to be processed through the new target signal extraction network.

It will be appreciated that each signal extraction network can be selected once, as the selected signal extraction network will be deleted from the collection of signal extraction networks.

In some embodiments, a specific implementation of step a23 may include:

If the signal is extracted from the audio signal to be processed, the duty ratio of the extracted signal in the audio signal to be processed is obtained.

The time period of occurrence of the extracted signal may be screened out from the audio signal to be processed, the time period of occurrence of the extracted signal may be calculated based on the time period of occurrence, and the ratio between the time period of occurrence and the time period of audio of the audio signal to be processed may be calculated, so as to obtain the duty ratio of the extracted signal in the audio signal to be processed.

For example, if the duty ratio of the extracted signal in the audio signal to be processed is 10%, the extracted signal may be determined as a sub-audio signal of the target audio type corresponding to the target signal extraction network, and the subsequent step a24 may be performed.

If the duty ratio of the extracted signal in the audio signal to be processed is smaller than the preset duty ratio threshold, a new target signal extraction network may be screened from the at least one signal extraction network, and step a22 is executed back based on the new target signal extraction network until the duty ratio of the extracted signal in the audio signal to be processed is greater than or equal to the preset duty ratio threshold.

In consideration of that the ratio of the signal extracted through the target signal extraction network in the audio signal to be processed is too small, the tone quality improvement effect brought by the enhancement of the extracted signal is not obvious and basically belongs to the enhancement of ineffective audio.

A24, screening out a new target signal extraction network from at least one signal extraction network.

Along the above example, after the signal extraction network 1 in the signal extraction network set extracts the sub-audio signal of the corresponding audio type, one signal extraction network (e.g., the signal extraction network 2) may be randomly selected from the signal extraction network set as a new target signal extraction network.

A25, extracting a network and a new audio signal to be processed based on the new target signal, and returning to execute: and carrying out signal extraction on the audio signal to be processed through the target signal extraction network until the new audio signal to be processed cannot be extracted by any one of the at least one signal extraction network, and determining the new audio signal to be processed as a sub-audio signal corresponding to other audio types.

Along with the above example, step a22 may be performed by a new target signal extraction network (e.g., the signal extraction network 2) and a new audio signal to be processed (e.g., the audio signal to be processed after the sub-audio signal of the human voice class is extracted), that is, the signal extraction network 2 is used to perform signal extraction on the audio signal to be processed after the sub-audio signal of the human voice class is extracted. And so on, the target signal extraction network is continuously selected from the at least one signal extraction network to extract the signal of the audio signal to be processed, so that the sub-audio signals contained in the audio signal to be processed are fewer and fewer until the signal cannot be extracted by any signal extraction network, and at the moment, the new audio signal to be processed does not contain the sub-audio signals of the target audio type any more, so that the new audio signal to be processed can be determined as the sub-audio signals corresponding to other audio types.

In this embodiment, the audio signal to be processed is extracted layer by layer through at least one signal extraction network, so as to obtain sub-audio signals corresponding to other audio types, and the audio signal to be processed can be rapidly and efficiently divided into the target audio type and other audio types, so that different enhancement strategies can be adopted for the two audio types in the following process.

In some embodiments, in step a21, a specific embodiment of screening one target signal extraction network from at least one signal extraction network may include:

an extraction priority of each of the at least one signal extraction network is obtained.

The extraction priority may be a degree flag that determines the relative importance or urgency of the signal extraction network when it is performing the audio signal extraction task. It will be appreciated that the higher the extraction priority the more important the signal extraction network is for sub-audio signals of the audio type it is to extract. In this embodiment, the higher the extraction priority, the higher the signal extraction network can be selected as the target signal extraction network.

The extraction priority of each signal extraction network may be preset, and when the extraction priority is set, the user may set the extraction priority according to their preference degrees for different audio types. The extraction priority may be set according to the frequency of use of the audio of different audio types in practical applications, for example, the extraction priority of the audio type (such as keyboard type) of the main stream may be set higher. It will be appreciated that the extraction priority may be set up by user according to the actual requirement, and is not limited herein.

A target signal extraction network is screened from the at least one signal extraction network based on the extraction priority.

For example, a signal extraction network having the highest extraction priority may be selected from the at least one signal extraction network as the target signal extraction network.

Accordingly, in step a24, the screening of the at least one signal extraction network for a new target signal extraction network may include:

a new target signal extraction network is screened from the at least one signal extraction network based on the extraction priority.

For example, if the signal extraction network 1 is selected in the first screening, the signal extraction network having the highest extraction priority among at least one signal extraction network other than the signal extraction network 1 may be selected as the target signal extraction network in the present screening.

Considering that a certain influence may be generated on one audio type of the audio signals to be processed when the signal extraction is performed on the other audio type of the audio signals to be processed, the earlier extracted audio signals are less influenced, in this embodiment, by acquiring the extraction priority of each signal extraction network and screening one target signal extraction network from at least one signal extraction network based on the extraction priority, the less interference is ensured on sub-audio signals of important audio types in the audio signals to be processed, thereby effectively improving the audio enhancement effect.

In other embodiments, specific embodiments of step A2 may include:

s1, respectively extracting the audio signals to be processed through at least one signal extraction network.

For example, the audio signals to be processed may be duplicated to obtain a plurality of audio signals to be processed, the number of which is equal to that of the at least one signal extraction network, and then signal extraction is performed through one signal extraction network for each of the audio signals to be processed.

S2, taking the signal extraction network of the extracted signal as a target signal extraction network, and taking the signal extracted by the target signal extraction network as a sub-audio signal of a target audio type corresponding to the target signal extraction network.

Illustratively, the at least one signal extraction network comprises, for example, a signal extraction network 1 corresponding to a human voice class, a signal extraction network 2 corresponding to a keyboard class, a signal extraction network 3 corresponding to a flute class, and a signal extraction network 4 corresponding to a tap class. After four audio signals to be processed are respectively subjected to signal extraction through the signal extraction network 1, the signal extraction network 2, the signal extraction network 3 and the signal extraction network 4, the signal extraction network 1, the signal extraction network 2 and the signal extraction network 3 extract signals of corresponding audio types, and the signal extraction network 4 does not extract signals. The signal extraction network 1, the signal extraction network 2, the signal extraction network 3 may be regarded as target signal extraction networks. And can obtain the sub-audio signals corresponding to the human voice class, the sub-audio signals corresponding to the keyboard class and the sub-audio signals corresponding to the flute class.

S3, filtering sub-audio signals of the target audio type corresponding to the target signal extraction network in the audio signals to be processed, and taking the filtered audio signals to be processed as audio signals corresponding to other types.

According to the above example, the sub-audio signals corresponding to the human voice, the sub-audio signals corresponding to the keyboard and the sub-audio signals corresponding to the flute may be filtered from a complete audio signal to be processed, so as to obtain a filtered audio signal to be processed, and the filtered audio signal to be processed may be used as other types of corresponding audio signals.

In some embodiments, the specific implementation of step S3 may include:

and acquiring the duty ratio of the sub-audio signals of the corresponding target audio type of each target signal extraction network in the audio signals to be processed.

For a specific embodiment of obtaining the duty ratio of the sub-audio signal of the target audio type corresponding to each target signal extraction network in the audio signal to be processed, reference may be made to the above embodiment of obtaining the duty ratio of the extracted signal in the audio signal to be processed in step a23, so that details are not repeated herein.

And filtering sub-audio signals of the target audio type corresponding to the target signal extraction network in the audio signals to be processed in sequence according to the order of the large duty ratio, so as to obtain the filtered audio signals to be processed.

With the above example, for example, the duty ratio of the sub-audio signal corresponding to the human voice class is 50%, the duty ratio of the sub-audio signal corresponding to the keyboard class is 10%, and the duty ratio of the sub-audio signal corresponding to the flute class is 30%. Then the sub-audio signals corresponding to the human voice class in the audio signals to be processed can be filtered, then the sub-audio signals corresponding to the flute class are filtered, and finally the sub-audio signals corresponding to the keyboard class are filtered.

In consideration of the need to consume computing resources for processing and analyzing the audio signal, in the present embodiment, by preferentially filtering the sub-audio signal occupying a relatively large area, the amount of data processed can be reduced, and the use of processing resources can be saved. In addition, the larger the possibility that the overlapping part exists between the sub-audio signal with larger occupation and other sub-audio signals, the higher the priority of filtering the sub-audio signal with larger occupation, and the processing of the overlapping part is reduced, so that the filtering efficiency can be improved, and the corresponding audio signals of other types can be obtained quickly.

Considering that the layer-by-layer signal extraction of the audio signal to be processed is performed, the quality of the signal extracted later may be affected by the operation of the preceding extraction signal, in this embodiment, by copying the audio signal to be processed into multiple parts, each part of the audio signal to be processed is extracted individually through a signal extraction network of one target audio type, interference caused when the layer-by-layer signal is extracted is avoided, and the quality of the sub-audio signal corresponding to each extracted audio type is ensured.

103. And aiming at each sub-audio signal, carrying out audio enhancement processing on the sub-audio signal based on the audio type corresponding to the sub-audio signal to obtain an enhanced sub-audio signal.

In some embodiments, the plurality of audio types includes a target audio type and other audio types, and in step 103, performing audio enhancement processing on the sub-audio signal based on the audio type corresponding to the sub-audio signal, a specific embodiment for obtaining the enhanced sub-audio signal may include:

and B1, if the audio type corresponding to the sub-audio signal is the target audio type, performing audio enhancement processing on the sub-audio signal through a first enhancement strategy to obtain an enhanced sub-audio signal, wherein the first enhancement strategy comprises at least one of equalizer enhancement processing and harmonic enhancement processing.

The main principle of the Equalizer (EQ) enhancement processing is to multiply the energy of the input signal corresponding to the frequency band by the corresponding frequency band gain by configuring gain parameter values of different frequency bands, so as to realize enhancement or attenuation of the energy of the frequency band, and the algorithm implementation of EQ can be realized by a multi-level IIR filtering combination.

Wherein the IIR filter (Infinite Impulse Response Filter) is a digital filter whose current value of the output signal is a linear combination of the current value of the input signal and a number of previous input and output values.

Wherein the harmonic enhancement processing can be implemented by a harmonic excitation method or a deep learning audio enhancement method. The implementation of harmonic excitation is to extract the fundamental frequency of the input signal, and because the harmonic frequency is an integer multiple of the fundamental frequency, the signal enhancement can be performed by the position of the integral multiple of the fundamental frequency, thereby realizing the harmonic enhancement effect of the input signal.

In some embodiments, the target audio type includes a plurality of sub audio types, and in step B1, a specific embodiment of performing audio enhancement processing on the sub audio signal by the first enhancement policy may include:

b11, determining a target sub-audio type corresponding to the sub-audio signal from the plurality of sub-audio types.

The audio type corresponding to the signal extraction network is a sub-audio type in the target audio type, so that the target sub-audio type corresponding to the sub-audio signal can be naturally obtained according to the signal extraction network from which the sub-audio signal is extracted.

For example, referring to table 1, table 1 shows a mapping relationship among audio type, sub-audio type, enhancement policy, and enhancement parameters. The mapping relationship may be set in a user-defined manner according to actual requirements, which is not limited herein.

TABLE 1

As can be seen from table 1, after determining the target sub-audio type of the sub-audio signal, the corresponding enhancement strategy and enhancement parameters can be found from table 1 according to the target sub-audio type.

In table 1, the first enhancement policy (1) may indicate that only equalizer enhancement processing is used, the first enhancement policy (2) may indicate that only harmonic enhancement processing is used, and the first enhancement policy (3) may indicate that both harmonic enhancement processing and equalizer enhancement processing are used. The enhancement parameters a corresponding to the first enhancement strategy (1) are EQ parameters (such as a frequency band gain value), the enhancement parameters b corresponding to the first enhancement strategy (2) are harmonic enhancement parameters, and the enhancement parameters d corresponding to the first enhancement strategy (3) comprise harmonic enhancement parameters and EQ parameters. The values of the enhancement parameters corresponding to different sub-audio types are different, for example, the enhancement parameter a and the enhancement parameter c are both EQ parameters, but the values of the EQ parameters are different because the sub-audio types corresponding to the enhancement parameter a and the enhancement parameter c are both sub-audio types. The specific value of the enhancement parameter can be set by user-defined according to the actual requirement, and is not limited herein.

And B12, determining target enhancement parameters corresponding to the first enhancement strategy according to the target sub-audio type.

Along with the above example, for example, the target sub-audio type is a flute, then the target enhancement parameter may be determined to be the enhancement parameter c corresponding to the first enhancement policy (1) according to table 1.

B13, performing audio enhancement processing on the sub-audio signal based on the target enhancement parameter and the first enhancement strategy.

And B2, if the audio type corresponding to the sub-audio signal is other audio types, performing audio enhancement processing on the sub-audio signal through a second enhancement strategy to obtain an enhanced sub-audio signal, wherein the second enhancement strategy comprises noise reduction processing.

The noise reduction processing means that noise components in the signal are reduced or removed by processing the signal so as to improve the quality and definition of the signal. In the present embodiment, the noise reduction processing manner may employ at least one of the following noise reduction manners: filter noise reduction, noise reduction methods based on machine learning and deep learning, spectral subtraction, wavelet transformation, and the like.

104. And carrying out signal reconstruction based on the enhancer audio signal to obtain a target audio signal.

In step 104, the signal reconstruction based on the enhancer audio signal may include:

And performing linear superposition processing on the enhancement sub-audio signals to obtain target audio signals.

In some embodiments, each enhancement sub-audio signal may be added according to a given weight according to a linear superposition principle, resulting in a reconstructed signal, i.e. a target audio signal. Alternatively, the weight corresponding to each enhancement sub-audio signal may be positively correlated with the extraction priority described above.

It can be seen that, in this embodiment, after the audio signal to be processed is obtained, sub audio signals corresponding to multiple audio types are extracted from the audio signal to be processed; then, aiming at each sub-audio signal, carrying out audio enhancement processing on the sub-audio signal based on the audio type corresponding to the sub-audio signal to obtain an enhanced sub-audio signal; and finally, carrying out signal reconstruction based on the enhancer audio signals to obtain target audio signals. That is, for the audio signal to be processed including multiple audio types, the audio signal to be processed can be split into multiple sub-audio signals according to the audio types, then, for each sub-audio signal, the audio enhancement mode matched with the audio type can be adopted for individual enhancement, so that the sub-audio signal corresponding to each audio type in the audio to be processed can be effectively subjected to audio enhancement processing, and finally, the enhanced sub-audio signal is reconstructed to obtain a target audio signal, and the target audio signal is guaranteed to have better audio quality.

The method described in the above embodiments will be described in further detail below.

In this embodiment, an electronic device will be taken as an example, and a method of the embodiment of the present application will be described in detail.

As shown in fig. 2a, a specific flow of an audio signal processing method is as follows:

201. the electronic device obtains an audio signal to be processed.

202. The electronic device obtains at least one signal extraction network, each of the at least one signal extraction network for extracting an audio signal of a target audio type.

203. The electronic equipment performs signal extraction on the audio signals to be processed through at least one signal extraction network to obtain sub-audio signals corresponding to various audio types.

In step 203, signal extraction is performed on an audio signal to be processed through at least one signal extraction network to obtain sub-audio signals corresponding to multiple audio types, including:

screening a target signal extraction network from the at least one signal extraction network;

performing signal extraction on the audio signal to be processed through a target signal extraction network;

if the signal is extracted from the audio signal to be processed, determining the extracted signal as a sub-audio signal of a target audio type corresponding to a target signal extraction network, and taking the extracted audio signal to be processed as a new audio signal to be processed;

Screening a new target signal extraction network from at least one signal extraction network;

based on the new target signal extraction network and the new audio signal to be processed, the return execution: and carrying out signal extraction on the audio signal to be processed through the target signal extraction network until the new audio signal to be processed cannot be extracted by any one of the at least one signal extraction network, and determining the new audio signal to be processed as a sub-audio signal corresponding to other audio types.

The step of determining the extracted signal as a sub-audio signal of the target audio type corresponding to the target signal extraction network if the signal is extracted from the audio signal to be processed may include:

Wherein, the step of "screening a target signal extraction network from at least one signal extraction network" may comprise:

Acquiring an extraction priority of each signal extraction network in at least one signal extraction network;

screening a new target signal extraction network from at least one signal extraction network, comprising:

The step of extracting the signal of the audio signal to be processed through the target signal extraction network may include:

extracting features of an audio signal to be processed to obtain signal features, wherein the signal features comprise a power spectrum or a mel-frequency cepstrum;

and extracting the signal of the audio signal to be processed based on the target signal extraction network and the signal characteristics.

The step of extracting features of the audio signal to be processed to obtain signal features may include:

taking the power spectrum and the phase spectrum as signal characteristics;

based on the target signal extraction network and the signal characteristics, extracting the signal of the audio signal to be processed comprises the following steps:

Inputting the power spectrum into a target signal extraction network, and obtaining the frequency point gain output by the target signal extraction network;

generating a target power spectrum based on the frequency point gain and the power spectrum;

based on the phase spectrum, the target power spectrum is subjected to an inverse fourier transform to obtain an extracted signal.

In other embodiments, in step 203, signal extraction is performed on an audio signal to be processed through at least one signal extraction network to obtain sub-audio signals corresponding to a plurality of audio types, including:

respectively extracting the audio signals to be processed through at least one signal extraction network;

taking the signal extraction network of the extracted signal as a target signal extraction network, and taking the signal extracted by the target signal extraction network as a sub-audio signal of a target audio type corresponding to the target signal extraction network;

The step of filtering sub-audio signals of the target audio type corresponding to the target signal extraction network in the audio signal to be processed may include:

Acquiring the duty ratio of sub-audio signals of the corresponding target audio type of each target signal extraction network in the audio signals to be processed;

sequentially filtering sub-audio signals of the target audio type corresponding to the target signal extraction network in the audio signals to be processed according to the order of the large duty ratio to obtain filtered audio signals to be processed;

204. The electronic equipment carries out audio enhancement processing on the sub-audio signals based on the audio types corresponding to the sub-audio signals aiming at each sub-audio signal to obtain enhanced sub-audio signals.

Wherein the plurality of audio types includes a target audio type and other audio types, the specific implementation in step 204 may include:

if the audio type corresponding to the sub-audio signal is the target audio type, performing audio enhancement processing on the sub-audio signal through a first enhancement strategy to obtain an enhanced sub-audio signal, wherein the first enhancement strategy comprises at least one of equalizer enhancement processing and harmonic enhancement processing;

if the audio type corresponding to the sub-audio signal is other audio types, performing audio enhancement processing on the sub-audio signal through a second enhancement strategy to obtain an enhanced sub-audio signal, wherein the second enhancement strategy comprises noise reduction processing.

Wherein the target audio type includes a plurality of sub-audio types, and the step of performing audio enhancement processing on the sub-audio signal through the first enhancement policy may include:

determining a target sub-audio type corresponding to the sub-audio signal from among the plurality of sub-audio types;

the sub-audio signal is audio enhanced based on the target enhancement parameter and the first enhancement policy.

205. And the electronic equipment performs signal reconstruction based on the enhancer audio signal to obtain a target audio signal.

The specific implementation manner of step 205 may include:

Illustratively, in practical applications, steps 201 to 205 may be implemented by an audio signal processing architecture as shown in fig. 2b, which may be provided in an electronic device. The composite recording signal in fig. 2b corresponds to the signal to be processed in the present embodiment, the circle in fig. 2b represents the filtering module, the "+" beside the filtering module in fig. 2b represents the signal input to the filtering module, and the "-" represents the signal to be filtered. The decomposition network in fig. 2b may be used to identify and decompose signals of different audio types, where the decomposition network may include decomposition modules corresponding to different audio types (such as type 1, type 2, etc.), where the decomposition modules may be equivalent to the signal extraction network in the foregoing embodiment, and are used to decompose sub-signals of corresponding audio types. Wherein the different audio types may include, but are not limited to: human voice, different instrument signals (pianos, violins, flute, tap, etc.), and other signals.

As shown in fig. 2b, the input composite audio signal is decomposed into different audio type signals layer by the decomposing network.

The signal decomposition referred to herein is that a signal of a specific type is extracted from an input signal of a previous stage through a deep learning network, for example, a composite recording signal is decomposed by a "type 1 signal decomposition" module to obtain a signal 1.

The type 1 definition can be a human voice signal or a certain type of musical instrument signal type, a deep learning network is arranged in the module, the input of the network is a residual signal extracted from a previous-stage signal, and the corresponding previous-stage residual signal is a composite recording signal or the characteristics of the composite recording signal: such as power spectrum, mel-frequency cepstrum features, etc.

The input signal is processed by the deep learning network to obtain a signal component belonging to type 1, the output signal is subtracted from the residual signal output by the previous stage to obtain a new residual signal (such as a new audio signal to be processed in the above embodiment), the new residual signal is continuously input into the next stage 'type 2 signal decomposition' module to perform signal extraction processing belonging to type 2, the output signal is subtracted from the output residual signal of the previous stage to obtain a new residual signal, and the new residual signal is extracted and subtracted layer by layer to finally obtain 'other signals' (such as sub audio signals corresponding to other audio types in the above embodiment), wherein the signals comprise noise signals and other signals which are extremely individually not classified.

Then, other signals are subjected to noise reduction processing to further extract effective signal components, and the signals of different types extracted at different stages before are subjected to equalization processing, harmonic enhancement processing and the like through EQ parameters matched with the various types, and the signals after various enhancement after the series of independent signal enhancement processing are subjected to linear superposition to obtain a reconstructed signal, wherein the reconstructed signal is the audio enhancement signal finally output by the embodiment.

Therefore, the embodiment provides an overall enhancement solution based on deep learning multi-signal decomposition, independent signal enhancement and signal reconstruction, which is different from the existing audio enhancement solution in that only the input audio signal is treated as a whole, and the existing scheme has limited audio enhancement effect because the input signal is a complex mixed signal of multiple signals, and generally has the problems that the effective signal is damaged or the ineffective signal remains too much. According to the audio signal processing method, the original mixed signal is decomposed into the independent signals of different types through signal decomposition, and then independent sound quality enhancement can be carried out according to the sound quality characteristics of the signals of different types, and the enhanced multiple independent signals are reconstructed into an integral signal. Particularly, the music signal generated by various musical instruments has more remarkable tone quality improving effect.

In order to better implement the above method, the embodiment of the application also provides an audio signal processing device, where the audio signal processing device may be specifically integrated in an electronic device, and the electronic device may be a terminal, a server, or other devices. The terminal can be a mobile phone, a tablet personal computer, an intelligent Bluetooth device, a notebook computer, a personal computer and other devices; the server may be a single server or a server cluster composed of a plurality of servers.

For example, in the present embodiment, a method of the embodiment of the present application will be described in detail by taking an example in which an audio signal processing apparatus is specifically integrated in audio signal processing.

For example, as shown in fig. 3, the audio signal processing apparatus may include a signal acquisition unit 301, a signal extraction unit 302, an enhancement unit 303, and a reconstruction unit 304, as follows:

a signal acquisition unit 301 for acquiring an audio signal to be processed;

a signal extraction unit 302, configured to extract sub-audio signals corresponding to a plurality of audio types from the audio signal to be processed;

an enhancing unit 303, configured to perform audio enhancement processing on the sub-audio signals based on the audio types corresponding to the sub-audio signals, for each sub-audio signal, to obtain enhanced sub-audio signals;

And a reconstruction unit 304, configured to perform signal reconstruction based on the enhancer audio signal, so as to obtain a target audio signal.

In some embodiments, the signal extraction unit 302 includes:

and the extraction subunit is used for carrying out signal extraction on the audio signals to be processed through at least one signal extraction network to obtain sub audio signals corresponding to various audio types.

In some embodiments, the extraction subunit comprises:

the first screening module is used for screening a target signal extraction network from at least one signal extraction network;

the second screening module is used for screening out a new target signal extraction network from at least one signal extraction network;

A return module, configured to extract a network and a new audio signal to be processed based on the new target signal, and return to perform: and carrying out signal extraction on the audio signal to be processed through the target signal extraction network until the new audio signal to be processed cannot be extracted by any one of the at least one signal extraction network, and determining the new audio signal to be processed as a sub-audio signal corresponding to other audio types.

In some embodiments, the first screening module is specifically configured to:

correspondingly, the second screening module is specifically configured to:

and taking the power spectrum and the phase spectrum as signal characteristics;

In some embodiments, the extraction subunit is further configured to:

In some implementations, the plurality of audio types includes a target audio type and other audio types, the enhancement unit 303, including:

the first enhancement subunit is configured to perform audio enhancement processing on the sub-audio signal through a first enhancement policy if the audio type corresponding to the sub-audio signal is a target audio type, so as to obtain an enhanced sub-audio signal, where the first enhancement policy includes at least one of equalizer enhancement processing and harmonic enhancement processing;

In some embodiments, the target audio type comprises a plurality of sub-audio types, a first enhancement subunit, specifically for:

In some embodiments, the reconstruction unit 304 is specifically configured to:

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

The embodiment of the application also provides electronic equipment which can be a terminal, a server and other equipment. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and the like; the server may be a single server, a server cluster composed of a plurality of servers, or the like.

In this embodiment, a detailed description will be given taking an electronic device of this embodiment as an example of a terminal, for example, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the present application, specifically:

the electronic device may include one or more processor cores 401, one or more computer-readable storage media memory 402, a power supply 403, an input module 404, and a communication module 405, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. In some embodiments, processor 401 may include one or more processing cores; in some embodiments, processor 401 may integrate an application processor that primarily processes operating systems, user interfaces, applications, and the like, with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The electronic device also includes a power supply 403 for powering the various components, and in some embodiments, the power supply 403 may be logically connected to the processor 401 by a power management system, such that charge, discharge, and power consumption management functions are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may also include an input module 404, which input module 404 may be used to receive entered numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The electronic device may also include a communication module 405, and in some embodiments the communication module 405 may include a wireless module, through which the electronic device may wirelessly transmit over a short distance, thereby providing wireless broadband internet access to the user. For example, the communication module 405 may be used to assist a user in e-mail, browsing web pages, accessing streaming media, and so forth.

Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform steps in any of the audio signal processing methods provided by the embodiments of the present application.

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided in the above-described embodiment.

The instructions stored in the storage medium may perform steps in any one of the audio signal processing methods provided in the embodiments of the present application, so that the beneficial effects that any one of the audio signal processing methods provided in the embodiments of the present application can be achieved, which are detailed in the previous embodiments and are not repeated herein.

The foregoing has described in detail the methods, apparatuses, electronic devices and computer readable storage medium for processing audio signals provided by the embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, where the foregoing examples are provided to assist in understanding the methods and core ideas of the present application; meanwhile, those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, and the present description should not be construed as limiting the present application in view of the above.

Claims

1. An audio signal processing method, comprising:

acquiring an audio signal to be processed;

acquiring at least one signal extraction network, each of the at least one signal extraction network being for extracting an audio signal of a target audio type;

Performing signal extraction on the audio signal to be processed through the at least one signal extraction network to obtain sub-audio signals corresponding to a plurality of audio types; the signal extraction is performed on the audio signal to be processed through the at least one signal extraction network to obtain sub-audio signals corresponding to multiple audio types, including:

performing signal extraction on the audio signal to be processed through the target signal extraction network;

if the signal is extracted from the audio signal to be processed, determining the extracted signal as a sub-audio signal of a target audio type corresponding to the target signal extraction network, and taking the extracted audio signal to be processed as a new audio signal to be processed;

screening a new target signal extraction network from the at least one signal extraction network;

and based on the new target signal extraction network and the new audio signal to be processed, returning to execute: the step of extracting the audio signal to be processed through the target signal extraction network until a new audio signal to be processed cannot be extracted by any one signal extraction network in the at least one signal extraction network, and determining the new audio signal to be processed as a sub-audio signal corresponding to other audio types;

Or, respectively extracting the audio signals to be processed through the at least one signal extraction network;

filtering sub-audio signals of the target audio type corresponding to the target signal extraction network in the audio signals to be processed, and taking the filtered audio signals to be processed as audio signals corresponding to other types;

2. The method according to claim 1, wherein if a signal is extracted from the audio signal to be processed, determining the extracted signal as a sub-audio signal of a target audio type corresponding to the target signal extraction network includes:

3. The audio signal processing method according to claim 1, wherein said screening out a target signal extraction network from said at least one signal extraction network comprises:

the screening the new target signal extraction network from the at least one signal extraction network comprises:

4. The audio signal processing method according to claim 1, wherein the signal extraction of the audio signal to be processed through the target signal extraction network includes:

5. The method for processing an audio signal according to claim 4, wherein the feature extraction of the audio signal to be processed to obtain a signal feature comprises:

taking the power spectrum and the phase spectrum as the signal characteristics;

6. The method for processing an audio signal according to claim 1, wherein filtering sub-audio signals of a target audio type corresponding to the target signal extraction network in the audio signal to be processed comprises:

7. The audio signal processing method according to any one of claims 1 to 6, wherein the plurality of audio types includes a target audio type and other audio types, the audio enhancement processing is performed on the sub-audio signal based on the audio type corresponding to the sub-audio signal, to obtain an enhanced sub-audio signal, including:

and if the audio type corresponding to the sub-audio signal is other audio types, performing audio enhancement processing on the sub-audio signal through a second enhancement strategy to obtain an enhanced sub-audio signal, wherein the second enhancement strategy comprises noise reduction processing.

8. The audio signal processing method according to claim 7, wherein the target audio type includes a plurality of sub-audio types, the audio enhancement processing is performed on the sub-audio signals by a first enhancement policy, comprising:

9. The audio signal processing method according to any one of claims 1 to 6, wherein the performing signal reconstruction based on the enhancer audio signal to obtain a target audio signal includes:

10. An audio signal processing apparatus, comprising:

a signal extraction unit, configured to obtain at least one signal extraction network, where each signal extraction network in the at least one signal extraction network is configured to extract an audio signal of a target audio type; performing signal extraction on the audio signal to be processed through the at least one signal extraction network to obtain sub-audio signals corresponding to a plurality of audio types; the signal extraction unit is further used for screening a target signal extraction network from the at least one signal extraction network; performing signal extraction on the audio signal to be processed through the target signal extraction network; if the signal is extracted from the audio signal to be processed, determining the extracted signal as a sub-audio signal of a target audio type corresponding to the target signal extraction network, and taking the extracted audio signal to be processed as a new audio signal to be processed; screening a new target signal extraction network from the at least one signal extraction network; and based on the new target signal extraction network and the new audio signal to be processed, returning to execute: the step of extracting the audio signal to be processed through the target signal extraction network until a new audio signal to be processed cannot be extracted by any one signal extraction network in the at least one signal extraction network, and determining the new audio signal to be processed as a sub-audio signal corresponding to other audio types; or, respectively extracting the audio signals to be processed through the at least one signal extraction network; taking a signal extraction network of the extracted signals as a target signal extraction network, and taking the signals extracted by the target signal extraction network as sub-audio signals of a target audio type corresponding to the target signal extraction network; filtering sub-audio signals of the target audio type corresponding to the target signal extraction network in the audio signals to be processed, and taking the filtered audio signals to be processed as audio signals corresponding to other types;

11. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps of the audio signal processing method according to any one of claims 1 to 9.

12. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor for performing the steps of the audio signal processing method according to any one of claims 1 to 9.