CN113113040B

CN113113040B - Audio processing method and device, terminal and storage medium

Info

Publication number: CN113113040B
Application number: CN202110303110.5A
Authority: CN
Inventors: 王昭
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2023-05-09
Anticipated expiration: 2041-03-22
Also published as: CN113113040A

Abstract

The disclosure relates to an audio processing method and device, a terminal and a storage medium. The method comprises the following steps: acquiring mixed audio data, wherein the mixed audio data comprises audio data with multiple components; processing the mixed audio data to obtain time-frequency spectrum characteristic data of the mixed audio data; inputting the time-frequency spectrum characteristic data into a trained neural network model for separation, and determining audio data corresponding to a preset component label in the mixed audio data; wherein the neural network model comprises an encoder-decoder structure. The method is based on a deep learning scheme, can effectively separate independent components in waveform music, is suitable for most audios, does not limit music style, and has strong expandability and universality.

Description

Audio processing method and device, terminal and storage medium

Technical Field

The disclosure relates to the field of electronic technology, and in particular, to an audio processing method and device, a terminal and a storage medium.

Background

The audio separation technique is an audio processing method of extracting specified kinds of signals from one mixed audio signal, respectively. The traditional music separation software called "silencing" generally adopts a band elimination filtering mode. In addition, there is a method of separating a human voice and accompaniment by means of spectral subtraction in the related art. However, because the frequency of a part of the musical instrument is overlapped with the frequency of the human voice and the frequency of the human voice or the sound is not fixed, the method can cause the sounds of the musical instruments in the same frequency range to disappear simultaneously based on the silencing technology of the band-stop filtering. The spectral subtraction method introduces musical noise and thus greatly affects the user's hearing.

Disclosure of Invention

To overcome the problems in the related art to some extent, the present disclosure provides an audio processing method and apparatus, a terminal, and a storage medium.

According to a first aspect of embodiments of the present disclosure, there is provided an audio processing method, including:

acquiring mixed audio data, wherein the mixed audio data comprises audio data with multiple components;

processing the mixed audio data to obtain time-frequency spectrum characteristic data of the mixed audio data;

inputting the time-frequency spectrum characteristic data into a trained neural network model for separation, and determining audio data corresponding to a preset component label in the mixed audio data;

wherein the neural network model comprises an encoder-decoder structure.

In some embodiments, the processing the mixed audio data to determine time-spectral feature data of the mixed audio data includes:

performing truncation processing on the mixed audio data to obtain target audio data with preset length;

and preprocessing and normalizing the target audio data to obtain the processed mixed time spectrum characteristic data.

In some embodiments, the inputting the time-frequency spectrum characteristic data into a trained neural network model for separation to obtain audio data corresponding to a preset component tag in the mixed audio data includes:

Determining feature extraction data corresponding to each preset component label in the time spectrum feature data based on a pre-trained extraction model associated with the preset component label;

and determining the audio data corresponding to the preset component tag in the mixed audio data based on the feature extraction data.

In some embodiments, the extraction model associated with the pre-set component tags is trained by:

determining a mixed audio sample training set, wherein the mixed audio sample training set comprises the preset component tag and mixed audio sample training data corresponding to the preset component tag;

and respectively inputting each preset component label and the time-frequency spectrum characteristics of the mixed audio sample training data corresponding to the preset component labels into the neural network model for training so as to obtain the extraction model associated with each preset component label.

Changing the sound channel movement rate of the mixed audio training data under the condition of keeping the fundamental frequency of the mixed audio training data unchanged so as to obtain processed target mixed audio sample training data;

and respectively inputting each preset component label and the time-frequency spectrum characteristics of the target mixed audio sample training data corresponding to the preset component labels into the neural network model for training so as to obtain the extraction model associated with each preset component label.

changing the size of the fundamental frequency of the mixed audio training data under the condition of keeping the envelope of the mixed audio training data unchanged so as to obtain processed target mixed audio sample training data;

In some embodiments, the neural network model is a full convolution UNET neural network model built using a Tensorflow deep learning framework.

In some embodiments, the neural network model is trained by:

inputting the time spectrum data of the mixed audio training set into a pre-trained separation model, and outputting estimated target time spectrum data;

updating model parameters of the separation model according to errors of the time spectrum data set and the target time spectrum data;

repeating the training step until the loss function of the separation model converges;

determining target model parameters that minimize the error, and determining the trained neural network model according to the target model parameters.

According to a second aspect of embodiments of the present disclosure, there is provided an audio processing apparatus comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module acquires mixed audio data, and the mixed audio data comprises audio data with multiple components;

the processing module is used for processing the mixed audio data to obtain time-frequency spectrum characteristic data of the mixed audio data;

the separation module is used for inputting the time-frequency spectrum characteristic data into a trained neural network model for separation and determining audio data corresponding to a preset component label in the mixed audio data;

Wherein the neural network model comprises an encoder-decoder structure.

In some embodiments, the processing module is specifically configured to:

preprocessing and normalizing the target audio data to obtain the processed mixed time spectrum data;

and performing alignment operation on the processed mixed time spectrum data to generate time spectrum characteristic data of the mixed audio data.

In some embodiments, the separation module is specifically configured to:

In some embodiments, the apparatus further comprises a first training module, the first training module being specifically configured to:

In some embodiments, the apparatus further comprises a second training module, the second training module being specifically configured to:

According to a third aspect of embodiments of the present disclosure, there is provided a terminal comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the audio processing method as described in the first aspect above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium comprising:

the instructions in the storage medium, when executed by a processor of the terminal, enable the terminal to perform the audio processing method as described in the first aspect above.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

in the embodiment of the disclosure, the obtained mixed audio data is processed, the time-frequency spectrum characteristic data of the processed mixed audio data is input into a trained neural network model for separation, and the audio data corresponding to the preset component label in the mixed audio data is determined. The audio processing method based on the deep learning and neural network model can effectively separate independent components in waveform music, is suitable for most audios, does not limit music style, and therefore has strong expandability and universality.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flowchart illustrating an audio processing method according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating another audio processing method according to an exemplary embodiment of the present disclosure.

Fig. 3 is a structural example diagram showing a neural network structure according to an exemplary embodiment of the present disclosure.

Fig. 4 is a waveform diagram illustrating a voice signal according to an exemplary embodiment of the present disclosure.

Fig. 5 is a waveform schematic diagram of another speech signal shown according to an exemplary embodiment of the present disclosure.

Fig. 6 is a waveform schematic diagram of yet another speech signal shown according to an exemplary embodiment of the present disclosure.

Fig. 7 is a diagram of an audio processing apparatus according to an exemplary embodiment of the present disclosure.

Fig. 8 is a block diagram of a terminal according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The audio separation technique is an audio processing method of extracting specified kinds of signals from one mixed audio signal, respectively. The traditional music separation software called "silencing" generally adopts a band elimination filtering mode. In addition, there is a method of separating a human voice and accompaniment by means of spectral subtraction in the related art. However, because the frequency of a part of the musical instrument is overlapped with the frequency of the human voice and the frequency of the human voice or the sound is not fixed, the method can cause the sounds of the musical instruments in the same frequency range to disappear simultaneously based on the silencing technology of the band-stop filtering. Whereas the spectral subtraction principle introduces musical noise and thus greatly affects the hearing of the user.

Fig. 1 is a flowchart 1 of an audio processing method according to an embodiment of the present disclosure, and as shown in fig. 1, the audio processing method includes the following steps:

s101, acquiring mixed audio data, wherein the mixed audio data comprises audio data with multiple components;

s102, processing the mixed audio data to obtain time-frequency spectrum characteristic data of the mixed audio data;

s103, inputting the time-frequency spectrum characteristic data into a trained neural network model for separation, and determining audio data corresponding to a preset component label in the mixed audio data.

In an embodiment of the present disclosure, the audio processing method may be applied to a terminal device, and the terminal device may include: a mobile device and a stationary device; the mobile device includes: cell phone, tablet computer, wearable device, intelligent audio amplifier, intelligent microphone etc.. The stationary devices include, but are not limited to, personal computers (Personal Computer, PCs), smart televisions, and the like.

In step S101, mixed audio data, which may be plural in number, which is audio data including a plurality of components, is first acquired. For example, popular music typically includes a human voice and accompaniment, which can be understood to be a kind of mixed audio data; as another example, rock music generally includes a human voice, a percussion music, a bass, a string music, and others, and may also be understood as a mixed audio data. The mixed audio data in embodiments of the present disclosure may be used to characterize waveform music.

In step S102, processing the mixed audio may include performing operations such as time-domain segmentation, framing, windowing, fourier transformation, normalization processing, and alignment on the mixed audio data to obtain time-spectrum feature data of the mixed audio data.

In step S103, the neural network may be a full convolution UNET neural network built using a Tensorflow deep learning framework. The neural network-based separator may adopt a structure of an encoder-decoder, the mixed time spectrum data is input into the neural network-based separator for separation, and the audio data corresponding to the preset component label in the mixed audio data is determined by minimizing an error between the output of the separation network and the preset component label by using a supervised learning method, so that the purpose of separating the sound of the specific component in the mixed audio data is achieved.

According to the audio processing method, the obtained mixed audio data are processed, the time spectrum characteristic data of the processed mixed audio data are input into a trained neural network model to be separated, and the audio data corresponding to the preset component labels in the mixed audio data are determined. The audio processing method based on the deep learning and neural network model can effectively separate independent components in waveform music, is suitable for most audios, does not limit music style, and therefore has strong expandability and universality.

Because the duration of the obtained mixed audio data is not fixed, the obtained mixed audio data can be uniformly cut off in the time domain when being processed, so as to obtain target audio data with preset length. The preset length may be set as desired, which is not limited by the present disclosure. For example, the preset length may be set to 20s, that is, all acquired mixed audio data is truncated to 20s in the time domain for the preprocessing of the truncation thereafter. Therefore, the processing time of the output result of the subsequent input separation model is not excessively long, and the audio data with the preset length can also meet the requirement of providing all required input features.

The target audio data after the truncation processing is preprocessed, which may include operations such as time domain segmentation, framing, windowing, fourier transform, and the like. For example, 90 ms of data can be taken as one frame of data, the windowing function can select a hanning window, the sampling frequency is set to 44100Hz, thus 4096 sampling points can be taken for fourier transformation, and zero padding operation is performed if the sampling points are less than 4096 points.

And performing Fourier transform to convert the time domain signal into a frequency domain signal, and then performing normalization operation. Due to the symmetry of the fourier transform, only the first 2048 points can be used for spectral analysis. Since the hearing range of the human ear is between 20Hz and 20000Hz, in order to reduce the amount of computation, the embodiments of the present disclosure take the first 1024 frequency points for covering the hearing range of the human ear.

In some possible embodiments, to ensure accuracy of the input signal, the signal may be further aligned after fourier transformation, and then normalized.

In the above embodiment, when the processed time-spectrum feature data is input into the trained neural network model for separation, since the extraction model associated with each preset component tag is trained in advance, during separation processing, feature extraction data corresponding to each preset component tag in the time-spectrum feature data may be determined based on the extraction model associated with each preset component tag that is trained in advance, and then audio data corresponding to the preset component tag in the mixed audio data may be determined based on the feature extraction data.

The training of the separation model is to train a plurality of mixed audio data corresponding to each preset component label and a time spectrum sample corresponding to each mixed audio data through a neural network one by one, and then the neural network training model, the data model of each preset component label and the extraction model of each preset component label can be obtained respectively. Therefore, when the method is used according to the trained separation model, the method for deep learning is adopted for training and using, independent components in the mixed audio data can be separated, and in order to ensure the purity of the separated independent components, the influence on the separation effect caused by data mixed into other components is avoided.

In order to ensure the training effect of the separation model, in the training process, the mixed audio sample training set needs to be determined when mixed audio sample data is acquired, and the mixed audio sample training set comprises preset component labels and mixed audio sample training data corresponding to the preset component labels. For mixed audio sample data, since the mixed audio sample data contains audio of multiple components, a mixed audio sample training set is required to be constructed in advance according to the mixed audio sample data.

The tag type and the contained preset component tag of the mixed audio sample data can be determined through a machine learning mode or a manual labeling mode.

For example, a tag type matching the mixed audio data is first determined. Tag types may include a variety of types. For example, the tag types may include a first type of tag, which may include a human voice, a percussion, a bass, a string, and others; the tag types may also include a second type of tag, which may be a human voice and accompaniment. Different types of labels are used for corresponding different music styles.

And corresponding the mixed audio sample training data with the preset component tags included in the tag type, and determining a first audio data set of the corresponding mixed audio sample. Taking the label type as a second type label as an example, the label includes two preset component labels, namely, a voice and an accompaniment. When it is determined that the mixed audio sample training data matches the second type of tag, data corresponding to each of the pre-set component tags for the mixed audio sample training data needs to be determined to construct a mixed audio sample first audio data set. For example, first vocal audio data corresponding to the vocal tags is determined, and first accompaniment audio data corresponding to the accompaniment tags is determined, and a mixed first audio sample data set is determined from the first vocal audio data and the first accompaniment audio data.

In this way, the label type matched with the mixed audio sample training data is determined, the mixed audio sample training data corresponds to the preset component labels included in the label type, and a corresponding mixed audio sample data set is determined, so that a foundation is laid for subsequent separation processing. In this way, when the subsequent separation step is facilitated, the time-frequency spectrum characteristics of the mixed audio sample training data corresponding to each preset component label and each preset component label are respectively input into the neural network model for training, so as to obtain an extraction model associated with each preset component label.

In order to acquire a richer data set, so as to greatly improve the performance of the model in the subsequent model training process, to enhance the accuracy of extracting the target waveform music by the separation model, data expansion processing can be performed on the basis of the current data set, wherein the data expansion processing comprises any one or more of the following operations: variable speed, variable modulation, variable envelope, adding reverberation and noise, channel switching.

When the data expansion processing is not performed, after the mixed audio sample training set is determined, the mixed audio sample training set comprises the preset component label and mixed audio sample training data corresponding to the preset component label, and the dimension information of the time spectrum of the mixed audio sample training data input into the separation network generally comprises batch size, frame sequence, frequency sequence and audio channel number. The frame sequence and the frequency sequence are obtained by performing short-time Fourier transform on the time domain waveform.

Taking data expansion processing as a speed change processing as an example, after the mixed audio sample training set is determined, when the mixed audio sample training data is processed, transverse stretching or compression can be performed in the time dimension of the spectrogram, so that the processing fundamental frequency value is almost unchanged, and the corresponding tone is not changed. However, as the whole time course is compressed or expanded, the number of glottal cycles is reduced or increased, the sound track movement rate is changed, and the speech speed is also changed.

The training data can be enriched under the condition that the training data are limited by using variable speed operation, the time-frequency spectrum characteristics after variable speed are input into a separation network for training to obtain a separation mask, and each preset component label and the time-frequency spectrum characteristics of the target mixed audio sample training data corresponding to the preset component label are respectively input into the neural network model for training so as to obtain the extraction model associated with each preset component label.

Taking data expansion processing as a tone changing processing as an example, since the voice tone changing operation means to change the size of the fundamental frequency of a speaker, the voice speed and the semantics are kept unchanged, namely the short-time spectrum envelope (the position and the bandwidth of a formant) and the time process are kept basically unchanged. After the mixed audio sample training set is determined, when mixed audio sample training data is processed, the size of the fundamental frequency of the mixed audio training data is changed under the condition that the envelope of the mixed audio training data is kept unchanged, so that processed target mixed audio sample training data is obtained. During training, two-channel data can be used as input, and as the two-channel corpus contains space information, the input characteristics are greatly enriched. And inputting the time-frequency spectrum characteristics after tone modification into a separation network for training to obtain a separation mask, and respectively inputting each preset component label and the time-frequency spectrum characteristics of the target mixed audio sample training data corresponding to the preset component label into the neural network model for training to obtain the extraction model associated with each preset component label.

As shown in fig. 2, in some embodiments, the neural network model is trained by:

step S201, inputting the time-frequency spectrum data of the mixed audio training set into a pre-trained separation model, and outputting estimated target time-frequency spectrum data;

step S202, updating model parameters of the separation model according to errors of the time spectrum data set and the target time spectrum data;

and step 203, determining target model parameters for minimizing the error, and determining the trained neural network model according to the target model parameters.

In step S204, taking the neural network structure as an UNET structure as an example, when the mixed audio training set and the corresponding label and label data thereof are used to train the separation network, the time-frequency spectrum data of the mixed audio training set obtained after the preprocessing and normalization operation can be input into the separation model, and meanwhile, the label type for calculating the loss and the specific preset label included in the label type need to be defined. For example, the tag type may be set to a first type tag, and accordingly, the preset tag may include bass, percussion, vocal, and others, and the tag of the loss function may be set to label (other).

Inputting the mixed time spectrum data x into a separation model, and outputting estimated time spectrum data by the separation model

Calculating x and

the error between the model parameters of the separation model is updated according to the error, which can be understood as obtaining the model parameters that minimize the error. The training step is repeated until the loss function of the separation model converges. Determining target model parameters that minimize the error, and determining the trained neural network model according to the target model parameters.

In some embodiments, when the trained separation model is used, the acquired time-frequency spectrum characteristic data of the mixed audio data is input into the trained model for separation, so that the time-frequency spectrum data with the maximum probability corresponding to the preset component label can be obtained, the time-frequency spectrum data is subjected to inverse Fourier transform, an estimated value time domain waveform can be obtained, and waveform music data corresponding to the preset component label in the mixed audio data is obtained according to the estimated value time domain waveform.

In some possible embodiments, the neural network model is a full convolution UNET neural network model built using a Tensorflow deep learning framework.

Fig. 3 is a structural example diagram showing a neural network structure according to an exemplary embodiment of the present disclosure. As shown in fig. 3, the separator section employs an encoder-decoder structure. The encoder is responsible for feature extraction and the decoder is responsible for restoring the original resolution. The UNET structure of the encoder adopts a cutting and splicing mode to replace a conventional residual connection mode and a point-by-point addition mode for feature fusion, and gradient disappearance can be effectively prevented when the network layer number is deepened.

In the separation process, abstract feature recovery is used to form a separation mask, where upsampling plays a significant role with jump linking. In the encoded section, the embodiments of the present disclosure employ skip chaining to introduce shallower layer convolutional layer features for forming the final split mask, since the feature map resolution has been downsampled to very small by the convolutional layer, which would be detrimental to the split mask formation. The method can effectively integrate deep-layer and shallow-layer sub-features and can ensure the purity of the mixed audio data separation. The scheme is based on the UNET full convolution neural network, the input of the network is the amplitude signal after short-time Fourier transformation and normalization, the output of the network is the estimated amplitude spectrum of the voice signal, and the consistency of the output signal and the input signal in shape can be ensured. The accuracy of extracting original individual component characteristics is improved by training a plurality of mixed audio training sample data and corresponding time spectrum samples based on the full convolution deep neural network and the Tensorflow deep learning framework, so that the accuracy of extracting target waveform music by the separation model is higher.

In some embodiments, the neural network model may also be a CRN deep neural network built using a Tensorflow deep learning framework. Because the bidirectional long-short-time memory network BLSTM has good long-distance feature capturing capability, the method is more suitable for training and evaluating audio signals with any time length than other types of neural networks, and the embodiment of the disclosure can also adopt a model framework of a CRN deep neural network framework separation network constructed by using a Tensorflow deep learning framework.

For the model architecture, during specific training, firstly, inputting a time spectrum of mixed audio training sample data, wherein dimension information of the mixed time spectrum is a batch size, a frame sequence, a frequency sequence and an audio frequency channel number, and the frame sequence and the frequency sequence are obtained by performing short-time Fourier transform on a time domain waveform before; for each discrete frequency in each frame, the global average and standard deviation are adopted to normalize the input data, so that redundant information can be reduced, the convergence of the model is quickened, and the training time is reduced.

Fig. 4-6 show waveforms of a speech signal, respectively. Fig. 4 is a schematic waveform diagram showing a mixed audio signal before component separation is performed, fig. 5 is a schematic waveform diagram showing a first audio signal matched with a human voice tag before component separation is performed, and fig. 6 is a schematic waveform diagram showing an audio signal matched with a human voice tag after component separation is performed. Through the amplitude spectrums of the voice signals shown in fig. 4-6, it can be seen that the output signals processed by the neural network-based separator in the scheme are consistent with the input signals in shape, and independent components in waveform music can be effectively separated. Fig. 7 is a diagram of an audio processing apparatus according to an exemplary embodiment. Referring to fig. 4, in an alternative embodiment, the audio processing apparatus 100 includes an acquisition module 101, a processing module 102, and a separation module 103, where:

An acquisition module 101 for acquiring mixed audio data including audio data of a plurality of components;

a processing module 102, configured to process the mixed audio data to obtain time-frequency spectrum feature data of the mixed audio data;

the separation module 103 is configured to input the time-frequency spectrum feature data into a trained neural network model for separation, and determine audio data corresponding to a preset component tag in the mixed audio data;

wherein the neural network model comprises an encoder-decoder structure.

In some embodiments, the processing module 102 is specifically configured to:

In other embodiments, the separation module 103 is specifically configured to:

In some embodiments, the apparatus 100 further comprises a first training module 104, where the first training module 104 is specifically configured to:

In some embodiments, the apparatus 100 further comprises a second training module 105, the second training module 105 being specifically configured to:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 8 is a block diagram of a terminal 800, according to an example embodiment. For example, the terminal 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, or the like.

Referring to fig. 8, a terminal 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the terminal 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the wake-up control method described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the terminal 800. Examples of such data include instructions for any application or method operating on the terminal 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power component 806 provides power to the various components of the terminal 800. Power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for terminal 800.

The multimedia component 808 includes a screen between the terminal 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the terminal 800 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the terminal 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals. The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the terminal 800. For example, the sensor assembly 814 may detect an on/off state of the terminal 800, a relative positioning of the components, such as a display and keypad of the terminal 800, a change in position of the terminal 800 or a component of the terminal 800, the presence or absence of user contact with the terminal 800, an orientation or acceleration/deceleration of the terminal 800, and a change in temperature of the terminal 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the terminal 800 and other devices, either wired or wireless. The terminal 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the terminal 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the wake-up control method described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of terminal 800 to perform the wake-up control method described above. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In another exemplary embodiment, a computer program product is also provided, comprising a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned wake-up control method when executed by the programmable apparatus.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of audio processing, the method comprising:

determining audio data corresponding to a preset component tag in the mixed audio data based on the feature extraction data;

the extraction model associated with the preset component label is obtained through training by the following steps:

changing the sound channel movement rate of the mixed audio training data under the condition of keeping the fundamental frequency of the mixed audio training data unchanged so as to obtain processed target mixed audio sample training data, or changing the fundamental frequency size of the mixed audio training data under the condition of keeping the envelope of the mixed audio training data unchanged so as to obtain processed target mixed audio sample training data;

Respectively inputting each preset component label and the time-frequency spectrum characteristics of the target mixed audio sample training data corresponding to the preset component labels into a neural network model for training so as to obtain the extraction model associated with each preset component label;

wherein the neural network model comprises an encoder-decoder structure.

2. The method of claim 1, wherein said processing said mixed audio data to determine time-spectral feature data of said mixed audio data comprises:

and preprocessing and normalizing the target audio data to obtain the processed time-frequency spectrum characteristic data of the mixed audio data.

3. The method of claim 1, wherein the neural network model is a full convolution UNET neural network model built using a Tensorflow deep learning framework.

4. The method of claim 1, wherein the neural network model is trained by:

Updating model parameters of the separation model according to errors of the time spectrum data of the mixed audio training set and the target time spectrum data;

5. An audio processing apparatus, the apparatus comprising:

the separation module is specifically used for:

the device further comprises a first training module, wherein the first training module is specifically used for:

wherein the neural network model comprises an encoder-decoder structure.

6. The audio processing device according to claim 5, wherein the processing module is specifically configured to:

preprocessing and normalizing the target audio data to obtain processed mixed time spectrum data;

7. The audio processing apparatus of claim 5, wherein the neural network model is a full convolution UNET neural network model built using a Tensorflow deep learning framework.

8. The audio processing apparatus according to claim 5, further comprising a second training module, the second training module being specifically configured to:

9. A terminal, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the audio processing method of any of claims 1 to 4.

10. A non-transitory computer readable storage medium, which when executed by a processor of a terminal, causes the terminal to perform the audio processing method of any of claims 1 to 4.