CN113113040A

CN113113040A - Audio processing method and device, terminal and storage medium

Info

Publication number: CN113113040A
Application number: CN202110303110.5A
Authority: CN
Inventors: 王昭
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2021-07-13
Anticipated expiration: 2041-03-22
Also published as: CN113113040B

Abstract

The disclosure relates to an audio processing method and apparatus, a terminal and a storage medium. The method comprises the following steps: acquiring mixed audio data, wherein the mixed audio data comprises audio data with multiple components; processing the mixed audio data to obtain time-frequency spectrum characteristic data of the mixed audio data; inputting the time-frequency spectrum characteristic data into a trained neural network model for separation, and determining audio data corresponding to a preset component label in the mixed audio data; wherein the neural network model comprises an encoder-decoder structure. The method is based on a deep learning scheme, can effectively separate independent components in waveform music, is suitable for most of audios, does not limit the music style, and has strong expandability and universality.

Description

Audio processing method and device, terminal and storage medium

Technical Field

The present disclosure relates to the field of electronic technologies, and in particular, to an audio processing method and apparatus, a terminal, and a storage medium.

Background

Audio separation is an audio processing method that extracts signals of a given type from a mixed audio signal. The so-called "silence" of the conventional music separation software generally adopts a band-stop filtering mode. In addition, there is a method of separating the vocal sound and the accompaniment by using a spectral subtraction method in the related art. However, in the silencing technique based on band-stop filtering, because the frequencies of some musical instruments and the frequencies of human voices are overlapped and the frequencies of human voices or harmony voices are not fixed, the method can cause the musical instrument voices in the same frequency band to disappear at the same time. While the spectral subtraction method introduces musical noise and thus greatly affects the user's perception of hearing.

Disclosure of Invention

To overcome the problems in the related art to some extent, the present disclosure provides an audio processing method and apparatus, a terminal, and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided an audio processing method, including:

acquiring mixed audio data, wherein the mixed audio data comprises audio data with multiple components;

processing the mixed audio data to obtain time-frequency spectrum characteristic data of the mixed audio data;

inputting the time-frequency spectrum characteristic data into a trained neural network model for separation, and determining audio data corresponding to a preset component label in the mixed audio data;

wherein the neural network model comprises an encoder-decoder structure.

In some embodiments, the processing the mixed audio data to determine time-frequency spectral feature data of the mixed audio data includes:

performing truncation processing on the mixed audio data to obtain target audio data with a preset length;

and preprocessing and normalizing the target audio data to obtain the processed mixed time-frequency spectrum characteristic data.

In some embodiments, the inputting the time-frequency spectrum feature data into a trained neural network model for separation to obtain audio data corresponding to a preset component tag in the mixed audio data includes:

determining feature extraction data corresponding to each preset component label in the time-frequency spectrum feature data based on a pre-trained extraction model associated with the preset component labels;

and determining audio data corresponding to a preset component label in the mixed audio data based on the feature extraction data.

In some embodiments, the extraction model associated with the preset composition label is trained by:

determining a mixed audio sample training set, wherein the mixed audio sample training set comprises the preset component labels and mixed audio sample training data corresponding to the preset component labels;

and respectively inputting the time-frequency spectrum characteristics of each preset component label and the mixed audio sample training data corresponding to the preset component labels into the neural network model for training so as to obtain the extraction model associated with each preset component label.

under the condition of keeping the fundamental frequency of the mixed audio training data unchanged, changing the sound channel movement rate of the mixed audio training data to obtain processed target mixed audio sample training data;

and respectively inputting the time-frequency spectrum characteristics of each preset component label and the target mixed audio sample training data corresponding to the preset component labels into the neural network model for training so as to obtain the extraction model associated with each preset component label.

under the condition of keeping the envelope of the mixed audio training data unchanged, changing the fundamental frequency of the mixed audio training data to obtain processed target mixed audio sample training data;

In some embodiments, the neural network model is a full convolution UNET neural network model built using a tensrflow deep learning framework.

In some embodiments, the neural network model is trained by:

inputting the time-frequency spectrum data of the mixed audio training set into a pre-trained separation model, and outputting estimated target time-frequency spectrum data;

updating model parameters of the separation model according to the time-frequency spectrum data set and the error of the target time-frequency spectrum data;

repeating the training step until the loss function of the separation model converges;

and determining target model parameters which minimize the error, and determining the trained neural network model according to the target model parameters.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio processing apparatus including:

the acquisition module acquires mixed audio data, wherein the mixed audio data comprises audio data with multiple components;

the processing module is used for processing the mixed audio data to obtain time-frequency spectrum characteristic data of the mixed audio data;

the separation module is used for inputting the time-frequency spectrum characteristic data into a trained neural network model for separation, and determining audio data corresponding to a preset component label in the mixed audio data;

wherein the neural network model comprises an encoder-decoder structure.

In some embodiments, the processing module is specifically configured to:

preprocessing and normalizing the target audio data to obtain processed mixed time-frequency spectrum data;

and performing alignment operation on the processed mixed time-frequency spectrum data to generate time-frequency spectrum characteristic data of the mixed audio data.

In some embodiments, the separation module is specifically configured to:

In some embodiments, the apparatus further comprises a first training module, the first training module being specifically configured to:

In some embodiments, the apparatus further comprises a second training module, the second training module specifically configured to:

According to a third aspect of the embodiments of the present disclosure, there is provided a terminal, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the audio processing method as described in the first aspect above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium including:

the instructions in said storage medium, when executed by a processor of the terminal, enable the terminal to perform the audio processing method as described in the first aspect above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the embodiment of the disclosure, the obtained mixed audio data is processed, the time-frequency spectrum characteristic data of the processed mixed audio data is input into a trained neural network model for separation, and the audio data corresponding to the preset component label in the mixed audio data is determined. The audio processing method based on the deep learning and neural network model can effectively separate independent components in waveform music, is suitable for most of audios, does not limit the music style, and therefore has strong expandability and universality.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flowchart illustrating an audio processing method according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flow chart illustrating another audio processing method according to an exemplary embodiment of the present disclosure.

Fig. 3 is a structural example diagram illustrating a neural network structure according to an exemplary embodiment of the present disclosure.

Fig. 4 is a waveform diagram illustrating a speech signal according to an exemplary embodiment of the present disclosure.

Fig. 5 is a waveform diagram illustrating another speech signal according to an exemplary embodiment of the present disclosure.

Fig. 6 is a waveform diagram illustrating yet another speech signal according to an exemplary embodiment of the present disclosure.

Fig. 7 is a diagram illustrating an audio processing device according to an exemplary embodiment of the present disclosure.

Fig. 8 is a block diagram illustrating a terminal according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart 1 of an audio processing method according to an embodiment of the present disclosure, where as shown in fig. 1, the audio processing method includes the following steps:

s101, acquiring mixed audio data, wherein the mixed audio data comprises audio data with multiple components;

s102, processing the mixed audio data to obtain time-frequency spectrum characteristic data of the mixed audio data;

s103, inputting the time-frequency spectrum characteristic data into a trained neural network model for separation, and determining audio data corresponding to a preset component label in the mixed audio data.

In an embodiment of the present disclosure, the audio processing method may be applied to a terminal device, and the terminal device may include: a mobile device and a stationary device; the mobile device includes: cell-phone, panel computer, wearable equipment, intelligent audio amplifier, intelligent microphone etc.. The fixed equipment includes, but is not limited to, a Personal Computer (PC), a smart television, and the like.

In step S101, mixed audio data, which may be plural in number, is first acquired, the mixed audio data being audio data including plural components. For example, pop music, which generally includes a human voice and an accompaniment, can be understood as a kind of mixed audio data; as another example, rock music typically includes human voice, percussion music, bass, string music, and others, and may also be understood as a kind of mixed audio data. The mixed audio data in the embodiments of the present disclosure may be used to characterize waveform music.

In step S102, the processing of the mixed audio may include performing operations such as time-domain segmentation, framing, windowing, fourier transform, normalization, alignment, and the like on the mixed audio data to obtain time-frequency spectrum feature data of the mixed audio data.

In step S103, the neural network may be a full convolution UNET neural network built by using a tensflo deep learning framework. The separator based on the neural network can adopt a structure of a coder-decoder, the spectrum data during mixing is input into the separator based on the neural network for separation, and the audio data corresponding to the preset component label in the mixed audio data can be determined by minimizing the error between the output of the separation network and the preset component label by using a supervised learning method, so that the aim of separating the sound of a specific component in the mixed audio data is fulfilled.

According to the audio processing method, the obtained mixed audio data are processed, the time-frequency spectrum characteristic data of the processed mixed audio data are input into a trained neural network model for separation, and the audio data corresponding to the preset component labels in the mixed audio data are determined. The audio processing method based on the deep learning and neural network model can effectively separate independent components in waveform music, is suitable for most of audios, does not limit the music style, and therefore has strong expandability and universality.

Because the duration of the obtained mixed audio data is not fixed, the obtained mixed audio data can be uniformly cut off in the time domain during processing, so as to obtain target audio data with a preset length. The preset length may be set as desired, and the present disclosure does not limit this. For example, the preset length may be set to 20s, that is, all the acquired mixed audio data may be temporally truncated to 20s for subsequent preprocessing of truncation. In this way, the processing time of the output result of the subsequent input separation model is not too long, and the preset length of audio data can also meet the requirement of providing all required input characteristics.

And preprocessing the target audio data after the truncation processing, wherein the preprocessing can comprise time domain segmentation, framing, windowing, Fourier transform and other operations. For example, 90 ms of data may be taken as one frame of data, the windowing function may select a hanning window, the sampling frequency is set to 44100Hz, and thus 4096 sampling points may be taken for fourier transform, and zero padding is performed if the sampling points are less than 4096 points.

And performing Fourier transform to convert the time domain signal into a frequency domain signal, and then performing normalization operation. Due to the symmetry of the fourier transform, only the first 2048 points can be used for spectral analysis. Since the hearing range of human ears is between 20Hz and 20000Hz, in order to reduce the amount of calculation, the first 1024 frequency points are taken by the embodiment of the present disclosure to cover the hearing range of human ears.

In some possible embodiments, in order to ensure the accuracy of the input signal, the fourier transform may be followed by an alignment operation on the signal, and then a normalization process.

In the above embodiment, when the processed time-frequency spectrum feature data is input into the trained neural network model for separation, since the extraction model associated with each preset component tag is trained in advance, during the separation, the feature extraction data corresponding to each preset component tag in the time-frequency spectrum feature data may be determined based on the pre-trained extraction model associated with each preset component tag, and then the audio data corresponding to the preset component tag in the mixed audio data may be determined based on the feature extraction data.

The training of the separation model is to train a plurality of mixed audio data corresponding to each preset component label and a time-frequency spectrum sample corresponding to each mixed audio data one by one through a neural network to obtain a neural network training model, an individual data model of each preset component label and an individual extraction model of each preset component label. Therefore, when the separation model is used according to the trained separation model, because the embodiment of the disclosure adopts a deep learning method for training and using, the independent components in the mixed audio data can be separated, and in order to ensure the purity of the separated independent components, the influence on the separation effect caused by the mixing of the data of other components is avoided.

In order to ensure the training effect of the separation model, in the training process, when the mixed audio sample data is obtained, the mixed audio sample training set needs to be determined, and the mixed audio sample training set includes a preset component label and mixed audio sample training data corresponding to the preset component label. After each preset component label and the mixed audio sample training data corresponding to the preset component label are determined, the mixed audio sample training set is constructed by the mixed audio sample data and the corresponding label and label data included in the mixed audio sample data.

The label type and the contained preset component label of the mixed audio sample data can be determined in a machine learning mode or a manual labeling mode.

For example, a tag type that matches the mixed audio data is first determined. The tag types may include multiple types. For example, the tag types may include a first type tag, which may include a vocal, percussion, bass, string, and others; the tag types may also include a second type of tag, which may be a voice and an accompaniment. Different types of tags are used to correspond to different genres of music.

And corresponding the mixed audio sample training data with a preset component label included in the label type, and determining a corresponding first audio data set of the mixed audio sample. Taking the tag type as the second type tag as an example, the tag includes two preset component tags, namely, a voice and an accompaniment. When it is determined that the mixed audio sample training data matches the second type of label, data corresponding to each preset component label needs to be determined for the mixed audio sample training data to construct the mixed audio sample first audio data set. For example, first person sound audio data corresponding to a person sound tag is determined, first accompaniment audio data corresponding to an accompaniment tag is determined, and a mixed first audio sample data set is determined from the first person sound audio data and the first accompaniment audio data.

Therefore, by determining the label type matched with the mixed audio sample training data, the mixed audio sample training data is corresponding to the preset component label included in the label type, and the corresponding mixed audio sample data set is determined, so that a foundation is laid for subsequent separation processing. Therefore, when the subsequent separation step is facilitated, the time-frequency spectrum characteristics of the mixed audio sample training data corresponding to each preset component label and the preset component label are respectively input into the neural network model for training, so that the extraction model associated with each preset component label is obtained.

In order to obtain a richer data set so as to greatly improve the performance of the model in the subsequent model training process and enhance the accuracy of extracting the target waveform music by the separation model, data expansion processing can be performed on the basis of the current data set, wherein the data expansion processing comprises any one or more of the following operations: speed change, pitch change, envelope change, addition of reverberation and noise, channel switching.

When data expansion processing is not performed, after a mixed audio sample training set is determined, the mixed audio sample training set includes the preset component labels and mixed audio sample training data corresponding to the preset component labels, and the dimension information of the time-frequency spectrum of the mixed audio sample training data input to the separation network generally includes batch size, frame sequence, frequency sequence, and audio channel number. Wherein the sequence of frames and the sequence of frequencies are derived by a short-time fourier transform of the time domain waveform previously.

Taking data expansion processing as the variable speed processing as an example, after the mixed audio sample training set is determined, when the mixed audio sample training data is processed, the mixed audio sample training data can be transversely stretched or compressed in the time dimension of a spectrogram, so that the processing fundamental frequency value is almost unchanged, and the corresponding tone does not change. However, as the whole time course is compressed or expanded, the number of glottic periods is reduced or increased, the sound track movement rate is changed, and the speech speed is changed accordingly.

The training data can be enriched under the condition that the training data are limited by using variable speed operation, the time-frequency spectrum characteristics after variable speed are input into a separation network to be trained so as to obtain a separation mask, and each preset component label and the time-frequency spectrum characteristics of the target mixed audio sample training data corresponding to the preset component label are respectively input into the neural network model to be trained so as to obtain the extraction model associated with each preset component label.

Taking data expansion processing as pitch change processing as an example, since the voice pitch change operation refers to changing the fundamental frequency of the speaker, and simultaneously keeping the speech rate and the semantics unchanged, that is, keeping the short-time spectrum envelope (the position and the bandwidth of the formant) and the time process basically unchanged. After the mixed audio sample training set is determined, when mixed audio sample training data are processed, under the condition that the envelope of the mixed audio training data is kept unchanged, the fundamental frequency of the mixed audio training data is changed, and the processed target mixed audio sample training data are obtained. The two-channel data can be used as input during training, and because the two-channel corpus contains spatial information, the input characteristics are greatly enriched. Inputting the time-frequency spectrum characteristics after tone modification into a separation network for training to obtain a separation mask, and respectively inputting each preset component label and the time-frequency spectrum characteristics of the target mixed audio sample training data corresponding to the preset component label into the neural network model for training to obtain the extraction model associated with each preset component label.

As shown in fig. 2, in some embodiments, the neural network model is trained by:

step S201, inputting time-frequency spectrum data of a mixed audio training set into a pre-trained separation model, and outputting estimated target time-frequency spectrum data;

step S202, updating model parameters of the separation model according to the time-frequency spectrum data set and the error of the target time-frequency spectrum data;

step S203, determining a target model parameter which minimizes the error, and determining the trained neural network model according to the target model parameter.

Step S204, taking the neural network structure as an UNET structure as an example, when the separation network is trained by using the mixed audio training set and the corresponding tags and tag data thereof, the time-frequency spectrum data of the mixed audio training set obtained after the preprocessing and normalization operations may be input into the separation model, and meanwhile, the tag type used for calculating the loss and the specific preset tag included in the tag type need to be defined. For example, the tag type may be set as a first type tag, and accordingly, the preset tag may include bass, percussion, vocal, and others, and the tag of the loss function may be set as label (cars).

Inputting mixed time-frequency spectrum data x into a separation model, and inputting the separation modelDeriving estimated time-frequency spectral data

Calculate x and

the error between, from which the model parameters of the separation model are updated, can be understood to be the error that minimizes the model parameters. The training step is repeated until the loss function of the separation model converges. And determining target model parameters which minimize the error, and determining the trained neural network model according to the target model parameters.

In some embodiments, when the trained separation model is used, the acquired time-frequency spectrum feature data of the mixed audio data is input into the trained model for separation, so that time-frequency spectrum data with the maximum probability of corresponding to the preset component tag can be obtained, the time-frequency spectrum data is subjected to inverse fourier transform, so that an estimated time-domain waveform can be obtained, and waveform music data corresponding to the preset component tag in the mixed audio data is obtained according to the estimated time-domain waveform.

In some possible embodiments, the neural network model is a full convolution UNET neural network model built using a tensoflow deep learning framework.

Fig. 3 is a structural example diagram illustrating a neural network structure according to an exemplary embodiment of the present disclosure. As shown in fig. 3, the splitter section employs an encoder-decoder structure. The encoder is responsible for feature extraction and the decoder is responsible for restoring the original resolution. According to the UNET structure of the encoder, a cutting and splicing mode is adopted to replace a conventional residual error connection and point-by-point addition mode for feature fusion during feature fusion, and when the number of network layers is deepened, gradient disappearance can be effectively prevented.

During the separation process, abstract feature recovery is used to form a separation mask, where upsampling and hopping links play a significant role. In the encoding portion, embodiments of the present disclosure employ jumped links to introduce shallower convolutional layer features for forming the final split mask, since the feature map resolution has been down-sampled by the convolutional layer to be very small, which would be detrimental to the formation of the split mask. The method can effectively fuse deep level and shallow level characteristics and can ensure the purity of the separation of the mixed audio data. The method is based on a UNET full convolution neural network, the input of the network is an amplitude signal after short-time Fourier transform and normalization, the output of the network is an amplitude spectrum of an estimated voice signal, and the shape of an output signal can be kept consistent with that of an input signal. Training a plurality of mixed audio training sample data and corresponding time-frequency spectrum samples by using a full convolution-based deep neural network and a Tensorflow deep learning framework improves the accuracy of extracting original individual component characteristics, so that the accuracy of extracting target waveform music by using a separation model is higher.

In some embodiments, the neural network model may also be a CRN deep neural network built using a tensrflow deep learning framework. Because the bidirectional long-and-short-term memory network BLSTM has good long-distance feature capture capability and is more suitable for training and evaluating audio signals of any time length compared with other types of neural networks, the embodiment of the disclosure can also adopt a model architecture of a CRN deep neural network framework separation network built by using a Tensorflow deep learning framework.

Aiming at the model architecture, in the specific training process, firstly, inputting a time-frequency spectrum of mixed audio training sample data, wherein the dimension information of the frequency spectrum in the mixing process is batch size, a frame sequence, a frequency sequence and an audio frequency channel number, wherein the frame sequence and the frequency sequence are obtained by performing short-time Fourier transform on a time domain waveform; for each discrete frequency located in each frame, global mean and standard deviation are adopted to carry out input data standardization, so that redundant information can be reduced, model convergence is accelerated, and training time is reduced.

Fig. 4-6 respectively show waveform diagrams of a speech signal. Fig. 4 represents a waveform diagram of a mixed audio signal before component separation is performed, fig. 5 represents a waveform diagram of a first audio signal matched with a human voice tag before component separation is performed, and fig. 6 represents a waveform diagram of an audio signal matched with a human voice tag after component separation is performed. From the amplitude spectra of the speech signals shown in fig. 4-6, it can be seen that the output signal processed by the neural network-based separator of the present invention is consistent in shape with the input signal, and can effectively separate the independent components in the waveform music. Fig. 7 is a diagram illustrating an audio processing device according to an example embodiment. Referring to fig. 4, in an alternative embodiment, the audio processing apparatus 100 includes an obtaining module 101, a processing module 102, and a separating module 103, where:

an obtaining module 101, configured to obtain mixed audio data, where the mixed audio data includes audio data of multiple components;

a processing module 102, configured to process the mixed audio data to obtain time-frequency spectrum feature data of the mixed audio data;

a separation module 103, configured to input the time-frequency spectrum feature data into a trained neural network model for separation, and determine audio data corresponding to a preset component tag in the mixed audio data;

wherein the neural network model comprises an encoder-decoder structure.

In some embodiments, the processing module 102 is specifically configured to:

In other embodiments, the separation module 103 is specifically configured to:

In some embodiments, the apparatus 100 further includes a first training module 104, where the first training module 104 is specifically configured to:

In some embodiments, the apparatus 100 further includes a second training module 105, and the second training module 105 is specifically configured to:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 8 is a block diagram illustrating a terminal 800 according to an example embodiment. For example, the terminal 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.

Referring to fig. 8, terminal 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the terminal 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or some of the steps of the wake-up control method described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the terminal 800. Examples of such data include instructions for any application or method operating on terminal 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of terminal 800. Power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for terminal 800.

The multimedia component 808 includes a screen providing an output interface between the terminal 800 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the terminal 800 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the terminal 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals. The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

Sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for terminal 800. For example, sensor assembly 814 can detect an open/closed state of terminal 800, the relative positioning of components, such as a display and keypad of terminal 800, sensor assembly 814 can also detect a change in position of terminal 800 or a component of terminal 800, the presence or absence of user contact with terminal 800, orientation or acceleration/deceleration of terminal 800, and a change in temperature of terminal 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 816 is configured to facilitate communications between terminal 800 and other devices in a wired or wireless manner. The terminal 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the terminal 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the wake-up control method described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the terminal 800 to perform the wake-up control method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which contains a computer program executable by a programmable apparatus, the computer program having code portions for performing the wake-up control method described above when executed by the programmable apparatus.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of audio processing, the method comprising:

wherein the neural network model comprises an encoder-decoder structure.

2. The method of claim 1, wherein the processing the mixed audio data to determine the time-frequency spectral feature data of the mixed audio data comprises:

3. The method according to claim 1, wherein the inputting the time-frequency spectrum feature data into a trained neural network model for separation to obtain audio data corresponding to a preset component label in the mixed audio data comprises:

4. The method of claim 3, wherein the extraction model associated with the predetermined component label is trained by:

5. The method of claim 3, wherein the extraction model associated with the predetermined component label is trained by:

6. The method of claim 3, wherein the extraction model associated with the predetermined component label is trained by:

7. The method of claim 1, wherein the neural network model is a full convolution UNET neural network model built using a tensoflow deep learning framework.

8. The method of claim 1, wherein the neural network model is trained by:

9. An audio processing apparatus, characterized in that the apparatus comprises:

wherein the neural network model comprises an encoder-decoder structure.

10. The audio processing device according to claim 9, wherein the processing module is specifically configured to:

11. The audio processing device according to claim 9, wherein the separation module is specifically configured to:

12. The audio processing device according to claim 11, wherein the device further comprises a first training module, the first training module being configured to:

13. The audio processing device according to claim 11, wherein the device further comprises a first training module, the first training module being configured to:

14. The audio processing device according to claim 11, wherein the device further comprises a first training module, the first training module being configured to:

15. The audio processing apparatus according to claim 9, wherein the neural network model is a full convolution UNET neural network model built using a tensflow deep learning framework.

16. The audio processing device according to claim 9, wherein the device further comprises a second training module, the second training module being configured to:

17. A terminal, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the audio processing method of any of claims 1 to 8.

18. A non-transitory computer-readable storage medium, instructions in which, when executed by a processor of a terminal, enable the terminal to perform the audio processing method of any one of claims 1 to 8.