CN117012218A - Audio processing method, device, electronic equipment, storage medium and program product - Google Patents

Audio processing method, device, electronic equipment, storage medium and program product Download PDF

Info

Publication number
CN117012218A
CN117012218A CN202211146105.9A CN202211146105A CN117012218A CN 117012218 A CN117012218 A CN 117012218A CN 202211146105 A CN202211146105 A CN 202211146105A CN 117012218 A CN117012218 A CN 117012218A
Authority
CN
China
Prior art keywords
signal
musical instrument
noise reduction
music
filtering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211146105.9A
Other languages
Chinese (zh)
Inventor
梁俊斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211146105.9A priority Critical patent/CN117012218A/en
Publication of CN117012218A publication Critical patent/CN117012218A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

The embodiment of the application provides an audio processing method, an audio processing device, electronic equipment, a storage medium and a program product, relates to the technical field of audio processing, and can be applied to music noise reduction. The method specifically comprises the following steps: acquiring a first signal, wherein the first signal is a music signal with noise generated by recording; filtering the first signals in parallel based on at least two pre-trained music noise reduction models to obtain at least two filtered second signals; performing signal reconstruction on each second signal to obtain a noise-reduced target signal; the music noise reduction model comprises a deep neural network model which is obtained by training a training sample corresponding to a preset musical instrument; a corresponding music noise reduction model is trained for each preset musical instrument, and each preset musical instrument comprises at least one musical instrument which is classified into the same type based on the sounding mode and/or the type of the musical instrument. The implementation of the application can effectively avoid the music signal from being damaged in the noise reduction process.

Description

Audio processing method, device, electronic equipment, storage medium and program product
Technical Field
The present application relates to the field of audio processing technology, and in particular, to an audio processing method, an apparatus, an electronic device, a storage medium, and a program product.
Background
In some music recording scenes, environmental noise, such as sound and wind noise of some vehicles, is easily introduced in the recording process due to the non-quiet recording environment, and electronic noise may be introduced due to the relatively simple recording device itself. When noise reduction processing is performed on a music signal with noise, it is very difficult to suppress noise components in the signal and to ensure that useful signals other than the noise components are not damaged.
In the prior art, the music signal and the noise signal have very similar characteristics, so that the noise suppression effect is poor, and the music signal is easy to be accidentally injured in the suppression process due to the wider frequency spectrum range distribution of the music signal.
Disclosure of Invention
Embodiments of the present application provide an audio processing method, an apparatus, an electronic device, a storage medium, and a program product for solving at least one technical problem described above. The technical scheme is as follows:
in a first aspect, an embodiment of the present application provides an audio processing method, including:
acquiring a first signal, wherein the first signal is a music signal with noise generated by recording;
performing signal filtering on the first signal based on at least two pre-trained music noise reduction models to obtain at least two filtered second signals;
Performing signal reconstruction on each second signal to obtain a noise-reduced target signal;
the music noise reduction model comprises a deep neural network model which is obtained by training a training sample corresponding to a preset musical instrument; a corresponding music noise reduction model is trained for each preset musical instrument, and each preset musical instrument comprises at least one musical instrument which is classified into the same type based on the sounding mode and/or the type of the musical instrument.
In a possible embodiment, if the first signal includes a voice portion, the music noise reduction model further includes a deep neural network model trained by using training samples corresponding to the voice portion.
In a possible embodiment, the at least two pre-trained music noise reduction models perform signal filtering on the first signal to obtain at least two filtered second signals, including:
performing Fourier transform on the first signal to obtain a frequency spectrum characteristic;
filtering the spectrum features in parallel based on at least two pre-trained music noise reduction models to obtain at least two filtering signals;
and performing inverse Fourier transform on each filtering signal based on the frequency spectrum characteristic and the phase information of the first signal to obtain filtered second information.
In a possible embodiment, the music noise reduction model includes a first fully connected unit, a convolution network unit, a gating circulation unit and a second fully connected unit which are sequentially connected;
the at least two music noise reduction models based on pre-training filter the spectral features in parallel to obtain at least two filtered signals, comprising performing the following operations for each music noise reduction model:
encoding the spectrum characteristics through the first full-connection unit and the convolution network unit to obtain high-dimensional characteristics;
decoding the high-dimensional features through the gating circulating unit and the second full-connection unit to obtain a filtering signal;
the convolution network unit comprises at least two convolution layers which are connected in sequence; the input of each convolution layer comprises the output of the first fully-connected unit; the input of the gating cycle unit comprises the output of the first full connection unit and the output of the convolution network unit; the second fully-connected unit comprises a fully-connected layer with an activation function.
In a possible embodiment, the at least two pre-trained music noise reduction models perform signal filtering on the first signal to obtain at least two filtered second signals, including:
Performing a feature analysis on the first signal;
if the first signal comprises the voice and/or at least one first musical instrument through feature analysis, filtering the first signal based on a music noise reduction model corresponding to the voice and/or the first musical instrument to obtain a filtered second signal corresponding to the voice and/or the first musical instrument;
wherein the preset musical instrument includes the at least one first musical instrument.
In a possible embodiment, the method further comprises at least one of:
if the response operation object triggers filtering operation for a second musical instrument or voice before signal filtering, the filtering of the first signal by a music noise reduction model corresponding to the second musical instrument or voice is indicated to be suspended in the signal filtering;
if before signal reconstruction, responding to an operation object to trigger filtering operation for the second musical instrument or the voice, and indicating that in the signal reconstruction, setting the weight of a signal corresponding to the second musical instrument or the voice in the second signal as a preset threshold;
wherein the preset musical instrument includes the second musical instrument.
In a possible embodiment, the music noise reduction model further includes a deep neural network model obtained by training a training sample corresponding to a preset playing mode; training a corresponding music noise reduction model aiming at each playing mode; the preset playing mode comprises at least one of solo, repeated, ensemble, accompaniment, zither and collar playing;
The method further comprises the steps of:
if the noise reduction operation is triggered by the response operation object aiming at the target playing mode of the third musical instrument before the signal filtering, the first signal is indicated to be filtered based on a music noise reduction model corresponding to the third musical instrument and a music noise reduction model corresponding to the target playing mode in the signal filtering, and a second signal corresponding to the third musical instrument and a second signal corresponding to the target playing mode are obtained;
the performing signal reconstruction on each second signal to obtain a target signal after noise reduction, including:
carrying out signal reconstruction on a second signal corresponding to the third musical instrument and a second signal corresponding to the target playing mode in a signal multiplication mode to obtain a noise-reduced target signal;
wherein the preset musical instrument includes the third musical instrument.
In a second aspect, an embodiment of the present application provides an audio processing apparatus, including:
the acquisition module is used for acquiring a first signal, wherein the first signal is a sound recording and generates a music signal with noise;
the filtering module is used for carrying out signal filtering on the first signal based on at least two pre-trained music noise reduction models to obtain at least two filtered second signals;
The reconstruction module is used for carrying out signal reconstruction on each second signal to obtain a target signal after noise reduction;
the music noise reduction model comprises a deep neural network model which is obtained by training a training sample corresponding to a preset musical instrument; a corresponding music noise reduction model is trained for each preset musical instrument, and each preset musical instrument comprises at least one musical instrument which is classified into the same type based on the sounding mode and/or the type of the musical instrument.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, where the processor executes the computer program to implement the steps of the audio processing method provided in the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the audio processing method provided in the first aspect described above.
In a fifth aspect, an embodiment of the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the audio processing method provided in the first aspect.
The technical scheme provided by the embodiment of the application has the beneficial effects that:
embodiments of the present application provide an audio processing method, apparatus, electronic device, computer-readable storage medium, and computer program product; specifically, when the first signal is obtained, signal filtering can be performed on the first signal based on at least two pre-trained music noise reduction models to obtain at least two filtered second signals; then, signal reconstruction can be carried out on each second signal to obtain a target signal after noise reduction; the first signal is a music signal with noise generated by recording; the music noise reduction model comprises a deep neural network model which is obtained by training a training sample corresponding to a preset musical instrument; a corresponding music noise reduction model is trained for each preset musical instrument, and each preset musical instrument comprises at least one musical instrument which is classified into the same type based on the sounding mode and/or the type of the musical instrument. The implementation of the application uses a plurality of music noise reduction models corresponding to different preset musical instruments to filter different components in a first signal (music signal with noise) in parallel, and a plurality of second signals (pure signals) obtained by filtering are reconstructed to obtain the music signal after noise reduction. The application solves the problem that the existing noise reduction algorithm cannot adapt to music signal diversity, so that music signals are easy to damage, filters the mixed music signals with noise through models corresponding to different music instruments, obtains the filtered pure signals, and then reconstructs the pure signals to obtain the noise-reduced music signals, thereby solving the problems that the existing noise reduction algorithm cannot effectively inhibit noise components in music recording or serious accidental injury to the music signals.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.
FIG. 1a is a schematic diagram of an original music signal;
FIG. 1b is a schematic diagram of a result of denoising the music signal shown in FIG. 1a using a conventional denoising algorithm;
FIG. 2 is a schematic diagram of an operating environment of an audio processing method according to an embodiment of the present application;
fig. 3 is a flowchart of an audio processing method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an algorithm architecture according to an embodiment of the present application;
FIG. 5 is a block diagram of a deep neural network according to an embodiment of the present application;
FIG. 6 is a schematic view of a scenario provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of an interactive interface according to an embodiment of the present application;
FIG. 8 is a schematic diagram of another interactive interface provided by an embodiment of the present application;
fig. 9 is a schematic structural diagram of an audio processing device according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
In the related audio noise reduction algorithm, a specific noise model and a signal type are required, for example, human voice and noise have a certain characteristic distance on signal correlation and spectrum distribution characteristics, and can be suppressed by some calculation statistical noise reduction methods. However, for a music type with a wider music scene, there may be some stable music spectrum features which are close to stable noise, and the existing noise reduction algorithm is easy to misjudge that the music signal is noise and misdamage the music signal, as shown in fig. 1a and 1b, after the music signal is noise reduced by the existing noise reduction algorithm, the damage of the music signal is serious. Therefore, the existing noise reduction algorithm has poor noise suppression effect.
According to the scheme, the music signals are different from the traditional voice signals, the types of the music signals are more, classical music, rock music, blue tone music, electronic music and the like are included, and the audio characteristics of different music types are obviously different; therefore, an audio processing method is proposed for noise reduction of music signals to avoid the problem that the music signals are damaged in noise reduction in the case of adapting to the diversity of the music signals.
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
Embodiments of the present application relate to artificial intelligence (Artificial Intelligence, AI) which is the intelligence of a person using a digital computer or a machine controlled by a digital computer to simulate, extend and expand the environment, sense the environment, acquire knowledge and use knowledge to obtain optimal results, methods, techniques and applications. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
The audio processing method provided by the embodiment of the application particularly relates to Machine Learning (ML), which is a multi-domain cross subject and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. Such as filtering noisy music signals through a music noise reduction model based on deep learning.
The audio processing method provided by the embodiment of the application can be applied to audio noise reduction scenes related to music signals.
The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.
Fig. 2 is a schematic diagram of an operation environment of an audio processing method according to an embodiment of the present application, where the environment may include a terminal 20 and a server 10.
Wherein the terminal 20 may run a client or a service platform. Terminals (which may also be referred to as devices) may be, but are not limited to, smartphones, tablet computers, notebook computers, desktop computers, intelligent voice interaction devices (e.g., smart speakers), wearable electronic devices (e.g., smart watches), vehicle terminals, smart appliances (e.g., smart televisions), AR/VR devices, and the like. Optionally, the terminal 20 may execute the audio processing method provided by the embodiment of the present application, and in an example, after the operation object performs recording related to the music signal through the terminal 20, the terminal 20 may reduce noise on a music signal with noise generated by the recording by executing the audio processing method provided by the embodiment of the present application, and feed back the target signal after noise reduction to the operation object. Alternatively, the terminal 20 may transmit the noisy music signal and the noise-reduced target signal to the server 10 through the network 30 for storage.
The server 10 may execute the audio processing method provided in the embodiment of the present application. The server may be an independent physical server, a server cluster or a distributed system (such as a distributed cloud storage system) formed by a plurality of physical servers, or a cloud server for providing cloud computing services. In one example, the operator may record a sound related to a music signal through the terminal 20, and send the music signal with noise generated by the sound recording to the server 10 through the network 30, so as to obtain a target signal after noise reduction fed back by the server 10 after executing the audio processing method provided by the embodiment of the present application.
In a possible embodiment, the terminal 20 and the server 10 may be directly or indirectly connected through wired or wireless communication, which is not limited herein. Such as the terminal 20 may issue a music noise reduction request to the server 10 via the network 30.
In a possible embodiment, the operating environment may further include a database, which may be used to store noisy music signals received by the server 10 and/or target signals resulting from the noise reduction process.
The following describes a scheme of the technology related to deep learning of artificial intelligence and the like provided by the embodiment of the application:
specifically, as shown in fig. 3, the method includes the following steps S101 to S103:
step S101: and acquiring a first signal, wherein the first signal is a music signal with noise generated by recording.
Specifically, the first signal may be any music signal that needs to be subjected to noise reduction, and may include not only a signal for which noise is substantially present but also a music signal for which the operator recognizes that noise reduction is required. As in an example scenario, when an operation object makes a recording of violin, piano, and cello performance in an open park, there may be environmental sounds such as bird sounds, river water flowing sounds, instrument echoes, and the like in addition to instrument sounds with the operation object playing in a music signal generated by the recording; if the operation object considers that the environment is selected for recording so as to synchronously record the environmental sound, the environmental sound does not belong to noise in essence; if only the playing sound with the musical instrument is expected to be output, the environment sound belongs to one of noises; the method can be set by an operation object when triggering the audio noise reduction treatment; that is, the first signal is a music signal including musical instrument playing sound and ambient sound in this scene.
Alternatively, the first signal may be acquired during the recording (in real time), or after the recording is completed (may be acquired at any time after the recording is completed). That is, the audio processing method provided by the embodiment of the application supports online real-time processing and offline background processing.
Step S102: and performing signal filtering on the first signal based on the at least two pre-trained music noise reduction models to obtain at least two filtered second signals.
The music noise reduction model comprises a deep neural network model which is obtained by training a training sample corresponding to a preset musical instrument; a corresponding music noise reduction model is trained for each preset musical instrument, and each preset musical instrument comprises at least one musical instrument which is classified into the same type based on the sounding mode and/or the type of the musical instrument.
In a possible example, a network may be trained for each instrument to obtain a music noise reduction model corresponding to each instrument, so that the purity of the music signal after noise reduction output by filtering the first signal by the music noise reduction model may be improved; that is, the output of the music noise reduction model only covers the music signal of a corresponding instrument, which is beneficial to improving the suppression effect of the music noise reduction. At this time, the preset musical instrument may correspond to one musical instrument.
In a possible example, considering the diversity of the musical instruments, if one network is trained for each musical instrument, not only is a large amount of resources consumed in the training phase of the early model, but also the consumption of the resources consumed in the signal filtering process is very large because of the need of adopting a very large number of models to perform parallel filtering on the first signal in the application phase of the later model (when all models synchronously filter the first signal, temporary operation is very large in a short time, and if the resources occupied at the same time are reduced, batch signal filtering can be considered, but the consumed time is relatively long). Therefore, in the embodiment of the application, the same musical instruments are classified into one type based on the sounding mode (sounding principle) and/or the musical instrument type, and the same network training is adopted for the musical instruments belonging to the same type, so that the resources consumed in model training and model application can be effectively reduced on the premise of ensuring the music noise reduction effect. It follows that the preset musical instrument does not correspond to a certain musical instrument in the embodiment of the present application, but corresponds to a certain type of musical instrument; i.e. the output of the music noise reduction model may correspond to the music signal of one or more instruments.
The following describes the classification of musical instruments in the embodiment of the present application:
in particular, the musical instruments may be classified into string musical instruments, woodwind musical instruments, brass musical instruments, keyboard musical instruments, and percussion musical instruments based on the type of musical instrument.
The string instruments can be classified into bow-pulled string instruments and plucked string instruments based on the sounding mode of the string instruments. The bow-pulled string instrument may be a Violin type such as Violin (Violin), viola (Viola), cello (Cello), or the like; the plucked string instrument may be a Guitar (Guitar), electric Guitar (Electric Guitar), harp (Harp), or the like.
The woodwind instrument can be divided into a lip-sounding instrument and a reed-sounding instrument based on the sounding mode of the woodwind instrument. The lip-ringing instrument may be Flute (Flute), siren (picoo), or the like; the reed instrument may be Clarinet, oboe, englishHorn, bassoon, saxophone, etc.
Among them, brass instruments may include small number (Trumpet), short number (Cornet), long number (Trombone), round number (French Horn), large number (Tuba), and the like.
Among them, the keyboard musical instrument may include a Piano (Piano), a Organ (Organ), an Accordion (Piano accerdion), an electronic Organ (Electronic Keyboard), and the like;
The percussion instruments are classified into toned percussion instruments and toneless percussion instruments according to types; the tuned percussion instrument may include a Timpani (Timpani), xylophone (Xylophone), etc.; the capless percussive instrument may include Snare Drum (Snare Drum), bass Drum (Bass Drum), triangle, tambourine (Tambourine), castanets (Castanets), mallets (Maracas), cymbals (Cymbals), gongs (Gong), and the like.
Optionally, considering that the Chinese musical instrument and the western musical instrument have a certain difference in sounding of the musical instrument, the corresponding effective features of the Chinese musical instrument and the western musical instrument may also have a certain difference, so that in order to train out a music noise reduction model more closely related to a scene, the classification of the musical instrument may be further divided by combining the belonging categories of the Chinese musical instrument and the western musical instrument under the condition of the classification.
The middle musical instrument can be a traditional musical instrument and can be divided into a wind musical instrument, a plucked musical instrument, a knocking musical instrument and a string musical instrument. The wind instrument can be flute, vertical bamboo flute, panpipe, etc.; the plucked instrument can be a willow, a lute, a lunar, an ancient musical instrument, a zither and the like; the percussion instrument can be a snare drum, etc.; the string instrument may be a urheen, a horse head, etc.
According to the classification of the musical instruments, the range covered by the musical instruments is very wide, when the screened musical instruments are subjected to model training, historical noise reduction data can be obtained, the use rate of various musical instruments in the historical noise reduction data is analyzed, after the musical instruments with higher use rate are obtained through screening, the musical instruments in the part are classified by adopting the classification method, and then a music noise reduction model is trained for the preset musical instruments in the same class.
In the signal filtering process, as shown in fig. 4, if N music noise reduction models are currently included, the first signal may be concurrently filtered using the N music noise reduction models. It will be understood that, assuming that the music noise reduction model 1 is a deep neural network model corresponding to the preset musical instrument 1, when the first signal is filtered by the music noise reduction model 1, a clean music signal corresponding to the preset musical instrument 1 is output.
Step S103: performing signal reconstruction on each second signal to obtain a noise-reduced target signal;
specifically, when the N music noise reduction models are adopted to perform parallel filtering on the first signals, N second signals are output correspondingly, and signal reconstruction can be performed on the N second signals in a linear superposition mode, so that noise-reduced target signals generated based on the N second signals are obtained.
In a possible embodiment, if the first signal includes a voice portion, the music noise reduction model further includes a deep neural network model trained using training samples corresponding to the voice portion.
Specifically, for some music recording applications, the operation object may record singing on the basis of playing accompaniment, and at this time, signals generated by recording may include human voice, music, environmental sound and the like (as shown in fig. 6), and since the human voice part does not belong to a noise part in the recording scene, in order to better adapt to a noisy music signal including the human voice part, a music noise reduction model may be trained for the human voice part correspondingly, that is, the model may output a pure signal of the human voice part when filtering the first signal.
The following describes details of signal filtering in the embodiments of the present application.
In a possible embodiment, the signal filtering is performed on the first signal based on the pre-trained at least two music noise reduction models in step S102 to obtain at least two filtered second signals, which includes the following steps A1-A3:
step A1: and carrying out Fourier transform on the first signal to obtain spectrum characteristics.
Specifically, before the music noise reduction model is adopted for signal filtering, the obtained first signal may be transformed, specifically, by fourier transform (Fourier Transform), also may be fast fourier transform (Fast Fourier Transform, FFT), and the signal is converted from the time domain to the frequency domain, and the spectral characteristics extracted in the process may be a power spectrum, mel-frequency cepstrum (mel-frequency cepstrum), an off-line cosine coefficient, and the like.
Step A2: and filtering the frequency spectrum characteristics in parallel based on the pre-trained at least two music noise reduction models to obtain at least two filtering signals.
Specifically, after step A1 is performed, the extracted spectral features may be used as network input features of each music noise reduction model, as shown in fig. 4, where the input of each music noise reduction model is consistent, and each music noise reduction model corresponds to a different preset musical instrument, so that the output of each music noise reduction model is different (is a clean signal corresponding to a different preset musical instrument).
The following describes the network structure of the music noise reduction model and the processing procedure of the signal:
alternatively, as shown in fig. 5, the music noise reduction model may include a first fully connected unit, a convolutional network unit, a gating loop unit, and a second fully connected unit, which are sequentially connected.
Wherein the first full connection unit may be formed of a full connection layer (Fully connected layer, FC layer).
Wherein the convolutional network element may comprise at least two convolutional layers (CONV 1D) connected in sequence; the input of each convolution layer comprises the output of the first fully-connected unit.
Wherein the gating loop unit (gated recurrent unit, GRU) may comprise a plurality of having a network (gated recurrent netural netword) of gating loops. The gate-controlled loop unit introduces a reset gate (reset gate) and an update gate (update gate). In the embodiment of the application, the input of the primary gating cycle unit comprises the output of the first fully-connected unit and the output of the convolution network unit; the input of the secondary gating cycle unit is the output of the upper stage gating cycle unit.
Wherein the second fully-connected unit comprises a fully-connected layer with an activation function; the activation function unit may employ a sigmoid function, adapted to the configuration of the gating loop unit.
In a possible example, filtering spectral features in step A2 in parallel based on at least two pre-trained music noise reduction models, resulting in at least two filtered signals, comprising performing the following steps a 21-a 22 for each music noise reduction model:
step A21: and encoding the frequency spectrum characteristics through the first full connection unit and the convolution network unit to obtain high-dimensional characteristics.
Step A22: and decoding the high-dimensional characteristics through the gating circulating unit and the second full-connection unit to obtain a filtered signal.
Specifically, as shown in fig. 5, the network generates a high-dimensional characteristic for the input power spectrum through the first full-connection unit and the convolution network unit, and outputs a power spectrum gain value (filtering signal) of each frequency point of the frequency domain through the multi-stage gating circulation unit and the second full-connection unit.
Step A3: and performing inverse Fourier transform on each filtering signal based on the frequency spectrum characteristics and the phase information of the first signal to obtain filtered second information.
Specifically, as shown in fig. 5, the frequency point gain (the filtered signal) is multiplied by the power spectrum of the input signal (the first signal) to obtain a filtered signal power spectrum, and the filtered second information can be obtained by performing inverse fourier transform (IFFT) in combination with the phase information of the first signal.
Alternatively, the fourier transform and inverse fourier transform may be performed with reference to the related art, and will not be described herein.
In a possible embodiment, the signal filtering in step S102 is performed on the first signal based on at least two pre-trained music noise reduction models to obtain at least two filtered second signals, including steps B1-B2:
step B1: a feature analysis is performed for the first signal.
Step B2: if the first signal comprises the voice and/or at least one first musical instrument through feature analysis, filtering the first signal based on a music noise reduction model corresponding to the voice and/or the first musical instrument to obtain a filtered second signal corresponding to the voice and/or the first musical instrument; wherein the preset musical instrument includes at least one first musical instrument.
Specifically, as shown in fig. 7 and 8, the interface shown in fig. 7 is a default (inherent) preset musical instrument (may also display a display area corresponding to the human voice portion) in the music noise reduction algorithm when the feature analysis is not performed on the first signal. Since there may be a plurality of default preset musical instruments, all the preset musical instruments may not be completely displayed in the interface, and the operation object may view the default preset musical instruments not displayed in the other current interface by performing a sliding operation on the interface (for example, in fig. 7, only a part of the display area corresponding to the preset musical instrument 5 is displayed, and the operation object may view all the parts of the display area corresponding to the preset musical instrument 5 by a sliding operation). The interface shown in fig. 8 is a first musical instrument included in the analysis result after the feature analysis is performed on the first signal. It will be appreciated that the preset musical instrument includes the first musical instrument analyzed.
Comparing fig. 7 and fig. 8, before performing feature analysis on the first signal, parallel filtering is required to be performed on the first signal by using N music noise reduction models corresponding to N preset musical instruments; after the feature analysis, the first signal only needs to be filtered in parallel using 3 music noise reduction models corresponding to 3 preset musical instruments. In the signal filtering, the operation of filtering is greatly reduced, so that the resource consumption required by the signal filtering can be effectively reduced. In addition, before the signal is filtered, the first signal is subjected to characteristic analysis, so that musical instruments and/or human voices possibly related in the first signal can be determined in advance, the suppression effect of subsequent signal filtering is improved, and the instability caused by the signal filtering by the music noise reduction model is effectively reduced.
Alternatively, the operations for feature analysis may also be implemented by a deep network model. If the noisy music signal for signal filtering comprises the preset musical instrument and the human voice, a training sample can be set according to the characteristic information corresponding to the preset musical instrument and the human voice, and the deep neural network model is trained to finally obtain the characteristic analysis model. It will be appreciated that the signature analysis model is used to analyze the signal types (e.g., a predetermined instrument or a human voice) that may be covered in the first signal; thus, the feature analysis model may be a classifier. And accordingly the music noise reduction model may be a filter.
Some possible operations involving interaction in the method provided by the embodiment of the present application are described below:
in a possible embodiment, the audio processing method provided in the embodiment of the present application further includes at least one of the following steps C1 to C2:
step C1: if the response operation object triggers filtering operation for the second musical instrument or the human voice before the signal filtering, the filtering of the first signal by the music noise reduction model corresponding to the second musical instrument or the human voice is stopped in the signal filtering.
Step C2: if the filtering operation is triggered for the second musical instrument or the voice in response to the operation object before the signal reconstruction, the method indicates that in the signal reconstruction, the weight of the signal corresponding to the second musical instrument or the voice in the second signal is set to be a preset threshold.
Wherein the preset musical instrument includes a second musical instrument.
Specifically, when the operation object triggers signal filtering, the operation object may also instruct filtering for some preset musical instrument or human voice, that is, the operation object wants to exclude the signal of the designated second musical instrument or human voice part from the obtained target signal after noise reduction.
In an example, as shown in fig. 7, the operation object may select a preset instrument 2 (the selected object is shown by a dashed box) as an instrument corresponding to the instrument signal to be filtered out in the first signal in the default page.
In an example, as shown in fig. 8, the operation object may select a preset instrument 3 (the selected object is shown by a dashed box) in the analysis result page as an instrument corresponding to the instrument signal to be filtered out in the first signal.
Optionally, the default page or the analysis result page may further display a display area corresponding to the voice, and the operation object may select the voice portion as a signal portion to be filtered in the first signal; if there is a sound discussion of the object watching the musical performance, the recording signal may include a voice part, and the voice belongs to one of the noises.
Alternatively, the operation object may switch the currently displayed page content by triggering a "default" or "analysis result" control, as shown in fig. 7, to trigger a default page displayed when the "default" control is triggered for the operation object (the selected object is displayed by a dotted line); as shown in fig. 8, the analysis results page displayed when the "analysis results" control (the selected object is shown in dotted line) is triggered for the operation object.
In an example, if the filtering operation performed by the operation object occurs before the signal filtering, when the signal filtering is performed on the first signal, the music noise reduction model corresponding to the second musical instrument or the human voice may be set to be in an unavailable state, so that the first signal cannot be subjected to audio signal processing through the music noise reduction model corresponding to the second musical instrument or the human voice, and the process makes that the output plurality of second signals do not include pure signals corresponding to the second musical instrument or the human voice.
In an example, if the filtering operation performed by the operation object occurs before the signal reconstruction (specifically, after the signal filtering and before the signal reconstruction), the weight of the second signal corresponding to the second musical instrument or the human voice may be reset to a preset threshold (e.g. 0) when the signal reconstruction is performed on the output second signal. Illustrating: assuming that the second signal currently output includes three kinds of clean signals corresponding to the preset musical instrument 1, the preset musical instrument 3, and the preset musical instrument 6, wherein the preset musical instrument 3 selects an object for performing filtering operation for an operation object, when the signal reconstruction is performed by a linear superposition manner, the following exemplary calculation manner may be adopted:
target signal = second signal of preset instrument 1 + second signal of preset instrument 3 + second signal of preset instrument 6.
In an example, if the filtering operation performed by the operation object occurs before the signal filtering (i.e. also before the signal reconstruction from the processing time), the above steps C1 and C2 may be performed simultaneously or alternatively.
Optionally, the operation object may select one or more preset musical instruments according to the requirement to trigger the filtering operation, which is only required to be implemented before the signal filtering or the signal reconstruction is performed.
In a possible embodiment, the music noise reduction model further includes a deep neural network model obtained by training a training sample corresponding to a preset playing mode; training a corresponding music noise reduction model aiming at each playing mode; the preset playing mode includes at least one of solo, heavy, ensemble, accompaniment, zither and collage.
Wherein solo refers to an individual performance by a certain instrument. The replay refers to the playing of at least two instruments (which may belong to the same preset instrument). The ensemble refers to a performance of a multi-tone part musical composition by a plurality of musical instruments. Accompaniment refers to the playing of a primer and a pass by a musical instrument or a band setting aside the main melody of a song during singing or solo of the musical instrument. The playout means that two or more musical instruments play the same melody at the same time. The playing of the collar refers to playing by one or more musical instrument collars during the ensemble.
Specifically, the audio processing method provided by the embodiment of the present application further includes step D1:
step D1: if the noise reduction operation is triggered by the response operation object aiming at the target playing mode of the third musical instrument before the signal filtering, the first signal is indicated to be filtered based on the music noise reduction model corresponding to the third musical instrument and the music noise reduction model corresponding to the target playing mode in the signal filtering, and a second signal corresponding to the third musical instrument and a second signal corresponding to the target playing mode are obtained; wherein the preset musical instrument includes a third musical instrument.
Specifically, when the noise reduction processing is performed on the first signal, the operation object may select a corresponding playing mode to perform the noise reduction operation at the same time when at least one of the preset musical instruments is selected; the noise reduction operation instruction only carries out filtering processing aiming at a third musical instrument selected by an operation object and a corresponding target playing mode in signal filtering, so as to obtain a pure signal corresponding to the third musical instrument after filtering and obtain a pure signal corresponding to the target playing mode.
In the embodiment of the present application, the music noise reduction model corresponding to the preset musical instrument and the music noise reduction corresponding to the preset playing mode are independent from each other, so that considering that the same playing mode is adopted by a plurality of different preset musical instruments possibly included in the same section of first signals, a signal reconstruction method is adaptively provided, specifically, in step S103, signal reconstruction is performed on each second signal, and a noise-reduced target signal is obtained, which includes the steps D2:
step D2: and carrying out signal reconstruction on the second signal corresponding to the third musical instrument and the second signal corresponding to the target playing mode in a signal multiplication mode to obtain a noise-reduced target signal.
Specifically, when the second signal corresponding to the third musical instrument and the second signal corresponding to the target playing mode are subjected to signal reconstruction by means of signal multiplication, the signal unrelated to the second signal corresponding to the third musical instrument can be removed from the second signal corresponding to the target playing mode, and finally the target signal corresponding to the target playing mode of the third musical instrument after noise reduction is obtained.
The following provides a feasible application example for the audio processing method provided by the embodiment of the present application:
application example I (execution subject of Audio processing method is terminal)
Implementation background: the operation object A performs a road show at a certain commercial center, and the program of the road show comprises violin playing under accompaniment; and in the course of the director, live broadcasting is also carried out through a certain live broadcasting platform.
When a director starts, an operation object A shoots a live broadcast through a smart phone, and in consideration of feedback sound of a live broadcast with a spectator, the audio data recorded by the smart phone accompanies environmental noise besides accompaniment and violin sound played by the operation object A, so that the operation object A synchronously starts an audio noise reduction mode when the live broadcast is carried out, a mobile phone end executes an audio processing method aiming at a first signal acquired in real time, and a noise reduction target signal is output to a live broadcast channel.
Optionally, since the operation object a does not need to introduce a program in the process of the director and directly broadcasts the played content, when the operation object a sets audio to reduce noise, filtering operation can be performed on the voice part through selection, so that when the background of the mobile phone end filters the first signal through the music noise reduction model, the background of the mobile phone end pauses to filter the first signal through the music noise reduction model corresponding to the voice, and finally, the output target signal does not have the signal of the voice part.
Alternatively, since the accompaniment is selected and prepared by the operation object a, the operation object a may learn in advance what kind of instrument sound is included in the accompaniment, and therefore, when triggering to perform audio noise reduction, the operation object a may select a corresponding preset instrument in the page content shown in fig. 7 to perform filtering, such as selecting a preset instrument corresponding to the instrument included in the accompaniment and a preset instrument corresponding to the violin, so that the first signal is filtered only through the music noise reduction model corresponding to the selected preset instrument, and finally, a target signal with a better suppression effect is obtained after the signal is reconstructed.
Optionally, in the first application example, since the noise reduction algorithm is implemented at the mobile phone end and the mobile phone end performs live broadcasting synchronously, if a default all music noise reduction models (K) are adopted to perform parallel filtering on the first signal, a problem of live broadcasting and blocking may occur in a short time due to insufficient resources, in order to avoid the occurrence of the problem, the background of the mobile phone end may actively trigger to perform feature analysis before filtering the first signal, first determine which music signals of preset musical instruments may be involved in the first signal, and then perform parallel filtering (where M is less than or equal to K) on the first signal with the music noise reduction models corresponding to M preset musical instruments involved in the analysis result, so as to effectively reduce resources and calculation amount consumed by signal filtering and signal reconstruction.
Alternatively, in the above-described application example one, the operation object a involves two times of solo and one time of ensemble with other musical instruments when performing violin performance, that is, two times of solo belonging to violin in the first signal; after live broadcasting is finished, the operation object A hopefully can filter the solo part of the violin after acquiring the recording signal, and at the moment, the operation object A can trigger the noise reduction operation in the audio noise reduction mode according to the solo playing mode of a preset musical instrument to which the violin belongs; for example, the operation object a may select a preset musical instrument Q including a violin in the interface shown in fig. 7, and then select a solo mode from a selection frame (the selection frame is not shown, and may specifically be displayed in a pop-up window, a suspension or a drop-down frame or may jump to other page display in the page shown in fig. 7) for displaying a playing mode, and trigger to perform noise reduction; at this time, the background at the mobile phone end respectively filters the first signals by adopting a music noise reduction model 1 corresponding to a preset musical instrument Q and a music noise reduction model 2 corresponding to a solo mode, wherein the music noise reduction model comprises a violin, so as to obtain a second signal 1 corresponding to the preset musical instrument Q and a second signal 2 corresponding to the solo mode; and then, the second signal 1 and the second signal 2 are subjected to signal reconstruction in a signal multiplication mode, so that after signals of the violin and other instruments are eliminated, a target signal corresponding to the solo of the violin can be obtained.
Application example II (the execution subject of the audio processing method is a server, and the terminal sends the first signal to the server through the network)
Implementation background: when the operation object B adopts a certain short video application to share the recorded songs, the operation object B finds that the recorded songs have environmental noise (such as rotating sound of an external air conditioner, sound of vehicle running and the like) besides accompaniment and own voice, and in order to improve the quality of the shared recorded songs, the operation object B triggers an audio noise reduction mode when triggering uploading sharing so that the recorded songs (first signals) are uploaded to a server, and the server releases the recorded songs formed by the noise reduction target signals on the short video sharing platform after executing an audio processing method.
Specifically, after the server obtains the first signal, each pre-trained music noise reduction model can be adopted to perform signal filtering on the first signal, so as to obtain pure signals respectively output by each music noise reduction model; since the first signal may not have the characteristic information of a certain preset musical instrument, there may be a part of the music noise reduction model with empty output (i.e., a part of the second signal is empty); at this time, when the signal reconstruction is performed on the second signals output by the noise reduction models of the music, the signal reconstruction can be directly performed in a linear superposition mode, so that the noise-reduced target signals can be simply and quickly obtained.
After the server performs the noise reduction processing to obtain the target signal, the target signal can be fed back to the terminal, and after the operation object B determines that the noise reduction processing can be accepted, the sound recording song after the noise reduction processing is released on the short video sharing platform.
Optionally, if the song that the operation object wants to share presents a singing effect, when the audio noise reduction mode is triggered, the noise reduction operation for the voice part may be set, so that the server only adopts the music noise reduction model corresponding to the voice to filter the first signal, and a pure voice signal (i.e. the target signal) is obtained. It follows that since only one music noise reduction model is used to filter the first signal, no signal reconstruction of the output of the music noise reduction model is required. That is, the audio processing method provided by the embodiment of the application can be applied to not only audio noise reduction scenes, but also signal screening scenes.
It should be noted that, in the alternative embodiment of the present application, the related data (such as the data related to the first signal, the second signal, the target signal, etc.) needs to be licensed or agreed by the user when the above embodiment of the present application is applied to a specific product or technology, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region. That is, in the embodiment of the present application, if data related to the subject is involved, the data needs to be obtained through the subject authorization consent and in accordance with the relevant laws and regulations and standards of the country and region.
An embodiment of the present application provides an audio processing apparatus, as shown in fig. 9, the audio processing apparatus 100 may include: an acquisition module 101, a filtering module 102 and a reconstruction module 103.
The acquisition module 101 is configured to acquire a first signal, where the first signal is a sound recording and generates a music signal with noise; the filtering module 102 is configured to perform signal filtering on the first signal based on at least two pre-trained music noise reduction models, so as to obtain at least two filtered second signals; a reconstruction module 103, configured to perform signal reconstruction for each second signal, so as to obtain a target signal after noise reduction; the music noise reduction model comprises a deep neural network model which is obtained by training a training sample corresponding to a preset musical instrument; a corresponding music noise reduction model is trained for each preset musical instrument, and each preset musical instrument comprises at least one musical instrument which is classified into the same type based on the sounding mode and/or the type of the musical instrument.
In a possible embodiment, if the first signal includes a voice portion, the music noise reduction model further includes a deep neural network model trained using training samples corresponding to the voice portion.
In a possible embodiment, the filtering module 102, when configured to perform signal filtering for the first signal based on at least two pre-trained music noise reduction models, is further configured to:
Performing Fourier transform on the first signal to obtain a frequency spectrum characteristic;
filtering spectral features in parallel based on at least two pre-trained music noise reduction models to obtain at least two filtered signals;
and performing inverse Fourier transform on each filtering signal based on the frequency spectrum characteristics and the phase information of the first signal to obtain filtered second information.
In a possible embodiment, the music noise reduction model comprises a first full-connection unit, a convolution network unit, a gating circulation unit and a second full-connection unit which are sequentially connected;
the filtering module 102 is further configured to, when configured to perform parallel filtering of spectral features based on the pre-trained at least two music noise reduction models, obtain at least two filtered signals, perform the following operations for each music noise reduction model:
the spectrum characteristics are encoded through a first full-connection unit and a convolution network unit, so that high-dimensional characteristics are obtained;
and decoding the high-dimensional characteristics through the gating circulating unit and the second full-connection unit to obtain a filtered signal.
The convolution network unit comprises at least two layers of convolution layers which are connected in sequence; the input of each convolution layer comprises the output of the first fully-connected unit; the input of the gating circulation unit comprises the output of the first full-connection unit and the output of the convolution network unit; the second fully connected unit comprises a fully connected layer with an activation function.
In a possible embodiment, the filtering module 102, when configured to perform signal filtering for the first signal based on at least two pre-trained music noise reduction models, is further configured to:
performing a feature analysis on the first signal;
if the first signal comprises the voice and/or at least one first musical instrument through feature analysis, filtering the first signal based on a music noise reduction model corresponding to the voice and/or the first musical instrument to obtain a filtered second signal corresponding to the voice and/or the first musical instrument;
wherein the preset musical instrument includes at least one first musical instrument.
In a possible embodiment, the apparatus 100 further comprises a response module for performing at least one of:
if the response operation object triggers filtering operation for the second musical instrument or the voice before the signal filtering, the filtering of the first signal by the music noise reduction model corresponding to the second musical instrument or the voice is indicated to be suspended in the signal filtering;
if the response operation object triggers filtering operation for the second musical instrument or the voice before the signal reconstruction, indicating that the weight of a signal corresponding to the second musical instrument or the voice in the second signal is set as a preset threshold in the signal reconstruction;
Wherein the preset musical instrument includes a second musical instrument.
In a possible embodiment, the music noise reduction model further includes a deep neural network model obtained by training a training sample corresponding to a preset playing mode; training a corresponding music noise reduction model aiming at each playing mode; the preset playing mode comprises at least one of solo, repeated, ensemble, accompaniment, zither and collage;
the response module included in the apparatus 100 is further configured to instruct, if the noise reduction operation is triggered by the response operation object for the target performance mode of the third instrument before the signal filtering, to filter the first signal based on the music noise reduction model corresponding to the third instrument and the music noise reduction model corresponding to the target performance mode in the signal filtering, so as to obtain a second signal corresponding to the third instrument and a second signal corresponding to the target performance mode;
the reconstruction module 102, when configured to perform signal reconstruction for each second signal to obtain a target signal after noise reduction, is further configured to:
carrying out signal reconstruction on a second signal corresponding to the third musical instrument and a second signal corresponding to the target playing mode in a signal multiplication mode to obtain a noise-reduced target signal;
Wherein the preset musical instrument includes a third musical instrument.
The device of the embodiment of the present application may perform the method provided by the embodiment of the present application, and its implementation principle is similar, and actions performed by each module in the device of the embodiment of the present application correspond to steps in the method of the embodiment of the present application, and detailed functional descriptions of each module of the device may be referred to the descriptions in the corresponding methods shown in the foregoing, which are not repeated herein.
The first signal, the second signal, the target signal, and the like according to the embodiments of the present application may be stored by a blockchain technique. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains a certain amount of processed data that is used to verify the validity of its information (anti-counterfeit) and to generate the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The embodiment of the application provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of an audio processing method, and compared with the related technology, the method can realize the following steps: when the first signal is obtained, the first signal can be subjected to signal filtering based on at least two pre-trained music noise reduction models to obtain at least two filtered second signals; then, signal reconstruction can be carried out on each second signal to obtain a target signal after noise reduction; the first signal is a music signal with noise generated by recording; the music noise reduction model comprises a deep neural network model which is obtained by training a training sample corresponding to a preset musical instrument; a corresponding music noise reduction model is trained for each preset musical instrument, and each preset musical instrument comprises at least one musical instrument which is classified into the same type based on the sounding mode and/or the type of the musical instrument. The implementation of the application uses a plurality of music noise reduction models corresponding to different preset musical instruments to filter different components in a first signal (music signal with noise) in parallel, and a plurality of second signals (pure signals) obtained by filtering are reconstructed to obtain the music signal after noise reduction. The application solves the problem that the existing noise reduction algorithm cannot adapt to music signal diversity, so that music signals are easy to damage, filters the mixed music signals with noise through models corresponding to different music instruments, obtains the filtered pure signals, and then reconstructs the pure signals to obtain the noise-reduced music signals, thereby solving the problems that the existing noise reduction algorithm cannot effectively inhibit noise components in music recording or serious accidental injury to the music signals.
In an alternative embodiment, there is provided an electronic device, as shown in fig. 10, the electronic device 4000 shown in fig. 10 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.
Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.
The memory 4003 is used for storing a computer program for executing an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.
Among them, electronic devices include, but are not limited to: server and terminal.
Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.
The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the steps and corresponding contents of the embodiment of the method when being executed by a processor.
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.
It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.
The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims (11)

1. An audio processing method, comprising:
acquiring a first signal, wherein the first signal is a music signal with noise generated by recording;
performing signal filtering on the first signal based on at least two pre-trained music noise reduction models to obtain at least two filtered second signals;
performing signal reconstruction on each second signal to obtain a noise-reduced target signal;
the music noise reduction model comprises a deep neural network model which is obtained by training a training sample corresponding to a preset musical instrument; a corresponding music noise reduction model is trained for each preset musical instrument, and each preset musical instrument comprises at least one musical instrument which is classified into the same type based on the sounding mode and/or the type of the musical instrument.
2. The method of claim 1, wherein if the first signal includes a human voice portion, the music noise reduction model further includes a deep neural network model trained using training samples corresponding to human voice.
3. The method of claim 1, wherein the pre-training based at least two music noise reduction models perform signal filtering on the first signal to obtain at least two filtered second signals, comprising:
Performing Fourier transform on the first signal to obtain a frequency spectrum characteristic;
filtering the spectrum features in parallel based on at least two pre-trained music noise reduction models to obtain at least two filtering signals;
and performing inverse Fourier transform on each filtering signal based on the frequency spectrum characteristic and the phase information of the first signal to obtain filtered second information.
4. A method according to claim 3, wherein the music noise reduction model comprises a first fully connected unit, a convolutional network unit, a gating loop unit and a second fully connected unit connected in sequence;
the at least two music noise reduction models based on pre-training filter the spectral features in parallel to obtain at least two filtered signals, comprising performing the following operations for each music noise reduction model:
encoding the spectrum characteristics through the first full-connection unit and the convolution network unit to obtain high-dimensional characteristics;
decoding the high-dimensional features through the gating circulating unit and the second full-connection unit to obtain a filtering signal;
the convolution network unit comprises at least two convolution layers which are connected in sequence; the input of each convolution layer comprises the output of the first fully-connected unit; the input of the gating cycle unit comprises the output of the first full connection unit and the output of the convolution network unit; the second fully-connected unit comprises a fully-connected layer with an activation function.
5. The method of claim 2, wherein the pre-training based at least two music noise reduction models perform signal filtering on the first signal to obtain at least two filtered second signals, comprising:
performing a feature analysis on the first signal;
if the first signal comprises the voice and/or at least one first musical instrument through feature analysis, filtering the first signal based on a music noise reduction model corresponding to the voice and/or the first musical instrument to obtain a filtered second signal corresponding to the voice and/or the first musical instrument;
wherein the preset musical instrument includes the at least one first musical instrument.
6. The method according to claim 2 or 5, further comprising at least one of:
if the response operation object triggers filtering operation for a second musical instrument or voice before signal filtering, the filtering of the first signal by a music noise reduction model corresponding to the second musical instrument or voice is indicated to be suspended in the signal filtering;
if before signal reconstruction, responding to an operation object to trigger filtering operation for the second musical instrument or the voice, and indicating that in the signal reconstruction, setting the weight of a signal corresponding to the second musical instrument or the voice in the second signal as a preset threshold;
Wherein the preset musical instrument includes the second musical instrument.
7. The method of claim 1, wherein the music noise reduction model further comprises a deep neural network model trained by training samples corresponding to a preset performance mode; training a corresponding music noise reduction model aiming at each playing mode; the preset playing mode comprises at least one of solo, repeated, ensemble, accompaniment, zither and collar playing;
the method further comprises the steps of:
if the noise reduction operation is triggered by the response operation object aiming at the target playing mode of the third musical instrument before the signal filtering, the first signal is indicated to be filtered based on a music noise reduction model corresponding to the third musical instrument and a music noise reduction model corresponding to the target playing mode in the signal filtering, and a second signal corresponding to the third musical instrument and a second signal corresponding to the target playing mode are obtained;
the performing signal reconstruction on each second signal to obtain a target signal after noise reduction, including:
carrying out signal reconstruction on a second signal corresponding to the third musical instrument and a second signal corresponding to the target playing mode in a signal multiplication mode to obtain a noise-reduced target signal;
Wherein the preset musical instrument includes the third musical instrument.
8. An audio processing apparatus, comprising:
the acquisition module is used for acquiring a first signal, wherein the first signal is a sound recording and generates a music signal with noise;
the filtering module is used for carrying out signal filtering on the first signal based on at least two pre-trained music noise reduction models to obtain at least two filtered second signals;
the reconstruction module is used for carrying out signal reconstruction on each second signal to obtain a target signal after noise reduction;
the music noise reduction model comprises a deep neural network model which is obtained by training a training sample corresponding to a preset musical instrument; a corresponding music noise reduction model is trained for each preset musical instrument, and each preset musical instrument comprises at least one musical instrument which is classified into the same type based on the sounding mode and/or the type of the musical instrument.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.
11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.
CN202211146105.9A 2022-09-20 2022-09-20 Audio processing method, device, electronic equipment, storage medium and program product Pending CN117012218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211146105.9A CN117012218A (en) 2022-09-20 2022-09-20 Audio processing method, device, electronic equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211146105.9A CN117012218A (en) 2022-09-20 2022-09-20 Audio processing method, device, electronic equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN117012218A true CN117012218A (en) 2023-11-07

Family

ID=88562467

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211146105.9A Pending CN117012218A (en) 2022-09-20 2022-09-20 Audio processing method, device, electronic equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN117012218A (en)

Similar Documents

Publication Publication Date Title
Mor et al. A universal music translation network
Winer The audio expert: everything you need to know about audio
KR20210041567A (en) Hybrid audio synthesis using neural networks
CN110211556B (en) Music file processing method, device, terminal and storage medium
JP2023542431A (en) System and method for hierarchical sound source separation
Nakamura et al. Real-time audio-to-score alignment of music performances containing errors and arbitrary repeats and skips
CN112309409A (en) Audio correction method and related device
US20220208175A1 (en) Information processing method, estimation model construction method, information processing device, and estimation model constructing device
Nistal et al. Darkgan: Exploiting knowledge distillation for comprehensible audio synthesis with gans
Reghunath et al. Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music
Kong et al. Universal source separation with weakly labelled data
CN115033734B (en) Audio data processing method and device, computer equipment and storage medium
Ullrich et al. Music transcription with convolutional sequence-to-sequence models
CN116778946A (en) Separation method of vocal accompaniment, network training method, device and storage medium
CN114302301B (en) Frequency response correction method and related product
CN117012218A (en) Audio processing method, device, electronic equipment, storage medium and program product
Cheuk et al. Jointist: Simultaneous improvement of multi-instrument transcription and music source separation via joint training
Hao Optimizing the Design of a Vocal Teaching Platform Based on Big Data Feature Analysis of the Audio Spectrum
CN113781989A (en) Audio animation playing and rhythm stuck point identification method and related device
Çakir et al. Musical instrument synthesis and morphing in multidimensional latent space using variational, convolutional recurrent autoencoders
CN113053337A (en) Intonation evaluation method, intonation evaluation device, intonation evaluation equipment and storage medium
Profeta et al. End-to-end learning for musical instruments classification
Chatterjee et al. Deep Single Shot Musical Instrument Identification using Scalograms
CN114667563A (en) Modal reverberation effect of acoustic space
Colonel Autoencoding neural networks as musical audio synthesizers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination