WO2023030017A1 - 音频数据处理方法、装置、设备以及介质 - Google Patents

音频数据处理方法、装置、设备以及介质 Download PDF

Info

Publication number
WO2023030017A1
WO2023030017A1 PCT/CN2022/113179 CN2022113179W WO2023030017A1 WO 2023030017 A1 WO2023030017 A1 WO 2023030017A1 CN 2022113179 W CN2022113179 W CN 2022113179W WO 2023030017 A1 WO2023030017 A1 WO 2023030017A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
recording
sample
speech
network model
Prior art date
Application number
PCT/CN2022/113179
Other languages
English (en)
French (fr)
Inventor
梁俊斌
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP22863157.8A priority Critical patent/EP4300493A1/en
Publication of WO2023030017A1 publication Critical patent/WO2023030017A1/zh
Priority to US18/137,332 priority patent/US20230260527A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02085Periodic noise

Definitions

  • the present application relates to the technical field of audio processing, and in particular to an audio data processing method, device, device and medium.
  • audio business applications With the rapid popularization of audio and video business applications, users use audio business applications to share daily music recordings with increasing frequency. For example, when a user is listening to accompaniment singing and recording through a device with a recording function (such as a mobile phone or a sound card device connected to a microphone), the user may be in a noisy environment, or the device used is too simple, which leads to In addition to the user's singing voice (human voice signal) and accompaniment singing (music signal), the music recording signal recorded by the device may also introduce noise signals in a noisy environment, electronic noise signals in the device, etc. If the unprocessed music recording signal is directly shared with the audio service application, it will be difficult for other users to hear the user's singing clearly when playing the music recording signal in the audio service application. Therefore, it is necessary to perform noise reduction processing on the recorded music recording signal.
  • a recording function such as a mobile phone or a sound card device connected to a microphone
  • the music recording signal recorded by the device may also introduce noise signals in a noisy environment, electronic noise signals in the device,
  • the current noise reduction algorithm needs to clarify the noise type and signal type. For example, based on the human voice and noise, there is a certain characteristic distance in terms of signal correlation and spectral distribution characteristics, and noise suppression is performed through some statistical noise reduction or deep learning noise reduction methods. However, there are many types of music in the music recording signal (for example, classical music, folk music, rock music, etc.), some music types are similar to some environmental noise types, or some music spectral features are relatively close to some noises, the above noise reduction algorithm is used When performing noise reduction processing on the music recording signal, the music signal may be misjudged as a noise signal for suppression, or the noise signal may be misjudged as a music signal and retained, resulting in an unsatisfactory noise reduction effect for the music recording signal.
  • the music signal may be misjudged as a noise signal for suppression, or the noise signal may be misjudged as a music signal and retained, resulting in an unsatisfactory noise reduction effect for the music recording signal.
  • Embodiments of the present application provide an audio data processing method, device, device, and medium, which can improve the noise reduction effect of recorded audio.
  • An embodiment of the present application provides an audio data processing method on the one hand, the method is executed by a computer device, including:
  • the recording audio includes background reference audio components, speech audio components and environmental noise components;
  • candidate speech audio from recording audio according to prototype audio; candidate speech audio includes speech audio components and environmental noise components;
  • An embodiment of the present application provides an audio data processing method on the one hand, the method is executed by a computer device, including:
  • voice sample audio, noise sample audio, and standard sample audio Acquire voice sample audio, noise sample audio, and standard sample audio, and generate sample recording audio based on voice sample audio, noise sample audio, and standard sample audio; voice sample audio and noise sample audio are collected through recording, and standard sample audio is audio Pure audio stored in the database;
  • the first initial network model is used to filter the standard sample audio included in the sample recording audio, and the expected prediction voice audio of the first initial network model is composed of the voice sample audio and noise sample audio determined;
  • the sample prediction noise reduction frequency corresponding to the sample prediction speech audio is obtained through the second initial network model; the second initial network model is used to suppress the noise sample audio contained in the sample prediction speech audio, and the expected prediction noise reduction frequency of the second initial network model is determined by speech sample audio determined;
  • the network parameters of the first initial network model are adjusted to obtain the first deep network model;
  • the first deep network model is used to filter the recorded audio to obtain candidate voice audio, the recorded audio Including background reference audio component, speech audio component and environmental noise component, candidate speech audio includes speech audio component and environmental noise component;
  • the network parameters of the second initial network model are adjusted to obtain the second deep network model; the second deep network model is used to denoise the candidate speech audio to obtain the reduced Noisy audio.
  • An embodiment of the present application provides an audio data processing device on the one hand, the device is deployed on a computer device, including:
  • the audio acquisition module is used to obtain the recording audio;
  • the recording audio includes background reference audio components, speech audio components and environmental noise components;
  • a retrieval module is used to determine the prototype audio matching the recording audio from the audio database
  • the audio filtering module is used to obtain candidate speech audio from recording audio according to prototype audio; candidate speech audio includes speech audio components and environmental noise components;
  • the audio frequency determination module is used to determine the difference between the recording audio and the candidate speech audio as the background reference audio component contained in the recording audio;
  • the noise reduction processing module is used to perform environmental noise reduction processing on the candidate speech audio, obtain the noise reduction speech audio corresponding to the candidate speech audio, merge the noise reduction speech audio and the background reference audio component, and obtain the recorded audio after noise reduction.
  • An embodiment of the present application provides an audio data processing device on the one hand, the device is deployed on a computer device, including:
  • the sample acquisition module is used to obtain voice sample audio, noise sample audio and standard sample audio, and generate sample recording audio according to the voice sample audio, noise sample audio and standard sample audio; the voice sample audio and noise sample audio are obtained through recording collection , the standard sample audio is the pure audio stored in the audio database;
  • the first prediction module is used to obtain the sample predicted speech audio in the sample recording audio through the first initial network model;
  • the first initial network model is used to filter the standard sample audio contained in the sample recording audio, and the expected prediction of the first initial network model
  • the speech audio is determined by the speech sample audio and the noise sample audio;
  • the second prediction module is used to obtain the sample prediction noise reduction frequency corresponding to the sample prediction speech audio through the second initial network model; the second initial network model is used to suppress the noise sample audio contained in the sample prediction speech audio, and the second initial network model
  • the expected predicted noise reduction frequency of is determined by the speech sample audio;
  • the first adjustment module is used to adjust the network parameters of the first initial network model based on the sample predicted speech audio and the expected predicted speech audio to obtain the first deep network model; the first deep network model is used to filter the recorded audio Obtain candidate voice audio, recording audio includes background reference audio component, voice audio component and environmental noise component, candidate voice audio includes voice audio component and environmental noise component;
  • the second adjustment module is used to adjust the network parameters of the second initial network model based on the sample prediction noise reduction frequency and the expected prediction noise reduction frequency to obtain a second deep network model; the second deep network model is used for candidate speech audio The noise-reduced speech audio is obtained after the noise-reduction processing is performed.
  • An embodiment of the present application provides a computer device on the one hand, including a memory and a processor, the memory is connected to the processor, the memory is used to store a computer program, and the processor is used to call the computer program, so that the computer device executes the computer program described in the embodiment of the present application.
  • a computer device including a memory and a processor, the memory is connected to the processor, the memory is used to store a computer program, and the processor is used to call the computer program, so that the computer device executes the computer program described in the embodiment of the present application.
  • Embodiments of the present application provide, on the one hand, a computer-readable storage medium, in which a computer program is stored, and the computer program is adapted to be loaded and executed by a processor, so that a computer device having a processor executes the implementation of the present application.
  • a computer-readable storage medium in which a computer program is stored, and the computer program is adapted to be loaded and executed by a processor, so that a computer device having a processor executes the implementation of the present application.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method provided in the above aspect.
  • the embodiment of the present application can obtain the recorded audio containing the background reference audio component, the speech audio component and the environmental noise component, obtain the prototype audio matching the recorded audio from the audio database, and then obtain the candidate speech audio from the recorded audio according to the prototype audio , the candidate speech audio includes a speech audio component and an environmental noise component.
  • the noise reduction processing problem of the recorded audio can be converted into the noise reduction processing problem of the candidate voice audio, and then the candidate voice audio is directly subjected to environmental noise noise reduction processing, and the noise reduction voice audio corresponding to the candidate voice audio is obtained, avoiding Background reference audio components in recorded audio are confused with ambient noise components.
  • the noise-reduced speech audio is combined with the background reference audio component to obtain the noise-reduced recording audio. It can be seen that this application converts the noise reduction processing problem of the recorded audio into the noise reduction processing problem of the candidate voice audio, which can avoid confusing the background reference audio component and the environmental noise component in the recorded audio, thereby improving the noise reduction of the recorded audio Effect.
  • FIG. 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a music recording audio noise reduction scene provided by an embodiment of the present application
  • Fig. 3 is a schematic flow chart of an audio data processing method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a music recording scene provided by an embodiment of the present application.
  • Fig. 5 is a schematic flow chart of an audio data processing method provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a first deep network model provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a second deep network model provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a recording audio noise reduction process provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of an audio data processing method provided in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of training a deep network model provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of an audio data processing device provided in an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of an audio data processing device provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the solution provided by the embodiment of the present application relates to the AI (Artificial Intelligence, Artificial Intelligence) noise reduction service in the artificial intelligence cloud service.
  • AI noise reduction can be accessed through API (Application Program Interface, Application Program Interface) Service, through the AI noise reduction service, perform noise reduction processing on the audio recordings shared on social platforms (for example, music recording sharing applications), so as to improve the noise reduction effect of the audio recordings.
  • API Application Program Interface, Application Program Interface
  • FIG. 1 is a schematic structural diagram of a network architecture provided by an embodiment of the present application.
  • the network architecture may include a server 10d and a user terminal cluster, and the user terminal cluster may include one or more user terminals, and the number of user terminals is not limited here.
  • the user terminal cluster may specifically include a user terminal 10a, a user terminal 10b, and a user terminal 10c.
  • the server 10d can be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud Cloud servers for basic cloud computing services such as communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the user terminal 10a, the user terminal 10b, and the user terminal 10c, etc. may include, but are not limited to: smart phones, tablet computers, notebook computers, palmtop computers, mobile internet devices (mobile internet device, MID), wearable devices (such as smart watches, Smart bracelets, etc.) and smart terminals with recording functions such as smart TVs, or sound card devices connected to microphones, etc.
  • the user terminal 10a, user terminal 10b, and user terminal 10c can respectively be connected to the server 10d through a network, so that each user terminal can exchange data with the server 10d through the network connection.
  • the user terminal 10a can be integrated with a recording function, and when the user wants to record audio data of himself or others, he can use an audio playback device to play the background reference audio (the background reference here
  • the audio can be music accompaniment, or background audio and subtitle dubbing audio in the video, etc.
  • start the recording function in the user terminal 10a start recording the mixed audio containing the background reference audio played by the above-mentioned audio playback device, the application can
  • the mixed audio is called recording audio
  • the background reference audio can be used as the background reference audio component in the recording audio.
  • the above-mentioned audio playback device can be the user terminal 10a itself; or, the audio playback device can also be other devices with audio playback functions other than the user terminal 10a; the above-mentioned audio recording can be Including the background reference audio played by the audio playback device, the ambient noise in the audio playback device/user's environment, and the moderate audio of the user's voice.
  • the recorded background reference audio can be used as the background reference audio component in the recorded audio.
  • the recorded ambient noise It can be used as the environmental noise component in the recorded audio
  • the recorded user voice can be used as the speech audio component in the recorded audio.
  • the user terminal 10a can upload the recorded audio to the social platform; for example, when the client of the social platform is installed on the user terminal 10a, the recorded audio can be uploaded to the client of the social platform, and the client of the social platform
  • the recorded audio can be transmitted to the background server of the social platform (for example, the server 10d shown in FIG. 1 above).
  • the noise reduction processing process of the recording audio can be: obtain the prototype audio matching the recording audio from the audio database (the prototype audio here can be understood as the official genuine audio corresponding to the background reference audio component in the recording audio); The prototype audio can obtain the candidate voice audio (including the above-mentioned environmental noise and the above-mentioned user voice) from the recording audio, and then can determine the difference between the recording audio and the candidate voice audio as the background reference audio component; denoise the candidate voice audio After processing, the noise-reduced voice audio corresponding to the candidate voice audio can be obtained.
  • the noise-reduced recording audio After superimposing the noise-reduced voice audio and the background reference audio component, the noise-reduced recording audio can be obtained. At this time, the noise-reduced recording audio can be displayed on the social platform. share. By converting the noise reduction processing problem of the recording audio into the noise reduction processing problem of the candidate speech audio, the noise reduction efficiency of the recording audio can be improved.
  • FIG. 2 is a schematic diagram of a music recording audio noise reduction scenario provided by an embodiment of the present application.
  • the user terminal 20a shown in Figure 2 may be a terminal device held by user A (for example, any user terminal in the user terminal cluster shown in Figure 1 above), and the user terminal 20a is integrated with a recording function and audio Play function, so the user terminal 20a can be used as a recording device, also can be used as an audio playback device.
  • user A wants to record the music recording that he sang, he can start the recording function in the user terminal 20a, start singing songs under the background of playing music accompaniment in the user terminal 20a, and start recording music.
  • the recording audio of the embodiment of the present application is music recording audio 20b
  • this music recording audio 20b can include user A's singing voice (i.e. voice audio component) and the music accompaniment played by user terminal 20a (i.e. background reference audio components).
  • the user terminal 20a can upload the recorded music recording audio 20b to the client corresponding to the music application, and after the client obtains the music recording audio 20b, the music recording audio 20b is transmitted to the background server corresponding to the music application (for example, the above-mentioned
  • the server 10d shown in Fig. 1 is used to make the background server store and share the music recording audio 20b.
  • the music recording audio 20b recorded by the user terminal 20a includes user A's singing voice and the music accompaniment played by the user terminal 20a
  • environmental noise will also be included, that is, the music recording audio 20b may include three audio components: environmental noise, music accompaniment and user's singing voice.
  • the ambient noise in the music recording audio 20b recorded by the user terminal 20a can be the whistle of vehicles, the shouting of roadside stores, and the voices of passers-by;
  • the ambient noise can also include electronic noise.
  • the background server directly shares the music recording audio 20b uploaded by the user terminal 20a, other terminal devices will not be able to clearly hear the music recorded by user A when accessing music applications and playing the music recording audio 20a. Therefore, before sharing the music recording audio 20b in the music application, it is necessary to perform noise reduction processing on the music recording audio 20b, and then share the noise-reduced music recording audio, so that other terminal devices can play the reduced music audio when accessing the music application.
  • the noise-recorded music audio can be used to understand the real singing level of user A; in other words, the user terminal 20a is only responsible for the collection and uploading of the music recording audio 20b, and the noise reduction process of the music recording audio 20b can be performed by the background server corresponding to the music application. implement.
  • the user terminal 20a can perform noise reduction processing on the music recording audio 20b, and upload the noise-reduced music recording audio to the music application
  • the background server corresponding to the music application can directly share the noise-reduced music recording audio, that is, the noise reduction processing of the music recording audio 20b can be performed by the user terminal 20a.
  • the background server (for example, the server 10d) of a music application is taken as an example below to describe the noise reduction process of the music recording audio 20b.
  • the essence of the noise reduction processing of the music recording audio 20b is to suppress the environmental noise in the music recording audio 20b, and retain the music accompaniment and user A's singing voice in the music recording audio 20b.
  • the noise reduction of the music recording audio 20b is to eliminate the ambient noise in the music recording audio 20b as much as possible, but it is necessary to keep the music accompaniment and the singing voice of user A in the music recording audio 20b unchanged as much as possible.
  • the background server for example, the above-mentioned server 10d
  • the music recording audio 20b can perform frequency domain conversion on the music recording audio 20b, that is, the music recording audio 20b is transformed from the time domain to the frequency domain.
  • Domain obtain the corresponding frequency domain power spectrum of music recording audio frequency 20b;
  • This frequency domain power spectrum can comprise the energy value corresponding to each frequency point respectively, this frequency domain power spectrum can be as shown in the frequency domain power spectrum 20i among the Fig. 2, this An energy value in the frequency domain power spectrum 20i corresponds to a frequency point, and a frequency point is a frequency sampling point.
  • the audio fingerprint 20c corresponding to the music recording audio 20b (that is, the audio fingerprint to be matched) can be extracted; digital features.
  • the background server can obtain the music library 20d in the music application, and the audio fingerprint library 20e corresponding to the music library 20d.
  • the music library 20d can include all music audio stored in the music application, and the audio fingerprint library 20e can include music. Audio fingerprints corresponding to each piece of music audio in the library 20d.
  • the audio fingerprint search can be carried out in the audio fingerprint storehouse 20e to obtain the fingerprint search result corresponding to the audio fingerprint 20c (that is, the audio fingerprint in the audio fingerprint storehouse 20e matched with the audio fingerprint 20b).
  • the music prototype audio 20f (such as the music prototype corresponding to the music accompaniment in the music recording audio 20b, i.e. prototype audio) can be determined from the music storehouse 20d according to the fingerprint retrieval result.
  • the frequency domain transformation can be performed on the music prototype audio 20f, that is, the music prototype audio 20f is transformed from the time domain to the frequency domain to obtain the frequency domain power spectrum corresponding to the music prototype audio 20f.
  • the first-order deep network model 20g may be a pre-trained network model capable of performing de-music processing on music recording audio, and the training process of the first-order deep network model 20g may refer to the process described in S304 below .
  • the weighted recording audio domain signal is obtained, and the weighted recording audio domain signal is subjected to time domain transformation, that is, the weighted recording audio
  • the domain signal is transformed from the frequency domain to the time domain to obtain the de-music audio 20k, where the de-music audio 20k may refer to filtering out the audio signal of the music accompaniment from the music recording audio 20b.
  • the frequency point gain output by the first-order deep network model 20g is a frequency point gain sequence 20h
  • the frequency point gain sequence 20h includes speech gains corresponding to 5 frequency points, including frequency point 1 corresponding to Voice gain 5, voice gain 7 corresponding to frequency 2, voice gain 8 corresponding to frequency 3, voice gain 10 corresponding to frequency 4, and voice gain 3 corresponding to frequency 5.
  • the frequency domain power spectrum 20i also includes the energy values corresponding to the above five frequency points, specifically including the energy value 1 corresponding to the frequency point 1, frequency Energy value 2 corresponding to point 2, energy value 3 corresponding to frequency point 3, energy value 2 corresponding to frequency point 4, and energy value 1 corresponding to frequency point 5.
  • the weighted recording audio domain signal 20j is obtained; the calculation process is specifically: calculating the frequency
  • the product between the speech gain 5 corresponding to the frequency point 1 in the point gain sequence 20h and the energy value 1 corresponding to the frequency point 1 in the frequency domain power spectrum 20i obtains a weighted energy value 5, and the weighted energy value 5 That is, the energy value 5 for frequency point 1 in the weighted recording audio domain signal 20j; calculate the speech gain 7 corresponding to frequency point 2 in the frequency point gain sequence 20h and the energy value 2 corresponding to frequency point 2 in the frequency domain power spectrum 20i
  • the product between the weighted recording audio domain signal 20j for the energy value 14 of the frequency point 2; calculate the voice gain 8 corresponding to the frequency point 3 in the frequency point gain sequence 20h corresponds to the frequency point 3 in the frequency domain power spectrum 20i
  • Background server can be determined as the pure music audio 20p (being the background reference audio component) contained in the music recording audio 20b with the difference between the music recording audio 20b and the music recording audio 20k after obtaining the music audio 20k.
  • the pure music audio at 20p can accompany the music played by the music playback device.
  • the second-order deep network model 20m may be a pre-trained network model capable of performing noise reduction processing on speech audio carrying noise, and the training process of the second-order deep network model 20m may refer to the description in the following S305 the process of.
  • the human voice noise-removed frequency 20n here can refer to the audio signal obtained after noise suppression is performed on the music-free audio 20k, such as the singing voice of user A in the music recording audio 20b.
  • the above-mentioned first-order deep network model 20g and second-order deep network model 20m can be deep networks with different network structures; the calculation process of the human voice denoising frequency 20n is similar to the calculation process of the above-mentioned de-music audio frequency 20k, here No further details will be given.
  • the background server can superimpose the pure music audio 20p and the human voice noise-removed audio 20n to obtain the noise-reduced music recording audio 20q (ie, the noise-reduced recording audio).
  • the noise reduction processing of the music recording audio 20b is converted into the noise reduction processing of the music audio 20k (can be understood as human voice audio), so that the music recording audio 20q after noise reduction It not only retains the singing voice and music accompaniment of user A, but also suppresses the environmental noise in the music recording audio 20b to the greatest extent, and improves the noise reduction effect of the music recording audio 20b.
  • FIG. 3 is a schematic flowchart of an audio data processing method provided in an embodiment of the present application. It can be understood that the audio data processing method can be executed by a computer device, and the computer device can be a user terminal, or a server, or a computer program application (including program code) in the computer device, which is not specifically limited here. As shown in Figure 3, the audio data processing method may include the following S101-S105:
  • the recording audio includes a background reference audio component, a voice audio component, and an environmental noise component.
  • the computer device can obtain the recording audio including the background reference audio component, the speech audio component and the environmental noise component, and the recording audio can be a mixed recording obtained by the recording device on the object to be recorded and the audio playback device in the environment to be recorded.
  • the recording device can be a device with a recording function, such as a sound card device connected to a microphone, a mobile phone, etc.
  • the audio playback device can be a device with an audio playback function, such as a mobile phone, a music player, and an audio device
  • the object to be recorded can be Refers to the user who needs to record voice, such as user A in the embodiment corresponding to Figure 2 above
  • the environment to be recorded can be the recording environment where the object to be recorded and the audio playback device are located, such as the location of the object to be recorded and the audio playback device indoor spaces, outdoor spaces (e.g.
  • the device can be used as both a recording device and an audio playback device, that is, the audio playback device and the recording device in this application can be the same device, as shown in Figure 2 above
  • the user terminal 20a in the corresponding embodiment.
  • the recorded audio obtained by the computer device can be the recorded data transmitted from the recording device to the computer device, or it can be the recorded data collected by the computer device itself, such as when the above-mentioned computer device has the function of recording and playing audio , which can also be used as a recording device and an audio playback device, the computer device can be installed with an audio application, and the above audio recording process can be realized through the recording function of the audio application.
  • the object to be recorded can start the recording function in the recording device, and use the audio playback device to play the music accompaniment. Sing a song in the background and start recording music with a recording device; after the recording is completed, the recorded music recording can be used as the above-mentioned audio recording, and the recording audio at this time can include the musical accompaniment played by the audio playback device and the singing voice of the object to be recorded; If the environment to be recorded is a noisy environment, the recording audio can also include the environmental noise in the environment to be recorded; the music accompaniment recorded here can be used as the background reference audio component in the recording audio, as in the corresponding embodiment of Figure 2 above
  • the music accompaniment played by the user terminal 20a in the recording; the singing voice of the object to be recorded can be used as the voice audio component in the recording audio, such as the singing voice of user A in the corresponding embodiment of FIG.
  • the environmental noise component such as the environmental noise in the environment where the user terminal 20a is
  • the object to be recorded can start the recording function in the recording device, and use the audio playback device to play the background audio in the segment to be dubbed. Perform dubbing on the basis of the background audio, and start recording the dubbing with the recording device; after the recording is completed, the recorded dubbing audio can be used as the above-mentioned recording audio, and the recording audio at this time can include the background audio played by the audio playback device, the object to be recorded
  • the recording audio can also include the ambient noise in the environment to be recorded; the background audio recorded here can be used as the background reference audio component in the recording audio; the recorded object to be recorded
  • the dubbing can be used as the speech audio component in the recorded audio; the recorded ambient noise can be used as the ambient noise component in the recorded audio.
  • the recorded audio acquired by the computer device may include the audio played by the audio playback device (for example, the above-mentioned music accompaniment, the background audio in the segment to be dubbed, etc.), the voice output by the object to be recorded (for example, the above-mentioned user's dubbing , singing, etc.) and ambient noise in the environment to be recorded.
  • the above-mentioned music recording scene and dubbing recording scene are only examples in this application, and this application can also be applied to other audio recording scenes, for example: the human-computer question-and-answer interaction scene between the object to be recorded and the audio playback device .
  • the language performance scene cross talk performance scene, etc.
  • the recorded audio acquired by the computer device may include not only the audio output by the object to be recorded and the audio played by the audio playback device, but also the ambient noise in the environment to be recorded.
  • the environmental noise in the above-mentioned audio recording may be the broadcasting sound of the promotional activities in the shopping mall, the yelling of shop assistants, and the electronic noise of the recording equipment, etc.;
  • the ambient noise in the above-mentioned audio recording may be the running sound of the air conditioner or the rotation sound of the fan, and the electronic noise of the recording device.
  • the computer device needs to perform noise reduction processing on the acquired recording audio, and the effect of the noise reduction processing is to suppress the environmental noise in the recording audio as much as possible, while maintaining the output of the object to be recorded contained in the recording audio. Audio and audio played by audio playback devices are not altered.
  • the noise reduction processing problem of the recorded audio can be transformed into a human voice noise reduction that does not include the background reference audio component . Therefore, the prototype audio matching the recording audio can be determined from the audio database first, so as to obtain the candidate speech audio from which the background reference audio component is removed.
  • S102 may be implemented by directly matching the recorded audio to obtain the prototype audio; it may also be to obtain the audio fingerprint to be matched corresponding to the recorded audio first, and obtain it in the audio database according to the audio fingerprint to be matched Prototype audio that matches the recorded audio.
  • the computer device can perform data compression on the recorded audio, and map the recorded audio to digital summary information, where the digital summary information can be referred to as the audio fingerprint to be matched corresponding to the recorded audio,
  • the data volume of the audio fingerprint to be matched is much smaller than the data volume of the recorded audio, thereby improving retrieval accuracy and retrieval efficiency.
  • the computer device can also obtain the audio database, and obtain the audio fingerprint library corresponding to the audio database, match the above-mentioned audio fingerprint to be matched with the audio fingerprint contained in the audio fingerprint library, and find the audio fingerprint corresponding to the audio fingerprint library to be matched in the audio fingerprint library.
  • Fingerprint retrieval technology retrieves prototype audio that matches the recorded audio from the audio database.
  • the above-mentioned audio database may include all audio data included in the audio application, and the audio fingerprint library may include the audio fingerprint corresponding to each audio data in the audio database, and the audio database and the audio fingerprint library may be pre-configured
  • the audio database can be a database containing all music sequences
  • the audio database can be a database containing audio in all video data; and so on.
  • the computer device can directly access the audio database and the audio fingerprint library when performing audio fingerprint retrieval on the recorded audio, so as to retrieve the prototype audio that matches the recorded audio.
  • the prototype audio can refer to the corresponding audio played by the voice playback device in the recorded audio.
  • the prototype audio can be the music prototype corresponding to the music accompaniment contained in the music recording audio; when the recording audio is dubbing recording audio, the prototype audio can be dubbing recording audio Prototype dubbing corresponding to the video background audio contained in , etc.
  • the audio fingerprint retrieval technology adopted by computer equipment may include but not limited to: philips audio retrieval technology (a retrieval technology that may include two parts: a highly robust fingerprint extraction method and an efficient fingerprint search strategy), shazam audio Retrieval technology (an audio retrieval technology that can include two parts: audio fingerprint extraction and audio fingerprint matching); the application can select an appropriate audio retrieval technology according to actual needs to retrieve the above-mentioned prototype audio, for example: based on the above two audio fingerprints As for the improvement technology of retrieval technology, this application does not limit the audio retrieval technology used.
  • the audio fingerprint to be matched extracted by the computer device can be represented by the common audio features of the recorded audio, where the common audio features can include but not limited to: Fourier coefficients, Mel frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCC), spectral flatness, sharpness, LPC (Linear Prediction Coefficient) coefficient, etc.
  • the common audio features can include but not limited to: Fourier coefficients, Mel frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCC), spectral flatness, sharpness, LPC (Linear Prediction Coefficient) coefficient, etc.
  • the audio fingerprint matching algorithm adopted by the computer device may include but not limited to: a distance-based matching algorithm (when the computer device finds the shortest distance between the audio fingerprint A and the audio fingerprint to be matched in the audio fingerprint database, it indicates that the audio fingerprint A
  • the corresponding audio data is the prototype audio corresponding to the recording audio), index-based matching method, and threshold-based matching method; the application can select the appropriate audio fingerprint extraction algorithm and audio fingerprint matching algorithm according to actual needs, and the application is concerned with this No limit.
  • the computer equipment After the computer equipment retrieves the prototype audio that matches the recording audio from the audio database, it can filter the recording audio according to the prototype audio to obtain the candidate voice audio (also called noise-carrying human voice) contained in the recording audio.
  • the candidate voice audio can include the voice audio component and the environmental noise component in the recording audio; in other words, the candidate voice audio can be understood as filtering the output of the audio playback device.
  • the computer device may perform frequency domain transformation on the recording audio to obtain the first spectral feature corresponding to the recording audio; perform frequency domain transformation on the prototype audio to obtain the second spectrum feature corresponding to the prototype audio.
  • the frequency domain transformation method in the present application may include but not limited to: Fourier Transformation (Fourier Transformation, FT), Laplace Transformation (Laplace Transformation), z-transformation (Z-transformation), and the above three frequency domain transformation methods
  • the deformation or improvement method of the domain transformation method such as Fast Fourier Transformation (Fast Fourier Transformation, FFT), Discrete Fourier Transform (Discrete Fourier Transform, DFT), etc.; this application does not limit the frequency domain transformation method adopted .
  • the above-mentioned first spectral feature may be the power spectrum data obtained after performing frequency-domain transformation on the recorded audio, or may be the result obtained after normalizing the power spectrum data; the above-mentioned second spectral feature is the same as the above-mentioned first
  • the acquisition process of spectral features is the same.
  • the first spectral feature is the power spectrum data corresponding to the recorded audio
  • the second spectral feature is the power spectrum data corresponding to the prototype audio
  • the first spectral feature is the normalized power spectrum data
  • the second spectral feature is normalized power spectrum data
  • the normalization processing method adopted by the first spectral feature and the second spectral feature is the same.
  • the above normalization processing methods may include but not limited to: iLN (instant layer normalization), LN (Layer Normalization), IN (Instance Normalization), GN (Group Normalization), SN (Switchable Normalization) and other normalization processing; This application does not limit the normalization processing method adopted.
  • the computer device can perform feature combination (concat) on the first spectral feature and the second spectral feature, and input the combined spectral feature as an input feature to the first deep network model (for example, the first in the above-mentioned embodiment corresponding to 2).
  • the first frequency point gain can be output through the first deep network model (for example, the frequency point gain sequence 20h in the embodiment corresponding to Figure 2 above), and then according to the first frequency point gain and the recording power spectrum data
  • the first frequency point gain can be multiplied by the power spectrum data corresponding to the recorded audio, and then the above-mentioned candidate speech audio can be obtained through time-domain transformation;
  • the time-domain transformation here is inverse to the above-mentioned frequency-domain transformation Transformation, for example, when the method used for frequency transformation is Fourier transform, the method of time domain transformation used here is inverse Fourier transform.
  • the above-mentioned first deep network model can be used to filter the audio output by the audio playback device in the audio recording
  • the first deep neural network can include but not limited to: gate recurrent unit (Gate Recurrent Unit, GRU), long short-term memory network ( Long Short Term Memory, LSTM), deep neural network (Deep Neural Networks, DNN), convolutional neural network (Convolutional Neural Network, CNN), and the deformation of any one of the above network models, or a combination of two or two network models model, etc., the present application does not limit the network structure of the first deep network model adopted.
  • the second deep network model involved in the following may also include but is not limited to the above-mentioned network model, wherein the second deep network model is used to perform noise reduction processing on the candidate speech audio, and the second deep network model and The first deep network model can have the same network structure, but have different model parameters (the functions possessed by the two network models are not the same); or, the second deep network model and the first deep network model can have different The network structure has different model parameters, and the type of the second deep network model will not be described in detail later.
  • S104 Determine the difference between the recorded audio and the candidate speech audio as the background reference audio component included in the recorded audio.
  • the computer device After the computer device obtains the candidate speech audio according to the first deep network model, it can subtract the above-mentioned candidate speech audio from the recording audio to obtain the audio output by the audio playback device.
  • the audio output by the audio device may be referred to as the background reference audio component in the recording audio (for example, the pure music audio 20p in the above-mentioned embodiment corresponding to FIG. 2 ).
  • the candidate speech audio includes the environmental noise component and the speech audio component in the recording audio, and the result obtained after subtracting the recording audio from the candidate speech audio is the background reference audio component contained in the recording audio.
  • the difference between the recording audio and the candidate speech audio may be a waveform difference in the time domain, or a spectrum difference in the frequency domain.
  • the recording audio and the candidate speech audio are time-domain waveform signals
  • the first signal waveform corresponding to the recording audio and the second signal waveform corresponding to the candidate speech audio can be obtained, and both the first signal waveform and the second signal waveform can be obtained in two dimensions coordinate system (the abscissa can be expressed as time, and the ordinate can be expressed as signal strength, which can also be called signal amplitude), and then the first signal waveform can be subtracted from the second signal waveform to obtain the recorded audio and the candidate voice
  • the waveform difference of audio in the time domain the abscissa can be expressed as time, and the ordinate can be expressed as signal strength, which can also be called signal amplitude
  • this new waveform signal can be considered as the time-domain waveform signal corresponding to the background reference audio component.
  • the recording power spectrum data corresponding to the recording audio can be subtracted from the speech power spectrum data corresponding to the candidate speech audio to obtain the The spectral difference value of , the spectral difference value can be considered as the frequency domain signal corresponding to the background reference audio component.
  • the recording power spectrum data corresponding to the recording audio is (5, 8, 10, 9, 7)
  • the speech power spectrum data corresponding to the candidate speech audio is (2, 4, 1, 5, 6)
  • the two are subtracted
  • the resulting spectrum difference can be (3, 4, 9, 4, 1)
  • the spectrum difference (3, 4, 9, 4, 1) at this time can be called the frequency domain corresponding to the background reference audio component Signal.
  • S105 Perform environmental noise reduction processing on the candidate speech audio to obtain a noise-reduced speech audio corresponding to the candidate speech audio, and combine the noise-reduced speech audio with the background reference audio component to obtain a noise-reduced recording audio.
  • the computer device can perform noise reduction processing on the candidate speech audio, that is, suppress the environmental noise in the candidate speech audio, and obtain the noise-reduced speech audio corresponding to the candidate speech audio (for example, the human voice denoising in the above-mentioned embodiment corresponding to FIG. 2 Frequency 20n).
  • the noise reduction processing of the above-mentioned candidate speech audio can be realized by the above-mentioned second deep network model.
  • the computer device can perform frequency domain conversion on the candidate speech audio to obtain the power spectrum data corresponding to the candidate speech audio (which can be referred to as speech power spectrum data), and input the speech power spectrum data to the second deep network model, through the second deep network model
  • the second frequency point gain can be output, and according to the second frequency point gain and the speech power spectrum data, the weighted speech audio domain signal corresponding to the candidate speech audio is obtained, and then the weighted speech audio domain signal is time-domain transformed to obtain the candidate speech audio
  • the noise-reduced speech audio for example, multiplying the second frequency point gain by the speech power spectrum data corresponding to the candidate speech audio and then performing time-domain transformation to obtain the above-mentioned noise-reduced speech audio.
  • the noise-reduced speech audio can be superimposed with the background reference audio component to obtain a noise-reduced audio recording (for example, the noise-reduce
  • the computer device can share the noise-reduced audio recording to the social platform, so that the terminal device in the social platform can play the noise-reduced audio recording when accessing the noise-reduced audio recording .
  • the aforementioned social platform refers to applications, webpages, etc. that can be used to share and disseminate audio and video data.
  • a social platform can be an audio application, or a video application, or a content sharing platform.
  • the noise-reduced recording audio can be the noise-reduced music recording audio
  • the computer device can share the noise-reducing music recording audio to the content sharing platform (the social platform at this time defaults to content sharing platform)
  • the terminal device can play the noise-reduced music recording audio when accessing the noise-reduced music recording audio shared on the content sharing platform.
  • FIG. 4 is a schematic diagram of a music recording scene provided by an embodiment of the present application.
  • the server 30a shown in Figure 4 can be the background server of the content sharing platform
  • the user terminal 30b can be the terminal device used by the user A, and the user A is to share the noise-reduced music recording audio 30e in the content sharing platform.
  • the user terminal 30c may be a terminal device used by user B
  • the user terminal 30d may be a terminal device used by user C.
  • the server 30a obtains the music recording audio 30e after the noise reduction
  • the music recording audio 30e after the noise reduction can be shared to the content sharing platform
  • the content sharing platform in the user terminal 30b can display the music recording audio after the noise reduction 30e, and information such as the sharing time corresponding to the noise-reduced music recording audio 30e.
  • the content shared by different users can be displayed on the content sharing platform of the user terminal 30c, and the content can include the noise-reduced music recording shared by the user A Audio 30e, after clicking the noise-reduced music recording audio 30e, the noise-reduced music recording audio 30e can be played in the user terminal 30c.
  • the noise-reduced music recording audio 30e shared by the user A can be displayed on the content sharing platform of the user terminal 30d. After the music recording audio 30e, the noise-reduced music recording audio 30e can be played in the user terminal 30d.
  • the recorded audio may be a mixed audio containing speech audio components, background reference audio components, and environmental noise components.
  • the prototype corresponding to the recorded audio may be found in the audio database. Audio, according to the prototype audio, the candidate voice audio can be screened out from the recording audio, and the background reference audio component can be obtained by subtracting the candidate voice audio from the above recording audio; and then the candidate voice audio can be denoised to obtain the noise-reduced voice audio, The noise-reduced recording audio can be obtained by superimposing the noise-reduced speech audio and the background reference audio component.
  • the noise reduction processing problem of the recorded audio into the noise reduction processing problem of the candidate speech audio, it is possible to avoid confusing the background reference audio component in the recorded audio with the environmental noise, thereby improving the noise reduction effect of the recorded audio.
  • FIG. 5 is a schematic flowchart of an audio data processing method provided by an embodiment of the present application. It can be understood that the audio data processing method can be executed by a computer device, and the computer device can be a user terminal, or a server, or a computer program application (including program code) in the computer device, which is not specifically limited here. As shown in Figure 5, the audio data processing method may include the following S201-S210:
  • the recording audio includes a background reference audio component, a voice audio component, and an environmental noise component.
  • S202 divide the recording audio into M recording data frames, perform frequency domain transformation on the i-th recording data frame in the M recording data frames, and obtain the power spectrum data corresponding to the i-th recording data frame; i and M are A positive integer, and i is less than or equal to M.
  • the computer device can perform frame processing on the recording audio, divide the recording audio into M recording data frames, and perform frequency domain transformation on the i-th recording data frame in the M recording data frames, for example, for the i-th recording data frame Perform Fourier transform to obtain the power spectrum data corresponding to the i-th recording data frame; wherein, M can be a positive integer greater than 1, such as M can take a value of 2, 3, ..., i can be less than or equal to A positive integer of M.
  • the computer device can realize the frame processing of the recording audio through the sliding window, and then can obtain M recording data frames. In order to maintain the continuity between adjacent recording data frames, it is usually possible to use overlapping segments
  • the recording audio is divided into frames, and the size of the recording data frame can be related to the size of the sliding window.
  • frequency domain transformation (such as Fourier transform) can be independently performed, and power spectrum data corresponding to each recording data frame can be obtained, and the power spectrum data can include each
  • the energy values corresponding to the frequency points (the energy value here can also be called the amplitude of the power spectrum data), an energy value in the power spectrum data corresponds to a frequency point, and a frequency point can be understood as the A frequency sampling point.
  • the computer device can construct the sub-fingerprint information corresponding to each recording data frame according to the power spectrum data corresponding to each recording data frame; wherein, the key to constructing the sub-fingerprint information is to obtain the power spectrum data corresponding to each recording data frame Select the energy value with the highest degree of discrimination.
  • the following takes the i-th recording data frame as an example to describe the construction process of the sub-fingerprint information.
  • the computer device can divide the power spectrum data corresponding to the i-th recording data frame into N spectral bands, and select the peak signal in each spectral band (that is, the maximum value in each spectral band, which can also be understood as each spectral band The maximum energy value in the band) is used as the signature of the spectrum band to construct the sub-fingerprint information corresponding to the i-th recording data frame, where N can be a positive integer, such as N can take values 1, 2,....
  • the sub-fingerprint information corresponding to the i-th recording data frame may include peak signals corresponding to N spectral bands respectively.
  • the computer device can obtain the sub-fingerprint information corresponding to the M recording data frames according to the description in the above S203, and then can sequentially compare the sub-fingerprint information corresponding to the M recording data frames according to the time sequence of the M recording data frames in the audio recording.
  • the fingerprint information is combined to obtain the audio fingerprint to be matched corresponding to the recorded audio.
  • the computer device can obtain the audio database, and obtain the audio fingerprint database corresponding to the audio database, and each audio data in the audio database can obtain the audio fingerprint corresponding to each audio data in the audio database according to the description in S201-S204 above , the audio fingerprint corresponding to each audio data may constitute an audio fingerprint library corresponding to the audio database.
  • the audio fingerprint library is pre-built. After the computer device obtains the audio fingerprint to be matched corresponding to the recorded audio, it can directly obtain the audio fingerprint library. Matching the audio fingerprint that matches the audio fingerprint, the matching audio fingerprint can be used as the fingerprint retrieval result corresponding to the audio fingerprint to be matched, and then the audio data corresponding to the fingerprint retrieval result can be determined as the prototype audio that matches the recorded audio.
  • the computer device may store the audio fingerprint as a key (key) of the audio retrieval hash table.
  • a single audio data frame contained in each audio data can correspond to a sub-fingerprint information, and a sub-fingerprint information can correspond to a key value of the audio retrieval hash;
  • the fingerprint information may constitute an audio fingerprint corresponding to the audio data.
  • each sub-fingerprint information can be used as a key value of the hash table, and each key value can point to the time when the sub-fingerprint information appears in the audio data to which it belongs, and can also point to the identification of the audio data to which the sub-fingerprint information belongs ; If a sub-fingerprint information is converted into a hash value, the hash value can be saved as a key value in the audio retrieval hash table, and the key value points to the time when the sub-fingerprint information appears in the audio data to which it belongs is 02 :30, the identifier of the pointed audio data is: audio data 1. It can be understood that the above-mentioned audio fingerprint library may include one or more hash values corresponding to each audio data in the audio database.
  • the audio fingerprint to be matched corresponding to the recorded audio may include M sub-fingerprint information, and one sub-fingerprint information corresponds to one audio data frame.
  • the computer device can map the M sub-fingerprint information contained in the audio fingerprint to be matched into M hash values to be matched, and obtain the recording time corresponding to the M hash values to be matched, and the corresponding recording time of each hash value to be matched.
  • the recording time is used to represent the time when the sub-fingerprint information corresponding to the hash value to be matched appears in the audio recording; if the pth hash value to be matched among the M hash values to be matched is the same as the first A hash value matches, then obtain the first time difference between the recording time corresponding to the pth hash value to be matched and the time information corresponding to the first hash value, wherein p is a positive integer less than or equal to M; if The qth hash value to be matched among the M hash values to be matched matches the second hash value contained in the audio fingerprint library, then the recording time and the second hash value corresponding to the qth hash value to be matched are obtained.
  • the second time difference between the time information corresponding to the Greek value; q is a positive integer less than or equal to M; when the first time difference and the second time difference meet the numerical threshold, and the first hash value and the second hash value belong to the same
  • the audio fingerprint to which the first hash value belongs may be determined as the fingerprint retrieval result, and the audio data corresponding to the fingerprint retrieval result may be determined as the prototype audio corresponding to the recorded audio.
  • the computer device can match the above M hash values to be matched with the hash values in the audio fingerprint library, and each hash value to be matched successfully can be calculated to obtain a time difference. After the hash values are matched, the maximum value of the same time difference can be counted. At this time, the maximum value can be set as the above numerical threshold, and the audio data corresponding to the maximum value can be determined as the prototype audio corresponding to the recorded audio.
  • the M hash values to be matched include hash value 1, hash value 2, hash value 3, hash value 4, hash value 5, and hash value 6, and the hash values in the audio fingerprint library A matches hash value 1, and hash value A points to audio data 1, and the time difference between hash value A and hash value 1 is t1; hash value B in the audio fingerprint library matches hash value 2 match, and hash value B points to audio data 1, and the time difference between hash value B and hash value 2 is t2; hash value C in the audio fingerprint library matches hash value 3, and hash value C Point to audio data 1, the time difference between hash value C and hash value 3 is t3; hash value D in the audio fingerprint library matches hash value 4, and hash value D points to audio data 1, hash value The time difference between the value D and the hash value 4 is t4; the hash value E in the audio fingerprint library matches the hash value 5, and the hash value E points to the audio data 2, and the hash value E and the hash value 5
  • the audio data 1 can be used as the prototype audio corresponding to the recorded audio.
  • the computer device can obtain the recording power spectrum data corresponding to the recording audio, the recording power spectrum data can be composed of the power spectrum data corresponding to the above M audio data frames respectively, and the recording power spectrum data can include the respective frequency points corresponding to each frequency point in the recording audio energy value; the recording power spectrum data is normalized to obtain the first spectrum feature; wherein, if the normalization processing here is iLN, the energy value corresponding to each frequency point in the recording power spectrum data can be Independent normalization; of course, other normalization processes, such as BN, can also be used in this application.
  • the power spectrum data of the recording may be directly used as the first spectral feature without performing normalization processing on the power spectrum data of the recording.
  • the prototype audio you can perform the same frequency domain transformation (to obtain the prototype power spectrum data) and normalization processing operations as the above-mentioned recording audio to obtain the second spectral feature corresponding to the prototype audio; and then you can concat (connect) the The first spectral feature and the second spectral feature are combined as an input feature.
  • the computer device can input the input features to the first deep network model, and the first deep network model can output the first frequency point gain for the recording audio, where the first frequency point gain can include each frequency point in the recording audio respectively Corresponding speech gain.
  • the input feature is first input to the feature extraction network layer in the first deep network model, and according to the feature extraction network layer, the time series distribution feature corresponding to the input feature can be obtained, and the time series distribution feature can be used to represent the context semantics in the audio recording; according to The fully-connected network layer in the first deep network model obtains the time-series feature vector corresponding to the time-series distribution feature, and then outputs the first frequency point gain through the activation layer in the first deep network model according to the time-series feature vector, such as can be obtained by Sigmoid
  • the function (as an activation layer) outputs speech gains corresponding to each frequency point included in the recorded audio (ie, the first frequency point gain).
  • S208 according to the gain of the first frequency point and the recording power spectrum data, obtain the candidate speech audio contained in the recording audio; determine the difference between the recording audio and the candidate speech audio as the background reference audio component contained in the recording audio ;
  • Candidate speech audio includes speech audio components and environmental noise components.
  • the first frequency point gain can include the voice gains corresponding to the T frequency points
  • the recording power spectrum data includes the energy values corresponding to the T frequency points respectively
  • T speech gains correspond to T energy values one-to-one.
  • the computer device can weight the energy values belonging to the same frequency points in the recording power spectrum data according to the speech gains corresponding to the T frequency points in the first frequency point gain, to obtain the weighted energy values corresponding to the T frequency points respectively; and then
  • the weighted recording audio domain signal corresponding to the recording audio can be determined according to the weighted energy values corresponding to the T frequency points respectively; the recording audio can be obtained by performing time domain transformation on the weighted recording audio domain signal (inverse transformation with the aforementioned frequency domain transformation).
  • the recording audio can include two frequency points (where T is set to be 2), the speech gain of the first frequency point in the first frequency point gain is 2, the energy value in the recording power spectrum data is 1, and the speech gain of the second frequency point in the first frequency point gain is 3 , the energy value in the recording power spectrum data is 2; the weighted recording audio domain signal can be calculated as (2, 6), and the candidate speech contained in the recording audio can be obtained by performing time domain transformation on the weighted recording audio domain signal audio.
  • the difference between the recording audio and the candidate speech audio may be determined as the background reference audio component, that is, the audio output by the audio playback device.
  • FIG. 6 is a schematic structural diagram of a first deep network model provided by an embodiment of the present application; taking a music recording scene as an example, the network structure of the first deep network model is described.
  • the computer device retrieves the music prototype audio 40b (ie prototype audio) corresponding to the music recording audio 40a (ie recording audio) from the audio database, the music recording audio 40a and the music prototype audio 40b can be quickly performed respectively.
  • Fourier transform obtain the corresponding power spectrum data 40c (being recording power spectrum data) and phase of music recording audio frequency 40a, and the corresponding power spectrum data 40d (being prototype power spectrum data) of music prototype audio frequency 40b, above-mentioned fast
  • the Fourier transform is only an example in this embodiment, and other frequency-domain transform methods, such as discrete Fourier transform, can also be used in this application.
  • the network model 40e can be composed of a gated recurrent unit 1, a gated recurrent unit 2, and a fully connected network 1, and finally outputs the first frequency point gain through the Sigmoid function; the speech gain and power spectrum data of each frequency point included in the first frequency point gain After the energy value (also can be referred to as the frequency point power spectrum) of corresponding frequency point in 40c is multiplied, then can obtain to remove music audio frequency 40f (being above-mentioned candidate voice audio frequency) through inverse fast Fourier transform (iFFT); Wherein, inverse The fast Fourier transform can be a time domain transform method, ie transform from the frequency domain to the time domain. It can be understood that the network structure of the first deep network model 40e shown in FIG. 6 is only an example, and the
  • the computer device After the computer device acquires the candidate speech audio, it can perform frequency domain conversion on the candidate speech audio to obtain the speech power spectrum data corresponding to the candidate speech audio, and input the speech power spectrum data into the second deep network model, and pass the second deep network model
  • the feature extraction network layer (which can be GRU), the fully connected network layer (which can be a fully connected network), and the activation layer (Sigmoid function) in the feature can output the second frequency point gain for the candidate speech audio, and the second frequency point gain can be It includes the noise reduction gains corresponding to the respective frequency points in the candidate speech audio, which may be the output value of the Sigmoid function.
  • the candidate speech audio includes D frequency points (D is a positive integer greater than 1, D here may be equal to or not equal to the above T, and both can be valued according to actual needs.
  • the value is not limited
  • the second frequency point gain can include noise reduction gains corresponding to D frequency points respectively
  • the voice power spectrum data includes energy values corresponding to D frequency points respectively, D noise reduction gains and D One-to-one correspondence of energy values.
  • the computer device can weight the energy values belonging to the same frequency points in the voice power spectrum data according to the noise reduction gains corresponding to the D frequency points in the second frequency point gain, and obtain the weighted energy values corresponding to the D frequency points respectively; Furthermore, according to the weighted energy values corresponding to the D frequency points respectively, the weighted speech audio domain signal corresponding to the candidate speech audio can be determined; by performing time domain transformation on the weighted speech audio domain signal (inverse transformation with the aforementioned frequency domain transformation), we can obtain The noise-reduced speech audio corresponding to the candidate speech audio.
  • the candidate speech audio can include two frequency points (where D takes a value 2), the noise reduction gain of the first frequency point in the second frequency point gain is 0.1, the energy value in the speech power spectrum data is 5, and the noise reduction gain of the second frequency point in the second frequency point gain
  • D takes a value 2
  • the noise reduction gain of the first frequency point in the second frequency point gain is 0.1
  • the energy value in the speech power spectrum data is 5,
  • the noise reduction gain of the second frequency point in the second frequency point gain The gain is 0.5
  • the energy value in the voice power spectrum data is 8
  • the weighted voice domain signal can be calculated as (0.5, 4).
  • FIG. 7 is a schematic structural diagram of a second deep network model provided by an embodiment of the present application.
  • the computer device can perform fast Fourier transform (FFT) on the music-free audio 40f to obtain the music-free audio 40f.
  • Power spectrum data 40g corresponding to the audio frequency 40f (that is, the speech power spectrum data) and phase.
  • the power spectrum data 40g is used as the input data of the second deep network model 40h
  • the second deep network model 40h can be composed of a fully connected network 2, a gate recurrent unit 3, a gate recurrent unit 4, and a fully connected network 3, and finally through the Sigmoid function
  • the second frequency point gain can be output; the noise reduction gain of each frequency point included in the second frequency point gain is multiplied by the energy value of the corresponding frequency point in the power spectrum data 40g, and then undergoes an inverse fast Fourier transform (iFFT) Human voice noise-removed audio 40i (that is, the above-mentioned noise-reduced speech audio) can be obtained.
  • iFFT inverse fast Fourier transform
  • FIG. 8 is a schematic flowchart of a recording audio noise reduction process provided by an embodiment of the present application. As shown in Figure 8, this embodiment takes the music recording scene as an example. After the computer device obtains the music recording audio 50a, it can obtain the audio fingerprint 50b corresponding to the music recording audio 50a. Based on the audio fingerprint 50b, the music library 50c (i.e.
  • the audio database corresponding to the audio fingerprint library 50d for audio fingerprint retrieval
  • the audio fingerprint in the block 50c can be
  • the audio data is determined to be the music prototype audio 50e corresponding to the music recording audio 50a; wherein, the extraction process of the audio fingerprint 50b and the audio fingerprint retrieval process of the audio fingerprint 50b can refer to the description in the aforementioned S202-S205, and will not be repeated here.
  • spectral feature extraction can be performed on the music recording audio 50a and the music prototype audio 50e respectively, and the obtained spectral features are combined and then input to the first-order deep network 50h (that is, the aforementioned first depth network model), through the first-order deep network 50h, the music audio 50i can be obtained (the acquisition process of the music audio 50i can refer to the embodiment corresponding to the above-mentioned FIG. 6, and will not be repeated here); wherein, the spectral feature extraction process It can include frequency domain transformation such as Fourier transform and normalization processing such as iLN. Further, the music recording audio 50a can be subtracted from the music-free audio 50i to obtain the pure music audio 50j (ie, the above-mentioned background reference audio component).
  • the power spectrum data can be used as the input of the second-order depth network 50k (ie, the above-mentioned second-depth network model), through the second-order depth
  • the network 50k can obtain the human voice denoising frequency 50m (the acquisition process of the human voice denoising frequency 50m can refer to the embodiment corresponding to the above-mentioned FIG.
  • the noise frequency 50m is superimposed to obtain the final noise-reduced music recording audio 50n (ie, the noise-reduced recording audio).
  • the recorded audio may be a mixed audio containing speech audio components, background reference audio components, and background reference audio components of environmental noise components.
  • the prototype audio corresponding to the recording audio.
  • the candidate speech audio can be screened out from the recording audio, and the background reference audio component can be obtained by subtracting the candidate speech audio from the above recording audio. Then, the candidate speech audio can be denoised to obtain Noise-reduced speech audio, after superimposing the noise-reduced speech audio and the background reference audio component, the recorded audio after noise reduction can be obtained.
  • the prototype audio can be retrieved through the audio fingerprint retrieval technology, which can improve the retrieval accuracy and retrieval efficiency.
  • FIG. 9 is a schematic flowchart of an audio data processing method provided by an embodiment of the present application. It can be understood that the audio data processing method can be executed by a computer device, and the computer device can be a user terminal, or a server, or a computer program application (including program code) in the computer device, which is not specifically limited here. As shown in Figure 9, the audio data processing method may include the following S301-S305:
  • the computer device may pre-acquire a large amount of voice sample audio, a large amount of noise sample audio, and a large amount of standard sample audio.
  • the voice sample audio may be an audio sequence containing only human voices; for example, the voice sample audio may be a pre-recorded singing sequence of various users, or a dubbing sequence of various users.
  • the noise sample audio can be an audio sequence containing only noise, and the noise sample audio can be pre-recorded noise of different scenes; for example, the noise sample audio can be the sound of a vehicle horn, the sound of typing Metal sounds and other types of noise.
  • the standard sample audio may be pure audio stored in an audio database; for example, the standard sample audio may be a music sequence, or a video dubbing sequence, and the like.
  • the voice sample audio and the noise sample audio can be collected through recording, and the standard sample audio can be pure audio stored in various platforms, and the computer device needs to obtain the authorization of the platform when obtaining the standard sample audio in the platform.
  • the speech sample audio may be a human voice sequence
  • the noise sample audio may be a noise sequence of different scenes
  • the standard sample audio may be a music sequence.
  • the computer device can superimpose the voice sample audio, the noise sample audio and the standard sample audio to obtain the sample recording audio.
  • the voice sample audio, noise sample audio and standard sample audio can be randomly combined, but also the same group of voice sample audio, noise sample audio and standard sample audio can be used for different coefficients. Weighting, you can get different sample recording audio.
  • the computer device can obtain a set of weighted coefficients for the first initial network model, the set of weighted coefficients can be a set of randomly generated floating-point numbers, K arrays can be constructed according to the set of weighted coefficients, and each array can include three Numerical values with an arrangement order, three numerical values with different arrangement orders can form different arrays, and the three numerical values contained in an array are the coefficients of voice sample audio, noise sample audio and standard sample audio respectively; according to the K arrays
  • the coefficients contained in the j-th array of are used to weight the voice sample audio, noise sample audio, and standard sample audio respectively, so as to obtain the sample recording audio corresponding to the j-th array.
  • K different sample audio recordings can be constructed.
  • the processing process of each sample audio recording in the two initial network models is the same.
  • the sample recording audio can be input into the first initial network model in batches, that is, all sample recording audios are trained in batches; for the convenience of description, the following takes any sample recording audio among all sample recording audios as an example , describe the training process of the above two initial network models.
  • FIG. 10 is a schematic diagram of training a deep network model provided by an embodiment of the present application.
  • the computer device can perform frequency domain transformation on the sample recording audio y to obtain the sample power spectrum data corresponding to the sample recording audio y, and perform normalization processing (for example, iLN normalization) on the sample power spectrum data to obtain the
  • the first initial network model 60b may refer to the first deep network model in the training phase, and the purpose of training the first initial network model 60b is to filter the standard sample audio contained in the sample recording audio.
  • the computer device can obtain the sample predicted speech audio 60c according to the first sample frequency point gain and the sample power spectrum data.
  • the calculation process of the sample predicted speech audio 60c is similar to the calculation process of the aforementioned candidate speech audio, and will not be repeated here.
  • the expected predicted speech audio corresponding to the first initial network model 60b can be determined by the speech sample audio x1 and the noise sample audio x2, and the expected predicted speech audio can be the signal in the above-mentioned sample recording audio y (r1 ⁇ x1+r2 ⁇ x2); That is to say, the expected output result of the first initial network model 60b can be the energy value of each frequency point in the power spectrum data of the signal (r1 ⁇ x1+r2 ⁇ x2) (or called the power spectrum value of each frequency point ) divided by the square root processing result of the corresponding frequency point energy value in the sample power spectrum data.
  • the computer device can input the power spectrum data corresponding to the sample predicted speech audio 60c into the second initial network model 60f, and the second sample frequency point corresponding to the sample predicted speech audio 60c can be output through the second initial network model 60f Gain, the second sample frequency point gain may include the noise reduction gain of each frequency point corresponding to the sample predicted speech audio 60c, where the second sample frequency point gain is the second initial network model 60f for the above sample predicted speech audio The actual output of 60c.
  • the second initial network model 60f may refer to the second deep network model in the training phase, and the purpose of training the second initial network model 60f is to suppress the environmental noise contained in the sample prediction speech audio.
  • training samples of the second initial network model 60f need to be aligned with some samples of the first initial network model 60b, for example, the training samples of the second initial network model 60f can be samples determined based on the first initial network model 60b Predicted speech audio 60c.
  • the computer device can obtain the sample predicted noise reduction frequency 60g according to the second sample frequency point gain and the power spectrum data of the sample predicted speech audio 60c, the calculation process of the sample predicted noise reduction frequency 60g is similar to the calculation process of the aforementioned noise reduction speech audio, I won't repeat them here.
  • the expected predicted noise reduction frequency corresponding to the second initial network model 60f can be determined by the speech sample audio x1, and the expected predicted noise reduction frequency can be the signal (r1 ⁇ x1) in the above-mentioned sample recording audio y; that is, The expected output of the second initial network model 60f can be the energy value of each frequency point (or called the power spectrum value of each frequency point) in the power spectrum data of the signal (r1 ⁇ x1), divided by the power of the sample predicted speech audio 60c The square root processing result of the corresponding frequency point energy value in the spectral data.
  • the network parameters of the first initial network model based on the sample predicted voice audio and the expected predicted voice audio, to obtain a first deep network model; the first deep network model is used to filter the recorded audio to obtain a candidate voice audio,
  • the recording audio includes a background reference audio component, a speech audio component, and an environmental noise component, and the candidate speech audio includes a speech audio component and an environmental noise component.
  • the process of using the first deep network model 60e can refer to the description in S207 above.
  • the above-mentioned first loss function 60d may also be a square term between the expected output result of the first initial network model 60b and the first frequency point gain (actual output result).
  • the second loss corresponding to the second initial network model 60f is determined Function 60h, by optimizing the second loss function 60h to the minimum value, that is, making the training loss the smallest, the network parameters of the second initial network model 60f are adjusted until the number of training iterations reaches the preset maximum number of iterations (or the second initial network The training of the model 60f reaches convergence), the second initial network model at this time can be used as the second deep network model 60i, and the second deep network model 60i that has been trained can be used to obtain the noise-reduced voice after performing noise-reduction processing on the candidate speech audio
  • the use process of the second deep network model 60i can refer to the description in S209 above.
  • the above-mentioned second loss function 60h may also be a square term between the expected output result of the second initial network model 60f and
  • the number of sample audio recordings can be expanded, and the first initial network model and the second initial network model are processed by these sample recording audios.
  • Training can improve the generalization ability of the network model; by aligning the training samples of the second initial network model with some training samples of the first initial network model (partial signals contained in the sample recording audio), the first initial network model can be enhanced.
  • the overall correlation between the network model and the second initial network model can improve the noise reduction effect of the recorded audio when the trained first deep network model and the second deep network model are used for noise reduction processing.
  • FIG. 11 is a schematic structural diagram of an audio data processing device provided by an embodiment of the present application.
  • the audio data processing device 1 may include: an audio acquisition module 11, a retrieval module 12, an audio filtering module 13, an audio determination module 14, and a noise reduction processing module 15;
  • Audio acquisition module 11 is used to obtain audio recording; audio recording includes background reference audio component, voice audio component and environmental noise component;
  • Retrieval module 12 is used for determining the prototype audio that matches with recording audio from audio database
  • Audio filtering module 13 for obtaining candidate voice audio from recording audio according to prototype audio;
  • Candidate voice audio includes voice audio component and environmental noise component;
  • Audio determining module 14 is used to determine the difference between recording audio and candidate voice audio as the background reference audio component contained in the recording audio;
  • the noise reduction processing module 15 is used to perform environmental noise reduction processing on the candidate speech audio, obtain the noise reduction speech audio corresponding to the candidate speech audio, merge the noise reduction speech audio and the background reference audio component, and obtain the recorded audio after the noise reduction .
  • the audio acquisition module 11, the fingerprint retrieval module 12, the audio filtering module 13, the audio determination module 14, and the specific function implementation of the noise reduction processing module 15 can refer to S101-S105 in the above-mentioned embodiment corresponding to FIG. to repeat.
  • the retrieval module 12 is specifically configured to obtain the audio fingerprint to be matched corresponding to the recorded audio, and obtain the prototype audio matching the recorded audio from the audio database according to the audio fingerprint to be matched.
  • the fingerprint retrieval module 12 may include: a frequency domain transformation unit 121, a spectrum band division unit 122, an audio fingerprint combination unit 123, and a prototype audio matching unit 124;
  • the frequency domain conversion unit 121 is used to divide the recording audio into M recording data frames, perform frequency domain conversion on the i-th recording data frame in the M recording data frames, and obtain the power spectrum data corresponding to the i-th recording data frame ; Both i and M are positive integers, and i is less than or equal to M;
  • the spectral band division unit 122 is used to divide the power spectrum data corresponding to the i-th recording data frame into N spectral bands, and construct sub-fingerprint information corresponding to the i-th recording data frame according to the peak signal in the N spectral bands; N is a positive integer;
  • the audio fingerprint combination unit 123 is used to combine the sub-fingerprint information corresponding to the M recording data frames according to the time sequence of the M recording data frames in the recording audio, so as to obtain the audio fingerprint to be matched corresponding to the recording audio;
  • the prototype audio matching unit 124 is used to obtain the audio fingerprint database corresponding to the audio database, perform fingerprint retrieval in the audio fingerprint database according to the audio fingerprint to be matched, and determine the prototype audio matching the recorded audio in the audio database according to the fingerprint retrieval result.
  • the prototype audio matching unit 124 is specifically used for:
  • the audio fingerprint to which the first hash value belongs is determined as the fingerprint retrieval result, and the fingerprint retrieval
  • the audio data corresponding to the result is determined as the prototype audio corresponding to the recording audio.
  • the specific function implementation of the frequency domain conversion unit 121, the spectrum band division unit 122, the audio fingerprint combination unit 123, and the prototype audio matching unit 124 can refer to S202 and S205 in the embodiment corresponding to FIG. 5 above, and will not be repeated here.
  • the audio filtering module 13 may include: a normalization processing unit 131, a first frequency point gain output unit 132, and a speech audio acquisition unit 133;
  • the normalization processing unit 131 is used to obtain the recording power spectrum data corresponding to the recording audio, and perform normalization processing on the recording power spectrum data to obtain the first spectral feature;
  • the above-mentioned normalization processing unit 131 is also used to obtain the prototype power spectrum data corresponding to the prototype audio, perform normalization processing on the prototype power spectrum data, obtain the second spectrum feature, and combine the first spectrum feature and the second spectrum feature into input features;
  • the first frequency point gain output unit 132 is used to input the input feature to the first deep network model, and output the first frequency point gain for recording audio through the first deep network model;
  • the speech audio obtaining unit 133 is configured to obtain the candidate speech audio included in the recorded audio according to the first frequency point gain and the recorded power spectrum data.
  • the first frequency point gain output unit 132 may include: a feature extraction subunit 1321, an activation subunit 1322;
  • the feature extraction subunit 1321 is configured to input the input features to the first deep network model, extract the network layer according to the features in the first deep network model, and obtain the time series distribution features corresponding to the input features;
  • the activation subunit 1322 is used to obtain the time series feature vector corresponding to the time series distribution feature according to the fully connected network layer in the first deep network model, and output the first frequency series through the activation layer in the first deep network model according to the time series feature vector. point gain.
  • the first frequency gain includes speech gains corresponding to T frequency points respectively
  • the recording power spectrum data includes energy values corresponding to T frequency points respectively
  • the T speech gains are equal to the T energy values One-to-one correspondence; T is a positive integer greater than 1;
  • the speech audio acquisition unit 133 may include: a frequency point weighting subunit 1331, a weighted energy value combination subunit 1332, and a time domain transformation subunit 1333;
  • the frequency point weighting subunit 1331 is used to weight the energy values belonging to the same frequency point in the recording power spectrum data according to the speech gains corresponding to the T frequency points in the first frequency point gain, so as to obtain the T frequency points corresponding to The weighted energy value of ;
  • the weighted energy value combination subunit 1332 is used to determine the weighted recorded audio domain signal corresponding to the recorded audio according to the weighted energy values corresponding to the T frequency points respectively;
  • the time-domain transformation subunit 1333 is configured to perform time-domain transformation on the weighted recorded audio domain signal to obtain candidate speech audio contained in the recorded audio.
  • the normalization processing unit 131 the first frequency point gain output unit 132, the speech audio acquisition unit 133, the feature extraction subunit 1321, the activation subunit 1322, the frequency point weighting subunit 1331, the weighted energy value combination subunit 1332,
  • the time domain transformation subunit 1333 For the specific function implementation manner of the time domain transformation subunit 1333, refer to S206 and S208 in the above embodiment corresponding to FIG. 5 , which will not be repeated here.
  • the noise reduction processing module 15 may include: a second frequency point gain output unit 151, a signal weighting unit 152, and a time domain transformation unit 153;
  • the second frequency point gain output unit 151 is used to obtain the voice power spectrum data corresponding to the candidate voice audio, input the voice power spectrum data to the second deep network model, and output the second frequency for the candidate voice audio through the second deep network model. point gain;
  • a signal weighting unit 152 configured to obtain a weighted voice domain signal corresponding to the candidate voice audio according to the second frequency point gain and the voice power spectrum data;
  • the time-domain transformation unit 153 is configured to perform time-domain transformation on the weighted voice domain signal to obtain the noise-reduced voice audio corresponding to the candidate voice audio.
  • the specific function implementation manners of the second frequency point gain output unit 151, the signal weighting unit 152, and the time domain transformation unit 153 can refer to S209 and S210 in the above-mentioned embodiment corresponding to FIG. 5 , and will not be repeated here.
  • the audio data processing device 1 may also include: an audio sharing module 16;
  • the audio sharing module 16 is configured to share the noise-reduced recorded audio to the social platform, so that the terminal device in the social platform plays the noise-reduced recorded audio when accessing the social platform.
  • the specific function implementation manner of the audio sharing module 16 can refer to S105 in the above-mentioned embodiment corresponding to FIG. 3 , which will not be repeated here.
  • FIG. 12 is a schematic structural diagram of an audio data processing device provided by an embodiment of the present application.
  • the audio data processing device 2 may include: a sample acquisition module 21, a first prediction module 22, a second prediction module 23, a first adjustment module 24, and a second adjustment module 25;
  • the sample acquisition module 21 is used to obtain voice sample audio, noise sample audio and standard sample audio, and generate sample recording audio according to voice sample audio, noise sample audio and standard sample audio; voice sample audio and noise sample audio are collected through recording Yes, the standard sample audio is the pure audio stored in the audio database;
  • the first prediction module 22 is used to obtain the sample predicted speech audio in the sample recording audio through the first initial network model; the first initial network model is used to filter the standard sample audio contained in the sample recording audio, and the expectation of the first initial network model the predicted speech audio is determined from the speech sample audio and the noise sample audio;
  • the second prediction module 23 is used to obtain the sample prediction noise reduction frequency corresponding to the sample prediction speech audio through the second initial network model; the second initial network model is used to suppress the noise sample audio contained in the sample prediction speech audio, and the second initial network The model's expected predicted denoising frequency is determined by the speech sample audio;
  • the first adjustment module 24 is used to adjust the network parameters of the first initial network model based on the sample predicted speech audio and the expected predicted speech audio to obtain the first deep network model; the first deep network model is used to filter the recorded audio Obtain candidate voice audio afterward, recording audio comprises background reference audio component, voice audio component and environmental noise component, and candidate voice audio comprises voice audio component and environmental noise component;
  • the second adjustment module 25 is used to adjust the network parameters of the second initial network model based on the sample prediction noise reduction frequency and the expected prediction noise reduction frequency to obtain the second deep network model; the second deep network model is used for candidate speech After the audio is subjected to noise reduction processing, the noise-reduced speech audio is obtained.
  • the specific function implementation of the sample acquisition module 21, the first prediction module 22, the second prediction module 23, the first adjustment module 24, and the second adjustment module 25 can refer to S301-S305 in the above-mentioned embodiment corresponding to FIG. 9 , No more details here.
  • the number of sample audio recordings is K, and K is a positive integer
  • the sample acquisition module 21 may include: an array construction unit 211, a sample recording construction unit 212;
  • An array construction unit 211 configured to obtain a set of weighted coefficients for the first initial network model, and construct K arrays according to the set of weighted coefficients; each array includes coefficients respectively corresponding to voice sample audio, noise sample audio, and standard sample audio;
  • the sample recording construction unit 212 is used to weight the speech sample audio, noise sample audio and standard sample audio respectively according to the coefficients contained in the j-th array in the K arrays, so as to obtain the sample recording audio corresponding to the j-th array; j is a positive integer less than or equal to K.
  • FIG. 13 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device 1000 may be a user terminal, for example, the user terminal 10a in the embodiment corresponding to Figure 1 above, or a server, for example, the server 10d in the embodiment corresponding to Figure 1 above, here It will not be restricted.
  • the present application takes computer equipment as an example of a user terminal.
  • the computer equipment 1000 may include: a processor 1001, a network interface 1004, and a memory 1005.
  • the computer equipment 1000 may also include: a user interface 1003, and at least one communication bus 1002 . Wherein, the communication bus 1002 is used to realize connection and communication between these components.
  • the user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1004 can be a high-speed RAM memory, or a non-volatile memory, such as at least one disk memory.
  • the memory 1005 may also be at least one storage device located away from the aforementioned processor 1001 .
  • the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 1004 in the computer device 1000 can also provide a network communication function
  • the optional user interface 1003 can also include a display screen (Display) and a keyboard (Keyboard).
  • the network interface 1004 can provide a network communication function
  • the user interface 1003 is mainly used to provide an input interface for the user
  • the processor 1001 can be used to call the device control application stored in the memory 1005 program to achieve:
  • the recording audio includes background reference audio components, speech audio components and environmental noise components;
  • candidate speech audio from recording audio according to prototype audio; candidate speech audio includes speech audio components and environmental noise components;
  • processor 1001 can also implement:
  • voice sample audio, noise sample audio, and standard sample audio Acquire voice sample audio, noise sample audio, and standard sample audio, and generate sample recording audio based on voice sample audio, noise sample audio, and standard sample audio; voice sample audio and noise sample audio are collected through recording, and standard sample audio is audio Pure audio stored in the database;
  • the first initial network model is used to filter the standard sample audio included in the sample recording audio, and the expected prediction voice audio of the first initial network model is composed of the voice sample audio and noise sample audio determined;
  • the sample prediction noise reduction frequency corresponding to the sample prediction speech audio is obtained through the second initial network model; the second initial network model is used to suppress the noise sample audio contained in the sample prediction speech audio, and the expected prediction noise reduction frequency of the second initial network model is determined by speech sample audio determined;
  • the network parameters of the first initial network model are adjusted to obtain the first deep network model;
  • the first deep network model is used to filter the recorded audio to obtain candidate voice audio, the recorded audio Including background reference audio component, speech audio component and environmental noise component, candidate speech audio includes speech audio component and environmental noise component;
  • the network parameters of the second initial network model are adjusted to obtain the second deep network model; the second deep network model is used to denoise the candidate speech audio to obtain the reduced Noisy audio.
  • the computer device 1000 described in the embodiment of the present application can execute the description of the audio data processing method in any one of the embodiments corresponding to FIG. 3 , FIG. 5 and FIG. 9 , and can also execute the embodiment corresponding to FIG. 11 above.
  • the description of the audio data processing device 1 in , or the description of the audio data processing device 2 in the embodiment corresponding to FIG. 12 will not be repeated here.
  • the description of the beneficial effect of adopting the same method will not be repeated here.
  • the embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium stores the above-mentioned audio data processing device 1 and audio data processing device 2.
  • a computer program, and the computer program includes program instructions.
  • the processor executes the program instructions, it can execute the description of the audio data processing method in any one of the above-mentioned embodiments corresponding to FIG. 3, FIG. 5 and FIG. to repeat.
  • the description of the beneficial effect of adopting the same method will not be repeated here.
  • program instructions may be deployed to execute on one computing device, or on multiple computing devices located at one site, or, alternatively, on multiple computing devices distributed across multiple sites and interconnected by a communication network .
  • Multiple computing devices distributed in multiple locations and interconnected by a communication network can form a blockchain system.
  • the embodiment of the present application also provides a computer program product or computer program, where the computer program product or computer program may include computer instructions, and the computer instructions may be stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor can execute the computer instructions, so that the computer device performs the processing of audio data in any one of the embodiments corresponding to Fig. 3, Fig. 5 and Fig. 9 above.
  • the description of the method therefore, will not be repeated here.
  • the description of the beneficial effect of adopting the same method will not be repeated here.
  • the modules in the device of the embodiment of the present application can be combined, divided and deleted according to actual needs.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种音频数据处理方法、装置、设备以及介质,该方法包括:获取录音音频;录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量(S101);从音频数据库中获取与录音音频相匹配的原型音频(S102);根据原型音频从录音音频中获取候选语音音频;候选语音音频包括语音音频分量和环境噪声分量(S103);将录音音频与候选语音音频之间的差值,确定为录音音频中所包含的背景基准音频分量(S104);对候选语音音频进行环境噪声降噪处理,得到候选语音音频对应的降噪语音音频,将降噪语音音频与背景基准音频分量进行合并,得到降噪后的录音音频(S105)。该方法可以提升录音音频的降噪效果。

Description

音频数据处理方法、装置、设备以及介质
本申请要求于2021年09月03日提交中国专利局、申请号202111032206.9、申请名称为“音频数据处理方法、装置、设备以及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频处理技术领域,尤其涉及一种音频数据处理方法、装置、设备以及介质。
背景技术
随着音视频业务应用的迅速推广普及,用户使用音频业务应用分享日常音乐录音的频率日益增加。例如,当用户听着伴唱唱歌,通过具有录音功能的设备(例如手机或者接入麦克风的声卡设备)进行录音时,该用户可能处在嘈杂的环境中,或者使用的设备过于简易,这就导致该设备所录制的音乐录音信号除了包括用户的歌声(人声信号)、伴唱(音乐信号)之外,还可能会引入嘈杂环境中的噪声信号、设备中的电子噪声信号等。若是直接将未处理的音乐录音信号分享至音频业务应用,会导致其余用户在音频业务应用中播放音乐录音信号时很难听清用户的歌声,因此需要对所录制的音乐录音信号进行降噪处理。
目前的降噪算法需要明确噪声类型和信号类型,如基于人声和噪声从信号相关性、频谱分布特征上具有一定的特征距离,通过一些统计降噪或者深度学习降噪的方法进行噪声抑制。然而,音乐录音信号的音乐类型较多(例如,古典音乐、民族音乐、摇滚音乐等),有些音乐类型与一些环境噪声类型相似,或者一些音乐频谱特征与一些噪声比较接近,采用上述降噪算法对音乐录音信号进行降噪处理时,可能会将音乐信号误判为噪声信号进行抑制,或者将噪声信号误判别音乐信号进行保留,造成音乐录音信号的降噪效果并不理想。
发明内容
本申请实施例提供一种音频数据处理方法、装置、设备以及介质,可以提升录音音频的降噪效果。
本申请实施例一方面提供了一种音频数据处理方法,该方法由计算机设备执行,包括:
获取录音音频;录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量;
从音频数据库中确定与录音音频相匹配的原型音频;
根据原型音频从录音音频中获取候选语音音频;候选语音音频包括语音音频分量和环境噪声分量;
将录音音频与候选语音音频之间的差值,确定为录音音频中所包含的背景基准音频分量;
对候选语音音频进行环境噪声降噪处理,得到候选语音音频对应的降噪语音音频,将降噪语音音频与背景基准音频分量进行合并,得到降噪后的录音音频。
本申请实施例一方面提供了一种音频数据处理方法,该方法由计算机设备执行,包括:
获取语音样本音频、噪声样本音频以及标准样本音频,根据语音样本音频、噪声样本音频以及标准样本音频,生成样本录音音频;语音样本音频和噪声样本音频是通过录音采集得到的,标准样本音频是音频数据库中所存储的纯净音频;
通过第一初始网络模型获取样本录音音频中的样本预测语音音频;第一初始网络模型用于过滤样本录音音频所包含的标准样本音频,第一初始网络模型的期望预测语音音频由语音样本音频和噪声样本音频所确定;
通过第二初始网络模型获取样本预测语音音频对应的样本预测降噪音频;第二初始网络模型用于抑制样本预测语音音频所包含的噪声样本音频,第二初始网络模型的期望预测降噪音频由语音样本音频所确定;
基于样本预测语音音频和期望预测语音音频,对第一初始网络模型的网络参数进行调整,得到第一深度网络模型;第一深度网络模型用于对录音音频进行过滤后得到候选语音音频,录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量,候选语音音频包括语音音频分量和环境噪声分量;
基于样本预测降噪音频和期望预测降噪音频,对第二初始网络模型的网络参数进行调整,得到第二深度网络模型;第二深度网络模型用于对候选语音音频进行降噪处理后得到降噪语音音频。
本申请实施例一方面提供了一种音频数据处理装置,该装置部署在计算机设备上,包括:
音频获取模块,用于获取录音音频;录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量;
检索模块,用于从音频数据库中确定与录音音频相匹配的原型音频;
音频过滤模块,用于根据原型音频从录音音频中获取候选语音音频;候选语音音频包括语音音频分量和环境噪声分量;
音频确定模块,用于将录音音频与候选语音音频之间的差值,确定为录音音频中所包含的背景基准音频分量;
降噪处理模块,用于对候选语音音频进行环境噪声降噪处理,得到候选语音音频对应的降噪语音音频,将降噪语音音频与背景基准音频分量进行合并,得到降噪后的录音音频。
本申请实施例一方面提供了一种音频数据处理装置,该装置部署在计算机设备上,包括:
样本获取模块,用于获取语音样本音频、噪声样本音频以及标准样本音频,根据语音样本音频、噪声样本音频以及标准样本音频,生成样本录音音频;语音样本音频和噪声样本音频是通过录音采集得到的,标准样本音频是音频数据库中所存储的纯净音频;
第一预测模块,用于通过第一初始网络模型获取样本录音音频中的样本预测语音音频;第一初始网络模型用于过滤样本录音音频所包含的标准样本音频,第一初始网络模型的期望预测语音音频由语音样本音频和噪声样本音频所确定;
第二预测模块,用于通过第二初始网络模型获取样本预测语音音频对应的样本预测降噪音频;第二初始网络模型用于抑制样本预测语音音频所包含的噪声样本音频,第二初始网络模型的期望预测降噪音频由语音样本音频所确定;
第一调整模块,用于基于样本预测语音音频和期望预测语音音频,对第一初始网络模型的网络参数进行调整,得到第一深度网络模型;第一深度网络模型用于对录音音频进行过滤后得到候选语音音频,录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量,候选语音音频包括语音音频分量和环境噪声分量;
第二调整模块,用于基于样本预测降噪音频和期望预测降噪音频,对第二初始网络模型的网络参数进行调整,得到第二深度网络模型;第二深度网络模型用于对候选语音音频进行降噪处理后得到降噪语音音频。
本申请实施例一方面提供了一种计算机设备,包括存储器和处理器,存储器与处理器相连,存储器用于存储计算机程序,处理器用于调用计算机程序,以使得该计算机设备执行本申请实施例中上述一方面提供的方法。
本申请实施例一方面提供了一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序适于由处理器加载并执行,以使得具有处理器的计算机设备执行本申请实施例中上述一方面提供的方法。
根据本申请的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述一方面提供的方法。
本申请实施例可以获取包含背景基准音频分量、语音音频分量以及环境噪声分量的录音音频,从音频数据库中获取与录音音频相匹配的原型音频,进而可以根据原型音频从录音音频中获取候选语音音频,该候选语音音频包括语音音频分量和环境噪声分量。通过这种方式可以将对录音音频的降噪处理问题转换为候选语音音频的降噪处理问题,进而直接对候选语音音频进行环境噪声降噪处理,得到候选语音音频对应的降噪语音音频,避免录音音频中的背景基准音频分量与环境噪声分量进行混淆。由于录音音频与候选语音音频之间的差值即为背景基准音频分量,故将该降噪语音音频与背景基准音频分量进行合并,可以得到降噪后的录音音频。可见,本申请通过将录音音频的降噪处理问题转换为候选语音音频的降噪处理问题,可以避免将录音音频中的背景基准音频分量与环境噪声分量进行混淆,进而可以提升录音音频的降噪效果。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种网络架构的结构示意图;
图2是本申请实施例提供的一种音乐录音音频的降噪场景示意图;
图3是本申请实施例提供的一种音频数据处理方法的流程示意图;
图4是本申请实施例提供的一种音乐录音场景的示意图;
图5是本申请实施例提供的一种音频数据处理方法的流程示意图;
图6是本申请实施例提供的一种第一深度网络模型的结构示意图;
图7是本申请实施例提供的一种第二深度网络模型的结构示意图;
图8是本申请实施例提供的一种录音音频降噪处理的流程示意图;
图9是本申请实施例提供的一种音频数据处理方法的流程示意图;
图10是本申请实施例提供的一种深度网络模型的训练示意图;
图11是本申请实施例提供的一种音频数据处理装置的结构示意图;
图12是本申请实施例提供的一种音频数据处理装置的结构示意图;
图13是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的方案涉及人工智能云服务中的AI(人工智能,Artificial Intelligence)降噪服务,本申请实施例中可以通过API(应用程序接口,Application Program Interface)的方式接入AI降噪服务,通过AI降噪服务对分享至社交平台(例如,音乐类录音分享应用)的录音音频进行降噪处理,以提升录音音频的降噪效果。
请参见图1,图1是本申请实施例提供的一种网络架构的结构示意图。如图1所示,该网络架构可以包括服务器10d和用户终端集群,该用户终端集群可以包括一个或者多个用户终端,这里不对用户终端的数量进行限制。如图1所示,该用户终端集群可以具体包括用户终端10a、用户终端10b以及用户终端10c等。其中,服务器10d可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。用户终端10a、用户终端10b以及用户终端10c等均可以包括但不限于:智能手机、平板电脑、笔记本电脑、掌上电脑、移动互联网设备(mobile internet device,MID)、可穿戴设备(例如智能手表、智能手环等)以及智能电视等具有录音功能的智能终端,或者为接入麦克风的声卡设备等。如图1所示,用户终端10a、用户终端10b以及用户终端10c等可以分别与服务器10d进行网络连接,以便于每个用户终端可以通过该网络连接与服务器10d之间进行数据交互。
以图1所示的用户终端10a为例,该用户终端10a可以集成有录音功能,当用户想要录制自己或他人的音频数据时,可以使用音频播放设备播放背景基准音频(此处的背景基准音频可以为音乐伴唱,或者为视频中的背景音频和字幕配音音频等),并启动用户终端10a中的录音功能,开始录制包含上述音频播放设备所播放的背景基准音频的混合音频,本申请可以将该混合音频称为录音音频,背景基准音频可以作为上述录音音频中的背景基准音频分量。其中,当用户终端10a具有音频播放功能时,上述音频播放设备可以是用户终 端10a本身;或者,音频播放设备还可以是用户终端10a之外的其余具有音频播放功能的设备;上述录音音频可以为包含音频播放设备所播放的背景基准音频、音频播放设备/用户所处环境中的环境噪声以及用户语音的缓和音频,录制的背景基准音频可以作为录音音频中的背景基准音频分量,录制的环境噪声可以作为录音音频中的环境噪声分量,录制的用户语音可以作为录音音频中的语音音频分量。用户终端10a可以将录制好的录音音频上传至社交平台;例如,用户终端10a安装有社交平台的客户端时,可以将录制好的录音音频上传至社交平台的客户端,该社交平台的客户端可以将录音音频传输至社交平台的后台服务器(例如,上述图1所示的服务器10d)。
由于可能录制到环境噪声,故录音音频中包含环境噪声分量,因此社交平台的后台服务器需要对该录音音频进行降噪处理。该录音音频的降噪处理过程可以为:从音频数据库中获取与录音音频相匹配的原型音频(此处的原型音频可以理解为录音音频中的背景基准音频分量所对应的官方正版音频);基于原型音频可以从录音音频中获取候选语音音频(包括上述环境噪声和上述用户语音),进而可以将录音音频与候选语音音频之间的差值确定为背景基准音频分量;对候选语音音频进行降噪处理,可以得到候选语音音频对应的降噪语音音频,将降噪语音音频与背景基准音频分量进行叠加后可以得到降噪后的录音音频,此时降噪后的录音音频可以在社交平台中进行分享。通过将录音音频的降噪处理问题转换为候选语音音频的降噪处理问题,可以提升录音音频的降噪效率。
请参见图2,图2是本申请实施例提供的一种音乐录音音频的降噪场景示意图。如图2所示的用户终端20a可以为用户A所持有的终端设备(例如,上述图1所示的用户终端集群中的任一个用户终端),该用户终端20a中集成有录音功能和音频播放功能,因此该用户终端20a既可以作为录音设备,也可以作为音频播放设备。当用户A想要录制自己演唱的音乐录音时,可以启动该用户终端20a中的录音功能,在该用户终端20a播放音乐伴唱的背景下开始演唱歌曲,并开始录制音乐,录制完成后,可以得到音乐录音音频20b,此时本申请实施例的录音音频为音乐录音音频20b,该音乐录音音频20b可以包含用户A的歌声(即语音音频分量)和用户终端20a所播放的音乐伴唱(即背景基准音频分量)。用户终端20a可以将录制的音乐录音音频20b上传至音乐类应用对应的客户端,该客户端获取到音乐录音音频20b后,将音乐录音音频20b传输至音乐类应用对应的后台服务器(例如,上述图1所示的服务器10d),以使后台服务器对该音乐录音音频20b进行存储和分享。
其中,在实际的音乐录音场景中,用户A可能会处于嘈杂的环境中,因此,上述用户终端20a所录制的音乐录音音频20b中除了包含用户A的歌声和该用户终端20a所播放的音乐伴唱之外,还会包含环境噪声,即音乐录音音频20b可以包括环境噪声、音乐伴唱以及用户歌声三个音频分量。假设用户A在街道上,那么用户终端20a所录制的音乐录音音频20b中的环境噪声可以为车辆的鸣笛声、路边门店的吆喝声以及路人的说话声等;当然,音乐录音音频20b中的环境噪声还可以包括电子噪声。若是后台服务器直接将用户终端20a所上传的音乐录音音频20b进行分享,会导致其余终端设备在访问音乐类应用并播放音乐录音音频20a时无法听清用户A所录制的音乐。因此,在音乐类应用中分享音乐录音音频 20b之前,需要对音乐录音音频20b进行降噪处理,再将降噪后的音乐录音音频进行分享,使得其余终端设备在访问音乐类应用时可以播放降噪后的音乐录音音频,了解用户A的真实歌唱水平;换言之,用户终端20a仅负责音乐录音音频20b的采集及上传操作,音乐录音音频20b的降噪处理过程可以由音乐类应用对应的后台服务器执行。在一种可能的实现方式中,用户终端20a在采集到音乐录音音频20b后,可以由用户终端20a对音乐录音音频20b进行降噪处理,并将降噪后的音乐录音音频上传至音乐类应用,该音乐类应用对应的后台服务器接收到降噪后的音乐录音音频后,可以直接对降噪后的音乐录音音频进行分享,即音乐录音音频20b的降噪处理可以由用户终端20a执行。
其中,下面以音乐类应用的后台服务器(例如,上述服务器10d)为例,对音乐录制音频20b的降噪处理过程进行描述。该音乐录制音频20b的降噪处理的本质是对该音乐录制音频20b中的环境噪声进行抑制,并保留该音乐录音音频20b中的音乐伴唱和用户A的歌声。换言之,对音乐录音音频20b进行降噪,就是尽可能地消除音乐录音音乐20b中的环境噪声,但是需要尽可能地保持音乐录音音频20b中的音乐伴唱和用户A的歌声不被改变。
如图2所示,音乐类应用的后台服务器(例如,上述服务器10d)获取到音乐录音音频20b后,可以对该音乐录音音频20b进行频域变换,即将音乐录音音频20b由时域变换到频域,得到音乐录音音频20b对应的频域功率谱;该频域功率谱可以包括各个频点分别对应的能量值,该频域功率谱可以如图2中的频域功率谱20i所示,该频域功率谱20i中的一个能量值对应于一个频点,一个频点即为一个频率采样点。
根据音乐录音音频20b对应的频域功率谱,可以提取该音乐录音音频20b对应的音频指纹20c(即待匹配音频指纹);其中,音频指纹可以是指以标识符的形式表示一段音频中独有的数字特征。后台服务器可以获取音乐类应用中的曲库20d,以及该曲库20d对应的音频指纹库20e,该曲库20d可以包括音乐类应用中所存储的所有音乐音频,该音频指纹库20e可以包括曲库20d中的每首音乐音频分别对应的音频指纹。进而可以根据音乐录音音频20b对应的音频指纹20c,在音频指纹库20e中进行音频指纹检索,得到该音频指纹20c对应的指纹检索结果(即音频指纹库20e中与音频指纹20b相匹配的音频指纹),根据指纹检索结果可以从曲库20d中确定与音乐录音音频20b相匹配的音乐原型音频20f(如音乐录音音频20b中的音乐伴唱所对应的音乐原型,即原型音频)。同样地,可以对音乐原型音频20f进行频域变换,即将音乐原型音频20f由时域变换到频域,得到音乐原型音频20f对应的频域功率谱。
将音乐录音音频20b对应的频域功率谱与音乐原型音乐20f对应的频域功率谱进行特征组合,并将组合后的频域功率谱输入至第一阶深度网络模型20g,通过第一阶深度网络模型20g输出频点增益。其中,第一阶深度网络模型20g可以为预先训练好的、具备对音乐录音音频进行去音乐处理能力的网络模型,第一阶深度网络模型20g的训练过程可以参见下述S304中所描述的过程。通过将第一阶深度网络模型20g输出的频点增益与音乐录音音频20b对应的频域功率谱相乘,得到加权录音频域信号,将加权录音频域信号进行时域 变换,即将加权录音频域信号由频域变换到时域,得到去音乐音频20k,此处的去音乐音频20k可以是指从音乐录音音频20b中过滤掉音乐伴唱的音频信号。
如图2所示,假设第一阶深度网络模型20g输出的频点增益为频点增益序列20h,该频点增益序列20h中包括5个频点分别对应的语音增益,包括频点1对应的语音增益5、频点2对应的语音增益7、频点3对应的语音增益8、频点4对应的语音增益10以及频点5对应的语音增益3。假设音乐录音音频20b对应的频域功率谱为频域功率谱20i,该频域功率谱20i中也包括上述5个频点分别对应的能量值,具体包括频点1对应的能量值1、频点2对应的能量值2、频点3对应的能量值3、频点4对应的能量值2以及频点5对应的能量值1。通过计算频点增益序列20h中各个频点的语音增益和频域功率谱20i中对应于相同频点的能量值之间的乘积,得到加权录音频域信号20j;其计算过程具体为:计算频点增益序列20h中的频点1对应的语音增益5与频域功率谱20i中的频点1对应的能量值1之间的乘积,得到加权后的能量值5,该加权后的能量值5即为加权录音频域信号20j中针对频点1的能量值5;计算频点增益序列20h中的频点2对应的语音增益7与频域功率谱20i中的频点2对应的能量值2之间的乘积,得到加权录音频域信号20j中针对频点2的能量值14;计算频点增益序列20h中的频点3对应的语音增益8与频域功率谱20i中的频点3对应的能量值3之间的乘积,得到加权录音频域信号20j中针对频点3的能量值24;计算频点增益序列20h中的频点4对应的语音增益10与频域功率谱20i中的频点4对应的能量值2之间的乘积,得到加权录音频域信号20j中针对频点4的能量值20;计算频点增益序列20h中的频点5对应的语音增益3与频域功率谱20i中的频点4对应的能量值1之间的乘积,得到加权录音频域信号20j中针对频点5的能量值3。通过对加权录音频域信号20j进行时域变换,可以得到去音乐音频20k(即候选语音音频),该去音乐音频20k可以包含环境噪声和用户歌声两个分量。
后台服务器在得到去音乐音频20k后,可以将音乐录音音频20b与去音乐音频20k之间的差值,确定为音乐录音音频20b中所包含的纯音乐音频20p(即背景基准音频分量),此处的纯音乐音频20p可以为音乐播放设备所播放的音乐伴唱。与此同时,还可以对去音乐音频20k进行频域变换,得到去音乐音频20k对应的频域功率谱,将去音乐音频20k对应的频域功率谱输入第二阶深度网络模型20m中,通过第二阶深度网络模型20m输出去音乐音频20k对应的频点增益。其中,第二阶深度网络模型20m可以为预先训练好的、具备对携带噪声的语音音频进行降噪处理能力的网络模型,第二阶深度网络模型20m的训练过程可以参见下述S305中所描述的过程。通过将第二阶深度网络模型20m输出的频点增益与去音乐音频20k对应的频域功率谱相乘,得到加权语音频域信号,将加权语音频域信号进行时域变换,得到人声去噪音频20n(即降噪语音音频),此处的人声去噪音频20n可以是指对去音乐音频20k进行噪声抑制后所得到的音频信号,如音乐录音音频20b中的用户A的歌声。其中,上述第一阶深度网络模型20g和第二阶深度网络模型20m可以为具有不同网络结构的深度网络;人声去噪音频20n的计算过程与上述去音乐音频20k的计算过程类似,此处不再进行赘述。
后台服务器可以将纯音乐音频20p与人声去噪音频20n进行叠加,得到降噪后的音乐录音音频20q(即降噪后的录音音频)。通过从音乐录音音频20b中分离出纯音乐音频20q,将音乐录音音频20b的降噪处理转换为去音乐音频20k(可以理解为人声音频)的降噪处理,使得降噪后的音乐录音音频20q既保留了用户A的歌声和音乐伴唱,又能够最大程度抑制音乐录音音频20b中的环境噪声,提升了音乐录音音频20b的降噪效果。
请参见图3,图3是本申请实施例提供的一种音频数据处理方法的流程示意图。可以理解地,该音频数据处理方法可以由计算机设备执行,该计算机设备可以为用户终端,或者为服务器,或者为计算机设备中的一个计算机程序应用(包括程序代码),这里不做具体限定。如图3所示,该音频数据处理方法可以包括以下S101-S105:
S101,获取录音音频;录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量。
计算机设备可以获取包含背景基准音频分量、语音音频分量以及环境噪声分量的录音音频,该录音音频可以是通过录音设备对处于待录制环境下的待录制对象和音频播放设备进行共同录音采集得到的混合音频。其中,录音设备可以为具有录音功能的设备,如接入麦克风的声卡设备、手机等;音频播放设备可以为具有音频播放功能的设备,如手机、音乐播放设备以及音响设备等;待录制对象可以是指需要进行语音录制的用户,如上述图2所对应实施例中的用户A;待录制环境可以为待录制对象和音频播放设备所处的录制环境,如待录制对象和音频播放设备所处的室内空间、室外空间(例如,街道、公园)等。当某一个设备同时具备录音功能和音频播放功能时,该设备既可以作为录音设备,也可以作为音频播放设备,即本申请中的音频播放设备和录音设备可以为同一个设备,如上述图2所对应实施例中的用户终端20a。需要说明的是,计算机设备所获取到的录音音频可以为录音设备传输至该计算机设备的录音数据,或者可以为计算机设备自身采集到的录音数据,如上述计算机设备具备录音功能和音频播放功能时,同样既可以作为录音设备又可以作为音频播放设备,该计算机设备可以安装有音频类应用,可以通过该音频类应用中的录制功能,来实现上述录音音频的录制过程。
在一种可能的实现方式中,假设待录制对象想要录制自己演唱的音乐录音,那么该待录制对象可以启动录音设备中的录音功能,并使用音频播放设备播放音乐伴唱,在播放音乐伴唱的背景下演唱歌曲,开始使用录音设备录制音乐;录制完成后,可以将所录制的音乐录音作为上述录音音频,此时的录音音频可以包括音频播放设备所播放的音乐伴唱、待录制对象的歌声;若待录制环境是一个嘈杂的环境,则录音音频中还可以包括待录制环境中的环境噪声;此处录制的音乐伴唱可以作为录音音频中的背景基准音频分量,如上述图2所对应实施例中用户终端20a所播放的音乐伴唱;录制的待录制对象的歌声可以作为录音音频中的语音音频分量,如上述图2所对应实施例中用户A的歌声;录制的环境噪声可以作为录音音频中的环境噪声分量,如上述图2所对应实施例中用户终端20a所处环境中的环境噪声,该录音音频可以如上述图2所对应实施例中的音乐录音音频20b。
在一种可能的实现方式中,假设目标用户想要录制自己的配音音频,那么该待录制对象可以启动录音设备中的录音功能,并使用音频播放设备播放待配音片段中的背景音频, 在播放背景音频的基础上进行配音,开始使用录制设备录制配音;录制完成后,可以将所录制的配音音频作为上述录音音频,此时的录音音频可以包括音频播放设备所播放的背景音频、待录制对象的配音;若待录制环境是一个嘈杂的环境,则录音音频中还可以包括待录制环境中的环境噪声;此处录制的背景音频可以作为录音音频中的背景基准音频分量;录制的待录制对象的配音可以作为录音音频中的语音音频分量;录制的环境噪声可以作为录音音频中的环境噪声分量。
换言之,计算机设备所获取的录音音频可以包括音频播放设备所播出的音频(例如,上述音乐伴唱、待配音片段中的背景音频等)、待录制对象所输出的语音(例如,上述用户的配音、歌声等)以及待录制环境中的环境噪声。可以理解的是,上述音乐录制场景以及配音录制场景仅为本申请中的举例,本申请还可以应用在其余音频录制场景中,例如:待录制对象与音频播放设备之间的人机问答交互场景、待录制对象与音频播放设备之间的语言类表演场景(相声表演场景等),本申请对此不做限定。
S102,从音频数据库中确定与录音音频相匹配的原型音频。
由于计算机设备所获取到的录音音频中除了包含待录制对象所输出的音频和音频播放设备所播放的音频之外,还可能包含待录制环境中的环境噪声。例如,待录制对象和音频播放设备所处的待录制环境为商场时,上述录音音频中的环境噪声可以是商场的宣传活动广播声、商铺店员的吆喝声,以及录音设备的电子噪声等;待录制对象和音频播放设备所处的待录制环境为办公室内时,上述录音音频中的环境噪声可以是空调机的运行声音或者风扇的转动声音,以及录音设备的电子噪声等。因此,计算机设备需要对获取到的录音音频进行降噪处理,而降噪处理所要达到的效果为尽可能地抑制录音音频中的环境噪声,而保持录音音频中所包含的待录制对象所输出的音频和音频播放设备所播放的音频不被改变。
由于背景基准音频分量和环境噪声分量可能存在相似之处,为了避免将背景基准音频分量和环境噪声分量混淆,可以将录音音频的降噪处理问题转换为不包括背景基准音频分量的人声降噪。因此,可以先从音频数据库中确定与录音音频相匹配的原型音频,以便得到去除背景基准音频分量的候选语音音频。
在一种可能的实现方式中,S102的实现方式可以是直接根据录音音频进行匹配,得到原型音频;也可以是先获取录音音频对应的待匹配音频指纹,根据待匹配音频指纹在音频数据库中获取与录音音频相匹配的原型音频。
在对录音音频进行降噪处理的过程中,计算机设备可以对录音音频进行数据压缩,将录音音频映射为数字摘要信息,此处的数字摘要信息可以称为该录音音频对应的待匹配音频指纹,待匹配音频指纹的数据量远小于上述录音音频的数据量,从而提高检索准确性和检索效率。计算机设备还可以获取音频数据库,并获取该音频数据库对应的音频指纹库,将上述待匹配音频指纹与音频指纹库中所包含的音频指纹进行匹配,在音频指纹库中找到与待匹配音频指纹相匹配的音频指纹,并将相匹配的音频指纹所对应的音频数据确定为录音音频对应的原型音频(例如,上述图2所对应实施例中的音乐原型音频20f);换言之,计算机设备可以基于音频指纹检索技术,从音频数据库中检索到与录音音频相匹配的原型音频。其中,上述音频数据库中可以包括音频类应用所包含的所有音频数据,音频指纹库 中可以包括音频数据库中的每个音频数据所对应的音频指纹,该音频数据库和音频指纹库可以是预先配置好的;例如,上述录音音频为音乐录音音频时,音频数据库可以为包含全部音乐序列的数据库;上述录音音频为配音录制音频时,音频数据库可以为包含全部视频数据中的音频的数据库;等等。计算机设备在对录音音频进行音频指纹检索时可以直接访问音频数据库和音频指纹库,以检索得到与录音音频相匹配的原型音频,原型音频可以是指录音音频中的语音播放设备所播放的音频对应的原始音频;例如,当录音音频为音乐录音音频时,原型音频可以为音乐录音音频中所包含的音乐伴唱所对应的音乐原型;当录音音频为配音录制音频时,原型音频可以为配音录制音频中所包含的视频背景音频所对应的原型配音等。
其中,计算机设备所采用的音频指纹检索技术可以包括但不限于:philips音频检索技术(一种检索技术,可以包括高度鲁棒性的指纹提取方法和高效的指纹搜索策略两个部分)、shazam音频检索技术(一种音频检索技术,可以包括音频指纹提取和音频指纹匹配两个部分);本申请可以根据实际需求选择合适的音频检索技术来检索得到上述原型音频,例如:基于上述两种音频指纹检索技术的改进技术,本申请对所使用的音频检索技术不做限定。其中,在音频指纹检索技术中,计算机设备所提取的待匹配音频指纹可以通过录音音频的常用音频特征来表示,其中常用音频特征可以包括但不限于:傅里叶系数、梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)、谱平坦度、锐度、LPC(线性预测系数)系数等。计算机设备所采用的音频指纹匹配算法可以包括但不限于:基于距离的匹配算法(当计算机设备在音频指纹库中找到音频指纹A与待匹配音频指纹之间的距离最短时,表明该音频指纹A所对应的音频数据即为录音音频对应的原型音频),基于索引的匹配方法,基于阈值的匹配方法;本申请可以根据实际需求选择合适的音频指纹提取算法和音频指纹匹配算法,本申请对此不做限定。
S103,根据原型音频从录音音频中获取候选语音音频;候选语音音频包括语音音频分量和环境噪声分量。
计算机设备从音频数据库中检索得到与录音音频相匹配的原型音频后,可以根据该原型音频对录音音频进行过滤,得到该录音音频中所包含的候选语音音频(也可以称为携带噪声的人声信号,如上述图2所对应实施例中的去音乐音频20k),该候选语音音频可以包括录音音频中语音音频分量和环境噪声分量;换言之,候选语音音频可以理解为过滤了音频播放设备所输出的音频后的录音音频,即将录音音频中所包含的音频播放设备所输出的音频进行消除处理后可以得到上述候选语音音频。
在一种可能的实现方式中,计算机设备可以对录音音频进行频域变换,得到录音音频对应的第一频谱特征;对原型音频进行频域变换,得到原型音频对应的第二频谱特征。其中,本申请中的频域变换方法可以包括但不限于:傅里叶变换(Fourier Transformation,FT)、拉普拉斯变换(Laplace Transform)、z变换(Z-transformation)、以及上述三种频域变换方法的变形或改进方法,如快速傅里叶变换(Fast Fourier Transformation,FFT)、离散傅里叶变换(Discrete Fourier Transform,DFT)等;本申请对所采用的频域变换方法不做限定。上述第一频谱特征可以为对录音音频进行频域变换之后所得到的功率谱数据,也可以 为对其功率谱数据进行归一化处理后所得到的结果;上述第二频谱特征与上述第一频谱特征的获取过程是相同的,如第一频谱特征为录音音频对应的功率谱数据时,第二频谱特征为原型音频对应的功率谱数据;第一频谱特征为归一化处理后的功率谱数据时,第二频谱特征为归一化处理后的功率谱数据,第一频谱特征和第二频谱特征所采用的归一化处理方法是相同的。其中,上述归一化处理方法可以包括但不限于:iLN(instant layer normalization)、LN(Layer Normalizaiton)、IN(Instance Normalization)、GN(Group Normalization)、SN(Switchable Normalization)等归一化处理;本申请对所采用的归一化处理方法不做限定。
计算机设备可以对第一频谱特征和第二频谱特征进行特征组合(concat),将组合后的频谱特征作为输入特征输入至第一深度网络模型(例如,上述如2所对应实施例中的第一深度网络模型20g),通过第一深度网络模型可以输出第一频点增益(例如,上述图2所对应实施例中的频点增益序列20h),进而根据第一频点增益和录音功率谱数据确定候选语音音频,例如可以将第一频点增益与录音音频对应的功率谱数据相乘后再经过时域变换可以得到上述候选语音音频;此处的时域变换与上述频域变换互为逆变换,如频率变换所采用的方法为傅里叶变换时,此处所采用的时域变换的方法为逆傅里叶变换。其中,候选语音音频的计算过程可以参见上述图2所对应实施例中针对去音乐音频20k的计算过程,此处不再进行赘述。
上述第一深度网络模型可以用于过滤录音音频中的音频播放设备所输出的音频,该第一深度神经网络可以包括但不限于:门循环单元(Gate Recurrent Unit,GRU)、长短期记忆网络(Long Short Term Memory,LSTM)、深度神经网络(Deep Neural Networks,DNN)、卷积神经网络(Convolutional Neural Network,CNN),以及上述任意一个网络模型的变形,或者两个以及两个网络模型的组合模型等,本申请对所采用的第一深度网络模型的网络结构不做限定。需要说明的是,对于下述涉及的第二深度网络模型同样可以包括但不限于上述网络模型,其中,第二深度网络模型用于对候选语音音频进行降噪处理,该第二深度网络模型与第一深度网络模型可以具有相同的网络结构,但是具有不同的模型参数(两个网络模型所具备的功能是不一样的);或者,第二深度网络模型与第一深度网络模型可以具有不同的网络结构,且具有不同的模型参数,后续不再对第二深度网络模型的类型进行赘述。
S104,将录音音频与候选语音音频之间的差值,确定为录音音频中所包含的背景基准音频分量。
计算机设备根据第一深度网络模型得到候选语音音频后,可以将录音音频减去上述候选语音音频,得到音频播放设备所输出的音频。本申请中,可以将音频设备所输出的音频称为录音音频中的背景基准音频分量(例如,上述图2所对应实施例中的纯音乐音频20p)。其中,候选语音音频包含录音音频中的环境噪声分量和语音音频分量,录音音频与候选语音音频相减后所得到的结果即为该录音音频中所包含的背景基准音频分量。
其中,录音音频与候选语音音频之间的差值可以为时域上的波形差,也可以为频域上的频谱差。当录音音频与候选语音音频为时域波形信号时,可以获取录音音频对应的第一 信号波形,以及候选语音音频对应的第二信号波形,第一信号波形与第二信号波形均可以在二维坐标系(横坐标可以表示为时间,纵坐标可以表示为信号强度,也可以称为信号幅度)中进行表示,进而可以将第一信号波形与第二信号波形相减,得到录音音频与候选语音音频在时域上的波形差。录音音频与候选语音音频在时域上相减时,第一信号波形和第二信号波形的横坐标保持不变,仅将横坐标值对应的纵坐标值相减,可以得到一个新的波形信号,这个新的波形信号可以认为是背景基准音频分量所对应的时域波形信号。
在一种可能的实现方式中,当录音音频与候选语音音频为频域信号时,可以将录音音频对应的录音功率谱数据与候选语音音频对应的语音功率谱数据相减,得到两者之间的频谱差值,该频谱差值可以认为是背景基准音频分量所对应的频域信号。例如,假设录音音频对应的录音功率谱数据为(5,8,10,9,7),候选语音音频对应的语音功率谱数据为(2,4,1,5,6),两者相减后所得到的频谱差值可以为(3,4,9,4,1),此时的频谱差值(3,4,9,4,1)可以称为背景基准音频分量所对应的频域信号。
S105,对候选语音音频进行环境噪声降噪处理,得到候选语音音频对应的降噪语音音频,将降噪语音音频与背景基准音频分量进行合并,得到降噪后的录音音频。
计算机设备可以对候选语音音频进行降噪处理,即对候选语音音频中的环境噪声进行抑制,得到候选语音音频对应的降噪语音音频(例如,上述图2所对应实施例中的人声去噪音频20n)。
其中,上述候选语音音频的降噪处理可以通过上述第二深度网络模型来实现。计算机设备可以对候选语音音频进行频域变换,得到候选语音音频对应的功率谱数据(可以称为语音功率谱数据),将语音功率谱数据输入至第二深度网络模型,通过第二深度网络模型可以输出第二频点增益,根据第二频点增益与所述语音功率谱数据,获取候选语音音频对应的加权语音频域信号,再对加权语音频域信号进行时域变换,得到候选语音音频对应的所述降噪语音音频,例如将第二频点增益与候选语音音频对应的语音功率谱数据相乘后再经过时域变换可以得到上述降噪语音音频。进而可以将降噪语音音频与上述背景基准音频分量进行叠加,得到降噪后的录音音频(例如,上述图2所对应实施例中的降噪后的音乐录音音频20q)。
需要说明的是,本申请实施例对S104与S105中“对候选语音音频进行环境噪声降噪处理,得到候选语音音频对应的降噪语音音频”的执行顺序不做限定。
在一种可能的实现方式中,计算机设备可以将降噪后的录音音频分享至社交平台,以使社交平台中的终端设备在访问降噪后的录音音频时,可以播放降噪后的录音音频。其中,上述社交平台是指可以用于分享并传播音视频数据的应用、网页等,如社交平台可以为音频类应用,或者视频类应用,或者为内容分享平台等。
举例来说,在音乐录音场景中,降噪后的录音音频可以为降噪后的音乐录音音频,计算机设备可以将降噪后的音乐录音音频分享至内容分享平台(此时的社交平台默认为内容分享平台),终端设备在访问内容分享平台中所分享的降噪后的音乐录音音频时,可以播放降噪后的音乐录音音频。请参见图4,图4是本申请实施例提供的一种音乐录音场景的示意图。如图4所示的服务器30a可以为内容分享平台的后台服务器,用户终端30b可以 为用户小A所使用的终端设备,用户小A为在内容分享平台中分享降噪后的音乐录音音频30e的用户;用户终端30c可以为用户小B所使用的终端设备,用户终端30d可以为用户小C所使用的终端设备。当服务器30a得到降噪后的音乐录音音频30e后,可以将降噪后的音乐录音音频30e分享至内容分享平台,此时用户终端30b中的内容分享平台中可以显示降噪后的音乐录音音频30e,以及降噪后的音乐录音音频30e对应的分享时间等信息。当用户小B所使用的用户终端30c访问内容分享平台时,可以在用户终端30c的内容分享平台中显示不同用户所分享的内容,该内容可以包括用户小A所分享的降噪后的音乐录音音频30e,点击降噪后的音乐录音音频30e后,可以在用户终端30c中播放降噪后的音乐录音音频30e。同理,当用户小C所使用的用户终端30d访问内容分享平台时,可以在用户终端30d的内容分享平台中显示用户小A所分享的降噪后的音乐录音音频30e,点击降噪后的音乐录音音频30e后,可以在用户终端30d中播放降噪后的音乐录音音频30e。
本申请实施例中,录音音频可以为包含语音音频分量、背景基准音频分量以及环境噪声分量的混合音频,在对录音音频进行降噪处理的过程中,可以在音频数据库中找到录音音频对应的原型音频,根据该原型音频可以从录音音频中筛选出候选语音音频,将上述录音音频减去候选语音音频可以得到背景基准音频分量;进而可以对候选语音音频进行降噪处理,得到降噪语音音频,将降噪语音音频与背景基准音频分量进行叠加后可以得到降噪后的录音音频。换言之,通过将录音音频的降噪处理问题转换为候选语音音频的降噪处理问题,可以避免将录音音频中的背景基准音频分量误与环境噪声进行混淆,进而可以提升录音音频的降噪效果。
请参见图5,图5是本申请实施例提供的一种音频数据处理方法的流程示意图。可以理解地,该音频数据处理方法可以由计算机设备执行,该计算机设备可以为用户终端,或者为服务器,或者为计算机设备中的一个计算机程序应用(包括程序代码),这里不做具体限定。如图5所示,该音频数据处理方法可以包括以下S201-S210:
S201,获取录音音频;录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量。
其中,S201的具体实现方式可以参见上述图3所对应实施例中的S101,此处不再进行赘述。
S202,将录音音频划分为M个录音数据帧,对M个录音数据帧中的第i个录音数据帧进行频域变换,得到第i个录音数据帧对应的功率谱数据;i和M均为正整数,且i小于或等于M。
计算机设备可以对录音音频进行分帧处理,将该录音音频划分为M个录音数据帧,对M个录音数据帧中的第i个录音数据帧进行频域变换,如对第i个录音数据帧进行傅里叶变换,可以得到第i个录音数据帧对应的功率谱数据;其中,M可以为大于1的正整数,如M可以取值为2,3,……,i可以为小于或等于M的正整数。其中,计算机设备可以通过滑动窗实现对录音音频的分帧处理,进而可以得到M个录音数据帧,为了保持相邻的录音数据帧之间的连续性,通常可以采用交叠分段的方式对录音音频进行分帧处理,录音数据帧的大小可以与滑动窗的大小相关联。
对于M个录音数据帧中的每个录音数据帧,均可以独立进行频域变换(如傅里叶变换),可以得到每个录音数据帧分别对应的功率谱数据,该功率谱数据可以包括各个频点分别对应的能量值(此处的能量值也可以称为功率谱数据的幅值),功率谱数据中的一个能量值对应于一个频点,一个频点可以理解为频域变换时的一个频率采样点。
S203,将第i个录音数据帧对应的功率谱数据划分为N个频谱带,根据N个频谱带中的峰值信号,构建第i个录音数据帧对应的子指纹信息;N为正整数。
计算机设备可以根据每个录音数据帧分别对应的功率谱数据,构造每个录音数据帧分别对应的子指纹信息;其中,构造子指纹信息的关键在于从每个录音数据帧所对应的功率谱数据中选出区分度最大的能量值,下面以第i个录音数据帧为例,对子指纹信息的构造过程进行描述。计算机设备可以将第i个录音数据帧对应的功率谱数据划分为N个频谱带,选取每个频谱带中的峰值信号(即每个频谱带中的极大值,也可以理解为每个频谱带中的最大能量值)作为该频谱带的签名,以此来构造第i个录音数据帧对应的子指纹信息,其中N可以为正整数,如N可以取值1,2,……。换言之,第i个录音数据帧对应的子指纹信息可以包括N个频谱带分别对应的峰值信号。
S204,按照M个录音数据帧在录音音频中的时间顺序,对M个录音数据帧分别对应的子指纹信息进行组合,得到录音音频对应的待匹配音频指纹。
计算机设备可以按照上述S203中的描述,获取M个录音数据帧分别对应的子指纹信息,进而可以按照M个录音数据帧在录音音频中的时间顺序,依次对M个录音数据帧分别对应的子指纹信息进行组合,可以得到录音音频对应的待匹配音频指纹。通过选取峰值信号构建待匹配音频指纹,可以尽可能地确保该待匹配音频指纹在各种噪声和失真环境下保持不变。
S205,获取音频数据库对应的音频指纹库,根据待匹配音频指纹在音频指纹库中进行指纹检索,根据指纹检索结果在音频数据库中确定原型音频。
计算机设备可以获取音频数据库,并获取音频数据库对应的音频指纹库,音频数据库中的每个音频数据都可以按照上述S201-S204中的描述,得到音频数据库中的每个音频数据分别对应的音频指纹,每个音频数据所对应的音频指纹可以构成音频数据库对应的音频指纹库。其中,音频指纹库是预先构建的,计算机设备在获取了录音音频对应的待匹配音频指纹后,可以直接获取音频指纹库,基于待匹配音频指纹在音频指纹库中进行指纹检索,可以得到与待匹配音频指纹相匹配的音频指纹,该相匹配的音频指纹可以作为该待匹配音频指纹对应的指纹检索结果,进而可以将指纹检索结果所对应的音频数据确定为与录音音频相匹配的原型音频。
计算机设备可以将音频指纹作为音频检索哈希表的键值(key)进行保存。每个音频数据所包含的单个音频数据帧可以对应于一个子指纹信息,一个子指纹信息可以对应于音频检索哈希的一个键值;每个音频数据所包含的所有音频数据帧所对应的子指纹信息可以组成该音频数据对应的音频指纹。为方便查找,每个子指纹信息可以作为哈希表的键值,每个键值可以指向该子指纹信息在所属的音频数据中出现的时间,还可以指向该子指纹信息所属的音频数据的标识;如某个子指纹信息转换为哈希值后,该哈希值可以作为音频检索 哈希表中的键值进行保存,该键值指向该子指纹信息在所属的音频数据中出现的时间为02:30,指向的音频数据的标识为:音频数据1。可以理解地,上述音频指纹库可以包括音频数据库中的每个音频数据所对应的一个或多个哈希值。
当录音音频划分为M个音频数据帧时,该录音音频所对应的待匹配音频指纹可以包括M个子指纹信息,一个子指纹信息对应一个音频数据帧。计算机设备可以将待匹配音频指纹中所包含的M个子指纹信息映射为M个待匹配哈希值,并获取M个待匹配哈希值分别对应的录音时间,一个待匹配哈希值所对应的录音时间用于表征该待匹配哈希值对应的子指纹信息在录音音频中出现的时间;若M个待匹配哈希值中的第p个待匹配哈希值与音频指纹库所包含的第一哈希值相匹配,则获取第p个待匹配哈希值对应的录音时间与第一哈希值对应的时间信息之间的第一时间差,其中p为小于或等于M的正整数;若M个待匹配哈希值中的第q个待匹配哈希值与音频指纹库所包含的第二哈希值相匹配,则获取第q个待匹配哈希值对应的录音时间与第二哈希值对应的时间信息之间的第二时间差;q为小于或等于M的正整数;当第一时间差和第二时间差满足数值阈值,且第一哈希值和第二哈希值属于相同的音频指纹时,可以将第一哈希值所属的音频指纹确定为指纹检索结果,将指纹检索结果所对应的音频数据确定为录音音频对应的原型音频。更多的,计算机设备可以对上述M个待匹配哈希值与音频指纹库中的哈希值进行匹配,每一个匹配成功的待匹配哈希值均可以计算得到一个时间差,在M个待匹配哈希值都完成匹配后,可以统计相同时间差的最大值,此时的最大值可以设置为上述数值阈值,将最大值所对应的音频数据确定为录音音频对应的原型音频。
举例来说,M个待匹配哈希值包括哈希值1、哈希值2、哈希值3、哈希值4、哈希值5以及哈希值6,音频指纹库中的哈希值A与哈希值1相匹配,且哈希值A指向音频数据1,哈希值A与哈希值1之间的时间差为t1;音频指纹库中的哈希值B与哈希值2相匹配,且哈希值B指向音频数据1,哈希值B与哈希值2之间的时间差为t2;音频指纹库中的哈希值C与哈希值3相匹配,且哈希值C指向音频数据1,哈希值C与哈希值3之间的时间差为t3;音频指纹库中的哈希值D与哈希值4相匹配,且哈希值D指向音频数据1,哈希值D与哈希值4之间的时间差为t4;音频指纹库中的哈希值E与哈希值5相匹配,且哈希值E指向音频数据2,哈希值E与哈希值5之间的时间差为t5;音频指纹库中的哈希值F与哈希值6相匹配,且哈希值6指向音频数据2,哈希值F与哈希值6之间的时间差为t6。若上述时间差t1、时间差t2、时间差t3以及时间差t4为相同的时间差,时间差t5和时间差t6为相同的时间差,则可以将音频数据1作为录音音频对应的原型音频。
S206,获取录音音频对应的录音功率谱数据,对录音功率谱数据进行归一化处理,得到第一频谱特征;获取原型音频对应的原型功率谱数据,对原型功率谱数据进行归一化处理,得到第二频谱特征,将第一频谱特征和第二频谱特征组合为输入特征。
计算机设备可以获取录音音频对应的录音功率谱数据,该录音功率谱数据可以由上述M个音频数据帧分别对应的功率谱数据组成,录音功率谱数据可以包括录音音频中的各个频点分别对应的能量值;对录音功率谱数据进行归一化处理,得到第一频谱特征;其中,若此处的归一化处理为iLN,则可以对录音功率谱数据中各个频点所对应的能量值进行独 立归一化;当然,本申请还可以采用其余归一化处理,如BN等。可选的,本申请实施例还可以无需对录音功率谱数据进行归一化处理,直接将录音功率谱数据作为第一频谱特征。同理,对于原型音频,可以执行如上述录音音频相同的频域变换(得到原型功率谱数据)、归一化处理操作,得到原型音频对应的第二频谱特征;进而可以通过concat(连接)将第一频谱特征和第二频谱特征组合为输入特征。
S207,将输入特征输入至第一深度网络模型,通过第一深度网络模型输出针对录音音频的第一频点增益。
计算机设备可以将输入特征输入至第一深度网络模型,通过第一深度网络模型可以输出针对录音音频的第一频点增益,此处的第一频点增益可以包括录音音频中的各个频点分别对应的语音增益。
其中,当第一深度网络模型包括GRU(可以作为特征提取网络层)、全连接网络(可以作为全连接网络层)以及Sigmoid函数(可以称为激活层,在本申请中可以作为输出层)时,输入特征首先输入至第一深度网络模型中的特征提取网络层,根据特征提取网络层,可以获取输入特征对应的时序分布特征,该时序分布特征可以用于表征录音音频中的上下文语义;根据第一深度网络模型中的全连接网络层,获取时序分布特征对应的时序特征向量,进而根据时序特征向量,通过第一深度网络模型中的激活层,输出第一频点增益,如可以由Sigmoid函数(作为激活层)输出录音音频中所包含的各个频点分别对应的语音增益(即第一频点增益)。
S208,根据第一频点增益和录音功率谱数据,获取录音音频中所包含的候选语音音频;将录音音频与候选语音音频之间的差值,确定为录音音频中所包含的背景基准音频分量;候选语音音频包括语音音频分量和环境噪声分量。
假设录音音频包括T个频点(T为大于1的正整数),那么第一频点增益可以包括T个频点分别对应的语音增益,录音功率谱数据包括T个频点分别对应的能量值,T个语音增益与T个能量值一一对应。计算机设备可以根据第一频点增益中的T个频点分别对应的语音增益,对录音功率谱数据中属于相同频点的能量值进行加权,得到T个频点分别对应的加权能量值;进而可以根据T个频点分别对应的加权能量值,确定录音音频对应的加权录音频域信号;通过对加权录音频域信号进行时域变换(与前述频域变换互为逆变换),得到录音音频中所包含的候选语音音频。例如,当第一深度网络模型输出的第一频点增益为(2,3),录音功率谱数据为(1,2)时,表示录音音频可以包括两个频点(此处T取值为2),第一个频点在第一频点增益中的语音增益为2,在录音功率谱数据中的能量值为1,第二个频点在第一频点增益中的语音增益为3,在录音功率谱数据中的能量值为2;可以计算得到加权录音频域信号为(2,6),通过对加权录音频域信号进行时域变换,可以得到录音音频中所包含的候选语音音频。进一步地,可以将录音音频与候选语音音频之间的差值,确定为背景基准音频分量,即音频播放设备所输出的音频。
请参见图6,图6是本申请实施例提供的一种第一深度网络模型的结构示意图;以音乐录音场景为例,对第一深度网络模型的网络结构进行说明。如图6所示,计算机设备从音频数据库中检索到音乐录音音频40a(即录音音频)对应的音乐原型音频40b(即原型音 频)后,可以分别对音乐录音音频40a和音乐原型音频40b进行快速傅里叶变换(FFT),得到音乐录音音频40a对应的功率谱数据40c(即录音功率谱数据)和相位,以及音乐原型音频40b对应的功率谱数据40d(即原型功率谱数据),上述快速傅里叶变换仅仅只是本实施例中的一种举例,本申请还可以使用其余频域变换方法,如离散傅里叶变换等。对功率谱数据40c和功率谱数据40d中的各帧功率谱进行iLN归一化处理后通过concat进行特征组合,将组合得到的输入特征作为第一深度网络模型40e的输入数据,该第一深度网络模型40e可以由门循环单元1、门循环单元2、全连接网络1组成,最后通过Sigmoid函数输出第一频点增益;第一频点增益所包含的各个频点的语音增益与功率谱数据40c中对应频点的能量值(也可以称为频点功率谱)相乘后,再经过逆快速傅里叶变换(iFFT)可以得到去音乐音频40f(即上述候选语音音频);其中,逆快速傅里叶变换可以为时域变换方法,即从频域转换到时域。可以理解的是,如图6所示的第一深度网络模型40e的网络结构仅为一种举例,本申请实施例所使用的第一深度网络模型还可以在上述第一深度网络模型40e的基础上增加门循环单元或全连接网络结构,本申请对此不做限定。
S209,获取候选语音音频对应的语音功率谱数据,将语音功率谱数据输入至第二深度网络模型,通过第二深度网络模型输出针对候选语音音频的第二频点增益。
计算机设备在获取到候选语音音频后,可以对候选语音音频进行频域变换,得到候选语音音频对应的语音功率谱数据,将语音功率谱数据输入至第二深度网络模型,通过第二深度网络模型中的特征提取网络层(可以为GRU)、全连接网络层(可以为全连接网络)、激活层(Sigmoid函数),可以输出针对候选语音音频的第二频点增益,第二频点增益可以包括候选语音音频中的各个频点分别对应的降噪增益,可以为Sigmoid函数的输出值。
S210,根据第二频点增益与语音功率谱数据,获取候选语音音频对应的加权语音频域信号;对加权语音频域信号进行时域变换,得到候选语音音频对应的降噪语音音频,将降噪语音音频与背景基准音频分量进行合并,得到降噪后的录音音频。
假设候选语音音频包括D个频点(D为大于1的正整数,此处的D可以等于上述T,也可以不等于上述T,两者可以根据实际需求进行取值,本申请对D和T的取值不做限定),那么第二频点增益可以包括D个频点分别对应的降噪增益,语音功率谱数据包括D个频点分别对应的能量值,D个降噪增益与D个能量值一一对应。计算机设备可以根据第二频点增益中的D个频点分别对应的降噪增益,对语音功率谱数据中属于相同频点的能量值进行加权,得到D个频点分别对应的加权能量值;进而可以根据D个频点分别对应的加权能量值,确定候选语音音频对应的加权语音频域信号;通过对加权语音频域信号进行时域变换(与前述频域变换互为逆变换),得到候选语音音频对应的降噪语音音频。例如,当第二深度网络模型输出的第二频点增益为(0.1,0.5),语音功率谱数据为(5,8)时,表示候选语音音频可以包括两个频点(此处D取值为2),第一个频点在第二频点增益中的降噪增益为0.1,在语音功率谱数据中的能量值为5,第二个频点在第二频点增益中的降噪增益为0.5,在语音功率谱数据中的能量值为8;可以计算得到加权语音频域信号为(0.5,4),通过对加权语音频域信号进行时域变换,可以得到候选语音音频对应的降噪语音音频。进一步地,可以将降噪语音音频与背景基准音频分量进行叠加,可以得到降噪后的录音音频。
请参见图7,图7是本申请实施例提供的一种第二深度网络模型的结构示意图。如图7所示,如前述图6所对应实施例,计算机设备通过第一深度网络模型40e得到去音乐音频40f后,可以对去音乐音频40f进行快速傅里叶变换(FFT),得到去音乐音频40f对应的功率谱数据40g(即上述语音功率谱数据)和相位。将功率谱数据40g作为第二深度网络模型40h的输入数据,该第二深度网络模型40h可以由全连接网络2、门循环单元3、门循环单元4、全连接网络3组成,最后通过Sigmoid函数可以输出第二频点增益;第二频点增益所包含的各个频点的降噪增益与功率谱数据40g中对应频点的能量值相乘后,再经过逆快速傅里叶变换(iFFT)可以得到人声去噪音频40i(即上述降噪语音音频)。可以理解的是,如图7所示的第二深度网络模型40h的网络结构仅为一种举例,本申请实施例所使用的第二深度网络模型还可以在上述第二深度网络模型40h的基础上增加门循环单元或全连接网络结构,本申请对此不做限定。
请参见图8,图8是本申请实施例提供的一种录音音频降噪处理的流程示意图。如图8所示,本实施例以音乐录音场景为例,计算机设备在获取到音乐录音音频50a后,可以获取该音乐录音音频50a对应的音频指纹50b,基于该音频指纹50b,在曲库50c(即上述音频数据库)所对应的音频指纹库50d中进行音频指纹检索,当曲库50c中的某个音频数据所对应的音频指纹与音频指纹50b相匹配时,可以将区块50c中的该音频数据确定为音乐录音音频50a对应的音乐原型音频50e;其中,音频指纹50b的提取过程以及音频指纹50b的音频指纹检索过程可以参见前述S202-S205中的描述,在此不再进行赘述。
在一种可能的实现方式中,可以对音乐录音音频50a和音乐原型音频50e分别进行频谱特征提取,将获取到的频谱特征进行特征组合后输入至第一阶深度网络50h(即前述第一深度网络模型),通过第一阶深度网络50h可以得到去音乐音频50i(去音乐音频50i的获取过程可以参见上述图6所对应的实施例,此处不再进行赘述);其中,频谱特征提取过程可以包括傅里叶变换等频域变换和iLN等归一化处理。进而可以将音乐录音音频50a与去音乐音频50i相减,可以得到纯音乐音频50j(即上述背景基准音频分量)。
对去音乐音频50i进行快速傅里叶变换后可以得到其对应的功率谱数据,将该功率谱数据作为第二阶深度网络50k(即上述第二深度网络模型)的输入,通过第二阶深度网络50k可以得到人声去噪音频50m(人声去噪音频50m的获取过程可以参见上述图7所对应的实施例,此处不再进行赘述);进而可以将纯音乐音频50j与人声去噪音频50m进行叠加,可以得到最终降噪后的音乐录音音频50n(即降噪后的录音音频)。
本申请实施例中,录音音频可以为包含语音音频分量、背景基准音频分量以及环境噪声分量的背景基准音频分量的混合音频,在对录音音频进行降噪处理的过程中,可以通过音频指纹检索找到录音音频对应的原型音频,根据该原型音频可以从录音音频中筛选出候选语音音频,将上述录音音频减去候选语音音频可以得到背景基准音频分量;进而可以对候选语音音频进行降噪处理,得到降噪语音音频,将降噪语音音频与背景基准音频分量进行叠加后可以得到降噪后的录音音频。换言之,通过将录音音频的降噪处理问题转换为候选语音音频的降噪处理问题,可以避免将录音音频中的背景基准音频分量误与环境噪声进 行混淆,进而可以提升录音音频的降噪效果;通过音频指纹检索技术进行检索得到原型音频,可以提高检索准确性和检索效率。
在录音场景中使用前述第一深度网络模型和第二深度网络模型之前,还需要对其进行训练,下面将通过附图9和附图10对第一深度网络模型和第二深度网络模型的训练过程进行描述。
请参见图9,图9是本申请实施例提供的一种音频数据处理方法的流程示意图。可以理解地,该音频数据处理方法可以由计算机设备执行,该计算机设备可以为用户终端,或者为服务器,或者为计算机设备中的一个计算机程序应用(包括程序代码),这里不做具体限定。如图9所示,该音频数据处理方法可以包括以下S301-S305:
S301,获取语音样本音频、噪声样本音频以及标准样本音频,根据语音样本音频、噪声样本音频以及标准样本音频,生成样本录音音频。
计算机设备可以预先获取大量的语音样本音频、大量的噪声样本音频以及大量的标准样本音频。其中,语音样本音频可以为仅包含人声的音频序列;例如,该语音样本音频可以是预先录制好的各种用户的歌声序列,或者为各种用户的配音序列等。噪声样本音频可以为仅包含噪声的音频序列,该噪声样本音频可以是预先录制好的不同场景的噪声;例如,噪声样本音频可以是车辆鸣笛的声音、敲击键盘的声音、敲击各种金属的声音等各种类型的噪声。标准样本音频可以为音频数据库中所存储的纯净音频;例如,该标准样本音频可以为音乐序列,或者视频配音序列等。换言之,语音样本音频和噪声样本音频可以是通过录采集,标准样本音频可以为各种平台中所存储的纯净音频,其中计算机设备在获取平台中的标准样本音频时需要获得该平台的授权许可。举例来说,在音乐录音场景中,语音样本音频可以为人声序列,噪声样本音频可以为不同场景的噪声序列,标准样本音频可以为音乐序列。
计算机设备可以对语音样本音频、噪声样本音频以及标准样本音频进行叠加,得到样本录音音频。为了构建更多的样本录音音频,不仅可以对不同的语音样本音频、噪声样本音频以及标准样本音频进行随机组合,还可以使用不同的系数对同一组语音样本音频、噪声样本音频以及标准样本音频进行加权,可以得到不同的样本录音音频。例如,计算机设备可以获取针对第一初始网络模型的加权系数集合,该加权系数集合可以为一组随机生成的浮点数,根据该加权系数集合可以构建K个数组,每个数组都可以包括三个具有排列顺序的数值,具有不同排列顺序的三个数值可以构成不同的数组,一个数组中所包含的三个数值分别为语音样本音频、噪声样本音频以及标准样本音频的系数;根据K个数组中的第j个数组所包含的系数,分别对语音样本音频、噪声样本音频以及标准样本音频进行加权,可以得到第j个数组对应的样本录音音频。换言之,对于任意一个语音样本音频、一个噪声样本音频以及一个标准样本音频,可以构建K个不同的样本录音音频。
举例来说,假设K个数组包括以下4个数组(此时的K取值为4),该4个数组分别为[0.1,0.5,0.3],[0.5,0.6,0.8],[0.6,0.1,0.4],[1,0.7,0.3],对于语音样本音频a、噪声样本音频b以及标准样本音频c,可以构建如下样本录音音频:样本录音音频 y1=0.1a+0.5b+0.3c,样本录音音频y2=0.5a+0.6b+0.8c,样本录音音频y3=0.6a+0.1b+0.4c,样本录音音频y4=a+0.7b+0.3c。
S302,通过第一初始网络模型获取样本录音音频中的样本预测语音音频;第一初始网络模型用于过滤样本录音音频所包含的标准样本音频,第一初始网络模型的期望预测语音音频由语音样本音频和噪声样本音频所确定。
对于用来训练两个初始网络模型(包括第一初始网络模型和第二初始网络模型)的所有样本录音音频,每个样本录音音频在两个初始网络模型中的处理过程是相同。在训练阶段,可以将样本录音音频分批次输入第一初始网络模型,即对所有样本录音音频进行分批次训练;为方便描述,下面以所有样本录音音频中的任一个样本录音音频为例,对上述两个初始网络模型的训练过程进行描述。
请参见图10,图10是本申请实施例提供的一种深度网络模型的训练示意图。如图10所示,样本录音音频y可以由样本数据库60a中的语音样本音频x1、噪声样本序列x2以及标准样本音频所确定,如样本录音音频y=r1×x1+r2×x2+r3×x3。计算机设备可以对该样本录音音频y进行频域变换,得到该样本录音音频y对应的样本功率谱数据,并对该样本功率谱数据进行归一化处理(例如,iLN归一化),得到该样本录音音频y对应的样本频谱特征;将该样本频谱特征输入至第一初始网络模型60b,通过第一初始网络模型60b可以输出样本频谱特征对应的第一样本频点增益,该第一样本频点增益可以包括样本录音音频所对应的各个频点的语音增益,此处的第一样本频点增益即为第一初始网络模型60b针对上述样本录音音频y的实际输出结果。其中,第一初始网络模型60b可以是指处于训练阶段的第一深度网络模型,训练第一初始网络模型60b是为了过滤样本录音音频所包含的标准样本音频。
计算机设备可以根据第一样本频点增益和样本功率谱数据,得到样本预测语音音频60c,该样本预测语音音频60c的计算过程与前述候选语音音频的计算过程类似,此处不再赘述。其中,第一初始网络模型60b对应的期望预测语音音频可以由语音样本音频x1和噪声样本音频x2所确定,该期望预测语音音频可以为上述样本录音音频y中的信号(r1×x1+r2×x2);也就是说,第一初始网络模型60b的期望输出结果可以为信号(r1×x1+r2×x2)的功率谱数据中的各频点能量值(或者称为各频点功率谱值)除以样本功率谱数据中对应的频点能量值后的开平方处理结果。
S303,通过第二初始网络模型获取样本预测语音音频对应的样本预测降噪音频;第二初始网络模型用于抑制样本预测语音音频所包含的噪声样本音频,第二初始网络模型的期望预测降噪音频由语音样本音频所确定。
如图10所示,计算机设备可以将样本预测语音音频60c对应的功率谱数据输入至第二初始网络模型60f,通过第二初始网络模型60f可以输出样本预测语音音频60c对应的第二样本频点增益,该第二样本频点增益可以包括样本预测语音音频60c所对应的各个频点的降噪增益,此处的第二样本频点增益即为第二初始网络模型60f针对上述样本预测语音音频60c的实际输出结果。其中,第二初始网络模型60f可以是指处于训练阶段的第二深度网络模型,训练第二初始网络模型60f是为了对样本预测语音音频中所包含的环境噪声进 行抑制。需要说明的是,第二初始网络模型60f的训练样本需要与第一初始网络模型60b的部分样本对齐,如第二初始网络模型60f的训练样本可以为基于第一初始网络模型60b所确定的样本预测语音音频60c。
计算机设备可以根据第二样本频点增益和样本预测语音音频60c的功率谱数据,得到样本预测降噪音频60g,该样本预测降噪音频60g的计算过程与前述降噪语音音频的计算过程类似,此处不再赘述。其中,第二初始网络模型60f对应的期望预测降噪音频可以由语音样本音频x1所确定,该期望预测降噪音频可以为上述样本录音音频y中的信号(r1×x1);也就是说,第二初始网络模型60f的期望输出结果可以为信号(r1×x1)的功率谱数据中的各频点能量值(或者称为各频点功率谱值),除以样本预测语音音频60c的功率谱数据中对应的频点能量值后的开平方处理结果。
S304,基于样本预测语音音频和期望预测语音音频,对第一初始网络模型的网络参数进行调整,得到第一深度网络模型;第一深度网络模型用于对录音音频进行过滤后得到候选语音音频,录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量,候选语音音频包括语音音频分量和环境噪声分量。
如图10所示,根据第一初始网络模型60b对应的样本预测语音音频60c与期望预测语音音频(r1×x1+r2×x2)之间的差值,确定针对第一初始网络模型60b对应的第一损失函数60d,通过优化第一损失函数60d至最小值,即使得训练损失最小,对第一初始网络模型60b的网络参数进行调整,直至训练迭代次数达到预先设置的最大迭代次数(或第一初始网络模型60b的训练达到收敛),此时的第一初始网络模型60b可以作为第一深度网络模型60e,训练完成的第一深度网络模型60e可以用于对录音音频进行过滤后得到候选语音音频,第一深度网络模型60e的使用过程可以参见上述S207中的描述。可选的,上述第一损失函数60d还可以为第一初始网络模型60b的期望输出结果与第一频点增益(实际输出结果)之间的平方项。
S305,基于样本预测降噪音频和期望预测降噪音频,对第二初始网络模型的网络参数进行调整,得到第二深度网络模型;第二深度网络模型用于对候选语音音频进行降噪处理后得到降噪语音音频。
如图10所示,根据第二初始网络模型60f对应的样本预测降噪音频60g与期望预测语音音频(r1×x1)之间的差值,确定针对第二初始网络模型60f对应的第二损失函数60h,通过优化第二损失函数60h至最小值,即使得训练损失最小,对第二初始网络模型60f的网络参数进行调整,直至训练迭代次数达到预先设置的最大迭代次数(或第二初始网络模型60f的训练达到收敛),此时的第二初始网络模型可以作为第二深度网络模型60i,训练完成的第二深度网络模型60i可以用于对候选语音音频进行降噪处理后得到降噪语音音频,第二深度网络模型60i的使用过程可以参见上述S209中的描述。在一种可能的实现方式中,上述第二损失函数60h还可以为第二初始网络模型60f的期望输出结果与第二频点增益(实际输出结果)之间的平方项。
本申请实施例中,通过为语音样本音频、噪声样本音频以及标准样本音频加权不同的系数,可以扩展样本录音音频的数量,通过这些样本录音音频对第一初始网络模型和第二 初始网络模型进行训练,可以提高网络模型的泛化能力;通过将第二初始网络模型的训练样本与第一初始网络模型的部分训练样本(样本录音音频中所包含的部分信号)进行对齐,可以增强第一初始网络模型与第二初始网络模型之间的整体关联性,在使用训练完成的第一深度网络模型与第二深度网络模型进行降噪处理时,可以提高录音音频的降噪效果。
请参见图11,图11是本申请实施例提供的一种音频数据处理装置的结构示意图。如图11所示,该音频数据处理装置1可以包括:音频获取模块11,检索模块12,音频过滤模块13,音频确定模块14,降噪处理模块15;
音频获取模块11,用于获取录音音频;录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量;
检索模块12,用于从音频数据库中确定与录音音频相匹配的原型音频;
音频过滤模块13,用于根据原型音频从录音音频中获取候选语音音频;候选语音音频包括语音音频分量和环境噪声分量;
音频确定模块14,用于将录音音频与候选语音音频之间的差值,确定为录音音频中所包含的背景基准音频分量;
降噪处理模块15,用于对候选语音音频进行环境噪声降噪处理,得到候选语音音频对应的降噪语音音频,将降噪语音音频与背景基准音频分量进行合并,得到降噪后的录音音频。
其中,音频获取模块11,指纹检索模块12,音频过滤模块13,音频确定模块14,降噪处理模块15的具体功能实现方式可以参见上述图3所对应实施例中的S101-S105,这里不再进行赘述。
在一个或多个实施例中,检索模块12,具体用于获取录音音频对应的待匹配音频指纹,根据待匹配音频指纹在音频数据库中获取与录音音频相匹配的原型音频。
在一个或多个实施例中,指纹检索模块12可以包括:频域变换单元121,频谱带划分单元122,音频指纹组合单元123,原型音频匹配单元124;
频域变换单元121,用于将录音音频划分为M个录音数据帧,对M个录音数据帧中的第i个录音数据帧进行频域变换,得到第i个录音数据帧对应的功率谱数据;i和M均为正整数,且i小于或等于M;
频谱带划分单元122,用于将第i个录音数据帧对应的功率谱数据划分为N个频谱带,根据N个频谱带中的峰值信号,构建第i个录音数据帧对应的子指纹信息;N为正整数;
音频指纹组合单元123,用于按照M个录音数据帧在录音音频中的时间顺序,对M个录音数据帧分别对应的子指纹信息进行组合,得到录音音频对应的待匹配音频指纹;
原型音频匹配单元124,用于获取音频数据库对应的音频指纹库,根据待匹配音频指纹在音频指纹库中进行指纹检索,根据指纹检索结果在音频数据库中确定与录音音频相匹配的原型音频。
其中,原型音频匹配单元124具体用于:
将待匹配音频指纹中所包含的M个子指纹信息映射为M个待匹配哈希值,获取M个待匹配哈希值分别对应的录音时间;一个待匹配哈希值所对应的录音时间用于表征该待匹配哈希值对应的子指纹信息在录音音频中出现的时间;
若M个待匹配哈希值中的第p个待匹配哈希值与音频指纹库所包含的第一哈希值相匹配,则获取第p个待匹配哈希值对应的录音时间与第一哈希值对应的时间信息之间的第一时间差;p为小于或等于M的正整数;
若M个待匹配哈希值中的第q个待匹配哈希值与音频指纹库所包含的第二哈希值相匹配,则获取第q个待匹配哈希值对应的录音时间与第二哈希值对应的时间信息之间的第二时间差;q为小于或等于M的正整数;
当第一时间差和第二时间差满足数值阈值,且第一哈希值和第二哈希值属于相同的音频指纹时,将第一哈希值所属的音频指纹确定为指纹检索结果,将指纹检索结果所对应的音频数据确定为录音音频对应的原型音频。
其中,频域变换单元121,频谱带划分单元122,音频指纹组合单元123,原型音频匹配单元124的具体功能实现方式可以参见上述图5所对应实施例中的S202S205,这里不再进行赘述。
在一个或多个实施例中,音频过滤模块13可以包括:归一化处理单元131,第一频点增益输出单元132,语音音频获取单元133;
归一化处理单元131,用于获取录音音频对应的录音功率谱数据,对录音功率谱数据进行归一化处理,得到第一频谱特征;
上述归一化处理单元131,还用于获取原型音频对应的原型功率谱数据,对原型功率谱数据进行归一化处理,得到第二频谱特征,将第一频谱特征和第二频谱特征组合为输入特征;
第一频点增益输出单元132,用于将输入特征输入至第一深度网络模型,通过第一深度网络模型输出针对录音音频的第一频点增益;
语音音频获取单元133,用于根据第一频点增益和录音功率谱数据,获取录音音频中所包含的候选语音音频。
在一个或多个实施例中,第一频点增益输出单元132可以包括:特征提取子单元1321,激活子单元1322;
特征提取子单元1321,用于将输入特征输入至第一深度网络模型,根据第一深度网络模型中的特征提取网络层,获取输入特征对应的时序分布特征;
激活子单元1322,用于根据第一深度网络模型中的全连接网络层,获取时序分布特征对应的时序特征向量,根据时序特征向量,通过第一深度网络模型中的激活层,输出第一频点增益。
在一个或多个实施例中,第一频点增益包括T个频点分别对应的语音增益,录音功率谱数据包括T个频点分别对应的能量值,T个语音增益与T个能量值一一对应;T为大于1的正整数;
语音音频获取单元133可以包括:频点加权子单元1331,加权能量值组合子单元1332,时域变换子单元1333;
频点加权子单元1331,用于根据第一频点增益中的T个频点分别对应的语音增益,对录音功率谱数据中属于相同频点的能量值进行加权,得到T个频点分别对应的加权能量值;
加权能量值组合子单元1332,用于根据T个频点分别对应的加权能量值,确定录音音频对应的加权录音频域信号;
时域变换子单元1333,用于对加权录音频域信号进行时域变换,得到录音音频中所包含的候选语音音频。
其中,归一化处理单元131,第一频点增益输出单元132,语音音频获取单元133,特征提取子单元1321,激活子单元1322,频点加权子单元1331,加权能量值组合子单元1332,时域变换子单元1333的具体功能实现方式可以参见上述图5所对应实施例中的S206S208,这里不再进行赘述。
在一个或多个实施例中,降噪处理模块15可以包括:第二频点增益输出单元151,信号加权单元152,时域变换单元153;
第二频点增益输出单元151,用于获取候选语音音频对应的语音功率谱数据,将语音功率谱数据输入至第二深度网络模型,通过第二深度网络模型输出针对候选语音音频的第二频点增益;
信号加权单元152,用于根据第二频点增益与语音功率谱数据,获取候选语音音频对应的加权语音频域信号;
时域变换单元153,用于对加权语音频域信号进行时域变换,得到候选语音音频对应的降噪语音音频。
其中,第二频点增益输出单元151,信号加权单元152,时域变换单元153的具体功能实现方式可以参见上述图5所对应实施例中的S209和S210,这里不再进行赘述。
在一个或多个实施例中,该音频数据处理装置1还可以包括:音频分享模块16;
音频分享模块16,用于将降噪后的录音音频分享至社交平台,以使社交平台中的终端设备在访问社交平台时,播放降噪后的录音音频。
其中,音频分享模块16的具体功能实现方式可以参见上述图3所对应实施例中的S105,这里不再进行赘述。
本申请中,上述各个模块、单元、子单元可以实现前述图3、图5任一个方法实施例中的描述,对采用相同方法的有益效果描述,也不再进行赘述。
请参见图12,图12是本申请实施例提供的一种音频数据处理装置的结构示意图。如图12所示,该音频数据处理装置2可以包括:样本获取模块21,第一预测模块22,第二预测模块23,第一调整模块24,第二调整模块25;
样本获取模块21,用于获取语音样本音频、噪声样本音频以及标准样本音频,根据语音样本音频、噪声样本音频以及标准样本音频,生成样本录音音频;语音样本音频和噪声样本音频是通过录音采集得到的,标准样本音频是音频数据库中所存储的纯净音频;
第一预测模块22,用于通过第一初始网络模型获取样本录音音频中的样本预测语音音频;第一初始网络模型用于过滤样本录音音频所包含的标准样本音频,第一初始网络模型的期望预测语音音频由语音样本音频和噪声样本音频所确定;
第二预测模块23,用于通过第二初始网络模型获取样本预测语音音频对应的样本预测降噪音频;第二初始网络模型用于抑制样本预测语音音频所包含的噪声样本音频,第二初始网络模型的期望预测降噪音频由语音样本音频所确定;
第一调整模块24,用于基于样本预测语音音频和期望预测语音音频,对第一初始网络模型的网络参数进行调整,得到第一深度网络模型;第一深度网络模型用于对录音音频进行过滤后得到候选语音音频,录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量,候选语音音频包括语音音频分量和环境噪声分量;
第二调整模块25,用于基于样本预测降噪音频和期望预测降噪音频,对第二初始网络模型的网络参数进行调整,得到第二深度网络模型;第二深度网络模型用于对候选语音音频进行降噪处理后得到降噪语音音频。
其中,样本获取模块21,第一预测模块22,第二预测模块23,第一调整模块24,第二调整模块25的具体功能实现方式可以参见上述图9所对应实施例中的S301-S305,这里不再进行赘述。
在一个或多个实施例中,样本录音音频的数量为K个,K为正整数;
样本获取模块21可以包括:数组构建单元211,样本录音构建单元212;
数组构建单元211,用于获取针对第一初始网络模型的加权系数集合,根据加权系数集合构建K个数组;每个数组包括语音样本音频、噪声样本音频以及标准样本音频分别对应的系数;
样本录音构建单元212,用于根据K个数组中的第j个数组所包含的系数,分别对语音样本音频、噪声样本音频以及标准样本音频进行加权,得到第j个数组对应的样本录音音频;j为小于或等于K的正整数。
其中,数组构建单元211,样本录音构建单元212的具体功能实现方式可以参见上述图9所对应实施例中的S301,这里不再进行赘述。
本申请中,上述各个模块、单元、子单元可以实现前述图9所对应的方法实施例中的描述,对采用相同方法的有益效果描述,也不再进行赘述。
请参见图13,图13是本申请实施例提供的一种计算机设备的结构示意图。如图13所示,该计算机设备1000可以为用户终端,例如,上述图1所对应实施例中的用户终端10a,还可以为服务器,例如,上述图1所对应实施例中的服务器10d,这里将不对其进行限制。为便于理解,本申请以计算机设备为用户终端为例,该计算机设备1000可以包括:处理器1001,网络接口1004和存储器1005,此外,该计算机设备1000还可以包括:用户接口1003,和至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1004可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005 可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图13所示,作为一种计算机可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
其中,该计算机设备1000中的网络接口1004还可以提供网络通讯功能,且可选用户接口1003还可以包括显示屏(Display)、键盘(Keyboard)。在图13所示的计算机设备1000中,网络接口1004可提供网络通讯功能;而用户接口1003主要用于为用户提供输入的接口;而处理器1001可以用于调用存储器1005中存储的设备控制应用程序,以实现:
获取录音音频;录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量;
从音频数据库中确定与录音音频相匹配的原型音频;
根据原型音频从录音音频中获取候选语音音频;候选语音音频包括语音音频分量和环境噪声分量;
将录音音频与候选语音音频之间的差值,确定为录音音频中所包含的背景基准音频分量;
对候选语音音频进行环境噪声降噪处理,得到候选语音音频对应的降噪语音音频,将降噪语音音频与背景基准音频分量进行合并,得到降噪后的录音音频。
或者,处理器1001还可以实现:
获取语音样本音频、噪声样本音频以及标准样本音频,根据语音样本音频、噪声样本音频以及标准样本音频,生成样本录音音频;语音样本音频和噪声样本音频是通过录音采集得到的,标准样本音频是音频数据库中所存储的纯净音频;
通过第一初始网络模型获取样本录音音频中的样本预测语音音频;第一初始网络模型用于过滤样本录音音频所包含的标准样本音频,第一初始网络模型的期望预测语音音频由语音样本音频和噪声样本音频所确定;
通过第二初始网络模型获取样本预测语音音频对应的样本预测降噪音频;第二初始网络模型用于抑制样本预测语音音频所包含的噪声样本音频,第二初始网络模型的期望预测降噪音频由语音样本音频所确定;
基于样本预测语音音频和期望预测语音音频,对第一初始网络模型的网络参数进行调整,得到第一深度网络模型;第一深度网络模型用于对录音音频进行过滤后得到候选语音音频,录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量,候选语音音频包括语音音频分量和环境噪声分量;
基于样本预测降噪音频和期望预测降噪音频,对第二初始网络模型的网络参数进行调整,得到第二深度网络模型;第二深度网络模型用于对候选语音音频进行降噪处理后得到降噪语音音频。
应当理解,本申请实施例中所描述的计算机设备1000可执行前文图3、图5以及图9任一个所对应实施例中对音频数据处理方法的描述,也可执行前文图11所对应实施例中对音频数据处理装置1的描述,或者图12所对应实施例中对音频数据处理装置2的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
此外,这里需要指出的是:本申请实施例还提供了一种计算机可读存储介质,且计算机可读存储介质中存储有前文提及的音频数据处理装置1和音频数据处理装置2所执行的计算机程序,且计算机程序包括程序指令,当处理器执行程序指令时,能够执行前文图3、图5以及图9任一个所对应实施例中对音频数据处理方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述。作为示例,程序指令可被部署在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行,分布在多个地点且通过通信网络互连的多个计算设备可以组成区块链系统。
此外,需要说明的是:本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或者计算机程序可以包括计算机指令,该计算机指令可以存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器可以执行该计算机指令,使得该计算机设备执行前文图3、图5以及图9任一个所对应实施例中对音频数据处理方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机程序产品或者计算机程序实施例中未披露的技术细节,请参照本申请方法实施例的描述。
需要说明的是,对于前述的各个方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某一些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。
本申请实施例装置中的模块可以根据实际需要进行合并、划分和删减。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,存储介质可为磁碟、光盘、只读存储器(Read-Only Memory,ROM)或随机存储器(Random Access Memory,RAM)等。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (16)

  1. 一种音频数据处理方法,所述方法由计算机设备执行,包括:
    获取录音音频;所述录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量;
    从音频数据库中确定与所述录音音频相匹配的原型音频;
    根据所述原型音频从所述录音音频中获取候选语音音频;所述候选语音音频包括所述语音音频分量和所述环境噪声分量;
    将所述录音音频与所述候选语音音频之间的差值,确定为所述录音音频中所包含的所述背景基准音频分量;
    对所述候选语音音频进行环境噪声降噪处理,得到所述候选语音音频对应的降噪语音音频,将所述降噪语音音频与所述背景基准音频分量进行合并,得到降噪后的录音音频。
  2. 根据权利要求1所述的方法,所述从音频数据库中确定与所述录音音频相匹配的原型音频,包括:
    获取所述录音音频对应的待匹配音频指纹;
    根据所述待匹配音频指纹在所述音频数据库中获取与所述录音音频相匹配的原型音频。
  3. 根据权利要求2所述的方法,所述获取所述录音音频对应的待匹配音频指纹,包括:
    将所述录音音频划分为M个录音数据帧,对所述M个录音数据帧中的第i个录音数据帧进行频域变换,得到所述第i个录音数据帧对应的功率谱数据;i和M均为正整数,且i小于或等于M;
    将所述第i个录音数据帧对应的功率谱数据划分为N个频谱带,根据所述N个频谱带中的峰值信号,构建所述第i个录音数据帧对应的子指纹信息;N为正整数;
    按照所述M个录音数据帧在所述录音音频中的时间顺序,对所述M个录音数据帧分别对应的子指纹信息进行组合,得到所述录音音频对应的待匹配音频指纹;
    所述根据所述待匹配音频指纹在所述音频数据库中获取与所述录音音频相匹配的原型音频,包括:
    获取所述音频数据库对应的音频指纹库;
    根据所述待匹配音频指纹在所述音频指纹库中进行指纹检索,根据指纹检索结果在所述音频数据库中确定所述原型音频。
  4. 根据权利要求3所述的方法,所述根据所述待匹配音频指纹在所述音频指纹库中进行指纹检索,根据指纹检索结果在所述音频数据库中确定所述原型音频,包括:
    将所述待匹配音频指纹中所包含的M个子指纹信息映射为M个待匹配哈希值,获取所述M个待匹配哈希值分别对应的录音时间;一个待匹配哈希值所对应的录音时间用于表征该待匹配哈希值对应的子指纹信息在所述录音音频中出现的时间;
    若所述M个待匹配哈希值中的第p个待匹配哈希值与所述音频指纹库所包含的第一哈希值相匹配,则获取所述第p个待匹配哈希值对应的录音时间与所述第一哈希值对应的时间信息之间的第一时间差;p为小于或等于M的正整数;
    若所述M个待匹配哈希值中的第q个待匹配哈希值与所述音频指纹库所包含的第二哈希值相匹配,则获取所述第q个待匹配哈希值对应的录音时间与所述第二哈希值对应的时间信息之间的第二时间差;q为小于或等于M的正整数;
    当所述第一时间差和所述第二时间差满足数值阈值,且所述第一哈希值和所述第二哈希值属于相同的音频指纹时,将所述第一哈希值所属的音频指纹确定为所述指纹检索结果,将所述指纹检索结果所对应的音频数据确定为所述录音音频对应的原型音频。
  5. 根据权利要求1所述的方法,所述根据所述原型音频从所述录音音频中获取候选语音音频,包括:
    获取所述录音音频对应的录音功率谱数据,对所述录音功率谱数据进行归一化处理,得到第一频谱特征;
    获取所述原型音频对应的原型功率谱数据,对所述原型功率谱数据进行归一化处理,得到第二频谱特征,将所述第一频谱特征和所述第二频谱特征组合为输入特征;
    将所述输入特征输入至第一深度网络模型,通过所述第一深度网络模型输出针对所述录音音频的第一频点增益;
    根据所述第一频点增益和所述录音功率谱数据,获取所述录音音频中所包含的候选语音音频。
  6. 根据权利要求5所述的方法,所述将所述输入特征输入至第一深度网络模型,通过所述第一深度网络模型输出第一频点增益,包括:
    将所述输入特征输入至第一深度网络模型,根据所述第一深度网络模型中的特征提取网络层,获取所述输入特征对应的时序分布特征;
    根据所述第一深度网络模型中的全连接网络层,获取所述时序分布特征对应的时序特征向量;
    根据所述时序特征向量,通过所述第一深度网络模型中的激活层,输出所述第一频点增益。
  7. 根据权利要求5所述的方法,所述第一频点增益包括T个频点分别对应的语音增益,所述录音功率谱数据包括所述T个频点分别对应的能量值,T个语音增益与T个能量值一一对应;T为大于1的正整数;
    所述根据所述第一频点增益和所述录音功率谱数据,获取所述录音音频中所包含的候选语音音频,包括:
    根据所述第一频点增益中的所述T个频点分别对应的语音增益,对所述录音功率谱数据中属于相同频点的能量值进行加权,得到所述T个频点分别对应的加权能量值;
    根据所述T个频点分别对应的加权能量值,确定所述录音音频对应的加权录音频域信号;
    对所述加权录音频域信号进行时域变换,得到所述录音音频中所包含的所述候选语音音频。
  8. 根据权利要求1所述的方法,所述对所述候选语音音频进行环境噪声降噪处理,得到所述候选语音音频对应的降噪语音音频,包括:
    获取所述候选语音音频对应的语音功率谱数据,将所述语音功率谱数据输入至第二深度网络模型,通过所述第二深度网络模型输出针对所述候选语音音频的第二频点增益;
    根据所述第二频点增益与所述语音功率谱数据,获取所述候选语音音频对应的加权语音频域信号;
    对所述加权语音频域信号进行时域变换,得到所述候选语音音频对应的所述降噪语音音频。
  9. 根据权利要求1-8任一项所述的方法,还包括:
    将所述降噪后的录音音频分享至社交平台,以使所述社交平台中的终端设备在访问所述社交平台时,播放所述降噪后的录音音频。
  10. 一种音频数据处理方法,所述方法由计算机设备执行,包括:
    获取语音样本音频、噪声样本音频以及标准样本音频,根据所述语音样本音频、所述噪声样本音频以及所述标准样本音频,生成样本录音音频;所述语音样本音频和所述噪声样本音频是通过录音采集得到的,所述标准样本音频是音频数据库中所存储的纯净音频;
    通过第一初始网络模型获取所述样本录音音频中的样本预测语音音频;所述第一初始网络模型用于过滤所述样本录音音频所包含的标准样本音频,所述第一初始网络模型的期望预测语音音频由所述语音样本音频和所述噪声样本音频所确定;
    通过第二初始网络模型获取所述样本预测语音音频对应的样本预测降噪音频;所述第二初始网络模型用于抑制所述样本预测语音音频所包含的噪声样本音频,所述第二初始网络模型的期望预测降噪音频由所述语音样本音频所确定;
    基于所述样本预测语音音频和所述期望预测语音音频,对所述第一初始网络模型的网络参数进行调整,得到第一深度网络模型;所述第一深度网络模型用于对录音音频进行过滤后得到候选语音音频,所述录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量,所述候选语音音频包括所述语音音频分量和所述环境噪声分量;
    基于所述样本预测降噪音频和所述期望预测降噪音频,对所述第二初始网络模型的网络参数进行调整,得到第二深度网络模型;所述第二深度网络模型用于对所述候选语音音频进行降噪处理后得到降噪语音音频。
  11. 根据权利要求10所述的方法,所述样本录音音频的数量为K个,K为正整数;
    所述根据所述语音样本音频、所述噪声样本音频以及所述标准样本音频,生成样本录音音频,包括:
    获取针对所述第一初始网络模型的加权系数集合,根据所述加权系数集合构建K个数组;每个数组包括所述语音样本音频、所述噪声样本音频以及所述标准样本音频分别对应的系数;
    根据所述K个数组中的第j个数组所包含的系数,分别对所述语音样本音频、所述噪声样本音频以及所述标准样本音频进行加权,得到所述第j个数组对应的样本录音音频;j为小于或等于K的正整数。
  12. 一种音频数据处理装置,所述装置部署在计算机设备上,包括:
    音频获取模块,用于获取录音音频;所述录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量;
    检索模块,用于从音频数据库中确定与所述录音音频相匹配的原型音频;
    音频过滤模块,用于根据所述原型音频从所述录音音频中获取候选语音音频;所述候选语音音频包括所述语音音频分量和所述环境噪声分量;
    音频确定模块,用于将所述录音音频与所述候选语音音频之间的差值,确定为所述录音音频中所包含的背景基准音频分量;
    降噪处理模块,用于对所述候选语音音频进行环境噪声降噪处理,得到所述候选语音音频对应的降噪语音音频,将所述降噪语音音频与所述背景基准音频分量进行合并,得到降噪后的录音音频。
  13. 一种音频数据处理装置,所述装置部署在计算机设备上,包括:
    样本获取模块,用于获取语音样本音频、噪声样本音频以及标准样本音频,根据所述语音样本音频、所述噪声样本音频以及所述标准样本音频,生成样本录音音频;所述语音样本音频和所述噪声样本音频是通过录音采集得到的,所述标准样本音频是音频数据库中所存储的纯净音频;
    第一预测模块,用于通过第一初始网络模型获取所述样本录音音频中的样本预测语音音频;所述第一初始网络模型用于过滤所述样本录音音频所包含的标准样本音频,所述第一初始网络模型的期望预测语音音频由所述语音样本音频和所述噪声样本音频所确定;
    第二预测模块,用于通过第二初始网络模型获取所述样本预测语音音频对应的样本预测降噪音频;所述第二初始网络模型用于抑制所述样本预测语音音频所包含的噪声样本音频,所述第二初始网络模型的期望预测降噪音频由所述语音样本音频所确定;
    第一调整模块,用于基于所述样本预测语音音频和所述期望预测语音音频,对所述第一初始网络模型的网络参数进行调整,得到第一深度网络模型;所述第一深度网络模型用于对录音音频进行过滤后得到候选语音音频,所述录音音频包括背景基准音频分量、语音音频分量以及环境噪声分量,所述候选语音音频包括所述语音音频分量和所述环境噪声分量;
    第二调整模块,用于基于所述样本预测降噪音频和所述期望预测降噪音频,对所述第二初始网络模型的网络参数进行调整,得到第二深度网络模型;所述第二深度网络模型用于对所述候选语音音频进行降噪处理后得到降噪语音音频。
  14. 一种计算机设备,包括存储器和处理器;
    所述存储器与所述处理器相连,所述存储器用于存储计算机程序,所述处理器用于调用所述计算机程序,以使得所述计算机设备执行权利要求1-11任一项所述的方法。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序适于由处理器加载并执行,以使得具有所述处理器的计算机设备执行权利要求1-11任一项所述的方法。
  16. 一种计算程序产品,包括计算机程序/指令,所述计算机程序/指令被处理器执行时实现权利要求1-11任一项所述的方法。
PCT/CN2022/113179 2021-09-03 2022-08-18 音频数据处理方法、装置、设备以及介质 WO2023030017A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22863157.8A EP4300493A1 (en) 2021-09-03 2022-08-18 Audio data processing method and apparatus, device and medium
US18/137,332 US20230260527A1 (en) 2021-09-03 2023-04-20 Audio data processing method and apparatus, device, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111032206.9A CN115762546A (zh) 2021-09-03 2021-09-03 音频数据处理方法、装置、设备以及介质
CN202111032206.9 2021-09-03

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/137,332 Continuation US20230260527A1 (en) 2021-09-03 2023-04-20 Audio data processing method and apparatus, device, and medium

Publications (1)

Publication Number Publication Date
WO2023030017A1 true WO2023030017A1 (zh) 2023-03-09

Family

ID=85332470

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113179 WO2023030017A1 (zh) 2021-09-03 2022-08-18 音频数据处理方法、装置、设备以及介质

Country Status (4)

Country Link
US (1) US20230260527A1 (zh)
EP (1) EP4300493A1 (zh)
CN (1) CN115762546A (zh)
WO (1) WO2023030017A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994600B (zh) * 2023-09-28 2023-12-12 中影年年(北京)文化传媒有限公司 基于音频驱动角色口型的方法及系统

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070131094A1 (en) * 2005-11-09 2007-06-14 Sony Deutschland Gmbh Music information retrieval using a 3d search algorithm
CN106024005A (zh) * 2016-07-01 2016-10-12 腾讯科技(深圳)有限公司 一种音频数据的处理方法及装置
US9947333B1 (en) * 2012-02-10 2018-04-17 Amazon Technologies, Inc. Voice interaction architecture with intelligent background noise cancellation
CN108140399A (zh) * 2015-09-25 2018-06-08 高通股份有限公司 用于超宽带音乐的自适应噪声抑制
CN110675886A (zh) * 2019-10-09 2020-01-10 腾讯科技(深圳)有限公司 音频信号处理方法、装置、电子设备及存储介质
CN110808063A (zh) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 一种语音处理方法、装置和用于处理语音的装置
CN111046226A (zh) * 2018-10-15 2020-04-21 阿里巴巴集团控股有限公司 一种音乐的调音方法及装置
CN111128214A (zh) * 2019-12-19 2020-05-08 网易(杭州)网络有限公司 音频降噪方法、装置、电子设备及介质
CN111524530A (zh) * 2020-04-23 2020-08-11 广州清音智能科技有限公司 一种基于膨胀因果卷积的语音降噪方法
CN113257283A (zh) * 2021-03-29 2021-08-13 北京字节跳动网络技术有限公司 音频信号的处理方法、装置、电子设备和存储介质

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070131094A1 (en) * 2005-11-09 2007-06-14 Sony Deutschland Gmbh Music information retrieval using a 3d search algorithm
US9947333B1 (en) * 2012-02-10 2018-04-17 Amazon Technologies, Inc. Voice interaction architecture with intelligent background noise cancellation
CN108140399A (zh) * 2015-09-25 2018-06-08 高通股份有限公司 用于超宽带音乐的自适应噪声抑制
CN106024005A (zh) * 2016-07-01 2016-10-12 腾讯科技(深圳)有限公司 一种音频数据的处理方法及装置
CN111046226A (zh) * 2018-10-15 2020-04-21 阿里巴巴集团控股有限公司 一种音乐的调音方法及装置
CN110675886A (zh) * 2019-10-09 2020-01-10 腾讯科技(深圳)有限公司 音频信号处理方法、装置、电子设备及存储介质
CN110808063A (zh) * 2019-11-29 2020-02-18 北京搜狗科技发展有限公司 一种语音处理方法、装置和用于处理语音的装置
CN111128214A (zh) * 2019-12-19 2020-05-08 网易(杭州)网络有限公司 音频降噪方法、装置、电子设备及介质
CN111524530A (zh) * 2020-04-23 2020-08-11 广州清音智能科技有限公司 一种基于膨胀因果卷积的语音降噪方法
CN113257283A (zh) * 2021-03-29 2021-08-13 北京字节跳动网络技术有限公司 音频信号的处理方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
US20230260527A1 (en) 2023-08-17
CN115762546A (zh) 2023-03-07
EP4300493A1 (en) 2024-01-03

Similar Documents

Publication Publication Date Title
US20210089967A1 (en) Data training in multi-sensor setups
CN112289333B (zh) 语音增强模型的训练方法和装置及语音增强方法和装置
JP2019216408A (ja) 情報を出力するための方法、及び装置
US20080120115A1 (en) Methods and apparatuses for dynamically adjusting an audio signal based on a parameter
CN111798821B (zh) 声音转换方法、装置、可读存储介质及电子设备
CN109584904B (zh) 应用于基础音乐视唱教育的视唱音频唱名识别建模方法
CN113611324B (zh) 一种直播中环境噪声抑制的方法、装置、电子设备及存储介质
CN110047497B (zh) 背景音频信号滤除方法、装置及存储介质
CN113257283B (zh) 音频信号的处理方法、装置、电子设备和存储介质
CN111091835A (zh) 模型训练的方法、声纹识别的方法、系统、设备及介质
WO2023030017A1 (zh) 音频数据处理方法、装置、设备以及介质
Mittal et al. Static–dynamic features and hybrid deep learning models based spoof detection system for ASV
Kamuni et al. Advancing Audio Fingerprinting Accuracy with AI and ML: Addressing Background Noise and Distortion Challenges
CN113436609A (zh) 语音转换模型及其训练方法、语音转换方法及系统
CN113205793A (zh) 音频生成方法、装置、存储介质及电子设备
Jensen et al. Evaluation of MFCC estimation techniques for music similarity
CN113614828A (zh) 经由归一化对音频信号进行指纹识别的方法和装置
Liu et al. Anti-forensics of fake stereo audio using generative adversarial network
CN116312559A (zh) 跨信道声纹识别模型的训练方法、声纹识别方法及装置
CN105589970A (zh) 音乐搜索方法和装置
CN115116469A (zh) 特征表示的提取方法、装置、设备、介质及程序产品
Baroughi et al. Additive attacks on speaker recognition
Choi et al. Light-weight Frequency Information Aware Neural Network Architecture for Voice Spoofing Detection
CN113362849A (zh) 一种语音数据处理方法以及装置
EP4343761A1 (en) Enhanced audio file generator

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22863157

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022863157

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022863157

Country of ref document: EP

Effective date: 20230926

NENP Non-entry into the national phase

Ref country code: DE