CN112242149B - Audio data processing method and device, earphone and computer readable storage medium - Google Patents

Audio data processing method and device, earphone and computer readable storage medium Download PDF

Info

Publication number
CN112242149B
CN112242149B CN202011398525.7A CN202011398525A CN112242149B CN 112242149 B CN112242149 B CN 112242149B CN 202011398525 A CN202011398525 A CN 202011398525A CN 112242149 B CN112242149 B CN 112242149B
Authority
CN
China
Prior art keywords
audio data
feature
target
voice
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011398525.7A
Other languages
Chinese (zh)
Other versions
CN112242149A (en
Inventor
陈孝良
冯大航
赵力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202011398525.7A priority Critical patent/CN112242149B/en
Publication of CN112242149A publication Critical patent/CN112242149A/en
Application granted granted Critical
Publication of CN112242149B publication Critical patent/CN112242149B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

The application provides a processing method and device of audio data, an earphone and a computer readable storage medium, and belongs to the technical field of data processing. When the two paths of audio data are acquired through the main microphone assembly and the auxiliary microphone assembly located in the earphone respectively, the energy difference characteristic of the two paths of audio data and the fusion characteristic of the target voice characteristic are determined, the target voice characteristic is the voice characteristic of a target user, the target voice characteristic is fused into the voice separation process, and then the voice separation is carried out on the first audio data acquired by the main microphone assembly of the earphone according to the fusion characteristic corresponding to the energy difference characteristic and the target voice characteristic, so that the target audio data only including the voice of the target user is obtained, and the voice separation effect is improved.

Description

Audio data processing method and device, earphone and computer readable storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing audio data, an earphone, and a computer-readable storage medium.
Background
The earphone as an efficient and convenient audio input and output device has become a necessity in daily life of people. In the actual use process of the headset, the signal received by the microphone assembly of the headset may not only include the voice of the target user (i.e., the target speaker), but also include the voice of other people around, noise of the environment where the target speaker is located, and the like, so that the headset needs to separate the voice of the target user to improve the speech quality of the headset, and the accuracy of processing such as back-end voice recognition.
At present, when the voice of a target user is separated, because the distances between different microphone assemblies and a sound source are different, taking an earphone comprising two microphone assemblies as an example, the distance between the target user and the two microphone assemblies is relatively short, so that the energy difference of audio data acquired by the two microphone assemblies is relatively large, and the distance between noise and the two microphone assemblies is relatively long, so that the energy difference of the audio data acquired by the two microphone assemblies is relatively small, and the voice and the noise can be separated through the energy difference of the audio data acquired between the two microphones of the earphone. However, when the distance between the other surrounding people and the target user is short, the difference between the energies of the other surrounding people received by the two microphone assemblies of the headset is large, and when the voice of the target user is separated, the voice of the other surrounding people is separated as the voice of the target user, so that the voice separation effect is poor.
Disclosure of Invention
The embodiment of the application provides a method and a device for processing audio data, an earphone and a computer readable storage medium, which can improve the voice separation effect. The following describes the related contents of the technical solution.
In one aspect, a method for processing audio data is provided, and the method includes:
acquiring first audio data acquired by a main microphone assembly based on an earphone and second audio data acquired by a slave microphone assembly of the earphone, wherein the main microphone assembly and the slave microphone assembly are both positioned in the earphone;
determining an energy difference characteristic of the first audio data and the second audio data;
performing feature fusion on the energy difference feature and a target voice feature to obtain a fusion feature, wherein the target voice feature is a voice feature registered by a target user in advance;
and carrying out voice separation on the first audio data according to the fusion characteristics to obtain target audio data.
In one possible implementation, the determining the energy difference characteristic of the first audio data and the second audio data comprises:
respectively determining spectral characteristics of the first audio data and the second audio data;
determining an energy spectral feature of the first audio data and an energy spectral feature of the second audio data based on the spectral feature of the first audio data and the spectral feature of the second audio data, respectively;
energy difference features of the first audio data and the second audio data are determined based on the energy spectral features of the first audio data and the energy spectral features of the second audio data.
In a possible implementation manner, the performing feature fusion on the energy difference feature and the target speech feature to obtain a fusion feature includes:
splicing the energy difference characteristic and the target voice characteristic in an up-sampling mode to obtain a fusion characteristic; alternatively, the first and second electrodes may be,
splicing the energy difference characteristic and the target voice characteristic to obtain a fusion characteristic; alternatively, the first and second electrodes may be,
and performing feature fusion on the energy difference feature and the target voice feature by a typical correlation analysis method to obtain a fusion feature.
In a possible implementation manner, the performing voice separation on the first audio data according to the fusion feature to obtain target audio data includes:
and fusing the mask corresponding to the fusion feature with the first audio data to obtain target audio data.
In a possible implementation manner, before the mask corresponding to the fusion feature is fused with the first audio data to obtain the target audio data, the method further includes:
extracting the convolution characteristic of the fusion characteristic;
and determining a mask corresponding to the fused feature based on the convolution feature of the fused feature.
In a possible implementation manner, the fusing the mask corresponding to the fusion feature with the first audio data to obtain the target audio data includes:
acquiring the spectral characteristics of the first audio data;
fusing the mask and the frequency spectrum characteristic to obtain a fused frequency spectrum characteristic;
and generating the target audio data based on the fused spectral features.
In a possible implementation manner, the target voice feature is obtained by performing feature extraction on registered audio data input by a target user through a feature extraction model, and the feature extraction model is obtained by training based on multiple loss functions.
In one possible implementation manner, a target control is arranged on the earphone, and the target control is used for registering the target voice characteristics based on the registered audio data input by the target user when the target control is triggered;
the registration process of the target voice feature comprises the following steps:
responding to the trigger operation of the target control, and sending voice prompt information, wherein the voice prompt information is used for prompting a target user to input the registered audio data;
acquiring registered audio data input by a target user;
based on the enrollment audio data, the target speech feature is determined.
In one aspect, an apparatus for processing audio data is provided, the apparatus comprising:
the earphone comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring first audio data acquired by a main microphone assembly based on an earphone and second audio data acquired by a slave microphone assembly of the earphone, and the main microphone assembly and the slave microphone assembly are both positioned in the earphone;
a determining module for determining an energy difference characteristic of the first audio data and the second audio data;
the fusion module is used for carrying out feature fusion on the energy difference feature and a target voice feature to obtain a fusion feature, wherein the target voice feature is a voice feature registered by a target user in advance;
and the separation module is used for carrying out voice separation on the first audio data according to the fusion characteristics to obtain target audio data.
In a possible implementation manner, the determining module is configured to determine spectral features of the first audio data and the second audio data respectively; determining an energy spectral feature of the first audio data and an energy spectral feature of the second audio data based on the spectral feature of the first audio data and the spectral feature of the second audio data, respectively; energy difference features of the first audio data and the second audio data are determined based on the energy spectral features of the first audio data and the energy spectral features of the second audio data.
In a possible implementation manner, the fusion module is configured to splice the energy difference feature and the target speech feature in an upsampling manner to obtain the fusion feature; or splicing the energy difference characteristic and the target voice characteristic to obtain the fusion characteristic; or, performing feature fusion on the energy difference feature and the target voice feature by a typical correlation analysis method to obtain the fusion feature.
In a possible implementation manner, the separation module is configured to fuse the mask corresponding to the fusion feature with the first audio data to obtain the target audio data.
In one possible implementation, the apparatus further includes:
the extraction module is used for extracting the convolution characteristic of the fusion characteristic;
the determining module is further configured to determine a mask corresponding to the fused feature based on the convolution feature of the fused feature.
In a possible implementation manner, the fusion module is configured to obtain a spectral feature of the first audio data; fusing the mask and the frequency spectrum characteristic to obtain a fused frequency spectrum characteristic; and generating the target audio data based on the fused spectral features.
In a possible implementation manner, the target voice feature is obtained by performing feature extraction on registered audio data input by a target user through a feature extraction model, and the feature extraction model is obtained by training based on multiple loss functions.
In one possible implementation manner, a target control is arranged on the earphone, and the target control is used for registering the target voice characteristics based on the registered audio data input by the target user when the target control is triggered;
the registration process of the target voice feature comprises the following steps:
responding to the trigger operation of the target control, and sending voice prompt information, wherein the voice prompt information is used for prompting a target user to input the registered audio data;
acquiring registered audio data input by a target user;
based on the enrollment audio data, the target speech feature is determined.
In one aspect, a headset is provided that includes one or more processors and one or more memories having at least one program code stored therein, the program code being loaded and executed by the one or more processors to implement operations performed by the processing method of audio data.
In one aspect, there is provided a computer-readable storage medium having at least one program code stored therein, the program code being loaded and executed by a processor to implement the operations performed by the processing method of the audio data.
In an aspect, a computer program product or a computer program is provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer readable storage medium. The processor of the computer device reads the computer program code from the computer-readable storage medium, and the processor executes the computer program code to implement the operations performed by the processing method of the audio data.
According to the scheme provided by the embodiment of the application, when two paths of audio data are respectively obtained through the main microphone assembly and the auxiliary microphone assembly which are positioned in the earphone, the energy difference characteristic of the two paths of audio data and the fusion characteristic of the target voice characteristic are determined, the target voice characteristic is the voice characteristic of a target user, the target voice characteristic is fused into the voice separation process, and then the voice separation is carried out on the first audio data collected by the main microphone assembly of the earphone according to the fusion characteristic corresponding to the energy difference characteristic and the target voice characteristic, so that the target audio data only including the voice of the target user is obtained, and the voice separation effect is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an implementation environment of a method for processing audio data according to an embodiment of the present application;
fig. 2 is a flowchart of a method for processing audio data according to an embodiment of the present application;
fig. 3 is a flowchart of a method for processing audio data according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a feature extraction model provided in an embodiment of the present application;
fig. 5 is a schematic diagram of a processing procedure of audio data according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an apparatus for processing audio data according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an earphone according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of an implementation environment of a method for processing audio data according to an embodiment of the present application, and referring to fig. 1, the implementation environment includes: a headset 101 and a computer device 102.
The earphone 101 may be a headphone, a True Wireless Stereo (TWS) earphone, or the like, and optionally, the earphone 101 includes other types, which are not limited in this embodiment. The earphone 101 is provided with a plurality of microphone components, and the earphone 101 collects audio data by the plurality of microphone components, and further separates target audio data including only voice from the collected audio data. The earphone 101 is connected to the computer device 102 through a wired or wireless connection, and the embodiment of the present application does not limit the specific connection manner, so that the earphone 101 transmits the target audio data to the computer device 102.
The computer device 102 may be at least one of a smart phone, a tablet computer, a smart watch, a portable computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), and a laptop computer. Optionally, the headset 101 and the computer device 102 are connected by a wired or wireless connection, and the embodiment of the present application does not limit the specific connection manner. The computer device 102 sends the target audio data acquired through the earphone 101 to other computer devices, receives audio data sent by other computer devices, and further plays the received audio data through the speaker assembly, or plays the received audio data through the earphone 101.
The headset 101 may generally refer to one of a plurality of headsets and the computer device 102 may generally refer to one of a plurality of computer devices, with the embodiment illustrated only with the headset 101 and the computer device 102. Those skilled in the art will appreciate that the number of headsets and computer devices described above may be greater or fewer. For example, the number of the earphones and the computer devices may be only a few, or the number of the earphones and the computer devices may be tens or hundreds, or more, and the number and the types of the earphones and the computer devices are not limited in the embodiments of the present application.
Fig. 2 is a flowchart of a method for processing audio data according to an embodiment of the present application, and referring to fig. 2, the method includes the following steps.
201. The headset acquires first audio data acquired by a master microphone assembly based on the headset and second audio data acquired by a slave microphone assembly of the headset, wherein the master microphone assembly and the slave microphone assembly are both positioned in the headset.
202. The headset determines an energy difference characteristic of the first audio data and the second audio data.
203. And the earphone performs feature fusion on the energy difference feature and the target voice feature to obtain a fusion feature, wherein the target voice feature is a voice feature registered by a target user in advance.
204. And the earphone performs voice separation on the first audio data according to the fusion characteristic to obtain target audio data.
According to the scheme provided by the embodiment of the application, when two paths of audio data are respectively obtained through the main microphone assembly and the auxiliary microphone assembly which are positioned in the earphone, the energy difference characteristic of the two paths of audio data and the fusion characteristic of the target voice characteristic are determined, the target voice characteristic is the voice characteristic of a target user, the target voice characteristic is fused into the voice separation process, and then the voice separation is carried out on the first audio data collected by the main microphone assembly of the earphone according to the fusion characteristic corresponding to the energy difference characteristic and the target voice characteristic, so that the target audio data only including the voice of the target user is obtained, and the voice separation effect is improved.
In one possible implementation, the determining the energy difference characteristic of the first audio data and the second audio data comprises:
respectively determining spectral characteristics of the first audio data and the second audio data;
determining an energy spectral feature of the first audio data and an energy spectral feature of the second audio data based on the spectral feature of the first audio data and the spectral feature of the second audio data, respectively;
energy difference features of the first audio data and the second audio data are determined based on the energy spectral features of the first audio data and the energy spectral features of the second audio data.
In a possible implementation manner, the performing feature fusion on the energy difference feature and the target speech feature to obtain a fusion feature includes:
splicing the energy difference characteristic and the target voice characteristic in an up-sampling mode to obtain a fusion characteristic; alternatively, the first and second electrodes may be,
splicing the energy difference characteristic and the target voice characteristic to obtain a fusion characteristic; alternatively, the first and second electrodes may be,
and performing feature fusion on the energy difference feature and the target voice feature by a typical correlation analysis method to obtain a fusion feature.
In a possible implementation manner, the performing voice separation on the first audio data according to the fusion feature to obtain target audio data includes:
and fusing the mask corresponding to the fusion feature with the first audio data to obtain target audio data.
In a possible implementation manner, before the mask corresponding to the fusion feature is fused with the first audio data to obtain the target audio data, the method further includes:
extracting the convolution characteristic of the fusion characteristic;
and determining a mask corresponding to the fused feature based on the convolution feature of the fused feature.
In a possible implementation manner, the fusing the mask corresponding to the fusion feature with the first audio data to obtain the target audio data includes:
acquiring the spectral characteristics of the first audio data;
fusing the mask and the frequency spectrum characteristic to obtain a fused frequency spectrum characteristic;
and generating the target audio data based on the fused spectral features.
In a possible implementation manner, the target voice feature is obtained by performing feature extraction on registered audio data input by a target user through a feature extraction model, and the feature extraction model is obtained by training based on multiple loss functions.
In one possible implementation manner, a target control is arranged on the earphone, and the target control is used for registering the target voice characteristics based on the registered audio data input by the target user when the target control is triggered;
the registration process of the target voice feature comprises the following steps:
responding to the trigger operation of the target control, and sending voice prompt information, wherein the voice prompt information is used for prompting a target user to input the registered audio data;
acquiring registered audio data input by a target user;
based on the enrollment audio data, the target speech feature is determined.
Fig. 3 is a flowchart of a method for processing audio data according to an embodiment of the present application, and referring to fig. 3, the method includes the following steps.
301. The headset acquires first audio data acquired by a master microphone assembly based on the headset and second audio data acquired by a slave microphone assembly of the headset, wherein the master microphone assembly and the slave microphone assembly are both positioned in the headset.
It should be noted that, in the steps 301 to 306, the earphone includes two main microphone assemblies and two sub-microphone assemblies, for example, in a more possible implementation manner, the earphone includes two or more microphone assemblies, where the two or more microphone assemblies include one main microphone assembly and at least one sub-microphone assembly, data collected by each sub-microphone assembly is second audio data, and a processing procedure of the second audio data corresponding to each sub-microphone assembly is the same, and the specific procedure refers to the following steps 302 to 306, which is not described herein again.
302. The headset determines an energy difference characteristic of the first audio data and the second audio data.
It should be noted that the energy difference feature is used to represent the difference between the energy of the first audio data and the energy of the second audio data.
In one possible implementation, the headphone determines spectral features of the first audio data and the second audio data, respectively, determines energy spectral features of the first audio data and energy spectral features of the second audio data based on the spectral features of the first audio data and the spectral features of the second audio data, respectively, and determines energy difference features of the first audio data and the second audio data based on the energy spectral features of the first audio data and the energy spectral features of the second audio data.
For any audio data in the first audio data and the second audio data, the earphone performs fast fourier transform on the any audio data to obtain amplitude information of the any audio data in a frequency domain, that is, amplitude corresponding to each frequency in the any audio data is obtained, that is, a spectral feature of the any audio data is obtained, and further, the amplitude corresponding to each frequency in the spectral feature is squared to obtain an energy value corresponding to each frequency in the any audio data, that is, an energy spectral feature of the any audio data is obtained. After the energy spectrum characteristics of the first audio data and the second audio data are determined, the energy difference between the first audio data and the second audio data can be determined by comparing the energy spectrum characteristics of the first audio data and the energy spectrum characteristics of the second audio data, and then the energy difference characteristics of the first audio data and the second audio data are extracted.
It should be noted that, the above is only an exemplary method for determining the energy difference characteristic of the first audio data and the second audio data, and in a more possible implementation manner, other manners are used to determine the energy difference characteristic of the first audio data and the second audio data, and the embodiment of the present application does not limit what manner is specifically used.
303. And the earphone performs feature fusion on the energy difference feature and the target voice feature to obtain a fusion feature, wherein the target voice feature is a voice feature registered by a target user in advance.
In a possible implementation manner, the headset splices the energy difference feature and the target voice feature in an up-sampling manner to obtain the fusion feature. In another possible implementation manner, the headset directly splices the energy difference feature and the target voice feature to obtain the fusion feature. In another possible implementation, the headset performs feature fusion using methods such as canonical correlation analysis. Optionally, the headset performs feature fusion on the energy difference feature and the target voice feature in other ways, and the embodiment of the present application does not limit which way is specifically adopted.
The process of obtaining the fusion Feature by the upsampling method is similar to the processing process of a Feature Pyramid Network (FPN), and is not described here again.
It should be noted that the target voice feature is pre-registered by the target user before using the headset. The registration process includes: the target control is arranged on the earphone and used for registering target voice characteristics based on registered audio data input by a target user when the target control is triggered, the target user triggers the target control, the earphone responds to the triggering operation of the target user on the target control and sends voice prompt information, the voice prompt information is used for prompting the target user to input the registered audio data, after hearing the voice prompt information, the target user inputs the audio data with the duration longer than the first target duration to serve as the registered audio data of the registered voice characteristics, the earphone acquires the registered audio data input by the target user, and then the target voice characteristics are determined based on the registered audio data and stored.
The target control is a physical control or a touch control, optionally, the target control is a control of another type, which is not limited in the embodiment of the present application. When the target audio features are determined based on the registered audio data, the earphone performs feature extraction on the registered audio data input by the target user through the feature extraction model to obtain the target voice features of the target user. Optionally, the target speech feature is sound intensity and sound intensity level, loudness, pitch period and pitch frequency, signal-to-noise ratio, and the like, and the specific type of the target speech feature is not limited in the embodiment of the present application. The first target time duration is any positive value, which is not limited in the embodiments of the present application.
It should be noted that the above-mentioned registration process of the target voice feature is described by taking as an example that the target voice feature is registered based on the trigger of the user on the target control, that is, if the user triggers the target control, the registered audio data is obtained through the above-mentioned process to determine the target voice feature. Optionally, the earphone automatically acquires, every second target duration, audio data that is input by the user and has a duration longer than the first target duration, as the registered audio data, and further determines the target voice feature based on the registered audio data.
When the target voice features are stored, if the target voice features are stored in the earphone, the stored target voice features are deleted, and the newly determined target voice features are stored, or the newly determined target voice features and the stored target voice features are stored together, which is not limited in the embodiment of the present application.
The newly extracted target voice features and the stored target voice features are stored together, so that the earphone can perform voice separation subsequently based on multiple target voice features, the earphone can perform voice separation based on the target voice features of multiple persons when the earphone is frequently used by the multiple persons, the use range of the earphone is expanded, and the user experience is improved.
The feature extraction model is an encoding-decoding (Encoder-Decoder) model, see fig. 4, and fig. 4 is a schematic structural diagram of a feature extraction model provided in an embodiment of the present application, where the feature extraction model includes an Encoder Network and a Decoder Network, both the Encoder Network and the Decoder Network are Recurrent Neural Networks (RNN), and optionally, the feature extraction model may also adopt other types of models, which is not limited in the embodiment of the present application. The feature extraction model is obtained based on training of multiple Loss functions (Loss), wherein the Loss functions are a multi-classification Loss function, a Mean Square Error (MSE) Loss function, a Mean Absolute Error (MAE) and the like, and the multi-classification Loss has the advantages that high-level features (namely voice features) extracted by the multi-classification Loss feature extraction model have the characteristics similar to voiceprint features, so that speakers belonging to input voices can be distinguished through the voice features, the MSE Loss or MAE Loss has the effects that by adopting a feature extraction model of MSE Loss or MAE Loss, after input audio data is processed by the model, the output audio data can be consistent with the input audio data, the extracted voice features can also contain the time sequence characteristics of the audio data, and the obtained voice features have good voice separation effect. Optionally, the loss function includes other types of loss functions, which are not limited in this application.
The feature extraction model is taken as an Encoder-Decoder model, the Loss function adopts MAE Loss and multi-class Loss as examples for explanation, and the training process of the feature extraction model comprises the following steps: obtaining a training data set, wherein the training data set comprises a plurality of sample audio data, inputting the first sample audio data in the training data set into an Encoder network of an initial feature extraction model, extracting the voice feature of the first sample audio data through a hidden layer of the Encoder network, further inputting the obtained voice feature into a Decoder network of the initial feature extraction model, outputting the audio data after model processing based on the voice feature of the first sample audio data through the hidden layer of the Decoder network, further determining a Loss function value of the initial feature extraction model based on the first sample audio data, the audio data after model processing of the first sample audio data and the voice feature extracted from the first sample audio data, combining MAE Loss and multi-class Loss, and utilizing a gradient correction network according to the Loss function value, and adjusting the parameters of the initial feature extraction model to obtain the feature extraction model subjected to the first parameter adjustment. Inputting the second sample audio data in the training data set into the Encoder network of the feature extraction model subjected to the first parameter adjustment, extracting the voice feature of the second sample audio data through a hidden layer of the Encoder network, further inputting the obtained voice feature into the Decode network of the feature extraction model subjected to the first parameter adjustment, outputting the audio data subjected to model processing based on the voice feature of the second sample audio data through the hidden layer of the Decode network, further determining a Loss function value of the feature extraction model subjected to the first parameter adjustment based on the second sample audio data, the audio data subjected to the model processing of the second sample audio data and the voice feature extracted from the second sample audio data, combining MAE Loss and multi-class Loss, and correcting the network by using a gradient according to the Loss function value, and further adjusting the parameters of the feature extraction model subjected to the first parameter adjustment to obtain a feature extraction model subjected to the second parameter adjustment. And by analogy, continuously processing the sample audio data in the training data set to finally obtain the feature extraction model meeting the target condition. The target condition is that the similarity between the audio data output by the model and the input sample audio data meets an iteration cutoff condition, or the loss function value of the model meets the iteration cutoff condition, or the iteration frequency reaches a preset frequency, and the embodiment of the application does not limit which condition is specifically adopted.
It should be noted that, after the feature extraction model is obtained through training of the training data set, in the actual use process, only the Encoder network portion of the feature extraction model is reserved as the feature extraction model for obtaining the target speech feature in this step. The Encoder network serving as the feature extraction model can ensure that the high-level features extracted by the feature extraction model, namely the target voice features have time sequence, and then the extracted target voice features are used in a subsequent voice separation task, so that the accuracy of voice separation is improved.
304. The headphone extracts the convolution features of the fused features.
It should be noted that this step is implemented by a speech separation model. The voice separation model is a Convolutional Neural Network (CNN), and optionally, the voice separation model is other types of models, which is not limited in this embodiment of the present application. The process of determining the mask of the fusion feature in steps 304 to 305 will be described below by taking the speech separation model as CNN as an example.
In one possible implementation, the earphone inputs the fusion feature into a voice separation model, and extracts the convolution feature of the fusion feature through the convolution layer of the voice separation model.
305. The earphone determines a mask corresponding to the fusion feature based on the convolution feature of the fusion feature.
In one possible implementation, the headphone inputs the convolution feature of the fused feature into the pooling layer of the speech separation model, and outputs a Mask of the convolution feature, that is, a Mask (Mask) corresponding to the fused feature, through the pooling layer.
Each numerical value in the mask is any value greater than or equal to 0 and less than or equal to 1, and the specific value is not limited in the embodiment of the application.
It should be noted that, the foregoing steps 304 to 305 are described by taking an example of determining a mask corresponding to a fusion feature through a speech separation model, and in a more possible implementation manner, the mask is determined in another manner, and the embodiment of the present application does not specifically limit which manner is used.
306. And the earphone fuses the mask corresponding to the fusion feature with the first audio data to obtain target audio data.
In a possible implementation manner, the headphone acquires the spectral feature of the first audio data determined in step 302, fuses the mask and the spectral feature to obtain a fused spectral feature, and generates the target audio data based on the fused spectral feature.
When the mask and the spectral feature are fused, the mask and the spectral feature can be fused by performing corresponding multiplication operation on the mask and each numerical value in the spectral feature.
It should be noted that, referring to fig. 5, the processes of step 301 to step 306 described above, fig. 5 is a schematic diagram of a processing process of audio data provided in an embodiment of the present application, where the process includes: the method comprises the steps that a first audio data and a second audio data are obtained through an earphone, feature extraction is respectively carried out on the first audio data and the second audio data to obtain energy difference features of the first audio data and the second audio data, feature fusion is further carried out on the energy difference features and target voice features registered in advance by a target user, the fusion features obtained through feature fusion are input into a voice separation model, a mask used for separating the voice of the target user is obtained, and the voice of the target user is separated from the first audio data collected through the earphone. The target voice feature is obtained by acquiring registered audio data of a target user through an earphone and extracting the voice feature of the registered audio data through a feature extraction model.
In addition, it should be noted that, in the above-mentioned steps 302 to 306, the earphone only includes one master microphone component and one slave microphone component as an example, in a more possible implementation manner, if the number of the slave microphone components included in the earphone is more than one, the energy difference features of the second audio data collected by each slave microphone component and the first audio data collected by the master microphone component are determined one by one, each energy difference feature is fused with the target voice feature, and then a mask corresponding to the fused feature of each energy difference feature and the target voice feature is determined, based on these masks, a mask finally used for processing the first audio data is determined, so as to process the first audio data, and obtain the target audio data. Through gathering multichannel second audio data, carry out speech separation based on multichannel second audio data, can improve speech separation's accuracy, improve the speech separation effect, and then improve user experience.
According to the scheme provided by the embodiment of the application, when two paths of audio data are respectively obtained through the main microphone assembly and the auxiliary microphone assembly which are positioned in the earphone, the energy difference characteristic of the two paths of audio data and the fusion characteristic of the target voice characteristic are determined, the target voice characteristic is the voice characteristic of a target user, the target voice characteristic is fused into the voice separation process, and then the voice separation is carried out on the first audio data collected by the main microphone assembly of the earphone according to the fusion characteristic corresponding to the energy difference characteristic and the target voice characteristic, so that the target audio data only including the voice of the target user is obtained, and the voice separation effect is improved. According to the scheme provided by the embodiment of the application, the target voice feature registered in advance by the target user is introduced in consideration of the relationship between the earphone and the target user, the target voice feature can guide the model to distinguish voice components belonging to the target user in the input audio, and then the earphone processing flow is optimized in combination with the target voice feature, so that the voice separation performance can be obviously improved, only the voice of the target user is included in the separated voice, and the user experience of the target user is improved.
All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.
Fig. 6 is a schematic structural diagram of an apparatus for processing audio data according to an embodiment of the present application, and referring to fig. 6, the apparatus includes:
an obtaining module 601, configured to obtain first audio data collected by a master microphone assembly of an earphone and second audio data collected by a slave microphone assembly of the earphone, where the master microphone assembly and the slave microphone assembly are both located in the earphone;
a determining module 602 for determining an energy difference characteristic of the first audio data and the second audio data;
a fusion module 603, configured to perform feature fusion on the energy difference feature and a target voice feature to obtain a fusion feature, where the target voice feature is a voice feature pre-registered by a target user;
the separation module 604 is configured to perform voice separation on the first audio data according to the fusion feature to obtain target audio data.
The device provided by the embodiment of the application determines the energy difference characteristic of the two paths of audio data and the fusion characteristic of the target voice characteristic when the two paths of audio data are respectively obtained through the main microphone component and the auxiliary microphone component which are positioned in the earphone, the target voice characteristic is the voice characteristic of a target user, the target voice characteristic is fused into the voice separation process, and then the voice separation is carried out on the first audio data collected by the main microphone component of the earphone according to the fusion characteristic corresponding to the energy difference characteristic and the target voice characteristic, so that the target audio data only comprising the voice of the target user is obtained, and the voice separation effect is improved.
In a possible implementation manner, the determining module 602 is configured to determine spectral characteristics of the first audio data and the second audio data, respectively; determining an energy spectral feature of the first audio data and an energy spectral feature of the second audio data based on the spectral feature of the first audio data and the spectral feature of the second audio data, respectively; energy difference features of the first audio data and the second audio data are determined based on the energy spectral features of the first audio data and the energy spectral features of the second audio data.
In a possible implementation manner, the fusion module 603 is configured to splice the energy difference feature and the target speech feature in an upsampling manner to obtain the fusion feature; or splicing the energy difference characteristic and the target voice characteristic to obtain the fusion characteristic; or, performing feature fusion on the energy difference feature and the target voice feature by a typical correlation analysis method to obtain the fusion feature.
In a possible implementation manner, the separation module 604 is configured to fuse the mask corresponding to the fusion feature with the first audio data to obtain the target audio data.
In one possible implementation, the apparatus further includes:
the extraction module is used for extracting the convolution characteristic of the fusion characteristic;
the determining module 602 is further configured to determine a mask corresponding to the fused feature based on the convolution feature of the fused feature.
In a possible implementation manner, the fusion module 603 is configured to obtain a spectral feature of the first audio data; fusing the mask and the frequency spectrum characteristic to obtain a fused frequency spectrum characteristic; and generating the target audio data based on the fused spectral features.
In a possible implementation manner, the target voice feature is obtained by performing feature extraction on registered audio data input by a target user through a feature extraction model, and the feature extraction model is obtained by training based on multiple loss functions.
In one possible implementation manner, a target control is arranged on the earphone, and the target control is used for registering the target voice characteristics based on the registered audio data input by the target user when the target control is triggered;
the registration process of the target voice feature comprises the following steps:
responding to the trigger operation of the target control, and sending voice prompt information, wherein the voice prompt information is used for prompting a target user to input the registered audio data;
acquiring registered audio data input by a target user;
based on the enrollment audio data, the target speech feature is determined.
It should be noted that: in the above embodiment, when the processing apparatus for audio data is used to process audio data collected by an earphone, only the division of the above functional modules is used as an example, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the earphone is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the processing apparatus for audio data and the processing method for audio data provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.
Fig. 7 is a schematic structural diagram of a headset 700 according to an embodiment of the present application, where the headset 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 701 and one or more memories 702, where at least one program code is stored in the one or more memories 702, and is loaded and executed by the one or more processors 701 to implement the methods provided by the above method embodiments. Certainly, the earphone 700 may further have a wired or wireless network interface, a keyboard, an input/output interface, and other components to facilitate input and output, and the earphone 700 may further include other components for implementing the device functions, which are not described herein.
In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory, including program code executable by a processor to perform the method of processing audio data in the above-described embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises computer program code stored in a computer-readable storage medium, which is read by a processor of a computer device from the computer-readable storage medium, the processor executing the computer program code to cause the computer device to perform the method steps of the method of processing audio data provided in the above-mentioned embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by hardware associated with program code, and the program may be stored in a computer readable storage medium, where the above mentioned storage medium may be a read-only memory, a magnetic or optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method of processing audio data, the method comprising:
acquiring first audio data acquired by a main microphone assembly based on an earphone and second audio data acquired by a slave microphone assembly of the earphone, wherein the main microphone assembly and the slave microphone assembly are both positioned in the earphone;
determining an energy difference characteristic of the first audio data and the second audio data, the energy difference characteristic being indicative of a difference between the energy of the first audio data and the second audio data;
performing feature fusion on the energy difference feature and a target voice feature to obtain a fusion feature, wherein the target voice feature is a voice feature registered by a target user in advance;
and carrying out voice separation on the first audio data according to the fusion characteristics to obtain target audio data.
2. The method of claim 1, wherein the determining an energy difference characteristic of the first audio data and the second audio data comprises:
determining spectral characteristics of the first audio data and the second audio data, respectively;
determining an energy spectral feature of the first audio data and an energy spectral feature of the second audio data based on a spectral feature of the first audio data and a spectral feature of the second audio data, respectively;
determining an energy difference feature for the first audio data and the second audio data based on the energy spectral feature of the first audio data and the energy spectral feature of the second audio data.
3. The method according to claim 1, wherein the feature fusing the energy difference feature with the target speech feature to obtain a fused feature comprises:
splicing the energy difference characteristic and the target voice characteristic in an up-sampling mode to obtain the fusion characteristic; alternatively, the first and second electrodes may be,
splicing the energy difference characteristic and the target voice characteristic to obtain the fusion characteristic; alternatively, the first and second electrodes may be,
and performing feature fusion on the energy difference features and the target voice features through a typical correlation analysis method to obtain fusion features.
4. The method of claim 1, wherein the performing voice separation on the first audio data according to the fusion feature to obtain target audio data comprises:
and fusing the mask corresponding to the fusion feature with the first audio data to obtain target audio data.
5. The method according to claim 4, wherein before the fusing the mask corresponding to the fused feature with the first audio data to obtain the target audio data, the method further comprises:
extracting convolution characteristics of the fusion characteristics;
and determining a mask corresponding to the fused feature based on the convolution feature of the fused feature.
6. The method of claim 1, wherein the target speech features are obtained by feature extraction of registered audio data input by a target user through a feature extraction model, and the feature extraction model is trained based on multiple loss functions.
7. The method of claim 1, wherein a target control is provided on the headset, and the target control is used for registering a target voice feature based on registered audio data input by a target user when triggered;
the registration process of the target voice feature comprises the following steps:
responding to the trigger operation of the target control, and sending voice prompt information, wherein the voice prompt information is used for prompting a target user to input the registered audio data;
acquiring registered audio data input by a target user;
determining the target speech feature based on the enrollment audio data.
8. An apparatus for processing audio data, the apparatus comprising:
the earphone comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring first audio data acquired by a main microphone assembly based on an earphone and second audio data acquired by a slave microphone assembly of the earphone, and the main microphone assembly and the slave microphone assembly are both positioned in the earphone;
a determining module for determining an energy difference characteristic of the first audio data and the second audio data, the energy difference characteristic being indicative of a difference between the first audio data and the second audio data energy;
the fusion module is used for performing feature fusion on the energy difference feature and a target voice feature to obtain a fusion feature, wherein the target voice feature is a voice feature registered by a target user in advance;
and the separation module is used for carrying out voice separation on the first audio data according to the fusion characteristics to obtain target audio data.
9. A headset, characterized in that the headset comprises one or more processors and one or more memories, in which at least one program code is stored, which is loaded and executed by the one or more processors to carry out the operations performed by the method of processing audio data according to any one of claims 1 to 7.
10. A computer-readable storage medium having at least one program code stored therein, the program code being loaded and executed by a processor to implement the operations performed by the method for processing audio data according to any one of claims 1 to 7.
CN202011398525.7A 2020-12-03 2020-12-03 Audio data processing method and device, earphone and computer readable storage medium Active CN112242149B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011398525.7A CN112242149B (en) 2020-12-03 2020-12-03 Audio data processing method and device, earphone and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011398525.7A CN112242149B (en) 2020-12-03 2020-12-03 Audio data processing method and device, earphone and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112242149A CN112242149A (en) 2021-01-19
CN112242149B true CN112242149B (en) 2021-03-26

Family

ID=74175465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011398525.7A Active CN112242149B (en) 2020-12-03 2020-12-03 Audio data processing method and device, earphone and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112242149B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112820300B (en) * 2021-02-25 2023-12-19 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium
CN113194372B (en) * 2021-04-27 2022-11-15 歌尔股份有限公司 Earphone control method and device and related components
CN113113041B (en) * 2021-04-29 2022-10-11 电子科技大学 Voice separation method based on time-frequency cross-domain feature selection
CN113539291A (en) * 2021-07-09 2021-10-22 北京声智科技有限公司 Method and device for reducing noise of audio signal, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103269465A (en) * 2013-05-22 2013-08-28 歌尔声学股份有限公司 Headset communication method under loud-noise environment and headset
US20170084292A1 (en) * 2015-09-23 2017-03-23 Samsung Electronics Co., Ltd. Electronic device and method capable of voice recognition
CN108986838A (en) * 2018-09-18 2018-12-11 东北大学 A kind of adaptive voice separation method based on auditory localization
CN109410978A (en) * 2018-11-06 2019-03-01 北京智能管家科技有限公司 A kind of speech signal separation method, apparatus, electronic equipment and storage medium
CN111445920A (en) * 2020-03-19 2020-07-24 西安声联科技有限公司 Multi-sound-source voice signal real-time separation method and device and sound pick-up

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103269465A (en) * 2013-05-22 2013-08-28 歌尔声学股份有限公司 Headset communication method under loud-noise environment and headset
US20170084292A1 (en) * 2015-09-23 2017-03-23 Samsung Electronics Co., Ltd. Electronic device and method capable of voice recognition
CN108986838A (en) * 2018-09-18 2018-12-11 东北大学 A kind of adaptive voice separation method based on auditory localization
CN109410978A (en) * 2018-11-06 2019-03-01 北京智能管家科技有限公司 A kind of speech signal separation method, apparatus, electronic equipment and storage medium
CN111445920A (en) * 2020-03-19 2020-07-24 西安声联科技有限公司 Multi-sound-source voice signal real-time separation method and device and sound pick-up

Also Published As

Publication number Publication date
CN112242149A (en) 2021-01-19

Similar Documents

Publication Publication Date Title
CN112242149B (en) Audio data processing method and device, earphone and computer readable storage medium
US11023690B2 (en) Customized output to optimize for user preference in a distributed system
US10923137B2 (en) Speech enhancement and audio event detection for an environment with non-stationary noise
Stern et al. Hearing is believing: Biologically inspired methods for robust automatic speech recognition
US20170061978A1 (en) Real-time method for implementing deep neural network based speech separation
US10825353B2 (en) Device for enhancement of language processing in autism spectrum disorders through modifying the auditory stream including an acoustic stimulus to reduce an acoustic detail characteristic while preserving a lexicality of the acoustics stimulus
Yüncü et al. Automatic speech emotion recognition using auditory models with binary decision tree and svm
KR101616112B1 (en) Speaker separation system and method using voice feature vectors
CN109308900B (en) Earphone device, voice processing system and voice processing method
CN112863538A (en) Audio-visual network-based multi-modal voice separation method and device
CN111261145B (en) Voice processing device, equipment and training method thereof
KR102062454B1 (en) Music genre classification apparatus and method
CN113921026A (en) Speech enhancement method and device
Poorjam et al. A parametric approach for classification of distortions in pathological voices
CN113205803A (en) Voice recognition method and device with adaptive noise reduction capability
CN113035225B (en) Visual voiceprint assisted voice separation method and device
US11551707B2 (en) Speech processing method, information device, and computer program product
CN115050372A (en) Audio segment clustering method and device, electronic equipment and medium
Kundegorski et al. Two-Microphone dereverberation for automatic speech recognition of Polish
JP2019053180A (en) Audio processing device, voice recognition device, audio processing method, voice recognition method, audio processing program and voice recognition program
CN112118511A (en) Earphone noise reduction method and device, earphone and computer readable storage medium
Imoto et al. Acoustic scene classification using asynchronous multichannel observations with different lengths
CN110738990A (en) Method and device for recognizing voice
CN117238311B (en) Speech separation enhancement method and system in multi-sound source and noise environment
Jokic et al. Towards enabling measurement of similarity of acoustical environments using mobile devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant