CN116959470A - Audio extraction method, device, equipment and storage medium - Google Patents

Audio extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN116959470A
CN116959470A CN202311045708.4A CN202311045708A CN116959470A CN 116959470 A CN116959470 A CN 116959470A CN 202311045708 A CN202311045708 A CN 202311045708A CN 116959470 A CN116959470 A CN 116959470A
Authority
CN
China
Prior art keywords
frequency bands
frequency
feature
features
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311045708.4A
Other languages
Chinese (zh)
Inventor
顾容之
罗艺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311045708.4A priority Critical patent/CN116959470A/en
Publication of CN116959470A publication Critical patent/CN116959470A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Abstract

The application discloses an audio extraction method, device, equipment and storage medium, and belongs to the technical field of audio analysis. The method comprises the following steps: acquiring time-frequency characteristics of a plurality of input audios; determining an angle distribution characteristic according to the time-frequency characteristics of a plurality of input audios; dividing the time-frequency characteristic of the first input audio in the frequency domain dimension according to K frequency bands to obtain time-frequency sub-characteristics corresponding to the K frequency bands; dividing the angle distribution characteristics in the frequency domain dimension according to K frequency bands to obtain angle distribution sub-characteristics corresponding to the K frequency bands; extracting the characteristics of time-frequency sub-characteristics corresponding to the K frequency bands and angle distribution sub-characteristics corresponding to the K frequency bands; and extracting output audio of the first input audio within a specified angle range according to the feature extraction results corresponding to the K frequency bands. The application performs frequency band segmentation aiming at the time-frequency characteristic and the angle distribution characteristic, so that the output audio obtained by extracting the sound zone can be independently analyzed aiming at different frequency bands, and the performance of extracting the sound zone can be improved.

Description

Audio extraction method, device, equipment and storage medium
Technical Field
The present application relates to the field of audio analysis technologies, and in particular, to an audio extraction method, apparatus, device, and storage medium.
Background
In some scenarios, for speech collected by a microphone array, speech extraction by volume is required.
In the related art, audio collected in a designated voice zone (designated angle range) of a user can be extracted from audio collected in a microphone array through a voice zone extraction model, so that subsequent processing, such as voice enhancement, can be performed on the extracted audio of the designated voice zone.
The related art voice zone extraction model generally analyzes only the time-frequency characteristics of the audio, so that characteristics for voice zone extraction are constructed, and the accuracy of voice zone extraction is low.
Disclosure of Invention
The application provides an audio extraction method, an audio extraction device, audio extraction equipment and a storage medium, which can improve the performance of audio extraction. The technical scheme is as follows:
according to an aspect of the present application, there is provided an audio extraction method, the method comprising:
acquiring time-frequency characteristics of a plurality of input audios, wherein each input audio in the plurality of input audios is acquired through one sound sensor in a sound sensor array, and the plurality of input audios comprise a first input audio;
determining angle distribution characteristics according to the time-frequency characteristics of the plurality of input audios, wherein the angle distribution characteristics are used for representing the audios of N angle directions of each input audio in a specified angle range, and the specific gravity occupied in each input audio is positive integer;
Dividing the time-frequency characteristic of the first input audio in a frequency domain dimension according to K frequency bands to obtain time-frequency sub-characteristics corresponding to the K frequency bands, wherein K is a positive integer greater than 1; dividing the angle distribution characteristics in the frequency domain dimension according to the K frequency bands to obtain angle distribution sub-characteristics corresponding to the K frequency bands;
performing feature extraction on the time-frequency sub-features corresponding to the K frequency bands and the angle distribution sub-features corresponding to the K frequency bands to obtain feature extraction results corresponding to the K frequency bands;
and extracting output audio of the first input audio within the specified angle range according to the feature extraction results corresponding to the K frequency bands.
According to another aspect of the present application, there is provided an audio extraction apparatus, the apparatus comprising:
the device comprises a feature extraction module, a first input audio generation module and a second input audio generation module, wherein the feature extraction module is used for acquiring time-frequency features of a plurality of input audios, each input audio in the plurality of input audios is acquired through one sound sensor in a sound sensor array, and the plurality of input audios comprise a first input audio;
the feature extraction module is further configured to determine an angle distribution feature according to time-frequency features of the plurality of input audio frequencies, where the angle distribution feature is used to characterize audio frequencies of N angular directions of each input audio frequency within a specified angle range, and N is a positive integer;
The frequency band dividing module is used for dividing the time-frequency characteristic of the first input audio frequency in the frequency domain dimension according to K frequency bands to obtain time-frequency sub-characteristics corresponding to the K frequency bands, wherein K is a positive integer greater than 1; dividing the angle distribution characteristics in the frequency domain dimension according to the K frequency bands to obtain angle distribution sub-characteristics corresponding to the K frequency bands;
the feature modeling module is used for carrying out feature extraction on the time-frequency sub-features corresponding to the K frequency bands and the angle distribution sub-features corresponding to the K frequency bands to obtain feature extraction results corresponding to the K frequency bands;
and the mask estimation module is used for extracting output audio of the first input audio within the specified angle range according to the feature extraction results corresponding to the K frequency bands.
In an alternative design, the angular distribution features include a first distribution feature and a second distribution feature, where the first distribution feature includes a first phase difference of input audio corresponding to each two sound sensors, and the second distribution feature is used to reflect similarity between the first phase difference and a second phase difference, where the second phase difference is a phase difference of sampling pulse signals of N angular directions in the specified angular range for each two sound sensors; the frequency band dividing module includes:
The first dividing sub-module is used for dividing the first distribution characteristics in the frequency domain dimension according to the K frequency bands to obtain first distribution sub-characteristics corresponding to the K frequency bands;
the second dividing sub-module is used for dividing the second distribution characteristics in the frequency domain dimension according to the K frequency bands to obtain similarity distribution sub-characteristics corresponding to the K frequency bands, each frequency band in the K frequency bands is provided with N similarity distribution sub-characteristics, and the N similarity distribution sub-characteristics are in one-to-one correspondence with the N angle directions;
an integration sub-module, configured to perform feature integration on the N similarity distribution sub-features corresponding to each of the K frequency bands, so as to obtain second distribution sub-features corresponding to the K frequency bands;
the frequency band dividing module is further configured to determine a first distribution sub-feature corresponding to the K frequency bands and a second distribution sub-feature corresponding to the K frequency bands as angle distribution sub-features corresponding to the K frequency bands.
In an alternative design, the integration sub-module is configured to:
and splicing the N similarity distribution sub-features corresponding to each frequency band in the K frequency bands to obtain second distribution sub-features corresponding to the K frequency bands.
In an alternative design, the integration sub-module is configured to:
performing linear or nonlinear feature transformation on the N similarity distribution sub-features corresponding to each of the K frequency bands to obtain N feature transformation results corresponding to each of the K frequency bands;
and splicing N characteristic transformation results corresponding to each frequency band in the K frequency bands to obtain second distribution sub-characteristics corresponding to the K frequency bands.
In an alternative design, the integration sub-module is configured to:
performing linear or nonlinear feature transformation on the N similarity distribution sub-features corresponding to each of the K frequency bands to obtain N feature transformation results corresponding to each of the K frequency bands;
and determining average characteristics of N characteristic transformation results corresponding to each frequency band in the K frequency bands to obtain second distribution sub-characteristics corresponding to the K frequency bands.
In an alternative design, the integration sub-module is configured to:
extracting cascade features from the N similarity distribution sub-features corresponding to each of the K frequency bands according to the sequence of the angle directions to obtain second distribution sub-features corresponding to the K frequency bands;
The cascade feature extraction comprises feature extraction of a cascade feature extraction result of the previous i level and similarity distribution sub-features of the (i+1) th level, so that a cascade feature extraction result of the previous i+1 level is obtained, and i is a positive integer.
In an alternative design, the frequency band dividing module further includes:
the first mapping sub-module is used for mapping the time-frequency sub-feature corresponding to each frequency band in the K frequency bands to a specified dimension to obtain a first sequence feature corresponding to the K frequency bands; the second mapping sub-module is used for mapping the first distribution sub-feature corresponding to each frequency band in the K frequency bands to the appointed dimension to obtain a second sequence feature corresponding to the K frequency bands; a third mapping sub-module, configured to map the second distribution sub-feature corresponding to each of the K frequency bands to the specified dimension, to obtain a third sequence feature corresponding to the K frequency bands;
the merging sub-module is used for splicing the first sequence features corresponding to the K frequency bands, the second sequence features corresponding to the K frequency bands and the third sequence features corresponding to the K frequency bands to obtain splicing features corresponding to the K frequency bands;
and the feature modeling module is used for carrying out feature extraction on the spliced features corresponding to the K frequency bands to obtain feature extraction results corresponding to the K frequency bands.
In an alternative design, the feature modeling module includes:
the first modeling module is used for modeling the splicing features corresponding to the K frequency bands along a first dimension to obtain first modeling sequence features corresponding to the K frequency bands, wherein the first modeling sequence features carry correlations among features of the splicing features at different positions of the first dimension;
the second modeling sub-module is used for modeling the first modeling sequence features corresponding to the K frequency bands along a second dimension to obtain second modeling sequence features corresponding to the K frequency bands, wherein the second modeling sequence features carry correlations among the features of the first modeling sequence features at different positions of the second dimension;
the feature modeling module is configured to determine features of the second modeling sequences corresponding to the K frequency bands as feature extraction results corresponding to the K frequency bands;
the first dimension is a time dimension, and the second dimension is a frequency band dimension; or, the first dimension is the frequency band dimension, and the second dimension is the time dimension.
In an alternative design, the mask estimation module is configured to:
Predicting masks corresponding to the K frequency bands through second modeling sequence features corresponding to the K frequency bands, wherein the mask corresponding to each frequency band in the K frequency bands is used for indicating the duty ratio of the output audio at different time frequency positions of the first input audio on each frequency band;
combining masks corresponding to each frequency band in the K frequency bands to obtain combined masks;
and determining the output audio according to the first input audio and the merging mask.
In an alternative design, the feature extraction module is configured to:
determining one or more pairs of sound sensors in the sound sensor array, the pairs of sound sensors comprising a first sound sensor and a second sound sensor;
determining the first phase difference according to the time-frequency characteristic corresponding to the first sound sensor and the time-frequency characteristic corresponding to the second sound sensor;
determining the second phase difference of the first sound sensor and the second sound sensor for sampling pulse signals in N angle directions in the specified angle range;
determining a similarity of the first phase difference and the second phase difference;
and determining the second distribution characteristic according to the similarity.
In an alternative design, the feature extraction module is configured to:
summing M second distribution characteristics corresponding to each of the N angular directions;
wherein M represents the number of the sound sensor pairs, and M is a positive integer.
According to another aspect of the present application, there is provided a computer device comprising a processor and a memory, the memory having stored therein at least one program that is loaded and executed by the processor to implement the audio extraction method as described in the above aspect.
According to another aspect of the present application, there is provided a computer-readable storage medium having stored therein at least one program loaded and executed by a processor to implement the audio extraction method as described in the above aspect.
According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the audio extraction method provided in various alternative implementations of the above aspects.
The technical scheme provided by the application has the beneficial effects that at least:
by constructing the angle distribution feature for extracting the voice zone based on the time-frequency feature of the input audio and carrying out frequency band segmentation aiming at the time-frequency feature and the angle distribution feature, the output audio obtained by extracting the voice zone can be independently analyzed aiming at different frequency bands, so that the performance of extracting the voice zone is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of a computer system provided in accordance with an exemplary embodiment of the present application;
FIG. 2 is a schematic diagram of a process for soundfield extraction provided by an exemplary embodiment of the present application;
FIG. 3 is a flow chart of an audio extraction method according to an exemplary embodiment of the present application;
FIG. 4 is a flow chart of an audio extraction method according to an exemplary embodiment of the present application;
FIG. 5 is a schematic illustration of a specified angular range provided by an exemplary embodiment of the present application;
FIG. 6 is a schematic diagram of a first network provided by an exemplary embodiment of the present application;
FIG. 7 is a schematic diagram of a process for feature integration provided by an exemplary embodiment of the present application;
FIG. 8 is a schematic diagram of a second network provided by an exemplary embodiment of the present application;
FIG. 9 is a schematic diagram of a third network provided by an exemplary embodiment of the present application;
FIG. 10 is a schematic diagram of a fourth network provided by an exemplary embodiment of the present application;
FIG. 11 is a flow chart of a model training method provided by an exemplary embodiment of the present application;
fig. 12 is a schematic structural view of an audio extraction apparatus according to an exemplary embodiment of the present application;
fig. 13 is a schematic structural view of a computer device according to an exemplary embodiment of the present application.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
First, the related terms related to the present application will be described:
feature modeling: the pointer analyzes a certain dimension of the feature and sequences formed by the feature in the dimension to obtain correlations among the features at different positions of the feature in the dimension. The dimensions of the features do not change before and after feature modeling, but the modeling results can carry correlations between features at different locations of the features in the dimension used for modeling.
Feature integration: refers to a process of integrating multiple features of the same dimension into one feature.
Mask (mask): refers to the control of the area or process of image processing by masking the processed image (either fully or partially) with a selected/predicted image, graphic or object. In the field of audio processing, a mask is used for shielding time-frequency characteristics (spectrograms) of input audio, so that a spectrogram of predicted output audio is obtained, and further the predicted output audio can be obtained.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology. In some embodiments, the methods provided by the present application may be implemented as a machine learning model.
FIG. 1 is a block diagram of a computer system provided in an exemplary embodiment of the application. The computer system includes a terminal 110, a server 120, and a communication network 130 between the terminal 110 and the server 120.
In some embodiments, the terminal 110 is configured to send a plurality of input audio to the server 120, where each of the plurality of input audio is acquired by one of the sound sensors in the sound sensor array 140. In some embodiments, an application program having an audio acquisition function is installed in the terminal 110 to acquire the plurality of input audio. In some embodiments, the terminal 110 is connected to the sound sensor array 140 either by wire or wirelessly so that a plurality of input audio captured by the sound sensor array 140 may be obtained. In some embodiments, the terminal 110 is a management end of the sound sensor array 140, so that a plurality of input audio collected by the sound sensor array 140 can be obtained. In some embodiments, the sound sensor array 140 refers to a microphone array.
The audio extraction method provided in the embodiment of the present application may be implemented by the terminal 110 alone, or may be implemented by the server 120 alone, or may be implemented by the terminal 110 and the server 120 through data interaction, which is not limited in the embodiment of the present application. In this embodiment, after a plurality of input audio is acquired by the terminal 110 through an application program having an audio acquisition function, the acquired plurality of input audio is sent to the server 120, and illustratively, audio extraction of the input audio by the server 120 is described as an example.
Optionally, after receiving the plurality of input audio signals sent by the terminal 110, the server 120 inputs the plurality of input audio signals into the audio region extraction model 121. The volume extraction model 121 includes a first network, a second network, a third network, and a fourth network. The method comprises the steps of a first network for constructing features used for extracting a sound zone, a second network for segmenting the constructed features according to frequency bands, a third network for modeling the segmented features, and a fourth network for estimating output audio according to the results of feature modeling. For example, the sound zone extraction model 121 is used to extract audio within a specified angle range 141, whereas audio extraction is not performed for a range 142 outside the specified angle range 141. The audio belongs to a certain angle range means that the sound source corresponding to the audio belongs to the angle range.
In the process of extracting the sound zone of a first input audio in a plurality of input audio, the first network can extract time-frequency characteristics of the plurality of input audio and generate angle distribution characteristics according to the time-frequency characteristics, wherein the angle distribution characteristics are used for representing audio of N angle directions of the input audio in a specified angle range and the specific gravity of the input audio. The second network and the third network can then segment the time-frequency characteristic and the angle distribution characteristic of the first input audio according to the frequency band and extract/model the characteristic respectively. The fourth network then estimates a mask indicating the duty cycle of the output audio of the first input audio at different time-frequency locations of the first input audio within the specified angular range based on the modeling results of the sequence and frequency band modeling network. And determining output audio by combining the first input audio according to the mask, thereby completing the extraction of the voice zone. The above is just one exemplary construction method of the pitch region extraction model 121.
Alternatively, the server 120 transmits the output audio to the terminal 110, and the output audio is received, played, displayed, etc. by the terminal 110.
It should be noted that the above-mentioned terminals include, but are not limited to, mobile terminals such as mobile phones, tablet computers, portable laptop computers, intelligent voice interaction devices, intelligent home appliances, vehicle-mounted terminals, and the like, and may also be implemented as desktop computers and the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms.
Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, application programs, networks and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required.
In some embodiments, the servers described above may also be implemented as nodes in a blockchain system.
Fig. 2 is a schematic diagram of a process for extracting a soundtrack according to an exemplary embodiment of the present application. As shown in fig. 2, the computer device acquires a plurality of input audio 201, where the plurality of input audio 201 is acquired by the sound sensor array, and the plurality of input audio 201 includes a first input audio, where the first input audio is an input audio of the plurality of input audio 201 that performs a pitch region extraction.
The computer device inputs the plurality of input audio frequencies 201 into a first network 202 (feature extraction network) of the audio region extraction model, the first network 202 generates time-frequency features corresponding to the plurality of input audio frequencies 201, and generates angular distribution features according to the time-frequency features corresponding to the plurality of input audio frequencies 201, wherein the angular distribution features comprise a first distribution feature and a second distribution feature. The first distribution feature comprises a first phase difference of input audio corresponding to each two sound sensors, the second distribution feature is used for reflecting similarity between the first phase difference and a second phase difference, and the second phase difference is a phase difference of sampling pulse signals of N angle directions in a specified angle range for each two sound sensors. Optionally, the specified angular range and N are manually set. The first phase difference is determined according to the time-frequency characteristics of the input audio actually collected by the two sound sensors, for example, the first phase difference is calculated according to the phase spectrum decomposed by the spectrogram (time-frequency characteristics). The second phase difference is used to estimate the phase difference of the same pulse signal sampled by the two acoustic sensors. For example, with continued reference to fig. 1, it is assumed that the pulse signal is located in the specified angle range 141, and the second phase difference corresponding to the two sound sensors on the left side, that is, the phase difference at which the pulse signal in the specified angle range 141 is sampled. The second phase difference may be affected by the position, frequency of the pulse signal and the spacing of the two sound sensors. I.e. the second phase difference can be estimated from the assumed position, frequency of the pulse signal (which can be regarded as a sound source) and the spacing of the two sound sensors.
The computer device inputs the time-frequency characteristics of the first input audio and the first distribution characteristics and the second distribution characteristics into a second network 203 (frequency band segmentation and subband modeling network) of the audio region extraction model, and the second network segments each type of characteristics in the frequency domain dimension according to K frequency bands, so as to obtain time-frequency sub-characteristics corresponding to the K frequency bands, first distribution sub-characteristics corresponding to the K frequency bands, and second distribution sub-characteristics corresponding to the K frequency bands. Wherein K is a positive integer greater than 1. And then the second network 203 maps all the sub-features to the specified dimension and performs splicing, so as to obtain splicing features corresponding to the K frequency bands.
The computer device inputs the splice features into the third network 204 of the soundfield extraction model to model the splice features corresponding to the K frequency bands along the first dimension to obtain first modeled sequence features corresponding to the K frequency bands. And modeling the first modeling sequence features corresponding to the K frequency bands along a second dimension, so as to obtain second modeling sequence features corresponding to the K frequency bands. The first dimension is a time dimension, and the second dimension is a frequency band dimension; or, the first dimension is a frequency band dimension, and the second dimension is a time dimension.
The computer device inputs the second modeling sequence features corresponding to the K frequency bands output by the third network 204 into a fourth network 205 (mask estimation network) of the audio region extraction model, the fourth network 205 predicts masks corresponding to the K frequency bands, the mask corresponding to each of the K frequency bands being used to indicate the duty cycle of the output audio 206 at different time-frequency locations of the first input audio at each frequency band. The fourth network 205 merges the masks corresponding to each of the K frequency bands to obtain a merged mask. So that the output audio 206 of the first input audio within the specified angular range is determined from the first input audio and the merge mask.
By setting the specified angle range, a corresponding angle distribution feature can be generated. The sound zone extraction model can extract the sound zone of the first input audio according to the angle distribution characteristics, so that the output audio of the first input audio in the specified angle range is obtained. By adjusting the specified angle range, the voice zone supported by the model to be extracted can be dynamically adjusted, and the flexibility of the model for extracting the voice zone is improved. Further, since the frequency band division is performed, it is possible to perform individual analysis for different frequency bands, and it is possible to improve the performance of extracting the sound zone.
Fig. 3 is a flow chart of an audio extraction method according to an exemplary embodiment of the present application. The method may be used with a computer device or a client on a computer device. As shown in fig. 3, the method includes:
step 302: a time-frequency characteristic of a plurality of input audio is acquired.
Each of the plurality of input audio is acquired by one of the sound sensors in the sound sensor array. For example, the plurality of input audio signals corresponds one-to-one to the sound sensors in the sound sensor array. In some embodiments, the input audio is obtained by sampling speech.
The sound sensor array comprises a plurality of sound sensors, and the sound sensor array is arranged according to a certain arrangement mode. In some embodiments, the sound sensor is a microphone and the sound sensor array is a microphone array.
The plurality of input audio may be referred to as multi-channel audio, with each channel corresponding to one sound sensor. The number of channels of audio refers to the number of sound channels, typically single and dual channels (e.g., left and right channels of stereo). The left and right ears of a person have different sound hearing time due to the space positions, and the double-channel playing simulates the situation to create the space sense of sound transmitted from different directions. Audio is used to indicate data with audio information such as: a piece of music, a piece of speech, etc.
Illustratively, with continued reference to fig. 1, the sound sensor array is a microphone array. The microphone array comprises 3 microphones, arranged as shown. The audio collected by the microphone array is three-channel audio.
Each input audio has its corresponding time-frequency characteristic, which is used to reflect the characteristics of the input audio in the time domain dimension and the frequency domain dimension, and is the characteristics obtained by extracting the characteristics of the input audio from the time domain dimension and the frequency domain dimension. Illustratively, the time domain dimension is the dimension in which the time scale is used to record the change in time of the input audio; the frequency domain dimension is used to describe the dimension of the input audio in terms of frequency characteristics.
Illustratively, after the input audio is analyzed along the time domain dimension, a time domain feature is obtained, and the time domain feature cannot provide oscillation information of the input audio in the frequency domain dimension; after the input audio is analyzed along the frequency domain dimension, frequency domain features are obtained, and the frequency domain features cannot provide information of frequency spectrum signals in the input audio along with time variation. Therefore, a dimension analysis method of the time domain dimension and the frequency domain dimension is comprehensively adopted, and comprehensive analysis is carried out on the input audio along the time domain dimension and the frequency domain dimension, so that the time-frequency characteristic is obtained.
Optionally, the computer device calculates a spectrogram (spectral) of the input audio as a time-frequency characteristic of the input audio by performing a short-time fourier transform on each input audio.
The plurality of input audio includes a first input audio, the first input audio being input audio awaiting a region extraction. The first input audio is any one of a plurality of input audio. Optionally, the first input audio is manually selected.
Step 304: an angular distribution characteristic is determined from time-frequency characteristics of the plurality of input audio frequencies.
The angle distribution feature is used for representing the audio frequency of N angle directions of each input audio frequency in a specified angle range, wherein N is a positive integer. Optionally, the specified angle range is manually set, and the specified angle range is used to indicate a range of the region where the extraction of the soundtrack is performed. For example, by a parameter [ theta ] l ,θ h ]Setting, θ l Represents the lower bound, θ, of the specified angular range h Representing the upper bound of the specified angular range. The specified angle range is set based on a coordinate system established based on the acoustic sensor array, for example, a coordinate system established with the center of the acoustic sensor array as the origin. Optionally, the value of N is set manually.
For example, the specified angle range set by the user is 30-90 degrees, and then the width thereof is 60 degrees. If an angular direction is selected every 15 degrees within this range, there are five directions 30, 45, 60, 75, 90, in this case n=5. The specified angular range and N may be pre-formed in a computer device (audio separation model) which information is invoked when determining the angular distribution characteristics. Alternatively, the specified angle range and N are input to a computer device (tone separation model) when determining the angle distribution characteristics. In practical cases, the audio extraction is performed for a plurality of (N) angular directions within a predetermined angular range, for example, in the above cases, the audio extraction is performed for 5 angular directions of 30, 45, 60, 75, and 90 degrees, and by setting a larger number of angular directions within the predetermined angular range, the accuracy of the audio extraction can be improved.
The angular distribution characteristic is related to time-frequency characteristics of the plurality of input audio. Optionally, the angular distribution feature includes a first distribution feature including a first phase difference of the input audio corresponding to each of the two sound sensors, and a second distribution feature for reflecting a similarity of the first phase difference to a second phase difference, which is a phase difference of sampling pulse signals of N angular directions within a specified angular range for each of the two sound sensors.
Optionally, in determining the angular distribution characteristic, the computer device determines one or more pairs of sound sensors in the sound sensor array, e.g., M pairs of sensors, M being a positive integer. The pair of acoustic sensors includes a first acoustic sensor and a second acoustic sensor, the first acoustic sensor and the second acoustic sensor being randomly selected in the sensor array. The computer device can determine a corresponding phase map from the time-frequency characteristic corresponding to the first sound sensor and the time-frequency characteristic corresponding to the second sound sensor, thereby determining the first phase difference. Illustratively, with continued reference to FIG. 1, the acoustic sensor array includes acoustic sensors 1, 2, and 3. The random pair of acoustic sensors may be (acoustic sensor 1, acoustic sensor 2), (acoustic sensor 2, acoustic sensor 3), (acoustic sensor 1, acoustic sensor 3). The computer device also determines a second phase difference of the first and second sound sensors for sampling the pulse signals in the N angular directions within the specified angular range, thereby determining a similarity of the first and second phase differences, and determines a second distribution characteristic based on the similarity. The pulse signal is not an actual signal, but a signal which is virtually derived from the calculation of the second phase difference. The computer device may determine the similarity as the second distribution characteristic or information related to the similarity as the second distribution characteristic. The second phase difference is used to estimate the phase difference of the same pulse signal sampled by the two acoustic sensors. The pulse signal corresponds to a position and a frequency, and the second phase difference is influenced by the position and the frequency of the pulse signal and the distance between the two sound sensors. I.e. the second phase difference can be estimated from the assumed position, frequency of the pulse signal and the spacing of the two sound sensors. For example, with continued reference to fig. 1, it is assumed that the pulse signal is located in the specified angle range 141, and the second phase difference corresponding to the two sound sensors on the left side, that is, the phase difference at which the pulse signal in the specified angle range 141 is sampled.
The computer device acquires the phase difference of the pulse signals over the pair of acoustic sensors, i.e. a second phase difference is obtained. Optionally, for a specific second phase difference calculation, the computer device calculates a signal delay of the pulse signal to the acoustic sensor pair, which signal delay is related to the sampling position (in the angular direction) of the pulse signal and the spacing of the acoustic sensor pair. The computer device also obtains the signal frequency of the pulse signal, and performs linear integration on the signal delay and the signal frequency, thereby obtaining a second phase difference. In this process, the computer device calculates "what the theoretical value of the phase difference reaches different sound sensors (sound sensor pairs) at different frequencies if a pulse signal is emitted at a certain angular direction". It should be noted that, the signal delay is related to the sampling position and the distance between the acoustic sensor pairs, and may be specifically calculated by using a plane wave model or a spherical wave model. Linear integration refers to the process of integrating signal delay and signal frequency by a linear function, such as addition, multiplication, etc.
In some embodiments, the method provided by the embodiments of the present application is implemented by a sound separation model. The sound separation model includes a first network for implementing the methods provided in steps 302 and 304 above.
Step 306: and segmenting the time-frequency characteristic of the first input audio in the frequency domain dimension according to the K frequency bands to obtain time-frequency sub-characteristics corresponding to the K frequency bands.
The time-frequency characteristic of the first input audio is frequency-dependent and therefore has characteristic dimensions in the frequency domain so that it can be sliced. The bandwidths of the K frequency bands are the same, or the bandwidths of the K frequency bands are different, or the bandwidths of the K frequency bands are partially the same, and K is a positive integer greater than 1. The relevant parameters used in the slicing process may be determined manually, for example, set according to the distribution of the sound to be extracted in the frequency domain. The time-frequency sub-features are features of the time-frequency features that are distributed in the corresponding frequency band range.
Step 308: and cutting the angle distribution characteristics in the frequency domain dimension according to the K frequency bands to obtain angle distribution sub-characteristics corresponding to the K frequency bands.
The angular distribution features are frequency dependent and therefore have a feature dimension in the frequency domain so that they can be sliced. The bandwidths of the K frequency bands are the same, or the bandwidths of the K frequency bands are different, or the bandwidths of the K frequency bands are partially the same, and K is a positive integer greater than 1. The relevant parameters used in the slicing process may be determined manually, for example, set according to the distribution of the sound to be extracted in the frequency domain. It should be noted that the above parameters used for slicing the time-frequency characteristic of the first input audio are the same as the above parameters used for slicing the angular distribution characteristic. The angular distribution sub-feature is a feature of the angular distribution feature that is distributed within a corresponding frequency band.
In some embodiments, the method provided by the embodiments of the present application is implemented by a sound separation model. The sound separation model includes a second network that is cascaded with the first network, the second network being configured to implement the methods provided in steps 306 and 308.
Step 310: and carrying out feature extraction on the time-frequency sub-features corresponding to the K frequency bands and the angle distribution sub-features corresponding to the K frequency bands to obtain feature extraction results corresponding to the K frequency bands.
The computer equipment performs feature modeling on the time-frequency sub-features corresponding to the K frequency bands and the angle distribution sub-features corresponding to the K frequency bands, so that a feature extraction result is obtained. Optionally, after the computer device performs frequency band division, the frequency band division result is mapped to the same designated dimension in a unified manner, and the mapped features are spliced to obtain splicing features corresponding to the K frequency bands, so that feature modeling is performed on the splicing features corresponding to the K frequency bands, and feature extraction results corresponding to the K frequency bands are obtained. During modeling, the computer device will model along the first dimension and the second dimension in turn, and the modeling results will carry correlations between features of the modeling input at different locations of the first dimension/the second dimension. Optionally, the first dimension is a time dimension (sequence dimension) and the second dimension is a frequency band dimension; or, the first dimension is a frequency band dimension, and the second dimension is a time dimension.
In some embodiments, the method provided by the embodiments of the present application is implemented by a sound separation model. The sound separation model comprises a third network, which is cascaded with the second network described above, for implementing the method provided in step 310 described above.
Step 312: and extracting output audio of the first input audio within a specified angle range according to the feature extraction results corresponding to the K frequency bands.
The computer device can predict masks corresponding to K frequency bands according to the feature extraction result, wherein the mask corresponding to each frequency band in the K frequency bands is used for indicating the duty ratio of the output audio at different time frequency positions of the first input audio on each frequency band. The mask corresponding to each of the K frequency bands is combined, so that a combined mask can be obtained. And then according to the first input audio and the merging mask, the time-frequency characteristic of the output audio of the first input audio within a specified angle range can be determined, and the output audio can be obtained by carrying out short-time Fourier inverse transformation on the time-frequency characteristic.
In some embodiments, the first input audio is acquired by an array of sound sensors. The computer equipment can extract the output voice of the first input audio within the appointed angle range according to the characteristic extraction results corresponding to the K frequency bands. For this case, the mask can also provide a noise reduction effect, that is, not only can output voice, but also can reduce noise for the output voice.
The method provided by the embodiment is to perform streaming processing on the input audio, so that streaming processing of the audio data is supported, namely, the audio data of a certain input time-frequency point can be outputted in real time to obtain the corresponding time-frequency point data obtained by extracting the audio region.
In some embodiments, the method provided by the embodiments of the present application is implemented by a sound separation model. The sound separation model includes a fourth network that is cascaded with the third network, the fourth network being configured to implement the method provided in step 312.
In summary, according to the method provided by the embodiment, the angle distribution feature for extracting the voice zone is constructed based on the time-frequency feature of the input audio, and the frequency band segmentation is performed for both the time-frequency feature and the angle distribution feature, so that the output audio obtained by extracting the voice zone can be independently analyzed for different frequency bands, and the performance of extracting the voice zone is improved.
Fig. 4 is a flow chart of an audio extraction method according to an exemplary embodiment of the present application. The method may be used with a computer device or a client on a computer device. As shown in fig. 4, the method includes:
step 402: a time-frequency characteristic of a plurality of input audio is acquired.
Each of the plurality of input audio is acquired by one of the sound sensors in the sound sensor array. In some embodiments, the input audio is obtained by sampling speech. The plurality of input audio may be referred to as multi-channel audio, with each channel corresponding to one sound sensor.
Each input audio has its corresponding time-frequency characteristic, which is used to reflect the characteristics of the input audio in the time domain dimension and the frequency domain dimension, and is the characteristics obtained by extracting the characteristics of the input audio from the time domain dimension and the frequency domain dimension. Optionally, the computer device calculates a spectrogram of the input audio as a time-frequency characteristic of the input audio by performing a short-time fourier transform on each input audio.
The plurality of input audio includes a first input audio, the first input audio being input audio awaiting a region extraction. The first input audio is any one of a plurality of input audio. Optionally, the first input audio is manually selected.
Step 404: a first distribution characteristic and a second distribution characteristic are determined according to time-frequency characteristics of a plurality of input audios.
The first distribution feature comprises a first phase difference of input audio corresponding to each two sound sensors, the second distribution feature is used for reflecting similarity between the first phase difference and a second phase difference, and the second phase difference is a phase difference of sampling pulse signals of N angle directions in a specified angle range for each two sound sensors. Optionally, the specified angle range is manually set, and the specified angle range is used to indicate a range of the region where the extraction of the soundtrack is performed. For example, by a parameter [ theta ] l ,θ h ]And (5) setting. The specified angle range is set based on a coordinate system established from the acoustic sensor array. Optionally, the value of N is set manually. The specified angular range and N may be pre-formed in a computer device (audio separation model) which information is invoked when determining the angular distribution characteristics. Alternatively, the specified angle range and N are manually input to a computer device (tone separation model) when determining the angle distribution characteristics.
Illustratively, FIG. 5 is a schematic diagram of a specified angular range provided by an exemplary embodiment of the present application. As shown in fig. 5, according to [ theta ] l ,θ h ]The specified angle range 502 may be determined in a coordinate system corresponding to the acoustic sensor array 501. N angular directions, respectively θ, can be determined from N in a specified angular range 502 l 、θ l+1 、θ l+2 、…、θ h
Optionally, in determining the angular distribution characteristic, the computer device determines one or more pairs of sound sensors in the sound sensor array, e.g., M pairs of sensors, M being a positive integer. The pair of acoustic sensors includes a first acoustic sensor and a second acoustic sensor, the first acoustic sensor and the second acoustic sensor being randomly selected in the sensor array. The computer device can determine a corresponding phase map from the time-frequency characteristic corresponding to the first sound sensor and the time-frequency characteristic corresponding to the second sound sensor, thereby determining the first phase difference. The computer device may also determine a second phase difference by which the first sound sensor and the second sound sensor sample the pulse signals for N angular directions within the specified angular range, thereby determining a similarity (e.g., by similarity calculation between vectors) of the first phase difference and the second phase difference, and determine a second distribution characteristic based on the similarity. The pulse signal is not an actual signal, but a virtual signal for calculating the second phase difference. For the calculation of the second phase difference, the computer device calculates "what the theoretical value of the phase difference reaches the different sound sensors (sound sensor pair) at the different frequencies if a pulse signal is emitted at a certain angular direction". The computer device may determine the similarity as the second distribution characteristic or information related to the similarity as the second distribution characteristic.
The computer device acquires the phase difference of the pulse signals over the pair of acoustic sensors, i.e. a second phase difference is obtained. Optionally, for a specific second phase difference calculation, the computer device may acquire a signal delay of the pulse signal to the acoustic sensor pair, the signal delay being related to the sampling position of the pulse signal and the spacing of the acoustic sensor pair. The computer equipment acquires the signal frequency of the pulse signal and performs linear integration on the signal delay and the signal frequency, so that a second phase difference is obtained. It should be noted that, the signal delay is related to the sampling position and the distance between the acoustic sensor pairs, and may be specifically calculated by using a plane wave model or a spherical wave model. Linear integration refers to the process of integrating signal delay and signal frequency by a linear function, such as addition, multiplication, etc. Wherein the first phase difference may be referred to as an Inter-channel phase difference (Inter-Channel Phase Difference, IPD) and the second phase difference may be referred to as a target angular phase difference (Target Phase Difference, TPD).
Illustratively, the calculation formulas of the above-mentioned similarity, the first phase difference, and the second phase difference are as follows:
where p= (p 1, p 2) is an index of sound sensor pairs, and M represents the total number of sound sensor pairs. IPD is the observed phase difference (first phase difference) across the pair of acoustic sensors, i.e., the difference between the phase across acoustic sensor p1 and the phase across acoustic sensor p 2. TPD is from the angular direction l at time t v The theoretical phase difference (second phase difference) of the pulse signal of (t) between the pair of acoustic sensors is the angle direction l v Pure time delay tau between arrival of pulse signal of (t) at sound sensor pair (p 1, p 2) (p) (l v (t)) (called signal delay) and frequency f (signal frequency), τ (p) (l v (t)) and the sound sensor pair and the angular direction l v (t) can be calculated by using a plane wave model or a spherical wave model. Where V () represents the calculated similarity.
For each angular direction, a second distribution characteristic with a dimension of m×f×t can be obtained, and there are N angular directions in total. Optionally, the computer device sums the M second distribution characteristics corresponding to each of the N angular directions along the M dimension, so as to obtain a second distribution characteristic having a dimension of n×f×t. Wherein M represents the number of the sound sensor pairs, and M is a positive integer.
In some embodiments, the method provided by the embodiments of the present application is implemented by a sound separation model. The sound separation model includes a first network for implementing the methods provided by steps 402 and 404 described above.
Illustratively, fig. 6 is a schematic diagram of a first network according to an exemplary embodiment of the present application, as shown in fig. 6, where the first network 601 performs short-time fourier transform on a plurality of input audio signals to obtain time-frequency characteristics (Y) of the plurality of input audio signals, where the time-frequency characteristics include the time-frequency characteristics of the first input audio signal. The time-frequency characteristic of the first input audio is input into a subsequent network. The first network 601 determines a first phase difference according to time-frequency characteristics of a plurality of input audios IPD), i.e. a first distribution feature, and calculating a second phase difference (TPD). And then determining the second distribution characteristic (V) according to the similarity between the TPD and the IPD. The second distribution feature may also be referred to as an angular zone definition feature, and the computation of TPD depends on the parameter [ theta ] l ,θ h ]。
Step 406: and segmenting the time-frequency characteristic of the first input audio in the frequency domain dimension according to the K frequency bands to obtain time-frequency sub-characteristics corresponding to the K frequency bands.
The time-frequency characteristic of the first input audio is frequency-dependent and therefore has characteristic dimensions in the frequency domain so that it can be sliced. The bandwidths of the K frequency bands are the same, or the bandwidths of the K frequency bands are different, or the bandwidths of the K frequency bands are partially the same, and K is a positive integer greater than 1. The relevant parameters used in the slicing process may be determined manually, for example, set according to the distribution of the sound to be extracted in the frequency domain. The time-frequency sub-features are features of the time-frequency features that are distributed in the corresponding frequency band range.
Step 408: and cutting the first distribution characteristic in the frequency domain dimension according to the K frequency bands to obtain first distribution sub-characteristics corresponding to the K frequency bands.
The first distribution feature is frequency dependent and therefore has a feature dimension in the frequency domain so that it can be sliced. The bandwidths of the K frequency bands are the same, or the bandwidths of the K frequency bands are different, or the bandwidths of the K frequency bands are partially the same, and K is a positive integer greater than 1. The relevant parameters used in the slicing process may be determined manually, for example, set according to the distribution of the sound to be extracted in the frequency domain. It should be noted that the above parameters used for slicing the time-frequency characteristic of the first input audio are the same as the above parameters used for slicing the first distribution characteristic. The first distribution sub-feature is a feature of the first distribution feature that is distributed within a corresponding frequency band range.
Step 410: and cutting the second distribution characteristic in the frequency domain dimension according to the K frequency bands to obtain similarity distribution sub-characteristics corresponding to the K frequency bands.
The second distribution feature is frequency dependent and therefore has a feature dimension in the frequency domain so that it can be sliced. The bandwidths of the K frequency bands are the same, or the bandwidths of the K frequency bands are different, or the bandwidths of the K frequency bands are partially the same, and K is a positive integer greater than 1. The relevant parameters used in the slicing process may be determined manually, for example, set according to the distribution of the sound to be extracted in the frequency domain. The above parameters used for splitting the time-frequency characteristic of the first input audio are the same as those used for splitting the second distribution characteristic. The similarity distribution sub-feature is a feature of the second distribution feature that is distributed within the corresponding frequency band range. Each of the K frequency bands has N similarity distribution sub-features, which are in one-to-one correspondence with the N angular directions.
Step 412: and carrying out feature integration on N similarity distribution sub-features corresponding to each frequency band in the K frequency bands to obtain second distribution sub-features corresponding to the K frequency bands.
The computer equipment respectively integrates the N similarity distribution sub-features corresponding to each frequency band into one feature, so as to obtain second distribution sub-features corresponding to K frequency bands. For feature integration (aggregation), this can be achieved by at least one of the following:
(1) Splicing (concat) the computer equipment splices the N similarity distribution sub-features corresponding to each frequency band in the K frequency bands to obtain second distribution sub-features corresponding to the K frequency bands. Optionally, the stitching is along any feature dimension of the similarity distribution sub-feature.
(2) Transform-And-Concat (TAC): the computer equipment performs linear or nonlinear characteristic transformation on N similarity distribution sub-characteristics corresponding to each frequency band in the K frequency bands to obtain N characteristic transformation results corresponding to each frequency band in the K frequency bands. And then splicing N characteristic transformation results corresponding to each frequency band in the K frequency bands to obtain second distribution sub-characteristics corresponding to the K frequency bands. Optionally, the stitching is along any feature dimension of the similarity distribution sub-feature. Alternatively, the above-described process of nonlinear feature transformation may be implemented by a neural network. Such as a multi-layer perceptron (Multilayer Perceptron, MLP). Wherein, the parameters of the MLP corresponding to each frequency band are different.
(3) Transform-And-Average (TAA): and carrying out linear or nonlinear characteristic transformation on N similarity distribution sub-characteristics corresponding to each frequency band in the K frequency bands to obtain N characteristic transformation results corresponding to each frequency band in the K frequency bands. And then determining the average characteristic of N characteristic transformation results corresponding to each frequency band in the K frequency bands to obtain second distribution sub-characteristics corresponding to the K frequency bands. Alternatively, the above-described process of nonlinear feature transformation may be implemented by a neural network. Such as MLP. Wherein, the parameters of the MLP corresponding to each frequency band are different.
(4) And the computer equipment performs cascading feature extraction on N similarity distribution sub-features corresponding to each frequency band in the K frequency bands according to the sequence of the angle directions to obtain second distribution sub-features corresponding to the K frequency bands. The cascade feature extraction comprises feature extraction of a cascade feature extraction result of the previous i level and similarity distribution sub-features of the (i+1) th level, so that a cascade feature extraction result of the previous i+1 level is obtained, and i is a positive integer. In the case where i is 1, feature extraction is performed on the i-th feature. The result of the cascade feature extraction carries a result of feature extraction for each of the N similarity distribution sub-features. Alternatively, the computer device may implement the above described feature extraction process through a recurrent neural network (Recurrent Neural Network, RNN).
Illustratively, FIG. 7 is a schematic diagram of a process for feature integration provided by an exemplary embodiment of the present application. As shown in fig. 7, in a manner 701, a computer device concatenates N similarity distribution sub-features for each frequency band along a feature dimension. In mode 702, the computer device feeds each similarity distribution sub-feature of each frequency band into one MLP for nonlinear feature transformation, and then concatenates the N output features along a feature dimension, where the MLP is different for each frequency band. In mode 703, the computer device sends each similarity distribution sub-feature of each frequency band to a multi-MLP for nonlinear feature transformation, and then calculates an average feature of the N output features, where the MLP is different for each frequency band. In mode 704, the computer device will each frequencyThe N similarity distribution sub-features of the band are regarded as a feature sequence (the ordering mode can be manually specified, such as clockwise ordering along the angle direction), the feature sequence is used for carrying out time sequence modeling by using the RNN, and the output of the last time step of the RNN is regarded as the feature after the N similarity distribution sub-features are integrated. Wherein BW is K The width of the kth frequency band is represented, and H is the feature dimension obtained by the feature map.
Step 414: and determining the first distribution sub-feature corresponding to the K frequency bands and the second distribution sub-feature corresponding to the K frequency bands as the angle distribution sub-feature corresponding to the K frequency bands.
In some embodiments, the method provided by the embodiments of the present application is implemented by a sound separation model. The sound separation model comprises a second network, which is cascaded with the first network described above, for implementing the method provided by steps 406-414 described above.
Illustratively, fig. 8 is a schematic diagram of a second network provided by an exemplary embodiment of the present application, where, as shown in fig. 8, the second network 801 includes a fully connected layer, and each of the K frequency bands corresponds to a fully connected layer. And a full connection layer is respectively arranged for the time-frequency characteristic, the first distribution characteristic and the second distribution characteristic of the first input audio. The second network 801 divides the time-frequency characteristic, the first distribution characteristic and the second distribution characteristic of the first input audio into K frequency bands along the frequency direction, so as to obtain a time-frequency sub-characteristic corresponding to the K frequency bands, a second distribution sub-characteristic corresponding to the K frequency bands and a similarity distribution sub-characteristic corresponding to the K frequency bands. And then, carrying out the feature integration on the similarity distribution sub-features corresponding to the K frequency bands, thereby obtaining second distribution sub-features corresponding to the K frequency bands. And, the second network 801 passes through the full connection layer corresponding to the frequency band K, and maps the input features to specified dimensions, where the specified dimensions are manually set. I.e. the feature output for each fully connected layer, the feature dimension is the specified dimension. Each fully connected layer is also cascaded with a layer specification (Normalization) layer.
Step 416: and carrying out feature extraction on the time-frequency sub-features corresponding to the K frequency bands and the angle distribution sub-features corresponding to the K frequency bands to obtain feature extraction results corresponding to the K frequency bands.
The computer equipment maps the time-frequency sub-features corresponding to each frequency band in the K frequency bands to the appointed dimension to obtain first sequence features corresponding to the K frequency bands. And mapping the first distribution sub-feature corresponding to each frequency band in the K frequency bands to the appointed dimension to obtain the second sequence feature corresponding to the K frequency bands. And mapping the second distribution sub-feature corresponding to each frequency band in the K frequency bands to the appointed dimension to obtain a third sequence feature corresponding to the K frequency bands. And then splicing the first sequence features corresponding to the K frequency bands, the second sequence features corresponding to the K frequency bands and the third sequence features corresponding to the K frequency bands, thereby obtaining splicing features corresponding to the K frequency bands. For the specific process of mapping and splicing (merging), reference may be made to the example of fig. 8, and the embodiments of the present application are not described herein.
After the splicing features corresponding to the K channels are obtained, the computer equipment performs feature extraction on the splicing features corresponding to the K frequency bands, so that feature extraction results corresponding to the K frequency bands are obtained.
Alternatively, the process of feature extraction refers to feature modeling. And modeling the splicing features corresponding to the K frequency bands along a first dimension by the computer equipment to obtain first modeling sequence features corresponding to the K frequency bands. Wherein the first modeled sequence features carry correlations between features of the stitching features at different positions of the first dimension. And modeling the first modeling sequence features corresponding to the K frequency bands along a second dimension to obtain second modeling sequence features corresponding to the K frequency bands. Wherein the second modeled sequence features carry correlations between features of the first modeled sequence features at different positions in the second dimension. The computer equipment determines the second modeling sequence features corresponding to the K frequency bands as feature extraction results corresponding to the K frequency bands. The first dimension is a time dimension, and the second dimension is a frequency band dimension; or, the first dimension is a frequency band dimension, and the second dimension is a time dimension.
In some embodiments, the method provided by the embodiments of the present application is implemented by a sound separation model. The sound separation model includes a third network that is cascaded with the second network, the third network being configured to implement the method provided in step 416.
Illustratively, fig. 9 is a schematic diagram of a third network provided by an exemplary embodiment of the present application. As shown in fig. 9, the third network 901 includes a sub-network 902 and a sub-network 903. For the output of the second network, i.e. the splicing characteristics corresponding to K frequency bands, the third network 901 is modeled sequentially along the dimensions of the sequence length T and the number K of frequency bands. Alternatively, the sub-networks 902 and 903 include recurrent neural networks, and the sub-networks 902 and 903 may be other neural networks, which is not limited by the embodiment of the present application. With continued reference to fig. 9, the subnetworks 902 and 903 include a layer specification and a Fully Connected (FC) layer, and after the fully connected layer, a residual connection structure is provided to enhance the performance of the model.
Step 418: and extracting output audio of the first input audio within a specified angle range according to the feature extraction results corresponding to the K frequency bands.
The computer equipment can predict masks corresponding to the K frequency bands through the second modeling sequence features corresponding to the K frequency bands. Wherein the mask corresponding to each of the K frequency bands is used to indicate a duty cycle of the output audio at a different time-frequency location of the first input audio at each frequency band. And combining the masks corresponding to each of the K frequency bands, thereby obtaining a combined mask. According to the time-frequency characteristics of the first input audio and the merging mask, the time-frequency characteristics of the output audio of the first input audio in a specified angle range can be determined, and the output audio can be obtained by performing short-time inverse Fourier transform on the time-frequency characteristics.
In some embodiments, the first input audio is acquired by an array of sound sensors. The computer equipment can extract the output voice of the first input audio within the appointed angle range according to the characteristic extraction results corresponding to the K frequency bands. For this case, the mask can also provide a noise reduction effect, that is, not only can output voice, but also can reduce noise for the output voice.
In some embodiments, the first input audio is acquired by an array of sound sensors. And the computer equipment extracts output voice of the first input audio within a specified angle range according to the feature extraction results corresponding to the K frequency bands.
In some embodiments, the method provided by the embodiments of the present application is implemented by a sound separation model. The sound separation model includes a fourth network that is cascaded with the third network described above for implementing the method provided by step 418 described above.
Illustratively, fig. 10 is a schematic diagram of a fourth network provided by an exemplary embodiment of the present application. As shown in fig. 10, after the feature extraction results corresponding to the K frequency bands are input to the fourth network 1001, the feature extraction results are predicted by the MLP and the layer specification layer, so that a mask corresponding to each of the K frequency bands can be obtained. And combining the masks corresponding to each of the K frequency bands, thereby obtaining a combined mask. According to the time-frequency characteristics of the first input audio and the merging mask, the time-frequency characteristics of the output audio of the first input audio in a specified angle range can be determined, and the estimated output audio can be obtained by carrying out short-time inverse Fourier transform on the time-frequency characteristics.
In summary, according to the method provided by the embodiment, the angle distribution feature for extracting the voice zone is constructed based on the time-frequency feature of the input audio, and the frequency band segmentation is performed for both the time-frequency feature and the angle distribution feature, so that the output audio obtained by extracting the voice zone can be independently analyzed for different frequency bands, and the performance of extracting the voice zone is improved.
According to the method provided by the embodiment, the feature integration is performed on the similarity distribution sub-features corresponding to each frequency band, so that feature dimensions of features processed by the model can be effectively reduced, and the processing efficiency and the reasoning speed of the model can be improved.
The method provided by the embodiment also provides various feature integration implementation modes, so that the feature integration implementation mode with the best performance can be selected according to actual conditions to perform feature integration. The flexibility of feature integration is improved.
According to the method provided by the embodiment, the sub-features corresponding to each frequency band are mapped to the appointed dimension, so that the features corresponding to each frequency band can be spliced, and the subsequent feature extraction process can be conveniently executed.
According to the method provided by the embodiment, the characteristics are modeled along the time domain dimension and the frequency domain dimension, so that the prediction output audio according to the relativity between sequences can be realized, and the accuracy of the prediction output audio is improved.
According to the method provided by the embodiment, the mask is predicted on different frequency bands, so that the distribution of the output audio on the different frequency bands can be accurately predicted according to the characteristics of the different frequency bands.
According to the method provided by the embodiment, the first distribution characteristic and the second distribution characteristic are constructed in the process of predicting the output audio, so that the accuracy of predicting the output audio can be improved.
The method provided by the embodiment can reduce the number of the processed features by summing the second distribution features corresponding to each angle direction.
Under the condition that the method provided by the application is realized through the machine learning model, the following training method is also provided aiming at the training of the machine learning model. FIG. 11 is a flow chart of a model training method provided in an exemplary embodiment of the application. The method may be used with a computer device or a client on a computer device. As shown in fig. 11, the method includes:
step 1102: a time-frequency characteristic of a plurality of sample audio is acquired.
Each of the plurality of sample audio is acquired by one of the sound sensors in the sound sensor array, the plurality of sample audio including the first sample audio. The process of acquiring the time-frequency characteristic may be illustrated with reference to fig. 6, and the embodiment of the present application will not be described herein. It should be noted that the sound sensor array herein may be the same as or different from the sound sensor array in the implementation of the above-described audio extraction method.
Step 1104: sample angle distribution characteristics are determined according to time-frequency characteristics of a plurality of sample audios.
The sample angle distribution feature is used for representing the audio frequency of each sample audio frequency in N angle directions in a sample specified angle range, wherein N is a positive integer. The process of determining the angular distribution characteristics of the sample may be illustrated with reference to fig. 6, and embodiments of the present application are not described herein.
Step 1106: and segmenting the time-frequency characteristic of the first sample audio according to K frequency bands in the frequency domain dimension to obtain sample time-frequency sub-characteristics corresponding to the K frequency bands.
K is a positive integer greater than 1. The process of dividing the sample time-frequency sub-features can be illustrated with reference to fig. 8, and the embodiments of the present application are not described herein.
Step 1108: and cutting the sample angle distribution characteristics in the frequency domain dimension according to K frequency bands to obtain sample angle distribution sub-characteristics corresponding to the K frequency bands.
The process of dividing the angular distribution sub-feature of the sample may be illustrated with reference to fig. 8, and the embodiments of the present application are not described herein. The parameters used for the frequency band splitting are the same or different from the relevant parameters in the implementation of the above-described audio extraction method.
Step 1110: and carrying out feature extraction on the sample time-frequency sub-features corresponding to the K frequency bands and the sample angle distribution sub-features corresponding to the K frequency bands through a sound region extraction model to obtain sample feature extraction results corresponding to the K frequency bands.
The process of extracting the sample feature extraction result may be illustrated with reference to fig. 9, and the embodiment of the present application will not be described herein.
Step 1112: and extracting predicted audio of the first sample audio within a sample specified angle range according to sample feature extraction results corresponding to the K frequency bands through the voice zone extraction model.
The process of extracting the predicted audio may be illustrated with reference to fig. 10, and embodiments of the present application are not described herein. The sample specified angular range is the same as or different from the specified angular range in the implementation of the audio extraction method described above.
Step 1114: and training a sound zone extraction model according to errors of the real audio and the predicted audio.
The computer device can construct a loss function according to the error of the real audio and the predicted audio, and can train a sound zone extraction model according to the loss function. The real audio is audio manually extracted from the first sample audio based on a specified angular range.
In summary, according to the method provided by the embodiment, by training the audio region extraction model, by constructing the angle distribution feature for audio region extraction based on the time-frequency feature of the input audio, and performing band segmentation for both the time-frequency feature and the angle distribution feature, the output audio obtained by audio region extraction can be analyzed separately for different frequency bands, so that the performance of audio region extraction is improved.
In a specific example, the method provided by the embodiment of the application is implemented through a soundfield extraction model. Referring to fig. 6, 8, 9, 10, the soundfield extraction model includes a first network (feature extraction network), a second network (band segmentation and subband modeling network), a third network (sequence and band modeling network), and a fourth network (mask estimation network).
(1) Feature extraction network: the network performs feature extraction of the speech signal. A complex spectrogram (complex-valued spectrogram) of the speech signal is extracted by using a short-time Fourier transform. The network extracts complex spectra of all channels in the multi-channel input, but only the reference microphone (reference mic) is used as a specific model input, and the other microphones are used only to calculate the spatial angle features described below.
For spatial angle features, two classes of features are included: inter-channel phase difference (IPD) and angular pitch defining features (query region feature). The definition of IPD is detailed in the foregoing embodiments, and will not be described here again. The calculation of the angular zone definition features depends on the similarity between the target angular phase difference (TPD) and the IPD, as well as the above embodiments, which are not described in detail here.
Regarding feature dimensions: the dimension of the spectrogram Y of the reference microphone is 2×f×t, where F is the number of frequency points, T is the number of feature frames, and 2 represents the real part and the imaginary part of the complex spectrogram. The dimension of the IPD is m×f×t, where M is the number of microphone pairs (assuming 3 microphones, there may be up to 3 microphone pairs, i.e., (mic 1, mic 2), (mic 1, mic 3), (mic 2, mic 3), and the model does not necessarily need to use all optional microphone pairs). The dimension of the angular region defining feature is N x F x T, where N is the number of angular directions selected in the specified angular range. Firstly, calculating the similarity between the TPD (dimension f×t) and the IPD of each microphone pair in each angular direction (dimension m×f×t), combining the N features (the dimension n×m×f×t at this time), and then summing along the dimension (M dimension) of the microphone pair to obtain the feature of the dimension n×f×t.
(2) Band segmentation and subband modeling network: the network divides the spectrogram and the angular zone definition feature into K sub-bands which are not overlapped with each other along the F dimension. In particular, for angular region definition features, N features per subband are feature integrated (aggregation) into one feature. The specific integration process can refer to fig. 7, and the embodiment of the present application is not described herein.
For the spectrogram of each sub-band, the spliced IPD and the integrated angular-region definition features, the network maps them into a feature space of the same dimension by using any linear or nonlinear transformation (the linear transformation given as normalization+full-connection layer in fig. 8), and then sums the mapped features corresponding to each sub-band to generate unified features corresponding to each sub-band, including the spectral features (time-frequency features), the IPD features and the angular-region definition features.
For the sequence and band modeling network (3) and the mask estimation network (4), reference may be made to the content in the foregoing embodiments, and the embodiments of the present application are not described herein.
The voice zone extraction model provided by the embodiment of the application can support the voice extraction of the streaming variable angle voice zone. The voice zone extraction model supporting the extraction of the voice zone with the variable angle can flexibly support the voice zone ranges with different angles, and can effectively detect the number of speakers in the voice zone (if no speaker exists, the voice with extremely low energy is output, so that no person is judged to be present) and extract the voice.
Table 1 shows the performance of the soundfield extraction model and other models provided by the embodiment of the present application in a soundfield extraction task.
TABLE 1
As shown in table 1, model 1 is the ResLSTM model, and model 2 is the model provided in the examples of application. SA indicates the accuracy of the model to detect that no speaker is present in the specified angle range when no speaker is present in the specified angle range, and the higher the value, the better the value in percent. Q=1 and q=2 represent scenes in which there are 1 person and 2 persons in the specified angle range. SDR is a signal distortion index, expresses the quality of model output, and has higher value. STOI and PESQ are speech intelligibility and hearing evaluation indexes, and the higher the numerical value, the better. # param is model parameter number, MACs is model complexity. ResLSTM is a baseline model that uses the same IPD and angular regions to define features, but does not perform band segmentation, does not include feature integration modules, and does not include inter-band modeling modules. The suffix of model 2 indicates the different model parameter configurations.
It can be seen that models 2-XXS have similar model complexity and significantly smaller model parameters than ResLSTM, but their performance is significantly better than ResLSTM at all evaluation indices; model 2-XXXS has a model complexity of only 40% of ResLSTM, a parameter amount of only 10% of ResLSTM, but performance is equal to ResLSTM; after the model parameter is increased (-M, -S, -XS), the model provided by the embodiment of the application can further stably improve the performance.
It should be noted that, before collecting relevant data of a user (for example, input audio in the present application) and during collecting relevant data of a user, the present application may display a prompt interface, a pop-up window or output voice prompt information, where the prompt interface, the pop-up window or the voice prompt information is used to prompt the user to collect relevant data currently, so that the present application only starts to execute the relevant step of obtaining relevant data of the user after obtaining the confirmation operation of the user on the prompt interface or the pop-up window, otherwise (i.e., when the confirmation operation of the user on the prompt interface or the pop-up window is not obtained), ends the relevant step of obtaining relevant data of the user, i.e., does not obtain relevant data of the user. In other words, all user data collected by the present application is collected with the consent and authorization of the user, and the collection, use and processing of relevant user data requires compliance with relevant laws and regulations and standards of the relevant country and region.
It should be noted that, the sequence of the steps of the method provided in the embodiment of the present application may be appropriately adjusted, the steps may also be increased or decreased according to the situation, and any method that is easily conceivable to be changed by those skilled in the art within the technical scope of the present disclosure should be covered within the protection scope of the present disclosure, so that no further description is given.
Fig. 12 is a schematic structural view of an audio extraction apparatus according to an exemplary embodiment of the present application. As shown in fig. 12, the apparatus includes:
the feature extraction module 1201 is configured to obtain time-frequency features of a plurality of input audio, where each of the plurality of input audio is acquired by one of the sound sensors in the sound sensor array, and the plurality of input audio includes a first input audio. In some embodiments, feature extraction module 1201 is implemented by the first network in fig. 6.
The feature extraction module 1201 is further configured to determine an angle distribution feature according to the time-frequency features of the plurality of input audio, where the angle distribution feature is used to characterize audio in N angular directions of each input audio within a specified angle range, and N is a positive integer.
The frequency band dividing module 1202 is configured to segment the time-frequency characteristic of the first input audio in a frequency domain dimension according to K frequency bands to obtain time-frequency sub-characteristics corresponding to the K frequency bands, where K is a positive integer greater than 1; and cutting the angle distribution characteristics in the frequency domain dimension according to the K frequency bands to obtain angle distribution sub-characteristics corresponding to the K frequency bands. In some embodiments, band partitioning module 1202 is implemented by the second network in fig. 8.
The feature modeling module 1203 is configured to perform feature extraction on the time-frequency sub-features corresponding to the K frequency bands and the angle distribution sub-features corresponding to the K frequency bands, so as to obtain feature extraction results corresponding to the K frequency bands. In some embodiments, feature modeling module 1203 is implemented via third network in fig. 9.
And a mask estimation module 1204, configured to extract output audio of the first input audio within the specified angle range according to the feature extraction results corresponding to the K frequency bands. In some embodiments, mask estimation module 1204 is implemented through a fourth network in fig. 10.
In an alternative design, the angular distribution features include a first distribution feature and a second distribution feature, where the first distribution feature includes a first phase difference of input audio corresponding to each two sound sensors, and the second distribution feature is used to reflect similarity between the first phase difference and a second phase difference, where the second phase difference is a phase difference of sampling pulse signals of N angular directions in the specified angular range for each two sound sensors; the band division module 1202 includes:
and the first dividing sub-module 12021 is configured to segment the first distribution feature in the frequency domain dimension according to the K frequency bands, so as to obtain first distribution sub-features corresponding to the K frequency bands.
And a second dividing sub-module 12022, configured to divide the second distribution feature in the frequency domain dimension according to the K frequency bands, obtain similarity distribution sub-features corresponding to the K frequency bands, where each frequency band in the K frequency bands has N similarity distribution sub-features, and the N similarity distribution sub-features are in one-to-one correspondence with the N angle directions.
In some embodiments, the band division module 1202 further includes a third division sub-module 12023, configured to divide the time-frequency characteristic of the first input audio according to K frequency bands in a frequency domain dimension, so as to obtain time-frequency sub-characteristics corresponding to the K frequency bands.
And an integration sub-module 12024, configured to perform feature integration on the N similarity distribution sub-features corresponding to each of the K frequency bands, so as to obtain second distribution sub-features corresponding to the K frequency bands. In some embodiments, the integration sub-module 12024 is implemented by a module corresponding to the different implementations in fig. 7.
The band division module 1202 is further configured to determine a first distribution sub-feature corresponding to the K frequency bands and a second distribution sub-feature corresponding to the K frequency bands as angle distribution sub-features corresponding to the K frequency bands.
In an alternative design, the integration submodule 12024 is configured to:
and splicing the N similarity distribution sub-features corresponding to each frequency band in the K frequency bands to obtain second distribution sub-features corresponding to the K frequency bands.
In an alternative design, the integration submodule 12024 is configured to:
performing linear or nonlinear feature transformation on the N similarity distribution sub-features corresponding to each of the K frequency bands to obtain N feature transformation results corresponding to each of the K frequency bands;
and splicing N characteristic transformation results corresponding to each frequency band in the K frequency bands to obtain second distribution sub-characteristics corresponding to the K frequency bands.
In an alternative design, the integration submodule 12024 is configured to:
performing linear or nonlinear feature transformation on the N similarity distribution sub-features corresponding to each of the K frequency bands to obtain N feature transformation results corresponding to each of the K frequency bands;
and determining average characteristics of N characteristic transformation results corresponding to each frequency band in the K frequency bands to obtain second distribution sub-characteristics corresponding to the K frequency bands.
In an alternative design, the integration submodule 12024 is configured to:
Extracting cascade features from the N similarity distribution sub-features corresponding to each of the K frequency bands according to the sequence of the angle directions to obtain second distribution sub-features corresponding to the K frequency bands;
the cascade feature extraction comprises feature extraction of a cascade feature extraction result of the previous i level and similarity distribution sub-features of the (i+1) th level, so that a cascade feature extraction result of the previous i+1 level is obtained, and i is a positive integer.
In an alternative design, the frequency band dividing module further includes:
a first mapping submodule 12025, configured to map a time-frequency sub-feature corresponding to each of the K frequency bands to a specified dimension, so as to obtain a first sequence feature corresponding to the K frequency bands; a second mapping sub-module 12026, configured to map the first distribution sub-feature corresponding to each of the K frequency bands to the specified dimension, so as to obtain a second sequence feature corresponding to the K frequency bands; and a third mapping sub-module 12027, configured to map the second distribution sub-feature corresponding to each of the K frequency bands to the specified dimension, so as to obtain a third sequence feature corresponding to the K frequency bands. In some embodiments, the first mapping sub-module 12025 is implemented by the corresponding layer specification layer and full connection layer in fig. 8. In some embodiments, the second mapping sub-module 12026 is implemented by the corresponding layer specification layer and full connection layer in fig. 8. In some embodiments, the third mapping sub-module 12027 is implemented by the corresponding layer specification layer and full connection layer in fig. 8.
And a merging submodule 12028, configured to splice the first sequence feature corresponding to the K frequency bands, the second sequence feature corresponding to the K frequency bands, and the third sequence feature corresponding to the K frequency bands, so as to obtain a splice feature corresponding to the K frequency bands. In some embodiments, the merge sub-module 12028 is implemented by the module in FIG. 8 that performs the merge.
The feature modeling module 1203 is configured to perform feature extraction on the spliced features corresponding to the K frequency bands, so as to obtain feature extraction results corresponding to the K frequency bands.
In an alternative design, the feature modeling module 1203 includes:
and a first modeling submodule 12031, configured to model the splicing features corresponding to the K frequency bands along a first dimension, so as to obtain first modeling sequence features corresponding to the K frequency bands, where the first modeling sequence features carry correlations between features of the splicing features at different positions of the first dimension. In some embodiments, the first modeling sub-module 12031 is implemented by the sub-network 902 in fig. 9.
And a second modeling submodule 12032, configured to perform modeling on the first modeling sequence features corresponding to the K frequency bands along a second dimension, so as to obtain second modeling sequence features corresponding to the K frequency bands, where the second modeling sequence features carry correlations between features of the first modeling sequence features at different positions in the second dimension. In some embodiments, the second modeling sub-module 12032 is implemented by the sub-network 903 in fig. 9.
The feature modeling module 1203 is configured to determine the features of the second modeling sequence corresponding to the K frequency bands as feature extraction results corresponding to the K frequency bands.
The first dimension is a time dimension, and the second dimension is a frequency band dimension; or, the first dimension is the frequency band dimension, and the second dimension is the time dimension.
In an alternative design, the mask estimation module 1204 is configured to:
predicting masks corresponding to the K frequency bands through second modeling sequence features corresponding to the K frequency bands, wherein the mask corresponding to each frequency band in the K frequency bands is used for indicating the duty ratio of the output audio at different time frequency positions of the first input audio on each frequency band;
combining masks corresponding to each frequency band in the K frequency bands to obtain combined masks;
and determining the output audio according to the first input audio and the merging mask.
In an alternative design, the feature extraction module 1201 is configured to:
determining one or more pairs of sound sensors in the sound sensor array, the pairs of sound sensors comprising a first sound sensor and a second sound sensor;
Determining the first phase difference according to the time-frequency characteristic corresponding to the first sound sensor and the time-frequency characteristic corresponding to the second sound sensor;
determining the second phase difference of the first sound sensor and the second sound sensor for sampling pulse signals in N angle directions in the specified angle range;
determining a similarity of the first phase difference and the second phase difference;
and determining the second distribution characteristic according to the similarity.
In an alternative design, the feature extraction module 1201 is configured to:
summing M second distribution characteristics corresponding to each of the N angular directions;
wherein M represents the number of the sound sensor pairs, and M is a positive integer.
It should be noted that: the audio extraction device provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the audio extraction apparatus and the audio extraction method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
Embodiments of the present application also provide a computer device comprising: the audio extraction method comprises a processor and a memory, wherein at least one instruction, at least one section of program, a code set or an instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the audio extraction method provided by each method embodiment.
Optionally, the computer device is a server. Illustratively, fig. 13 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.
The computer apparatus 1300 includes a central processing unit (Central Processing Unit, CPU) 1301, a system Memory 1304 including a random access Memory (Random Access Memory, RAM) 1302 and a Read-Only Memory (ROM) 1303, and a system bus 1305 connecting the system Memory 1304 and the central processing unit 1301. The computer device 1300 also includes a basic Input/Output system (I/O) 1306 to facilitate the transfer of information between various devices within the computer device, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.
The basic input/output system 1306 includes a display 1308 for displaying information, and an input device 1309, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1308 and the input device 1309 are connected to the central processing unit 1301 through an input output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a keyboard, mouse, or electronic stylus, among a plurality of other devices. Similarly, the input output controller 1310 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable storage media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable storage medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.
The computer-readable storage medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable storage instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-Only register (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-Only Memory (EEPROM), flash Memory or other solid state Memory devices, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1304 and mass storage device 1307 described above may be referred to collectively as memory.
The memory stores one or more programs configured to be executed by the one or more central processing units 1301, the one or more programs containing instructions for implementing the above-described method embodiments, the central processing unit 1301 executing the one or more programs to implement the methods provided by the various method embodiments described above.
According to various embodiments of the application, the computer device 1300 may also operate by a remote computer device connected to the network through a network, such as the Internet. I.e., the computer device 1300 may be connected to the network 1312 via a network interface unit 1311 coupled to the system bus 1305, or alternatively, the network interface unit 1311 may be used to connect to other types of networks or remote computer device systems (not shown).
The memory also includes one or more programs stored in the memory, the one or more programs including steps for performing the methods provided by the embodiments of the present application, as performed by the computer device.
The embodiment of the application also provides a computer readable storage medium, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the computer readable storage medium, and when the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by a processor of computer equipment, the audio extraction method provided by each method embodiment is realized.
The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio extraction method provided by the above-mentioned method embodiments.
Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above mentioned computer readable storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims (15)

1. An audio extraction method, the method comprising:
acquiring time-frequency characteristics of a plurality of input audios, wherein each input audio in the plurality of input audios is acquired through one sound sensor in a sound sensor array, and the plurality of input audios comprise a first input audio;
determining angle distribution characteristics according to the time-frequency characteristics of the plurality of input audios, wherein the angle distribution characteristics are used for representing the audios of N angle directions of each input audio in a specified angle range, and the specific gravity occupied in each input audio is positive integer;
Dividing the time-frequency characteristic of the first input audio in a frequency domain dimension according to K frequency bands to obtain time-frequency sub-characteristics corresponding to the K frequency bands, wherein K is a positive integer greater than 1; dividing the angle distribution characteristics in the frequency domain dimension according to the K frequency bands to obtain angle distribution sub-characteristics corresponding to the K frequency bands;
performing feature extraction on the time-frequency sub-features corresponding to the K frequency bands and the angle distribution sub-features corresponding to the K frequency bands to obtain feature extraction results corresponding to the K frequency bands;
and extracting output audio of the first input audio within the specified angle range according to the feature extraction results corresponding to the K frequency bands.
2. The method of claim 1, wherein the angular distribution features include a first distribution feature including a first phase difference of input audio corresponding to each two sound sensors and a second distribution feature reflecting a similarity of the first phase difference to a second phase difference, the second phase difference being a phase difference at which the pulse signals of N angular directions within the specified angular range are sampled by each two sound sensors;
The step of segmenting the angle distribution feature in the frequency domain dimension according to the K frequency bands to obtain angle distribution sub-features corresponding to the K frequency bands, including:
dividing the first distribution characteristics in the frequency domain dimension according to the K frequency bands to obtain first distribution sub-characteristics corresponding to the K frequency bands;
dividing the second distribution feature in the frequency domain dimension according to the K frequency bands to obtain similarity distribution sub-features corresponding to the K frequency bands, wherein each frequency band in the K frequency bands has N similarity distribution sub-features, and the N similarity distribution sub-features are in one-to-one correspondence with the N angle directions;
performing feature integration on the N similarity distribution sub-features corresponding to each frequency band in the K frequency bands to obtain second distribution sub-features corresponding to the K frequency bands;
and determining the first distribution sub-feature corresponding to the K frequency bands and the second distribution sub-feature corresponding to the K frequency bands as the angle distribution sub-feature corresponding to the K frequency bands.
3. The method of claim 2, wherein the feature integrating the N similarity distribution sub-features corresponding to each of the K frequency bands to obtain a second distribution sub-feature corresponding to the K frequency bands includes:
And splicing the N similarity distribution sub-features corresponding to each frequency band in the K frequency bands to obtain second distribution sub-features corresponding to the K frequency bands.
4. The method of claim 2, wherein the feature integrating the N similarity distribution sub-features corresponding to each of the K frequency bands to obtain a second distribution sub-feature corresponding to the K frequency bands includes:
performing linear or nonlinear feature transformation on the N similarity distribution sub-features corresponding to each of the K frequency bands to obtain N feature transformation results corresponding to each of the K frequency bands;
and splicing N characteristic transformation results corresponding to each frequency band in the K frequency bands to obtain second distribution sub-characteristics corresponding to the K frequency bands.
5. The method of claim 2, wherein the feature integrating the N similarity distribution sub-features corresponding to each of the K frequency bands to obtain a second distribution sub-feature corresponding to the K frequency bands includes:
performing linear or nonlinear feature transformation on the N similarity distribution sub-features corresponding to each of the K frequency bands to obtain N feature transformation results corresponding to each of the K frequency bands;
And determining average characteristics of N characteristic transformation results corresponding to each frequency band in the K frequency bands to obtain second distribution sub-characteristics corresponding to the K frequency bands.
6. The method of claim 2, wherein the feature integrating the N similarity distribution sub-features corresponding to each of the K frequency bands to obtain a second distribution sub-feature corresponding to the K frequency bands includes:
extracting cascade features from the N similarity distribution sub-features corresponding to each of the K frequency bands according to the sequence of the angle directions to obtain second distribution sub-features corresponding to the K frequency bands;
the cascade feature extraction comprises feature extraction of a cascade feature extraction result of the previous i level and similarity distribution sub-features of the (i+1) th level, so that a cascade feature extraction result of the previous i+1 level is obtained, and i is a positive integer.
7. The method according to any one of claims 2 to 6, wherein the performing feature extraction on the time-frequency sub-features corresponding to the K frequency bands and the angle distribution sub-features corresponding to the K frequency bands to obtain feature extraction results corresponding to the K frequency bands includes:
mapping the time-frequency sub-features corresponding to each frequency band in the K frequency bands to a designated dimension to obtain first sequence features corresponding to the K frequency bands; mapping the first distribution sub-feature corresponding to each frequency band in the K frequency bands to the appointed dimension to obtain a second sequence feature corresponding to the K frequency bands; mapping the second distribution sub-feature corresponding to each frequency band in the K frequency bands to the appointed dimension to obtain a third sequence feature corresponding to the K frequency bands;
Splicing the first sequence features corresponding to the K frequency bands, the second sequence features corresponding to the K frequency bands and the third sequence features corresponding to the K frequency bands to obtain splicing features corresponding to the K frequency bands;
and performing feature extraction on the spliced features corresponding to the K frequency bands to obtain feature extraction results corresponding to the K frequency bands.
8. The method of claim 7, wherein the performing feature extraction on the spliced features corresponding to the K frequency bands to obtain feature extraction results corresponding to the K frequency bands includes:
modeling the splicing features corresponding to the K frequency bands along a first dimension to obtain first modeling sequence features corresponding to the K frequency bands, wherein the first modeling sequence features carry correlations among features of the splicing features at different positions of the first dimension;
modeling the first modeling sequence features corresponding to the K frequency bands along a second dimension to obtain second modeling sequence features corresponding to the K frequency bands, wherein the second modeling sequence features carry correlations among features of the first modeling sequence features at different positions of the second dimension;
Determining the second modeling sequence features corresponding to the K frequency bands as feature extraction results corresponding to the K frequency bands;
the first dimension is a time dimension, and the second dimension is a frequency band dimension; or, the first dimension is the frequency band dimension, and the second dimension is the time dimension.
9. The method according to claim 8, wherein the extracting output audio of the first input audio within the specified angle range according to the feature extraction results for the K frequency bands includes:
predicting masks corresponding to the K frequency bands through second modeling sequence features corresponding to the K frequency bands, wherein the mask corresponding to each frequency band in the K frequency bands is used for indicating the duty ratio of the output audio at different time frequency positions of the first input audio on each frequency band;
combining masks corresponding to each frequency band in the K frequency bands to obtain combined masks;
and determining the output audio according to the first input audio and the merging mask.
10. The method according to any one of claims 2 to 6, further comprising:
determining one or more pairs of sound sensors in the sound sensor array, the pairs of sound sensors comprising a first sound sensor and a second sound sensor;
Determining the first phase difference according to the time-frequency characteristic corresponding to the first sound sensor and the time-frequency characteristic corresponding to the second sound sensor;
determining the second phase difference of the first sound sensor and the second sound sensor for sampling pulse signals in N angle directions in the specified angle range;
determining a similarity of the first phase difference and the second phase difference;
and determining the second distribution characteristic according to the similarity.
11. The method according to claim 10, wherein the method further comprises:
summing M second distribution characteristics corresponding to each of the N angular directions;
wherein M represents the number of the sound sensor pairs, and M is a positive integer.
12. An audio extraction apparatus, the apparatus comprising:
the device comprises a feature extraction module, a first input audio generation module and a second input audio generation module, wherein the feature extraction module is used for acquiring time-frequency features of a plurality of input audios, each input audio in the plurality of input audios is acquired through one sound sensor in a sound sensor array, and the plurality of input audios comprise a first input audio;
the feature extraction module is used for determining angle distribution features according to time-frequency features of the plurality of input audios, the angle distribution features are used for representing audios of N angle directions of each input audio within a specified angle range, and the specific gravity occupied in each input audio is positive integer N;
The frequency band dividing module is used for dividing the time-frequency characteristic of the first input audio frequency in the frequency domain dimension according to K frequency bands to obtain time-frequency sub-characteristics corresponding to the K frequency bands, wherein K is a positive integer greater than 1; dividing the angle distribution characteristics in the frequency domain dimension according to the K frequency bands to obtain angle distribution sub-characteristics corresponding to the K frequency bands;
the feature modeling module is used for carrying out feature extraction on the time-frequency sub-features corresponding to the K frequency bands and the angle distribution sub-features corresponding to the K frequency bands to obtain feature extraction results corresponding to the K frequency bands;
and the mask estimation module is used for extracting output audio of the first input audio within the specified angle range according to the feature extraction results corresponding to the K frequency bands.
13. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the audio extraction method of any of claims 1 to 11.
14. A computer readable storage medium having stored therein at least one program loaded and executed by a processor to implement the audio extraction method of any one of claims 1 to 11.
15. A computer program product, characterized in that the computer program product comprises computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, the processor executing the computer instructions, causing the computer device to perform the audio extraction method according to any one of claims 1 to 11.
CN202311045708.4A 2023-08-18 2023-08-18 Audio extraction method, device, equipment and storage medium Pending CN116959470A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311045708.4A CN116959470A (en) 2023-08-18 2023-08-18 Audio extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311045708.4A CN116959470A (en) 2023-08-18 2023-08-18 Audio extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116959470A true CN116959470A (en) 2023-10-27

Family

ID=88446342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311045708.4A Pending CN116959470A (en) 2023-08-18 2023-08-18 Audio extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116959470A (en)

Similar Documents

Publication Publication Date Title
CN108564963B (en) Method and apparatus for enhancing voice
WO2021196905A1 (en) Voice signal dereverberation processing method and apparatus, computer device and storage medium
CN110459241B (en) Method and system for extracting voice features
CN110880329B (en) Audio identification method and equipment and storage medium
US11074925B2 (en) Generating synthetic acoustic impulse responses from an acoustic impulse response
CN104995679A (en) Signal source separation
EP4266308A1 (en) Voice extraction method and apparatus, and electronic device
CN115602165A (en) Digital staff intelligent system based on financial system
US20190172477A1 (en) Systems and methods for removing reverberation from audio signals
CN112397090B (en) Real-time sound classification method and system based on FPGA
WO2023226572A1 (en) Feature representation extraction method and apparatus, device, medium and program product
CN112735466A (en) Audio detection method and device
CN115426582B (en) Earphone audio processing method and device
CN116959470A (en) Audio extraction method, device, equipment and storage medium
CN114822457A (en) Music score determination method and device, electronic equipment and computer readable medium
CN114283833A (en) Speech enhancement model training method, speech enhancement method, related device and medium
Lluís et al. Points2Sound: from mono to binaural audio using 3D point cloud scenes
JPWO2020066542A1 (en) Acoustic object extraction device and acoustic object extraction method
CN115116460B (en) Audio signal enhancement method, device, apparatus, storage medium and program product
CN116959478A (en) Sound source separation method, device, equipment and storage medium
Li et al. MAF-Net: multidimensional attention fusion network for multichannel speech separation
CN115910047B (en) Data processing method, model training method, keyword detection method and equipment
Singh pyAudioProcessing: Audio Processing, Feature Extraction, and Machine Learning Modeling.
US20230154480A1 (en) Adl-ufe: all deep learning unified front-end system
Jiang et al. A Complex Neural Network Adaptive Beamforming for Multi-channel Speech Enhancement in Time Domain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication