CN116092501B - Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system - Google Patents

Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system Download PDF

Info

Publication number
CN116092501B
CN116092501B CN202310238080.3A CN202310238080A CN116092501B CN 116092501 B CN116092501 B CN 116092501B CN 202310238080 A CN202310238080 A CN 202310238080A CN 116092501 B CN116092501 B CN 116092501B
Authority
CN
China
Prior art keywords
voice
target
speech
module
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310238080.3A
Other languages
Chinese (zh)
Other versions
CN116092501A (en
Inventor
柯登峰
聂帅
刘文举
梁山
罗琪
胡睿欣
姚文翰
舒文涛
王运峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Weiou Technology Co ltd
Original Assignee
Shenzhen Weiou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Weiou Technology Co ltd filed Critical Shenzhen Weiou Technology Co ltd
Priority to CN202310238080.3A priority Critical patent/CN116092501B/en
Publication of CN116092501A publication Critical patent/CN116092501A/en
Application granted granted Critical
Publication of CN116092501B publication Critical patent/CN116092501B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to the technical field of voice recognition, and particularly discloses a voice enhancement method, a voice recognition method, a speaker recognition method and a system, which comprise the steps of generating a double-microphone far-field noisy voice based on pure voice, pure noise and scattered noise; generating a plurality of target voices based on the pure voices and recording a plurality of target voice orientations; uniformly dividing the space orientation into a plurality of target areas; labeling a plurality of target voice orientations based on a plurality of target areas to obtain labeled target voice orientations; extracting features of double-microphone far-field noisy speech from the labeled target speech azimuth to obtain features of each target region; constructing a masking nerve voice enhancement model; training a masking nerve voice enhancement model based on the characteristics of each target region, target voice and the labeled target voice azimuth, and enhancing voice signals based on the trained masking nerve voice enhancement model; the method selectively focuses on the target voice direction, and realizes voice enhancement.

Description

Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system
Technical Field
The invention relates to the technical field of multi-channel voice recognition of microphone arrays, in particular to a voice enhancement method, a voice recognition method, a speaker recognition method and a speaker recognition system.
Background
Under far field conditions, voice signals are easy to be interfered by noise and reverberation, so that the performance of applications such as voice communication, voice recognition and the like is greatly influenced; multi-channel speech enhancement has proven to significantly improve speech intelligibility, perceptual quality, and far-field speech recognition performance relative to mono speech enhancement; however, in a scenario where the sound source position of the target voice is unknown or moving, the target voice enhancement is still a very challenging task. Although many speech enhancement methods do not require prior knowledge of the target sound source's position, such as MVDR and PMWF, their performance is severely dependent on the estimation of the covariance matrix for each band and its computation of the inverse matrix, a process which is often very difficult and time consuming.
The azimuth of the target voice is an important clue for improving the voice enhancement performance; on one hand, the directional beam forming technology has the capability of enhancing signals in a target direction and suppressing signals from other directions, and the directional beam forming technology proves that noise suppression can be carried out, meanwhile, voice distortion can be effectively avoided, and the voice recognition performance is remarkably improved; on the other hand, when the target sound source position is known, many directional information can be mined to improve the performance of speech enhancement; thus, in many speech enhancement systems, sound source position estimation is generally regarded as an indispensable important constituent module. These systems typically utilize a signal to perform sound source localization, such as wake-up word audio clips, prior to speech enhancement; however, real-time sound source localization is very difficult, especially in reverberations or scenes where the sound source is moving, and sound source localization is more difficult; when the sound source position estimation is inaccurate, the performance of speech enhancement may be drastically reduced.
The space attention mechanism can realize selective attention to the target sound source position for the scene with unknown sound source position or moving sound source, and is a potential scheme for enhancing the target voice under the scene with unknown sound source position. However, the existing spatial attention mechanism lacks effective target guidance, so that the selective attention of the sound source azimuth is inaccurate and unstable; therefore, it is necessary to study the spatial attention machine guided by the target direction to improve the performance of the spatial attention, so as to realize multi-channel voice enhancement in the scene with unknown target direction.
Disclosure of Invention
In view of the foregoing, it is an object of the present invention to provide a method for enhancing speech, which selectively focuses on a target speech direction by using the attention of a target speech guiding space, and performs weighted fusion on direction information and spectrum information extracted from a plurality of sampling spaces, so as to enhance speech signals in the target speech direction.
It is a second object of the invention to provide a speech enhancement system.
A third object of the present invention is to provide a speech recognition method in which a speech enhancement system and a speech recognition model (i.e., a speech recognition module) are employed; the voice enhancement system selectively focuses on the target voice direction by utilizing the attention of the target voice guiding space, and performs weighted fusion on the direction information and the frequency spectrum information extracted by the plurality of sampling spaces, so that the voice signal enhancement of the target voice direction is finally realized.
It is a fourth object of the invention to provide a speech recognition system.
A fifth object of the present invention is to provide a speaker recognition method, in which a speech enhancement system and a speaker recognition model (i.e., a speaker recognition module) are used; the voice enhancement system selectively focuses on the target voice direction by utilizing the attention of the target voice guiding space, and performs weighted fusion on the direction information and the frequency spectrum information extracted by the plurality of sampling spaces, so that the voice signal enhancement of the target voice direction is finally realized.
It is a sixth object of the invention to provide a speaker recognition system.
The first technical scheme adopted by the invention is as follows: a method of speech enhancement comprising the steps of:
s100: generating a double-microphone far-field noisy speech based on the clean speech, the clean noise and the scattered noise; generating a plurality of target voices based on the pure voices, and recording a plurality of target voice orientations;
s200: uniformly dividing the space orientation into a plurality of target areas; labeling the plurality of target voice orientations based on a plurality of target areas to obtain labeled target voice orientations;
s300: extracting the characteristics of the double-microphone far-field noisy speech from the labeled target speech azimuth to obtain the characteristics of each target region;
S400: constructing a masking nerve voice enhancement model; training the masking nerve voice enhancement model based on the characteristics of each target area, the target voice and the labeled target voice azimuth, thereby obtaining a trained masking nerve voice enhancement model; the masking neural speech enhancement model includes an encoder, a spatial attention module, a decoder, and a neural beam;
s500: and enhancing the voice signal based on the trained masking nerve voice enhancement model.
Preferably, the dual-microphone far-field noisy speech in the step S100 is obtained by the following method:
pure voice is converted into double-microphone far-field pure voice data through a simulation tool, and pure noise and scattered noise are converted into double-microphone far-field noise data; and mixing the double-microphone far-field pure voice data and the double-microphone far-field noise data according to a certain signal to noise ratio, so as to obtain double-microphone far-field noisy voice.
Preferably, the target voice in the step S100 is obtained by the following method:
generating double-microphone far-field pure voice data by the pure voice through a simulation tool, recording the azimuth of the double-microphone far-field pure voice data during simulation, and performing spatial filtering on the double-microphone far-field pure voice data through a fixed beam former corresponding to the azimuth of the double-microphone far-field pure voice data, wherein a filtered signal is the target voice; the azimuth of the recorded double-microphone far-field pure voice data is the azimuth of the target voice.
Preferably, labeling the plurality of target speech orientations based on the plurality of target regions in step S200 includes:
discretizing a plurality of target areas and marking the target areas as azimuth labels; and labeling the target voice azimuth based on the azimuth label to obtain a labeled target voice azimuth.
Preferably, the step S300 includes the following sub-steps:
s310: framing the double-microphone far-field noisy speech, adding a hamming window, and obtaining a Fourier coefficient of the double-microphone far-field noisy speech by utilizing Fourier transform;
s320: extracting the frequency spectrum characteristic and the direction coherence characteristic of the Fourier coefficient of the double-microphone far-field noisy speech from the labeled target speech azimuth;
s330: and splicing the spectrum characteristic and the directional coherence characteristic to obtain the characteristic of each target area.
Preferably, training the masking nerve voice enhancement model based on the feature of each target region, the target voice and the tagged target voice orientation in the step S400 includes the following sub-steps:
s410: inputting the features of each target region into a masking neural speech enhancement model to obtain an estimated target speech direction and enhanced speech;
S420: training the masking nerve speech enhancement model based on the tagged target speech orientation and the estimated target speech direction, and the target speech and the enhanced speech, to obtain a trained masking nerve speech enhancement model.
Preferably, the step S410 includes:
the encoder encodes the characteristics of each target region to obtain a characteristic encoded representation of each target region;
the spatial attention module calculates the spatial attention weight at the current moment based on the feature coding representation of each target area and the decoding state of a decoder at the last moment, and performs weighted summation on the feature coding representation of each target area based on the spatial attention weight at the current moment to obtain a spatially aggregated representation vector; selecting the direction with the largest attention weight as the estimated target voice direction;
the decoder predicts a time-frequency mask of the target speech based on the spatially aggregated representation vector to obtain an estimated time-frequency mask;
selecting a nerve wave beam corresponding to the estimated target voice direction to carry out spatial filtering on the double-microphone far-field noisy voice so as to obtain a signal with enhanced spatial filtering; the estimated time-frequency mask is applied to the spatially filtered enhanced signal to obtain enhanced speech.
Preferably, the step S420 includes:
calculating a speech enhancement loss based on the target speech and the enhanced speech; and calculating cross entropy orientation classification loss based on the tagged target speech orientation and the estimated target speech orientation; the encoder, spatial attention module, decoder and neural beam are jointly trained with the speech enhancement loss and the cross entropy orientation classification loss as optimization targets.
The second technical scheme adopted by the invention is as follows: a voice enhancement system comprises a data generation module, a labeling module, a feature extraction module, a model construction and training module and a voice enhancement module;
the data generation module is used for generating double-microphone far-field noisy speech based on pure speech, pure noise and scattered noise; generating a plurality of target voices based on the pure voices, and recording a plurality of target voice orientations;
the labeling module is used for uniformly dividing the space orientation into a plurality of target areas; labeling the plurality of target voice orientations based on a plurality of target areas to obtain labeled target voice orientations;
the feature extraction module is used for extracting features of the double-microphone far-field noisy speech from the labeled target speech azimuth so as to obtain features of each target area;
The model construction and training module is used for constructing a masking nerve voice enhancement model; training the masking nerve voice enhancement model based on the characteristics of each target area, the target voice and the labeled target voice azimuth, thereby obtaining a trained masking nerve voice enhancement model;
the speech enhancement module is configured to enhance a speech signal based on the trained masking neural speech enhancement model.
The third technical scheme adopted by the invention is as follows: a speech recognition method based on the speech enhancement system as described in the second technical solution, comprising the steps of:
s100: constructing a voice data set with text labels, and enhancing voice signals in the voice data set with text labels based on the voice enhancement system so as to obtain enhanced voice signals;
s200: preprocessing the enhanced speech signal to obtain mel spectrum characteristics;
s300: constructing a voice recognition model, and inputting the Mel frequency spectrum characteristics into the voice recognition model to obtain a decoded recognition result; training the voice recognition model based on the decoded recognition result and the text to obtain a trained voice recognition model;
S400: and recognizing a voice signal based on the trained voice recognition model.
Preferably, the speech recognition model in the step S300 includes an acoustic encoding module and a text decoding module;
inputting the mel spectrum features into the acoustic coding module, thereby obtaining high-dimensional acoustic features; inputting the high-dimensional acoustic features into the text decoding module for gradual decoding, so as to obtain a decoded identification result;
calculating cross entropy loss between the decoded recognition result and the text; calculating the average value of all cross entropy losses to obtain a final loss; the final loss is targeted for speech recognition model training.
The fourth technical scheme adopted by the invention is as follows: a voice recognition system based on the voice enhancement system as described in the second technical scheme, which comprises a data acquisition module, a preprocessing module, a voice recognition model construction and training module and a recognition module;
the data acquisition module is used for constructing a voice data set with text labels, and enhancing voice signals in the voice data set with the text labels based on the voice enhancement system so as to obtain enhanced voice signals;
The preprocessing module is used for preprocessing the enhanced voice signal to obtain a Mel frequency spectrum characteristic;
the voice recognition model construction and training module is used for constructing a voice recognition model, and inputting the Mel frequency spectrum characteristics into the voice recognition model to obtain a decoded recognition result; training the voice recognition model based on the decoded recognition result and the text to obtain a trained voice recognition model;
the recognition module is used for recognizing the voice signal based on the trained voice recognition model.
The fifth technical scheme adopted by the invention is as follows: a speaker recognition method based on the speech enhancement system as described in the second technical solution, comprising the steps of:
s100: constructing a target speaker voice data set with text labels, and enhancing a target speaker voice signal in the target speaker voice data set with text labels based on the voice enhancement system, so as to obtain an enhanced voice signal; the enhanced speech signal includes information related to speaker identity;
s200: preprocessing the enhanced voice signal and the corresponding text to obtain a phoneme text sequence, a phoneme duration and a mel spectrum characteristic;
S300: constructing a speaker recognition model, and inputting the phoneme text sequence, the phoneme duration and the Mel frequency spectrum characteristics into the speaker recognition model to obtain a text information hiding vector, a predicted text hiding vector and predicted speaker identity information; training the speaker recognition model based on the text information hiding vector and the predicted text hiding vector, and the information related to the speaker identity and the predicted speaker identity information, thereby obtaining a trained speaker recognition model;
s400: and identifying the voice signal of the unknown speaker identity based on the trained speaker identification model.
Preferably, the speaker recognition model in the step S300 includes a text encoding module, a voiceprint feature extraction module, and a feature classification module;
inputting the phoneme text sequence and the phoneme duration into the text encoding module so as to obtain a text information hiding vector; inputting the Mel frequency spectrum features into the voiceprint feature extraction module so as to obtain voiceprint feature vectors and predicted text hidden vectors; inputting the voiceprint feature vector into the feature classification module so as to obtain predicted speaker identity information;
Calculating average absolute error loss based on the text information hiding vector and the predicted text hiding vector to obtain text information loss; and calculating a cross entropy loss based on the information related to speaker identity and the predicted speaker identity information; and taking the text information loss and the cross entropy loss as joint optimization targets of the speaker recognition model to realize training of the speaker recognition model.
The sixth technical scheme adopted by the invention is as follows: a speaker recognition system based on the speech enhancement system as described in the second technical scheme, comprising a data acquisition module, a preprocessing module, a speaker recognition model construction and training module and a recognition module;
the data acquisition module is used for constructing a target speaker voice data set with text labels, and enhancing target speaker voice signals in the target speaker voice data set with the text labels based on the voice enhancement system so as to obtain enhanced voice signals and corresponding text; the enhanced speech signal includes information related to speaker identity;
the preprocessing module is used for preprocessing the enhanced voice signal and the corresponding text to obtain a phoneme text sequence, a phoneme duration and a mel frequency spectrum characteristic;
The speaker recognition model construction and training module is used for constructing a speaker recognition model, and inputting the phoneme text sequence, the phoneme duration and the mel frequency spectrum characteristics into the speaker recognition model so as to obtain text information hiding vectors, predicted text hiding vectors and predicted speaker identity information; training the speaker recognition model based on the text information hiding vector and the predicted text hiding vector, and the information related to the speaker identity and the predicted speaker identity information, thereby obtaining a trained speaker recognition model;
the recognition module is used for recognizing the voice signal of the unknown speaker identity based on the trained speaker recognition model.
The beneficial effects of the technical scheme are that:
(1) In order to solve the problem of target voice enhancement in a scene with unknown or moving target sound source position, the voice enhancement method disclosed by the invention comprises the steps of constructing and training a masking nerve beam voice enhancement network based on directional guiding and space attention (namely a masking nerve voice enhancement model based on space attention), wherein the model comprises an encoder, a space attention module, a decoder and a nerve beam, and carrying out joint optimization and supervised training on each module based on frequency spectrum loss, waveform loss and space direction classification loss by using simulation data:
Firstly, uniformly dividing a space region, carrying out feature coding on spectrum features and direction features extracted from each space region by an encoder, carrying out weighted fusion on the feature codes of each space region by using a space attention module, and estimating time-frequency masking of target voice by using a weighted fusion feature vector by a decoder;
and selecting a nerve wave beam corresponding to the direction with the largest spatial attention weight for spatial filtering according to the spatial attention weight, and finally obtaining the final enhanced target voice by using a masking technology.
(2) The voice enhancement method disclosed by the invention selectively focuses on the target voice direction by utilizing the attention of the target voice guiding space, and performs weighted fusion on the direction information and the frequency spectrum information extracted by a plurality of sampling spaces, thereby finally realizing the voice signal enhancement of the target voice direction.
(3) Considering the current lack of effective information guidance of spatial attention, and not considering the short-term correlation of speech and spatial orientation; the invention expresses the space attention as a classification problem of sampling directions, and an optimization target based on target direction classification proposes the space attention guided by the target directions, and takes the short-time correlation of the space orientation and the inherent short-time correlation of the voice signals into account in the calculation of the space attention.
(4) The invention selectively focuses on a plurality of uniformly sampled spatial directions through a spatial focusing mechanism, the spatial focusing is expressed as a classification problem of one sampling direction, and the spatial focusing mechanism focuses on the target direction by utilizing the classification loss between the spatial focusing weight and the labeled target direction; finally, the masking nerve wave beam is used for enhancing the target voice signal.
Drawings
FIG. 1 is a flow chart of a method for speech enhancement according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data simulation flow provided in an embodiment of the present invention;
FIG. 3 is a schematic diagram of spatial sampling provided by one embodiment of the present invention;
FIG. 4 is a schematic diagram of a feature extraction process according to an embodiment of the present invention;
FIG. 5 is a block diagram of a masking neural speech enhancement network provided in one embodiment of the present invention;
FIG. 6 is a schematic diagram of a speech enhancement system according to an embodiment of the present invention;
FIG. 7 is a flowchart of a speech recognition method according to an embodiment of the present invention;
FIG. 8 is a basic block diagram of a speech recognition model provided by one embodiment of the present invention;
FIG. 9 is a schematic diagram of a network structure of an acoustic coding module in a speech recognition model according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a network structure of a text decoding module in a speech recognition model according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a speech recognition system according to an embodiment of the present invention;
FIG. 12 is a flowchart of a speaker recognition method according to an embodiment of the present invention;
FIG. 13 is a flow chart illustrating data preprocessing in a speaker recognition model according to an embodiment of the present invention;
FIG. 14 is a basic block diagram of a speaker recognition model network provided in accordance with one embodiment of the present invention;
FIG. 15 is a schematic diagram of a basic structure of a single affine coupling layer in a voiceprint feature extraction module according to one embodiment of the present invention;
fig. 16 is a schematic structural diagram of a speaker recognition system according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following detailed description of the embodiments and the accompanying drawings are provided to illustrate the principles of the invention and are not intended to limit the scope of the invention, i.e. the invention is not limited to the preferred embodiments described, which is defined by the claims.
In the description of the present invention, it is to be noted that, unless otherwise indicated, the meaning of "plurality" means two or more; the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance; the specific meaning of the above terms in the present invention can be understood as appropriate by those of ordinary skill in the art.
Example 1
As shown in fig. 1, one embodiment of the present invention provides a voice enhancement method, which includes the steps of:
s100: generating a double-microphone far-field noisy speech based on the clean speech, the clean noise and the scattered noise; generating a plurality of target voices based on the pure voices and recording a plurality of target voice orientations;
data preparation: as shown in fig. 2, the noise-containing far-field double-microphone speech is randomly generated by using a simulation tool pyromamacotics through pure speech, pure noise and scattered noise.
Collecting and sorting open source clean speech from 863-1 chinese speech data, AISHELL-1 and AISHELL-2 to form a clean speech dataset; collecting and sorting online open source clean noise data from Google audio to form a clean noise data set; the scatter-like noise is collected and sorted from noiex-92, air conditioning noise, wind noise, cafe noise, square noise, etc. to form a scatter-like noise data set, and a large amount (about 20,000 sentences) of scatter noise (i.e., double microphone scatter noise) is generated by the scatter noise simulator ANF-Generator based on the scatter-like noise data set, with a pitch between the double microphones of 4 cm.
Randomly selecting clean voice from the clean voice data set, randomly selecting clean noise from the clean noise data set, and randomly selecting scattered noise from a large amount of simulated scattered noise; according to simulation parameters shown in table 1, pure voice is converted into double-microphone far-field pure voice data through a simulation tool (far-field data of two channels are obtained by convoluting pure voice through room impact response), then pure noise and scattered noise are converted into double-microphone far-field noise data through the same mode, and finally double-microphone far-field pure voice data and double-microphone far-field noise data are mixed according to a certain signal to noise ratio, so that double-microphone far-field noisy voice is obtained; for example, using simulation tool pyromamacometrics to randomly generate 10,000,000 sentences of bimorphan far-field noisy speech (i.e., bimorphan far-field noisy data) as a training set; in addition, 10,000 sentences of double-microphone far-field noisy speech are respectively generated as a test set and a development set.
In the simulation process, besides the double-microphone far-field noisy speech, double-microphone far-field clean speech data before noise addition is stored, namely clean speech is generated into double-microphone far-field clean speech data through a simulation tool, the azimuth of the double-microphone far-field clean speech data in the simulation process is recorded, the double-microphone far-field clean speech data is spatially filtered through a fixed beam former corresponding to the azimuth of the double-microphone far-field clean speech data, and the filtered signal is the target speech; the azimuth of the recorded double-microphone far-field pure voice data is the azimuth (namely the target direction) of the target voice.
The method disclosed by the invention reserves double-microphone far-field pure voice data before noise addition, is used for generating a plurality of target voices, guides voice enhancement network training based on the target voices, records the spatial orientation of the target voices in a simulation stage, and guides a spatial attention module to pay attention to the spatial orientation of the target in subsequent model training.
Table 1 simulation parameters
S200: uniformly dividing the space orientation into a plurality of target areas; labeling the plurality of target voice orientations based on a plurality of target areas to obtain labeled target voice orientations;
spatial sampling region: as shown in FIG. 3, the spatial orientation is uniformly divided into five target regions [0,22.5], [22.5,67.5], [67.5,112.5], [112.5,157.5], [157.5,180], and the 5 target regions are discretized and labeled as {0,1,2,3,4} five orientation labels, assuming that the target speech is from the 5 regions, the target speech orientation is labeled based on the orientation labels to obtain a labeled target speech orientation, which is used in a subsequent training model to direct the spatial attention module to focus on the orientation of the target speech.
For a linear microphone array (i.e. microphones placed in a straight line), the direction discrimination is 0-180 degrees for any target direction∈[0,360]Firstly, converting the range of 0-180 degrees into the following formula:in (1) the->Target voice azimuth (i.e., target direction); />Is any target direction;
converting the five target areas into {0,1,2,3,4} five labels, wherein the five labels correspond to the five target areas respectively, and the target voice azimuth is based on the five labelsLabeling to obtain labeled target voice azimuth +.>E {0,45,90,135,180}; tagged target Voice Direction->The corresponding spatial region label is +.>The formula is as follows:for example, if a target speech orientation is [0,22.5 ]]This area, then the target speech azimuth is marked as 0; if a certain target voice azimuth is [22.5,67.5 ]]This area, then the target speech orientation is marked 1; if a certain target voice azimuth is [67.5,112.5 ]]This area, then the target speech orientation is marked as 2; if a certain target voice azimuthIn [112.5,157.5 ]]This area, then the target speech orientation is marked 3; if a certain target voice azimuth is [157.5,180 ]]This area, the target speech orientation is marked 4.
S300: extracting the characteristics of the double-microphone far-field noisy speech from the labeled target speech azimuth to obtain the characteristics of each target region;
according to the array shape and geometric parameters of the microphone array, designing a super-directional beam forming device and a fixed beam forming device with a notch (the notch direction is the candidate direction of an interference source corresponding to the labeling target voice azimuth) for a plurality of target areas; i.e. designing a super-directional beamformer, labeling target speech orientations, e.g. each pointing at 0,45,90,135,180 degrees, while designing a fixed beamformer with notches for each individual target region divided; then, utilizing a designed wave beam former to extract frequency spectrum characteristics and spatial coherence characteristics from 5 labeled target voice orientations for each sentence of double-microphone far-field noisy voices (namely double-microphone signals) respectively; the fixed beamformer with notch suppresses as much as possible the sound source from the notch direction while ensuring that the target speech orientation is not distorted.
Feature extraction: as shown in fig. 4, extracting features of the dual microphone far field noisy speech from the tagged target speech bearing includes the sub-steps of:
s310: framing the double-microphone far-field noisy speech (namely double-microphone signal), adding a hamming window, and obtaining a Fourier coefficient of the double-microphone far-field noisy speech by utilizing Fourier transform;
Framing the double-microphone far-field noisy speech in the training set, wherein the frame length is 512, the frame shift is 256, then adding a hamming window to the frame, and obtaining the Fourier coefficient of the double-microphone far-field noisy speech by utilizing Fourier transform
S320: extracting the frequency spectrum characteristic and the direction coherence characteristic of the Fourier coefficient of the double-microphone far-field noisy speech from the labeled target speech azimuth;
respectively at five labeled targetsVoice orientationFrequency spectrum feature ∈ 0,45,90,135,180} of extracting Fourier coefficient of double-microphone far-field noisy speech>And directional coherence feature->
S330: splicing the spectrum characteristics and the direction coherence characteristics to obtain characteristics of each target area;
splicing the spectrum features and the direction coherence features according to feature dimensions to obtain the features of each target region (namely the labeling target voice azimuth)Input features of) each target region is represented by the following formula: />in (1) the->For tagging target speech orientation->Is a feature of the input of (a); />Is a spectral feature; />Is a directional coherence feature.
S400: constructing a masking nerve voice enhancement model; training the masking neural speech enhancement model based on the features of each target region, target speech, and tagged target speech orientations;
Fig. 5 is a network framework of a masked neural speech enhancement model, including an encoder, a spatial attention module, a decoder, and a neural beam, as described in fig. 5.
Training the masking neural speech enhancement model based on the features of each target region, target speech, and tagged target speech orientation includes the sub-steps of:
s410: inputting the features of each target region into a masking neural speech enhancement model to obtain an estimated target speech direction and enhanced speech;
the encoder encodes the characteristics of each target region, thereby obtaining a characteristic encoded representation of each target region;
the spatial attention module calculates the spatial attention weight of the current moment by utilizing the output of the encoder (namely the characteristic coding representation of each target area), the decoding state of the decoder and the spatial attention weight of the last moment, and performs weighted summation on the characteristic coding representation of each target area based on the spatial attention weight of the current moment to obtain a spatially aggregated representation vector; selecting the direction with the largest attention weight as the estimated target voice direction;
the decoder predicts the time-frequency mask of the target voice by using the spatially aggregated representation vector to obtain an estimated time-frequency mask;
Selecting a nerve wave beam corresponding to the estimated target voice direction to carry out spatial filtering on the double-microphone far-field noisy voice so as to obtain a signal with enhanced spatial filtering; finally, the estimated time-frequency masking is applied to the spatially filtered enhanced signal to obtain the final enhanced speech.
1) An encoder;
the encoder is a TDNN encoding network, namely the encoder network consists of 2 layers of TDNNs, each TDNN layer has 256 nodes, and a TDNN layer activation function adopts a ReLU; the space is uniformly divided into 5 target areas [0,22.5 ]]、[22.5,67.5]、[67.5,112.5]、[112.5,157.5]、[157.5,180]Extracting features of each target regionThe method comprises the steps of carrying out a first treatment on the surface of the Specific for each target areaSyndrome of->Input to the shared TDNN encoder to obtain a characteristic code representation of each target area +.>
The feature coded representation of each target region is represented by the following formula: in (1) the->For t time->A directionally encoded representation of the feature (i.e., a representation of the feature encoding for each target region);tfor time frame +.>For t time->And (5) extracting the characteristics of the direction.
In the invention, the encoder performs feature coding on each sampling direction, and the feature encoders of all the sampling directions share a parameter weight.
2) A spatial attention module;
because the voice signal has short time sequence correlation, the invention considers the decoding state of the decoder at the last moment in the calculation of the space attention weight; the spatial attention module calculates the spatial attention weight of the current moment by using the output of the encoder and the decoding state of the decoder at the last moment, and performs weighted summation on the feature coding representation of each target area based on the spatial attention weight of the current moment to obtain a spatially aggregated representation vector.
Spatially aggregated representationThe amount is expressed by the following formula:in (1) the->A spatially aggregated representation vector; />For t time->A directionally encoded representation of the feature (i.e., a representation of the feature encoding for each target region); />Labeling the target voice azimuth;at decodert-decoding status at time 1.
Spatially aggregated representation vectorsThe calculation is performed as follows: /> In (1) the->An intermediate representation vector for calculating the attention coefficients; w, U, V and b are weight matrices of the attention model; t is matrix transposition operation;at decodert-1 decoding status of the time frame; />For t time->A directionally encoded representation of the feature (i.e., a representation of the feature encoding for each target region); />For t time->Attention weight of direction; />Is an attention control factor, and is constant; />Is a spatially aggregated representation vector.
Selecting the direction with the greatest attention weight as the estimated target voice directionThe method comprises the following steps: />In (1) the->For an estimated target speech direction; />For t time->Attention weight of direction.
3) A decoder;
the decoder network is composed of 2 LSTM layers and 1 full-connection layer, each LSTM layer has 256 nodes, the full-connection layer has 257 nodes, the activation function of the full-connection layer is Sigmoid, so that the output of the masking prediction network is [0,1 ] ]Is a masking value of (2); the input of the decoder comes from the spatially aggregated representation vector of the spatial attention moduleDefinitions->A time-frequency mask estimated for a decoder, which is expressed by the following formula: />In (1) the->Estimated time-frequency masking of the t frame target speech predicted for the masking neural speech enhancement model; />Is a spatially aggregated representation vector.
4) A nerve beam;
the nerve wave beam is composed of complex affine transformation layers, and 5 independent nerve wave beams are respectively designed for 0,45,90,135,180 degrees and 5 directions; the present invention introduces an additional filter considering that existing beamforming does not utilize the history signalModeling the history signal, selecting an estimated target speech direction according to the attention weight +.>After that, select->The corresponding nerve beam filters the microphone array signal to obtain a spatially filtered enhanced signal, and the spatially filtered enhanced signal is represented by the following formula: />In (1) the->For spatially filtering the enhanced signal; />And->All are in the direction +.>Complex weight coefficients of the nerve beam of (2); />Is the firsttFourier transform coefficients of a frame far-field double-microphone noisy speech signal; />Is the firstt-fourier transform coefficients of a 1 frame far field bi-microphone noisy speech signal.
Applying the estimated time-frequency mask to the spatially filtered enhanced signal to obtain a final enhanced speech, comprising:
obtaining final enhanced target voice Fourier coefficient by masking technology, performing inverse Fourier transform on the final enhanced target voice Fourier coefficient to obtain final enhanced voice, namely obtaining waveform signal of the enhanced voice
The final enhanced target speech fourier coefficients are expressed by the following formula:in the method, in the process of the invention,fourier coefficients for the final enhanced target speech; />Prediction of model for masking neural speech enhancementtEstimating time-frequency masking of the frame target speech; />Representing a dot product operation; />For spatially filtering the enhanced signal.
S420: training the masking nerve voice enhancement model based on the tagged target voice orientation and the estimated target voice direction, and the target voice and the enhanced voice to obtain a trained masking nerve voice enhancement model;
the method uses the speech enhancement loss (calculated based on a plurality of target speech and enhanced speech) and the cross entropy azimuth classification loss (calculated by labeling target speech azimuth and estimated target speech direction) as optimization targets to perform joint training on an encoder, a spatial attention module, a decoder and nerve beams;
The parameters of the encoder, spatial attention module, decoder and nerve beam are learnable, jointly optimally trained in an end-to-end manner based on spectrum, waveform loss (i.e., speech enhancement loss), and waveform distance loss (i.e., cross entropy orientation classification loss), and furthermore, cross entropy orientation classification loss is used to guide the spatial attention module to better focus on the target speech direction;
with speech enhancement loss as an optimization objective: using exponentially suppressed energy spectrum minimum Mean Square Error (MSE) loss (i.e., spectral loss) and scale-invariant signal-to-distortion (SI-SDR) (i.e., waveform loss) as optimization targets for speech enhancement, the present invention calculates optimization targets for speech enhancement using target speech;
taking cross entropy azimuth classification loss as an optimization target: in the training stage, the spatial attention module is guided to better focus on the target voice direction and neglect other directions by labeling the target voice direction; according to the invention, the space region labels corresponding to the labeled target voice azimuth are used for performing supervision guidance, so that the directional selection capability of the space attention module is enhanced; the invention expresses the selective attention of the space as a classification problem of one sampling direction, and guides the optimization of the space attention module by taking the cross entropy azimuth classification loss between the target voice azimuth predicted by the attention weight (estimated target voice direction) and the labeled target voice azimuth as an optimization target.
/> />In (1) the->Is the total loss; />Is a spectral loss; />Is a waveform loss; />Classifying losses for cross entropy azimuth; />Is the total time frame; />The number of the Fourier transform frequency points; />Fourier coefficients for the target speech; />Fourier coefficients for the final enhanced target speech; />Is->Is a vector transpose of (2); />Waveform signals that are target speech; />Waveform signals for enhancing speech; />Representing the vector +.>Is>Element(s)>Is made of->Vectors of composition->For t time->Attention weight of direction; />And the spatial region label corresponding to the target voice azimuth is used.
The encoder, the spatial attention module, the decoder and the weight coefficient of the nerve beam are randomly initialized; spectrum loss, waveform loss and cross entropy azimuth classification loss are used as joint optimization targets of the whole network; using the target voice direction as a target of cross entropy azimuth classification loss; the whole network adopts an Adam optimizer, the learning rate of network training is adjusted by using a wakeup strategy, a mini-batch training mode is adopted, the batch processing size is 64, and 50 rounds of training are performed in total until convergence.
Further, in one embodiment, the method further comprises testing the trained masking neural speech enhancement model, including:
Double wheat distance in test setThe field noisy speech is used as the test speech, which is first extracted according to step S300The spectrum characteristics and the space characteristics of the E {0,45,90,135,180} directions are spliced together according to characteristic dimensions to obtain the characteristics of each target area; inputting the characteristics of each target area into an encoder to obtain characteristic coding representation of each target area, then carrying out weighted fusion on the characteristic coding representations of a plurality of target areas through a spatial attention module to obtain a spatially aggregated representation vector, and then inputting the spatially aggregated representation vector into a decoder to obtain estimated time-frequency masking of target voice; meanwhile, according to the spatial attention module, selecting the direction with the largest attention weight to carry out nerve beam filtering to obtain a signal with enhanced spatial filtering; finally, a masking technology is utilized to obtain the Fourier coefficient of the final enhanced voice signal, and the waveform signal of the enhanced voice is obtained through inverse Fourier change.
S500: the speech signal is enhanced based on the trained masking neural speech enhancement model.
The azimuth of the target voice is an important clue for improving the multi-channel voice enhancement performance, and when a sound source is in a scene of far-field reverberation and movement, the real-time estimation of the azimuth of the sound source is very difficult; the space attention mechanism can realize selective attention to the target sound source position, and is a potential scheme for enhancing the target voice in a scene with unknown sound source position; the lack of effective target guidance by existing spatial attention mechanisms leads to inaccurate and unstable selective attention of sound source orientation; the target voice direction is automatically focused by the target direction guiding spatial attention mechanism and is fused into the masking nerve beam to realize multichannel voice enhancement.
Example two
As shown in fig. 6, one embodiment of the present invention provides a speech enhancement system, which includes a data generation module, a tagging module, a feature extraction module, a model construction and training module, and a speech enhancement module;
the data generation module is used for generating double-microphone far-field noisy speech based on pure speech, pure noise and scattered noise; generating a plurality of target voices based on the pure voices and recording a plurality of target voice orientations;
the labeling module is used for uniformly dividing the space orientation into a plurality of target areas; labeling the plurality of target voice orientations based on a plurality of target areas to obtain labeled target voice orientations;
the feature extraction module is used for extracting features of the double-microphone far-field noisy speech from the labeled target speech azimuth so as to obtain features of each target area;
the model construction and training module is used for constructing a masking nerve voice enhancement model; training the masking neural speech enhancement model based on the features of each target region, target speech, and tagged target speech orientations;
the speech enhancement module is configured to enhance a speech signal based on the trained masking neural speech enhancement model.
Example III
As shown in fig. 7, one embodiment of the present invention provides a voice recognition method, including the steps of:
s100: constructing a voice data set with text labels, and enhancing voice signals in the voice data set with the text labels by a voice enhancement system in the second embodiment so as to obtain enhanced voice signals and corresponding text;
constructing a speech data set with text labels includes: the consolidated speech signal is collected from 863-1 chinese speech data, AISHELL-1, and AISHELL-2 to form a speech dataset having text labels.
And carrying out voice enhancement processing on the voice signals in the voice data set with the text labels by a voice enhancement system to obtain enhanced voice signals.
S200: preprocessing the enhanced speech signal to obtain mel-spectrum features (i.e., mel-spectrum features);
s210: pre-emphasis is carried out on the enhanced voice signal, and the enhanced voice signal is divided into frames according to the frame length 1024 and the frame shift 256, and a Hamming window is added;
s220: performing fast Fourier transform on the enhanced voice signal subjected to pre-emphasis, framing and Hamming window processing to obtain an amplitude spectrum of the enhanced voice signal;
S230: the Mel frequency spectrum characteristics can be obtained by passing the amplitude spectrum of the enhanced voice signal through a Mel filter bank;
s300: constructing a voice recognition model, and inputting the Mel frequency spectrum characteristics into the voice recognition model to obtain a decoded recognition result; training the voice recognition model based on the decoded recognition result and the text;
as shown in fig. 8, the speech recognition model (i.e., speech recognition module) includes an acoustic encoding module (i.e., acoustic encoder) and a text decoding module (i.e., text decoder), and the present invention uses cross entropy loss as an optimization objective to jointly optimize and supervise training the respective modules.
(1) Inputting the Mel spectrum characteristics into a voice recognition model to obtain a recognition result of each step of decoding;
1) An acoustic encoding module;
as shown in fig. 9, the acoustic encoding module takes mel spectral features as input and high-dimensional acoustic features as output; mapping 80-dimensional Mel spectrum features to higher-dimensional feature representation through an affine transformation layer, and carrying out maximum pooling operation on the output of the convolution layer along the time dimension through 3 modules formed by combining one-dimensional convolution, one-dimensional Batch Normalization (batch normalization operation) and one-dimensional MaxPooling (maximum pooling layer), so that the length of a hidden variable is shortened under the condition that local information is not changed; then, the hidden vector is input into a multi-head Self-attention module (multi-head Self-attention), residual connection is introduced, and the hidden vector is added with the output passing through the multi-head attention module, so that the high-dimensional acoustic characteristic is finally obtained and output.
2) A text decoding module;
as shown in fig. 10, the high-dimensional acoustic features are input to a text decoding module, which employs a step-by-step decoding approach, while simultaneously directing attention to a module for learning to align speech frames and text sequences.
Random an embedded vector v 0 Representing sentence initiator<s/>As a start of the decoding step, a hidden state h is obtained through a Pre-net (i.e., preprocessing net) operation (pre_net) and an Attention-sequential network layer (Attention RNN) 0
Will h 0 An alignment attention module (SMA) is fed with the output x of the acoustic encoding module, wherein the module employs a stepwise monotonic alignment attention mechanism (Stepwise Monotonic Attention) as follows: /> /> in the formula e ij Is in a hidden state h i-1 Output x from acoustic encoding module j Energy of (2); h is a i-1 The hidden state of the AttenationRNN in the i-1 step; x= [ x ] 1 ,x 2 ,...,x j ,...,x n ]Is the output of the acoustic encoding module; p is p ij To energy e ij The probability obtained after the sigmoid operation is used for determining whether to move forward or not; a, a ij In the case of the i step, the attention weight of the j-th category; a, a i-1;j-1 In the i-1 th step, the attention weight of the j-1 th category; p is p i;j-1 In the ith step, the forward prediction probability of the jth category; a, a i-1;j In the i-1 th step, the attention weight of the j-th category; nOutputting the length of x for the acoustic encoding module; r= [ R ] 1 ,...,R i ,...,R n ]To align the output of the attention module; the i-coordinate represents the decoding step and the j-coordinate represents all the categories (i.e., all the words) in which the recognition result may appear for each decoding step.
According to the characteristic that input voice and output text are aligned monotonically in the voice recognition task, namely the input voice and the output text share the same time sequence, a gradual monotonically aligned attention module is introduced into a text decoding module for aligning the voice characteristic and the text content; the alignment attention module passes the predicted value p ij Judging whether to move forward or not, and paying attention to the weight a at the current moment ij The calculation of (2) must and depends only on the previous moment, whereby the calculation of defining the attention is done step by step without missing any information.
The outputs R and h of the attention module will then be aligned 0 Inputting decoding timing network layer (Decoder RNN) together to obtain text hidden vector c 0 The text hidden vector is subjected to affine transformation and softmax operation to obtain classification weights, and the classification categories comprise all Chinese characters and sentence beginning characters<s/>And sentence ending symbol</s>And selecting the text with the largest weight as the recognition result of the decoding in the step, namely obtaining the text content output by the voice recognition model, and completing the decoding in the first step. The second decoding step uses the text hidden vector c in the first step 0 As the initial embedded vector v1, repeating the above steps until the decoding result is a sentence ending symbol</s>Ending the last step of decoding, and completing voice recognition; the general procedure can be expressed by the following formula: /> in the formula, h i For the hidden state of the ith AttenionRNN,/A>K is the number of characters of the recognition result; v i Embedding a vector for the start of the decoding in the i-th step; c i-1 Hiding the vector for the text in the i-1 step; h is a i-1 Is the i-1 st step of attentionHidden state of ionRNN; r is the output of the alignment attention module; x is the output of the acoustic encoding module; c i Hiding the vector for the text in step i.
Relevant parameters of the speech recognition model are for example shown in table 2.
Table 2 model parameters
(2) Training the speech recognition model based on the recognition result of each decoding step and the text of the word:
randomly initializing weight coefficients of an acoustic encoding module and a text decoding module; cross entropy loss is used as an optimization target for the whole network (i.e. the loss function is cross entropy loss);
since the text decoding module decodes the hidden vector step by step, the cross entropy loss between each word (i.e. the recognition result of each decoding step) and the text is calculated, and the average value of all the cross entropy losses is calculated to obtain the final loss; the final loss is targeted for speech recognition model training.
The whole network adopts an Adam optimizer, the learning rate of network training is adjusted by using a wakeup strategy, a mini-batch training mode is adopted, the batch processing size is 48, and the total training is 100 rounds until convergence.
Further, in an embodiment, the method further comprises testing the speech recognition model;
selecting a small amount of voice audios with known voice text contents, carrying out voice enhancement by a voice enhancement system, carrying out feature extraction on enhanced voice signals according to step S300 to obtain Mel spectrum features, inputting the Mel spectrum features into a voice recognition model, decoding step by step, and selecting the text with the highest probability of a weight sequence as a recognition result in each step to obtain the finally recognized voice text contents.
S400: recognizing the voice signal based on the trained voice recognition model;
after the voice signal to be recognized is obtained and subjected to voice enhancement by a voice enhancement system, feature extraction is carried out on the voice signal according to the step S200 to obtain Mel spectrum features, the Mel spectrum features are input into a trained voice recognition model and are decoded step by step, a group of weight values are calculated in each step, characters corresponding to the highest weight value are used as recognition results of the decoding step, and complete recognition results are obtained after all decoding steps are completed.
Example IV
As shown in FIG. 11, one embodiment of the present invention provides a speech recognition system, comprising a data acquisition module, a preprocessing module, a speech recognition model construction and training module, and a recognition module;
the data acquisition module is used for constructing a voice data set with text labels, and enhancing voice signals in the voice data set with the text labels based on the voice enhancement system so as to obtain enhanced voice signals;
the preprocessing module is used for preprocessing the enhanced voice signal to obtain a Mel frequency spectrum characteristic;
the voice recognition model construction and training module is used for constructing a voice recognition model, and inputting the Mel frequency spectrum characteristics into the voice recognition model to obtain a decoded recognition result; training the voice recognition model based on the decoded recognition result and the text to obtain a trained voice recognition model;
the recognition module is used for recognizing the voice signal based on the trained voice recognition model.
Example five
As shown in fig. 12, one embodiment of the present invention provides a speaker recognition method, which includes the steps of:
S100: constructing a target speaker voice data set with text labels, and enhancing a target speaker voice signal in the data set based on the voice enhancement system in the second embodiment so as to obtain an enhanced voice signal; the enhanced speech signal includes information related to speaker identity;
constructing a target speaker speech dataset with text labels includes: collecting the collated speech signals from 863-1 chinese speech data, AISHELL-1 and AISHELL-2 to form a targeted speaker speech dataset having text labels; the speech signals in the target speaker speech dataset are all speaker speech signals for which speaker identity related information is known.
After the voice signals in the voice data set of the target speaker are subjected to voice enhancement processing of a voice enhancement system, enhanced voice signals and corresponding text are obtained; the enhanced speech signal includes information related to the identity of the speaker.
S200: as shown in fig. 13, preprocessing the enhanced speech signal and the corresponding text to obtain a phoneme text sequence, a phoneme duration and mel spectrum characteristics;
s210: converting the text into a phonetic text sequence: converting the text sequence into a pinyin text sequence with tone marks using the pypinyin tool, e.g., text sequence "Zhang Wei" into pinyin text sequence "zhang1 wei3";
S220: cutting the phonetic text sequence to obtain a phonetic text sequence; forcibly aligning the phoneme text sequence with the corresponding enhanced voice signal so as to obtain a phoneme duration;
inputting the phoneme text sequence and the corresponding enhanced voice signals into an alignment model (obtained by using a Montreal Force Alignment (MFA) tool) for forced alignment, obtaining a time part corresponding to the phonemes through a forced alignment result, determining the number of frames corresponding to each phoneme in the phoneme text sequence through frame shift, and counting the total number of frames to obtain the duration of the phonemes;
phoneme duration: forced alignment of the enhanced speech signal and the pinyin text sequence using MFA (Montreal Forced Aligner); cutting the phonetic text sequence according to a pronunciation dictionary equipped by the tool MFA to obtain a phoneme text sequence, and simultaneously obtaining the duration of the corresponding phonemes, namely the pronunciation time of each phoneme;
s230: extracting mel spectral features of the enhanced speech signal: firstly, pre-emphasis is carried out on an enhanced voice signal, and the enhanced voice signal is divided into frames according to a frame length 1024 and a frame shift 256, and a hamming window is added; then, performing fast Fourier transform on the enhanced voice signal subjected to pre-emphasis, framing and Hamming window processing to obtain an amplitude spectrum of the enhanced voice signal; finally, the amplitude spectrum of the enhanced voice signal is subjected to a Mel filter group to obtain Mel spectrum characteristics;
S300: constructing a speaker recognition model, and inputting a phoneme text sequence, a phoneme duration and Mel frequency spectrum characteristics into the speaker recognition model to obtain a text information hiding vector, a predicted text hiding vector and predicted speaker identity information; training the speaker recognition model based on the text information hiding vector and the predicted text hiding vector, and information related to the speaker identity and the predicted speaker identity information;
as shown in fig. 14, the speaker recognition model (i.e., speaker recognition module) includes a text encoding module, a voiceprint feature extraction module, and a feature classification module (i.e., feature classifier), and the present invention uses text information loss and feature classification loss to perform joint optimization and supervised training on each module; wherein the text encoding module is only used when model training.
(1) Inputting the phoneme text sequence, the phoneme duration and the mel spectrum characteristics into a speaker recognition model to obtain text information hiding vectors, predicted text hiding vectors and predicted speaker identity information;
1) A text encoding module;
the text coding module refers to a model structure in FastSpecech 2, takes a phoneme text sequence and a phoneme duration as input, and outputs a text information hiding vector; the text encoding module comprises a text encoder and a length adjuster; the text encoder is composed of 4 FFT blocks (Feed-Forward Transformer, feedforward converter), takes a phoneme text sequence as input, encodes the phoneme text sequence, and extracts semantic information, namely obtains and outputs a phoneme-level text hidden vector; the length adjuster is configured to extend the text hidden vector at the phoneme level according to the phoneme duration, and extend the text hidden vector at the phoneme level to the text hidden vector (i.e., the text information hidden vector) at the frame level length, that is, simply copy [ n1, n2, ] each phoneme for post-splicing, [ n1, n2, ], where nk is the phoneme duration.
2) The voiceprint feature extraction module comprises a multistage system metering computation sub-module and a feature vector extraction sub-module, wherein the voiceprint feature extraction module takes Mel frequency spectrum features as input, and voiceprint feature vectors and predictive text hidden vectors as output.
The multi-stage system metering computation sub-module is formed by sequentially stacking a plurality of one-dimensional convolution layers and example normalization operations; the multi-stage unified metering computation sub-module carries out multi-stage normalization on the input Mel frequency spectrum characteristics, and the obtained predictive text hiding vector (namely, the predictive text hiding vector is obtained by characteristic stripping of Mel frequency spectrum characteristics) as well as the mean value and the variance;
further, the mean and the variance are spliced to obtain a statistic group; the statistic set is used to characterize speaker identity information.
The feature vector extraction submodule comprises a full-connection layer and a multi-layer affine coupling layer, and is used for outputting voiceprint feature vectors;
the operation of the partial affine coupling layer is shown in fig. 15, the hidden vector obtained after the statistic group is subjected to one-layer full connection is divided into two parts [ x1, x2], x1 is subjected to mapping of s (to) and t (to), and the x2 is subjected to additive coupling and multiplicative coupling to obtain z2, wherein z1 is directly copied by x1, and the specific formula is as follows:
z1=x1
z2=s(x1)*x2+t(x1)
The mixed and disturbed [ z1, z2] is subdivided before the next coupling layer is input, and the operation is repeated; wherein the operation of the split-mix scrambling is performed in the channel dimension in order not to destroy the local dependencies.
Relevant parameters of the text encoding module and voiceprint feature extraction module in the speaker recognition model are shown in table 2, for example.
TABLE 3 parameters of speaker recognition model
3) A feature classification module;
the feature classification module is connected with the voiceprint feature extraction module, voiceprint feature vectors are input, and the identity information weight sequence is output; the module comprises a full-connection layer and an activation function softmax, weight information of voiceprint features corresponding to different speaker identities is obtained, classification is carried out according to the weight information, and therefore the speaker identities are determined, and the identity information of the predicted speaker is obtained.
(2) Training the speaker recognition model based on the text information hiding vector and the predicted text hiding vector, and information related to the speaker identity and the predicted speaker identity information;
randomly initializing weight coefficients of a text coding module, a voiceprint feature extraction module and a feature classification module; text information loss and feature classification loss are used as joint optimization objectives for the entire network (i.e., the loss function includes text information loss and feature classification loss);
Text information loss: the text information hiding vector and the predicted text hiding vector are calculated by using MAE (mean absolute error) loss, so that the text information loss can be obtained; because the text information hiding vector does not contain any content related to acoustic features, the text information loss can be used for restraining the vector after the multi-level normalization operation from not containing information related to acoustic features, namely stripping the feature information related to the speaker identity into the voiceprint feature vector as far as possible;
feature classification loss:
and calculating cross entropy loss of information related to the speaker identity and predicted speaker identity information, and matching with an activation function softmax to obtain the feature classification loss.
The whole network of the speaker recognition model adopts an Adam optimizer, uses a wakeup strategy to adjust the learning rate of network training, adopts a mini-batch training mode, and adopts batch processing with the size of 48, and totally trains 100 rounds until convergence.
Further, in an embodiment, the method further comprises testing the speaker recognition model;
selecting a small number of voice audios of the speakers with known identities, performing voice enhancement by a voice enhancement system, performing feature extraction on voice signals according to the step S300 to obtain Mel spectrum features, and inputting the Mel spectrum features into a speaker recognition model to obtain a group of weights, wherein the identity corresponding to the highest weight value is the result of the speaker recognition model on the speaker identity.
S400: identifying a speech signal of an unknown speaker identity based on the trained speaker identification model;
after the voice audio of the unknown speaker is obtained and the voice signal is subjected to voice enhancement by the voice enhancement system, the characteristic extraction is carried out on the voice signal according to the step S300 to obtain a Mel frequency spectrum characteristic, the Mel frequency spectrum characteristic is input into a trained speaker recognition model, so that a group of weights can be obtained, and the identity corresponding to the highest weight value is the recognition result of the speaker recognition model on the identity of the unknown speaker (namely the identity of the unknown speaker is obtained).
Different from other speaker recognition methods, the speaker recognition model of the invention introduces a text coding module, and by using text information loss to restrict, all information irrelevant to text content in the voice is stripped as far as possible, namely, the information relevant to the speaker identity is stripped, and the stripped information relevant to the speaker identity is used as a voiceprint feature vector; and using a plurality of affine coupling layers in the voiceprint feature extraction module to further refine information relating only to speaker identity.
Example six
As shown in fig. 16, one embodiment of the present invention provides a speaker recognition system, which includes a data acquisition module, a preprocessing module, a speaker recognition model construction and training module, and a recognition module;
The data acquisition module is used for constructing a target speaker voice data set and acquiring an enhanced voice signal and a corresponding text based on the target speaker voice signal in the enhanced data set; the enhanced speech signal includes information related to speaker identity;
the preprocessing module is used for preprocessing the enhanced voice signal and the corresponding text to obtain a phoneme text sequence, a phoneme duration and a mel frequency spectrum characteristic;
the speaker recognition model construction and training module is used for constructing a speaker recognition model, and inputting the phoneme text sequence, the phoneme duration and the mel frequency spectrum characteristics into the speaker recognition model so as to obtain text information hiding vectors, predicted text hiding vectors and predicted speaker identity information; training the speaker recognition model based on the text information hiding vector and the predicted text hiding vector, and the information related to the speaker identity and the predicted speaker identity information, thereby obtaining a trained speaker recognition model;
the recognition module is used for recognizing the voice signal of the unknown speaker identity based on the trained speaker recognition model.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. A method of speech enhancement comprising the steps of:
s100: generating a double-microphone far-field noisy speech based on the clean speech, the clean noise and the scattered noise; generating a plurality of target voices based on the pure voices, and recording a plurality of target voice orientations;
s200: uniformly dividing the space orientation into a plurality of target areas; labeling the plurality of target voice orientations based on a plurality of target areas to obtain labeled target voice orientations;
s300: extracting the characteristics of the double-microphone far-field noisy speech from the labeled target speech azimuth to obtain the characteristics of each target region;
s400: constructing a masking nerve voice enhancement model; training the masking nerve voice enhancement model based on the characteristics of each target area, the target voice and the labeled target voice azimuth, thereby obtaining a trained masking nerve voice enhancement model; the masking neural speech enhancement model includes an encoder, a spatial attention module, a decoder, and a neural beam;
s500: and enhancing the voice signal based on the trained masking nerve voice enhancement model.
2. The method according to claim 1, wherein labeling the plurality of target speech orientations based on the plurality of target regions in step S200 comprises:
Discretizing a plurality of target areas and marking the target areas as azimuth labels; and labeling the target voice azimuth based on the azimuth label to obtain a labeled target voice azimuth.
3. The method according to claim 1, wherein training the masking neural speech enhancement model based on the feature of each target region, target speech, and tagged target speech orientation in step S400 comprises the sub-steps of:
s410: inputting the features of each target region into a masking neural speech enhancement model to obtain an estimated target speech direction and enhanced speech;
s420: training the masking nerve speech enhancement model based on the tagged target speech orientation and the estimated target speech direction, and the target speech and the enhanced speech, to obtain a trained masking nerve speech enhancement model.
4. The voice enhancement method according to claim 3, wherein the step S410 includes:
the encoder encodes the characteristics of each target region to obtain a characteristic encoded representation of each target region;
the spatial attention module calculates the spatial attention weight at the current moment based on the feature coding representation of each target area and the decoding state of a decoder at the last moment, and performs weighted summation on the feature coding representation of each target area based on the spatial attention weight at the current moment to obtain a spatially aggregated representation vector; selecting the direction with the largest attention weight as the estimated target voice direction;
The decoder predicts a time-frequency mask of the target speech based on the spatially aggregated representation vector to obtain an estimated time-frequency mask;
selecting a nerve wave beam corresponding to the estimated target voice direction to carry out spatial filtering on the double-microphone far-field noisy voice so as to obtain a signal with enhanced spatial filtering; the estimated time-frequency mask is applied to the spatially filtered enhanced signal to obtain enhanced speech.
5. The voice enhancement method according to claim 3, wherein said step S420 comprises:
calculating a speech enhancement loss based on the target speech and the enhanced speech; and calculating cross entropy orientation classification loss based on the tagged target speech orientation and the estimated target speech orientation; the encoder, spatial attention module, decoder and neural beam are jointly trained with the speech enhancement loss and the cross entropy orientation classification loss as optimization targets.
6. The voice enhancement system is characterized by comprising a data generation module, a labeling module, a feature extraction module, a model construction and training module and a voice enhancement module;
the data generation module is used for generating double-microphone far-field noisy speech based on pure speech, pure noise and scattered noise; generating a plurality of target voices based on the pure voices, and recording a plurality of target voice orientations;
The labeling module is used for uniformly dividing the space orientation into a plurality of target areas; labeling the plurality of target voice orientations based on a plurality of target areas to obtain labeled target voice orientations;
the feature extraction module is used for extracting features of the double-microphone far-field noisy speech from the labeled target speech azimuth so as to obtain features of each target area;
the model construction and training module is used for constructing a masking nerve voice enhancement model; training the masking nerve voice enhancement model based on the characteristics of each target area, the target voice and the labeled target voice azimuth, thereby obtaining a trained masking nerve voice enhancement model;
the speech enhancement module is configured to enhance a speech signal based on the trained masking neural speech enhancement model.
7. A method of speech recognition based on the speech enhancement system of claim 6, comprising the steps of:
s100: constructing a voice data set with text labels, and enhancing voice signals in the voice data set with text labels based on the voice enhancement system so as to obtain enhanced voice signals;
S200: preprocessing the enhanced speech signal to obtain mel spectrum characteristics;
s300: constructing a voice recognition model, and inputting the Mel frequency spectrum characteristics into the voice recognition model to obtain a decoded recognition result; training the voice recognition model based on the decoded recognition result and the text to obtain a trained voice recognition model;
s400: and recognizing a voice signal based on the trained voice recognition model.
8. A speech recognition system based on the speech enhancement system of claim 6, comprising a data acquisition module, a preprocessing module, a speech recognition model construction and training module, and a recognition module;
the data acquisition module is used for constructing a voice data set with text labels, and enhancing voice signals in the voice data set with the text labels based on the voice enhancement system so as to obtain enhanced voice signals;
the preprocessing module is used for preprocessing the enhanced voice signal to obtain a Mel frequency spectrum characteristic;
the voice recognition model construction and training module is used for constructing a voice recognition model, and inputting the Mel frequency spectrum characteristics into the voice recognition model to obtain a decoded recognition result; training the voice recognition model based on the decoded recognition result and the text to obtain a trained voice recognition model;
The recognition module is used for recognizing the voice signal based on the trained voice recognition model.
9. A speaker recognition method based on the speech enhancement system of claim 6, comprising the steps of:
s100: constructing a target speaker voice data set with text labels, and enhancing a target speaker voice signal in the target speaker voice data set with text labels based on the voice enhancement system, so as to obtain an enhanced voice signal; the enhanced speech signal includes information related to speaker identity;
s200: preprocessing the enhanced voice signal and the corresponding text to obtain a phoneme text sequence, a phoneme duration and a mel spectrum characteristic;
s300: constructing a speaker recognition model, and inputting the phoneme text sequence, the phoneme duration and the Mel frequency spectrum characteristics into the speaker recognition model to obtain a text information hiding vector, a predicted text hiding vector and predicted speaker identity information; training the speaker recognition model based on the text information hiding vector and the predicted text hiding vector, and the information related to the speaker identity and the predicted speaker identity information, thereby obtaining a trained speaker recognition model;
S400: and identifying the voice signal of the unknown speaker identity based on the trained speaker identification model.
10. A speaker recognition system based on the speech enhancement system of claim 6, comprising a data acquisition module, a preprocessing module, a speaker recognition model construction and training module, and a recognition module;
the data acquisition module is used for constructing a target speaker voice data set with text labels, and enhancing target speaker voice signals in the target speaker voice data set with the text labels based on the voice enhancement system so as to obtain enhanced voice signals and corresponding text; the enhanced speech signal includes information related to speaker identity;
the preprocessing module is used for preprocessing the enhanced voice signal and the corresponding text to obtain a phoneme text sequence, a phoneme duration and a mel frequency spectrum characteristic;
the speaker recognition model construction and training module is used for constructing a speaker recognition model, and inputting the phoneme text sequence, the phoneme duration and the mel frequency spectrum characteristics into the speaker recognition model so as to obtain text information hiding vectors, predicted text hiding vectors and predicted speaker identity information; training the speaker recognition model based on the text information hiding vector and the predicted text hiding vector, and the information related to the speaker identity and the predicted speaker identity information, thereby obtaining a trained speaker recognition model;
The recognition module is used for recognizing the voice signal of the unknown speaker identity based on the trained speaker recognition model.
CN202310238080.3A 2023-03-14 2023-03-14 Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system Active CN116092501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310238080.3A CN116092501B (en) 2023-03-14 2023-03-14 Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310238080.3A CN116092501B (en) 2023-03-14 2023-03-14 Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system

Publications (2)

Publication Number Publication Date
CN116092501A CN116092501A (en) 2023-05-09
CN116092501B true CN116092501B (en) 2023-07-25

Family

ID=86204708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310238080.3A Active CN116092501B (en) 2023-03-14 2023-03-14 Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system

Country Status (1)

Country Link
CN (1) CN116092501B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311003B (en) * 2023-05-23 2023-08-01 澳克多普有限公司 Video detection method and system based on dual-channel loading mechanism
CN116778913B (en) * 2023-08-25 2023-10-20 澳克多普有限公司 Speech recognition method and system for enhancing noise robustness
CN117935834B (en) * 2024-03-12 2024-05-28 深圳市声优创科技有限公司 Intelligent audio noise reduction method and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014196769A1 (en) * 2013-06-03 2014-12-11 삼성전자 주식회사 Speech enhancement method and apparatus for same
CN110739002A (en) * 2019-10-16 2020-01-31 中山大学 Complex domain speech enhancement method, system and medium based on generation countermeasure network
CN112151059A (en) * 2020-09-25 2020-12-29 南京工程学院 Microphone array-oriented channel attention weighted speech enhancement method
CN113889137A (en) * 2021-12-06 2022-01-04 中国科学院自动化研究所 Microphone array speech enhancement method and device, electronic equipment and storage medium
CN114023346A (en) * 2021-11-01 2022-02-08 北京语言大学 Voice enhancement method and device capable of separating circulatory attention
CN115410589A (en) * 2022-09-05 2022-11-29 新疆大学 Attention generation confrontation voice enhancement method based on joint perception loss

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014196769A1 (en) * 2013-06-03 2014-12-11 삼성전자 주식회사 Speech enhancement method and apparatus for same
CN110739002A (en) * 2019-10-16 2020-01-31 中山大学 Complex domain speech enhancement method, system and medium based on generation countermeasure network
CN112151059A (en) * 2020-09-25 2020-12-29 南京工程学院 Microphone array-oriented channel attention weighted speech enhancement method
CN114023346A (en) * 2021-11-01 2022-02-08 北京语言大学 Voice enhancement method and device capable of separating circulatory attention
CN113889137A (en) * 2021-12-06 2022-01-04 中国科学院自动化研究所 Microphone array speech enhancement method and device, electronic equipment and storage medium
CN115410589A (en) * 2022-09-05 2022-11-29 新疆大学 Attention generation confrontation voice enhancement method based on joint perception loss

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"复杂环境下基于神经网络的语音增强算法研究";商骞;《中国优秀硕士学位论文全文数据库(信息科技辑)》;全文 *
Li HF 等."Improving speech enchancement by focusing on smaller values using relative loss".《IET Signal Process》.2020,全文. *
柯登峰."基于噪音估计和参数估计的优化语音增强算法".《第七届全国人机语音通讯学术会议(NCMMSC7)论文集》.2003,全文. *

Also Published As

Publication number Publication date
CN116092501A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
Yu et al. Recent progresses in deep learning based acoustic models
CN116092501B (en) Speech enhancement method, speech recognition method, speaker recognition method and speaker recognition system
Li et al. Robust automatic speech recognition: a bridge to practical applications
Xiao et al. Single-channel speech extraction using speaker inventory and attention network
Wu et al. An end-to-end deep learning approach to simultaneous speech dereverberation and acoustic modeling for robust speech recognition
Kanda et al. Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers
Feng et al. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition
Zhang et al. Improving end-to-end single-channel multi-talker speech recognition
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
Delcroix et al. Context adaptive neural network based acoustic models for rapid adaptation
Nakagome et al. Mentoring-Reverse Mentoring for Unsupervised Multi-Channel Speech Source Separation.
Li et al. Listening and grouping: an online autoregressive approach for monaural speech separation
Ochiai et al. Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR
Zhang et al. Time-domain speech extraction with spatial information and multi speaker conditioning mechanism
Wang et al. Exploring end-to-end multi-channel ASR with bias information for meeting transcription
Wang Supervised speech separation using deep neural networks
Ng et al. Teacher-student training for text-independent speaker recognition
Soni et al. State-of-the-art analysis of deep learning-based monaural speech source separation techniques
Kim et al. Semi-supervsied Learning-based Sound Event Detection using Freuqency Dynamic Convolution with Large Kernel Attention for DCASE Challenge 2023 Task 4
Park et al. The Second DIHARD Challenge: System Description for USC-SAIL Team.
Dong et al. Towards real-world objective speech quality and intelligibility assessment using speech-enhancement residuals and convolutional long short-term memory networks
EP4177882B1 (en) Methods and systems for synthesising speech from text
Yu et al. Recent progresses in deep learning based acoustic models (updated)
Zorilă et al. An investigation into the multi-channel time domain speaker extraction network
Shi et al. Casa-asr: Context-aware speaker-attributed asr

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230625

Address after: Room 1404, Unit 2, Building 1, Haizhi Yungu, No. 2094, Shenyan Road, Tiandong Community, Haishan Street, Yantian District, Shenzhen, Guangdong 518081

Applicant after: Shenzhen Weiou Technology Co.,Ltd.

Address before: Room 401, Building B, No. 2, Lanshui Industrial Zone, Longxin Community, Baolong Street, Longgang District, Shenzhen City, Guangdong Province, 518116

Applicant before: Ocdop Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant