CN113160850A - Audio feature extraction method and device based on re-parameterization decoupling mode - Google Patents

Audio feature extraction method and device based on re-parameterization decoupling mode Download PDF

Info

Publication number
CN113160850A
CN113160850A CN202110460111.0A CN202110460111A CN113160850A CN 113160850 A CN113160850 A CN 113160850A CN 202110460111 A CN202110460111 A CN 202110460111A CN 113160850 A CN113160850 A CN 113160850A
Authority
CN
China
Prior art keywords
convolutional layer
network
unit
module
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110460111.0A
Other languages
Chinese (zh)
Inventor
许敏强
马雨枫
赵淼
刘敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Speakin Intelligent Technology Co ltd
Original Assignee
Guangzhou Speakin Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Speakin Intelligent Technology Co ltd filed Critical Guangzhou Speakin Intelligent Technology Co ltd
Priority to CN202110460111.0A priority Critical patent/CN113160850A/en
Publication of CN113160850A publication Critical patent/CN113160850A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses an audio feature extraction method and device based on a re-parameterization decoupling mode, and the method comprises the following steps: acquiring a voice sample to be detected of a target speaker; preprocessing a voice sample to be detected; extracting acoustic characteristics of the preprocessed voice sample to be detected; inputting the acoustic features into a network inference module to obtain voiceprint feature vectors, wherein the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through parameterization. This application uses the multi-branch structure at the training stage to reach better convergence effect, at the inference stage, heavily parameterize to the one-way structure, in order to obtain the better effect than the multi-branch structure that the parameter quantity is equivalent, and enable speed faster, it is lower to consume the memory.

Description

Audio feature extraction method and device based on re-parameterization decoupling mode
Technical Field
The application relates to the technical field of voiceprint feature extraction, in particular to an audio feature extraction method and device based on a re-parameterization decoupling mode.
Background
Existing high performance network architectures include multi-drop architectures and network components with superior performance. Compared with the prior single-path structure, the multi-branch structure has greatly improved performance. Such as GoogleNet, inclusion, etc., all belong to the multiplex architecture. While network components with superior performance, including deep separable convolutions, packet convolutions, etc., can significantly increase network performance. However, the multi-branch structure and the component with excellent performance can significantly improve the performance of the model, but ultimately leads to the speed of the model becoming slow and memory consuming during reasoning, which is very unfavorable for industrial scenarios, especially in the case of limited computation.
There have also been many attempts in recent years on one-way networks. The core of these attempts is to train deeper networks, but there is not much success, performance is generally inferior to that of multi-branch structures, and the resulting models are often neither simple nor practical.
Disclosure of Invention
The application provides an audio feature extraction method and device based on a re-parameterization decoupling mode, so that a multi-branch structure is used in a training phase to achieve a better convergence effect, and in an inference phase, the re-parameterization is a one-way structure to obtain a better effect than the multi-branch structure with equivalent parameter quantity, and the method and device are faster in speed and lower in memory consumption.
In view of this, a first aspect of the present application provides an audio feature extraction method based on a re-parameterization decoupling manner, where the method includes:
acquiring a voice sample to be detected of a target speaker;
preprocessing the voice sample to be detected;
extracting acoustic features of the preprocessed voice sample to be detected;
and inputting the acoustic features into a network inference module to obtain voiceprint feature vectors, wherein the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through parameterization.
Optionally, before the inputting the acoustic features into the network inference module to obtain the voiceprint feature vector, the method further includes:
collecting voice samples of a large number of target speakers as training voice samples;
preprocessing the training voice sample;
extracting acoustic features of the preprocessed training voice sample;
inputting the acoustic features into the network training module to obtain the trained network training module, wherein the network training module comprises a plurality of parallel first 3x3 convolutional layers, first 1x1 convolutional layers and direct-connection layers.
Optionally, the network inference module is a network model of a single-path structure converted by a trained multi-layer network training module through parameterization, and specifically includes:
combining the first 3x3 convolutional layer in the trained network training module with a BN layer unit to obtain a second 3x3 convolutional layer;
combining the first 1x1 convolutional layer in the trained training network module with a BN layer unit to obtain a second 1x1 convolutional layer;
combining the trained direct connection layer and BN layer units in the training network module to obtain a third 1x1 convolutional layer;
expanding the second 1x1 convolutional layer to a third 3x3 convolutional layer;
expanding the third 1x1 convolutional layer to a fourth 3x3 convolutional layer;
and adding the second 3x3 convolutional layer, the third 3x3 convolutional layer and the fourth 3x3 convolutional layer according to the additive principle of convolution to obtain a fifth 3x3 convolutional layer in the network inference module.
Optionally, the preprocessing the voice sample to be detected includes:
and resampling the voice sample to be detected, and carrying out noise reduction transformation.
A second aspect of the present application provides an audio feature extraction device based on a decoupling manner of reparameterization, the device including:
the acquisition unit is used for acquiring a voice sample to be detected of a target speaker;
the first preprocessing unit is used for preprocessing the voice sample to be detected;
the first feature extraction unit is used for extracting acoustic features of the preprocessed voice sample to be detected;
and the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through re-parameterization.
Optionally, the method further includes:
the acquisition unit is used for acquiring voice samples of a large number of target speakers as training voice samples;
the second preprocessing unit is used for preprocessing the training voice sample;
the second feature extraction unit is used for extracting acoustic features of the preprocessed training voice sample;
the training unit is used for inputting the acoustic features into the network training module to obtain the trained network training module, and the network training module comprises a plurality of parallel first 3x3 convolutional layers, a first 1x1 convolutional layer and a direct connection layer.
Optionally, the voiceprint feature obtaining unit includes:
the first merging unit is used for merging the first 3x3 convolutional layer in the trained network training module with the BN layer unit to obtain a second 3x3 convolutional layer;
a second merging unit, configured to merge the first 1x1 convolutional layer in the trained training network module with a BN layer unit, so as to obtain a second 1x1 convolutional layer;
a third merging unit, configured to merge the direct connection layer in the trained training network module with a BN layer unit, so as to obtain a third 1x1 convolutional layer;
a first expanding unit for expanding the second 1x1 convolutional layer into a third 3x3 convolutional layer;
a second expanding unit for expanding the third 1x1 convolutional layer into a fourth 3x3 convolutional layer;
and the adding unit is used for adding the second 3x3 convolutional layer, the third 3x3 convolutional layer and the fourth 3x3 convolutional layer according to the additive principle of convolution to obtain a fifth 3x3 convolutional layer in the network inference module.
Optionally, the preprocessing unit is specifically configured to perform resampling and noise reduction transformation on the to-be-detected speech sample.
A third aspect of the present application provides an audio feature extraction device based on a re-parameterized decoupling manner, the device including a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the steps of the method for extracting audio features based on the decoupled manner of reparameterization according to the first aspect.
A fourth aspect of the present application provides a computer-readable storage medium for storing program code for performing the method of the first aspect.
According to the technical scheme, the method has the following advantages:
the application provides an audio feature extraction method based on a re-parameterization decoupling mode, which comprises the following steps: acquiring a voice sample to be detected of a target speaker; preprocessing a voice sample to be detected; extracting acoustic characteristics of the preprocessed voice sample to be detected; inputting the acoustic features into a network inference module to obtain voiceprint feature vectors, wherein the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through parameterization. This application uses the multi-branch structure at the training stage to reach better convergence effect, at the inference stage, heavily parameterize to the one-way structure, in order to obtain the better effect than the multi-branch structure that the parameter quantity is equivalent, and enable speed faster, it is lower to consume the memory.
Drawings
Fig. 1 is a flowchart of a method in an embodiment of an audio feature extraction method based on a re-parameterization decoupling mode according to the present application;
FIG. 2 is a flowchart of a method of another embodiment of the present application for an audio feature extraction method based on a re-parameterized decoupling approach;
fig. 3 is a schematic structural diagram of an embodiment of an audio feature extraction apparatus based on a re-parameterization decoupling manner according to the present application;
fig. 4 is a schematic diagram of a network structure of a network training module and a network inference module according to a specific embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a flowchart of a method in an embodiment of an audio feature extraction method based on a re-parameterization decoupling manner according to the present application, as shown in fig. 1, where fig. 1 includes:
101. acquiring a voice sample to be detected of a target speaker;
it should be noted that, the method and the device can obtain the voice sample to be tested of the target speaker collected by any terminal for testing.
102. Preprocessing a voice sample to be detected;
it should be noted that, in the present application, the process of preprocessing the to-be-detected voice sample may include performing conversion such as resampling and noise reduction on the to-be-detected voice sample.
103. Extracting acoustic characteristics of the preprocessed voice sample to be detected;
it should be noted that the FBANK feature can be obtained by performing a series of operations such as pre-emphasis, framing, windowing, fourier transform, mel filter bank, and logarithm operation on the preprocessed voice sample to be detected, and then normalization and effective sound extraction are performed on the FBANK feature, so that the acoustic feature of the voice sample to be detected can be obtained.
104. Inputting the acoustic features into a network inference module to obtain voiceprint feature vectors, wherein the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through parameterization.
It should be noted that the reparameterization means that the network parameters in the network inference module are reconstructed from the network parameters in the network training module after a certain change. The network reasoning module in the application is a network model which is formed by converting a multi-layer network training model into a one-way structure through heavy parameterization conversion by a trained network training module. The network performance is ensured by using a multi-layer network structure in the training stage, and a better convergence effect is achieved; and a network reasoning module with a single-path structure obtained by heavy parameterization conversion is adopted in the voiceprint feature extraction stage, so that the calculation speed is higher, and the memory consumption is lower.
This application is through at the training stage, uses the multi-branch structure to reach better convergence effect, at the inference stage, heavily parameterize to the one-way structure, in order to obtain the better effect than the multi-branch structure that the parameter quantity is equivalent, and enable that speed is faster, and it is lower to consume the memory.
The present application further provides a method flowchart in another embodiment of an audio feature extraction method based on a re-parameterized decoupling manner, as shown in fig. 2, where fig. 2 includes:
201. collecting voice samples of a large number of target speakers as training voice samples;
it should be noted that a large number of voice samples of the target speaker can be collected as training voice samples for training the network model in the network training module.
202. Preprocessing a training voice sample;
it should be noted that the training speech samples are preprocessed, and the preprocessing includes resampling the training speech samples, and performing noise reduction and other transformations.
203. Extracting acoustic features of the preprocessed training voice sample;
it should be noted that the FBANK features can be obtained by performing a series of operations such as pre-emphasis, framing, windowing, fourier transform, mel filter bank, and logarithm operation on the preprocessed training speech sample, and then normalization and effective sound extraction are performed on the FBANK features, so that the acoustic features of the speech sample to be detected can be obtained.
204. Inputting the acoustic features into a network training module to obtain a trained network training module, wherein the network training module comprises a plurality of parallel first 3x3 convolutional layers, a first 1x1 convolutional layer and a direct connection layer;
it should be noted that, the acoustic features can be input into the network training module to obtain the classification corresponding to the acoustic features, the loss function is calculated, the parameters of the training network are continuously iterated through the back propagation algorithm to obtain the network model in the training stage, so as to obtain the trained network training module,
specifically, the network structure of the network training module may refer to the multi-layer structure network on the left side in fig. 4, and on the basis of the original VGG, a plurality of residual branches and 1 × 1 convolution branches are introduced. The positions of the multi-path branches are adjusted for subsequent parameterization into a single-path structure. The structure of the residual branch and the 1x1 convolutional branch in the network training module includes a first 3x3 convolutional layer, a first 1x1 convolutional layer, and a direct-connected layer in parallel. And the main part of the network training module only has one operator: 3x3 convolution-ReLU activation function.
205. Combining the first 3x3 convolutional layer in the trained network training module with the BN layer unit to obtain a second 3x3 convolutional layer;
it should be noted that, in the network training model, each of the first 3x3 convolutional layer, the first 1x1 convolutional layer, and the direct connection layer includes a BN layer (batch normalization), and in order to convert the network training model into a single-path structure, the first 3x3 convolutional layer and the BN layer unit may be merged to obtain a second 3x3 convolutional layer.
206. Combining the first 1x1 convolutional layer in the trained training network module with the BN layer unit to obtain a second 1x1 convolutional layer;
it should be noted that the first 1x1 convolutional layer in the trained training network module may be merged with the BN layer unit to obtain a second 1x1 convolutional layer.
207. Combining the direct connection layer in the trained training network module with the BN layer unit to obtain a third 1x1 convolutional layer;
208. expanding the second 1x1 convolutional layer to a third 3x3 convolutional layer;
209. expanding the third 1x1 convolutional layer to a fourth 3x3 convolutional layer;
210. and adding the second 3x3 convolutional layer, the third 3x3 convolutional layer and the fourth 3x3 convolutional layer according to the additive principle of convolution to obtain a fifth 3x3 convolutional layer in the network inference module.
It should be noted that, while the three convolutional layers are added, the trained parameters in the parameters of the corresponding training network module may also be correspondingly converted, so as to complete the re-parameterization, and obtain the fifth 3x3 convolutional layer in the network inference module.
The specific network inference module can be composed of a plurality of serial fifth 3x3 convolution layers, and can refer to a one-way network model as shown in the right structure of fig. 4.
The present application further provides an embodiment of an audio feature extraction apparatus based on a re-parameterization decoupling manner, as shown in fig. 3, where fig. 3 includes:
an obtaining unit 301, configured to obtain a to-be-detected speech sample of a target speaker;
a first preprocessing unit 302, configured to preprocess a voice sample to be detected;
a first feature extraction unit 303, configured to extract acoustic features of the preprocessed voice sample to be detected;
the voiceprint feature acquisition unit 304 is configured to input the acoustic features into a network inference module to obtain a voiceprint feature vector, where the network inference module is a network model of a single-path structure that is converted by a trained multi-layer network training module through parameterization.
In a specific embodiment, the method further comprises the following steps:
the acquisition unit is used for acquiring voice samples of a large number of target speakers as training voice samples;
the second preprocessing unit is used for preprocessing the training voice sample;
the second feature extraction unit is used for extracting acoustic features of the preprocessed training voice sample;
the training unit is used for inputting the acoustic features into the network training module to obtain the trained network training module, and the network training module comprises a plurality of parallel first 3x3 convolutional layers, first 1x1 convolutional layers and direct connection layers.
In a specific embodiment, the voiceprint feature acquisition unit includes:
the first merging unit is used for merging the first 3x3 convolutional layer in the trained network training module with the BN layer unit to obtain a second 3x3 convolutional layer;
the second merging unit is used for merging the first 1x1 convolutional layer in the trained training network module with the BN layer unit to obtain a second 1x1 convolutional layer;
the third merging unit is used for merging the direct connection layer in the trained training network module and the BN layer unit to obtain a third 1x1 convolutional layer;
a first expanding unit for expanding the second 1x1 convolutional layer into a third 3x3 convolutional layer;
a second expansion unit for expanding the third 1x1 convolutional layer into a fourth 3x3 convolutional layer;
and the adding unit is used for adding the second 3x3 convolutional layer, the third 3x3 convolutional layer and the fourth 3x3 convolutional layer according to the additive principle of convolution to obtain a fifth 3x3 convolutional layer in the network inference module.
The preprocessing unit is specifically used for resampling and denoising the voice sample to be detected.
The application also provides audio feature extraction equipment based on a re-parameterized decoupling mode, which comprises a processor and a memory, wherein the processor is used for: the memory is used for storing the program codes and transmitting the program codes to the processor; the processor is used for executing the embodiment of the audio feature extraction method based on the decoupling mode of the re-parameterization according to the instructions in the program code.
The present application further provides a computer-readable storage medium for storing program code for performing embodiments of a re-parameterization-based decoupling approach-based audio feature extraction method.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A re-parameterization-based audio feature extraction method based on a decoupling mode is characterized by comprising the following steps:
acquiring a voice sample to be detected of a target speaker;
preprocessing the voice sample to be detected;
extracting acoustic features of the preprocessed voice sample to be detected;
and inputting the acoustic features into a network inference module to obtain voiceprint feature vectors, wherein the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through parameterization.
2. The method for extracting audio features based on a re-parameterized decoupling manner according to claim 1, wherein before inputting the acoustic features into the network inference module to obtain a voiceprint feature vector, the method further comprises:
collecting voice samples of a large number of target speakers as training voice samples;
preprocessing the training voice sample;
extracting acoustic features of the preprocessed training voice sample;
inputting the acoustic features into the network training module to obtain the trained network training module, wherein the network training module comprises a plurality of parallel first 3x3 convolutional layers, first 1x1 convolutional layers and direct-connection layers.
3. The audio feature extraction method based on the decoupling mode with parameterization according to claim 2, wherein the network inference module is a network model of a single-path structure transformed by a trained multi-layer network training module through parameterization, and specifically comprises:
combining the first 3x3 convolutional layer in the trained network training module with a BN layer unit to obtain a second 3x3 convolutional layer;
combining the first 1x1 convolutional layer in the trained training network module with a BN layer unit to obtain a second 1x1 convolutional layer;
combining the trained direct connection layer and BN layer units in the training network module to obtain a third 1x1 convolutional layer;
expanding the second 1x1 convolutional layer to a third 3x3 convolutional layer;
expanding the third 1x1 convolutional layer to a fourth 3x3 convolutional layer;
and adding the second 3x3 convolutional layer, the third 3x3 convolutional layer and the fourth 3x3 convolutional layer according to the additive principle of convolution to obtain a fifth 3x3 convolutional layer in the network inference module.
4. The audio feature extraction method based on the re-parameterized decoupling manner according to claim 1, wherein the preprocessing the to-be-detected speech sample comprises:
and resampling the voice sample to be detected, and carrying out noise reduction transformation.
5. An audio feature extraction device based on a re-parameterization decoupling mode is characterized by comprising the following components:
the acquisition unit is used for acquiring a voice sample to be detected of a target speaker;
the first preprocessing unit is used for preprocessing the voice sample to be detected;
the first feature extraction unit is used for extracting acoustic features of the preprocessed voice sample to be detected;
and the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through re-parameterization.
6. The apparatus according to claim 5, further comprising:
the acquisition unit is used for acquiring voice samples of a large number of target speakers as training voice samples;
the second preprocessing unit is used for preprocessing the training voice sample;
the second feature extraction unit is used for extracting acoustic features of the preprocessed training voice sample;
the training unit is used for inputting the acoustic features into the network training module to obtain the trained network training module, and the network training module comprises a plurality of parallel first 3x3 convolutional layers, a first 1x1 convolutional layer and a direct connection layer.
7. The apparatus according to claim 6, wherein the voiceprint feature obtaining unit comprises:
the first merging unit is used for merging the first 3x3 convolutional layer in the trained network training module with the BN layer unit to obtain a second 3x3 convolutional layer;
a second merging unit, configured to merge the first 1x1 convolutional layer in the trained training network module with a BN layer unit, so as to obtain a second 1x1 convolutional layer;
a third merging unit, configured to merge the direct connection layer in the trained training network module with a BN layer unit, so as to obtain a third 1x1 convolutional layer;
a first expanding unit for expanding the second 1x1 convolutional layer into a third 3x3 convolutional layer;
a second expanding unit for expanding the third 1x1 convolutional layer into a fourth 3x3 convolutional layer;
and the adding unit is used for adding the second 3x3 convolutional layer, the third 3x3 convolutional layer and the fourth 3x3 convolutional layer according to the additive principle of convolution to obtain a fifth 3x3 convolutional layer in the network inference module.
8. The device according to claim 5, wherein the preprocessing unit is configured to resample the to-be-detected speech sample and perform denoising transformation.
9. An audio feature extraction device based on a decoupled manner of reparameterization, the device comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the audio feature extraction method based on the re-parameterized decoupling manner according to any one of claims 1 to 4 according to instructions in the program code.
10. A computer-readable storage medium for storing a program code for executing the method for audio feature extraction based on a decoupled manner of reparameterization according to any one of claims 1 to 4.
CN202110460111.0A 2021-04-27 2021-04-27 Audio feature extraction method and device based on re-parameterization decoupling mode Pending CN113160850A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110460111.0A CN113160850A (en) 2021-04-27 2021-04-27 Audio feature extraction method and device based on re-parameterization decoupling mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110460111.0A CN113160850A (en) 2021-04-27 2021-04-27 Audio feature extraction method and device based on re-parameterization decoupling mode

Publications (1)

Publication Number Publication Date
CN113160850A true CN113160850A (en) 2021-07-23

Family

ID=76871528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110460111.0A Pending CN113160850A (en) 2021-04-27 2021-04-27 Audio feature extraction method and device based on re-parameterization decoupling mode

Country Status (1)

Country Link
CN (1) CN113160850A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114219817A (en) * 2022-02-22 2022-03-22 湖南师范大学 New coronary pneumonia CT image segmentation method and terminal equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170358306A1 (en) * 2016-06-13 2017-12-14 Alibaba Group Holding Limited Neural network-based voiceprint information extraction method and apparatus
CN108399913A (en) * 2018-02-12 2018-08-14 北京容联易通信息技术有限公司 High robust audio fingerprinting method and system
US20180336889A1 (en) * 2017-05-19 2018-11-22 Baidu Online Network Technology (Beijing) Co., Ltd . Method and Apparatus of Building Acoustic Feature Extracting Model, and Acoustic Feature Extracting Method and Apparatus
CN110021307A (en) * 2019-04-04 2019-07-16 Oppo广东移动通信有限公司 Audio method of calibration, device, storage medium and electronic equipment
WO2019179036A1 (en) * 2018-03-19 2019-09-26 平安科技(深圳)有限公司 Deep neural network model, electronic device, identity authentication method, and storage medium
CN110782907A (en) * 2019-11-06 2020-02-11 腾讯科技(深圳)有限公司 Method, device and equipment for transmitting voice signal and readable storage medium
US20200364603A1 (en) * 2019-05-15 2020-11-19 Google Llc Compression of Machine-Learned Models via Entropy Penalized Weight Reparameterization

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170358306A1 (en) * 2016-06-13 2017-12-14 Alibaba Group Holding Limited Neural network-based voiceprint information extraction method and apparatus
US20180336889A1 (en) * 2017-05-19 2018-11-22 Baidu Online Network Technology (Beijing) Co., Ltd . Method and Apparatus of Building Acoustic Feature Extracting Model, and Acoustic Feature Extracting Method and Apparatus
CN108399913A (en) * 2018-02-12 2018-08-14 北京容联易通信息技术有限公司 High robust audio fingerprinting method and system
WO2019179036A1 (en) * 2018-03-19 2019-09-26 平安科技(深圳)有限公司 Deep neural network model, electronic device, identity authentication method, and storage medium
CN110021307A (en) * 2019-04-04 2019-07-16 Oppo广东移动通信有限公司 Audio method of calibration, device, storage medium and electronic equipment
US20200364603A1 (en) * 2019-05-15 2020-11-19 Google Llc Compression of Machine-Learned Models via Entropy Penalized Weight Reparameterization
CN110782907A (en) * 2019-11-06 2020-02-11 腾讯科技(深圳)有限公司 Method, device and equipment for transmitting voice signal and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOHAN DING等: ""RepVGG Making VGG-style ConvNets Great Again"", 《ARXIV》, 29 March 2021 (2021-03-29), pages 1 - 10 *
张海波: "《智慧图书馆技术及应用》", 河北科技技术出版社, pages: 256 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114219817A (en) * 2022-02-22 2022-03-22 湖南师范大学 New coronary pneumonia CT image segmentation method and terminal equipment

Similar Documents

Publication Publication Date Title
CN110390952B (en) City sound event classification method based on dual-feature 2-DenseNet parallel connection
CN113436643B (en) Training and application method, device and equipment of voice enhancement model and storage medium
CN109410974B (en) Voice enhancement method, device, equipment and storage medium
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN111863015B (en) Audio processing method, device, electronic equipment and readable storage medium
CN112201273B (en) Noise power spectral density calculation method, system, equipment and medium
CN113241064A (en) Voice recognition method, voice recognition device, model training method, model training device, electronic equipment and storage medium
CN117174105A (en) Speech noise reduction and dereverberation method based on improved deep convolutional network
CN111640442B (en) Method for processing audio packet loss, method for training neural network and respective devices
CN113160850A (en) Audio feature extraction method and device based on re-parameterization decoupling mode
CN112331187A (en) Multi-task speech recognition model training method and multi-task speech recognition method
CN112002339B (en) Speech noise reduction method and device, computer-readable storage medium and electronic device
CN115113855B (en) Audio data processing method, device, electronic equipment, storage medium and product
CN114972695A (en) Point cloud generation method and device, electronic equipment and storage medium
Li et al. Dynamic attention based generative adversarial network with phase post-processing for speech enhancement
CN114882889A (en) Speaker recognition model training method, device, equipment and readable medium
CN114495962A (en) Audio noise reduction method, device and system and computer readable storage medium
CN112489678A (en) Scene recognition method and device based on channel characteristics
CN113299300A (en) Voice enhancement method, device and storage medium
CN116631429B (en) Voice and video processing method and system based on VOLTE call
CN112863542B (en) Voice detection method and device, storage medium and electronic equipment
CN115620748B (en) Comprehensive training method and device for speech synthesis and false identification evaluation
CN117351925B (en) Howling suppression method, device, electronic equipment and storage medium
CN114067785B (en) Voice deep neural network training method and device, storage medium and electronic device
CN116469415A (en) Audio scene classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210723

RJ01 Rejection of invention patent application after publication