CN113160850A

CN113160850A - Audio feature extraction method and device based on re-parameterization decoupling mode

Info

Publication number: CN113160850A
Application number: CN202110460111.0A
Authority: CN
Inventors: 许敏强; 马雨枫; 赵淼; 刘敏
Original assignee: Guangzhou Speakin Intelligent Technology Co ltd
Current assignee: Guangzhou Speakin Intelligent Technology Co ltd
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-23

Abstract

The application discloses an audio feature extraction method and device based on a re-parameterization decoupling mode, and the method comprises the following steps: acquiring a voice sample to be detected of a target speaker; preprocessing a voice sample to be detected; extracting acoustic characteristics of the preprocessed voice sample to be detected; inputting the acoustic features into a network inference module to obtain voiceprint feature vectors, wherein the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through parameterization. This application uses the multi-branch structure at the training stage to reach better convergence effect, at the inference stage, heavily parameterize to the one-way structure, in order to obtain the better effect than the multi-branch structure that the parameter quantity is equivalent, and enable speed faster, it is lower to consume the memory.

Description

Audio feature extraction method and device based on re-parameterization decoupling mode

Technical Field

The application relates to the technical field of voiceprint feature extraction, in particular to an audio feature extraction method and device based on a re-parameterization decoupling mode.

Background

Existing high performance network architectures include multi-drop architectures and network components with superior performance. Compared with the prior single-path structure, the multi-branch structure has greatly improved performance. Such as GoogleNet, inclusion, etc., all belong to the multiplex architecture. While network components with superior performance, including deep separable convolutions, packet convolutions, etc., can significantly increase network performance. However, the multi-branch structure and the component with excellent performance can significantly improve the performance of the model, but ultimately leads to the speed of the model becoming slow and memory consuming during reasoning, which is very unfavorable for industrial scenarios, especially in the case of limited computation.

There have also been many attempts in recent years on one-way networks. The core of these attempts is to train deeper networks, but there is not much success, performance is generally inferior to that of multi-branch structures, and the resulting models are often neither simple nor practical.

Disclosure of Invention

The application provides an audio feature extraction method and device based on a re-parameterization decoupling mode, so that a multi-branch structure is used in a training phase to achieve a better convergence effect, and in an inference phase, the re-parameterization is a one-way structure to obtain a better effect than the multi-branch structure with equivalent parameter quantity, and the method and device are faster in speed and lower in memory consumption.

In view of this, a first aspect of the present application provides an audio feature extraction method based on a re-parameterization decoupling manner, where the method includes:

acquiring a voice sample to be detected of a target speaker;

preprocessing the voice sample to be detected;

extracting acoustic features of the preprocessed voice sample to be detected;

and inputting the acoustic features into a network inference module to obtain voiceprint feature vectors, wherein the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through parameterization.

Optionally, before the inputting the acoustic features into the network inference module to obtain the voiceprint feature vector, the method further includes:

collecting voice samples of a large number of target speakers as training voice samples;

preprocessing the training voice sample;

extracting acoustic features of the preprocessed training voice sample;

inputting the acoustic features into the network training module to obtain the trained network training module, wherein the network training module comprises a plurality of parallel first 3x3 convolutional layers, first 1x1 convolutional layers and direct-connection layers.

Optionally, the network inference module is a network model of a single-path structure converted by a trained multi-layer network training module through parameterization, and specifically includes:

combining the first 3x3 convolutional layer in the trained network training module with a BN layer unit to obtain a second 3x3 convolutional layer;

combining the first 1x1 convolutional layer in the trained training network module with a BN layer unit to obtain a second 1x1 convolutional layer;

combining the trained direct connection layer and BN layer units in the training network module to obtain a third 1x1 convolutional layer;

expanding the second 1x1 convolutional layer to a third 3x3 convolutional layer;

expanding the third 1x1 convolutional layer to a fourth 3x3 convolutional layer;

and adding the second 3x3 convolutional layer, the third 3x3 convolutional layer and the fourth 3x3 convolutional layer according to the additive principle of convolution to obtain a fifth 3x3 convolutional layer in the network inference module.

Optionally, the preprocessing the voice sample to be detected includes:

and resampling the voice sample to be detected, and carrying out noise reduction transformation.

A second aspect of the present application provides an audio feature extraction device based on a decoupling manner of reparameterization, the device including:

the acquisition unit is used for acquiring a voice sample to be detected of a target speaker;

the first preprocessing unit is used for preprocessing the voice sample to be detected;

the first feature extraction unit is used for extracting acoustic features of the preprocessed voice sample to be detected;

and the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through re-parameterization.

Optionally, the method further includes:

the acquisition unit is used for acquiring voice samples of a large number of target speakers as training voice samples;

the second preprocessing unit is used for preprocessing the training voice sample;

the second feature extraction unit is used for extracting acoustic features of the preprocessed training voice sample;

the training unit is used for inputting the acoustic features into the network training module to obtain the trained network training module, and the network training module comprises a plurality of parallel first 3x3 convolutional layers, a first 1x1 convolutional layer and a direct connection layer.

Optionally, the voiceprint feature obtaining unit includes:

the first merging unit is used for merging the first 3x3 convolutional layer in the trained network training module with the BN layer unit to obtain a second 3x3 convolutional layer;

a second merging unit, configured to merge the first 1x1 convolutional layer in the trained training network module with a BN layer unit, so as to obtain a second 1x1 convolutional layer;

a third merging unit, configured to merge the direct connection layer in the trained training network module with a BN layer unit, so as to obtain a third 1x1 convolutional layer;

a first expanding unit for expanding the second 1x1 convolutional layer into a third 3x3 convolutional layer;

a second expanding unit for expanding the third 1x1 convolutional layer into a fourth 3x3 convolutional layer;

and the adding unit is used for adding the second 3x3 convolutional layer, the third 3x3 convolutional layer and the fourth 3x3 convolutional layer according to the additive principle of convolution to obtain a fifth 3x3 convolutional layer in the network inference module.

Optionally, the preprocessing unit is specifically configured to perform resampling and noise reduction transformation on the to-be-detected speech sample.

A third aspect of the present application provides an audio feature extraction device based on a re-parameterized decoupling manner, the device including a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the steps of the method for extracting audio features based on the decoupled manner of reparameterization according to the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium for storing program code for performing the method of the first aspect.

According to the technical scheme, the method has the following advantages:

the application provides an audio feature extraction method based on a re-parameterization decoupling mode, which comprises the following steps: acquiring a voice sample to be detected of a target speaker; preprocessing a voice sample to be detected; extracting acoustic characteristics of the preprocessed voice sample to be detected; inputting the acoustic features into a network inference module to obtain voiceprint feature vectors, wherein the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through parameterization. This application uses the multi-branch structure at the training stage to reach better convergence effect, at the inference stage, heavily parameterize to the one-way structure, in order to obtain the better effect than the multi-branch structure that the parameter quantity is equivalent, and enable speed faster, it is lower to consume the memory.

Drawings

Fig. 1 is a flowchart of a method in an embodiment of an audio feature extraction method based on a re-parameterization decoupling mode according to the present application;

FIG. 2 is a flowchart of a method of another embodiment of the present application for an audio feature extraction method based on a re-parameterized decoupling approach;

fig. 3 is a schematic structural diagram of an embodiment of an audio feature extraction apparatus based on a re-parameterization decoupling manner according to the present application;

fig. 4 is a schematic diagram of a network structure of a network training module and a network inference module according to a specific embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a flowchart of a method in an embodiment of an audio feature extraction method based on a re-parameterization decoupling manner according to the present application, as shown in fig. 1, where fig. 1 includes:

101. acquiring a voice sample to be detected of a target speaker;

it should be noted that, the method and the device can obtain the voice sample to be tested of the target speaker collected by any terminal for testing.

102. Preprocessing a voice sample to be detected;

it should be noted that, in the present application, the process of preprocessing the to-be-detected voice sample may include performing conversion such as resampling and noise reduction on the to-be-detected voice sample.

103. Extracting acoustic characteristics of the preprocessed voice sample to be detected;

it should be noted that the FBANK feature can be obtained by performing a series of operations such as pre-emphasis, framing, windowing, fourier transform, mel filter bank, and logarithm operation on the preprocessed voice sample to be detected, and then normalization and effective sound extraction are performed on the FBANK feature, so that the acoustic feature of the voice sample to be detected can be obtained.

104. Inputting the acoustic features into a network inference module to obtain voiceprint feature vectors, wherein the network inference module is a network model of a single-path structure converted by a trained multilayer network training module through parameterization.

It should be noted that the reparameterization means that the network parameters in the network inference module are reconstructed from the network parameters in the network training module after a certain change. The network reasoning module in the application is a network model which is formed by converting a multi-layer network training model into a one-way structure through heavy parameterization conversion by a trained network training module. The network performance is ensured by using a multi-layer network structure in the training stage, and a better convergence effect is achieved; and a network reasoning module with a single-path structure obtained by heavy parameterization conversion is adopted in the voiceprint feature extraction stage, so that the calculation speed is higher, and the memory consumption is lower.

This application is through at the training stage, uses the multi-branch structure to reach better convergence effect, at the inference stage, heavily parameterize to the one-way structure, in order to obtain the better effect than the multi-branch structure that the parameter quantity is equivalent, and enable that speed is faster, and it is lower to consume the memory.

The present application further provides a method flowchart in another embodiment of an audio feature extraction method based on a re-parameterized decoupling manner, as shown in fig. 2, where fig. 2 includes:

201. collecting voice samples of a large number of target speakers as training voice samples;

it should be noted that a large number of voice samples of the target speaker can be collected as training voice samples for training the network model in the network training module.

202. Preprocessing a training voice sample;

it should be noted that the training speech samples are preprocessed, and the preprocessing includes resampling the training speech samples, and performing noise reduction and other transformations.

203. Extracting acoustic features of the preprocessed training voice sample;

it should be noted that the FBANK features can be obtained by performing a series of operations such as pre-emphasis, framing, windowing, fourier transform, mel filter bank, and logarithm operation on the preprocessed training speech sample, and then normalization and effective sound extraction are performed on the FBANK features, so that the acoustic features of the speech sample to be detected can be obtained.

204. Inputting the acoustic features into a network training module to obtain a trained network training module, wherein the network training module comprises a plurality of parallel first 3x3 convolutional layers, a first 1x1 convolutional layer and a direct connection layer;

it should be noted that, the acoustic features can be input into the network training module to obtain the classification corresponding to the acoustic features, the loss function is calculated, the parameters of the training network are continuously iterated through the back propagation algorithm to obtain the network model in the training stage, so as to obtain the trained network training module,

specifically, the network structure of the network training module may refer to the multi-layer structure network on the left side in fig. 4, and on the basis of the original VGG, a plurality of residual branches and 1 × 1 convolution branches are introduced. The positions of the multi-path branches are adjusted for subsequent parameterization into a single-path structure. The structure of the residual branch and the 1x1 convolutional branch in the network training module includes a first 3x3 convolutional layer, a first 1x1 convolutional layer, and a direct-connected layer in parallel. And the main part of the network training module only has one operator: 3x3 convolution-ReLU activation function.

205. Combining the first 3x3 convolutional layer in the trained network training module with the BN layer unit to obtain a second 3x3 convolutional layer;

it should be noted that, in the network training model, each of the first 3x3 convolutional layer, the first 1x1 convolutional layer, and the direct connection layer includes a BN layer (batch normalization), and in order to convert the network training model into a single-path structure, the first 3x3 convolutional layer and the BN layer unit may be merged to obtain a second 3x3 convolutional layer.

206. Combining the first 1x1 convolutional layer in the trained training network module with the BN layer unit to obtain a second 1x1 convolutional layer;

it should be noted that the first 1x1 convolutional layer in the trained training network module may be merged with the BN layer unit to obtain a second 1x1 convolutional layer.

207. Combining the direct connection layer in the trained training network module with the BN layer unit to obtain a third 1x1 convolutional layer;

208. expanding the second 1x1 convolutional layer to a third 3x3 convolutional layer;

209. expanding the third 1x1 convolutional layer to a fourth 3x3 convolutional layer;

210. and adding the second 3x3 convolutional layer, the third 3x3 convolutional layer and the fourth 3x3 convolutional layer according to the additive principle of convolution to obtain a fifth 3x3 convolutional layer in the network inference module.

It should be noted that, while the three convolutional layers are added, the trained parameters in the parameters of the corresponding training network module may also be correspondingly converted, so as to complete the re-parameterization, and obtain the fifth 3x3 convolutional layer in the network inference module.

The specific network inference module can be composed of a plurality of serial fifth 3x3 convolution layers, and can refer to a one-way network model as shown in the right structure of fig. 4.

The present application further provides an embodiment of an audio feature extraction apparatus based on a re-parameterization decoupling manner, as shown in fig. 3, where fig. 3 includes:

an obtaining unit 301, configured to obtain a to-be-detected speech sample of a target speaker;

a first preprocessing unit 302, configured to preprocess a voice sample to be detected;

a first feature extraction unit 303, configured to extract acoustic features of the preprocessed voice sample to be detected;

the voiceprint feature acquisition unit 304 is configured to input the acoustic features into a network inference module to obtain a voiceprint feature vector, where the network inference module is a network model of a single-path structure that is converted by a trained multi-layer network training module through parameterization.

In a specific embodiment, the method further comprises the following steps:

the training unit is used for inputting the acoustic features into the network training module to obtain the trained network training module, and the network training module comprises a plurality of parallel first 3x3 convolutional layers, first 1x1 convolutional layers and direct connection layers.

In a specific embodiment, the voiceprint feature acquisition unit includes:

the second merging unit is used for merging the first 1x1 convolutional layer in the trained training network module with the BN layer unit to obtain a second 1x1 convolutional layer;

the third merging unit is used for merging the direct connection layer in the trained training network module and the BN layer unit to obtain a third 1x1 convolutional layer;

a second expansion unit for expanding the third 1x1 convolutional layer into a fourth 3x3 convolutional layer;

The preprocessing unit is specifically used for resampling and denoising the voice sample to be detected.

The application also provides audio feature extraction equipment based on a re-parameterized decoupling mode, which comprises a processor and a memory, wherein the processor is used for: the memory is used for storing the program codes and transmitting the program codes to the processor; the processor is used for executing the embodiment of the audio feature extraction method based on the decoupling mode of the re-parameterization according to the instructions in the program code.

The present application further provides a computer-readable storage medium for storing program code for performing embodiments of a re-parameterization-based decoupling approach-based audio feature extraction method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A re-parameterization-based audio feature extraction method based on a decoupling mode is characterized by comprising the following steps:

acquiring a voice sample to be detected of a target speaker;

preprocessing the voice sample to be detected;

extracting acoustic features of the preprocessed voice sample to be detected;

2. The method for extracting audio features based on a re-parameterized decoupling manner according to claim 1, wherein before inputting the acoustic features into the network inference module to obtain a voiceprint feature vector, the method further comprises:

preprocessing the training voice sample;

extracting acoustic features of the preprocessed training voice sample;

3. The audio feature extraction method based on the decoupling mode with parameterization according to claim 2, wherein the network inference module is a network model of a single-path structure transformed by a trained multi-layer network training module through parameterization, and specifically comprises:

4. The audio feature extraction method based on the re-parameterized decoupling manner according to claim 1, wherein the preprocessing the to-be-detected speech sample comprises:

5. An audio feature extraction device based on a re-parameterization decoupling mode is characterized by comprising the following components:

6. The apparatus according to claim 5, further comprising:

7. The apparatus according to claim 6, wherein the voiceprint feature obtaining unit comprises:

8. The device according to claim 5, wherein the preprocessing unit is configured to resample the to-be-detected speech sample and perform denoising transformation.

9. An audio feature extraction device based on a decoupled manner of reparameterization, the device comprising a processor and a memory:

the processor is configured to execute the audio feature extraction method based on the re-parameterized decoupling manner according to any one of claims 1 to 4 according to instructions in the program code.

10. A computer-readable storage medium for storing a program code for executing the method for audio feature extraction based on a decoupled manner of reparameterization according to any one of claims 1 to 4.