CN113362814B

CN113362814B - Voice identification model compression method fusing combined model information

Info

Publication number: CN113362814B
Application number: CN202110910114.XA
Authority: CN
Inventors: 易江燕; 陶建华; 田正坤; 傅睿博
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-11-09
Anticipated expiration: 2041-08-09
Also published as: CN113362814A

Abstract

The invention provides a voice identification model compression method fusing combined model information, which comprises the following steps: collecting training data of a target model; extracting acoustic features of training data of the target model; extracting sample label information from training data of a target model to serve as hard label information; meanwhile, a forward calculation method is adopted to obtain posterior probability information of the combined model; performing linear interpolation on the posterior probability information of the combined model and the hard tag information to obtain supervision probability information of the combined model; and training by using the supervision probability information of the combined model to assist the target model, and obtaining the trained target model by minimizing the probability distribution distance between the target model and the combined model.

Description

Voice identification model compression method fusing combined model information

Technical Field

The invention relates to the field of voice identification, in particular to a voice identification model compression method fusing combined model information.

Background

In recent years, with the rapid development of deep learning, speech generation techniques have become mature, and speech recognition techniques capable of generating speech close to that of a real person and corresponding to the speech have been attracting attention and paid attention. At present, voice true and false identification technologies can be mainly summarized into two categories, wherein the first category is to try from a feature layer; second, try from the model structure level. The model structure level is fast in trial development, and the identification accuracy of the combined model is far higher than that of a single model.

Publication number CN111564163A discloses a speech detection method for various counterfeit operations based on RNN, which comprises the following steps: 1) obtaining an original voice sample, performing M kinds of counterfeiting processing on the original voice sample to obtain M voices after counterfeiting operation and 1 original voice without processing, performing feature extraction on the voices to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into an RNN classifier network for training to obtain a multi-classification training model; 2) obtaining a section of test voice, extracting the characteristics of the test voice to obtain an LFCC matrix of the test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) for classification, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the prediction result is the original voice, the test voice is recognized as the original voice; if the prediction result is a voice subjected to a certain falsification operation, the test voice is recognized as a falsified voice subjected to a corresponding falsification operation.

A speech detection method with publication number CN112712809B, which extracts a plurality of speech feature information from the speech to be detected; respectively inputting the voice characteristic information into a plurality of pre-trained voice source models, and determining a first matching degree between the voice to be detected and the source type of each voice source model; determining a second matching degree between the voice to be detected and the class type corresponding to the voice class model based on the determined first matching degree aiming at each voice class model; and determining the type and the source type of the voice to be detected based on the determined first matching degrees and the second matching degrees. Therefore, the voice detection is carried out by adopting the voice type model and the voice source model under the voice type model, and the detection of the voice truth and the voice source is completed.

The prior art has the following defects:

but the defects of the combined model are obvious, namely the combined model is large in size and slow in calculation speed. In real life, the voice data to be identified are massive, the calculation speed of the combined model is low, and the practical requirement is difficult to meet.

Disclosure of Invention

In view of the above, the first aspect of the present invention provides a method for compressing a speech recognition model by fusing combination model information, including:

training process of the combined model:

s1: collecting training data of the combined model;

s2: extracting acoustic features of training data of the combined model;

s3: training a plurality of single models by applying the acoustic features of the training data of the combined model;

s4: learning optimal weight coefficients for each single model in a combined model using linear regressionß _jLearning the weight coefficient of the single model in the combined model to obtain a trained combined model;

compression of the target model:

s5: collecting training data of a target model;

s6: extracting acoustic features of training data of the target model;

s7: extracting sample label information from training data of a target model to serve as hard label information; meanwhile, calculating posterior probability information from the combined model by adopting a forward calculation method;

s8: performing linear interpolation on the hard tag information and the posterior probability information to obtain supervision probability information of the combined model;

s9: training by using supervision probability information of the combined model to assist the target model, and obtaining a trained target model by minimizing the probability distribution distance between the target model and the combined model;

prediction after target model compression:

s10: collecting prediction data of a target model;

s11: extracting acoustic features of prediction data of the target model;

s12: and inputting the acoustic characteristics of the prediction data into the trained target model, and outputting the true and false categories of the voice.

In an exemplary embodiment of the present application, the training data of the combined model includes real audio data and dummy audio data.

In an exemplary embodiment of the present application, the acoustic feature of the training data of the combined model is a constant Q spectral coefficient CQCC.

In an exemplary embodiment of the present application, the plurality of single models includes: the system comprises a Gaussian mixture model, a convolutional neural network model, a residual error network model and a long-time and short-time memory model.

In an exemplary embodiment of the present application, a plurality of trained single models are subjected to linear regression, wherein the bias in the linear regression formula is 0.

In an exemplary embodiment of the present application, a specific calculation formula of the supervised probability information of the combined model is as follows:

wherein the content of the first and second substances,x: acoustic features of each sentence of speech;

c _i: the ci value of the output label corresponding to the acoustic feature is true or false, and i is the subscript of the label;

P _ens(c _i|x): supervision probability information of the combined model;

P _hard(c _i|x): the hard tag value is directly obtained from the tag data of the training data of the combined model;

Q(c _i|x): a posterior probability of the combined model;

α: is composed ofP _hard(c _i|x) The weight coefficient of (2).

In an exemplary embodiment of the present application, a posteriori probability information of the combined modelQ(c _i|x) The specific calculation method comprises the following steps:

wherein the content of the first and second substances,P _j(c _i|x): first, thejSingle model about inputxBelong to the labelc _iThe output probability of (1);

M: the total number of single models in the combined model;

ß _j: the weight coefficients of the jth single model,ß _j∈[0,1]。

in an exemplary embodiment of the present application, the

I.e. the sum of the weight coefficients of all the single models in the combined model is 1.

In an exemplary embodiment of the present application, the training criteria of the target model is determined by minimizing the probability distribution distance of the target model and the combined modelD(P _ens ||P _sin) The specific formula is as follows:

wherein the content of the first and second substances,P _sin(c _i|x): object model with respect to inputxBelong to the labelc _iThe posterior probability of (d).

In an exemplary embodiment of the present application, the training data of the target model includes real audio data and dummy audio data; the prediction data of the object model includes real audio data and dummy audio data.

According to a second aspect of the present application, there is provided a readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the speech discrimination model compression method of fusing combination model information as described in the first aspect of the present application.

According to a third aspect of the present application, there is provided a computer apparatus comprising a processor and a memory, wherein the memory is for storing a computer program; the processor, when executing the computer program stored in the memory, is configured to implement the steps of the method for compressing a speech recognition model by fusing combination model information according to the first aspect of the present application.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

(1) extracting supervision probability knowledge from a large-size and complex-calculation combined model to assist the training of a small-size single model;

(2) the size of the model is compressed, the calculation speed is improved, and the performance of the model is hardly lost.

Drawings

FIG. 1 is a flow chart of a combined model training process according to an embodiment of the present invention;

FIG. 2 is a flow chart of a target model training process provided by an embodiment of the present invention;

fig. 3 is a flowchart of target model prediction according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Example 1:

the invention provides a voice identification model compression method fusing combined model information, which comprises the following steps:

training process of the combined model:

s1: collecting training data of the combined model;

s2: extracting acoustic features of training data of the combined model;

compression of the target model:

s5: collecting training data of a target model;

s6: extracting acoustic features of training data of the target model;

prediction after target model compression:

s10: collecting prediction data of a target model;

s11: extracting acoustic features of prediction data of the target model;

s12: and inputting the acoustic characteristics of the prediction data into the trained target model, and outputting the true and false types of the voice.

P _ens(c _i|x): supervision probability information of the combined model;

Q(c _i|x): a posterior probability of the combined model;

α: is composed ofP _hard(c _i|x) The weight coefficient of (2).

wherein the content of the first and second substances,P _j(c _i|x): first, thejSingle model about inputxBelong to the labelc _iA posterior probability of (d);

M: the total number of single models in the combined model;

ß _j: the weight coefficients of the jth single model,ß _j∈[0,1]。

in an exemplary embodiment of the present application, the

In an exemplary embodiment of the present application, the target model training data includes real audio data and dummy audio data; the target model prediction data comprises real audio data and false audio data.

Example 2:

the first aspect of the embodiments of the present application provides a method for compressing a speech recognition model by fusing combination model information, including:

as shown in fig. 1, the training process of the combined model:

s1: collecting real audio data and false audio data as training data of a combined model;

s2: extracting a constant Q spectral coefficient CQCC of training data of a combined model as acoustic characteristics of the training data of the combined model;

s3: training a plurality of single models using acoustic features of training data of the combined model, the plurality of single models including: the system comprises a Gaussian mixture model, a convolutional neural network model, a residual error network model and a long-time and short-time memory model;

s4: learning optimal weight coefficients for each single model in a combined model using linear regressionß _jWherein the bias in the linear regression equation is 0. Learning the optimal weight coefficient of the single model in the combined model to obtain a trained combined model;

as shown in fig. 2, the target model is a compressed model, which is a small-volume single model, a convolutional neural network model:

s5: collecting real audio data and false audio data as training data of a target model;

s6: extracting a constant Q spectral coefficient CQCC of training data of a target model as acoustic characteristics of the training data of the target model;

s7: extracting sample label information from training data of target model as hard label informationP _hard(c _i|x) (ii) a Meanwhile, a forward calculation method is adopted to calculate posterior probability information from the combined model,

Q(c _i|x) The specific calculation method comprises the following steps:

；

P _j(c _i|x): first, thejSingle model about inputxBelong to the labelc _iA posterior probability of (d);

M: the total number of single models in the combined model;

ß _j: the weight coefficients of the jth single model,ß _j∈[0,1]；

the above-mentioned

That is, the sum of the weight coefficients of all the single models in the combined model is 1;

s8: performing linear interpolation on the posterior probability information and the hard tag information of the combined model to obtain supervision probability information of the combined model, wherein the concrete formula is as follows:

wherein the content of the first and second substances,P _ens(c _i|x): supervision probability information of the combined model;

Q(c _i|x): a posterior probability of the combined model;

α: is composed ofP _hard(c _i|x) The weight coefficient of (2).

S9: guiding the target model to train better by using the combined model, namely forcing the probability distribution of the target modelP _sinProbability distribution as close as possible to the combined modelP _ensThe idea is to minimize the probability distribution distance of the combined model and the target model

The specific calculation formula is as follows:

The invention provides a method for fusing combined model information to compress the volume of a model, and particularly relates to a method for compressing a target model by calculating supervision probability information by using the combined model and using the supervision probability information to guide the target model to be trained better. Here, the combined model is a model for making a decision by using a plurality of single models, and the target model is a compressed model that is a single model and has a small volume.

The core idea of this method is that it can be intuitively understood to force the object model to mimic the behavior of the combined model, i.e. to make the probability distribution of the object modelP _sinProbability distribution as close as possible to the combined modelP _ensThe idea is to minimize the probability distribution distance of the combined model and the target modelD(P _ens ||P _sin) To be implemented.

According to the method, as shown in fig. 3, the specific method for predicting the compressed target model is as follows:

s10: collecting real audio data and false audio data as prediction data of a target model;

s11: extracting a constant Q spectral coefficient CQCC of prediction data of a target model as acoustic features of the prediction data of the target model;

Example 3:

Example 4:

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A speech discrimination model compression method fusing combined model information is characterized by comprising the following steps:

training process of the combined model:

s1: collecting training data of the combined model;

s2: extracting acoustic features of training data of the combined model;

s4: learning optimal weight coefficients for each single model in a combined model using linear regressionß _jLearning of a single model in a combined modelWeighting coefficients to obtain a trained combined model;

compression of the target model:

s5: collecting training data of a target model;

s6: extracting acoustic features of training data of the target model;

prediction after target model compression:

s10: collecting prediction data of a target model;

s11: extracting acoustic features of prediction data of the target model;

s12: inputting the acoustic characteristics of the prediction data into the trained target model, and outputting the true and false types of the voice;

the specific calculation formula of the supervision probability information of the combined model is as follows:

wherein the content of the first and second substances,

x: acoustic features of each sentence of speech;

P _ens(c _i|x): supervision probability information of the combined model;

Q(c _i|x): a posterior probability of the combined model;

α: is composed ofP _hard(c _i|x) The weight coefficient of (2).

2. The method of claim 1, wherein the training data of the combined model comprises real audio data and dummy audio data.

3. The method of claim 1, wherein the acoustic features of the training data of the combined model are constant Q spectral coefficients CQCC.

4. The method of compressing a speech recognition model by fusing information of a combination model according to claim 1, wherein the plurality of single models includes: the system comprises a Gaussian mixture model, a convolutional neural network model, a residual error network model and a long-time and short-time memory model.

5. The method of claim 1, wherein the plurality of trained single models are subjected to linear regression, and wherein the bias in the linear regression equation is 0.

6. The method of claim 1, wherein the posterior probability information of the combined model is used as a compression model of the speech recognition modelQ(c _i|x) The specific calculation method comprises the following steps:

wherein the content of the first and second substances,

M: the total number of single models in the combined model;

ß _j: the weight coefficients of the jth single model,ß _j∈[0,1]。

7. the method of claim 6, wherein the speech discrimination model compression method is based on a combination model information fusion method

8. The method of claim 7, wherein the training criteria of the target model is determined by minimizing the distance between the probability distributions of the target model and the combined modelD(P _ens ||P _sin) The specific formula is as follows:

wherein the content of the first and second substances,

P _sin(c _i|x): object model with respect to inputxBelong to the labelc _iThe posterior probability of (d).

9. A readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the method for speech recognition model compression fusing combination model information according to any one of claims 1 to 8.