CN113362814B - Voice identification model compression method fusing combined model information - Google Patents

Voice identification model compression method fusing combined model information Download PDF

Info

Publication number
CN113362814B
CN113362814B CN202110910114.XA CN202110910114A CN113362814B CN 113362814 B CN113362814 B CN 113362814B CN 202110910114 A CN202110910114 A CN 202110910114A CN 113362814 B CN113362814 B CN 113362814B
Authority
CN
China
Prior art keywords
model
combined
information
combined model
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110910114.XA
Other languages
Chinese (zh)
Other versions
CN113362814A (en
Inventor
易江燕
陶建华
田正坤
傅睿博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110910114.XA priority Critical patent/CN113362814B/en
Publication of CN113362814A publication Critical patent/CN113362814A/en
Application granted granted Critical
Publication of CN113362814B publication Critical patent/CN113362814B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

The invention provides a voice identification model compression method fusing combined model information, which comprises the following steps: collecting training data of a target model; extracting acoustic features of training data of the target model; extracting sample label information from training data of a target model to serve as hard label information; meanwhile, a forward calculation method is adopted to obtain posterior probability information of the combined model; performing linear interpolation on the posterior probability information of the combined model and the hard tag information to obtain supervision probability information of the combined model; and training by using the supervision probability information of the combined model to assist the target model, and obtaining the trained target model by minimizing the probability distribution distance between the target model and the combined model.

Description

Voice identification model compression method fusing combined model information
Technical Field
The invention relates to the field of voice identification, in particular to a voice identification model compression method fusing combined model information.
Background
In recent years, with the rapid development of deep learning, speech generation techniques have become mature, and speech recognition techniques capable of generating speech close to that of a real person and corresponding to the speech have been attracting attention and paid attention. At present, voice true and false identification technologies can be mainly summarized into two categories, wherein the first category is to try from a feature layer; second, try from the model structure level. The model structure level is fast in trial development, and the identification accuracy of the combined model is far higher than that of a single model.
Publication number CN111564163A discloses a speech detection method for various counterfeit operations based on RNN, which comprises the following steps: 1) obtaining an original voice sample, performing M kinds of counterfeiting processing on the original voice sample to obtain M voices after counterfeiting operation and 1 original voice without processing, performing feature extraction on the voices to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into an RNN classifier network for training to obtain a multi-classification training model; 2) obtaining a section of test voice, extracting the characteristics of the test voice to obtain an LFCC matrix of the test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) for classification, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the prediction result is the original voice, the test voice is recognized as the original voice; if the prediction result is a voice subjected to a certain falsification operation, the test voice is recognized as a falsified voice subjected to a corresponding falsification operation.
A speech detection method with publication number CN112712809B, which extracts a plurality of speech feature information from the speech to be detected; respectively inputting the voice characteristic information into a plurality of pre-trained voice source models, and determining a first matching degree between the voice to be detected and the source type of each voice source model; determining a second matching degree between the voice to be detected and the class type corresponding to the voice class model based on the determined first matching degree aiming at each voice class model; and determining the type and the source type of the voice to be detected based on the determined first matching degrees and the second matching degrees. Therefore, the voice detection is carried out by adopting the voice type model and the voice source model under the voice type model, and the detection of the voice truth and the voice source is completed.
The prior art has the following defects:
but the defects of the combined model are obvious, namely the combined model is large in size and slow in calculation speed. In real life, the voice data to be identified are massive, the calculation speed of the combined model is low, and the practical requirement is difficult to meet.
Disclosure of Invention
In view of the above, the first aspect of the present invention provides a method for compressing a speech recognition model by fusing combination model information, including:
training process of the combined model:
s1: collecting training data of the combined model;
s2: extracting acoustic features of training data of the combined model;
s3: training a plurality of single models by applying the acoustic features of the training data of the combined model;
s4: learning optimal weight coefficients for each single model in a combined model using linear regressionß j Learning the weight coefficient of the single model in the combined model to obtain a trained combined model;
compression of the target model:
s5: collecting training data of a target model;
s6: extracting acoustic features of training data of the target model;
s7: extracting sample label information from training data of a target model to serve as hard label information; meanwhile, calculating posterior probability information from the combined model by adopting a forward calculation method;
s8: performing linear interpolation on the hard tag information and the posterior probability information to obtain supervision probability information of the combined model;
s9: training by using supervision probability information of the combined model to assist the target model, and obtaining a trained target model by minimizing the probability distribution distance between the target model and the combined model;
prediction after target model compression:
s10: collecting prediction data of a target model;
s11: extracting acoustic features of prediction data of the target model;
s12: and inputting the acoustic characteristics of the prediction data into the trained target model, and outputting the true and false categories of the voice.
In an exemplary embodiment of the present application, the training data of the combined model includes real audio data and dummy audio data.
In an exemplary embodiment of the present application, the acoustic feature of the training data of the combined model is a constant Q spectral coefficient CQCC.
In an exemplary embodiment of the present application, the plurality of single models includes: the system comprises a Gaussian mixture model, a convolutional neural network model, a residual error network model and a long-time and short-time memory model.
In an exemplary embodiment of the present application, a plurality of trained single models are subjected to linear regression, wherein the bias in the linear regression formula is 0.
In an exemplary embodiment of the present application, a specific calculation formula of the supervised probability information of the combined model is as follows:
Figure 997438DEST_PATH_IMAGE001
wherein the content of the first and second substances,x: acoustic features of each sentence of speech;
c i : the ci value of the output label corresponding to the acoustic feature is true or false, and i is the subscript of the label;
P ens (c i |x): supervision probability information of the combined model;
P hard (c i |x): the hard tag value is directly obtained from the tag data of the training data of the combined model;
Q(c i |x): a posterior probability of the combined model;
α: is composed ofP hard (c i |x) The weight coefficient of (2).
In an exemplary embodiment of the present application, a posteriori probability information of the combined modelQ(c i |x) The specific calculation method comprises the following steps:
Figure 69299DEST_PATH_IMAGE002
wherein the content of the first and second substances,P j (c i |x): first, thejSingle model about inputxBelong to the labelc i The output probability of (1);
M: the total number of single models in the combined model;
ß j : the weight coefficients of the jth single model,ß j ∈[0,1]。
in an exemplary embodiment of the present application, the
Figure DEST_PATH_IMAGE003
I.e. the sum of the weight coefficients of all the single models in the combined model is 1.
In an exemplary embodiment of the present application, the training criteria of the target model is determined by minimizing the probability distribution distance of the target model and the combined modelD(P ens ||P sin ) The specific formula is as follows:
Figure 764634DEST_PATH_IMAGE004
wherein the content of the first and second substances,P sin (c i |x): object model with respect to inputxBelong to the labelc i The posterior probability of (d).
In an exemplary embodiment of the present application, the training data of the target model includes real audio data and dummy audio data; the prediction data of the object model includes real audio data and dummy audio data.
According to a second aspect of the present application, there is provided a readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the speech discrimination model compression method of fusing combination model information as described in the first aspect of the present application.
According to a third aspect of the present application, there is provided a computer apparatus comprising a processor and a memory, wherein the memory is for storing a computer program; the processor, when executing the computer program stored in the memory, is configured to implement the steps of the method for compressing a speech recognition model by fusing combination model information according to the first aspect of the present application.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
(1) extracting supervision probability knowledge from a large-size and complex-calculation combined model to assist the training of a small-size single model;
(2) the size of the model is compressed, the calculation speed is improved, and the performance of the model is hardly lost.
Drawings
FIG. 1 is a flow chart of a combined model training process according to an embodiment of the present invention;
FIG. 2 is a flow chart of a target model training process provided by an embodiment of the present invention;
fig. 3 is a flowchart of target model prediction according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
Example 1:
the invention provides a voice identification model compression method fusing combined model information, which comprises the following steps:
training process of the combined model:
s1: collecting training data of the combined model;
s2: extracting acoustic features of training data of the combined model;
s3: training a plurality of single models by applying the acoustic features of the training data of the combined model;
s4: learning optimal weight coefficients for each single model in a combined model using linear regressionß j Learning the weight coefficient of the single model in the combined model to obtain a trained combined model;
compression of the target model:
s5: collecting training data of a target model;
s6: extracting acoustic features of training data of the target model;
s7: extracting sample label information from training data of a target model to serve as hard label information; meanwhile, calculating posterior probability information from the combined model by adopting a forward calculation method;
s8: performing linear interpolation on the hard tag information and the posterior probability information to obtain supervision probability information of the combined model;
s9: training by using supervision probability information of the combined model to assist the target model, and obtaining a trained target model by minimizing the probability distribution distance between the target model and the combined model;
prediction after target model compression:
s10: collecting prediction data of a target model;
s11: extracting acoustic features of prediction data of the target model;
s12: and inputting the acoustic characteristics of the prediction data into the trained target model, and outputting the true and false types of the voice.
In an exemplary embodiment of the present application, the training data of the combined model includes real audio data and dummy audio data.
In an exemplary embodiment of the present application, the acoustic feature of the training data of the combined model is a constant Q spectral coefficient CQCC.
In an exemplary embodiment of the present application, the plurality of single models includes: the system comprises a Gaussian mixture model, a convolutional neural network model, a residual error network model and a long-time and short-time memory model.
In an exemplary embodiment of the present application, a plurality of trained single models are subjected to linear regression, wherein the bias in the linear regression formula is 0.
In an exemplary embodiment of the present application, a specific calculation formula of the supervised probability information of the combined model is as follows:
Figure DEST_PATH_IMAGE005
wherein the content of the first and second substances,x: acoustic features of each sentence of speech;
c i : the ci value of the output label corresponding to the acoustic feature is true or false, and i is the subscript of the label;
P ens (c i |x): supervision probability information of the combined model;
P hard (c i |x): the hard tag value is directly obtained from the tag data of the training data of the combined model;
Q(c i |x): a posterior probability of the combined model;
α: is composed ofP hard (c i |x) The weight coefficient of (2).
In an exemplary embodiment of the present application, a posteriori probability information of the combined modelQ(c i |x) The specific calculation method comprises the following steps:
Figure 427697DEST_PATH_IMAGE002
wherein the content of the first and second substances,P j (c i |x): first, thejSingle model about inputxBelong to the labelc i A posterior probability of (d);
M: the total number of single models in the combined model;
ß j : the weight coefficients of the jth single model,ß j ∈[0,1]。
in an exemplary embodiment of the present application, the
Figure 172930DEST_PATH_IMAGE006
I.e. the sum of the weight coefficients of all the single models in the combined model is 1.
In an exemplary embodiment of the present application, the training criteria of the target model is determined by minimizing the probability distribution distance of the target model and the combined modelD(P ens ||P sin ) The specific formula is as follows:
Figure 48482DEST_PATH_IMAGE007
wherein the content of the first and second substances,P sin (c i |x): object model with respect to inputxBelong to the labelc i The posterior probability of (d).
In an exemplary embodiment of the present application, the target model training data includes real audio data and dummy audio data; the target model prediction data comprises real audio data and false audio data.
Example 2:
the first aspect of the embodiments of the present application provides a method for compressing a speech recognition model by fusing combination model information, including:
as shown in fig. 1, the training process of the combined model:
s1: collecting real audio data and false audio data as training data of a combined model;
s2: extracting a constant Q spectral coefficient CQCC of training data of a combined model as acoustic characteristics of the training data of the combined model;
s3: training a plurality of single models using acoustic features of training data of the combined model, the plurality of single models including: the system comprises a Gaussian mixture model, a convolutional neural network model, a residual error network model and a long-time and short-time memory model;
s4: learning optimal weight coefficients for each single model in a combined model using linear regressionß j Wherein the bias in the linear regression equation is 0. Learning the optimal weight coefficient of the single model in the combined model to obtain a trained combined model;
as shown in fig. 2, the target model is a compressed model, which is a small-volume single model, a convolutional neural network model:
s5: collecting real audio data and false audio data as training data of a target model;
s6: extracting a constant Q spectral coefficient CQCC of training data of a target model as acoustic characteristics of the training data of the target model;
s7: extracting sample label information from training data of target model as hard label informationP hard (c i |x) (ii) a Meanwhile, a forward calculation method is adopted to calculate posterior probability information from the combined model,
Q(c i |x) The specific calculation method comprises the following steps:
Figure 77618DEST_PATH_IMAGE008
wherein the content of the first and second substances,x: acoustic features of each sentence of speech;
c i : the ci value of the output label corresponding to the acoustic feature is true or false, and i is the subscript of the label;
P j (c i |x): first, thejSingle model about inputxBelong to the labelc i A posterior probability of (d);
M: the total number of single models in the combined model;
ß j : the weight coefficients of the jth single model,ß j ∈[0,1];
the above-mentioned
Figure 895270DEST_PATH_IMAGE009
That is, the sum of the weight coefficients of all the single models in the combined model is 1;
s8: performing linear interpolation on the posterior probability information and the hard tag information of the combined model to obtain supervision probability information of the combined model, wherein the concrete formula is as follows:
Figure 377067DEST_PATH_IMAGE010
wherein the content of the first and second substances,P ens (c i |x): supervision probability information of the combined model;
P hard (c i |x): the hard tag value is directly obtained from the tag data of the training data of the combined model;
Q(c i |x): a posterior probability of the combined model;
α: is composed ofP hard (c i |x) The weight coefficient of (2).
S9: guiding the target model to train better by using the combined model, namely forcing the probability distribution of the target modelP sin Probability distribution as close as possible to the combined modelP ens The idea is to minimize the probability distribution distance of the combined model and the target model
Figure 790731DEST_PATH_IMAGE011
The specific calculation formula is as follows:
Figure 939953DEST_PATH_IMAGE012
wherein the content of the first and second substances,P sin (c i |x): object model with respect to inputxBelong to the labelc i The posterior probability of (d).
The invention provides a method for fusing combined model information to compress the volume of a model, and particularly relates to a method for compressing a target model by calculating supervision probability information by using the combined model and using the supervision probability information to guide the target model to be trained better. Here, the combined model is a model for making a decision by using a plurality of single models, and the target model is a compressed model that is a single model and has a small volume.
The core idea of this method is that it can be intuitively understood to force the object model to mimic the behavior of the combined model, i.e. to make the probability distribution of the object modelP sin Probability distribution as close as possible to the combined modelP ens The idea is to minimize the probability distribution distance of the combined model and the target modelD(P ens ||P sin ) To be implemented.
According to the method, as shown in fig. 3, the specific method for predicting the compressed target model is as follows:
s10: collecting real audio data and false audio data as prediction data of a target model;
s11: extracting a constant Q spectral coefficient CQCC of prediction data of a target model as acoustic features of the prediction data of the target model;
s12: and inputting the acoustic characteristics of the prediction data into the trained target model, and outputting the true and false categories of the voice.
The invention provides a method for fusing combined model information to compress the volume of a model, and particularly relates to a method for compressing a target model by calculating supervision probability information by using the combined model and using the supervision probability information to guide the target model to be trained better. Here, the combined model is a model for making a decision by using a plurality of single models, and the target model is a compressed model that is a single model and has a small volume.
Example 3:
according to a second aspect of the present application, there is provided a readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the speech discrimination model compression method of fusing combination model information as described in the first aspect of the present application.
Example 4:
according to a third aspect of the present application, there is provided a computer apparatus comprising a processor and a memory, wherein the memory is for storing a computer program; the processor, when executing the computer program stored in the memory, is configured to implement the steps of the method for compressing a speech recognition model by fusing combination model information according to the first aspect of the present application.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A speech discrimination model compression method fusing combined model information is characterized by comprising the following steps:
training process of the combined model:
s1: collecting training data of the combined model;
s2: extracting acoustic features of training data of the combined model;
s3: training a plurality of single models by applying the acoustic features of the training data of the combined model;
s4: learning optimal weight coefficients for each single model in a combined model using linear regressionß j Learning of a single model in a combined modelWeighting coefficients to obtain a trained combined model;
compression of the target model:
s5: collecting training data of a target model;
s6: extracting acoustic features of training data of the target model;
s7: extracting sample label information from training data of a target model to serve as hard label information; meanwhile, calculating posterior probability information from the combined model by adopting a forward calculation method;
s8: performing linear interpolation on the hard tag information and the posterior probability information to obtain supervision probability information of the combined model;
s9: training by using supervision probability information of the combined model to assist the target model, and obtaining a trained target model by minimizing the probability distribution distance between the target model and the combined model;
prediction after target model compression:
s10: collecting prediction data of a target model;
s11: extracting acoustic features of prediction data of the target model;
s12: inputting the acoustic characteristics of the prediction data into the trained target model, and outputting the true and false types of the voice;
the specific calculation formula of the supervision probability information of the combined model is as follows:
Figure 368718DEST_PATH_IMAGE001
wherein the content of the first and second substances,
x: acoustic features of each sentence of speech;
c i : the ci value of the output label corresponding to the acoustic feature is true or false, and i is the subscript of the label;
P ens (c i |x): supervision probability information of the combined model;
P hard (c i |x): the hard tag value is directly obtained from the tag data of the training data of the combined model;
Q(c i |x): a posterior probability of the combined model;
α: is composed ofP hard (c i |x) The weight coefficient of (2).
2. The method of claim 1, wherein the training data of the combined model comprises real audio data and dummy audio data.
3. The method of claim 1, wherein the acoustic features of the training data of the combined model are constant Q spectral coefficients CQCC.
4. The method of compressing a speech recognition model by fusing information of a combination model according to claim 1, wherein the plurality of single models includes: the system comprises a Gaussian mixture model, a convolutional neural network model, a residual error network model and a long-time and short-time memory model.
5. The method of claim 1, wherein the plurality of trained single models are subjected to linear regression, and wherein the bias in the linear regression equation is 0.
6. The method of claim 1, wherein the posterior probability information of the combined model is used as a compression model of the speech recognition modelQ(c i |x) The specific calculation method comprises the following steps:
Figure 763928DEST_PATH_IMAGE002
wherein the content of the first and second substances,
P j (c i |x): first, thejSingle model about inputxBelong to the labelc i A posterior probability of (d);
M: the total number of single models in the combined model;
ß j : the weight coefficients of the jth single model,ß j ∈[0,1]。
7. the method of claim 6, wherein the speech discrimination model compression method is based on a combination model information fusion method
Figure 578300DEST_PATH_IMAGE003
I.e. the sum of the weight coefficients of all the single models in the combined model is 1.
8. The method of claim 7, wherein the training criteria of the target model is determined by minimizing the distance between the probability distributions of the target model and the combined modelD(P ens ||P sin ) The specific formula is as follows:
Figure 587713DEST_PATH_IMAGE004
wherein the content of the first and second substances,
P sin (c i |x): object model with respect to inputxBelong to the labelc i The posterior probability of (d).
9. A readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the method for speech recognition model compression fusing combination model information according to any one of claims 1 to 8.
CN202110910114.XA 2021-08-09 2021-08-09 Voice identification model compression method fusing combined model information Active CN113362814B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110910114.XA CN113362814B (en) 2021-08-09 2021-08-09 Voice identification model compression method fusing combined model information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110910114.XA CN113362814B (en) 2021-08-09 2021-08-09 Voice identification model compression method fusing combined model information

Publications (2)

Publication Number Publication Date
CN113362814A CN113362814A (en) 2021-09-07
CN113362814B true CN113362814B (en) 2021-11-09

Family

ID=77540689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110910114.XA Active CN113362814B (en) 2021-08-09 2021-08-09 Voice identification model compression method fusing combined model information

Country Status (1)

Country Link
CN (1) CN113362814B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887679B (en) * 2021-12-08 2022-03-08 四川大学 Model training method, device, equipment and medium integrating posterior probability calibration
CN115083421B (en) * 2022-07-21 2022-11-15 中国科学院自动化研究所 Method and device for constructing automatic parameter-searching speech identification model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105355199A (en) * 2015-10-20 2016-02-24 河海大学 Model combination type speech recognition method based on GMM (Gaussian mixture model) noise estimation
CN108877769A (en) * 2018-06-25 2018-11-23 北京语言大学 The method and apparatus for identifying dialect type
CN111564163A (en) * 2020-05-08 2020-08-21 宁波大学 RNN-based voice detection method for various counterfeit operations
CN111816203A (en) * 2020-06-22 2020-10-23 天津大学 Synthetic speech detection method for inhibiting phoneme influence based on phoneme-level analysis
CN111833852A (en) * 2020-06-30 2020-10-27 苏州思必驰信息科技有限公司 Acoustic model training method and device and computer readable storage medium
CN112712809A (en) * 2021-03-29 2021-04-27 北京远鉴信息技术有限公司 Voice detection method and device, electronic equipment and storage medium
CN112992126A (en) * 2021-04-22 2021-06-18 北京远鉴信息技术有限公司 Voice authenticity verification method and device, electronic equipment and readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9230550B2 (en) * 2013-01-10 2016-01-05 Sensory, Incorporated Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105355199A (en) * 2015-10-20 2016-02-24 河海大学 Model combination type speech recognition method based on GMM (Gaussian mixture model) noise estimation
CN108877769A (en) * 2018-06-25 2018-11-23 北京语言大学 The method and apparatus for identifying dialect type
CN111564163A (en) * 2020-05-08 2020-08-21 宁波大学 RNN-based voice detection method for various counterfeit operations
CN111816203A (en) * 2020-06-22 2020-10-23 天津大学 Synthetic speech detection method for inhibiting phoneme influence based on phoneme-level analysis
CN111833852A (en) * 2020-06-30 2020-10-27 苏州思必驰信息科技有限公司 Acoustic model training method and device and computer readable storage medium
CN112712809A (en) * 2021-03-29 2021-04-27 北京远鉴信息技术有限公司 Voice detection method and device, electronic equipment and storage medium
CN112992126A (en) * 2021-04-22 2021-06-18 北京远鉴信息技术有限公司 Voice authenticity verification method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN113362814A (en) 2021-09-07

Similar Documents

Publication Publication Date Title
CN110189769B (en) Abnormal sound detection method based on combination of multiple convolutional neural network models
CN111061843B (en) Knowledge-graph-guided false news detection method
CN108346436B (en) Voice emotion detection method and device, computer equipment and storage medium
CN108229649A (en) For the method and apparatus of deep learning training
CN110211574A (en) Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism
CN113362814B (en) Voice identification model compression method fusing combined model information
CN110232123B (en) Text emotion analysis method and device, computing device and readable medium
CN104903954A (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
CN113284513B (en) Method and device for detecting false voice based on phoneme duration characteristics
Chakraborty et al. Bird call identification using dynamic kernel based support vector machines and deep neural networks
Massoudi et al. Urban sound classification using CNN
CN110459207A (en) Wake up the segmentation of voice key phrase
Miao et al. The audio auditor: user-level membership inference in internet of things voice services
CN111653275A (en) Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method
CN115310534A (en) Underwater sound target detection training method, underwater sound target identification device, underwater sound target detection equipment and underwater sound target identification medium
CN110827809B (en) Language identification and classification method based on condition generation type confrontation network
US11763836B2 (en) Hierarchical generated audio detection system
CN115497564A (en) Antigen identification model establishing method and antigen identification method
CN115240656A (en) Training of audio recognition model, audio recognition method and device and computer equipment
CN114121018A (en) Voice document classification method, system, device and storage medium
CN112133291B (en) Language identification model training and language identification method and related device
Pappagari et al. Unsupervised spoken word retrieval using Gaussian-Bernoulli restricted Boltzmann machines
CN113380235B (en) Knowledge migration-based telephone channel false voice identification method and storage medium
CN116935889B (en) Audio category determining method and device, electronic equipment and storage medium
CN116705063B (en) Manifold measurement-based multi-model fusion voice fake identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant