CN117577117A - Training method and device for orthogonalization low-rank adaptive matrix voice detection model - Google Patents

Training method and device for orthogonalization low-rank adaptive matrix voice detection model Download PDF

Info

Publication number
CN117577117A
CN117577117A CN202410063975.2A CN202410063975A CN117577117A CN 117577117 A CN117577117 A CN 117577117A CN 202410063975 A CN202410063975 A CN 202410063975A CN 117577117 A CN117577117 A CN 117577117A
Authority
CN
China
Prior art keywords
training
voice
data set
low
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410063975.2A
Other languages
Chinese (zh)
Other versions
CN117577117B (en
Inventor
陶建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202410063975.2A priority Critical patent/CN117577117B/en
Publication of CN117577117A publication Critical patent/CN117577117A/en
Application granted granted Critical
Publication of CN117577117B publication Critical patent/CN117577117B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Filters That Use Time-Delay Elements (AREA)

Abstract

The invention provides a training method and a training device for a voice detection model of an orthogonalization low-rank adaptive matrix, and particularly relates to the technical field of voice recognition, wherein a new training data set is acquired; loading a pre-training voice large model, freezing parameters of the pre-training voice large model, and introducing a first low-rank adaptive matrix and a second low-rank adaptive matrix to obtain a voice detection model to be trained; inputting the new training data set into the voice detection model to be trained, and finishing training by orthogonally optimizing parameters of the first low-rank adaptive matrix and the second low-rank adaptive matrix to obtain the voice detection model. Aiming at the new data set obtained in practice, the training method is used for training the voice detection model, a low-rank adaptation matrix is introduced, and the model is subjected to fine adjustment, so that the training cost can be remarkably reduced, the detection capability of the model for generating audio under the new data set can be greatly improved, and the detection capability of the model for a previously learned voice algorithm is hardly influenced.

Description

Training method and device for orthogonalization low-rank adaptive matrix voice detection model
Technical Field
The invention relates to the technical field of voice recognition, in particular to a training method and device for a voice detection model of an orthogonalization low-rank adaptation matrix.
Background
Due to rapid development of deep learning, voice conversion and voice synthesis technologies are becoming mature, and voices generated by the deep learning model are widely applied to human-computer interaction scenes. But abuse of generated voice brings harm to people and society, and voice true and false identification technology corresponding to the abuse is also widely paid attention to. The generated voice detection based on the voice detection model is excellent in most data sets, but in the generated voice scene generated in the face of a new algorithm and an unknown algorithm, the detection accuracy is greatly reduced.
At present, generated voice generated by a new algorithm and an unknown algorithm is used for training a voice detection model, so that the model can forget the learned known algorithm, a large amount of computing resources and training time are consumed, and the practical application cost is high.
Based on the above, the present invention proposes a training method of a speech detection model based on orthogonalization low-rank adaptation matrix to solve the above-mentioned problems.
Disclosure of Invention
The invention provides a training method and device for a voice detection model of an orthogonalization low-rank adaptive matrix, which are used for solving the problems.
In a first aspect of the present invention, a training method for a speech detection model of an orthogonalization low-rank adaptation matrix is provided, where the training method includes:
acquiring a new training data set, wherein the new training data set comprises a plurality of voices generated by adopting a generation algorithm unknown by a pre-training voice large model;
loading the pre-training voice large model, freezing parameters of the pre-training voice large model, and introducing a first low-rank adaptive matrix and a second low-rank adaptive matrix to obtain a voice detection model to be trained;
inputting the new training data set into the to-be-trained voice detection model, and finishing training by orthogonally optimizing parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix to obtain the voice detection model, wherein the orthogonal optimization refers to: in the process of training the to-be-trained voice detection model, training of the first low-rank adaptation matrix and the second low-rank adaptation matrix on each data set is independent, and knowledge learned from the learned training data set cannot be forgotten.
In an alternative embodiment of the present invention, the training process of the pre-training large voice model is as follows:
acquiring an old training data set;
and pre-training the voice big model by adopting the old training data set to obtain the pre-training voice big model, wherein the pre-training voice big model can identify a generation algorithm of voices in the old training data set.
In an alternative embodiment of the present invention, the optimizing parameters of the first low rank adaptation matrix and the second low rank adaptation matrix by orthogonality includes:
in the process of training the to-be-trained voice detection model, the new training data set is divided into a plurality of batches of sub-data sets for training, and the weight updating direction corresponding to the sub-data set of the ith batch is orthogonal to the weight updating direction of the sub-data set of the ith-1 batch, so that the weight updating of each sub-data set does not influence the weight updating of the sub-data sets of other batches.
In an alternative embodiment of the present invention, the formula for the orthogonal optimization is as follows:
wherein,i represents the batch to which the sub-data set belongs when training the voice detection model, j represents that the training data set in which the sub-data set input when training the voice detection model is positioned is the jth training data set, x represents the voice in the new training data set, alpha represents a preset constant, T represents transposition, and->Representing the averaging of the speech in the new training dataset that was input.
In an alternative embodiment of the present invention, after obtaining the speech detection model, the training method further includes:
acquiring voice to be detected;
inputting the voice to be detected into the voice detection model, and outputting a detection result, wherein when the algorithm of the voice to be detected belongs to the generated voice which is learned by the pre-training voice big model, the output of the pre-training voice big model aiming at the voice to be detected is used as the detection result;
and when the algorithm of the voice to be detected belongs to the generated voice which is not learned by the pre-training voice big model, taking the sum of the outputs of the pre-training voice big model, the first low-rank adaptation matrix and the second low-rank adaptation matrix as the detection result.
In an alternative embodiment of the present invention, the formula of the detection result is as follows:
wherein h is model Is the detection result output by the voice detection model, x is the input voice to be detected, W SOM Is the pre-training voice big model, A A Is the first low rank adaptation matrix, B B Is the second low rank adaptation matrix.
In a second aspect of the embodiment of the present invention, a training device for continuously learning a speech recognition model is provided, where the training device includes:
the system comprises a new training data set acquisition module, a pre-training data set generation module and a pre-training data set generation module, wherein the new training data set acquisition module is used for acquiring a new training data set, and the new training data set comprises a plurality of voices generated by adopting a generation algorithm unknown by a pre-training voice large model;
the to-be-trained voice detection model acquisition module is used for loading the pre-trained voice large model and freezing parameters of the pre-trained voice large model, and introducing a first low-rank adaptive matrix and a second low-rank adaptive matrix to obtain a to-be-trained voice detection model;
the voice detection model obtaining module is configured to input the new training data set into the to-be-trained voice detection model, and finish training by orthogonally optimizing parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix to obtain the voice detection model, where the orthogonal optimization refers to: in the process of training the to-be-trained voice detection model, training of the first low-rank adaptation matrix and the second low-rank adaptation matrix on each data set is independent, and knowledge learned from the learned training data set cannot be forgotten.
In an optional embodiment of the invention, the to-be-trained voice detection model obtaining module further includes a first training submodule, where the first training submodule includes:
an acquisition unit configured to acquire an old training data set;
the pre-training unit is used for pre-training the large voice model by adopting the old training data set to obtain the large pre-training voice model, and the large pre-training voice model can identify a generation algorithm of voices in the old training data set.
In an optional embodiment of the invention, the voice detection model acquisition module further includes:
and the orthogonal optimization sub-module is used for dividing the new training data set into a plurality of batches of sub-data sets for training in the process of training the voice detection model to be trained, wherein the weight updating direction corresponding to the sub-data set of the ith batch is orthogonal to the weight updating direction of the sub-data set of the ith-1 batch, so that the weight updating of each sub-data set does not influence the weight updating of the sub-data sets of other batches.
In an alternative embodiment of the present invention, the orthogonal optimization formula in the orthogonal optimization submodule is as follows:
wherein,i represents the batch to which the sub-data set belongs when training the voice detection model, j represents that the training data set in which the sub-data set input when training the voice detection model is positioned is the jth training data set, x represents the voice in the new training data set, alpha represents a preset constant, T represents transposition, and->Representing the averaging of the speech in the new training dataset that was input.
In an alternative embodiment of the present invention, after obtaining the speech detection model, the training device further includes:
the voice to be detected acquisition module is used for acquiring voice to be detected;
the detection result acquisition module is used for inputting the voice to be detected into the voice detection model and outputting a detection result, wherein when the algorithm of the voice to be detected belongs to the generated voice learned by the pre-training voice big model, the output of the pre-training voice big model aiming at the voice to be detected is used as the detection result; and when the algorithm of the voice to be detected belongs to the generated voice which is not learned by the pre-training voice big model, taking the sum of the outputs of the pre-training voice big model, the first low-rank adaptation matrix and the second low-rank adaptation matrix as the detection result.
In an alternative embodiment of the present invention, the formula of the detection result in the detection result obtaining module is as follows:
wherein h is model Is the detection result output by the voice detection model, x is the input voice to be detected, W SOM Is the pre-training voice big model, A A Is the first low rank adaptation matrix, B B Is the second low rank adaptation matrix.
In a third aspect of the embodiment of the present invention, an electronic device is provided, including: a memory for storing one or more programs; a processor; a training method for a speech detection model of an orthogonalization low-rank adaptation matrix as in any of the first aspects above, when the one or more programs are executed by the processor.
In a fourth aspect of the embodiments of the present invention, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements a training method for a speech detection model of an orthogonalization low-rank adaptation matrix as in any of the first aspects above.
The invention has the following advantages: the embodiment of the invention provides a training method and a training device for a voice detection model of an orthogonalization low-rank adaptive matrix, which are characterized in that a new training data set is obtained, wherein the new training data set comprises a plurality of voices generated by adopting a generation algorithm unknown by a pre-training voice large model; loading the pre-training voice large model, freezing parameters of the pre-training voice large model, and introducing a first low-rank adaptive matrix and a second low-rank adaptive matrix to obtain a voice detection model to be trained; inputting the new training data set into the to-be-trained voice detection model, and finishing training by orthogonally optimizing parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix to obtain the voice detection model, wherein the orthogonal optimization refers to: in the process of training the to-be-trained voice detection model, training of the first low-rank adaptation matrix and the second low-rank adaptation matrix on each data set is independent, and knowledge learned from the learned training data set cannot be forgotten. Aiming at the new data set obtained in practice, the training method is used for training the voice detection model, a low-rank adaptation matrix is introduced, and the model is subjected to fine adjustment, so that the training cost can be remarkably reduced, the detection capability of the model for generating audio under the new data set can be greatly improved, and the detection capability of the model for a previously learned voice algorithm is hardly influenced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic step flow diagram of a training method of a speech detection model of an orthogonalization low-rank adaptation matrix according to an embodiment of the present invention;
FIG. 2 is a training device frame diagram of a speech detection model of an orthogonalization low-rank adaptation matrix according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Due to rapid development of deep learning, speech conversion and speech synthesis technologies are becoming mature, and speech generated by a deep learning model is widely applied to human-computer interaction scenes, such as smart home scenes, entertainment scenes, education scenes and the like. But abuse of generated voice brings harm to people and society, and voice true and false identification technology corresponding to the abuse is also widely paid attention to. The generated voice detection based on the voice detection model is excellent in most data sets, but in the generated voice scene generated in the face of a new algorithm and an unknown algorithm, the detection accuracy is greatly reduced.
At present, generated voice generated by a new algorithm and an unknown algorithm is used for training a voice detection model, so that the model can forget the learned known algorithm, a large amount of computing resources and training time are consumed, and the practical application cost is high.
Based on the method, the invention provides a training method of the voice detection model based on the low-rank adaptive matrix, so that the generated voice detection model can additionally introduce the low-rank adaptive matrix to learn the voice generated by the unknown generation algorithm, and meanwhile, the detection accuracy of the voice detection model on the voice generated by the known generation algorithm is not reduced.
In a first aspect of the present invention, referring to fig. 1, fig. 1 is a schematic step flow diagram of a training method for continuously learning a speech recognition model according to an embodiment of the present invention, where the training method includes the following steps:
step 101: acquiring a new training data set, wherein the new training data set comprises a plurality of voices generated by adopting a generation algorithm unknown by a pre-training voice large model;
step 102: loading the pre-training voice large model, freezing parameters of the pre-training voice large model, and introducing a first low-rank adaptive matrix and a second low-rank adaptive matrix to obtain a voice detection model to be trained;
step 103: inputting the new training data set into the to-be-trained voice detection model, and finishing training by orthogonally optimizing parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix to obtain the voice detection model, wherein the orthogonal optimization refers to: in the process of training the to-be-trained voice detection model, training of the first low-rank adaptation matrix and the second low-rank adaptation matrix on each data set is independent, and knowledge learned from the learned training data set cannot be forgotten.
In order to enable the voice detection model to have higher detection accuracy in the face of the generated voice generated by the new algorithm and the unknown algorithm, the invention adopts a fine adjustment-based method to fine adjust parameters of the voice detection model on the generated voice generated by the new algorithm and the unknown algorithm. Firstly, on the basis of a pre-training voice large model, a low-rank adaptation matrix is introduced to be combined into a voice detection model to be trained, then a new training data set is adopted to train the voice detection model to be trained, and parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix are optimized in an orthogonal mode to obtain the voice detection model, wherein the specific implementation steps are as follows.
In the specific implementation step 101, the new training data set is a plurality of voices generated by a generation algorithm which is not learned by the pre-training voice big model, the generation algorithm which is not learned by the pre-training voice big model is obtained, and the plurality of voices are generated to form the new training data set for training the voice detection model to be trained subsequently. The large voice model refers to a model with more parameters of a neural network and capable of learning more complex representations.
In the implementation of step 102, the pre-training speech large model that has undergone the pre-training process on the old training data set is loaded, and since only the low-rank adaptation matrix needs to be trained in the training of the new training data set later, the parameters of the pre-training speech large model need to be frozen. And then introducing a first low-rank adaptive matrix and a second low-rank adaptive matrix, and combining the pre-trained voice large model, the first low-rank adaptive matrix and the second low-rank adaptive matrix to obtain a voice detection model to be trained.
The pre-training large speech model refers to a pre-training large speech model (SOM) that has undergone pre-training processing on an old training data set, so that the pre-training large speech model can already recognize the generated speech type of a known generation algorithm in the old training data set. Specifically, the training process of the pre-training large voice model includes firstly acquiring an old training data set, wherein the old training data set contains voices generated by a plurality of generating algorithms; and then, pre-training the large voice model by adopting the old training data set to obtain the pre-trained large voice model, wherein the pre-trained large voice model can recognize a generation algorithm of voice in the old training data set through the pre-training. For example, when the old training data set includes a plurality of voices generated by the generation algorithm 1, the generation algorithm 2 and the generation algorithm 3, the old training data set is used to pretrain the voice big model, and the obtained pretrained voice big model can identify the generation algorithm 1, the generation algorithm 2 and the generation algorithm 3 in the old training data set.
When step 103 is specifically implemented, the training data set is input into the to-be-trained voice detection model obtained in step 102, and only the parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix are optimized during the training process, so as to learn a new algorithm of voices in the new training data set, and optimize the detection performance of the voice detection model on the new algorithm. The parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix are optimized in an orthogonal mode, namely, in the process of training the to-be-trained voice detection model, training of each data set is conducted, training of the first low-rank adaptation matrix and the second low-rank adaptation matrix is conducted in an orthogonal mode and is independent, and after orthogonal optimization, the trained voice detection model does not forget knowledge learned from the learned training data set. Wherein the learned training data set includes, but is not limited to, the old training data set learned in the pre-trained speech large model.
In an optional embodiment of the present invention, the optimizing the parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix through orthogonality specifically includes, during training the to-be-trained voice detection model, first dividing the new training dataset into a plurality of batches of sub-datasets for training, where a batch refers to a training batch during training, and a weight update direction corresponding to the sub-dataset of the i-th batch is orthogonal to a weight update direction of the sub-dataset of the i-1 th batch, so that weight update of each sub-dataset does not affect weight update of the sub-datasets of other batches. Since the weight update directions of the sub-data sets of each batch are orthogonal, the weight update directions of the new training data set and the old training data set are also orthogonal.
Specifically, the formula for the orthogonal optimization is as follows:
wherein,i represents the batch to which the sub-data set belongs when training the voice detection model, j represents the j-th training data set as the training data set in which the sub-data set input when training the voice detection model is located, i and j are numbers for manually marking the sub-data set, x represents the voice in the new training data set, alpha represents a preset constant, T represents transposition,>representing the averaging of the speech in the new training dataset that was input.
When the to-be-trained voice detection model is trained, the new training data set is simultaneously input into the pre-trained voice large model with freezing parameters, the first low-rank adaptation matrix and the second low-rank adaptation matrix, parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix are optimized in an orthogonal mode, and when the detection result output by the to-be-trained voice detection model is the same as the voice type in the new training data set, training is finished. The parameters of the pre-training large voice model are frozen in the process of training the to-be-trained voice detection model, namely, the parameters of the pre-training large voice model are not modified in the training process of aiming at the new training data set, so that the detection accuracy of the pre-training large voice model on the voice generated by the learned generation algorithm is not reduced. Meanwhile, the training on the new training data set only trains the parameters of the low-rank adaptation matrix (the first low-rank adaptation matrix and the second low-rank adaptation matrix), so that the calculation resources and the training time required for training the voice detection model are greatly reduced, and the training cost is reduced.
The voice detection model obtained at the moment has higher detection accuracy for the voices in the old training data set and higher detection accuracy for the voices in the new training data set. For example, when the old training data set includes a plurality of voices generated by the generation algorithm 1, the generation algorithm 2 and the generation algorithm 3, and the new training data set includes a plurality of voices generated by the generation algorithm 4, the generation algorithm 5 and the generation algorithm 6, the pre-training voice big model can accurately detect the generation algorithm 1, the generation algorithm 2 and the generation algorithm 3, and the introduced low-rank adaptation matrix (the first low-rank adaptation matrix and the second low-rank adaptation matrix) can accurately detect the generation algorithm 4, the generation algorithm 5 and the generation algorithm 6 after training. Thus, the speech detection model can accurately detect the generation algorithms (generation algorithm 1, generation algorithm 2, and generation algorithm 3) in the old training data set and the generation algorithms (generation algorithm 4, generation algorithm 5, and generation algorithm 6) in the new training data set. During training, a low-rank adaptive matrix is additionally introduced to train a new training data set, instead of adjusting parameters of the whole voice detection model, so that training resource consumption and training time in the training process are remarkably reduced.
In an optional embodiment of the present invention, after the voice detection model is obtained, the training method further includes using the voice detection model to detect unknown voices, specifically, first obtaining a voice to be detected, where the voice to be detected may be any voice including multiple situations, for example, the voice to be detected may be a voice input by a user, a voice output by a machine device in an interaction process, or may be obtained through a voice collection procedure and a voice verification procedure in a voice verification process.
And then inputting the voice to be detected into the voice detection model, and outputting a corresponding detection result. The detection results comprise the following cases: when the algorithm of the to-be-detected voice belongs to the generated voice which is learned by the pre-training voice big model, the output of the pre-training voice big model aiming at the to-be-detected voice is used as the detection result, and at the moment, the to-be-detected voice is the generated voice which is learned by the pre-training voice big model, namely, the to-be-detected voice is false. When the algorithm of the to-be-detected voice belongs to the generated voice which is not learned by the pre-training voice big model, taking the sum of the outputs of the pre-training voice big model, the first low-rank adaptive matrix and the second low-rank adaptive matrix as the detection result, wherein the to-be-detected voice is the generated voice which is not learned by the pre-training voice big model, but is the generated voice which is learned by the first low-rank adaptive matrix and the second low-rank adaptive matrix, namely, the to-be-detected voice is false. In the case that the speech detection model cannot recognize the generation algorithm of the speech to be detected, that is, the speech to be detected is not generated by the generation algorithm, the speech to be detected is true.
Specifically, the formula of the detection result is as follows:
wherein h is model Is the detection result output by the voice detection model, x is the input voice to be detected, W SOM Is the pre-training voice big model, A A Is the first low rank adaptation matrix, B B Is the second low rank adaptation matrix.
The embodiment of the invention provides a training method of a voice detection model of an orthogonalization low-rank adaptive matrix, which comprises the steps of obtaining a new training data set, wherein the new training data set comprises a plurality of voices generated by adopting a generation algorithm unknown by a pre-training voice large model; loading the pre-training voice large model, freezing parameters of the pre-training voice large model, and introducing a first low-rank adaptive matrix and a second low-rank adaptive matrix to obtain a voice detection model to be trained; inputting the new training data set into the to-be-trained voice detection model, and finishing training by orthogonally optimizing parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix to obtain the voice detection model, wherein the orthogonal optimization refers to: in the process of training the to-be-trained voice detection model, training of the first low-rank adaptation matrix and the second low-rank adaptation matrix on each data set is independent, and knowledge learned from the learned training data set cannot be forgotten. Aiming at the new data set obtained in practice, the training method is used for training the voice detection model, a low-rank adaptation matrix is introduced, and the model is subjected to fine adjustment, so that the training cost can be remarkably reduced, the detection capability of the model for generating audio under the new data set can be greatly improved, and the detection capability of the model for a previously learned voice algorithm is hardly influenced.
In a second aspect of the present invention, referring to fig. 2, fig. 2 is a training device frame diagram of a speech detection model of an orthogonalization low-rank adaptation matrix, where the training device includes:
a new training data set obtaining module 201, configured to obtain a new training data set, where the new training data set includes a plurality of voices generated by adopting a generation algorithm unknown by a pre-training voice big model;
the to-be-trained voice detection model obtaining module 202 is configured to load the pre-trained voice large model and freeze parameters thereof, and introduce a first low-rank adaptive matrix and a second low-rank adaptive matrix to obtain a to-be-trained voice detection model;
the voice detection model obtaining module 203 is configured to input the new training data set into the to-be-trained voice detection model, and end training by orthogonally optimizing parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix, so as to obtain the voice detection model, where the orthogonal optimization refers to: in the process of training the to-be-trained voice detection model, training of the first low-rank adaptation matrix and the second low-rank adaptation matrix on each data set is independent, and knowledge learned from the learned training data set cannot be forgotten.
The to-be-trained voice detection model acquisition module further comprises a first training submodule, wherein the first training submodule comprises:
an acquisition unit configured to acquire an old training data set;
the pre-training unit is used for pre-training the large voice model by adopting the old training data set to obtain the large pre-training voice model, and the large pre-training voice model can identify a generation algorithm of voices in the old training data set.
Wherein, the voice detection model acquisition module further comprises:
and the orthogonal optimization sub-module is used for dividing the new training data set into a plurality of batches of sub-data sets for training in the process of training the voice detection model to be trained, wherein the weight updating direction corresponding to the sub-data set of the ith batch is orthogonal to the weight updating direction of the sub-data set of the ith-1 batch, so that the weight updating of each sub-data set does not influence the weight updating of the sub-data sets of other batches.
The orthogonal optimization formula in the orthogonal optimization sub-module is as follows:
wherein,i represents the batch to which the sub-data set belongs when training the voice detection model, j represents that the training data set in which the sub-data set input when training the voice detection model is positioned is the jth training data set, x represents the voice in the new training data set, alpha represents a preset constant, T represents transposition, and->Representing the averaging of the speech in the new training dataset that was input.
Wherein after obtaining the speech detection model, the training device further comprises:
the voice to be detected acquisition module is used for acquiring voice to be detected;
the detection result acquisition module is used for inputting the voice to be detected into the voice detection model and outputting a detection result, wherein when the algorithm of the voice to be detected belongs to the generated voice learned by the pre-training voice big model, the output of the pre-training voice big model aiming at the voice to be detected is used as the detection result; and when the algorithm of the voice to be detected belongs to the generated voice which is not learned by the pre-training voice big model, taking the sum of the outputs of the pre-training voice big model, the first low-rank adaptation matrix and the second low-rank adaptation matrix as the detection result.
The formula of the detection result in the detection result acquisition module is as follows:
wherein h is model Is the detection result output by the voice detection model, x is the input voice to be detected, W SOM Is the pre-training voice big model, A A Is the first low rank adaptation matrix, B B Is the second low rank adaptation matrix.
Based on the same inventive concept, an embodiment of the present invention discloses an electronic device, fig. 3 shows a schematic diagram of the electronic device disclosed in the embodiment of the present invention, and as shown in fig. 3, the electronic device 100 includes: the system comprises a memory 110 and a processor 120, wherein the memory of the electronic device is not less than 12G, the main frequency of the processor is not lower than 2.4GHz, the memory 110 is in communication connection with the processor 120 through a bus, and a computer program is stored in the memory 110 and can run on the processor 120 so as to realize the training method of the voice detection model of the orthogonalization low-rank adaptation matrix disclosed by the embodiment of the invention.
Based on the same inventive concept, the embodiment of the invention discloses a computer readable storage medium, on which a computer program/instruction is stored, which when executed by a processor, implements a training method of a speech detection model of an orthogonalization low-rank adaptation matrix disclosed in the embodiment of the invention.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, electronic devices, and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.
The above detailed description of the training method and device for the orthogonalization low-rank adaptive matrix voice detection model provided by the invention applies specific examples to illustrate the principle and implementation of the invention, and the above description of the examples is only used for helping to understand the method and core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (10)

1. A training method of a speech detection model of an orthogonalization low-rank adaptation matrix, the training method comprising:
acquiring a new training data set, wherein the new training data set comprises a plurality of voices generated by adopting a generation algorithm unknown by a pre-training voice large model;
loading the pre-training voice large model, freezing parameters of the pre-training voice large model, and introducing a first low-rank adaptive matrix and a second low-rank adaptive matrix to obtain a voice detection model to be trained;
inputting the new training data set into the to-be-trained voice detection model, and finishing training by orthogonally optimizing parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix to obtain the voice detection model, wherein the orthogonal optimization refers to: in the process of training the to-be-trained voice detection model, training of the first low-rank adaptation matrix and the second low-rank adaptation matrix on each data set is independent, and knowledge learned from the learned training data set cannot be forgotten.
2. The method for training a speech detection model of an orthogonalization low-rank adaptation matrix according to claim 1, wherein the training process of the pre-training speech large model is as follows:
acquiring an old training data set;
and pre-training the voice big model by adopting the old training data set to obtain the pre-training voice big model, wherein the pre-training voice big model can identify a generation algorithm of voices in the old training data set.
3. The method of training a speech detection model of an orthogonalized low-rank adaptation matrix according to claim 2, wherein said optimizing parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix by orthogonalization comprises:
in the process of training the to-be-trained voice detection model, the new training data set is divided into a plurality of batches of sub-data sets for training, and the weight updating direction corresponding to the sub-data set of the ith batch is orthogonal to the weight updating direction of the sub-data set of the ith-1 batch, so that the weight updating of each sub-data set does not influence the weight updating of the sub-data sets of other batches.
4. A training method according to claim 3, characterized in that the formula for the orthogonal optimization is as follows:
wherein,i represents the batch to which the sub-data set belongs when training the voice detection model, j represents that the training data set in which the sub-data set input when training the voice detection model is located is the jth training data set, x represents the voice in the new training data set, a represents a preset constant, T represents a transpose,representing the averaging of the speech in the new training dataset that was input.
5. The method of training a speech detection model of an orthogonalization low-rank adaptation matrix of claim 1, wherein after obtaining the speech detection model, the training method further comprises:
acquiring voice to be detected;
inputting the voice to be detected into the voice detection model, and outputting a detection result, wherein when the algorithm of the voice to be detected belongs to the generated voice which is learned by the pre-training voice big model, the output of the pre-training voice big model aiming at the voice to be detected is used as the detection result;
and when the algorithm of the voice to be detected belongs to the generated voice which is not learned by the pre-training voice big model, taking the sum of the outputs of the pre-training voice big model, the first low-rank adaptation matrix and the second low-rank adaptation matrix as the detection result.
6. The training method of the orthogonalization low-rank adaptation matrix speech detection model of claim 5, wherein the detection result is represented by the formula:
wherein h is model Is the detection result output by the voice detection model, x is the input voice to be detected, W SOM Is the pre-training voice big model, A A Is the first low rank adaptation matrix, B B Is the second low rank adaptation matrix.
7. A training device for orthogonalizing a speech detection model of a low-rank adaptation matrix, the training device comprising:
the system comprises a new training data set acquisition module, a pre-training data set generation module and a pre-training data set generation module, wherein the new training data set acquisition module is used for acquiring a new training data set, and the new training data set comprises a plurality of voices generated by adopting a generation algorithm unknown by a pre-training voice large model;
the to-be-trained voice detection model acquisition module is used for loading the pre-trained voice large model and freezing parameters of the pre-trained voice large model, and introducing a first low-rank adaptive matrix and a second low-rank adaptive matrix to obtain a to-be-trained voice detection model;
the voice detection model obtaining module is configured to input the new training data set into the to-be-trained voice detection model, and finish training by orthogonally optimizing parameters of the first low-rank adaptation matrix and the second low-rank adaptation matrix to obtain the voice detection model, where the orthogonal optimization refers to: in the process of training the to-be-trained voice detection model, training of the first low-rank adaptation matrix and the second low-rank adaptation matrix on each data set is independent, and knowledge learned from the learned training data set cannot be forgotten.
8. The training apparatus of the orthogonalization low-rank adaptation matrix of claim 7, wherein the to-be-trained speech detection model acquisition module further comprises a first training submodule comprising:
an acquisition unit configured to acquire an old training data set;
the pre-training unit is used for pre-training the large voice model by adopting the old training data set to obtain the large pre-training voice model, and the large pre-training voice model can identify a generation algorithm of voices in the old training data set.
9. An electronic device, comprising:
a memory for storing one or more programs;
a processor;
a training method of a speech detection model implementing an orthogonalization of a low-rank adaptation matrix as claimed in any one of claims 1-6, when the one or more programs are executed by the processor.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a method of training a speech detection model of an orthogonalization of a low rank adaptation matrix as claimed in any of claims 1-6.
CN202410063975.2A 2024-01-17 2024-01-17 Training method and device for orthogonalization low-rank adaptive matrix voice detection model Active CN117577117B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410063975.2A CN117577117B (en) 2024-01-17 2024-01-17 Training method and device for orthogonalization low-rank adaptive matrix voice detection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410063975.2A CN117577117B (en) 2024-01-17 2024-01-17 Training method and device for orthogonalization low-rank adaptive matrix voice detection model

Publications (2)

Publication Number Publication Date
CN117577117A true CN117577117A (en) 2024-02-20
CN117577117B CN117577117B (en) 2024-03-19

Family

ID=89862941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410063975.2A Active CN117577117B (en) 2024-01-17 2024-01-17 Training method and device for orthogonalization low-rank adaptive matrix voice detection model

Country Status (1)

Country Link
CN (1) CN117577117B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140019388A1 (en) * 2012-07-13 2014-01-16 International Business Machines Corporation System and method for low-rank matrix factorization for deep belief network training with high-dimensional output targets
US20190122108A1 (en) * 2017-10-24 2019-04-25 Baidu Usa Llc Systems and methods for trace norm regularization and faster inference for embedded models
CN113076215A (en) * 2021-04-08 2021-07-06 华南理工大学 Unsupervised anomaly detection method independent of data types
WO2022245502A1 (en) * 2021-05-19 2022-11-24 Microsoft Technology Licensing, Llc Low-rank adaptation of neural network models
CN117059103A (en) * 2023-10-12 2023-11-14 慧言科技(天津)有限公司 Acceleration method of voice recognition fine tuning task based on low-rank matrix approximation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140019388A1 (en) * 2012-07-13 2014-01-16 International Business Machines Corporation System and method for low-rank matrix factorization for deep belief network training with high-dimensional output targets
US20190122108A1 (en) * 2017-10-24 2019-04-25 Baidu Usa Llc Systems and methods for trace norm regularization and faster inference for embedded models
CN113076215A (en) * 2021-04-08 2021-07-06 华南理工大学 Unsupervised anomaly detection method independent of data types
WO2022245502A1 (en) * 2021-05-19 2022-11-24 Microsoft Technology Licensing, Llc Low-rank adaptation of neural network models
CN117059103A (en) * 2023-10-12 2023-11-14 慧言科技(天津)有限公司 Acceleration method of voice recognition fine tuning task based on low-rank matrix approximation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
路成;田猛;周健;王华彬;陶亮;: "L_(1/2)稀疏约束卷积非负矩阵分解的单通道语音增强方法", 声学学报, no. 03, 15 May 2017 (2017-05-15) *

Also Published As

Publication number Publication date
CN117577117B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
CN110288978B (en) Speech recognition model training method and device
US20180158449A1 (en) Method and device for waking up via speech based on artificial intelligence
CN102737633B (en) Method and device for recognizing speaker based on tensor subspace analysis
CN110223680A (en) Method of speech processing, recognition methods and its device, system, electronic equipment
CN108197669B (en) Feature training method and device of convolutional neural network
Miao et al. Underwater acoustic signal classification based on sparse time–frequency representation and deep learning
CN110929836B (en) Neural network training and image processing method and device, electronic equipment and medium
CN115146670A (en) Radio frequency fingerprint identification method and system based on data enhancement and comparison learning
CN112509600A (en) Model training method and device, voice conversion method and device and storage medium
CN109150538A (en) A kind of fingerprint merges identity identifying method with vocal print
CN117577117B (en) Training method and device for orthogonalization low-rank adaptive matrix voice detection model
CN113762503A (en) Data processing method, device, equipment and computer readable storage medium
WO2023093029A1 (en) Wake-up word energy calculation method and system, and voice wake-up system and storage medium
CN112738724B (en) Method, device, equipment and medium for accurately identifying regional target crowd
CN113435263A (en) CGAN data enhancement-based spectrum sensing method and system
CN113450800A (en) Method and device for determining activation probability of awakening words and intelligent voice product
CN112528068A (en) Voiceprint feature storage method, voiceprint feature matching method and device and electronic equipment
CN108206024B (en) Voice data processing method based on variational Gaussian regression process
CN112669836A (en) Command recognition method and device and computer readable storage medium
CN112115509A (en) Data generation method and device
Huang et al. Sampling adaptive learning algorithm for mobile blind source separation
CN113822445B (en) Model integrated prediction method, system, electronic equipment and storage medium
CN114677519A (en) Feature extraction network training and image recognition method and device
CN113299302A (en) Audio noise reduction method and device and electronic equipment
CN117726478A (en) Intelligent decision-making method for dispatching of power system unit, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant