CN114464195A

CN114464195A - Voiceprint recognition model training method and device for self-supervision learning and readable medium

Info

Publication number: CN114464195A
Application number: CN202111591886.8A
Authority: CN
Inventors: 张广学; 肖龙源; 李稀敏; 叶志坚
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-05-10

Abstract

The invention discloses a voiceprint recognition model training method, a voiceprint recognition model training device and a readable medium for self-supervision learning. The optimized voiceprint recognition model training method for the self-supervision learning is characterized in that training data are classified by adopting three-time classification, the training data are reasonably selected according to the state of the voiceprint recognition model for training, the voiceprint recognition model can obtain different capabilities in different stages, the data with larger voiceprint characteristic difference can be accurately distinguished by the aid of the first training and classification, the voiceprint recognition model has characteristics capable of distinguishing voiceprint similar persons by the aid of the second training and classification, and the model is converged and keeps a fixed classification characteristic network by the aid of the third training and classification, so that convergence and stability of the voiceprint recognition model are facilitated.

Description

Voiceprint recognition model training method and device for self-supervision learning and readable medium

Technical Field

The invention relates to the field of voiceprint recognition, in particular to a voiceprint recognition model training method and device for self-supervision learning and a readable medium.

Background

With the wide application of the voiceprint recognition technology in the directions of public security, financial anti-fraud, criminal investigation and the like, more and more application scenes are provided, and the requirements of related technical indexes are higher and higher. At present, in the existing models, the supervised learning model has the best effect, but the learning model has a great promotion space. In addition, there are many situations of no label or label error in real scene data, and how to train and optimize the model by using the data needs to explore the advantages of the self-monitoring model.

Most of the existing self-supervision models are pre-trained models in a supervision model, then pseudo label labeling is carried out in a clustering mode, and then pseudo label self-supervision model training is carried out. However, for large-scale data, the recognition and discrimination effects of the supervision model have limitations, and it is easy for the self-supervision model effect to approach the supervision model effect. In addition, clustering is performed on the basis of the extracted voiceprint features, and the performance of the dependent model is compared.

Disclosure of Invention

The data acquired by aiming at the above-mentioned practical application scenes have the problems of no label or label error scenes, dependence on the performance of a supervision model of an automatic supervision model and the like. An embodiment of the present application aims to provide a voiceprint recognition model training method, device and readable medium for self-supervised learning, so as to solve the technical problems mentioned in the background section above.

In a first aspect, an embodiment of the present application provides a voiceprint recognition model training method for self-supervised learning, including the following steps:

s1, training by using first training data in at least one generation of training process of the voiceprint recognition model to obtain a trained voiceprint recognition model, performing similarity calculation on the first training data through the trained voiceprint recognition model to obtain a first similarity result, and dividing the first training data into second training data and third training data based on the first similarity result;

s2, training by using second training data in at least one generation of training process of the trained voiceprint recognition model to obtain an optimized voiceprint recognition model, performing similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a second similarity result, and dividing the first training data into fourth training data and fifth training data based on the second similarity result;

s3, extracting a certain amount of data from fifth training data to combine with fourth training data to form sixth training data, training by using the sixth training data in at least one generation of training process of the optimized voiceprint recognition model to obtain the optimized voiceprint recognition model, performing similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a third similarity result, and dividing the first training data into seventh training data and eighth training data based on the third similarity result;

and S4, repeating the steps S1-S3, and training the voiceprint recognition model by taking the seventh training data as the first training data until the voiceprint recognition model achieves the expected effect or the training end condition is met.

In some embodiments, the training process of the voiceprint recognition model comprises:

performing VAD processing on voice data in the training data to extract an effective voice segment, wherein the training data is one of first training data, second training data, third training data, fourth training data, fifth training data, sixth training data, seventh training data and eighth training data;

extracting at least 2 or n fixed-length data based on the effective voice fragment, extracting voice characteristics from the at least 2 fixed-length data, inputting the voice characteristics into a voiceprint recognition model, performing calculation three-ranking loss calculation by taking each piece of voice data as a label, and outputting a voiceprint recognition result;

and updating the network parameters through back propagation.

In some embodiments, the dividing the first training data into the second training data and the third training data based on the first similarity result in step 1 specifically includes: sorting according to the similarity of the first similarity result, selecting the scored data within the specified interval distance from the first training data as second training data, and taking the rest as third training data;

in step 2, the dividing the first training data into fourth training data and fifth training data based on the second similarity result specifically includes: sorting according to the similarity of the second similarity result, selecting the scored data within the specified interval distance from the first training data as fourth training data, and taking the rest as fifth training data;

in step 3, the dividing the first training data into sixth training data and seventh training data based on the third similarity result specifically includes: and sorting according to the similarity of the third similarity result, selecting the scored data within the specified interval distance from the first training data as sixth training data, and taking the rest as seventh training data.

In some embodiments, each training generation of the voiceprint recognition model comprises at least one training batch, each training batch is composed of a batch size bar of speech data, each speech data has 2 or n speech features, the number of labels is batch size x (2 or n), the voiceprint recognition model is trained for 1-3 generations in step S1, and the voiceprint recognition model is trained for 2 generations in step S2 and step S3.

In some embodiments, the sixth training data in step S3 includes one third of the fifth training data.

In some embodiments, the voiceprint recognition model comprises an ecapa-tdnn network.

In some embodiments, the similarity comprises a cosine similarity.

In a second aspect, an embodiment of the present application provides a voiceprint recognition model training apparatus for self-supervised learning, including:

the first training module is configured to use first training data to train in at least one generation of training process of the voiceprint recognition model to obtain a trained voiceprint recognition model, perform similarity calculation on the first training data through the trained voiceprint recognition model to obtain a first similarity result, and divide the first training data into second training data and third training data based on the first similarity result;

the second training module is configured to use second training data to train in at least one generation of training process of the trained voiceprint recognition model to obtain an optimized voiceprint recognition model, perform similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a second similarity result, and divide the first training data into fourth training data and fifth training data based on the second similarity result;

the third training module is configured to extract a certain amount of data from the fifth training data and combine the data with the fourth training data to form sixth training data, train the sixth training data in at least one generation of training process of the optimized voiceprint recognition model to obtain the optimized voiceprint recognition model, perform similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a third similarity result, and divide the first training data into seventh training data and eighth training data based on the third similarity result;

and the repeated training module is configured to repeatedly execute the first training module to the third training data, train the voiceprint recognition model by taking the seventh training data as the first training data until the voiceprint recognition model achieves the expected effect or the training end condition is met.

In a third aspect, embodiments of the present application provide an electronic device comprising one or more processors; storage means for storing one or more programs which, when executed by one or more processors, cause the one or more processors to carry out a method as described in any one of the implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the method as described in any of the implementations of the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

(1) the voiceprint recognition model training method for the self-supervision learning can effectively utilize real scene data and continuous model optimization by means of an optimized self-supervision learning scheme according to the situation that data acquired by an actual application scene has no label or a label error scene.

(2) The voiceprint recognition model training method for the self-supervision learning can avoid selecting voice data with similar voiceprint characteristics to train the voiceprint recognition model, and selects voice data with larger difference of the voiceprint characteristics to train the voiceprint recognition model, so that the effectiveness of model training can be improved.

(3) The voiceprint recognition model training method for the self-supervision learning can be used for disturbing or excavating distinguishing features on voice data with similar voiceprint features after the voiceprint recognition model gradually tends to be stable, and is beneficial to the convergence and the stability of the model.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is an exemplary device architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a schematic flow chart of a voiceprint recognition model training method for self-supervised learning according to an embodiment of the present invention;

FIG. 3 is a network structure diagram of a voiceprint recognition model of the voiceprint recognition model training method of the self-supervised learning according to the embodiment of the present invention;

FIG. 4 is a schematic diagram of a training apparatus for a voiceprint recognition model for self-supervised learning according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device suitable for implementing an electronic apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 illustrates an exemplary device architecture 100 to which the voiceprint recognition model training method for self-supervised learning or the voiceprint recognition model training device for self-supervised learning of an embodiment of the present application may be applied.

As shown in fig. 1, the apparatus architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use

terminal devices

101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like. Various applications, such as data processing type applications, file processing type applications, etc., may be installed on the

terminal apparatuses

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background data processing server that processes files or data uploaded by the

terminal devices

101, 102, 103. The background data processing server can process the acquired file or data to generate a processing result.

It should be noted that the voiceprint recognition model training method for the self-supervised learning provided in the embodiment of the present application may be executed by the server 105, or may also be executed by the

terminal devices

101, 102, and 103, and accordingly, the voiceprint recognition model training device for the self-supervised learning may be disposed in the server 105, or may also be disposed in the

terminal devices

101, 102, and 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the processed data does not need to be acquired from a remote location, the above device architecture may not include a network, but only a server or a terminal device.

Fig. 2 illustrates a voiceprint recognition model training method for self-supervised learning, provided by an embodiment of the present application, including the following steps:

and S1, training by using the first training data in at least one generation of training process of the voiceprint recognition model to obtain a trained voiceprint recognition model, performing similarity calculation on the first training data through the trained voiceprint recognition model to obtain a first similarity result, and dividing the first training data into second training data and third training data based on the first similarity result.

In a specific embodiment, the voiceprint recognition model is an ecapa-tdnn network. ecapa-tdnn network architecture referring to fig. 3, in alternative embodiments, the voiceprint recognition model may also employ other network architectures, for example

tdnn/etdnn/cnn-ecapa/resnet, etc. The first generation training is 1 epoch, that is, the model is completely trained once by using all data of the training set, and all training data are propagated in the voiceprint recognition network model in a forward direction and a backward direction. The training data for an epoch may be divided into multiple batchs, each of which has a Batch size.

In a specific embodiment, the training process of the voiceprint recognition model specifically includes:

and updating the network parameters through back propagation.

Specifically, voice data is subjected to VAD processing in a kaldi mode, effective voice fragments are extracted, 2 or n 200 frames of data are extracted from each effective voice fragment, mcff features are extracted from the extracted 200 frames of data respectively, and the mcff features are input into an ecapa-tdnn network. The triple Ranking Loss (triple Ranking Loss) is used for Loss calculation, mainly comprises a positive sample, a negative sample and an anchor point, and in the self-supervision learning, one part of a piece of voice is used as the anchor point, the other part of the voice is used as the positive sample, and the other voice is used as the negative sample. In the former calculation, the requirement for the negative sample is not strict, and in the later calculation, the negative sample selects the voice with larger distance after sequencing.

In a specific embodiment, each generation of training of the voiceprint recognition model comprises at least one batch of training, each batch of training data is composed of batch size bar voice data, each voice data has 2 or n voice features, and the number of labels is batch size x (2 or n). The network structure operation performed in the training process of the voiceprint recognition model is referred to fig. 3, and the specific operation process is not the key point of the present application, and therefore is not described in detail. In one embodiment, the similarity is cosine similarity. In other embodiments, similarity measures such as euclidean distance, pearson correlation coefficient, Tanimoto coefficient, etc. may also be used. In the embodiments of the present application, cosine similarity is taken as an example. In the training process of step S1, training is performed by using 1-3 generations of training, all the speech data of the first training data are trained, cosine similarity calculation is performed on the first training data through the trained voiceprint recognition model to obtain a first similarity result, the first similarity result is sorted according to the similarity in the second similarity result, the score of the first similarity result in the specified interval distance is selected from the first training data as the second training data, and the rest of the first training data are used as the third training data. Training in 1-3 generations is to make the first training data classified for the first time and obtain the trained voiceprint recognition model, so that the voiceprint recognition model can learn some basic features. After each epoch, the voiceprint recognition model is assumed to proceed toward the optimal classification direction. After each epoch, it is assumed that the distance between the voiceprint features of different people will be pulled apart. Although, some people have similar voiceprint characteristics or similar audio voices of the same person, it is difficult to distinguish them.

It should be noted that the first training data, the second training data, the third training data, the fourth training data, the fifth training data, the sixth training data, the seventh training data, and the eighth training data all use their own labels, for example, the voice feature extracted from the first training data, and the label is the first training data itself.

And S2, training by using second training data in at least one generation of training process of the trained voiceprint recognition model to obtain an optimized voiceprint recognition model, performing similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a second similarity result, and dividing the first training data into fourth training data and fifth training data based on the second similarity result.

In a specific embodiment, the training process of the trained voiceprint recognition model refers to the training process of the voiceprint recognition model, and the second training data generated by the first classification is used for 2 generations of training, so that the voiceprint recognition model can accurately distinguish individuals with large voiceprint feature differences, that is, the voiceprint recognition model can be rapidly converged. And performing secondary classification on the basis, wherein the obtained voiceprint recognition model can classify individuals with larger characteristic difference. In the previous epochs, voice with similar voiceprint characteristics is avoided as much as possible, and data with larger differences of the voiceprint characteristics are selected for model training.

After some epochs, the voiceprint recognition model gradually becomes stable, and then the voices of people with similar voiceprint characteristics or the same person are considered. In the training process, the fifth training data is used as a disturbance or a new distinguishing characteristic is discovered, and the self-supervision training data selection strategy is adopted to help the convergence and stability of the voiceprint recognition model.

In the training process of step S2, 2 generations of training are adopted, all the speech data of the second training data are trained, cosine similarity calculation is performed on the first training data through the optimized voiceprint recognition model to obtain a second similarity result, the scores of the second similarity result within the specified interval distance are selected from the first training data as fourth training data according to the order of the similarity in the second similarity result, and the rest of the first training data are used as fifth training data.

And S3, extracting a certain amount of data from the fifth training data to combine with the fourth training data to form sixth training data, training by using the sixth training data in at least one generation of training process of the optimized voiceprint recognition model to obtain the optimized voiceprint recognition model, performing similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a third similarity result, and dividing the first training data into seventh training data and eighth training data based on the third similarity result.

In a specific embodiment, the training process of the optimized voiceprint recognition model refers to the training process of the voiceprint recognition model described above, and 2 generations of training are performed using the fourth training data and part of the fifth training data generated by the second classification. Specifically, the sixth training data in step S3 includes 1/3 of the fifth training data. The fourth training data and the fifth training data of 1/3 are used as the sixth training data, so that the voiceprint recognition model can have the characteristics of distinguishing similar persons of the voiceprint characteristics, some disturbance is added to the model training, and new distinguishing characteristics are discovered.

In a specific embodiment, in the training process of step S3, 2 generations of training are adopted, all the speech data of the sixth training data are trained, cosine similarity calculation is performed on the first training data through the optimized trained voiceprint recognition model to obtain a third similarity result, a score of the third similarity result within a specified interval distance is selected from the first training data as seventh training data according to the similarity ranking in the third similarity result, and the rest of the first training data are used as eighth training data.

Specifically, the training of the voiceprint recognition model is performed by using the seventh training data generated by the third classification as the first training data, so as to converge the voiceprint recognition model and maintain a fixed classification feature network. And repeating the steps to train the voiceprint recognition model until the network parameters of the voiceprint recognition model are stable.

The voiceprint recognition model training method based on the self-supervision learning can effectively apply the situation that data in a real application scene are not provided with labels or label errors, and is beneficial to continuous optimization of the model.

With further reference to fig. 4, as an implementation of the method shown in the above figures, the present application provides an embodiment of a voiceprint recognition model training apparatus for self-supervised learning, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices in particular.

The embodiment of the application provides a voiceprint recognition model training device for self-supervision learning, which comprises:

the first training module 1 is configured to use first training data to train in at least one generation of training process of the voiceprint recognition model to obtain a first voiceprint recognition model, perform similarity calculation on the first training data through the first voiceprint recognition model to obtain a first similarity result, and divide the first training data into second training data and third training data based on the first similarity result;

the second training module 2 is configured to use second training data to train in at least one generation of training process of the first voiceprint recognition model to obtain a trained voiceprint recognition model, perform similarity calculation on the first training data through the trained voiceprint recognition model to obtain a second similarity result, and divide the first training data into fourth training data and fifth training data based on the second similarity result;

the third training module 3 is configured to extract a certain amount of data from the fifth training data and combine the data with the fourth training data to form sixth training data, train the sixth training data in at least one generation of training process of the trained voiceprint recognition model to obtain a voiceprint recognition model after mixed training, perform similarity calculation on the first training data through the voiceprint recognition model after mixed training to obtain a third similarity result, and divide the first training data into seventh training data and eighth training data based on the third similarity result;

and the repeated training module 4 is configured to repeatedly execute the first training module to the third training data, train the voiceprint recognition model by taking the seventh training data as the first training data until the voiceprint recognition model achieves an expected effect or until a training end condition is met.

Referring now to fig. 5, a schematic diagram of a computer apparatus 500 suitable for implementing an electronic device (e.g., the server or the terminal device shown in fig. 1) according to an embodiment of the present application is shown. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 5, the computer apparatus 500 includes a Central Processing Unit (CPU)501 and a Graphics Processing Unit (GPU)502, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)503 or a program loaded from a storage section 509 into a Random Access Memory (RAM) 504. In the RAM504, various programs and data necessary for the operation of the apparatus 500 are also stored. The CPU 501, GPU502, ROM 503, and RAM504 are connected to each other via a bus 505. An input/output (I/O) interface 506 is also connected to bus 505.

The following components are connected to the I/O interface 506: an input portion 507 including a keyboard, a mouse, and the like; an output section 508 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 509 including a hard disk and the like; and a communication section 510 including a network interface card such as a LAN card, a modem, or the like. The communication section 510 performs communication processing via a network such as the internet. The driver 511 may also be connected to the I/O interface 506 as necessary. A removable medium 512 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 511 as necessary, so that a computer program read out therefrom is mounted into the storage section 509 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications section 510, and/or installed from removable media 512. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU)501 and a Graphics Processing Unit (GPU) 502.

It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable medium or any combination of the two. The computer readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, apparatus, or any combination of the foregoing. More specific examples of the computer readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present application may be implemented by software or hardware. The modules described may also be provided in a processor.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: training by using first training data in at least one generation of training process of the voiceprint recognition model to obtain a trained voiceprint recognition model, performing similarity calculation on the first training data through the trained voiceprint recognition model to obtain a first similarity result, and dividing the first training data into second training data and third training data based on the first similarity result; training by using second training data in at least one generation of training process of the trained voiceprint recognition model to obtain an optimized voiceprint recognition model, performing similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a second similarity result, and dividing the first training data into fourth training data and fifth training data based on the second similarity result; extracting a certain amount of data from the fifth training data to combine with the fourth training data to form sixth training data, training by using the sixth training data in at least one generation of training process of the optimized voiceprint recognition model to obtain an optimized voiceprint recognition model, performing similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a third similarity result, and dividing the first training data into seventh training data and eighth training data based on the third similarity result; and repeating the steps, and taking the seventh training data as the first training data to train the voiceprint recognition model until the voiceprint recognition model achieves the expected effect or the training end condition is met.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A voiceprint recognition model training method for self-supervision learning is characterized by comprising the following steps:

s2, training by using the second training data in at least one generation of training process of the trained voiceprint recognition model to obtain an optimized voiceprint recognition model, performing similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a second similarity result, and dividing the first training data into fourth training data and fifth training data based on the second similarity result;

s3, extracting a certain amount of data from the fifth training data to combine with the fourth training data to form sixth training data, training with the sixth training data in at least one generation of training process of the optimized voiceprint recognition model to obtain an optimized voiceprint recognition model, performing similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a third similarity result, and dividing the first training data into seventh training data and eighth training data based on the third similarity result;

s4, repeating the steps S1-S3, and training the voiceprint recognition model by taking the seventh training data as the first training data until the voiceprint recognition model achieves the expected effect or the training end condition is met.

2. The training method of the voiceprint recognition model for the unsupervised learning according to claim 1, wherein the training process of the voiceprint recognition model comprises the following steps:

performing VAD processing on voice data in training data to extract an effective voice segment, wherein the training data is one of first training data, second training data, third training data, fourth training data, fifth training data, sixth training data, seventh training data and eighth training data;

extracting at least 2 or n fixed length data based on the effective voice fragment, extracting voice characteristics of the at least 2 fixed length data, inputting the voice characteristics into the voiceprint recognition model, performing calculation three-ranking loss calculation by taking each piece of voice data as a label, and outputting a voiceprint recognition result;

and updating the network parameters through back propagation.

3. The method for training the voiceprint recognition model for the unsupervised learning according to claim 1, wherein the step 1 of dividing the first training data into the second training data and the third training data based on the first similarity result specifically includes: sorting according to the similarity of the first similarity result, selecting the scored data within the specified interval distance from the first training data as second training data, and taking the rest as third training data;

the dividing the first training data into fourth training data and fifth training data based on the second similarity result in the step 2 specifically includes: sorting according to the similarity of the second similarity result, selecting the scored data within the specified interval distance from the first training data as fourth training data, and taking the rest as fifth training data;

the dividing the first training data into sixth training data and seventh training data based on the third similarity result in step 3 specifically includes: and sorting according to the similarity of the third similarity result, selecting the scored data within the specified interval distance from the first training data as sixth training data, and taking the rest as seventh training data.

4. The training method of the voiceprint recognition model for the self-supervised learning of claim 2, wherein each training generation of the voiceprint recognition model comprises at least one training batch, each training batch is composed of a batch size bar of voice data, each voice data has 2 or n voice features, the number of labels is batch size x (2 or n), the voiceprint recognition model is trained for 1-3 generations in the step S1, and the voiceprint recognition model is trained for 2 generations in the steps S2 and S3.

5. The training method of the self-supervised learning voiceprint recognition model according to claim 1, wherein the sixth training data in the step S3 contains one third of the fifth training data.

6. The training method of the voiceprint recognition model for the unsupervised learning of claim 1, wherein the voiceprint recognition model comprises an ecapa-tdnn network.

7. The method of claim 1, wherein the similarity comprises a cosine similarity.

8. The utility model provides a voiceprint recognition model training device of self-supervised learning which characterized in that includes:

a second training module configured to perform training using the second training data in at least one generation of training process of the trained voiceprint recognition model to obtain an optimized voiceprint recognition model, perform similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a second similarity result, and divide the first training data into fourth training data and fifth training data based on the second similarity result;

a third training module, configured to extract a certain amount of data from the fifth training data and combine the data with the fourth training data to form sixth training data, train using the sixth training data in at least one generation of training process of the optimized voiceprint recognition model to obtain an optimized voiceprint recognition model, perform similarity calculation on the first training data through the optimized voiceprint recognition model to obtain a third similarity result, and divide the first training data into seventh training data and eighth training data based on the third similarity result;

and the repeated training module is configured to repeatedly execute the first training module to the third training data, train the voiceprint recognition model by taking the seventh training data as the first training data until the voiceprint recognition model achieves an expected effect or a training ending condition is met.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.