CN111341322A

CN111341322A - Voiceprint model training method, device and equipment

Info

Publication number: CN111341322A
Application number: CN202010293641.6A
Authority: CN
Inventors: 肖龙源; 李稀敏; 刘晓葳; 谭玉坤; 叶志坚
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2020-06-26

Abstract

The invention discloses a voiceprint model training method, a voiceprint model training device and voiceprint model training equipment. Wherein the method comprises the following steps: the method comprises the steps of collecting voice data of at least one user, obtaining environmental noise data in the collected voice data of each user, eliminating the environmental noise in the collected voice data of each user according to the environmental noise data, carrying out voiceprint feature extraction on the voice data of each user after the environmental noise is eliminated, training the extracted voiceprint features to obtain a voiceprint feature training set, and forming a voiceprint model according to the voiceprint feature training set. By the method, the environmental noise in the collected voice data of each user can be eliminated, and the accuracy of the formed voiceprint model can be improved.

Description

Voiceprint model training method, device and equipment

Technical Field

The invention relates to the technical field of voiceprints, in particular to a voiceprint model training method, a voiceprint model training device and voiceprint model training equipment.

Background

Voiceprints are the spectrum of sound waves carrying verbal information displayed with an electro-acoustic instrument. Modern scientific research shows that the voiceprint not only has specificity, but also has the characteristic of relative stability. After the adult, the voice of the human can be kept relatively stable and unchanged for a long time. Experiments prove that the voiceprints of each person are different, and the voiceprints of the speakers are different all the time no matter the speakers deliberately imitate the voices and tone of other persons or speak with whisper and whisper, even if the imitation is vivid and lifelike.

However, the inventors found that at least the following problems exist in the prior art:

the existing voiceprint model training scheme generally collects the voice data of at least one user, then performs voiceprint feature extraction on the collected voice data of each user, then trains the extracted voiceprint features to obtain a voiceprint feature training set, and then forms a voiceprint model according to the voiceprint feature training set, wherein the collected voice data generally contains environmental noise, and the environmental noise runs through the forming process of the whole voiceprint model, so that the accuracy of the formed voiceprint model is general.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus and a device for training a voiceprint model, which can improve the accuracy of the formed voiceprint model.

According to an aspect of the present invention, there is provided a voiceprint model training method, including: collecting voice data of at least one user; acquiring environmental noise data in the acquired voice data of each user; according to the environmental noise data, eliminating the environmental noise in the collected voice data of each user; performing voiceprint feature extraction on the voice data of each user after the environmental noise is eliminated; training the extracted voiceprint features to obtain a voiceprint feature training set; and forming a voiceprint model according to the voiceprint feature training set.

Wherein the acquiring ambient noise data in the collected voice data of each user comprises: and acquiring the environmental noise data in the collected voice data of each user through a preset general environmental noise model.

Wherein, training the extracted voiceprint features to obtain a voiceprint feature training set comprises: and training the extracted voiceprint features through a long-term and short-term memory network and a convolutional neural network to obtain a voiceprint feature training set.

Wherein, according to the voiceprint feature training set, forming a voiceprint model, comprising: and forming a voiceprint model of the voice data of each user after the environmental noise is eliminated and/or a voiceprint model of the voice data of all users after the environmental noise is eliminated according to the voiceprint feature training set.

Wherein after the forming of the voiceprint model according to the voiceprint feature training set, the method further comprises: and optimizing the formed voiceprint model through iteration of preset times.

According to another aspect of the present invention, there is provided a voiceprint model training apparatus comprising: the device comprises an acquisition module, an elimination module, an extraction module, a training module and a forming module; the acquisition module is used for acquiring voice data of at least one user; the acquisition module is used for acquiring the environmental noise data in the acquired voice data of each user; the eliminating module is used for eliminating the environmental noise in the collected voice data of each user according to the environmental noise data; the extraction module is used for extracting the voiceprint characteristics of the voice data of each user after the environmental noise is eliminated; the training module is used for training the extracted voiceprint features to obtain a voiceprint feature training set; and the forming module is used for forming a voiceprint model according to the voiceprint feature training set.

The obtaining module is specifically configured to: and acquiring the environmental noise data in the collected voice data of each user through a preset general environmental noise model.

Wherein, the training module is specifically configured to: and training the extracted voiceprint features through a long-term and short-term memory network and a convolutional neural network to obtain a voiceprint feature training set.

Wherein the forming module is specifically configured to: and forming a voiceprint model of the voice data of each user after the environmental noise is eliminated and/or a voiceprint model of the voice data of all users after the environmental noise is eliminated according to the voiceprint feature training set.

Wherein, the voiceprint model training device further comprises: an optimization module; and the optimization module is used for optimizing the formed voiceprint model through iteration of preset times.

According to still another aspect of the present invention, there is provided a voiceprint model training apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any of the voiceprint model training methods described above.

According to a further aspect of the present invention, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements a voiceprint model training method as described in any one of the above.

It can be found that, according to the above scheme, the voice data of at least one user can be collected, the environmental noise data in the voice data of each collected user can be obtained, the environmental noise in the voice data of each collected user can be eliminated according to the environmental noise data, the voice print feature extraction can be carried out on the voice data of each user after the environmental noise is eliminated, the extracted voice print feature can be trained to obtain a voice print feature training set, and the voice print model can be formed according to the voice print feature training set.

Further, according to the above scheme, the environmental noise data in the collected voice data of each user can be obtained through a preset general environmental noise model, which is advantageous in that since the general environmental noise model is applicable to all types of environmental noise, more accurate obtaining of the environmental noise data in the collected voice data of each user can be achieved.

Furthermore, the scheme can train the extracted voiceprint features through the long-short term memory network and the convolutional neural network to obtain the voiceprint feature training set, so that the advantage that the information of the voiceprint feature context can be reserved by the long-short term memory network and the convolutional neural network, and the continuity and the accuracy of the voiceprint feature training set can be improved.

Further, according to the above scheme, the voiceprint model of the voice data of each user after the ambient noise is removed and/or the voiceprint models of the voice data of all users after the ambient noise is removed can be formed according to the voiceprint feature training set, which has the advantage that the voiceprint model of the voice data of each user and/or all users after the ambient noise is removed can be formed, so that the voiceprint models of each user and all users can be managed conveniently.

Furthermore, the above scheme can optimize the formed voiceprint model through a preset number of iterations, which has the advantage of being able to further improve the accuracy of the formed voiceprint model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram illustrating a voiceprint model training method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a voiceprint model training method according to another embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an embodiment of a voiceprint model training apparatus according to the present invention;

FIG. 4 is a schematic structural diagram of another embodiment of the voiceprint model training apparatus of the present invention;

FIG. 5 is a schematic structural diagram of an embodiment of a voiceprint model training apparatus according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be noted that the following examples are only illustrative of the present invention, and do not limit the scope of the present invention. Similarly, the following examples are only some but not all examples of the present invention, and all other examples obtained by those skilled in the art without any inventive work are within the scope of the present invention.

The invention provides a voiceprint model training method which can improve the accuracy of a formed voiceprint model.

Referring to fig. 1, fig. 1 is a schematic flow chart of a voiceprint model training method according to an embodiment of the invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 1 if the results are substantially the same. As shown in fig. 1, the method comprises the steps of:

s101: voice data of at least one user is collected.

In this embodiment, the voice data of multiple users may be collected at one time, the voice data of multiple users may be collected for multiple times, the voice data of users may be collected one by one, and the like.

In this embodiment, the present invention may collect multiple voice data of the same user, may collect single voice data of the same user, may collect multiple voice data of multiple users, and the like, and the present invention is not limited thereto.

S102: ambient noise data in the collected voice data of each user is acquired.

Wherein the acquiring of the ambient noise data in the collected voice data of each user may include:

the acquisition of the environmental noise data in the collected voice data of each user through the preset general environmental noise model has the advantage that the acquisition of the environmental noise data in the collected voice data of each user can be more accurate because the general environmental noise model is suitable for all types of environmental noise.

In the present embodiment, the environmental noise may be a sound generated in industrial production, construction, transportation, and social life, which interferes with the surrounding living environment, and the present invention is not limited thereto.

In this embodiment, the general ambient noise model may be an ambient noise model that is formed based on training of all types of ambient noise based on speech data prepared in advance and is applicable to all types of ambient noise, may be an ambient noise model that is formed by another method and is applicable to all types of ambient noise, and the like, and the present invention is not limited thereto.

S103: ambient noise in the collected voice data of each user is removed based on the ambient noise data.

In this embodiment, the same ambient noise and the like as the ambient noise data may be eliminated from the collected voice data of each user according to the ambient noise data, and the present invention is not limited thereto.

S104: and carrying out voiceprint feature extraction on the voice data of each user after the environmental noise is eliminated.

In this embodiment, the voice data of each user from which the environmental noise is removed may be subjected to voiceprint feature extraction once, or may be subjected to voiceprint feature extraction multiple times, or may be subjected to voiceprint feature extraction one by one for each user, and the like.

S105: and training the extracted voiceprint features to obtain a voiceprint feature training set.

Wherein, should train the vocal print characteristic that should draw and obtain vocal print characteristic training set, can include:

the extracted voiceprint features are trained through the long-short term memory network and the convolutional neural network to obtain a voiceprint feature training set, so that the advantages that the information of the voiceprint feature context can be reserved through the long-short term memory network and the convolutional neural network, and the continuity and accuracy of the voiceprint feature training set can be improved.

In this embodiment, the voiceprint feature training set may be a voiceprint feature training set of the voice data of each user after the ambient noise is removed, or may be a voiceprint feature training set of the voice data of all users after the ambient noise is removed, and the like, which is not limited in the present invention.

S106: and forming a voiceprint model according to the voiceprint feature training set.

Wherein, should form the voiceprint model according to this voiceprint feature training set, can include:

according to the voiceprint feature training set, the voiceprint model of the voice data of each user after the environmental noise is eliminated and/or the voiceprint models of the voice data of all users after the environmental noise is eliminated are formed, so that the voiceprint model of the voice data of each user and/or all users after the environmental noise is eliminated can be formed, and the voiceprint models of each user and all users can be managed conveniently.

Wherein, after forming the voiceprint model according to the voiceprint feature training set, the method may further include:

the formed voiceprint model is optimized through iteration of preset times, and the advantage is that the accuracy of the formed voiceprint model can be further improved.

It can be found that, in this embodiment, the voice data of at least one user may be collected, the environmental noise data in the voice data of each collected user may be obtained, the environmental noise in the voice data of each collected user may be eliminated according to the environmental noise data, the voice print feature extraction may be performed on the voice data of each user after the environmental noise is eliminated, the extracted voice print feature may be trained to obtain a voice print feature training set, and the voice print model may be formed according to the voice print feature training set, so that the accuracy of the formed voice print model can be improved by eliminating the environmental noise in the voice data of each collected user.

Further, in the present embodiment, the environmental noise data in the collected voice data of each user can be obtained through a preset general environmental noise model, which is advantageous in that since the general environmental noise model is applicable to all types of environmental noise, more accurate obtaining of the environmental noise data in the collected voice data of each user can be achieved.

Further, in this embodiment, the extracted voiceprint features can be trained through the long-short term memory network and the convolutional neural network to obtain a voiceprint feature training set, which is advantageous in that the information of the voiceprint feature context can be retained by the long-short term memory network and the convolutional neural network, and thus the continuity and accuracy of the voiceprint feature training set can be improved.

Further, in this embodiment, the voiceprint model of the voice data of each user after the ambient noise is removed and/or the voiceprint models of the voice data of all users after the ambient noise is removed may be formed according to the training set of the voiceprint features, which has the advantage of being able to form the voiceprint model of the voice data of each user and/or all users after the ambient noise is removed, so as to facilitate the management of the voiceprint models of each user and all users.

Referring to fig. 2, fig. 2 is a schematic flow chart of a voiceprint model training method according to another embodiment of the invention. In this embodiment, the method includes the steps of:

s201: voice data of at least one user is collected.

As described above in S101, further description is omitted here.

S202: ambient noise data in the collected voice data of each user is acquired.

As described above in S102, further description is omitted here.

S203: ambient noise in the collected voice data of each user is removed based on the ambient noise data.

As described above in S103, which is not described herein.

S204: and carrying out voiceprint feature extraction on the voice data of each user after the environmental noise is eliminated.

As described above in S104, and will not be described herein.

S205: and training the extracted voiceprint features to obtain a voiceprint feature training set.

As described above in S105, which is not described herein.

S206: and forming a voiceprint model according to the voiceprint feature training set.

As described above in S106, and will not be described herein.

S207: and optimizing the formed voiceprint model through preset times of iteration.

It can be found that, in the present embodiment, the formed voiceprint model can be optimized through a preset number of iterations, which has the advantage that further improvement of the accuracy of the formed voiceprint model can be achieved.

The invention also provides a voiceprint model training device which can improve the accuracy of the formed voiceprint model.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a voiceprint model training device according to an embodiment of the present invention. In this embodiment, the voiceprint model training device 30 includes an acquisition module 31, an acquisition module 32, a removal module 33, an extraction module 34, a training module 35, and a formation module 36.

The collecting module 31 is configured to collect voice data of at least one user.

The obtaining module 32 is configured to obtain ambient noise data in the collected voice data of each user.

The eliminating module 33 is configured to eliminate the environmental noise in the collected voice data of each user according to the environmental noise data.

The extracting module 34 is configured to perform voiceprint feature extraction on the voice data of each user after the environmental noise is removed.

The training module 35 is configured to train the extracted voiceprint features to obtain a voiceprint feature training set.

The forming module 36 is configured to form a voiceprint model according to the training set of voiceprint features.

Optionally, the obtaining module 32 may be specifically configured to:

and acquiring the environmental noise data in the collected voice data of each user through a preset general environmental noise model.

Optionally, the training module 35 may be specifically configured to:

and training the extracted voiceprint features through a long-term and short-term memory network and a convolutional neural network to obtain a voiceprint feature training set.

Optionally, the forming module 36 may be specifically configured to:

and forming a voiceprint model of the voice data of each user after the environmental noise is eliminated and/or a voiceprint model of the voice data of all users after the environmental noise is eliminated according to the voiceprint feature training set.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a voiceprint model training device according to another embodiment of the present invention. Different from the previous embodiment, the voiceprint model training apparatus 40 according to the present embodiment further includes an optimization module 41.

The optimizing module 41 is configured to optimize the formed voiceprint model through a preset number of iterations.

Each unit module of the voiceprint model training device 30/40 can respectively execute the corresponding steps in the above method embodiments, and therefore, the details of each unit module are not repeated herein, please refer to the description of the corresponding steps above.

The present invention further provides a voiceprint model training device, as shown in fig. 5, including: at least one processor 51; and a memory 52 communicatively coupled to the at least one processor 51; the memory 52 stores instructions executable by the at least one processor 51, and the instructions are executed by the at least one processor 51 to enable the at least one processor 51 to execute the above-mentioned voiceprint model training method.

Wherein the memory 52 and the processor 51 are coupled in a bus, which may comprise any number of interconnected buses and bridges, which couple one or more of the various circuits of the processor 51 and the memory 52 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 51 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 51.

The processor 51 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory 52 may be used to store data used by the processor 51 in performing operations.

The present invention further provides a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be substantially or partially implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a part of the embodiments of the present invention, and not intended to limit the scope of the present invention, and all equivalent devices or equivalent processes performed by the present invention through the contents of the specification and the drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A voiceprint model training method is characterized by comprising the following steps:

collecting voice data of at least one user;

acquiring environmental noise data in the acquired voice data of each user;

according to the environmental noise data, eliminating the environmental noise in the collected voice data of each user;

performing voiceprint feature extraction on the voice data of each user after the environmental noise is eliminated;

training the extracted voiceprint features to obtain a voiceprint feature training set;

and forming a voiceprint model according to the voiceprint feature training set.

2. The method of claim 1, wherein the obtaining ambient noise data in the collected voice data of each user comprises:

3. The method for training a voiceprint model according to claim 1, wherein the training the extracted voiceprint features to obtain a voiceprint feature training set comprises:

4. The method for training a voiceprint model according to claim 1, wherein said forming a voiceprint model from said training set of voiceprint features comprises:

5. The method of claim 1, wherein after forming the voiceprint model from the training set of voiceprint features, further comprising:

and optimizing the formed voiceprint model through iteration of preset times.

6. A voiceprint model training apparatus comprising:

the device comprises an acquisition module, an elimination module, an extraction module, a training module and a forming module;

the acquisition module is used for acquiring voice data of at least one user;

the acquisition module is used for acquiring the environmental noise data in the acquired voice data of each user;

the eliminating module is used for eliminating the environmental noise in the collected voice data of each user according to the environmental noise data;

the extraction module is used for extracting the voiceprint characteristics of the voice data of each user after the environmental noise is eliminated;

the training module is used for training the extracted voiceprint features to obtain a voiceprint feature training set;

and the forming module is used for forming a voiceprint model according to the voiceprint feature training set.

7. The voiceprint model training apparatus according to claim 6, wherein the obtaining module is specifically configured to:

8. The voiceprint model training apparatus according to claim 6, wherein the training module is specifically configured to:

9. The voiceprint model training apparatus according to claim 6, wherein the forming module is specifically configured to:

10. The voiceprint model training apparatus according to claim 6, further comprising:

an optimization module;

and the optimization module is used for optimizing the formed voiceprint model through iteration of preset times.