CN110648671A - Voiceprint model reconstruction method, terminal, device and readable storage medium - Google Patents

Voiceprint model reconstruction method, terminal, device and readable storage medium Download PDF

Info

Publication number
CN110648671A
CN110648671A CN201910775992.8A CN201910775992A CN110648671A CN 110648671 A CN110648671 A CN 110648671A CN 201910775992 A CN201910775992 A CN 201910775992A CN 110648671 A CN110648671 A CN 110648671A
Authority
CN
China
Prior art keywords
voiceprint
sub
sample data
voiceprint model
voice sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910775992.8A
Other languages
Chinese (zh)
Inventor
陈昊亮
罗伟航
李炳霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou National Acoustic Intelligent Technology Co Ltd
Original Assignee
Guangzhou National Acoustic Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou National Acoustic Intelligent Technology Co Ltd filed Critical Guangzhou National Acoustic Intelligent Technology Co Ltd
Priority to CN201910775992.8A priority Critical patent/CN110648671A/en
Publication of CN110648671A publication Critical patent/CN110648671A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voiceprint model reconstruction method, which comprises the following steps: obtaining voice sample data, generating an initial voiceprint model based on the voice sample data, wherein the voice sample data comprises a plurality of sub-voice sample data, then obtaining a voiceprint characteristic vector of each sub-voice sample data based on the initial voiceprint model, clustering the voice sample data based on a K-Means algorithm and each voiceprint characteristic vector, dividing the voice sample data into a preset number of sub-sample sets, and then generating a target voiceprint model based on the preset number of sub-sample sets. The invention also discloses a device, a terminal and a readable storage medium. According to the method, the voice sample data are clustered and grouped, and the grouped voice sub-sample set is used for training the voiceprint model in an iterative mode, so that the training efficiency of the voiceprint model and the robustness of the voiceprint model are improved.

Description

Voiceprint model reconstruction method, terminal, device and readable storage medium
Technical Field
The invention relates to the field of voiceprint recognition, in particular to a voiceprint model reconstruction method, a terminal, a device and a readable storage medium.
Background
Voiceprints are the spectrum of sound waves carrying verbal information displayed with an electro-acoustic instrument. Modern scientific research shows that the voiceprint not only has characteristics of specificity, but also has characteristics of relative stability. After the adult, the voice of the human can be kept relatively stable and unchanged for a long time. The voiceprint recognition algorithm establishes a voiceprint recognition model by learning various voice features from the voice map, thereby confirming the speaker.
However, at present, user speech training data are trained based on a training sample guidance Model, specifically, the training data can be trained by a Gaussian Mixture Model-general background Model (GMM-UBM), a Total variance modeling (TV) system or a deep neural network system, and a large amount of user speech is used to train a feature vector representing user information in the training process. The fact that the voice sample lacks a label is a common phenomenon, if the label of the user is manually labeled, a larger error exists, because the user information is very difficult to label by a label operator in the unfamiliar voice of the user, the error is large, and the labeling cost is very high. Therefore, based on the above problems, how to train a better voiceprint model on the training data of the user with incomplete labels is a problem to be solved at present.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a voiceprint model reconstruction method, a terminal, a device and a readable storage medium, and aims to solve the technical problem that a voiceprint model trained on training data based on incomplete user labels is not strong in robustness.
In order to achieve the above object, the present invention provides a method for reconstructing a voiceprint model, which comprises the following steps:
when a voiceprint model reconstruction request is received, reminding a user of reading the authentication voice content by voice, and acquiring a face video of the user reading the voice;
carrying out face recognition authentication based on the face video, and extracting voice data in the face video;
when the face recognition authentication fails, acquiring a feature vector corresponding to the voice data, and performing voiceprint authentication based on the feature vector;
and after the voiceprint authentication is passed, responding to the voiceprint model reconstruction request.
Further, in an embodiment, the preset number of sub-sample sets includes a first sub-sample set, a second sub-sample set, and a third sub-sample set, where the first preset value is smaller than the second preset value, the clustering the voice sample data based on the K-Means algorithm and each voiceprint feature vector, and the dividing the voice sample data into the preset number of sub-sample sets includes:
calculating the distance between each voiceprint characteristic vector and a preset clustering center based on the K-Means algorithm;
when a first sub-distance smaller than or equal to the first preset value exists in all the distances, taking sub-voice sample data corresponding to the first sub-distance as voice sample data in a first sub-sample set;
when a second sub-distance which is greater than the first preset value and less than or equal to the second preset value exists in all the distances, taking sub-voice sample data corresponding to the second sub-distance as voice sample data in a second sub-sample set;
and when a third sub-distance larger than the second preset value exists in all the distances, taking the sub-voice sample data corresponding to the third sub-distance as the voice sample data in a third sub-sample set.
Further, in one embodiment, the cluster center is calculated by:
and calculating the average value of each voiceprint feature vector, and taking the average value as the clustering center.
Further, in an embodiment, the step of generating the target voiceprint model based on the preset number of subsample sets comprises:
generating a first voiceprint model based on the first set of subsamples;
generating a target voiceprint model based on the first set of subsamples, the second set of subsamples, the third set of subsamples, and the first voiceprint model.
Further, in an embodiment, the step of generating a target voiceprint model based on the first set of subsamples, the second set of subsamples, the third set of subsamples and the first voiceprint model comprises:
generating a second voiceprint model based on the first set of subsamples, the second set of subsamples, and the first voiceprint model;
generating a target voiceprint model based on the first set of subsamples, the second set of subsamples, the third set of subsamples, and the second voiceprint model.
Further, in an embodiment, after the step of generating the target voiceprint model based on the preset number of subsample sets, the method further includes:
when a voiceprint authentication request is received, acquiring voice data to be authenticated based on the voiceprint authentication request;
and determining a voiceprint authentication result of the voice data to be authenticated based on the target voiceprint model.
Further, in an embodiment, after the step of determining the voiceprint authentication result of the voice data to be authenticated based on the target voiceprint model, the method further includes:
and when the voiceprint authentication result is that the voiceprint authentication is passed, sending prompt information that the voiceprint authentication request is passed to a preset terminal.
Further, in an embodiment, the voiceprint model reconstruction apparatus includes:
the voice recognition system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module acquires voice sample data and generates an initial voiceprint model based on the voice sample data, and the voice sample data comprises a plurality of sub-voice sample data;
the processing module is used for acquiring the voiceprint characteristic vector of each sub-voice sample data based on the initial voiceprint model, clustering the voice sample data based on a K-Means algorithm and each voiceprint characteristic vector, and dividing the voice sample data into a preset number of sub-sample sets;
and the generating module is used for generating a target voiceprint model based on the sub-sample sets with the preset number.
In addition, to achieve the above object, the present invention also provides a terminal, including: a memory, a processor and a voiceprint model reconstruction program stored on the memory and executable on the processor, the voiceprint model reconstruction program when executed by the processor implementing the steps of the voiceprint model reconstruction method of any one of the above.
In addition, to achieve the above object, the present invention further provides a readable storage medium having stored thereon a voiceprint model reconstruction program, which when executed by a processor, implements the steps of the voiceprint model reconstruction method according to any one of the above.
The method comprises the steps of obtaining voice sample data, generating an initial voiceprint model based on the voice sample data, wherein the voice sample data comprises a plurality of sub-voice sample data, then obtaining a voiceprint characteristic vector of each sub-voice sample data based on the initial voiceprint model, clustering the voice sample data based on a K-Means algorithm and each voiceprint characteristic vector, dividing the voice sample data into a preset number of sub-sample sets, and then generating a target voiceprint model based on the preset number of sub-sample sets. The voice sample data are clustered and grouped through the unsupervised learning K-Means algorithm, the influence of the fact that the voice sample lacks a label on model training is weakened, voiceprint model iterative training is carried out according to the sequence of the grouped voice sample data from easy to difficult, the performance of the voiceprint model is better, meanwhile, the difficult voice sample data are used for model training through repeated iteration, and therefore the robustness of the voiceprint model is effectively improved.
Drawings
Fig. 1 is a schematic structural diagram of a terminal in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a first embodiment of a voiceprint model reconstruction method according to the present invention;
FIG. 3 is a flowchart illustrating a voiceprint model reconstruction method according to a second embodiment of the present invention;
fig. 4 is a functional block diagram of an embodiment of a voiceprint model reconstruction apparatus according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic structural diagram of a terminal in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a client interface 1003, a memory 1005, and a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The client interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional client interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001 described above.
Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Wherein, the sensors such as light sensor, motion sensor and other sensors are not described in detail herein.
Those skilled in the art will appreciate that the system architecture shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a readable storage medium, may include therein an operating system, a network communication module, a client interface module, and a voiceprint model reconstruction program.
In the system shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and communicating with the backend server; the client interface 1003 is mainly used for connecting a client (client) and performing data communication with the client; and the processor 1001 may be used to invoke a voiceprint model reconstruction program stored in the memory 1005.
In this embodiment, the terminal includes: the system comprises a memory 1005, a processor 1001 and a voiceprint model reconstruction program stored in the memory 1005 and capable of running on the processor 1001, wherein when the processor 1001 calls the voiceprint model reconstruction program stored in the memory 1005, the steps of the voiceprint model reconstruction method provided by each embodiment of the application are executed.
The invention also provides a voiceprint model reconstruction method, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the voiceprint model reconstruction method of the invention.
While a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than presented.
In this embodiment, the method for reconstructing a voiceprint model includes:
step S100, obtaining voice sample data, and generating an initial voiceprint model based on the voice sample data, wherein the voice sample data comprises a plurality of sub-voice sample data;
in this embodiment, voiceprint recognition, a type of biometric technology, also known as speaker recognition, is classified into two categories, namely speaker recognition and speaker verification. Different tasks and applications may use different voiceprint recognition techniques, such as recognition techniques may be required to narrow criminal investigation, and validation techniques may be required for banking transactions. Voiceprint recognition is the conversion of acoustic signals into electrical signals, which are then recognized by a computer. A large amount of voice sample data are needed for training the voiceprint model, and the voice data can be collected by a voice data collecting system and stored in a database and used during voiceprint model training. Since the collected voice data is limited, more voice data needs to be acquired to improve the accuracy of the voiceprint model. The difficulty and the cost for acquiring the standard voice data are high, so that a large amount of voice data without labels exist in the voice sample data.
Further, an initial voiceprint Model is generated according to the acquired voice sample data, and the voiceprint Model training can be performed by a Gaussian Mixture Model-Universal Background Model (GMM-UBM), a Total variance modeling (TV) system or a deep neural network system and other methods.
S200, acquiring a voiceprint characteristic vector of each sub-voice sample data based on the initial voiceprint model, clustering the voice sample data based on a K-Means algorithm and each voiceprint characteristic vector, and dividing the voice sample data into a preset number of sub-sample sets;
in this embodiment, first, a voiceprint feature vector of each voice sample data is obtained by using an initial voiceprint model, then, the voice sample data is clustered according to a K-Means algorithm and each voiceprint feature vector, the voice sample data is divided into a plurality of sub-sample sets, and the number of the sub-sample sets is determined according to an actual situation. The K-Means algorithm is a clustering analysis algorithm for iterative solution, and the method comprises the steps of randomly selecting K objects as initial clustering centers, then calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers are changed again, and the sum of squared errors is locally minimal. In the present invention, the average value of each voiceprint feature vector is calculated and used as the clustering center.
Specifically, step S200 includes:
step S210, calculating the distance between each voiceprint characteristic vector and a preset clustering center based on the K-Means algorithm;
in this embodiment, a cluster center is determined first, and the determination method is as follows: and calculating the average value of each voiceprint feature vector, and taking the average value as the clustering center, wherein the process of calculating the voiceprint feature vector is well known to those skilled in the art and is not described herein again.
Next, in the specific implementation process of selecting K-Means for calculating the distance between each voiceprint feature vector and the clustering center according to the K-Means algorithm, the Euclidean distance is used as a calculation formula of the user distance:
Figure BDA0002174719100000071
wherein the content of the first and second substances,and
Figure RE-GDA0002287886100000073
respectively representing cluster centers and voiceprint feature vectors.
Step S220, when a first sub-distance smaller than or equal to the first preset value exists in all distances, taking sub-voice sample data corresponding to the first sub-distance as voice sample data in a first sub-sample set;
in this embodiment, taking three subsample sets as an example, two preset values, namely a first preset value and a second preset value, are set empirically, where the first preset value is smaller than the second preset value, then distances between each voiceprint feature vector and a cluster center are calculated according to a K-Means algorithm, and voice sample data corresponding to the voiceprint feature vectors whose distances among all distances are smaller than or equal to the first preset value is divided into the first subsample set.
Step S230, when a second sub-distance greater than the first preset value and less than or equal to the second preset value exists in all distances, taking sub-voice sample data corresponding to the second sub-distance as voice sample data in a second sub-sample set;
in this embodiment, among the distances between each voiceprint feature vector and the center of the cluster calculated according to the K-Means algorithm, the voice sample data corresponding to all the voiceprint feature vectors whose distances are greater than the first preset value and less than or equal to the second preset value are divided into a second sub-sample set.
Step S240, when a third sub-distance greater than the second preset value exists in all distances, taking sub-voice sample data corresponding to the third sub-distance as voice sample data in a third sub-sample set.
In this embodiment, among all distances, voice sample data corresponding to the voiceprint feature vector whose distance is greater than the second preset value is classified as a third subsample set.
It should be noted that the distance between the voiceprint feature vector in the first sub-sample set and the cluster center is closest, and the voiceprint feature vector is considered as a voice sample which is easy to learn, while the distance between the second sub-sample set and the third sub-sample set is considered as a voice sample which is far away from the cluster center, and the voiceprint feature vector is a voice sample which is difficult to recognize, and the voice sample is just a sample which needs to be learned by the voiceprint model most.
And step S300, generating a target voiceprint model based on the preset number of the sub-sample sets.
In this embodiment, the target voiceprint is generated according to a preset number of sub-sample sets, for example, when there are 3 sub-sample sets, the target voiceprint model is generated according to the first sub-sample set, the second sub-sample set, and the third sub-sample set.
Specifically, step S300 includes:
step S310, generating a first voiceprint model based on the first subsample set;
in the present embodiment, the learning order is set according to how easy it is to be (i.e., the closer the distance from the cluster center is considered, the easier the distance is considered, and the farther the distance is considered, the harder the distance is), that is, the first set of subsamples is set as simple voice sample data, the second set of subsamples is set as the harder voice sample data, and the third set of subsamples is set as the hardest voice sample data. The learning sequence includes learning simple voice sample data, learning harder voice sample data and finally learning the hardest voice sample data. That is, the simplest first set of subsamples is learned first, then the harder first set of subsamples is learned, and finally the hardest first set of subsamples is learned.
Specifically, a first voiceprint Model is generated according to the first subsample set, and the voiceprint Model training can be performed by a Gaussian Mixture Model-Universal Background Model (GMM-UBM), a Total variance modeling (TV) system or a deep neural network system and other methods.
Step S320, generating a target voiceprint model based on the first subsample set, the second subsample set, the third subsample set and the first voiceprint model.
In this embodiment, after the first voiceprint model is generated according to the first sub-sample set, the first voiceprint model is trained by using the first sub-sample set, the second sub-sample set, and the third sub-sample set, so as to generate the target voiceprint model.
Specifically, step S320 includes:
step S321, generating a second voiceprint model based on the first subsample set, the second subsample set and the first voiceprint model;
in this embodiment, a first voiceprint model is used as an initial model, the first voiceprint model is trained by using a first subsample set and a second subsample set, and the learning rate of the speech training of the first subsample set and the second subsample set is set according to empirical data. Specifically, the first voiceprint Model is used as an initial Model, and all the voice sample data in the first sub-sample set and the second sub-sample set are input into the first voiceprint Model for training by using a Gaussian Mixture Model-Universal Background Model (GMM-UBM), a total variance modeling (TV) system or a deep neural network system, and the like, so as to obtain the second voiceprint Model.
Step S322, generating a target voiceprint model based on the first set of subsamples, the second set of subsamples, the third set of subsamples and the second voiceprint model.
In this embodiment, after the second voiceprint model is generated according to the first subsample set and the second subsample set, the second voiceprint model is trained by using the first subsample set, the second subsample set, and the third subsample set, so as to generate the target voiceprint model.
Specifically, the method includes the steps of firstly obtaining a relatively pure voice closest to a clustering center (namely a first subsample set), then obtaining a voice far away from the clustering center and considered to be difficult (namely a second subsample set), and finally obtaining a voice farthest away from the clustering center and considered to be the most difficult (namely a third subsample set), firstly training a voiceprint model by using the first subsample set (namely a first voiceprint model), then training the second subsample set difficult to obtain together with the first subsample set by using the first voiceprint model as an initial model, and obtaining a second voiceprint model, and finally training the third subsample set difficult to obtain together with the second subsample set and the first subsample set by using the second voiceprint model as the initial model, so that a target voiceprint model is obtained. In the training process, the process that the learning knowledge of human beings is changed from simple to difficult is simulated, the difficult training voice sample is well utilized, the robustness of the voiceprint model is effectively improved, and the performance of the voiceprint model is better. .
The method for reconstructing the voiceprint model provided by this embodiment includes obtaining voice sample data, generating an initial voiceprint model based on the voice sample data, where the voice sample data includes a plurality of sub-voice sample data, then obtaining a voiceprint feature vector of each sub-voice sample data based on the initial voiceprint model, clustering the voice sample data based on a K-Means algorithm and each voiceprint feature vector, dividing the voice sample data into a preset number of sub-sample sets, and then generating a target voiceprint model based on the preset number of sub-sample sets. According to the method, the voice sample data are clustered and grouped through the unsupervised learning K-Means algorithm, the influence of the fact that the voice sample lacks a label on model training is weakened, the voiceprint model iterative training is carried out according to the sequence that the grouped voice sample data are easy to go to be difficult, the performance of the voiceprint model is enabled to be better, meanwhile, the difficult voice sample data are utilized for model training through multiple iterations, and therefore the robustness of the voiceprint model is effectively improved.
Based on the first embodiment, referring to fig. 3, a second embodiment of the voiceprint model reconstruction method of the present invention is provided, in this embodiment, after step S300, the method further includes:
step S400, when a voiceprint authentication request is received, acquiring voice data to be authenticated based on the voiceprint authentication request;
in this embodiment, the Voiceprint (Voiceprint) is a spectrum of sound waves carrying speech information displayed by an electro-acoustic apparatus. The generation of human language is a complex physiological and physical process between the human language center and the vocal organs, and the vocal print maps of any two people are different because the size and the shape of the vocal organs, namely tongue, teeth, larynx, lung and nasal cavity, used by a person during speaking are different greatly. The speech acoustic characteristics of each individual are both relatively stable and variable, not absolute, but invariable. The variation can come from physiology, pathology, psychology, simulation, camouflage and is also related to environmental interference. However, since the pronunciation organs of each person are different, in general, people can distinguish different sounds or judge whether the sounds are the same. Voiceprint recognition has two categories, Speaker Identification (Speaker Identification) and Speaker Verification (Speaker Verification). The former is used for judging which one of a plurality of people said a certain section of voice, and is a 'one-out-of-multiple' problem; the latter is used to confirm whether a certain speech is spoken by a specified person, which is a one-to-one discrimination problem. Different voiceprint recognition techniques are used for different tasks and applications, such as identification techniques may be required for criminal investigation and validation techniques for bank transactions. Therefore, voiceprint recognition is widely applied in the field of identity authentication.
Specifically, when a voiceprint authentication request is received, to-be-authenticated voice data is obtained according to the voiceprint authentication request, and the to-be-authenticated voice data can be subjected to voiceprint authentication by extracting a voiceprint feature vector through a voiceprint model.
Step S500, determining the voiceprint authentication result of the voice data to be authenticated based on the target voiceprint model.
In this embodiment, voice data to be authenticated is used as input of the target voiceprint model, a voiceprint feature vector corresponding to the voice data to be authenticated is further obtained through the target voiceprint model, the voiceprint feature vector corresponding to the voice data to be authenticated is compared with the voiceprint feature vector registered by the user, and a voiceprint authentication result is determined. Specifically, calculating a matching numerical value of a voiceprint feature vector corresponding to voice data to be authenticated and a voiceprint feature vector registered by the user, and if the matching numerical value is greater than or equal to a preset threshold value, determining that voiceprint authentication passes authentication; and if the matching numerical value is smaller than the preset threshold value, determining that the voiceprint authentication fails.
And step S600, when the voiceprint authentication result is that the voiceprint authentication is passed, sending a prompt message that the voiceprint authentication request is passed to a preset terminal.
In this embodiment, when the voiceprint authentication result is that the voiceprint authentication passes, the prompt message that the voiceprint authentication request passes is sent to the preset terminal, and similarly, when the voiceprint authentication result is that the voiceprint authentication fails, the prompt message that the voiceprint authentication request fails is sent to the preset terminal.
According to the voiceprint model reconstruction method provided by the embodiment, when a voiceprint authentication request is received, to-be-authenticated voice data are obtained based on the voiceprint authentication request, then the voiceprint authentication result of the to-be-authenticated voice data is determined based on the target voiceprint model, and then the voiceprint authentication result of the to-be-authenticated voice data is determined based on the target voiceprint model, so that the user identity is authenticated by using the voiceprint of the user, the convenience in the use process is improved, and the user experience is improved.
The invention further provides a voiceprint model reconstruction device, and referring to fig. 4, fig. 4 is a functional module schematic diagram of an embodiment of the voiceprint model reconstruction device of the invention.
The acquiring module 10 acquires voice sample data, and generates an initial voiceprint model based on the voice sample data, wherein the voice sample data comprises a plurality of sub-voice sample data;
the processing module 20 is configured to obtain a voiceprint feature vector of each sub-voice sample data based on the initial voiceprint model, perform clustering on the voice sample data based on a K-Means algorithm and each voiceprint feature vector, and divide the voice sample data into a preset number of sub-sample sets;
and a generating module 30 for generating a target voiceprint model based on the preset number of sub-sample sets.
Further, the processing module 20 is further configured to:
calculating the distance between each voiceprint characteristic vector and a preset clustering center based on the K-Means algorithm;
when a first sub-distance smaller than or equal to the first preset value exists in all the distances, taking sub-voice sample data corresponding to the first sub-distance as voice sample data in a first sub-sample set;
when a second sub-distance which is greater than the first preset value and less than or equal to the second preset value exists in all the distances, taking sub-voice sample data corresponding to the second sub-distance as voice sample data in a second sub-sample set;
and when a third sub-distance larger than the second preset value exists in all the distances, taking the sub-voice sample data corresponding to the third sub-distance as the voice sample data in a third sub-sample set.
Further, the processing module 20 is further configured to:
and calculating the average value of each voiceprint feature vector, and taking the average value as the clustering center.
Further, the generating module 30 is further configured to:
generating a first voiceprint model based on the first set of subsamples;
generating a target voiceprint model based on the first set of subsamples, the second set of subsamples, the third set of subsamples, and the first voiceprint model.
Further, the generating module 30 is further configured to:
generating a second voiceprint model based on the first set of subsamples, the second set of subsamples, and the first voiceprint model;
generating a target voiceprint model based on the first set of subsamples, the second set of subsamples, the third set of subsamples, and the second voiceprint model.
Further, the voiceprint model reconstruction device is further configured to:
the voice recognition system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module acquires voice data to be recognized based on a voiceprint authentication request when receiving the voiceprint authentication request;
and the determining module is used for determining the voiceprint authentication result of the voice data to be authenticated based on the target voiceprint model.
Further, the voiceprint model reconstruction device is further configured to:
and the sending module is used for sending prompt information that the voiceprint authentication request passes to a preset terminal when the voiceprint authentication result is that the voiceprint authentication passes.
In addition, an embodiment of the present invention further provides a readable storage medium, where a voiceprint model reconstruction program is stored on the readable storage medium, and when being executed by a processor, the voiceprint model reconstruction program implements the steps of the voiceprint model reconstruction method in the foregoing embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a readable storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a system device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or used directly or indirectly in other related fields, are included in the scope of the present invention.

Claims (10)

1. A voiceprint model reconstruction method is characterized by comprising the following steps:
acquiring voice sample data, and generating an initial voiceprint model based on the voice sample data, wherein the voice sample data comprises a plurality of sub-voice sample data;
acquiring a voiceprint characteristic vector of each sub-voice sample data based on the initial voiceprint model, clustering the voice sample data based on a K-Means algorithm and each voiceprint characteristic vector, and dividing the voice sample data into a preset number of sub-sample sets;
and generating a target voiceprint model based on the preset number of sub-sample sets.
2. The method of claim 1, wherein the predetermined number of subsamples includes a first subsample set, a second subsample set, and a third subsample set, the first predetermined value is smaller than the second predetermined value, the clustering the voice sample data based on the K-Means algorithm and the respective voiceprint feature vectors, the dividing the voice sample data into the predetermined number of subsamples includes:
calculating the distance between each voiceprint characteristic vector and a preset clustering center based on the K-Means algorithm;
when a first sub-distance smaller than or equal to the first preset value exists in all the distances, taking sub-voice sample data corresponding to the first sub-distance as voice sample data in a first sub-sample set;
when a second sub-distance which is greater than the first preset value and less than or equal to the second preset value exists in all the distances, taking sub-voice sample data corresponding to the second sub-distance as voice sample data in a second sub-sample set;
and when a third sub-distance larger than the second preset value exists in all the distances, taking the sub-voice sample data corresponding to the third sub-distance as the voice sample data in a third sub-sample set.
3. The voiceprint model reconstruction method of claim 2 wherein the cluster center is calculated by:
and calculating the average value of each voiceprint feature vector, and taking the average value as the clustering center.
4. The voiceprint model reconstruction method of claim 2 wherein said step of generating a target voiceprint model based on said preset number of subsample sets comprises:
generating a first voiceprint model based on the first set of subsamples;
generating a target voiceprint model based on the first set of subsamples, the second set of subsamples, the third set of subsamples, and the first voiceprint model.
5. The method of voiceprint model reconstruction according to claim 4 wherein said step of generating a target voiceprint model based on said first set of subsamples, said second set of subsamples, said third set of subsamples and said first voiceprint model comprises:
generating a second voiceprint model based on the first set of subsamples, the second set of subsamples, and the first voiceprint model;
generating a target voiceprint model based on the first set of subsamples, the second set of subsamples, the third set of subsamples, and the second voiceprint model.
6. The voiceprint model reconstruction method according to any one of claims 1 to 5, wherein said step of generating a target voiceprint model based on said preset number of sets of subsamples is followed by further comprising:
when a voiceprint authentication request is received, acquiring voice data to be authenticated based on the voiceprint authentication request;
and determining a voiceprint authentication result of the voice data to be authenticated based on the target voiceprint model.
7. The voiceprint model reconstruction method according to claim 6, wherein after the step of determining the voiceprint authentication result of the voice data to be authenticated based on the target voiceprint model, further comprising:
and when the voiceprint authentication result is that the voiceprint authentication is passed, sending prompt information that the voiceprint authentication request is passed to a preset terminal.
8. A voiceprint model reconstruction apparatus, comprising:
the acquisition module acquires voice sample data and generates an initial voiceprint model based on the voice sample data, wherein the voice sample data comprises a plurality of sub-voice sample data;
the processing module is used for acquiring the voiceprint characteristic vector of each sub-voice sample data based on the initial voiceprint model, clustering the voice sample data based on a K-Means algorithm and each voiceprint characteristic vector, and dividing the voice sample data into a preset number of sub-sample sets;
and the generating module is used for generating a target voiceprint model based on the sub-sample sets with the preset number.
9. A terminal, characterized in that the terminal comprises: a memory, a processor and a voiceprint model reconstruction program stored on the memory and executable on the processor, the voiceprint model reconstruction program when executed by the processor implementing the steps of the voiceprint model reconstruction method of any one of claims 1 to 7.
10. A readable storage medium, having stored thereon the voiceprint model reconstruction program which, when executed by a processor, implements the steps of the voiceprint model reconstruction method according to any one of claims 1 to 7.
CN201910775992.8A 2019-08-21 2019-08-21 Voiceprint model reconstruction method, terminal, device and readable storage medium Pending CN110648671A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910775992.8A CN110648671A (en) 2019-08-21 2019-08-21 Voiceprint model reconstruction method, terminal, device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910775992.8A CN110648671A (en) 2019-08-21 2019-08-21 Voiceprint model reconstruction method, terminal, device and readable storage medium

Publications (1)

Publication Number Publication Date
CN110648671A true CN110648671A (en) 2020-01-03

Family

ID=68990284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910775992.8A Pending CN110648671A (en) 2019-08-21 2019-08-21 Voiceprint model reconstruction method, terminal, device and readable storage medium

Country Status (1)

Country Link
CN (1) CN110648671A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111063360A (en) * 2020-01-21 2020-04-24 北京爱数智慧科技有限公司 Voiceprint library generation method and device
CN111415669A (en) * 2020-04-15 2020-07-14 厦门快商通科技股份有限公司 Voiceprint model construction method, device and equipment
CN111785283A (en) * 2020-05-18 2020-10-16 北京三快在线科技有限公司 Voiceprint recognition model training method and device, electronic equipment and storage medium
CN111833851A (en) * 2020-06-16 2020-10-27 杭州云嘉云计算有限公司 Method for automatically learning and optimizing acoustic model
CN112530409A (en) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 Voice sample screening method and device based on geometry and computer equipment
CN113409795A (en) * 2021-08-19 2021-09-17 北京世纪好未来教育科技有限公司 Training method, voiceprint recognition method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782564A (en) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 Method and apparatus for processing speech data
US20180211670A1 (en) * 2015-01-26 2018-07-26 Verint Systems Ltd. Acoustic signature building for a speaker from multiple sessions
CN108460081A (en) * 2018-01-12 2018-08-28 平安科技(深圳)有限公司 Voice data base establishing method, voiceprint registration method, apparatus, equipment and medium
CN109145148A (en) * 2017-06-28 2019-01-04 百度在线网络技术(北京)有限公司 Information processing method and device
CN109378003A (en) * 2018-11-02 2019-02-22 科大讯飞股份有限公司 A kind of method and system of sound-groove model training

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180211670A1 (en) * 2015-01-26 2018-07-26 Verint Systems Ltd. Acoustic signature building for a speaker from multiple sessions
CN106782564A (en) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 Method and apparatus for processing speech data
CN109145148A (en) * 2017-06-28 2019-01-04 百度在线网络技术(北京)有限公司 Information processing method and device
CN108460081A (en) * 2018-01-12 2018-08-28 平安科技(深圳)有限公司 Voice data base establishing method, voiceprint registration method, apparatus, equipment and medium
CN109378003A (en) * 2018-11-02 2019-02-22 科大讯飞股份有限公司 A kind of method and system of sound-groove model training

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111063360A (en) * 2020-01-21 2020-04-24 北京爱数智慧科技有限公司 Voiceprint library generation method and device
CN111063360B (en) * 2020-01-21 2022-08-19 北京爱数智慧科技有限公司 Voiceprint library generation method and device
CN111415669A (en) * 2020-04-15 2020-07-14 厦门快商通科技股份有限公司 Voiceprint model construction method, device and equipment
CN111785283A (en) * 2020-05-18 2020-10-16 北京三快在线科技有限公司 Voiceprint recognition model training method and device, electronic equipment and storage medium
CN111833851A (en) * 2020-06-16 2020-10-27 杭州云嘉云计算有限公司 Method for automatically learning and optimizing acoustic model
CN111833851B (en) * 2020-06-16 2021-03-16 杭州云嘉云计算有限公司 Method for automatically learning and optimizing acoustic model
CN112530409A (en) * 2020-12-01 2021-03-19 平安科技(深圳)有限公司 Voice sample screening method and device based on geometry and computer equipment
CN112530409B (en) * 2020-12-01 2024-01-23 平安科技(深圳)有限公司 Speech sample screening method and device based on geometry and computer equipment
CN113409795A (en) * 2021-08-19 2021-09-17 北京世纪好未来教育科技有限公司 Training method, voiceprint recognition method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
CN110648671A (en) Voiceprint model reconstruction method, terminal, device and readable storage medium
EP3477519B1 (en) Identity authentication method, terminal device, and computer-readable storage medium
JP6429945B2 (en) Method and apparatus for processing audio data
US11875799B2 (en) Method and device for fusing voiceprint features, voice recognition method and system, and storage medium
CN112435684B (en) Voice separation method and device, computer equipment and storage medium
CN111079791A (en) Face recognition method, face recognition device and computer-readable storage medium
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
CN112989108B (en) Language detection method and device based on artificial intelligence and electronic equipment
CN112071322A (en) End-to-end voiceprint recognition method, device, storage medium and equipment
CN109448732B (en) Digital string voice processing method and device
CN111179940A (en) Voice recognition method and device and computing equipment
CN113223536A (en) Voiceprint recognition method and device and terminal equipment
CN111862945A (en) Voice recognition method and device, electronic equipment and storage medium
CN113327620A (en) Voiceprint recognition method and device
CN111243604B (en) Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
CN111613230A (en) Voiceprint verification method, voiceprint verification device, voiceprint verification equipment and storage medium
CN116895273B (en) Output method and device for synthesized audio, storage medium and electronic device
CN110827834B (en) Voiceprint registration method, system and computer readable storage medium
CN113053395A (en) Pronunciation error correction learning method and device, storage medium and electronic equipment
CN116631380A (en) Method and device for waking up audio and video multi-mode keywords
CN116486789A (en) Speech recognition model generation method, speech recognition method, device and equipment
CN114220177A (en) Lip syllable recognition method, device, equipment and medium
CN113870896A (en) Motion sound false judgment method and device based on time-frequency graph and convolutional neural network
CN113823294B (en) Cross-channel voiceprint recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200103

RJ01 Rejection of invention patent application after publication