CN106971725B

CN106971725B - Voiceprint recognition method and system with priority

Info

Publication number: CN106971725B
Application number: CN201610024164.7A
Authority: CN
Inventors: 祝铭明
Original assignee: Yutou Technology Hangzhou Co Ltd
Current assignee: Yutou Technology Hangzhou Co Ltd
Priority date: 2016-01-14
Filing date: 2016-01-14
Publication date: 2021-06-15
Anticipated expiration: 2036-01-14
Also published as: CN106971725A

Abstract

The invention discloses a method and a system for improving voiceprint recognition accuracy. Collecting a voice film source, analyzing an unidentified voiceprint in the voice film source, counting, and performing priority identification and sequencing on the unidentified voiceprint according to a statistical result to obtain unidentified voiceprint features in the unidentified voiceprint, wherein the unidentified voiceprint features at least comprise wavelet elements of the unidentified voiceprint, and the unidentified voiceprint features at least comprise wavelet elements of the unidentified voiceprint; processing according to the unrecognized voiceprint features and the standard voiceprint features in the voiceprint recognition model to obtain the recognition degree of the unrecognized voiceprint; and judging whether each discrimination degree is greater than a preset standard threshold, reserving the unidentified voiceprints with the discrimination degree greater than the standard threshold, selecting the unidentified voiceprints with the maximum discrimination degree and identifying the unidentified voiceprints as locked voiceprints. The technical scheme has the advantages that the voice to be recognized is processed in advance, so that the voice with high recognition frequency can be recognized through voiceprint preferentially, and the recognition efficiency is improved.

Description

Voiceprint recognition method and system with priority

Technical Field

The invention relates to the technical field of voiceprint recognition, in particular to a voiceprint recognition method and system with priority.

Background

With the rapid development of information technology and network technology, people have higher and higher requirements for identity Recognition, and the identity Recognition technology based on traditional password authentication has exposed more and more defects in the practical information network application process, while the Voiceprint Recognition technology based on biometric authentication is a more effective identity Recognition technology, namely Voiceprint Recognition (VPR), also called Speaker Recognition (Speaker Recognition), and the Voiceprint Recognition is divided into two categories, namely Speaker Identification (Speaker Identification) and Speaker Verification (Speaker Verification). The former is used for judging which one of a plurality of people said a certain section of voice, and is a 'one-out-of-multiple' problem; the latter is used to confirm whether a certain speech is spoken by a given person, which is a "one-to-one decision" problem.

In the voiceprint recognition process, when the identity of a user needs to be recognized in multiple voices in time, the existing voiceprint recognition system recognizes the multiple voices one by one, so that the recognition efficiency is influenced, and the recognition process is complicated.

Disclosure of Invention

According to the above problems in the prior art, a technical solution for a voiceprint recognition method and system with priority is provided, which specifically includes:

a voiceprint recognition method with priority, comprising:

collecting a voice film source, and identifying an unidentified voiceprint existing in the voice film source;

counting the unidentified voiceprints identified each time, and forming a counting result;

performing priority identification sequencing on the unidentified voiceprints according to a statistical result;

obtaining unidentified voiceprint features in each unidentified voiceprint according to a priority order, wherein the unidentified voiceprint features at least comprise wavelet elements of the unidentified voiceprint;

processing according to each unrecognized voiceprint feature and a standard voiceprint feature in a voiceprint recognition model to obtain a discrimination degree corresponding to each unrecognized voiceprint;

respectively judging whether each discrimination degree is greater than a preset standard threshold value, and reserving the unidentified voiceprints with the discrimination degrees greater than the standard threshold value;

selecting the unidentified voiceprint with the maximum discrimination degree from the retained unidentified voiceprints and identifying the unidentified voiceprint as a locked voiceprint;

the wavelet elements comprise real wavelet elements and/or complex wavelet elements, wherein the obtaining of the unidentified voiceprint features in the unidentified voiceprint comprises:

detecting voiced intervals in the unidentified voiceprint;

and detecting a pitch interval in each voiced interval, and acquiring the real wavelet elements and/or the complex wavelet elements of the voiceprint features in each pitch interval.

Preferably, the method for voiceprint recognition with priority further includes, before the extracting the unrecognized voiceprint features in the unrecognized voiceprint, the steps of:

collecting the unidentified voiceprints;

and adjusting the voiceprint characteristic vector parameters corresponding to the unidentified voiceprint characteristic vector in a pre-constructed standard identification model at least according to the unidentified voiceprint characteristic vector in the unidentified voiceprint characteristics so as to construct the standard voiceprint characteristic vector in the standard voiceprint characteristics in the voiceprint identification model, which is adaptive to the unidentified voiceprint.

Preferably, the voiceprint recognition method with priority includes that the unrecognized voiceprint features include a plurality of the unrecognized voiceprint feature vectors, and the standard voiceprint features include a plurality of the standard voiceprint feature vectors, where the obtaining of the degree of discrimination of the unrecognized voiceprint according to at least the unrecognized voiceprint features and the standard voiceprint feature processing in the voiceprint recognition model includes:

processing to obtain the vector distance between each unidentified voiceprint feature vector in the unidentified voiceprint features and each standard voiceprint feature vector corresponding to the unidentified voiceprint feature vector in the standard voiceprint features;

processing according to the plurality of vector distances obtained by processing to obtain the target distance between the unidentified voiceprint feature and the standard voiceprint feature;

and processing by using at least the target distance between the unrecognized voiceprint feature and the standard voiceprint feature to obtain the discrimination degree of the unrecognized voiceprint.

Preferably, the method for voiceprint recognition with priority further comprises, before acquiring the unidentified voiceprint:

acquiring a plurality of voiceprints and obtaining background voiceprint characteristics of each voiceprint in the plurality of voiceprints so as to construct a plurality of background identification models corresponding to the voiceprints, wherein the background voiceprint characteristics comprise a plurality of background voiceprint characteristic vectors;

and constructing the standard identification model according to the background identification model.

Preferably, the obtaining the degree of discrimination of the unrecognized voiceprint by the distance processing at least using the unrecognized voiceprint feature and the standard voiceprint feature includes:

processing to obtain the background distance between the unrecognized voiceprint features and the background voiceprint features of each voiceprint corresponding to the plurality of background recognition models;

processing according to the plurality of background distances to obtain a distance average value and a distance standard deviation;

processing to obtain a difference value between the target distance of the unidentified voiceprint feature and the standard voiceprint feature and the distance average value;

and processing to obtain a ratio of the difference value to the distance standard deviation, and taking the ratio as the discrimination of the unidentified voiceprint.

Preferably, the voiceprint recognition method with priority, wherein the obtaining the real wavelet elements and/or the complex wavelet elements of the voiceprint features in each of the pitch intervals includes:

acquiring a preset feature vector in each pitch interval, dividing the feature vectors in the pitch intervals into sample vectors with preset lengths according to a wavelet filter, and normalizing the sample vectors with the preset lengths;

performing at least one of the following wavelet transforms on the normalized sample vector of the predetermined length:

performing real wavelet transform on the normalized sample vector with the preset length to obtain a real part coefficient of a first preset frequency band, and selecting a frequency band meeting a first preset condition from the first preset frequency band for sampling to obtain the real wavelet element in the unidentified voiceprint feature;

and performing dual-tree complex wavelet transform on the normalized sample vector with the preset length to obtain a real part coefficient and an imaginary part coefficient of a second preset frequency band, and selecting a frequency band meeting a second preset condition from the second preset frequency band for sampling to obtain the complex wavelet elements in the unidentified voiceprint features.

Preferably, the method for voiceprint recognition with priority, wherein after detecting a voiced interval in the unrecognized voiceprint, the obtaining the unrecognized voiceprint features in the unrecognized voiceprint further comprises:

acquiring a Mel cepstrum coefficient of each frame in the unidentified voiceprint to obtain the Mel cepstrum coefficient characteristics in the unidentified voiceprint characteristics;

and processing according to the Mel cepstrum coefficient to obtain a differential Mel cepstrum coefficient characteristic of each frame in the unidentified voiceprint so as to obtain the differential Mel cepstrum coefficient characteristic in the unidentified voiceprint characteristic.

A system to improve voiceprint recognition accuracy, comprising:

the first acquisition unit is used for acquiring a voice film source;

the recognition unit is connected with the first acquisition unit and used for recognizing the unrecognized voiceprint existing in the voice film source;

the statistical unit is connected with the identification unit and used for carrying out statistics on the unidentified voiceprints and forming a statistical result;

the first processing unit is used for carrying out priority sequencing on the unidentified voiceprints according to the statistical result;

the acquisition unit is connected with the first processing unit and used for acquiring unidentified voiceprint features in the unidentified voiceprints which are prioritized, wherein the unidentified voiceprint features at least comprise wavelet elements of the unidentified voiceprints;

the second processing unit is connected with the acquisition unit and is used for processing the discrimination degree of the unidentified voiceprint according to the unidentified voiceprint characteristics and standard voiceprint characteristics in a voiceprint identification model;

the judging unit is connected with the second processing unit and used for judging whether the discrimination is greater than a preset standard threshold value or not and reserving the unidentified voiceprint of which the discrimination is greater than the standard threshold value;

an identifying unit connected to the judging unit, for selecting the unidentified voiceprint with the highest discrimination degree from the retained unidentified voiceprints, and identifying as a locked voiceprint;

the wavelet elements include real wavelet elements and/or complex wavelet elements, and the obtaining unit includes:

a detection module for detecting voiced intervals in the unrecognized voiceprint;

a first obtaining module, connected to the detecting module, configured to detect a pitch interval in each of the voiced intervals, and obtain the real wavelet elements and/or the complex wavelet elements of the voiceprint features in each of the pitch intervals.

Preferably, the voiceprint recognition system with priority further comprises:

and the adjusting unit is connected with the first acquisition unit and used for adjusting the voiceprint characteristic vector parameters corresponding to the unidentified voiceprint characteristic vector in a pre-constructed standard identification model at least according to the unidentified voiceprint characteristic vector in the unidentified voiceprint characteristics so as to construct the standard voiceprint characteristic vector in the standard voiceprint characteristics in the voiceprint identification model, which is adaptive to the unidentified voiceprint.

Preferably, in the voiceprint recognition system with priority, the unrecognized voiceprint features include a plurality of the unrecognized voiceprint feature vectors, and the standard voiceprint features include a plurality of the standard voiceprint feature vectors, the processing unit includes:

the first processing module is used for processing to obtain the vector distance between each unidentified voiceprint feature vector in the unidentified voiceprint features and each standard voiceprint feature vector corresponding to the unidentified voiceprint feature vector in the standard voiceprint features;

the second processing module is connected with the first processing module and used for processing according to the plurality of vector distances obtained by processing to obtain the target distance between the unidentified voiceprint feature and the standard voiceprint feature;

and the third processing module is connected with the second processing module and used for processing by using the target distance between the unrecognized voiceprint feature and the standard voiceprint feature to obtain the discrimination of the unrecognized voiceprint.

Preferably, the voiceprint recognition system with priority further comprises:

the second acquisition unit is used for acquiring a plurality of voiceprints and acquiring background voiceprint characteristics of each voiceprint in the voiceprints so as to construct a plurality of background identification models corresponding to the voiceprints, wherein the background voiceprint characteristics comprise a plurality of background voiceprint characteristic vectors;

and the construction unit is connected with the second acquisition unit and used for constructing the standard identification model according to the background identification model.

Preferably, the voiceprint recognition with priority system, the third processing module comprises:

the first processing submodule is used for processing to obtain the background distance between the unidentified voiceprint features and the background voiceprint features of each voiceprint corresponding to the plurality of background identification models;

the second processing submodule is connected with the first processing submodule and is used for processing according to the plurality of background distances to obtain a distance average value and a distance standard deviation;

the third processing submodule is connected with the second processing submodule and used for processing to obtain the difference value between the target distance of the unidentified voiceprint feature and the standard voiceprint feature and the distance average value;

and the fourth processing submodule is respectively connected with the second processing submodule and the third processing submodule and used for processing to obtain a ratio of the difference value to the distance standard deviation, and the ratio is used as the discrimination of the unidentified voiceprint.

Preferably, the voiceprint recognition system with priority, the first obtaining module includes:

the first obtaining submodule is used for obtaining a preset feature vector in each pitch interval, dividing the feature vectors in the pitch intervals into sample vectors with preset length according to a wavelet filter, and normalizing the sample vectors with the preset length;

a transform submodule, connected to the first obtaining submodule, for performing at least one of the following wavelet transforms on the normalized sample vector of the predetermined length:

Preferably, the voiceprint recognition system with priority, the acquiring unit further includes:

a second obtaining module, configured to, after detecting a voiced interval in the unidentified voiceprint, obtain a mel cepstrum coefficient of each frame in the unidentified voiceprint to obtain the mel cepstrum coefficient feature in the unidentified voiceprint feature;

and the fourth processing module is connected with the second acquiring module and used for processing according to the mel cepstrum coefficient to obtain the differential mel cepstrum coefficient characteristics of each frame in the unidentified voiceprint so as to obtain the differential mel cepstrum coefficient characteristics in the unidentified voiceprint characteristics.

The beneficial effects of the above technical scheme are: the voice to be recognized is processed in advance, so that the voice with high recognition frequency can be recognized through voiceprint preferentially, and the recognition efficiency is improved.

Drawings

FIG. 1 is a general flow diagram of a method for prioritized voiceprint recognition in a preferred embodiment of the invention;

FIG. 2 is a flow chart of constructing a standard voiceprint feature vector in a preferred embodiment of the invention;

FIG. 3 is a flow chart of the process of deriving the degree of discrimination in a preferred embodiment of the present invention;

FIG. 4 is a flow chart of a process for obtaining a standard recognition model in a preferred embodiment of the present invention;

FIG. 5 is a flow chart of the process of deriving the degree of discrimination in a preferred embodiment of the present invention;

FIG. 6 is a flow chart of extracting unidentified voiceprint features in a preferred embodiment of the invention;

FIG. 7 is a flow chart of wavelet element extraction within each pitch interval in a preferred embodiment of the present invention;

FIG. 8 is a flow chart for obtaining differential Mel cepstral coefficient features in a preferred embodiment of the present invention;

FIG. 9 is a flow chart of obtaining unidentified voiceprint features after detecting voiced intervals in a preferred embodiment of the invention;

FIG. 10 is a block diagram of a prioritized voiceprint recognition system in a preferred embodiment of the invention;

FIG. 11 is a block diagram of a processing unit in the system in a preferred embodiment of the invention;

FIG. 12 is a block diagram of a third processing module in the processing unit in a preferred embodiment of the invention

Fig. 13 is a block diagram of a first acquisition module in the acquisition unit in a preferred embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In a preferred embodiment of the present invention, there is provided a voiceprint recognition method with priority, as shown in fig. 1, the method comprising:

step S1, collecting a voice film source, and analyzing the unrecognized voiceprint existing in the voice film source;

step S2, carrying out statistics on the unidentified voiceprints identified each time, and forming a statistical result;

step S3, carrying out priority identification and sequencing on the collected unidentified voiceprints according to the statistical result;

step S4, obtaining the unidentified voiceprint characteristics in each unidentified voiceprint according to a priority sequence, wherein the unidentified voiceprint characteristics at least comprise wavelet elements of the unidentified voiceprint;

step S5, processing according to each unrecognized voiceprint feature and the standard voiceprint feature in the voiceprint recognition model to obtain the discrimination corresponding to each unrecognized voiceprint;

step S6, respectively judging whether each discrimination degree is greater than a preset standard threshold value, and reserving unidentified voiceprints with discrimination degrees greater than the standard threshold value;

in step S7, the unrecognized voiceprint with the largest discrimination is selected from the remaining unrecognized voiceprints and recognized as the locked voiceprint.

According to the technical scheme, in the collected voice film source, the unrecognized voiceprints existing in the voice film source are recognized, the unrecognized voiceprints are recognized each time and are counted to form a statistical result, the unrecognized voiceprints are preferentially sorted according to the statistical result, so that the unrecognized voiceprints with high recognition frequency are preferentially recognized, and the recognition efficiency is effectively improved. In a preferred embodiment of the present invention, the voiceprint recognition method with priority can be applied to, but not limited to, recognizing voiceprints of a limited number of users on an intelligent device placed in a private personal space, and can also be applied to placing the intelligent device in a relatively open space for recognizing voiceprints of a limited number of users. In the application environment, there may be a plurality of unrecognized voiceprints to be recognized, and likewise there may be a plurality of unrecognized voiceprints with a degree of discrimination greater than the standard threshold (i.e., which may be recognized generally as locked voiceprints). However, for a smart device, the best situation is to operate only according to the voice command of a user at the same time, otherwise the user's experience may be adversely affected. Therefore, when there are a plurality of unidentified voiceprints which meet the rule (the degree of discrimination is greater than the standard threshold), all the unidentified voiceprints are reserved, and the unidentified voiceprint with the highest degree of discrimination is selected and identified as the locked voiceprint, and then the intelligent device with the voiceprint recognition function can perform corresponding subsequent operations according to the locked voiceprint. In a preferred embodiment of the present invention, the unrecognized voiceprint feature includes wavelet elements, that is, the wavelet elements of the voiceprint are combined on the basis of the original feature, thereby improving the accuracy and stability of the voiceprint recognition system with priority. And the problem of inaccurate identification caused by the fact that the identification result of the existing voiceprint identification mode is easily interfered by various factors is further solved. Furthermore, by directly comparing with the voiceprint recognition model, the complexity and the construction period of model construction are reduced, and therefore the stability and the recognition efficiency of voiceprint recognition are improved. Details regarding the above wavelet elements are described below.

In a preferred embodiment of the present invention, the unrecognized voiceprint feature in the unrecognized voiceprint may include a plurality of unrecognized voiceprint feature vectors. Accordingly, the standard voiceprint feature may include a plurality of standard voiceprint feature vectors.

Further, in the present embodiment, the above-mentioned unrecognized voiceprint feature may include the contents described below, but the composition thereof is not limited to the contents described below:

4 real wavelet elements, 4 dual-tree complex wavelet elements, mel-frequency cepstrum coefficient characteristics, and difference mel-frequency cepstrum coefficient characteristics.

Wherein, the wavelet element comprises at least one of real wavelet and complex wavelet.

In a preferred embodiment of the present invention, the voiceprint recognition model may include the following contents:

and adjusting the standard recognition model according to a plurality of voiceprint characteristic vectors (unrecognized voiceprint characteristic vectors) in the unrecognized voiceprint to obtain the standard recognition model which is adapted to the unrecognized voiceprint and is used for recognizing the unrecognized voiceprint.

The standard recognition model may include: different voiceprints associated with multiple people are collected, and corresponding voiceprint characteristics are obtained from the voiceprints of each person. Then, a Background recognition model corresponding to the voiceprint of each person is respectively constructed according to different voiceprint characteristics, and then the Background voiceprint characteristics in the plurality of Background recognition models are clustered, so that a standard recognition model, such as a Universal Background Model (UBM), is constructed.

In a preferred embodiment of the present invention, for example, a voiceprint feature includes 10 voiceprint feature vectors (i.e., 10 types of features), after the voiceprints of multiple persons are collected, the 10 types of features are obtained from the voiceprints of each person, and then each type of feature is clustered (e.g., including 32 centers). Then, a UBM model of 10 codebooks (i.e. voiceprint feature parameters corresponding to 10 voiceprint feature vectors) containing 32 codewords is obtained according to the result obtained by clustering. Furthermore, each speaker can also construct a corresponding background recognition model according to the voiceprint characteristics of the speaker.

In a preferred embodiment of the present invention, before obtaining the unrecognized voiceprint features in the unrecognized voiceprint, the following steps are further included as shown in fig. 2:

step A1, collecting unidentified voiceprints;

step A2, adjusting the voiceprint feature vector parameters corresponding to the unrecognized voiceprint feature vector in the pre-constructed standard recognition model at least according to the unrecognized voiceprint feature vector in the unrecognized voiceprint features to construct the standard voiceprint feature vector in the standard voiceprint features in the voiceprint recognition model which is adaptive to the unrecognized voiceprint.

In a preferred embodiment of the present invention, the above-mentioned manner of collecting the unrecognized voiceprint may include the following several manners: and acquiring the human voice to be identified with a preset time length (for example, acquiring the human voice lasting for 5 seconds) by adopting a voice acquisition device (for example, a microphone), wherein the adopted audio format is that the sampling rate is 16KHz, the quantization depth is 16 bits, and the audio format is monaural.

In a preferred embodiment of the present invention, the unrecognized voiceprint feature may include a plurality of unrecognized voiceprint feature vectors, and similarly, the standard voiceprint feature may include a plurality of standard voiceprint feature vectors. For example, each unrecognized voiceprint feature includes 10 VQ codebooks, that is, each unrecognized voiceprint feature vector corresponds to one VQ codebook, wherein each VQ codebook corresponds to a set of feature sets. Similarly, each standard voiceprint feature may also include 10 VQ codebooks, and each standard voiceprint feature vector corresponds to one VQ codebook.

In a preferred embodiment of the present invention, the standard recognition model may be adjusted according to a plurality of unrecognized voiceprint feature vectors in the unrecognized voiceprint features, so as to obtain a voiceprint recognition model adapted to the unrecognized voiceprint, thereby facilitating to recognize the voiceprint collected later by using the voiceprint recognition model.

According to the preferred embodiment of the invention, before the unrecognized voiceprint features in the unrecognized voiceprint are obtained, the voiceprint recognition model adaptive to the unrecognized voiceprint is obtained by adjusting the standard recognition model, and the pre-registration of the unrecognized voiceprint is realized, so that the voiceprint can be directly and accurately recognized according to the pre-registered voiceprint recognition model during voiceprint recognition, the complexity and the construction period of model construction are reduced, and the reliability and the efficiency of voiceprint recognition are further improved.

In a preferred embodiment of the present invention, the unrecognized voiceprint features include a plurality of unrecognized voiceprint feature vectors, and the standard voiceprint features include a plurality of standard voiceprint feature vectors, where the discrimination degree of the unrecognized voiceprint obtained at least according to the processing of the unrecognized voiceprint features and the standard voiceprint features in the voiceprint recognition model is shown in fig. 3, and includes:

step B1, processing to obtain the vector similarity of each unidentified voiceprint feature vector in the unidentified voiceprint features and each standard voiceprint feature vector corresponding to the unidentified voiceprint feature vector in the standard voiceprint features;

step B2, processing according to the processed vector similarity to obtain the target distance between the unrecognized voiceprint feature and the standard voiceprint feature;

and step B3, obtaining the discrimination degree of the unidentified voiceprint by processing at least the target distance between the unidentified voiceprint characteristic and the standard voiceprint characteristic.

In a preferred embodiment of the present invention, the obtaining of the vector similarity between the unrecognized voiceprint feature vector in the unrecognized voiceprint feature and the standard voiceprint feature vector in the standard voiceprint feature by the processing includes: and processing to obtain the distance between the unidentified vocal print characteristic vector and the standard vocal print characteristic vector.

Specifically, for example, the vector distance between the unrecognized voiceprint feature vector in the unrecognized voiceprint feature of the unrecognized voiceprint and the standard voiceprint feature vector in the standard voiceprint feature of the voiceprint recognition model is a, normalization processing is performed on a plurality of vector distances, and weighted summation is performed to obtain the target distance S between the unrecognized voiceprint feature and the standard voiceprint feature. And processing according to at least the target distance S between the unrecognized voiceprint features and the standard voiceprint features to obtain the discrimination of the unrecognized voiceprint features. In a preferred embodiment of the present invention, the weight may be preset according to the importance degree of different feature vectors, and in other embodiments of the present invention, the weight may be set or processed in other suitable manners.

In the preferred embodiment of the invention, the vector distances of the unrecognized voiceprint features and a plurality of voiceprint feature vectors in the standard voiceprint features are obtained through processing, the target distances of the unrecognized voiceprint features and the standard voiceprint features are obtained through accurate processing after the weighted summation of the vector distances, and the accuracy of voiceprint discrimination is further ensured.

In a preferred embodiment of the present invention, before the step of collecting the unidentified voiceprint, the following steps are further included as shown in fig. 4:

step C1, collecting a plurality of voiceprints and obtaining background voiceprint characteristics of each voiceprint in the plurality of voiceprints so as to construct a plurality of background recognition models corresponding to the voiceprints, wherein the background voiceprint characteristics comprise a plurality of background voiceprint characteristic vectors;

and step C2, constructing a standard recognition model according to the background recognition model.

Specifically, in a preferred embodiment of the present invention, voiceprints of multiple users during speaking are collected, and multiple background recognition models are constructed based on the collected voiceprints, so that a standard recognition model including the characteristics of the voiceprints of multiple users is constructed according to the background recognition models, and the voiceprint recognition model for voiceprint recognition is constructed in advance, thereby achieving the purposes of shortening the model construction period and improving the voiceprint recognition efficiency.

In a preferred embodiment of the present invention, the distance processing between the unrecognized voiceprint feature and the standard voiceprint feature can be used to obtain the degree of discrimination of the unrecognized voiceprint, and the steps are as shown in fig. 5, and include:

step D1, processing to obtain the background distance between the unrecognized voiceprint feature and the background voiceprint feature of each voiceprint corresponding to the plurality of background recognition models;

step D2, processing according to the plurality of background distances to obtain a distance average value and a distance standard deviation;

step D3, processing to obtain the difference value between the target distance and the distance average value of the unidentified voiceprint features and the standard voiceprint features;

and D4, processing to obtain the ratio of the difference value to the distance standard deviation, and taking the ratio as the discrimination of the unidentified voiceprint.

In a preferred embodiment of the present invention, assuming that target distances between an unrecognized voiceprint feature of an unrecognized voiceprint and a standard voiceprint feature are identified by S, i voiceprints are collected together to construct i background recognition models, where background distances of the unrecognized voiceprint feature and the i background voiceprint features corresponding to the i voiceprints are D1, D2, D3, and … Di, respectively, and further, a distance average value of the plurality of background distances is u and a distance standard deviation is σ are obtained through processing. The degree of discrimination of the unrecognized voiceprint is obtained by the following formula:

s’＝(s-u)/σ (1)

further, the relationship between the discrimination s' of the recognized voiceprint and a preset standard threshold is judged, and if the discrimination is larger than the standard threshold, the unrecognized voiceprint is considered as the locked voiceprint.

It should be noted that, because the hardware environment and conditions for collecting the unrecognized voiceprint and constructing the voiceprint recognition model may change, for example, the model of the adopted microphone device changes, which may cause a large change between the unrecognized voiceprint feature of the unrecognized voiceprint and the standard voiceprint feature of the voiceprint recognition model, and further affect the judgment of the unrecognized voiceprint, the degree of discrimination of the unrecognized voiceprint is further obtained by combining with the processing of the voiceprint feature in the background recognition model, and the accuracy of the degree of discrimination of the unrecognized voiceprint is further ensured.

Specifically, as described in a preferred embodiment of the present invention, if the same sound card, microphone, and other devices as those used to construct the background recognition model are used to perform sound recording collection, the distance between the voiceprint feature of the unrecognized voiceprint obtained after sound recording and the voiceprint recognition model is closer and the distance between the voiceprint feature and the background recognition model is also closer, and if the sound card, microphone, and other devices different from those used to construct the background recognition model are used to perform sound recording collection, the distance between the unrecognized voiceprint of the feature obtained after sound recording and the voiceprint recognition model is farther and the distance between the unrecognized voiceprint and the background recognition model is also farther, however, although the distances are farther, the distance between the unrecognized voiceprint and the background recognition model is still closer.

In the preferred embodiment of the invention, the discrimination of the unrecognized voiceprint is obtained by combining the pre-trained voiceprint recognition model and the background recognition model, so that the problem of inaccurate calculation of the discrimination of the unrecognized voiceprint caused by the change of the environment and the condition for collecting the unrecognized voiceprint is solved.

In a preferred embodiment of the present invention, as described above, if the wavelet elements include real wavelet elements and/or complex wavelet elements, the step of obtaining the unrecognized voiceprint features in the unrecognized voiceprint is shown in fig. 6, and includes:

step E1, detecting voiced intervals in unrecognized voiceprints;

step E2, pitch intervals are detected in each voiced interval, and real wavelet elements and/or complex wavelet elements of the voiceprint features are obtained in each pitch interval.

Further, in this embodiment, the step of obtaining the real wavelet element and/or the complex wavelet element of the voiceprint feature in each pitch interval is specifically shown in fig. 7, and includes:

step E21, obtaining a predetermined feature vector in each pitch interval, dividing the feature vectors in the pitch intervals into sample vectors with a predetermined length according to the wavelet filter, and normalizing the sample vectors with the predetermined length;

step E22, performing wavelet transformation on the normalized sample vector with the preset length;

specifically, in the above step E22, the wavelet transform is performed in the manner described below:

1) performing real wavelet transformation on the normalized sample vector with the preset length to obtain a real part coefficient of a first preset frequency band, and selecting a frequency band meeting a first preset condition from the first preset frequency band for sampling to obtain a real wavelet element in the unidentified voiceprint feature;

2) and performing double-tree complex wavelet transformation on the normalized sample vector with the preset length to obtain a real part coefficient and an imaginary part coefficient of a second preset frequency band, and selecting a frequency band meeting a second preset condition from the second preset frequency band for sampling to obtain a complex wavelet element in the unidentified voiceprint feature.

In a preferred embodiment of the present invention, the above sample vector may be determined according to the length of the wavelet filter employed.

In a preferred embodiment of the present invention, after detecting the voiced intervals in the unrecognized voiceprint, the step of obtaining the unrecognized voiceprint features in the unrecognized voiceprint further includes, as shown in fig. 8:

step F1, obtaining the Mel cepstrum coefficient of each frame in the unidentified voiceprint to obtain the Mel cepstrum coefficient characteristic in the unidentified voiceprint characteristic;

and F2, processing according to the Mel cepstrum coefficient to obtain the difference Mel cepstrum coefficient characteristics of each frame in the unrecognized voiceprint so as to obtain the difference Mel cepstrum coefficient characteristics in the unrecognized voiceprint characteristics.

In a preferred embodiment of the present invention, voiced interval detection is performed on unrecognized voiceprints, for example as described above, followed by pre-emphasis processing. The pre-emphasis process is actually a process using a high-pass filter, and the specific formula is as follows:

y(n)＝x(n)-0.9375*x(n-1) (2)

then, as described above, the voice print after the pre-emphasis processing is subjected to feature extraction, and after a plurality of processing steps such as 3-order real wavelet transform, 3-order even complex wavelet transform, mel cepstrum coefficient acquisition, processing according to the mel cepstrum coefficient, etc., the difference mel cepstrum coefficient is obtained by processing, and then 10 groups of 20-dimensional voice print feature vectors are obtained.

In the preferred embodiment of the invention, the wavelet elements in the voiceprint features are obtained, so that the new features are formed by combining the wavelet elements on the basis of the original features, and the wavelet elements can reflect the speech features which cannot be used by the original features, so that the accuracy and the stability of the voiceprint recognition system with the priority are improved.

Specifically, the following description is made by taking an example that the voiceprint features in the voiceprint recognition model include 10 voiceprint feature vectors:

for example, a background recognition model is respectively constructed from features acquired from voice data of dozens of speakers, wherein the background recognition model comprises 10 VQ codebooks, each VQ codebook comprises 10 features, such as a mel cepstrum, a difference mel cepstrum, 4 real wavelet elements and 4 complex wavelet elements, and each feature is a 20-dimensional vector. And further constructing a UBM model according to the background recognition model. Further, registering unidentified voiceprints, collecting unidentified voiceprints and obtaining characteristics from the unidentified voiceprints, and adapting to each characteristic group through a VQ codebook of the UBM model so as to construct a VQ codebook in the voiceprint recognition model (namely a standard voiceprint characteristic vector in a standard voiceprint characteristic).

Further, mel cepstrum coefficients, differential mel cepstrum coefficients, and 8 wavelet elements (4 real wavelets and 4 complex wavelets) in each codebook are acquired.

Specifically, a voiced interval is detected in an input signal { s (i): i ═ 0., N-1 }; voiced intervals are detected using energy, such as the energy ratio of the low and high frequency bands, zero crossing rate. And then pre-emphasis processing is carried out on the input signal.

s′(i)＝s(i)-0.9375*s(i-1)，i＝1，...，N-1；

The following operations are then performed on the pre-emphasized voiceprint as shown in fig. 9:

and G1, processing to obtain the Mel cepstrum coefficient of each frame, wherein each frame has 360 samples, and the frame interval is 180 samples.

The dimension of the resulting mel-frequency cepstrum vector is 20.

{MFCCi，i＝0，...，Nm-1}；

{MFCCi＝{MFCCi(k)}；k＝0，...，19}；

And G2, for each frame, processing the difference of the obtained Mel cepstrum vectors to form a difference Mel cepstrum vector.

DMFCCi＝MFCCi+2-MFCCi-2；

And G3, detecting a pitch interval in each voiced interval, and processing each obtained pitch interval to obtain real wavelets and complex wavelets of pitch synchronization.

Wherein, pitch intervals and maximum peaks are detected in the input speech signal { s (i): i ═ 0.., N-1 }. Where N is the length of the speech interval, Np is the number of pitch intervals, and the starting position and length of each pitch interval are as follows:

{Pit_st(i):i＝0，...，Np-1}；

{Pit_ln(i):i＝0，...，Np-1}；

further, the real wavelet is processed as follows:

acquiring 4 20-dimensional feature vectors for each pitch interval, and cutting out an interval containing the interval and two groups of a certain number of samples before and after the interval for each pitch interval to obtain the following vectors:

{s(Pit_st(i)-l1)，...，s(Pit_st(i)+(Pit_ln(i)+l1}，i＝0，...，Np-1；

the vector is then normalized to have a norm of 1.

For the above vector, a three-stage real wavelet (e.g., Daubechies wavelet) packet transform is performed to obtain eight coefficient sequences:

{RWi0},i＝1,...,8；

{RWi0}＝{RWi0(k)},k＝1,...,M；

each coefficient sequence corresponds to a specific frequency band, and has the same length as the pitch interval of 1/8.

Among the 8 sequences obtained above, 4 sequences corresponding to the low frequency band are resampled to generate 4 20-dimensional vectors:

{RWi}，i＝1，...，4；

RWi＝{RWi(k)}k＝1，...，20；

further, the complex wavelet is processed in the following manner:

for each pitch interval, 4 20-dimensional feature vectors are acquired, and for each pitch interval, the interval including that interval and two groups of a certain number of samples before and after the interval is cut out, and the obtained vector is normalized so that the norm is 1.

For the interval, a three-stage double-tree complex wavelet packet transform (DT-CWPT) is performed to obtain coefficients corresponding to 8 frequency bands, each having real and imaginary coefficients, wherein each coefficient sequence has the same length and the length is equal to the pitch interval length of 1/8. For each frequency band, a sequence of absolute values is obtained from the real and imaginary sequences.

{CWi}，i＝1，...，4；

CWi＝{CW(k)i}k＝1，...，20；

and G4, performing normalization processing by using a test standard method according to the obtained 10 groups of feature sets to obtain the similarity between the unrecognized voiceprint and the voiceprint recognition model, and when the similarity is judged to be greater than a standard threshold value, recognizing the same voiceprint, namely the speaker to be recognized is the same person as the speaker in the constructed voiceprint recognition model.

In a preferred embodiment of the present invention, based on the above-mentioned voiceprint recognition method with priority, there is further provided a voiceprint recognition system a with priority, whose structure is specifically shown in fig. 9, and includes:

the first acquisition unit 1 is used for acquiring a voice film source;

the analysis unit 2 is connected with the first acquisition unit and used for analyzing the unrecognized voiceprint existing in the voice film source;

the statistical unit 3 is connected with the identification unit and used for performing statistics on the unidentified voiceprints and forming a statistical result;

the first processing unit 4 is used for carrying out priority sequencing on the unidentified voiceprints according to the statistical result;

an obtaining unit 5, connected to the first processing unit, configured to obtain an unidentified voiceprint feature in an unidentified voiceprint that is prioritized, where the unidentified voiceprint feature at least includes a wavelet element of the unidentified voiceprint;

the second processing unit 6 is connected with the acquisition unit 1 and is used for processing the discrimination of the unidentified voiceprint according to the unidentified voiceprint characteristics and the standard voiceprint characteristics in the voiceprint identification model;

the judging unit 7 is connected with the second processing unit 6 and used for judging whether the discrimination is greater than a preset standard threshold value or not and reserving the unidentified voiceprint of which the discrimination is greater than the standard threshold value;

an identifying unit 8 connected to the judging unit for selecting an unidentified voiceprint with the highest degree of discrimination among the retained unidentified voiceprints and identifying as a locked voiceprint;

the wavelet elements include real wavelet elements and/or complex wavelet elements, and the obtaining unit 5 further includes:

a detection module 51 for detecting voiced intervals in unrecognized voiceprints;

a first obtaining module 52, connected to the detecting module 51, is configured to detect a pitch interval in each voiced interval and obtain a real wavelet element and/or a complex wavelet element of a voiceprint feature in each pitch interval.

In a preferred embodiment of the present invention, as shown in fig. 9, the system a further includes:

and the adjusting unit 9 is connected with the first acquisition and acquisition unit 1 and is used for adjusting the voiceprint feature vector parameters corresponding to the unidentified voiceprint feature vector in the pre-constructed standard identification model at least according to the unidentified voiceprint feature vector in the unidentified voiceprint feature so as to construct the standard voiceprint feature vector in the standard voiceprint feature in the voiceprint identification model corresponding to the unidentified voiceprint.

In a preferred embodiment of the present invention, the unrecognized voiceprint feature comprises a plurality of unrecognized voiceprint feature vectors, and the standard voiceprint feature comprises a plurality of standard voiceprint feature vectors.

Then, as shown in fig. 10, the second processing unit 6 specifically includes:

the first processing module 61 is configured to process to obtain a vector distance between each unrecognized voiceprint feature vector in the unrecognized voiceprint features and each standard voiceprint feature vector corresponding to the unrecognized voiceprint feature vector in the standard voiceprint features;

the second processing module 62 is connected to the first processing module 61 and is configured to process the plurality of vector distances obtained through the processing to obtain a target distance between the unrecognized voiceprint feature and the standard voiceprint feature;

and the third processing module 63 is connected to the second processing module 62, and is configured to obtain the degree of discrimination of the unidentified voiceprint by using at least the target distance processing between the unidentified voiceprint features and the standard voiceprint features.

In a preferred embodiment of the present invention, as still shown in fig. 9, the system further includes:

the second acquisition unit 9 is configured to acquire a plurality of voiceprints and obtain background voiceprint characteristics of each voiceprint in the plurality of voiceprints to construct a plurality of background recognition models corresponding to the voiceprints, where the background voiceprint characteristics include a plurality of background voiceprint characteristic vectors;

and the construction unit 11 is connected with the second acquisition unit 10 and is used for constructing a standard identification model according to the background identification model.

Further, in a preferred embodiment of the present invention, as shown in fig. 11, the third processing module 63 includes:

the first processing sub-module 631 is configured to process the background distance between the unrecognized voiceprint feature and the background voiceprint feature of each voiceprint corresponding to the plurality of background recognition models;

the second processing submodule 632 is connected to the first processing submodule 631 and configured to process the plurality of background distances to obtain a distance average and a distance standard deviation;

the third processing submodule 633, connected to the second processing submodule 632, is configured to process the difference between the target distance of the unidentified voiceprint feature and the standard voiceprint feature and the average distance value;

the fourth processing submodule 634 is connected to the second processing submodule 632 and the third processing submodule 633 respectively, and is configured to process the difference value and the distance standard deviation to obtain a ratio, and use the ratio as the recognition degree of the unidentified voiceprint.

In a preferred embodiment of the present invention, as shown in fig. 12, the first obtaining module 52 in the above description includes:

a first obtaining submodule 521, configured to obtain a predetermined feature vector in each pitch interval, divide the feature vectors in the multiple pitch intervals into sample vectors of a predetermined length according to a wavelet filter, and normalize the sample vectors of the predetermined length;

the transform submodule 522, connected to the first obtaining submodule 521, is configured to perform at least one of the following wavelet transforms on the normalized sample vector of the predetermined length:

performing real wavelet transformation on the normalized sample vector with the preset length to obtain a real part coefficient of a first preset frequency band, and selecting a frequency band meeting a first preset condition from the first preset frequency band for sampling to obtain a real wavelet element in the unidentified voiceprint feature;

and performing dual-tree complex wavelet transform on the normalized sample vector with the preset length to obtain a real part coefficient and an imaginary part coefficient of a second preset frequency band, and selecting a frequency band meeting a second preset condition from the second preset frequency band for sampling to obtain a complex wavelet element in the unidentified voiceprint feature.

In a preferred embodiment of the present invention, as also shown in fig. 9, the obtaining unit 5 described above further includes:

a second obtaining module 53, configured to, after detecting a voiced interval in an unidentified voiceprint, obtain a mel cepstrum coefficient of each frame in the unidentified voiceprint to obtain a mel cepstrum coefficient feature in the unidentified voiceprint feature;

and the fourth processing module 54 is connected to the second obtaining module 53, and configured to obtain a difference mel-frequency cepstrum coefficient feature of each frame in the unidentified voiceprint according to the mel-frequency cepstrum coefficient processing, so as to obtain a difference mel-frequency cepstrum coefficient feature in the unidentified voiceprint feature. The preferred embodiments of the present invention described above are for illustrative purposes only and do not represent the merits of the embodiments.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a mobile terminal, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. A method for voiceprint recognition with priority, comprising:

collecting a voice film source, and analyzing an unrecognized voiceprint existing in the voice film source;

performing priority identification sequencing on the unidentified voiceprints according to the statistical result;

selecting the unidentified voiceprint with the maximum discrimination degree from the retained unidentified voiceprints and identifying the unidentified voiceprint as a locked voiceprint; according to the locked voiceprint, the intelligent device with the voiceprint recognition function recognizes;

detecting voiced intervals in the unidentified voiceprint;

detecting a pitch interval in each of the voiced intervals, and acquiring the real wavelet elements and/or the complex wavelet elements of the voiceprint features in each of the pitch intervals;

before the extracting the unrecognized voiceprint features in the unrecognized voiceprint, the method further comprises:

collecting the unidentified voiceprints;

adjusting voiceprint characteristic vector parameters corresponding to the unidentified voiceprint characteristic vectors in a pre-constructed standard identification model at least according to the unidentified voiceprint characteristic vectors in the unidentified voiceprint characteristics so as to construct standard voiceprint characteristic vectors in the standard voiceprint characteristics in the voiceprint identification model, which are adaptive to the unidentified voiceprint;

the unrecognized voiceprint features comprise a plurality of unrecognized voiceprint feature vectors, the standard voiceprint features comprise a plurality of standard voiceprint feature vectors, wherein the obtaining of the degree of discrimination of the unrecognized voiceprint at least according to the unrecognized voiceprint features and the standard voiceprint feature processing in the voiceprint recognition model comprises:

processing by using at least the target distance between the unrecognized voiceprint feature and the standard voiceprint feature to obtain the discrimination of the unrecognized voiceprint;

before acquiring the unrecognized voiceprint, the method further comprises:

constructing the standard identification model according to the background identification model;

the processing to obtain the discrimination degree of the unidentified voiceprint by at least utilizing the distance between the unidentified voiceprint feature and the standard voiceprint feature comprises:

processing to obtain a ratio of the difference value to the distance standard deviation, and taking the ratio as the discrimination of the unidentified voiceprint;

the obtaining the real wavelet elements and/or the complex wavelet elements of the voiceprint feature in each of the pitch intervals comprises:

performing dual-tree complex wavelet transform on the normalized sample vector with the predetermined length to obtain a real part coefficient and an imaginary part coefficient of a second predetermined frequency band, and selecting a frequency band meeting a second predetermined condition from the second predetermined frequency band for sampling to obtain the complex wavelet elements in the unidentified voiceprint features;

after detecting voiced intervals in the unidentified voiceprint, the obtaining unidentified voiceprint features in the unidentified voiceprint further comprises:

processing according to the Mel cepstrum coefficient to obtain a differential Mel cepstrum coefficient characteristic of each frame in the unidentified voiceprint so as to obtain the differential Mel cepstrum coefficient characteristic in the unidentified voiceprint characteristic; a system of voiceprint recognition methods having the priority, comprising:

the first acquisition unit is used for acquiring a voice film source;

the analysis unit is connected with the first acquisition unit and used for analyzing the unrecognized voiceprint existing in the voice film source;

the statistical unit is connected with the identification unit and used for performing statistics on the unidentified voiceprints and forming a statistical result;

2. The voiceprint recognition method with priority according to claim 1, wherein the voiceprint recognition system with priority further comprises:

3. The voiceprint recognition method according to claim 2, wherein the unrecognized voiceprint features comprise a plurality of the unrecognized voiceprint feature vectors, the standard voiceprint features comprise a plurality of the standard voiceprint feature vectors, and the second processing unit comprises:

4. The prioritized voiceprint recognition method according to claim 3, wherein the prioritized voiceprint recognition system further comprises:

5. The method of claim 4, wherein the third processing module comprises:

6. The method of claim 2, wherein the first obtaining module comprises:

7. The voiceprint recognition method with priority according to claim 6, wherein the obtaining unit further comprises: