CN110459226A

CN110459226A - A method of voice is detected by vocal print engine or machine sound carries out identity veritification

Info

Publication number: CN110459226A
Application number: CN201910765307.3A
Authority: CN
Inventors: 任超; 钟亚希; 陈志骏
Original assignee: Effective Software Technology (shanghai) Co Ltd
Current assignee: Effective Software Technology (shanghai) Co Ltd
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2019-11-15

Abstract

The method that voice or machine sound carry out identity veritification is detected by vocal print engine the invention discloses a kind of, comprising the following steps: A, receive audio to be verified；B, received audio-frequency information is pre-processed；Pretreatment comprises the steps of: a) duration detection；B) mute detection；C) preemphasis；D) framing adding window；C, vocal print comparison is carried out, judges identity information, the present invention detects voice by way of vocal print engine or machine sound carries out identity veritification.Wherein the vocal print engine can efficiently, fast and accurately recognition detection sound is that speaker expresses in real time, or the recording for passing through audio player plays, and judge whether the sound is me, so that having higher accuracy rate and efficiency in the application scenarios of Application on Voiceprint Recognition, there are wide application scenarios.

Description

A method of voice is detected by vocal print engine or machine sound carries out identity veritification

Technical field

It is specifically a kind of that voice or the progress of machine sound are detected by vocal print engine the present invention relates to identity identification technical field The method that identity is veritified.

Background technique

The theoretical basis of Application on Voiceprint Recognition is that each sound has unique feature, so as to by the sound of different people into Row is effective to be distinguished.This unique feature mainly includes two aspect contents: the 1, size of the operatic tunes；Specifically include throat, nasal cavity With oral cavity etc., shape, size and the position of these organs determine the size of vocal chord tension and the range of sound frequency.Therefore not With people although if same, but the frequency distribution of sound is different, sound it is droning have it is loud and clear.Everyone Sounding chamber be all different, just as fingerprint, everyone sound also just has unique feature.2, phonatory organ is manipulated Mode, phonatory organ includes lip, tooth, tongue, soft palate and palate muscle etc., and interaction will generate clearly language between them Sound.And the cooperation mode between them is people is arrived by incidental learning in the exchanging of the day after tomorrow and people around.People speaks in study During, by simulating the tongue of surrounding different people, the vocal print feature of oneself will be gradually formed.

Sound groove recognition technology in e is developed by AT&T Labs earliest, is mainly used for military information field.With this technology Gradually develop, the later period at the end of the sixties all employed this technology in fields such as forensic identification, the court evidences in the U.S., from 1967 Till now, the U.S. at least more than 5000 a case include murder, rape, blackmailing, drug smuggling, gambling, corrosion of politics etc. is all Effective clue and strong evidence are provided by sound groove recognition technology in e.Special emphasis is that vocal print identification has been at present The standard of the Ministry of Public Security can be used as evidence and be identified.

Application on Voiceprint Recognition is a wide in range concept, and technical aspect, which has, is divided into two classes: i.e. speaker verification's technology and speaker distinguish Recognize technology, speaker verification's technology is for judging whether unknown speaker is some nominator；The latter is then for recognizing not Know that speaker is to have recorded who in speaker.

It is all speaker's recognition techniques that we are normally understood, is usually applied to criminal investigation and case detection, criminal's tracking, national defence prison It listens, personalized application etc., speaker verification's technology is usually applied to securities trading, bank transaction, public security evidence obtaining, PC Acoustic control lock, automobile sound control lock, identity card, identification of credit card etc..

At present, the common method of Application on Voiceprint Recognition includes template matching method, arest neighbors method, Neural Networks Learning Algorithm, VQ clustering procedure etc..Although these methods processing means are different, basic principle is similar, for example just started to show to everybody Sound spectrograph.Sound spectrograph is a kind of representation of image conversion of voice signal, its horizontal axis represents the time, and the longitudinal axis represents frequency, Voice is distinguished in the amplitude size of each Frequency point with color.The fundamental frequency and harmonics of the sound of speaker show on sound spectrograph It for bright line one by one, then by different processing means can be obtained by the similarity between different sound spectrographs, finally reach To the purpose of Application on Voiceprint Recognition.

Speaker verification's technology of Application on Voiceprint Recognition plays very important in the daily information of guarantee modern people, property safety Effect, for example, can guarantee to log in using sound groove recognition technology in e during credit card, bank transaction, system login The safety of the property of people, information.In addition, remote identity confirmation has specific application in company's routine attendance check.But exist at present During these identity validations, all exists and carry out authentication using the mode of recording, so as to cause system vulnerability, serious shadow The reliability of acoustic system.If Application on Voiceprint Recognition speaker verification function will also become a kind of ornaments without good control.This Outside, for the identification of machine sound, in certain payment industries, by extracting background sound, sentenced by the way of the verifying of secondary auxiliary Disconnected to whether there is machine sound, this kind of mode is to a certain extent, it is therefore prevented that user directly prevents from recording using the mode of recording A possibility that.But in such a way that this kind dependence background sound is come secondary verifying, the complexity of user's checking is not only increased, and Can not directly it judge with the presence or absence of recording, also, user context sound may change the previous second with next second generation conspicuousness, To cause recognition failures.Moreover, usually requirement of real-time is relatively high in verification process, it is desirable that program has quick response Speed.

Therefore, it is necessary to a kind of technologies for being able to detect voice or machine sound, so that Application on Voiceprint Recognition speaker verification's function Speaker verification can more accurately be completed.

Summary of the invention

The purpose of the present invention is to provide a kind of sides that voice or the progress identity veritification of machine sound are detected by vocal print engine Method, to solve the problems mentioned in the above background technology.

To achieve the above object, the invention provides the following technical scheme:

A method of voice is detected by vocal print engine or machine sound carries out identity veritification, comprising the following steps:

A, audio to be verified is received；

B, received audio-frequency information is pre-processed；Pretreatment comprises the steps of: a) duration detection；B) mute inspection It surveys；C) preemphasis；D) framing adding window；

C, vocal print comparison is carried out, judges identity information.

As further technical solution of the present invention: the step a) is specifically: according to incoming voice, judging voice duration Whether satisfaction setting detection duration requires, and is unsatisfactory for duration requirement, directly return error message.

As further technical solution of the present invention: the step b) is specifically: whether the decibel value of the incoming voice of detection Greater than given threshold, when all audio point decibel values are both less than given threshold, error message is returned to.

As further technical solution of the present invention: the step c) is realized by a high-pass filter, is transmitted Function are as follows: H (z)=1- μ z^-1.Wherein, z indicates that timing node, H (z) are indicated to moment z aggravated consequence.

As further technical solution of the present invention: the step d) utilizes the window function of transportable specific length Weighting realize, using Hamming window as window function, function expression are as follows:Wherein, N) table] show length of window, (n is wherein) takes Break number, w (x) indicates the amplitude of the selected right break n of length of window.

As further technical solution of the present invention: when vocal print engine, which opens, whether detects machine sound function switch Wait, vocal print engine after the completion of being pre-processed to the audio that receives can asynchronous comparison carry out vocal print and compare and whether be that machine sound is tested Two steps are demonstrate,proved, wherein asynchronous comparison carries out above-mentioned two step simultaneously with regard to the mode using multithreading, when vocal print engine is opened When whether detecting machine sound function switch, using asynchronous system, speech samples are obtained from sound-groove model database respectively Characteristic model, and extract speech feature vector；And noise characteristic model is obtained, and extract the optimal classification surface of noise.

As further technical solution of the present invention: the voice print verification is specifically: it carries out in speaker's comparison process, it is first First vocal print engine carries out vocal print feature extraction operation to by the pretreated audio received, then passes through the side of pattern match Whether formula is come to verify audio and the sample voice model to be detected that vocal print engine receives be the same person, most by applied probability Bigization matching algorithm carrys out implementation pattern matching, and recognition function is as follows:Wherein, n* is the speaker identified, and N indicates speaker , λ_nFor the GMM model of n-th of speaker, z indicates observation vector.Pass through the feature extracted to voice signal to be identified It is carried out characteristic probability matching with sound-groove model feature vector by parameter, and the maximum identity of matching probability is recognition result.

As further technical solution of the present invention: it carries out in machine sound detection process, it first can be to by pretreated The audio received extracts channelling mode noise, then by using Legendre multinomial coefficient and extraction statistical nature The mode of value extract audio to be verified it is long when characteristic vector, finally by svm classifier know method for distinguishing, judge acoustic to be checked Whether frequency is machine sound, and Legendre fitting of a polynomial expression formula is as follows:Wherein L_nFor Legendre multinomial coefficient, P_nFor Legnedre polynomial, it is defined as follows,By In equipment generate interchannel noise with the presence of voice signal and presence and transformation it is very slow, using statistical characteristics come table The statistics frame characteristic information for showing the interchannel noise that equipment generates, finally, by using the feature of both the above channelling mode noise Value as the characteristic vector for judging whether it is voice playback, thus when completing long characteristic vector extraction, svm classifier identification side Method is the noise optimal classification surface that will be extracted in sound-groove model database according to svm classifier function, the sound to be verified with extraction Frequency it is long when characteristic parameter carry out Classification and Identification.

Compared with prior art, the invention has the following advantages: the present invention detects people by way of vocal print engine Sound or machine sound carry out identity veritification.Wherein the vocal print engine can efficiently, fast and accurately recognition detection sound is speaker Expression in real time, or by the recording of audio player plays, and judge whether the sound is me, so that vocal print There is higher accuracy rate and efficiency in the application scenarios of identification, there are wide application scenarios.

Detailed description of the invention

Fig. 1 is vocal print engine identification process figure.

Fig. 2 is pretreatment process figure.

Fig. 3 is feature vector flow chart when extracting long.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Fig. 1-3 is please referred to, a method of by vocal print engine detection voice or the progress identity veritification of machine sound, including with Lower step:

A, audio to be verified is received；

C, vocal print comparison is carried out, judges identity information.

In the present invention, raw tone be exactly system directly utilize the acquisition equipment of signal to the pronunciation information of speaker into Row acquisition, and require no intermediate other equipment.Voice playback refers to that the voice of speaker is acquired by sound pick-up outfit, is What system acquired is speaker's pronunciation information in recording file.Since voice playback is undergone than raw tone more when acquisition The channel of sound pick-up outfit, extra channel can increase the interchannel noise information of playback apparatus in voice segments.

In sound-groove model library, there is corresponding sound for speaker's speech model feature and environmental noise model is possessed Line model.

When vocal print engine receives an audio to be verified, the audio can be pre-processed first, implementation process is such as Fig. 2；

Duration detection is specifically: according to incoming voice, judging whether voice duration meets setting detection duration requirement, is discontented with What sufficient duration required, directly return error message.

Mute detection is specifically: whether the decibel value for detecting incoming voice is greater than given threshold, when all audio point decibels When value is both less than given threshold, error message is returned.

Preemphasis: to solve the problems, such as the reduction of the high fdrequency component due to caused by attenuation effect, in the frequency of analysis voice signal High fdrequency component is promoted before spectrum information, keeps the variation between the high frequency spectrum of signal and low-frequency spectra flat, improves language The resolution ratio of the high fdrequency component of sound signal is joined by preemphasis treated voice signal is more advantageous to spectrum analysis or channel Several analyses.This process realized by a high-pass filter, transfer function are as follows: H (z)=1- μ z^-1。

Framing adding window: mainly determining the size that frame length and frame move, and the voice in a frame length is referred to as a speech frame, The unit that frame length and frame move is millisecond (ms).This patent utilizes the weighting of the window function of transportable specific length to realize, Using Hamming window as window function, function expression are as follows:

When vocal print engine, which opens, whether detects machine sound function switch, vocal print engine is pre- to the audio received It handles the asynchronous progress speaker comparison of meeting after the completion, whether be that machine sound verifies two steps.

It carries out in speaker's comparison process, vocal print engine is special to vocal print is carried out by the pretreated audio received first Extraction operation is levied, audio and the sample voice mould to be detected that vocal print engine receives then is verified by way of pattern match Whether type is the same person.

The problem of matching algorithm is come implementation pattern matching is maximized by applied probability, to realize the identification function of module Can, recognition function is as follows:

By the characteristic parameter extracted to voice signal to be identified, it is subjected to feature with sound-groove model feature vector Probability match, the maximum identity of matching probability is recognition result.

It carries out in machine sound detection process, channelling mode can be extracted to by the pretreated audio received first Noise, then by using Legendre multinomial coefficient and extract statistical characteristics mode extract audio to be verified it is long when Characteristic vector knows method for distinguishing finally by svm classifier, judges whether audio to be detected is machine sound, and Legendre is multinomial The fitting expression of formula is as follows:Wherein L_nFor Legendre multinomial coefficient, P_nIt is multinomial for Legendre Formula is defined as follows,Since the interchannel noise of equipment generation is with voice signal In the presence of and presence and transformation it is very slow, indicated using statistical characteristics equipment generate interchannel noise statistics frame feature letter Breath, finally, by using the characteristic value of both the above channelling mode noise as the characteristic vector for judging whether it is voice playback, The extraction of characteristic vector when to complete long.

Judge that an input voice whether be voice playback is substantially two points of problems.And SVM is one good The realization of two sorting algorithms, it is a kind of statistical method of structuring risk.

Embodiment 2, on the basis of embodiment 1, it is according to svm classifier function, by vocal print mould that svm classifier, which knows method for distinguishing, The noise optimal classification surface extracted in type database, characteristic parameter carries out classification knowledge when long with the audio to be verified of extraction Not.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included within the present invention.Any reference signs in the claims should not be construed as limiting the involved claims.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims

1. a kind of detect the method that voice or machine sound carry out identity veritification by vocal print engine, which is characterized in that including following Step:

A, audio to be verified is received；

B, received audio-frequency information is pre-processed；Pretreatment comprises the steps of: a) duration detection；B) mute detection；c) Preemphasis；D) framing adding window；

C, vocal print comparison is carried out, judges identity information.

2. a kind of method that voice or the progress identity veritification of machine sound are detected by vocal print engine according to claim 1, It is characterized in that, the step a) is specifically: according to incoming voice, judging whether voice duration meets setting detection duration and want It asks, is unsatisfactory for duration requirement, directly return error message.

3. a kind of method that voice or the progress identity veritification of machine sound are detected by vocal print engine according to claim 1, It is characterized in that, the step b) is specifically: whether the decibel value for detecting incoming voice is greater than given threshold, when all audio points When decibel value is both less than given threshold, error message is returned.

4. a kind of method that voice or the progress identity veritification of machine sound are detected by vocal print engine according to claim 1, It is characterized in that, the step c) is realized by a high-pass filter, transfer function are as follows:

H (z)=1- μ z^-1

Wherein, z indicates that timing node, H (z) are indicated to moment z aggravated consequence.

5. a kind of method that voice or the progress identity veritification of machine sound are detected by vocal print engine according to claim 1, It is characterized in that, the step d) is realized using the weighting of the window function of transportable specific length, with Hamming window work For window function, function expression are as follows:

Wherein, N indicates length of window, and n is the break number wherein taken, and w (x) indicates the amplitude of the selected right break n of length of window.

6. a kind of method that voice or the progress identity veritification of machine sound are detected by vocal print engine according to claim 1, It is characterized in that, vocal print engine is to the sound received when vocal print engine opens and whether detects machine sound function switch Whether the asynchronous comparison progress vocal print of meeting compares and is machine sound two steps of verifying after the completion for frequency pretreatment, wherein asynchronous compare just Above-mentioned two step is carried out simultaneously using the mode of multithreading, when vocal print engine, which opens, whether detects machine sound function switch Wait, using asynchronous system, respectively from sound-groove model database obtain speech samples characteristic model, and extract phonetic feature to Amount；And noise characteristic model is obtained, and extract the optimal classification surface of noise.

7. a kind of method that voice or the progress identity veritification of machine sound are detected by vocal print engine according to claim 6, It is characterized in that, the voice print verification is specifically: carrying out in speaker's comparison process, vocal print engine is to by pretreated first The audio received carries out vocal print feature extraction operation, then verifies what vocal print engine received by way of pattern match Whether audio and sample voice model to be detected is the same person, maximizes matching algorithm by applied probability come implementation pattern Match, recognition function is as follows:

Wherein, n* is the speaker identified, and N indicates speaker, λ_nFor the GMM model of n-th of speaker, z indicates observation arrow Amount；

By the characteristic parameter extracted to voice signal to be identified, it is subjected to characteristic probability with sound-groove model feature vector Matching, the maximum identity of matching probability is recognition result.

8. a kind of method that voice or the progress identity veritification of machine sound are detected by vocal print engine according to claim 6, It is characterized in that, carrying out in machine sound detection process, channel can be extracted to by the pretreated audio received first Then modal noise extracts audio to be verified by using the mode of Legendre multinomial coefficient and extraction statistical characteristics Characteristic vector when long knows method for distinguishing finally by svm classifier, judges whether audio to be detected is machine sound, Legendre Fitting of a polynomial expression formula is as follows:Wherein L_nFor Legendre multinomial coefficient, P_nIt is more for Legendre Item formula, is defined as follows,Since the interchannel noise of equipment generation is with voice signal Presence and presence and transformation it is very slow, indicated using statistical characteristics equipment generate interchannel noise statistics frame feature Information, finally, by using the characteristic value of both the above channelling mode noise as the Characteristic Vectors for judging whether it is voice playback Amount, thus when completing long characteristic vector extraction, svm classifier know method for distinguishing be according to svm classifier function, by sound-groove model number According to the noise optimal classification surface extracted in library, characteristic parameter carries out Classification and Identification when long with the audio to be verified of extraction.