CN111081256A

CN111081256A - Digital string voiceprint password verification method and system

Info

Publication number: CN111081256A
Application number: CN201911416538.XA
Authority: CN
Inventors: 黄厚军; 项煦; 钱彦旻
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-04-28

Abstract

The invention discloses a digital string voiceprint password verification method and a system, wherein the method comprises the following steps: and training and acquiring a background model. And acquiring the digital string information in the registrant audio. And acquiring the digital string information in the currently decoded audio. And if the set digital string information is matched, processing the registrant audio through the background model to obtain the xvector voiceprint feature of the current audio. If the set digital string information is not matched, the digital string information in the current decoding audio in the step is finished or obtained again. And scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain a scoring value. And judging a verification result according to the scoring value. The invention can be used on the voice print recognition tasks related to the text and unrelated to the text only by simply adjusting the digital classification head, and can be made into a universal voice print recognition scheme.

Description

Digital string voiceprint password verification method and system

Technical Field

The invention belongs to the technical field of audio processing, and particularly relates to a digital string voiceprint password verification method and system.

Background

In the related technology, the corresponding audio frequencies of 0-9 digits in the registered audio frequency and the tested audio frequency are separated, the representation vectors such as drivers are respectively extracted, the upper drivers with the same digits carry out similarity scoring, and the scores of all digits are averaged to obtain the speaker similarity score of the registered audio frequency and the tested audio frequency.

The technologies currently used in the market have two obvious defects: firstly, the information in the registered audio is easy to be incompletely utilized; secondly, the joint information between different numbers cannot be well utilized. At present, a digital string voiceprint password system in the market carries out similarity scoring one by one, and when a non-repeated number contained in a test audio is different from a registered audio, drivers of some numbers are not utilized necessarily, and the system performance is influenced. In addition, when the background model is trained, each digital is used for training one background model, and the audio of each digital adopts the background model extraction driver of the corresponding digital and does not have information combined to different numbers, so that the system is difficult to achieve the optimal performance.

The inventor discovers that in the process of implementing the application: the method that can be thought of in the same industry generally is to extract an xvector instead of an actuator by using a deep embedding technology, so that the modeling capability of a characterization vector is improved; and simultaneously averaging the xvectors of all the digits in the registered audio, averaging the xvectors extracted from all the digits in the registered audio, and scoring the similarity of the speakers by using the two mean vectors. When the background model is trained, the scheme uses multi-task training and combines multi-digital information to extract an xvector, so that the whole scheme is not easy to realize.

Disclosure of Invention

An embodiment of the present invention provides a method and a system for verifying a voiceprint password of a numeric string, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for verifying a voiceprint password of a numeric string, including:

step S101, training and obtaining a background model.

And step S102, acquiring the digital string information in the registrant audio. And if the set digital string information is matched, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the set number string information is not matched, ending or reacquiring the number string information in the current registrant audio in the step.

Step S103, acquiring the digital string information in the current decoding audio. And if the set digital string information is matched, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set digital string information is not matched, the digital string information in the current decoding audio in the step is finished or obtained again.

And step S104, scoring the voiceprint characteristics of the xvector of the current tester according to the corresponding registered voiceprint characteristics of the xvector in the speaker library to obtain a scoring value.

Step S105, judging whether the score value exceeds a set threshold value, if so, generating verification passing information, and if not, generating verification failure information.

In a preferred embodiment, the step of training and acquiring the background model in step S101 includes:

step S1011, obtaining a training set audio, wherein the training set audio is a training set audio with multi-person voice. A plurality of time periods corresponding to each number are acquired and a digital label is acquired.

Step S1012, extracting features from the training set audio according to a plurality of time segments and training by a deep convolutional neural network. And outputting the result after the convolution training through a first full-connection layer, wherein the first full-connection layer is provided with 0-9 digital definition nodes. The first fully-connected layer output is normalized and mapped to a plurality of supervectors corresponding to each defined node.

And S1013, averaging the plurality of supervectors, and acquiring the xvector voiceprint feature with the set dimension through the second full connection layer.

In a preferred embodiment, step S1013 further includes: for the set dimension xvector voiceprint characteristics of each person in the multi-person voice, the distance between the xvectors of the same speaker is reduced through the cross entropy loss function, and the distance between the xvectors of different speakers is increased.

In a preferred embodiment, step S1011 further includes: and acquiring a digital classification head loss function through the digital label.

Speaker labels are obtained from the training set audio. And obtaining a speaker classification loss function according to the speaker label.

Step S1013 is followed by:

step 1014, a system total loss function is obtained through the digital classification head loss function and the speaker classification loss function.

In a second aspect, an embodiment of the present invention provides a verification system for a digital string voiceprint password, which includes:

a training unit configured to train and acquire a background model.

A registering unit configured to acquire the digital string information in the registrant's audio. And if the set digital string information is matched, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the set number string information is not matched, ending or reacquiring the number string information in the current registrant audio in the step.

A verification unit configured to obtain string information in the currently decoded audio. And if the set digital string information is matched, processing the audio of the tester through the background model to obtain the current xvector voiceprint characteristics. If the set digital string information is not matched, the digital string information in the current decoding audio in the step is finished or obtained again.

And the scoring unit is configured to score the voiceprint characteristics of the current tester according to the corresponding registered xvector voiceprint characteristics in the speaker library to acquire a scoring value.

And a result output unit configured to determine whether the score value exceeds a set threshold, if so, generate verification pass information, and if not, generate verification fail information.

In a preferred embodiment of the system, the training unit is further configured to:

and acquiring training set audio, wherein the training set audio is the training set audio with multi-person voice. A plurality of time periods corresponding to each number are acquired and a digital label is acquired.

Features are extracted from the training set audio according to a plurality of time segments and trained through a deep convolutional neural network. And outputting the result after the convolution training through a first full-connection layer, wherein the first full-connection layer is provided with 0-9 digital definition nodes. The first fully-connected layer output is normalized and mapped to a plurality of supervectors corresponding to each defined node.

And averaging the plurality of supervectors, and acquiring the xvector voiceprint feature with set dimensionality through the second full connection layer.

In a preferred embodiment of the system, the training unit is further configured to: for the set dimension xvector voiceprint characteristics of each person in the multi-person voice, the distance between the xvectors of the same speaker is reduced through the cross entropy loss function, and the distance between the xvectors of different speakers is increased.

In a preferred embodiment of the system, the training unit is further configured to: and acquiring a digital classification head loss function through the digital label.

And acquiring a system total loss function through the digital classification head loss function and the speaker classification loss function.

In a third aspect, an electronic device is provided, comprising: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the method of any of the embodiments of the invention.

In a fourth aspect, the embodiments of the present invention also provide a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, which, when executed by a computer, make the computer execute the steps of the method of any of the embodiments of the present invention.

The method is only used on a digital string voiceprint password system at present, and only 0-9 ten numbers are converted into awakening word contents when the voiceprints related to the awakening words are actually awakened. Text-independent voiceprints can also enumerate all words in the audio. From such a point of view, the scheme can be used on the voice print recognition tasks related to the text and unrelated to the text only by simply adjusting the number classification head, so that a universal voice print recognition scheme can be made.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for verifying a voiceprint password for a numeric string according to an embodiment of the present invention;

FIG. 2 is a flow chart of a digital string voiceprint password system scheme provided by an embodiment of the present invention;

FIG. 3 is a diagram illustrating a background model for speaker recognition according to an embodiment of the present invention;

FIG. 4 is a block diagram of a digital string voiceprint password authentication system according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the following, embodiments of the present application will be described, and then experimental data will be used to confirm what different and advantageous effects can be achieved in the scheme of the present application compared with the prior art.

Referring to fig. 1, a flow chart of a digital string voiceprint password authentication method of the present application is shown.

Step S101, training and obtaining a background model.

And step S102, acquiring registered xvector voiceprint characteristics.

In this step, the numeric string information in the registrant's audio is obtained. And if the set digital string information is matched, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the set number string information is not matched, ending or reacquiring the number string information in the current registrant audio in the step.

And step S103, acquiring the voiceprint characteristics of the current xvector.

In this step, the digital string information in the currently decoded audio is acquired. And if the set digital string information is matched, processing the audio of the tester through the background model to obtain the current xvector voiceprint characteristics. If the set digital string information is not matched, the digital string information in the current decoding audio in the step is finished or obtained again.

And step S104, obtaining the scoring value.

In this step, scoring is performed on the xvector voiceprint feature of the current tester according to the corresponding registered xvector voiceprint feature in the speaker library, and a scoring value is obtained.

Step S105, judging whether the score value exceeds a set threshold value.

And judging whether the score value exceeds a set threshold value, if so, generating verification passing information, and if not, generating verification failure information.

Step S1013 is followed by:

The following technical scheme is adopted to solve the problems: in the scheme, a deep convolutional neural network and batch normalization operation are adopted, the result after the convolution operation passes through a full connection layer, the output nodes of the full connection layer represent 0-9 ten numbers, and the output is subjected to softmax. And mapping the convolution layer output vectors to a super vector according to the output of each convolution layer output vector on the 10 nodes, averaging the super vectors, obtaining an xvector with a fixed dimension through another full-connection layer, and reducing the distance between the xvectors of the same speaker and increasing the distance between the xvectors of different speakers by adopting a cross entropy loss function. Then, cosine distances are obtained between the xvector of the registered audio and the xvector of the tested audio, and the similarity of the speakers is represented.

The flow chart of the whole system is shown in fig. 2, and the voiceprint recognition system is divided into a voiceprint registration process and a voiceprint recognition process.

During the voiceprint registration phase, the system prompts the user for a randomly generated numeric string, which the user reads. After the device-side microphone collects the Voice of the user, Voice Activity Detection (VAD) is adopted to intercept the Voice frequency of the user speaking, and the Voice frequency is sent to the digital string recognizer. If the identification result of the numeric string recognizer is consistent with the system prompt numeric string, the audio is sent to an xvector extraction module, the xvector is extracted and is put into a speaker database; otherwise, the registration fails and the registration process ends.

In the voiceprint recognition stage, the system prompts a user to read a randomly generated digit string, and after the user voice is collected by a microphone at the equipment end, the voice frequency of the user speaking is intercepted by adopting VAD and is sent to a digit string recognizer. If the recognition result of the digit string recognizer is inconsistent with the system prompt digit string, the test flow is ended. And if the identification result of the digital string recognizer is consistent with the system prompt digital string, sending the audio into an xvector extractor to extract xvector, then carrying out cosine distance scoring on the xvector of the speaker registrant speakerA in the database, if the score is higher than a threshold value, judging that the current tester is speakerA, otherwise, judging that the current tester is not speakerA.

The core component in the whole system is an xvector extractor, namely, a schematic diagram of a background model in speaker recognition is shown in FIG. 3.

As shown in fig. 3, in training the background model, each piece of training data needs to contain audio, a time period corresponding to each digit in the audio, and a speaker tag. Our model contains two classification heads, each shown in box 91 of fig. 3, where one head classifies a number in the audio and the other head classifies a speaker. In the background model training process, the sum of the loss function of the digital classification head and the loss function of the speaker classification is optimized as the total loss function of the system.

When the speaker is registered and tested, the FBANK features extracted from the audio frequency firstly pass through a multilayer convolutional neural network, then the supervectors are jointly calculated according to the output of digital classification and the output of the convolutional layer, and then the supervectors are mapped to the xvector through a full connection layer.

Referring to fig. 4, a verification system for a digital string voiceprint password according to an embodiment of the present invention is shown, which includes:

a training unit 101 configured to train and obtain a background model.

A registering unit 102 configured to acquire the information of the numeric string in the registrant's audio. And if the set digital string information is matched, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the set number string information is not matched, ending or reacquiring the number string information in the current registrant audio in the step.

A verification unit 103 configured to obtain the digital string information in the currently decoded audio. And if the set digital string information is matched, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set digital string information is not matched, the digital string information in the current decoding audio in the step is finished or obtained again.

And the scoring unit 104 is configured to score the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library, so as to obtain a scoring value.

A result output unit 105 configured to determine whether the score value exceeds a set threshold, and if so, generate verification-pass information, and if not, generate verification-fail information.

In other embodiments, the present invention further provides a non-volatile computer storage medium storing computer-executable instructions that can perform the speech signal processing and using methods in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

and training and acquiring a background model.

And acquiring the digital string information in the registrant audio. And if the set digital string information is matched, identifying the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the set number string information is not matched, ending or reacquiring the number string information in the current registrant audio in the step.

And acquiring the digital string information in the currently decoded audio. And if the set number string information is matched, processing the registrant audio through the background model to obtain the current xvector voiceprint characteristic. If the set digital string information is not matched, the digital string information in the current decoding audio in the step is finished or obtained again.

And scoring the voiceprint characteristics of the xvector of the current tester according to the corresponding registered voiceprint characteristics of the xvector in the speaker library to obtain a scoring value.

The steps of training-based and obtaining a background model include: step S1011, obtaining a training set audio, wherein the training set audio is a training set audio with multi-person voice. And acquiring a time period corresponding to each number and acquiring a digital label. Step S1012, extracting features from the training set audio according to the time period and training by a deep convolutional neural network. And outputting the result after the convolution training through a first full-connection layer, wherein the first full-connection layer is provided with 0-9 digital definition nodes. The first fully-connected layer output is normalized and mapped to a plurality of supervectors corresponding to each defined node. And S1013, averaging the plurality of supervectors, and acquiring the xvector voiceprint feature with the set dimension through the second full connection layer.

Based on step S1013, the method further includes: for the set dimension xvector voiceprint characteristics of each person in the multi-person voice, the distance between the xvectors of the same speaker is reduced through the cross entropy loss function, and the distance between the xvectors of different speakers is increased.

The method further includes, in step S1011: and acquiring a digital classification head loss function through the digital label.

Speaker labels are obtained from the training set audio. And obtaining a speaker classification loss function according to the speaker label. Step S1013 is followed by: step 1014, a system total loss function is obtained through the digital classification head loss function and the speaker classification loss function.

As a non-volatile computer readable storage medium, it can be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the digital string voiceprint password authentication method in the embodiment of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform a digital string voiceprint password authentication method in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice signal processing apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the voice signal processing apparatus over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, which, when executed by a computer, cause the computer to execute any of the above-mentioned digital string voiceprint password verification methods.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes: one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5. The device of the digital string voiceprint password authentication method may further include: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5. The memory 520 is a non-volatile computer-readable storage medium as described above. The processor 510 executes various functional applications and data processing of the server by executing nonvolatile software programs, instructions and modules stored in the memory 520, namely, implementing the digital string voiceprint password authentication method of the above-described method embodiment. The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the information delivery device. The output device 540 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device may be applied to an intelligent voice dialog platform, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

and training and acquiring a background model.

And acquiring the digital string information in the registrant audio. And if the set digital string information is matched, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the set number string information is not matched, ending or reacquiring the number string information in the current registrant audio in the step.

And acquiring the digital string information in the currently decoded audio. And if the set digital string information is matched, processing the audio of the tester through the background model to obtain the current xvector voiceprint characteristics. If the set digital string information is not matched, the digital string information in the current decoding audio in the step is finished or obtained again.

The steps of training-based and obtaining a background model include: step S1011, obtaining a training set audio, wherein the training set audio is a training set audio with multi-person voice. A plurality of time periods corresponding to each number are acquired and a digital label is acquired. Step S1012, extracting features from the training set audio according to a plurality of time segments and training by a deep convolutional neural network. And outputting the result after the convolution training through a first full-connection layer, wherein the first full-connection layer is provided with 0-9 digital definition nodes. The first fully-connected layer output is normalized and mapped to a plurality of supervectors corresponding to each defined node. And S1013, averaging the plurality of supervectors, and acquiring the xvector voiceprint feature with the set dimension through the second full connection layer.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A verification method of a digital string voiceprint password comprises the following steps:

step S101, training and obtaining a background model;

step S102, acquiring digital string information in the registrant audio; if the set digital string information is matched, processing the registrant audio through the background model to obtain registered xvector voiceprint characteristics, and establishing a speaker library according to the registered xvector voiceprint characteristics; if the set digital string information is not matched, ending or reacquiring the digital string information in the current registrant audio in the step;

step S103, acquiring digital string information in the current decoding audio; if the set number string information is matched, processing the audio of the tester through the background model to obtain the current xvector voiceprint feature; if the set digital string information is not matched, ending or reacquiring the digital string information in the current decoding audio in the step;

step S104, scoring the acoustic print characteristics of the xvector of the current tester according to the acoustic print characteristics of the corresponding registered xvector in the speaker library to obtain a scoring value;

and step S105, judging whether the score value exceeds a set threshold value, if so, generating verification passing information, and if not, generating verification failure information.

2. A verification method according to claim 1, wherein the step of training and obtaining a background model in the step S101 comprises:

step S1011, acquiring a training set audio, wherein the training set audio is a training set audio with multi-person voice; acquiring a time period corresponding to each digit and acquiring a digital label;

step S1012, extracting features from the training set audio according to the time segment and training through a deep convolutional neural network; outputting a result after convolution training through a first full-connection layer, wherein the first full-connection layer is provided with 0-9 digital definition nodes; normalizing the first fully-connected layer output and mapping to a plurality of supervectors corresponding to each defined node;

and S1013, averaging the plurality of supervectors, and acquiring the xvector voiceprint feature with the set dimension through a second full connection layer.

3. The authentication method according to claim 2, wherein the step S1013 further comprises: for the set dimension xvector voiceprint characteristics of each person in the multi-person voice, the distance between the xvectors of the same speaker is reduced through the cross entropy loss function, and the distance between the xvectors of different speakers is increased.

4. The authentication method according to claim 2 or 3, wherein the step S1011 further comprises: acquiring a digital classification head loss function through the digital label;

obtaining a speaker label through training set audio; obtaining a speaker classification loss function according to the speaker label;

the step S1013 is followed by:

and step S1014, acquiring a total system loss function through a digital classification head loss function and the speaker classification loss function.

5. A system for verifying a numeric string voiceprint password, comprising:

a training unit configured to train and obtain a background model;

a registering unit configured to acquire digital string information in registrant audio; if the set digital string information is matched, processing the registrant audio through the background model to obtain registered xvector voiceprint characteristics, and establishing a speaker library according to the registered xvector voiceprint characteristics; if the set digital string information is not matched, ending or reacquiring the digital string information in the current registrant audio in the step;

a verification unit configured to acquire digital string information in currently decoded audio; if the set number string information is matched, processing the audio of the tester through the background model to obtain the current xvector voiceprint feature; if the set digital string information is not matched, ending or reacquiring the digital string information in the current decoding audio in the step;

the scoring unit is configured to score the voiceprint characteristics of the xvector of the current tester according to the corresponding registered voiceprint characteristics of the xvector in the speaker library to obtain a scoring value;

and the result output unit is configured to judge whether the score value exceeds a set threshold value, if so, generating verification passing information, and if not, generating verification failure information.

6. The validation system of claim 5, wherein the training unit is further configured to:

acquiring a training set audio, wherein the training set audio is a training set audio with multi-person voice; acquiring a time period corresponding to each digit and acquiring a digital label;

extracting features from the training set audio according to the time segments and training through a deep convolutional neural network; outputting a result after convolution training through a first full-connection layer, wherein the first full-connection layer is provided with 0-9 digital definition nodes; normalizing the first fully-connected layer output and mapping to a plurality of supervectors corresponding to each defined node;

and averaging the plurality of supervectors, and acquiring the xvector voiceprint feature with set dimensionality through a second full-connection layer.

7. The validation system of claim 6, wherein the training unit is further configured to: for the set dimension xvector voiceprint characteristics of each person in the multi-person voice, the distance between the xvectors of the same speaker is reduced through the cross entropy loss function, and the distance between the xvectors of different speakers is increased.

8. A verification system according to claim 6 or 7 wherein the training unit is further configured to: acquiring a digital classification head loss function through the digital label;

and acquiring a total system loss function through a digital classification head loss function and the speaker classification loss function.