CN111081260A

CN111081260A - Method and system for identifying voiceprint of awakening word

Info

Publication number: CN111081260A
Application number: CN201911422078.1A
Authority: CN
Inventors: 黄厚军; 项煦; 钱彦旻
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-04-28

Abstract

The invention discloses a method and a system for identifying voiceprints of awakening words, wherein the method comprises the following steps: and training and acquiring a background model. And acquiring voice awakening words in the registrant audio. And acquiring a current voice awakening word in the current awakening audio. And scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain a scoring value. And judging whether the score value exceeds a set threshold value, if so, generating awakening passing information, and if not, generating awakening failure information. The invention can record the data of the awakening words by using an online recording tool such as a WeChat applet and the like, thereby reducing the time and the cost to a certain extent; larger models can be used to improve the performance of the text-dependent voiceprint recognition model.

Description

Method and system for identifying voiceprint of awakening word

Technical Field

The invention belongs to the technical field of voice processing, and particularly relates to a method and a system for identifying a voiceprint of a wakeup word.

Background

At present, the voiceprint recognition system of the awakening words on the market either needs to record the training data of the awakening words or directly tests by using a text-independent model. The scheme for recording the training data of the awakening words is long in time consumption and high in cost; the scheme recognition effect of directly using the text-independent model is difficult to meet the performance requirement of the product on voiceprint recognition.

The awakening words of the intelligent products in the current market are not common words basically, so that training data of a large number of speakers speaking the awakening words need to be recorded every new awakening word item, the cost of recording data of one item needs tens of thousands to hundreds of thousands, the recording audio frequency period is long, and the project schedule is delayed. If the awakening word training data is not recorded, the text-independent model is directly used, and the performance of the awakening word voiceprint recognition task is far inferior to that of a model customized under the condition of the same awakening word training data.

What is commonly thought by the industry is: the data of the awakening words are recorded by using an online recording tool such as a WeChat applet and the like, so that the time and the cost can be reduced to a certain extent; a larger model is used to improve the performance of the text-independent voiceprint recognition model.

Therefore, in the prior art, no scheme for waking up the word voiceprint recognition system under the condition of zero custom data exists.

Disclosure of Invention

An embodiment of the present invention provides a method and a system for recognizing a voiceprint of a wakeup word, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for identifying a voiceprint of a wakeup word, including:

step S101, training and obtaining a background model.

Step S102, acquiring voice awakening words in the registrant voice frequency. And if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.

Step S103, acquiring the current voice awakening word in the current awakening audio. And if the current voice awakening word is matched with the set voice awakening word, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.

And step S104, scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain scoring values.

Step S105, judging whether the score value exceeds a set threshold value, if so, generating awakening passing information, and if not, generating awakening failure information.

The step of training and obtaining the background model based on the step S101 includes: step S1011, the current awakening word audio sequence is obtained through awakening the training set audio. Step S1012, train the current wakeup word audio sequence through the deep convolutional neural network to obtain a background model.

The method further includes, in step S1011: and acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text. The entire audio is audio including a plurality of setting fields. And the voice recognition awakens the training set to acquire audio segments of the plurality of setting fields. And acquiring all audio fragment sets corresponding to the fields in the awakening training set audio according to the plurality of set fields. The audio clip has play time information. And randomly extracting the current audio fragment sequence from all the audio fragment sets corresponding to the fields. And arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.

The step of obtaining the voice wakeup word in the registrant audio based on the step S102 further includes: the registrant audio is collected. And extracting the voice awakening words in the registrant audio from the registrant audio through voice activity point detection. The step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes: the current wake-up audio is collected. And extracting the current voice awakening word from the current awakening audio through voice activity point detection.

The step of scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library based on the step S104 includes scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.

In a second aspect, an embodiment of the present invention provides a system for recognizing a voiceprint of a wakeup word, including:

a background model training unit configured to train and obtain a background model.

A registration unit configured to acquire a voice wake-up word in the registrant's audio. And if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.

A verification unit configured to obtain a current voice wake-up word in the current wake-up audio. And if the current voice awakening word is matched with the set voice awakening word, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.

And the scoring unit is configured to score the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain a scoring value.

And the result unit is configured to judge whether the scoring value exceeds a set threshold value, if so, the wake-up passing information is generated, and if not, the wake-up failure information is generated.

In a preferred embodiment of the present system, the background model training unit is further configured to: and acquiring the current awakening word audio sequence through awakening the training set audio. And training the audio sequence of the current awakening word through a deep convolutional neural network to obtain a background model.

In a preferred embodiment of the present system, the background model training unit is further configured to: and acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text. The entire audio is audio including a plurality of setting fields. And the voice recognition awakens the training set to acquire audio segments of the plurality of setting fields. And acquiring all audio fragment sets corresponding to the fields in the awakening training set audio according to the plurality of set fields. The audio clip has play time information. And randomly extracting the current audio fragment sequence from all the audio fragment sets corresponding to the fields. And arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.

In a preferred embodiment of the present system, the registration unit is further configured to: the registrant audio is collected. And extracting the voice awakening words in the registrant audio from the registrant audio through voice activity point detection. The step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes: the current wake-up audio is collected. And extracting the current voice awakening word from the current awakening audio through voice activity point detection.

In a preferred embodiment of the present system, the scoring unit is further configured to: and scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of the embodiments of the present invention.

In a fourth aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, the computer program including program instructions that, when executed by a computer, cause the computer to perform the steps of the method of any of the embodiments of the present invention.

In the invention, text-independent training data is used, all characters in the text-independent training data are separated, then the characters are spliced according to the sequence of the awakening words to obtain awakening word training data, and then a convolutional neural network is adopted to train a deep embedded model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for recognizing a voiceprint of a wakeup word according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating another scheme of a method for recognizing a voiceprint of a wake word according to an embodiment of the present invention;

fig. 3 is a flowchart of a speaker splicing synthesis algorithm in the method for recognizing a voiceprint of a wakeup word according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating a configuration of a system for recognizing a voiceprint of a wakeup word according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the following, embodiments of the present application will be described, and then experimental data will be used to confirm what different and advantageous effects can be achieved in the scheme of the present application compared with the prior art.

Please refer to fig. 1, which shows a flowchart of an embodiment of a method for recognizing a voiceprint of a wake word according to the present application. The embodiment of the invention provides a method for identifying a voiceprint of a wakeup word, which comprises the following steps:

step S101, training and obtaining a background model.

And step S102, acquiring a registered voice awakening word.

In this step, a voice wake-up word in the registrant's audio is obtained. And if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.

And step S103, acquiring the current voice awakening word.

In this step, the current voice wake-up word in the current wake-up audio is obtained. And if the current voice awakening word is matched with the set voice awakening word, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.

And step S104, obtaining the scoring value.

In this step, the current xvector voiceprint characteristics are scored according to the corresponding registered xvector voiceprint characteristics in the speaker library, and scoring values are obtained.

Step S105, acquiring a wake-up result.

In the step, whether the score value exceeds a set threshold value is judged, if yes, awakening passing information is generated, and if not, awakening failure information is generated.

The following technical scheme is adopted to solve the problems: in the invention, text-independent voiceprint recognition model training data is used, all characters in the text-independent voiceprint recognition model training data are separated, then the characters are spliced according to the sequence of the awakening words to obtain awakening word training data, and then a convolutional neural network is adopted to train a deep embedded model.

The flow chart of the whole system is shown in fig. 2, and the scheme comprises 3 steps: background model training, voiceprint registration and voiceprint recognition.

In the background model training stage, the text-independent general data is synthesized into wake word-related training data through a splicing synthesis algorithm, and then the background model based on the convolutional neural network is trained.

In the voiceprint registration stage, after the device-side microphone collects the user voice, Voice Activity Detection (VAD) is adopted to intercept the voice frequency of the user speaking, and the voice frequency is sent to a keyword awakening system. If the system is awakened normally, the audio is sent to an xvector extraction module, and the xvector is extracted and put into a speaker database; otherwise, the registration fails and the registration process ends.

In the voiceprint recognition stage, after the equipment-side microphone collects the voice of the user, the voice frequency of the user is intercepted by adopting VAD, and the voice frequency is sent to a keyword awakening system. If the awakening system cannot be awakened, the test flow is ended, and the user identification fails; if the system is awakened normally, the audio is sent to an xvector extractor to extract an xvector, then cosine distance scoring is carried out on the audio and the xvector of the speaker registrant speakerA in the database, if the score is higher than a threshold value, the current test person is judged to be speakerA, otherwise, the current test person is not judged to be speakerA.

The core component of the whole scheme is to adopt a concatenation synthesis algorithm to synthesize the wake-up word training data, taking the wake-up word "ni 3 hao3 xiao3 le 4" as an example, and the processing flow of each speaker is shown in fig. 3.

As shown in fig. 3, a speaker concatenation synthesis algorithm proceeds as follows:

the first step is as follows: for all the audios of a speaker, the content of each audio is recognized through a speech recognition system, and the position of each word in the audio is given at the same time.

The second step is that: according to all audio recognition results, all audio segments of four words, i.e., "ni 3", "hao 3", "xiao 3", and "le 4", are classified.

The third step: randomly taking out one segment from the audio segments corresponding to the four words of 'ni 3', 'hao 3', 'xiao 3' and 'le 4', and splicing the segments into an audio in sequence.

In a second aspect, as shown in fig. 4, an embodiment of the present invention provides a system for recognizing a voiceprint of a wakeup word, including:

a background model training unit 101 configured to train and obtain a background model.

A registering unit 102 configured to acquire a voice wake-up word in the registrant's audio. And if the voice awakening words can be matched with the set voice awakening words, identifying the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.

A verification unit 103 configured to obtain a current voice wake-up word in the current wake-up audio. And if the current voice awakening word is matched with the set voice awakening word, identifying the voice frequency of the registrant through the background model to obtain the voice print characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.

And the scoring unit 104 is configured to score the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library, so as to obtain a scoring value.

A result unit 105 configured to determine whether the score value exceeds a set threshold, if so, generate wake-up pass information, and if not, generate wake-up fail information.

In a preferred embodiment of the present system, the background model training unit 101 is further configured to: and acquiring the current awakening word audio sequence through awakening the training set audio. And training the audio sequence of the current awakening word through a deep convolutional neural network to obtain a background model.

In a preferred embodiment of the present system, the background model training unit 101 is further configured to: and acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text. The entire audio is audio including a plurality of setting fields. And the voice recognition awakens the training set to acquire audio segments of the plurality of setting fields. And acquiring all audio fragment sets corresponding to the fields in the awakening training set audio according to the plurality of set fields. The audio clip has play time information. And randomly extracting the current audio fragment sequence from all the audio fragment sets corresponding to the fields. And arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.

In a preferred embodiment of the present system, the registering unit 102 is further configured to: the registrant audio is collected. And extracting the voice awakening words in the registrant audio from the registrant audio through voice activity point detection. The step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes: the current wake-up audio is collected. And extracting the current voice awakening word from the current awakening audio through voice activity point detection.

In a preferred embodiment of the present system, the scoring unit 102 is further configured to: and scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.

In other embodiments, the present invention further provides a non-volatile computer storage medium storing computer-executable instructions that can perform the speech signal processing and using methods in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

step S101, training and obtaining a background model.

As a non-volatile computer readable storage medium, it can be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the method for recognizing the voiceprint of the wakeup word in the embodiment of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a method of identifying a wake word voiceprint in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice signal processing apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the voice signal processing apparatus over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention further provide a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes any one of the above methods for recognizing a voiceprint of a wakeup word.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes: one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5. The apparatus of the method for recognizing a wakeup word voiceprint may further include: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5. The memory 520 is a non-volatile computer-readable storage medium as described above. The processor 510 executes various functional applications and data processing of the server by executing the nonvolatile software program, instructions and modules stored in the memory 520, namely, implements the method for recognizing the wake word voiceprint of the above method embodiment. The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the information delivery device. The output device 540 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device may be applied to an intelligent voice dialog platform, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

and training and acquiring a background model.

And acquiring voice awakening words in the registrant audio. And if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.

And acquiring a current voice awakening word in the current awakening audio. And if the current voice awakening word is matched with the set voice awakening word, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.

And scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain a scoring value.

And judging whether the score value exceeds a set threshold value, if so, generating awakening passing information, and if not, generating awakening failure information.

The method for training and acquiring the background model based on the medium comprises the following steps: and acquiring the current awakening word audio sequence through awakening the training set audio. And training the audio sequence of the current awakening word through a deep convolutional neural network to obtain a background model.

And further comprising: and acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text. The entire audio is audio including a plurality of setting fields. And the voice recognition awakens the training set to acquire audio segments of the plurality of setting fields. And acquiring all audio fragment sets corresponding to the fields in the awakening training set audio according to the plurality of set fields. The audio clip has play time information. And randomly extracting the current audio fragment sequence from all the audio fragment sets corresponding to the fields. And arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.

The step of obtaining the voice awakening word in the registrant audio further comprises the following steps: the registrant audio is collected. And extracting the voice awakening words in the registrant audio from the registrant audio through voice activity point detection. And the step of obtaining the current voice awakening word in the current awakening audio frequency further comprises the following steps: the current wake-up audio is collected. And extracting the current voice awakening word from the current awakening audio through voice activity point detection.

Scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library comprises scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A recognition method of a voiceprint of a wake word comprises the following steps:

step S101, training and obtaining a background model;

step S102, acquiring voice awakening words in the registrant audio; if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain registered xvector voiceprint characteristics, and establishing a speaker library according to the registered xvector voiceprint characteristics; if the preset voice awakening words are not matched, ending or reacquiring the voice awakening words in the current registrant audio in the step;

step S103, acquiring a current voice awakening word in the current awakening audio; if the current voice awakening word is matched with the set voice awakening word, processing the audio frequency of the tester through the background model to obtain the current xvector voiceprint characteristic; if the set voice awakening words are not matched, ending or reacquiring the voice awakening words in the current awakening audio in the step;

step S104, scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain scoring values;

2. The recognition method of claim 1, wherein the step of training and obtaining a background model in the step S101 comprises:

step S1011, acquiring the audio sequence of the current awakening word through awakening the audio of the training set;

step S1012, training the current wake word audio sequence through a deep convolutional neural network to obtain a background model.

3. The identification method according to claim 2, wherein the step S1011 further comprises:

acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text; the whole audio is audio comprising a plurality of setting fields;

voice recognition is carried out on the awakening training set to obtain audio segments of the plurality of setting fields;

acquiring all audio clip sets corresponding to the fields in the audio of the awakening training set according to the plurality of set fields; the audio clip has playing time information;

randomly extracting a current audio clip sequence from all audio clip sets corresponding to the fields; and arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.

4. The recognition method according to claim 1, wherein the step of obtaining the voice wake-up word in the registrant audio in step S102 further comprises:

collecting the voice frequency of a registrant;

extracting voice awakening words in the registrant audio from the registrant audio through voice activity point detection;

the step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes:

collecting current awakening audio;

and extracting the current voice awakening word from the current awakening audio through voice activity point detection.

5. The identification method according to claim 1, wherein the step of scoring the current xvector voiceprint feature according to the corresponding registered xvector voiceprint feature in the speaker library in step S104 comprises:

and scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.

6. A system for voiceprint recognition of a wake word, comprising:

a background model training unit configured to train and obtain a background model;

the registration unit is configured to acquire a voice awakening word in the registrant audio; if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain registered xvector voiceprint characteristics, and establishing a speaker library according to the registered xvector voiceprint characteristics; if the preset voice awakening words are not matched, ending or reacquiring the voice awakening words in the current registrant audio in the step;

the verification unit is configured to acquire a current voice awakening word in the current awakening audio; if the current voice awakening word is matched with the set voice awakening word, processing the audio frequency of the tester through the background model to obtain the current xvector voiceprint characteristic; if the set voice awakening words are not matched, ending or reacquiring the voice awakening words in the current awakening audio in the step;

the scoring unit is configured to score the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain a scoring value;

and the result unit is configured to judge whether the score value exceeds a set threshold value, if so, generate awakening passing information, and if not, generate awakening failure information.

7. The recognition system of claim 6, wherein the background model training unit is further configured to: acquiring a current awakening word audio sequence through awakening training set audio; and training the current awakening word audio sequence through a deep convolutional neural network to obtain a background model.

8. The recognition system of claim 7, wherein the background model training unit is further configured to:

9. The identification system of claim 6, wherein the registration unit is further configured to:

collecting the voice frequency of a registrant;

collecting current awakening audio;

10. The identification system of claim 6, wherein the scoring unit is further configured to: and scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.