CN111081260A - Method and system for identifying voiceprint of awakening word - Google Patents

Method and system for identifying voiceprint of awakening word Download PDF

Info

Publication number
CN111081260A
CN111081260A CN201911422078.1A CN201911422078A CN111081260A CN 111081260 A CN111081260 A CN 111081260A CN 201911422078 A CN201911422078 A CN 201911422078A CN 111081260 A CN111081260 A CN 111081260A
Authority
CN
China
Prior art keywords
audio
awakening
current
voice
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201911422078.1A
Other languages
Chinese (zh)
Inventor
黄厚军
项煦
钱彦旻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN201911422078.1A priority Critical patent/CN111081260A/en
Publication of CN111081260A publication Critical patent/CN111081260A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/12Score normalisation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Abstract

The invention discloses a method and a system for identifying voiceprints of awakening words, wherein the method comprises the following steps: and training and acquiring a background model. And acquiring voice awakening words in the registrant audio. And acquiring a current voice awakening word in the current awakening audio. And scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain a scoring value. And judging whether the score value exceeds a set threshold value, if so, generating awakening passing information, and if not, generating awakening failure information. The invention can record the data of the awakening words by using an online recording tool such as a WeChat applet and the like, thereby reducing the time and the cost to a certain extent; larger models can be used to improve the performance of the text-dependent voiceprint recognition model.

Description

Method and system for identifying voiceprint of awakening word
Technical Field
The invention belongs to the technical field of voice processing, and particularly relates to a method and a system for identifying a voiceprint of a wakeup word.
Background
At present, the voiceprint recognition system of the awakening words on the market either needs to record the training data of the awakening words or directly tests by using a text-independent model. The scheme for recording the training data of the awakening words is long in time consumption and high in cost; the scheme recognition effect of directly using the text-independent model is difficult to meet the performance requirement of the product on voiceprint recognition.
The awakening words of the intelligent products in the current market are not common words basically, so that training data of a large number of speakers speaking the awakening words need to be recorded every new awakening word item, the cost of recording data of one item needs tens of thousands to hundreds of thousands, the recording audio frequency period is long, and the project schedule is delayed. If the awakening word training data is not recorded, the text-independent model is directly used, and the performance of the awakening word voiceprint recognition task is far inferior to that of a model customized under the condition of the same awakening word training data.
What is commonly thought by the industry is: the data of the awakening words are recorded by using an online recording tool such as a WeChat applet and the like, so that the time and the cost can be reduced to a certain extent; a larger model is used to improve the performance of the text-independent voiceprint recognition model.
Therefore, in the prior art, no scheme for waking up the word voiceprint recognition system under the condition of zero custom data exists.
Disclosure of Invention
An embodiment of the present invention provides a method and a system for recognizing a voiceprint of a wakeup word, which are used to solve at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides a method for identifying a voiceprint of a wakeup word, including:
step S101, training and obtaining a background model.
Step S102, acquiring voice awakening words in the registrant voice frequency. And if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.
Step S103, acquiring the current voice awakening word in the current awakening audio. And if the current voice awakening word is matched with the set voice awakening word, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.
And step S104, scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain scoring values.
Step S105, judging whether the score value exceeds a set threshold value, if so, generating awakening passing information, and if not, generating awakening failure information.
The step of training and obtaining the background model based on the step S101 includes: step S1011, the current awakening word audio sequence is obtained through awakening the training set audio. Step S1012, train the current wakeup word audio sequence through the deep convolutional neural network to obtain a background model.
The method further includes, in step S1011: and acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text. The entire audio is audio including a plurality of setting fields. And the voice recognition awakens the training set to acquire audio segments of the plurality of setting fields. And acquiring all audio fragment sets corresponding to the fields in the awakening training set audio according to the plurality of set fields. The audio clip has play time information. And randomly extracting the current audio fragment sequence from all the audio fragment sets corresponding to the fields. And arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.
The step of obtaining the voice wakeup word in the registrant audio based on the step S102 further includes: the registrant audio is collected. And extracting the voice awakening words in the registrant audio from the registrant audio through voice activity point detection. The step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes: the current wake-up audio is collected. And extracting the current voice awakening word from the current awakening audio through voice activity point detection.
The step of scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library based on the step S104 includes scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.
In a second aspect, an embodiment of the present invention provides a system for recognizing a voiceprint of a wakeup word, including:
a background model training unit configured to train and obtain a background model.
A registration unit configured to acquire a voice wake-up word in the registrant's audio. And if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.
A verification unit configured to obtain a current voice wake-up word in the current wake-up audio. And if the current voice awakening word is matched with the set voice awakening word, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.
And the scoring unit is configured to score the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain a scoring value.
And the result unit is configured to judge whether the scoring value exceeds a set threshold value, if so, the wake-up passing information is generated, and if not, the wake-up failure information is generated.
In a preferred embodiment of the present system, the background model training unit is further configured to: and acquiring the current awakening word audio sequence through awakening the training set audio. And training the audio sequence of the current awakening word through a deep convolutional neural network to obtain a background model.
In a preferred embodiment of the present system, the background model training unit is further configured to: and acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text. The entire audio is audio including a plurality of setting fields. And the voice recognition awakens the training set to acquire audio segments of the plurality of setting fields. And acquiring all audio fragment sets corresponding to the fields in the awakening training set audio according to the plurality of set fields. The audio clip has play time information. And randomly extracting the current audio fragment sequence from all the audio fragment sets corresponding to the fields. And arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.
In a preferred embodiment of the present system, the registration unit is further configured to: the registrant audio is collected. And extracting the voice awakening words in the registrant audio from the registrant audio through voice activity point detection. The step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes: the current wake-up audio is collected. And extracting the current voice awakening word from the current awakening audio through voice activity point detection.
In a preferred embodiment of the present system, the scoring unit is further configured to: and scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.
In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of the embodiments of the present invention.
In a fourth aspect, the present invention also provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, the computer program including program instructions that, when executed by a computer, cause the computer to perform the steps of the method of any of the embodiments of the present invention.
In the invention, text-independent training data is used, all characters in the text-independent training data are separated, then the characters are spliced according to the sequence of the awakening words to obtain awakening word training data, and then a convolutional neural network is adopted to train a deep embedded model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a method for recognizing a voiceprint of a wakeup word according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating another scheme of a method for recognizing a voiceprint of a wake word according to an embodiment of the present invention;
fig. 3 is a flowchart of a speaker splicing synthesis algorithm in the method for recognizing a voiceprint of a wakeup word according to an embodiment of the present invention;
fig. 4 is a schematic diagram illustrating a configuration of a system for recognizing a voiceprint of a wakeup word according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the following, embodiments of the present application will be described, and then experimental data will be used to confirm what different and advantageous effects can be achieved in the scheme of the present application compared with the prior art.
Please refer to fig. 1, which shows a flowchart of an embodiment of a method for recognizing a voiceprint of a wake word according to the present application. The embodiment of the invention provides a method for identifying a voiceprint of a wakeup word, which comprises the following steps:
step S101, training and obtaining a background model.
And step S102, acquiring a registered voice awakening word.
In this step, a voice wake-up word in the registrant's audio is obtained. And if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.
And step S103, acquiring the current voice awakening word.
In this step, the current voice wake-up word in the current wake-up audio is obtained. And if the current voice awakening word is matched with the set voice awakening word, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.
And step S104, obtaining the scoring value.
In this step, the current xvector voiceprint characteristics are scored according to the corresponding registered xvector voiceprint characteristics in the speaker library, and scoring values are obtained.
Step S105, acquiring a wake-up result.
In the step, whether the score value exceeds a set threshold value is judged, if yes, awakening passing information is generated, and if not, awakening failure information is generated.
The step of training and obtaining the background model based on the step S101 includes: step S1011, the current awakening word audio sequence is obtained through awakening the training set audio. Step S1012, train the current wakeup word audio sequence through the deep convolutional neural network to obtain a background model.
The method further includes, in step S1011: and acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text. The entire audio is audio including a plurality of setting fields. And the voice recognition awakens the training set to acquire audio segments of the plurality of setting fields. And acquiring all audio fragment sets corresponding to the fields in the awakening training set audio according to the plurality of set fields. The audio clip has play time information. And randomly extracting the current audio fragment sequence from all the audio fragment sets corresponding to the fields. And arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.
The step of obtaining the voice wakeup word in the registrant audio based on the step S102 further includes: the registrant audio is collected. And extracting the voice awakening words in the registrant audio from the registrant audio through voice activity point detection. The step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes: the current wake-up audio is collected. And extracting the current voice awakening word from the current awakening audio through voice activity point detection.
The step of scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library based on the step S104 includes scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.
The following technical scheme is adopted to solve the problems: in the invention, text-independent voiceprint recognition model training data is used, all characters in the text-independent voiceprint recognition model training data are separated, then the characters are spliced according to the sequence of the awakening words to obtain awakening word training data, and then a convolutional neural network is adopted to train a deep embedded model.
The flow chart of the whole system is shown in fig. 2, and the scheme comprises 3 steps: background model training, voiceprint registration and voiceprint recognition.
In the background model training stage, the text-independent general data is synthesized into wake word-related training data through a splicing synthesis algorithm, and then the background model based on the convolutional neural network is trained.
In the voiceprint registration stage, after the device-side microphone collects the user voice, Voice Activity Detection (VAD) is adopted to intercept the voice frequency of the user speaking, and the voice frequency is sent to a keyword awakening system. If the system is awakened normally, the audio is sent to an xvector extraction module, and the xvector is extracted and put into a speaker database; otherwise, the registration fails and the registration process ends.
In the voiceprint recognition stage, after the equipment-side microphone collects the voice of the user, the voice frequency of the user is intercepted by adopting VAD, and the voice frequency is sent to a keyword awakening system. If the awakening system cannot be awakened, the test flow is ended, and the user identification fails; if the system is awakened normally, the audio is sent to an xvector extractor to extract an xvector, then cosine distance scoring is carried out on the audio and the xvector of the speaker registrant speakerA in the database, if the score is higher than a threshold value, the current test person is judged to be speakerA, otherwise, the current test person is not judged to be speakerA.
The core component of the whole scheme is to adopt a concatenation synthesis algorithm to synthesize the wake-up word training data, taking the wake-up word "ni 3 hao3 xiao3 le 4" as an example, and the processing flow of each speaker is shown in fig. 3.
As shown in fig. 3, a speaker concatenation synthesis algorithm proceeds as follows:
the first step is as follows: for all the audios of a speaker, the content of each audio is recognized through a speech recognition system, and the position of each word in the audio is given at the same time.
The second step is that: according to all audio recognition results, all audio segments of four words, i.e., "ni 3", "hao 3", "xiao 3", and "le 4", are classified.
The third step: randomly taking out one segment from the audio segments corresponding to the four words of 'ni 3', 'hao 3', 'xiao 3' and 'le 4', and splicing the segments into an audio in sequence.
In a second aspect, as shown in fig. 4, an embodiment of the present invention provides a system for recognizing a voiceprint of a wakeup word, including:
a background model training unit 101 configured to train and obtain a background model.
A registering unit 102 configured to acquire a voice wake-up word in the registrant's audio. And if the voice awakening words can be matched with the set voice awakening words, identifying the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.
A verification unit 103 configured to obtain a current voice wake-up word in the current wake-up audio. And if the current voice awakening word is matched with the set voice awakening word, identifying the voice frequency of the registrant through the background model to obtain the voice print characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.
And the scoring unit 104 is configured to score the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library, so as to obtain a scoring value.
A result unit 105 configured to determine whether the score value exceeds a set threshold, if so, generate wake-up pass information, and if not, generate wake-up fail information.
In a preferred embodiment of the present system, the background model training unit 101 is further configured to: and acquiring the current awakening word audio sequence through awakening the training set audio. And training the audio sequence of the current awakening word through a deep convolutional neural network to obtain a background model.
In a preferred embodiment of the present system, the background model training unit 101 is further configured to: and acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text. The entire audio is audio including a plurality of setting fields. And the voice recognition awakens the training set to acquire audio segments of the plurality of setting fields. And acquiring all audio fragment sets corresponding to the fields in the awakening training set audio according to the plurality of set fields. The audio clip has play time information. And randomly extracting the current audio fragment sequence from all the audio fragment sets corresponding to the fields. And arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.
In a preferred embodiment of the present system, the registering unit 102 is further configured to: the registrant audio is collected. And extracting the voice awakening words in the registrant audio from the registrant audio through voice activity point detection. The step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes: the current wake-up audio is collected. And extracting the current voice awakening word from the current awakening audio through voice activity point detection.
In a preferred embodiment of the present system, the scoring unit 102 is further configured to: and scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.
In other embodiments, the present invention further provides a non-volatile computer storage medium storing computer-executable instructions that can perform the speech signal processing and using methods in any of the above method embodiments;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
step S101, training and obtaining a background model.
Step S102, acquiring voice awakening words in the registrant voice frequency. And if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.
Step S103, acquiring the current voice awakening word in the current awakening audio. And if the current voice awakening word is matched with the set voice awakening word, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.
And step S104, scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain scoring values.
Step S105, judging whether the score value exceeds a set threshold value, if so, generating awakening passing information, and if not, generating awakening failure information.
The step of training and obtaining the background model based on the step S101 includes: step S1011, the current awakening word audio sequence is obtained through awakening the training set audio. Step S1012, train the current wakeup word audio sequence through the deep convolutional neural network to obtain a background model.
The method further includes, in step S1011: and acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text. The entire audio is audio including a plurality of setting fields. And the voice recognition awakens the training set to acquire audio segments of the plurality of setting fields. And acquiring all audio fragment sets corresponding to the fields in the awakening training set audio according to the plurality of set fields. The audio clip has play time information. And randomly extracting the current audio fragment sequence from all the audio fragment sets corresponding to the fields. And arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.
The step of obtaining the voice wakeup word in the registrant audio based on the step S102 further includes: the registrant audio is collected. And extracting the voice awakening words in the registrant audio from the registrant audio through voice activity point detection. The step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes: the current wake-up audio is collected. And extracting the current voice awakening word from the current awakening audio through voice activity point detection.
The step of scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library based on the step S104 includes scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.
As a non-volatile computer readable storage medium, it can be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the method for recognizing the voiceprint of the wakeup word in the embodiment of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a method of identifying a wake word voiceprint in any of the method embodiments described above.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice signal processing apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the voice signal processing apparatus over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present invention further provide a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes any one of the above methods for recognizing a voiceprint of a wakeup word.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes: one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5. The apparatus of the method for recognizing a wakeup word voiceprint may further include: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5. The memory 520 is a non-volatile computer-readable storage medium as described above. The processor 510 executes various functional applications and data processing of the server by executing the nonvolatile software program, instructions and modules stored in the memory 520, namely, implements the method for recognizing the wake word voiceprint of the above method embodiment. The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the information delivery device. The output device 540 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
As an embodiment, the electronic device may be applied to an intelligent voice dialog platform, and includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:
and training and acquiring a background model.
And acquiring voice awakening words in the registrant audio. And if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain the voiceprint characteristics of the registered xvector, and establishing a speaker library according to the voiceprint characteristics of the registered xvector. If the preset voice awakening words are not matched, the voice awakening words in the current registrant audio in the step are ended or obtained again.
And acquiring a current voice awakening word in the current awakening audio. And if the current voice awakening word is matched with the set voice awakening word, processing the audio of the tester through the background model to obtain the voiceprint characteristic of the current xvector. If the set voice awakening words are not matched, the voice awakening words in the current awakening audio in the step are finished or obtained again.
And scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain a scoring value.
And judging whether the score value exceeds a set threshold value, if so, generating awakening passing information, and if not, generating awakening failure information.
The method for training and acquiring the background model based on the medium comprises the following steps: and acquiring the current awakening word audio sequence through awakening the training set audio. And training the audio sequence of the current awakening word through a deep convolutional neural network to obtain a background model.
And further comprising: and acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text. The entire audio is audio including a plurality of setting fields. And the voice recognition awakens the training set to acquire audio segments of the plurality of setting fields. And acquiring all audio fragment sets corresponding to the fields in the awakening training set audio according to the plurality of set fields. The audio clip has play time information. And randomly extracting the current audio fragment sequence from all the audio fragment sets corresponding to the fields. And arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.
The step of obtaining the voice awakening word in the registrant audio further comprises the following steps: the registrant audio is collected. And extracting the voice awakening words in the registrant audio from the registrant audio through voice activity point detection. And the step of obtaining the current voice awakening word in the current awakening audio frequency further comprises the following steps: the current wake-up audio is collected. And extracting the current voice awakening word from the current awakening audio through voice activity point detection.
Scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library comprises scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.
(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A recognition method of a voiceprint of a wake word comprises the following steps:
step S101, training and obtaining a background model;
step S102, acquiring voice awakening words in the registrant audio; if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain registered xvector voiceprint characteristics, and establishing a speaker library according to the registered xvector voiceprint characteristics; if the preset voice awakening words are not matched, ending or reacquiring the voice awakening words in the current registrant audio in the step;
step S103, acquiring a current voice awakening word in the current awakening audio; if the current voice awakening word is matched with the set voice awakening word, processing the audio frequency of the tester through the background model to obtain the current xvector voiceprint characteristic; if the set voice awakening words are not matched, ending or reacquiring the voice awakening words in the current awakening audio in the step;
step S104, scoring the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain scoring values;
step S105, judging whether the score value exceeds a set threshold value, if so, generating awakening passing information, and if not, generating awakening failure information.
2. The recognition method of claim 1, wherein the step of training and obtaining a background model in the step S101 comprises:
step S1011, acquiring the audio sequence of the current awakening word through awakening the audio of the training set;
step S1012, training the current wake word audio sequence through a deep convolutional neural network to obtain a background model.
3. The identification method according to claim 2, wherein the step S1011 further comprises:
acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text; the whole audio is audio comprising a plurality of setting fields;
voice recognition is carried out on the awakening training set to obtain audio segments of the plurality of setting fields;
acquiring all audio clip sets corresponding to the fields in the audio of the awakening training set according to the plurality of set fields; the audio clip has playing time information;
randomly extracting a current audio clip sequence from all audio clip sets corresponding to the fields; and arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.
4. The recognition method according to claim 1, wherein the step of obtaining the voice wake-up word in the registrant audio in step S102 further comprises:
collecting the voice frequency of a registrant;
extracting voice awakening words in the registrant audio from the registrant audio through voice activity point detection;
the step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes:
collecting current awakening audio;
and extracting the current voice awakening word from the current awakening audio through voice activity point detection.
5. The identification method according to claim 1, wherein the step of scoring the current xvector voiceprint feature according to the corresponding registered xvector voiceprint feature in the speaker library in step S104 comprises:
and scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.
6. A system for voiceprint recognition of a wake word, comprising:
a background model training unit configured to train and obtain a background model;
the registration unit is configured to acquire a voice awakening word in the registrant audio; if the voice awakening words can be matched with the set voice awakening words, processing the voice frequency of the registrant through the background model to obtain registered xvector voiceprint characteristics, and establishing a speaker library according to the registered xvector voiceprint characteristics; if the preset voice awakening words are not matched, ending or reacquiring the voice awakening words in the current registrant audio in the step;
the verification unit is configured to acquire a current voice awakening word in the current awakening audio; if the current voice awakening word is matched with the set voice awakening word, processing the audio frequency of the tester through the background model to obtain the current xvector voiceprint characteristic; if the set voice awakening words are not matched, ending or reacquiring the voice awakening words in the current awakening audio in the step;
the scoring unit is configured to score the current xvector voiceprint characteristics according to the corresponding registered xvector voiceprint characteristics in the speaker library to obtain a scoring value;
and the result unit is configured to judge whether the score value exceeds a set threshold value, if so, generate awakening passing information, and if not, generate awakening failure information.
7. The recognition system of claim 6, wherein the background model training unit is further configured to: acquiring a current awakening word audio sequence through awakening training set audio; and training the current awakening word audio sequence through a deep convolutional neural network to obtain a background model.
8. The recognition system of claim 7, wherein the background model training unit is further configured to:
acquiring awakening training set audio, wherein the awakening training set audio is all audio of data which is recorded by one person and is unrelated to the text; the whole audio is audio comprising a plurality of setting fields;
voice recognition is carried out on the awakening training set to obtain audio segments of the plurality of setting fields;
acquiring all audio clip sets corresponding to the fields in the audio of the awakening training set according to the plurality of set fields; the audio clip has playing time information;
randomly extracting a current audio clip sequence from all audio clip sets corresponding to the fields; and arranging and acquiring the current awakening word audio sequence according to the playing time of the audio segments.
9. The identification system of claim 6, wherein the registration unit is further configured to:
collecting the voice frequency of a registrant;
extracting voice awakening words in the registrant audio from the registrant audio through voice activity point detection;
the step of obtaining the current voice wake-up word in the current wake-up audio in step S103 further includes:
collecting current awakening audio;
and extracting the current voice awakening word from the current awakening audio through voice activity point detection.
10. The identification system of claim 6, wherein the scoring unit is further configured to: and scoring the current xvector voiceprint characteristics through a cosine distance function according to the corresponding registered xvector voiceprint characteristics in the speaker library.
CN201911422078.1A 2019-12-31 2019-12-31 Method and system for identifying voiceprint of awakening word Withdrawn CN111081260A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911422078.1A CN111081260A (en) 2019-12-31 2019-12-31 Method and system for identifying voiceprint of awakening word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911422078.1A CN111081260A (en) 2019-12-31 2019-12-31 Method and system for identifying voiceprint of awakening word

Publications (1)

Publication Number Publication Date
CN111081260A true CN111081260A (en) 2020-04-28

Family

ID=70321418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911422078.1A Withdrawn CN111081260A (en) 2019-12-31 2019-12-31 Method and system for identifying voiceprint of awakening word

Country Status (1)

Country Link
CN (1) CN111081260A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735438A (en) * 2020-12-29 2021-04-30 科大讯飞股份有限公司 Online voiceprint feature updating method and device, storage device and modeling device
CN113948091A (en) * 2021-12-20 2022-01-18 山东贝宁电子科技开发有限公司 Air-ground communication voice recognition engine for civil aviation passenger plane and application method thereof
WO2023207185A1 (en) * 2022-04-29 2023-11-02 荣耀终端有限公司 Voiceprint recognition method, graphical interface, and electronic device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020178004A1 (en) * 2001-05-23 2002-11-28 Chienchung Chang Method and apparatus for voice recognition
WO2016100231A1 (en) * 2014-12-15 2016-06-23 Baidu Usa Llc Systems and methods for speech transcription
CN106057206A (en) * 2016-06-01 2016-10-26 腾讯科技(深圳)有限公司 Voiceprint model training method, voiceprint recognition method and device
CN106098068A (en) * 2016-06-12 2016-11-09 腾讯科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN108648759A (en) * 2018-05-14 2018-10-12 华南理工大学 A kind of method for recognizing sound-groove that text is unrelated
CN110047491A (en) * 2018-01-16 2019-07-23 中国科学院声学研究所 A kind of relevant method for distinguishing speek person of random digit password and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020178004A1 (en) * 2001-05-23 2002-11-28 Chienchung Chang Method and apparatus for voice recognition
WO2016100231A1 (en) * 2014-12-15 2016-06-23 Baidu Usa Llc Systems and methods for speech transcription
CN106057206A (en) * 2016-06-01 2016-10-26 腾讯科技(深圳)有限公司 Voiceprint model training method, voiceprint recognition method and device
CN106098068A (en) * 2016-06-12 2016-11-09 腾讯科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN110047491A (en) * 2018-01-16 2019-07-23 中国科学院声学研究所 A kind of relevant method for distinguishing speek person of random digit password and device
CN108648759A (en) * 2018-05-14 2018-10-12 华南理工大学 A kind of method for recognizing sound-groove that text is unrelated

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
董莺艳: "基于深度学习的声纹识别方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735438A (en) * 2020-12-29 2021-04-30 科大讯飞股份有限公司 Online voiceprint feature updating method and device, storage device and modeling device
CN113948091A (en) * 2021-12-20 2022-01-18 山东贝宁电子科技开发有限公司 Air-ground communication voice recognition engine for civil aviation passenger plane and application method thereof
WO2023207185A1 (en) * 2022-04-29 2023-11-02 荣耀终端有限公司 Voiceprint recognition method, graphical interface, and electronic device

Similar Documents

Publication Publication Date Title
CN107147618B (en) User registration method and device and electronic equipment
JP6394709B2 (en) SPEAKER IDENTIFYING DEVICE AND FEATURE REGISTRATION METHOD FOR REGISTERED SPEECH
CN111081260A (en) Method and system for identifying voiceprint of awakening word
CN111063341A (en) Method and system for segmenting and clustering multi-person voice in complex environment
CN111081280B (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN104143326A (en) Voice command recognition method and device
CN108831477B (en) Voice recognition method, device, equipment and storage medium
CN111341325A (en) Voiceprint recognition method and device, storage medium and electronic device
CN102404278A (en) Song request system based on voiceprint recognition and application method thereof
CN109462603A (en) Voiceprint authentication method, equipment, storage medium and device based on blind Detecting
CN104462912B (en) Improved biometric password security
WO2017166651A1 (en) Voice recognition model training method, speaker type recognition method and device
WO2018129869A1 (en) Voiceprint verification method and apparatus
CN110634468B (en) Voice wake-up method, device, equipment and computer readable storage medium
CN102916815A (en) Method and device for checking identity of user
CN110544469A (en) Training method and device of voice recognition model, storage medium and electronic device
CN111179915A (en) Age identification method and device based on voice
CN111243604B (en) Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
CN108847243B (en) Voiceprint feature updating method and device, storage medium and electronic equipment
CN110544468A (en) Application awakening method and device, storage medium and electronic equipment
JP6915637B2 (en) Information processing equipment, information processing methods, and programs
CN110600029A (en) User-defined awakening method and device for intelligent voice equipment
CN110580897A (en) audio verification method and device, storage medium and electronic equipment
CN111081256A (en) Digital string voiceprint password verification method and system
CN109087647A (en) Application on Voiceprint Recognition processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

CB02 Change of applicant information
WW01 Invention patent application withdrawn after publication

Application publication date: 20200428

WW01 Invention patent application withdrawn after publication