CN111145758A

CN111145758A - Voiceprint recognition method, system, mobile terminal and storage medium

Info

Publication number: CN111145758A
Application number: CN201911357829.6A
Authority: CN
Inventors: 叶林勇; 肖龙源; 李稀敏; 蔡振华; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-05-12

Abstract

The invention provides a voiceprint recognition method, a system, a mobile terminal and a storage medium, wherein the method comprises the following steps: acquiring voice data to be recognized, and performing feature extraction on the voice data to be recognized to obtain acoustic features; decoding and identifying the acoustic features to obtain text contents, and performing text cutting on the voice data to be identified according to the text contents; judging the text type of the voice data to be recognized according to the text cutting result, and inquiring a target recognition model according to the text type; and carrying out voiceprint recognition on the voice data to be recognized according to the target recognition model so as to obtain a voiceprint recognition result. According to the invention, the text type of the voice data to be recognized is judged by performing the text cutting design on the voice data to be recognized according to the text content, so that the voice data to be recognized can be transmitted to the corresponding voiceprint recognition model for voiceprint recognition according to the judged text type, the problem of inconsistency between the registered voice and the voice to be recognized in the voiceprint recognition process is prevented, and the accuracy of voiceprint recognition is improved.

Description

Voiceprint recognition method, system, mobile terminal and storage medium

Technical Field

The invention belongs to the technical field of voiceprint recognition, and particularly relates to a voiceprint recognition method, a voiceprint recognition system, a mobile terminal and a storage medium.

Background

The voice of each person implies unique biological characteristics, and the voiceprint recognition refers to a technical means for recognizing a speaker by using the voice of the speaker. The voiceprint recognition has high safety and reliability as the techniques of fingerprint recognition and the like, and can be applied to all occasions needing identity recognition. Such as in the financial fields of criminal investigation, banking, securities, insurance, and the like. Compared with the traditional identity recognition technology, the voiceprint recognition technology has the advantages of simple voiceprint extraction process, low cost, uniqueness and difficulty in counterfeiting and counterfeit.

The existing voiceprint recognition scheme is to collect voice data of at least one user in advance, extract a characteristic value of the voice data, and input the extracted characteristic value into a voiceprint model to obtain an N-dimensional voiceprint vector. And when confirming or identifying, firstly acquiring the voice data of any user, then extracting a characteristic value of the voice data, inputting the characteristic value into a voiceprint model to obtain an N-dimensional voiceprint vector, and then performing similarity matching with the original voiceprint vector in a voiceprint library, wherein each matched user can obtain a score, and the voiceprint with the highest score and larger than a threshold value is the user corresponding to the voice to be detected. However, in the prior art, when the speech to be detected is not a speaking speech, for example, when the speech to be detected is a section of random text and the registered speech is a sentence of fixed text, the obtained recognition result is inaccurate, so that the voiceprint recognition accuracy is low.

Disclosure of Invention

The embodiment of the invention aims to provide a voiceprint recognition method, a voiceprint recognition system, a mobile terminal and a storage medium, and aims to solve the problem that the existing voiceprint recognition method is low in recognition accuracy.

The embodiment of the invention is realized in such a way that a voiceprint recognition method comprises the following steps:

acquiring voice data to be recognized, and performing feature extraction on the voice data to be recognized to obtain acoustic features;

decoding and identifying the acoustic features to obtain text content, and performing text cutting on the voice data to be identified according to the text content;

judging the text type of the voice data to be recognized according to the text cutting result, and inquiring a target recognition model according to the text type, wherein the target recognition model is a text-related recognition model, a text-unrelated recognition model or a text semi-related recognition model;

and carrying out voiceprint recognition on the voice data to be recognized according to the target recognition model so as to obtain a voiceprint recognition result.

Further, the step of decoding and identifying the acoustic features comprises:

inputting the acoustic features into an acoustic model to obtain phoneme information;

and inputting the phoneme information into a language model and decoding according to a preset text dictionary to obtain the text content.

Further, the step of performing text segmentation on the speech data to be recognized according to the text content includes:

judging whether text characters are stored in the text content;

when the text characters are judged to be stored in the text content, carrying out text marking on corresponding voice in the voice data to be recognized according to the text characters;

when the text word is judged not to be stored in the text content, judging whether a number is stored in the text content;

and when the number is judged to be stored in the text content, carrying out digital marking on the corresponding voice in the voice data to be recognized according to the number.

Further, the step of determining the text type of the speech data to be recognized according to the text cutting result comprises:

judging whether the text characters are fixed texts prestored locally;

if so, judging that the voice data to be recognized is a text related type;

and if not, judging that the voice data to be recognized is a text-independent type.

when the number is judged to be stored in the text content, judging whether the number value of the number is a number threshold value;

if so, judging that the voice data to be recognized is a text semi-correlation type;

if not, sending out a text content error prompt.

Further, the step of querying the target recognition model according to the text type comprises:

when the voice data to be recognized is judged to be the text-related type, judging that the target recognition model is the text-related recognition model;

when the voice data to be recognized is judged to be the text-independent type, judging that the target recognition model is the text-independent recognition model;

and when the voice data to be recognized is judged to be the text semi-correlation type, judging that the target recognition model is the text semi-correlation recognition model.

Further, the step of performing voiceprint recognition on the voice data to be recognized according to the target recognition model comprises:

inputting the acoustic features into the target recognition model to obtain feature vectors;

calculating a matching value between the characteristic vector and a locally pre-stored sample vector according to an Euclidean distance formula, and acquiring a serial number value of the sample vector corresponding to the maximum value in the matching value;

and when the number value is judged to be larger than the number threshold value, judging that the voiceprint recognition of the voice data to be recognized is qualified.

Another object of an embodiment of the present invention is to provide a voiceprint recognition system, which includes:

the acoustic feature extraction module is used for acquiring voice data to be recognized and extracting features of the voice data to be recognized to obtain acoustic features;

the text cutting module is used for decoding and identifying the acoustic features to obtain text contents and performing text cutting on the voice data to be identified according to the text contents;

the model query module is used for judging the text type of the voice data to be recognized according to the text cutting result and querying a target recognition model according to the text type, wherein the target recognition model is a text-related recognition model, a text-unrelated recognition model or a text semi-related recognition model;

and the voiceprint recognition module is used for carrying out voiceprint recognition on the voice data to be recognized according to the target recognition model so as to obtain a voiceprint recognition result.

Another object of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above voiceprint recognition method.

Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the above-mentioned mobile terminal, wherein the computer program, when executed by a processor, implements the steps of the above-mentioned voiceprint recognition method.

According to the method and the device, the text type of the voice data to be recognized is judged by performing text cutting design on the voice data to be recognized according to the text content, so that the voice data to be recognized can be transmitted to the corresponding voice print recognition model according to the judged text type for voice print recognition, the problem that the registered voice is inconsistent with the voice to be recognized in the voice print recognition process is avoided, and the accuracy of the voice print recognition is effectively improved.

Drawings

Fig. 1 is a flowchart of a voiceprint recognition method provided by a first embodiment of the invention;

FIG. 2 is a flow chart of a voiceprint recognition method provided by a second embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a voiceprint recognition system provided by a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a mobile terminal according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Example one

Referring to fig. 1, a flowchart of a voiceprint recognition method according to a first embodiment of the present invention is shown, which includes the following steps:

step S10, acquiring voice data to be recognized, and performing feature extraction on the voice data to be recognized to obtain acoustic features;

wherein, the acoustic feature is extracted by adopting an MFCC algorithm to obtain Mel-scale frequency Cepstral coeffients (Mel-scale frequency Cepstral coeffients);

specifically, the extraction method of the mel-frequency cepstrum coefficient includes: pre-emphasis, framing, windowing, FFT processing, filter processing, logarithm operation and discrete cosine transform;

step S20, decoding and identifying the acoustic features to obtain text content, and performing text cutting on the voice data to be identified according to the text content;

specifically, in this embodiment, the acoustic model and the language model are respectively subjected to model training according to locally pre-stored sample voice data and sample text data, so that the voice in the voice data to be recognized can be effectively subjected to character recognition according to the trained acoustic model and voice model, and the text content is obtained;

in the step, data such as characters, numbers or letters and the like can be stored in the text content obtained by decoding, and the text content obtained by decoding corresponds to the information in the voice data to be recognized one by one;

step S30, judging the text type of the voice data to be recognized according to the text cutting result, and inquiring a target recognition model according to the text type;

the target identification model is a text-related identification model, a text-unrelated identification model or a text semi-related identification model;

specifically, the text related model is used for recognizing fixed voices, such as fixed texts like 'a magpie is on a tree', 'shopping is really convenient', 'a jujube tree is in front of my family', and the like, wherein the registration of the text related model needs a section of voice file repeated for three times to ensure the quality of the extracted voiceprint characteristic value;

the text-independent voiceprint model is used for identifying random texts, and the registration requires that the effective voice text length is more than 30 s;

the text semi-correlation voiceprint model is used for identifying random 8-bit length dynamic numbers, and the registration of the text semi-correlation model requires 5 groups of 8-bit random dynamic number voices;

preferably, in this step, the step of querying the target recognition model according to the text type includes:

when the voice data to be recognized is judged to be the text-related type, judging that the target recognition model is the text-related recognition model, namely when the voice data to be recognized is judged to be a fixed text pre-stored locally, judging that the voice data to be recognized can be subjected to voiceprint recognition by adopting the text-related recognition model currently;

when the voice data to be recognized is judged to be the text-independent type, the target recognition model is judged to be the text-independent recognition model, namely when the voice data to be recognized is judged to be a random text, the voice print recognition of the voice data to be recognized can be carried out by adopting the text-independent recognition model at present;

when the voice data to be recognized is judged to be the text semi-correlation type, the target recognition model is judged to be the text semi-correlation recognition model, namely when the voice data to be recognized is judged to be a digital sequence, the voice print recognition of the voice data to be recognized can be carried out by adopting the text semi-correlation recognition model at present;

step S40, performing voiceprint recognition on the voice data to be recognized according to the target recognition model to obtain a voiceprint recognition result;

extracting acoustic features of the voice data to be recognized by using an MFCC algorithm, and then inputting the extracted acoustic features into a corresponding target recognition model (a text-related model, a text-unrelated model or a text semi-related model) to output a voiceprint recognition result, wherein the voiceprint recognition result is qualified or unqualified voiceprint recognition detection of the voice data to be recognized;

in the embodiment, the text type of the voice data to be recognized is judged by performing text cutting on the voice data to be recognized according to the text content, so that the voice data to be recognized can be transmitted to the corresponding voice print recognition model according to the judged text type for voice print recognition, the problem of inconsistency between the registered voice and the voice to be recognized in the voice print recognition process is prevented, and the accuracy of the voice print recognition is effectively improved.

Example two

Referring to fig. 2, a flowchart of a voiceprint recognition method according to a second embodiment of the present invention is shown, which includes the following steps:

step S11, acquiring voice data to be recognized, and performing feature extraction on the voice data to be recognized to obtain acoustic features;

step S21, inputting the acoustic features into an acoustic model to obtain phoneme information, inputting the phoneme information into a language model and decoding according to a preset text dictionary to obtain the text content;

the method comprises the steps that model training is respectively carried out on an acoustic model and a language model according to sample voice data and sample text data which are pre-stored locally, so that phoneme information acquisition and text decoding can be effectively carried out on voice in the voice data to be recognized according to the acoustic model and the voice model which are trained, and text content is obtained, preferably, in the step, the preset text dictionary can be coded in a unique hot coding mode;

step S31, judging whether the text content stores text characters;

the text characters are any preset characters, and the preset characters can be Chinese, English, Japanese or Korean;

specifically, in this step, whether the text content stores the text characters is determined by sequentially matching the characters in the text content with preset characters, wherein the matching between the text content and the preset characters can be performed by adopting an image matching manner;

when the step S31 determines that the text word is stored in the text content, text labeling the corresponding voice in the voice data to be recognized according to the text word, and executing step S41;

by designing the text marks on the corresponding voices, the subsequent capturing of the corresponding voices is effectively facilitated, and the recognition efficiency and accuracy of the voiceprint recognition method are improved;

step S41, judging whether the text characters are fixed texts pre-stored locally;

the fixed text can be set according to requirements, for example, the fixed text can be set to 'a magpie is on a tree', 'shopping is convenient', 'a jujube tree is in front of my home', and the like, that is, the step judges whether the speech to be recognized is a text related type by judging whether text characters are pre-stored fixed text;

when the judgment result of the step S41 is yes, step S51 is performed;

step S51, judging the voice data to be recognized as text-related type, and setting the text-related recognition model as a target recognition model;

when the voice data to be recognized is judged to be the text related type, the voice to be recognized is judged to be the voice data sent out aiming at the fixed text, for example, when the locally pre-stored fixed text is 'a magpie exists on a tree', and the text stored in the text content is judged to be 'a magpie exists on a tree', the voice data to be recognized is judged to be the text related type;

preferably, in this embodiment, when it is determined that the repetition probability between the text characters and the fixed text is greater than or equal to a probability threshold, it is determined that the speech data to be recognized is a text-related type, where the probability threshold may be set according to a requirement, and in this embodiment, the probability threshold is 50%, for example:

when the locally pre-stored fixed text is 'magpie only on tree' and the text stored in the text content is 'magpie only', the repetition probability is 50 percent, so that the voice data to be recognized is judged to be a text related type;

when the judgment result of the step S41 is no, step S61 is performed;

step S61, judging the voice data to be recognized as a text-independent type, and setting a text-independent recognition model as a target recognition model;

when the voice data to be recognized is judged to be a text-independent type, judging that the voice to be recognized is the voice data sent out by aiming at the random text, wherein the random text is subjected to character dynamic change along with the change of time;

when the step S31 determines that the text word is not stored in the text content, execute step S71;

step S71, judging whether the text content stores numbers;

preferably, in other embodiments, the step may further determine whether a preset identifier is stored in the text content, where the preset identifier may be a letter or a symbol;

when the step S71 determines that the number is stored in the text content, digitally labeling the corresponding voice in the voice data to be recognized according to the number, and executing step S81;

step S81, determining whether the number value of the number is a number threshold;

in this embodiment, the number threshold is 8, that is, whether the number of the digits in the text content is 8 is determined;

preferably, in this step, when it is determined that the ratio between the number of the digits and the number threshold is greater than or equal to a preset ratio, it is determined that the number of the digits is the number threshold, and the preset ratio may be set according to a requirement, where the preset ratio in this embodiment is 0.5, for example:

when the number of the numbers is equal to 4:8, the ratio of the number of the numbers to the number threshold value is 4:8, so that the number of the numbers is judged to be the number threshold value;

when the judgment result of the step S81 is yes, step S91 is performed;

step S91, judging the voice data to be recognized as a text semi-correlation type, and setting a text semi-correlation recognition model as a target recognition model;

when the voice data to be recognized is judged to be of a text semi-correlation type, judging that the voice to be recognized is the voice data sent out aiming at the dynamic number;

when the judgment result of step S71 or step S81 is no, step 101 is executed;

step 101, sending out a text content error prompt;

when the text content is judged to be not stored with characters and numbers, sending a text content error prompt, wherein the sent text content error prompt is used for prompting a user that errors exist in the current collection or decoding of the voice data to be recognized;

step S111, performing voiceprint recognition on the voice data to be recognized according to the target recognition model to obtain a voiceprint recognition result;

specifically, in this step, the step of performing voiceprint recognition on the speech data to be recognized according to the target recognition model includes:

when the number value is judged to be larger than the number threshold value, judging that the voiceprint recognition of the voice data to be recognized is qualified;

in the embodiment, the text type of the voice data to be recognized is judged by performing text cutting on the voice data to be recognized according to the text content, so that the voice data to be recognized can be transmitted to the corresponding voice print recognition model according to the judged text type for voice print recognition, the problem that the registered voice is inconsistent with the voice to be recognized in the voice print recognition process is prevented, and the accuracy of the voice print recognition is effectively improved.

EXAMPLE III

Referring to fig. 3, a schematic structural diagram of a voiceprint recognition system 100 according to a third embodiment of the present invention is shown, including: the acoustic feature extraction module 10, the text cutting module 11, the model query module 12 and the voiceprint recognition module 13, wherein:

the acoustic feature extraction module 10 is configured to acquire voice data to be recognized and perform feature extraction on the voice data to be recognized to obtain acoustic features;

and the text cutting module 11 is configured to decode and identify the acoustic features to obtain text content, and perform text cutting on the voice data to be identified according to the text content.

Wherein the text cutting module 11 is further configured to: inputting the acoustic features into an acoustic model to obtain phoneme information; and inputting the phoneme information into a language model and decoding according to a preset text dictionary to obtain the text content.

Preferably, the text cutting module 11 is further configured to: judging whether text characters are stored in the text content;

And the model query module 12 is configured to determine a text type of the speech data to be recognized according to the result of the text segmentation, and query a target recognition model according to the text type, where the target recognition model is a text-related recognition model, a text-unrelated recognition model, or a text semi-related recognition model.

Further, the model query module 12 is further configured to determine whether the text characters are fixed texts pre-stored locally; if so, judging that the voice data to be recognized is a text related type; and if not, judging that the voice data to be recognized is a text-independent type.

Preferably, the model query module 12 is further configured to: when the number is judged to be stored in the text content, judging whether the number value of the number is a number threshold value; if so, judging that the voice data to be recognized is a text semi-correlation type; if not, sending out a text content error prompt. A

Further, the model query module 12 is further configured to: when the voice data to be recognized is judged to be the text-related type, judging that the target recognition model is the text-related recognition model; when the voice data to be recognized is judged to be the text-independent type, judging that the target recognition model is the text-independent recognition model; and when the voice data to be recognized is judged to be the text semi-correlation type, judging that the target recognition model is the text semi-correlation recognition model.

And the voiceprint recognition module 13 is configured to perform voiceprint recognition on the voice data to be recognized according to the target recognition model to obtain a voiceprint recognition result.

Wherein the voiceprint recognition module 13 is further configured to: inputting the acoustic features into the target recognition model to obtain feature vectors; calculating a matching value between the characteristic vector and a locally pre-stored sample vector according to an Euclidean distance formula, and acquiring a serial number value of the sample vector corresponding to the maximum value in the matching value; and when the number value is judged to be larger than the number threshold value, judging that the voiceprint recognition of the voice data to be recognized is qualified.

Example four

Referring to fig. 4, a mobile terminal 101 according to a fourth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal 101 execute the above voiceprint recognition method.

The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:

and carrying out voiceprint recognition on the voice data to be recognized according to the target recognition model so as to obtain a voiceprint recognition result. The storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is used as an example, in practical applications, the above-mentioned function distribution may be performed by different functional units or modules according to needs, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application.

Those skilled in the art will appreciate that the component structures shown in fig. 3 are not intended to be limiting of the voiceprint recognition system of the present invention and can include more or fewer components than shown, or some components in combination, or a different arrangement of components, and that the voiceprint recognition method of fig. 1-2 can also be implemented using more or fewer components than shown in fig. 3, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) in the target voiceprint recognition system and that are functionally capable of performing certain functions, all of which can be stored in a storage device (not shown) of the target voiceprint recognition system.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A voiceprint recognition method, the method comprising:

2. The voiceprint recognition method of claim 1 wherein said step of decoding and recognizing said acoustic features comprises:

3. The voiceprint recognition method according to claim 1, wherein the step of text-cutting the speech data to be recognized according to the text content comprises:

judging whether text characters are stored in the text content;

4. The voiceprint recognition method according to claim 3, wherein the step of determining the text type of the speech data to be recognized according to the result of the text cutting comprises:

judging whether the text characters are fixed texts prestored locally;

if so, judging that the voice data to be recognized is a text related type;

5. The voiceprint recognition method according to claim 4, wherein the step of determining the text type of the speech data to be recognized according to the result of the text cutting comprises:

if not, sending out a text content error prompt.

6. The voiceprint recognition method of claim 5 wherein said step of querying a target recognition model based on said text type comprises:

7. The voiceprint recognition method according to claim 1, wherein the step of voiceprint recognizing the speech data to be recognized according to the target recognition model comprises:

8. A voiceprint recognition system, said system comprising:

9. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor running the computer program to make the mobile terminal execute the voiceprint recognition method according to any one of claims 1 to 7.

10. A storage medium, characterized in that it stores a computer program for use in a mobile terminal according to claim 9, which computer program, when executed by a processor, implements the steps of the voiceprint recognition method according to any one of claims 1 to 7.