WO2018095167A1

WO2018095167A1 - Voiceprint identification method and voiceprint identification system

Info

Publication number: WO2018095167A1
Application number: PCT/CN2017/106886
Authority: WO
Inventors: 雷利博; 薛韬; 罗超
Original assignee: 北京京东尚科信息技术有限公司; 北京京东世纪贸易有限公司
Priority date: 2016-11-22
Filing date: 2017-10-19
Publication date: 2018-05-31
Also published as: CN108091340B; CN108091340A

Abstract

A voiceprint identification method and system. The method comprises: receiving an audio to be tested and segmenting the audio to be tested into a first part and a second part; selecting a sample audio and segmenting the sample audio into a first part and a second part; extracting characteristic matrixes for the audio to be tested and the sample audio by using Mel-frequency cepstral coefficient extraction method; executing support vector machine training by using the characteristic matrix of the first part of the audio to be tested as a first type of sample and using the characteristic matrix of the selected sample audio as a second type of sample, and calculating the matching degree of the second part of the audio to be tested and the second type of sample; performing a similar process on the first part of the sample audio, the first part of the audio to be tested and the second part of the sample audio, and respectively calculating the matching degree between the three with the audio to be tested, the selected sample audio and the audio to be tested as the respective corresponding second type of sample; and determining, according to the matching degree, whether the voice in the audio to be tested and the sample audio are from the same person.

Description

Voiceprint recognition method and voiceprint recognition system

Technical field

The present disclosure relates to the field of voiceprint recognition, and in particular to a voiceprint recognition method and a voiceprint recognition system.

Background technique

Voiceprint refers to a spectrum pattern showing sound wave characteristics drawn by a special electro-acoustic conversion instrument (such as a sonograph, a mapter, etc.), which is a collection of various acoustic feature maps. For the human body, the voiceprint is a long-term stable characteristic signal. Due to the innate physiological differences of the vocal organs and the behavioral differences formed by the acquired organs, each person's voiceprint has a strong personal color.

Voiceprint recognition is a biometric method that automatically recognizes the identity of a speaker based on characteristic parameters such as unique physiological and behavioral characteristics contained in human speech. The voiceprint recognition mainly collects the voice information of the person, extracts the unique voice feature and converts it into a digital symbol, and saves it into a feature template, so that the voice to be recognized is matched with the template in the database during the application, thereby discriminating the speech. Human identity. Beginning in the 1960s, research techniques for sound spectrum analysis began to be proposed and applied to speaker feature analysis. At present, voiceprint recognition technology has been relatively mature and practical.

Sound spectrum analysis plays a major role in the lives of modern people. For example, the installation, adjustment and operation of machinery in industrial production can be monitored by means of sound spectrum analysis. In addition, sound spectrum analysis has a wide range of applications in scientific testing of musical instrument manufacturing processes, jewelry identification, communication and efficient use of broadcast equipment. In terms of communication, the "voiceprint recognition" technology can be used for identity authentication to discriminate the identity of the speaker. At present, most of the research results in this field are based on text relevance, that is, the verifier must be pronounced according to the prescribed text, thus limiting the development of the technology. In addition, the fault tolerance of the existing algorithms is too poor, basically relying on a similarity score to assess whether the samples of the two speech features belong to the same person. If the sample size is not large enough or the sample's speech feature similarity is high, it is difficult to make an accurate judgment.

Summary of the invention

According to a first aspect of the present disclosure, a voiceprint recognition method is provided, the voiceprint recognition method may include: receiving audio to be tested and dividing the audio to be tested into a first part and a second part; selecting one sample from the sample database Audio and dividing the selected sample audio into a first part and a second part; extracting a feature matrix for the audio to be tested and the selected sample audio by using a method of extracting the Mel cepstrum coefficients; Part of the feature matrix is used as the first type of sample, and the feature matrix of the selected sample audio is used as the second type of sample, and the support vector machine training is performed. The second part of the audio to be tested belongs to the ratio a of the second type of samples; by using the feature matrix of the first part of the selected sample audio as the first type of samples and the feature matrix of the audio to be tested as the second type of samples, performing Support vector machine training, and calculate the ratio b of the second part of the selected sample audio belonging to the second type of sample; by using the feature matrix of the second part of the audio to be tested as the first type of sample, and the characteristics of the selected sample audio As a second type of sample, the matrix performs support vector machine training, and calculates a ratio c of the first part of the audio to be tested belonging to the second type of sample; by using the feature matrix of the second part of the selected sample audio as the first type of sample, and Using the feature matrix of the audio to be tested as the second type of sample, performing support vector machine training, and calculating the ratio d of the first part of the selected sample audio belonging to the second type of sample; calculating according to the calculated a, b, c, and d The degree to which the audio to be tested matches the selected sample audio to determine if the audio to be tested and the selected sample audio are from the same person's voice.

According to an embodiment of the present disclosure, the voiceprint recognition method further includes: preprocessing the received audio to be tested, wherein the preprocessing includes at least one of: pre-emphasizing the audio to be detected; The framing method of the stacked segment is to framing the test audio; the Hamming window is applied to eliminate the Gibbs effect; and the speech frame and the non-speech frame are distinguished and the non-speech frame is discarded.

According to an embodiment of the present disclosure, the dividing the audio to be tested into the first portion and the second portion includes dividing the audio to be tested into two portions of equal length.

According to an embodiment of the present disclosure, the dividing the selected sample audio into the first portion and the second portion comprises dividing the selected sample audio into two portions of equal length.

According to an embodiment of the present disclosure, the calculating the degree of matching of the audio to be tested and the sample audio comprises: calculating an average value of a, b, c, and d; and determining a ratio of the average value to 0.5 as the audio and sample to be tested The degree of matching of the audio.

According to a second aspect of the present disclosure, there is provided a voiceprint recognition system, the voiceprint recognition system comprising: a receiver configured to receive audio to be tested; a sample database configured to store one or more sample audio; a vector machine configured to classify the test data according to the classification sample; the controller configured to: divide the audio to be tested from the receiver into the first part and the second part, and select a sample audio from the sample database and select the The sample audio is divided into a first part and a second part; the feature matrix for the audio to be tested and the selected sample audio is extracted by using the extraction method of the Mel cepstral coefficient; the test to be tested as the first type of sample is input to the support vector machine a feature matrix of the first portion of the audio and a feature matrix of the selected sample audio as the second type of samples and training the support vector machine, calculating a ratio a of the second portion of the audio to be tested belonging to the second type of samples; Entering the feature matrix of the first part of the selected sample audio as the first type of sample and as the first a feature matrix of the to-be-tested audio of the second type of samples and training the support vector machine to calculate a ratio b of the second part of the selected sample audio belonging to the second type of sample; to be tested by inputting to the support vector machine as the first type of sample a feature matrix of the second portion of the audio and a feature matrix of the selected sample audio as the second type of samples and training the support vector machine to calculate a ratio c of the first portion of the audio to be tested belonging to the second type of sample; The measuring machine inputs a feature matrix of the second part of the selected sample audio as the first type of samples and a feature matrix of the audio to be tested as the second type of samples and trains the support vector machine to calculate the first part of the selected sample audio The ratio d of the second type of samples; based on the calculated a, b, c, and d, the degree of matching between the audio to be tested and the sample audio is calculated to determine whether the audio to be tested and the sample audio are from the same person's voice.

According to an embodiment of the present disclosure, the controller may be further configured to perform pre-processing on the received audio to be tested; wherein the pre-processing comprises at least one of: pre-emphasizing the audio to be detected; overlapping by using The segmented framing method framing the test audio; applying a Hamming window to eliminate the Gibbs effect; and distinguishing between speech frames and non-speech frames and discarding non-speech frames.

According to an embodiment of the present disclosure, the controller is further configured to divide the audio to be tested into two parts of equal length.

According to an embodiment of the present disclosure, the controller is further configured to split the selected sample audio into two parts of equal length.

According to an embodiment of the present disclosure, the controller is further configured to: calculate an average value of a, b, c, and d; and determine a ratio of the average value to 0.5 as a degree of matching of the audio to be tested and the sample audio.

According to an embodiment of the present disclosure, there is also provided a computer system comprising: one or more processors; a memory for storing one or more programs, wherein when the one or more programs are When executed by a plurality of processors, the one or more processors are caused to implement the voiceprint recognition method as described above.

According to an embodiment of the present disclosure, there is also provided a computer readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the voiceprint recognition method as described above.

DRAWINGS

The above and other aspects, features, and advantages of the example embodiments of the present disclosure will be more apparent from the following description of the accompanying drawings.

FIG. 1 is a block diagram showing the structure of a voiceprint recognition system according to an exemplary embodiment of the present disclosure;

2 illustrates an operational logic diagram of a voiceprint recognition method in accordance with an example embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of a voiceprint recognition method according to an example embodiment of the present disclosure;

4 is a diagram showing an example of a process of training the support vector machine of FIG. 3 and calculating an audio matching degree;

FIG. 5 schematically illustrates a block diagram of a computer system suitable for implementing a voiceprint recognition method in accordance with an embodiment of the present disclosure.

detailed description

Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. It should be understood, however, that the description is only illustrative, and is not intended to limit the scope of the disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted to avoid The concepts of the present disclosure are unnecessarily confused.

The terminology used herein is for the purpose of describing the particular embodiments, The use of the terms "comprising", "comprising" or "an"

All terms (including technical and scientific terms) used herein have the meaning commonly understood by one of ordinary skill in the art, unless otherwise defined. It should be noted that the terms used herein are to be interpreted as having a meaning consistent with the context of the present specification and should not be interpreted in an ideal or too rigid manner.

Where an expression similar to "at least one of A, B, and C, etc." is used, it should generally be interpreted in accordance with the meaning of the expression as commonly understood by those skilled in the art (for example, "having A, B, and C" "Systems of at least one of" shall include, but are not limited to, systems having A alone, B alone, C alone, A and B, A and C, B and C, and/or A, B, C, etc. ). Where an expression similar to "at least one of A, B or C, etc." is used, it should generally be interpreted according to the meaning of the expression as commonly understood by those skilled in the art (for example, "having A, B or C" "Systems of at least one of" shall include, but are not limited to, systems having A alone, B alone, C alone, A and B, A and C, B and C, and/or A, B, C, etc. ). Those skilled in the art will also appreciate that transitional conjunctions and/or phrases that are arbitrarily arbitrarily representing two or more optional items, whether in the specification, claims, or drawings, are to be construed as The possibility of one of the projects, either or both of these projects. For example, the phrase "A or B" should be understood to include the possibility of "A" or "B", or "A and B."

Example implementations of the present disclosure are described below with reference to the drawings. The present disclosure provides a text-independent voiceprint recognition method and a voiceprint recognition system, wherein the voiceprint recognition method can effectively improve the fault tolerance of voiceprint recognition in a small sample, and quickly and efficiently identify two segments. Whether the audio belongs to the same person has broad application prospects. Through speaker recognition in voiceprint recognition technology, identity identification using voice information can be achieved.

FIG. 1 shows a block diagram of a structure of a voiceprint recognition system 100 in accordance with an exemplary embodiment of the present disclosure. As shown in FIG. 1, the voiceprint recognition system 100 includes a receiver 110 configured to receive audio to be tested, a sample database 120 configured to store one or more sample audios, and a support vector machine 130 configured to test according to a classification sample. The data is classified; and the controller 140. The support vector machine 130 is capable of performing a classification function. Specifically, for a linearly inseparable case, the input space is first transformed into a high-dimensional space by a nonlinear transformation, so that the sample is transformed into a linearly separable case, wherein the The linear transformation is achieved by an appropriate inner product function; then the optimal linear classification surface is sought in the new space to achieve the classification function. The controller 140 may be configured to divide the audio to be tested from the receiver 110 into a first portion and a second portion, and select one sample audio from the sample database 130 and divide the selected sample audio into the first portion and the second portion. For example, the audio to be tested and the selected sample audio are both divided into two parts of equal length. Despite the above The embodiment describes that the audio to be tested and the selected sample audio are equally divided into two parts of equal length, however, it should be noted that the audio to be tested and the selected sample audio may also be divided in different division ratios, and the ratio of the two is divided. It can be different. Next, the controller 140 extracts a feature matrix for the audio to be tested and the selected sample audio by using the extraction method of the Mel Cepstrum Coefficient (MFCC). The Mel frequency is based on the auditory characteristics of the human ear, which is nonlinearly related to the frequency (Hz). The Mel Frequency Cepstral Coefficient (MFCC) is a Hz spectral feature calculated using this relationship between them. At present, MFCC and its extraction methods have been widely used in the field of speech recognition.

According to an embodiment of the present disclosure, the controller 140 determines whether the audio to be tested and the selected sample audio are from the same person by using a support vector machine. Specifically, the feature matrix of the first part of the to-be-tested audio as the first type of samples and the feature matrix of the selected sample audio as the second type of samples may be input to the support vector machine 130 and the support vector machine 130 may be trained to calculate The second portion of the audio to be tested belongs to the ratio a of the second type of samples; by inputting to the support vector machine 130 the feature matrix of the first portion of the selected sample audio as the first type of samples and the audio to be tested as the second type of samples Feature matrix and training the support vector machine 130 to calculate a ratio b of the second portion of the selected sample audio belonging to the second type of sample; by inputting to the support vector machine 130 the second portion of the audio to be tested as the first type of sample a feature matrix and a feature matrix of the selected sample audio as the second type of samples and training the support vector machine 130 to calculate a ratio c of the first portion of the audio to be tested belonging to the second type of sample; by inputting to the support vector machine 130 as the first a feature matrix of the second part of the selected sample audio of a class of samples and a feature matrix of the audio to be tested as a second type of sample Training the support vector machine 130, calculating a ratio d of the first part of the selected sample audio belonging to the second type of samples; and calculating a matching degree between the to-be-tested audio and the sample audio according to the calculated a, b, c, and d, so that Determine if the audio to be tested and the sample audio are from the same person's voice. In one embodiment, the controller 140 may determine the average of a, b, c, and d and determine the ratio of the average to 0.5 as the degree of matching of the audio to be tested to the sample audio.

In an alternative embodiment, the controller 140 may be further configured to pre-process the received audio to be tested, for example, pre-emphasis the audio to be detected; pre-value filtering and high frequency compensation; then by using overlapping points The segmentation method of the segment is to frame the test audio; then the Hamming window is applied to eliminate the Gibbs effect; and the speech frame and the non-speech frame are distinguished and the non-speech frame is discarded. Since the sound signal tends to be continuously changing, in order to simplify the continuously varying signal, it is assumed that the audio signal does not change in a short time scale, so that the signal is grouped into a unit by a plurality of sampling points, which is called a "frame". That is, "one frame." A frame is often 20-40 milliseconds. If the frame length is shorter, the sampling points in each frame will not be enough to make a reliable spectrum calculation. However, if the length is too long, each frame signal will change too. Big.

FIG. 2 illustrates an operational logic diagram of a voiceprint recognition method in accordance with an example embodiment of the present disclosure. First, in operation S01, the audio to be tested is received by the receiver; then, in operation S05, the audio to be tested is pre-processed, for example, pre-value filtering and high-frequency compensation; then the audio is to be tested by using the overlapping segmentation method Framing is performed; then a Hamming window is applied to eliminate the Gibbs effect; and speech and non-speech frames are distinguished and non-speech frames are discarded. In operation S10, the audio to be tested Split into first and second parts. Further, at operation S15, sample audio may be selected from the sample database, and the selected sample audio is divided into a first portion and a second portion at operation S20. Subsequently, in operation S25, feature vectors for respective portions of the audio to be tested and the selected sample audio are extracted by using the extraction method of the Mel cepstrum coefficients, so that one or more of the feature vectors are used in operation S30. To train the support vector machine. Finally, in operation S35, it is determined whether the audio to be tested and the selected sample audio are from the same person.

FIG. 3 illustrates a flow chart of a voiceprint recognition method in accordance with an example embodiment of the present disclosure. In step S305, the audio A to be tested is received and the audio A to be tested is divided into a first part A1 and a second part A2. At step S310, a sample audio B is selected from the sample database and the selected sample audio B is divided into a first portion B1 and a second portion B2. For example, the audio A to be tested can be divided from the middle into two parts of equal lengths A1 and A2, while the sample audio B is equally divided into two parts B1 and B2 from the middle. In addition, in addition to the above division manner, the audio to be tested and the selected sample audio may be divided in other division ratios, for example, the audio to be tested is divided into two parts of 1:2, and the selected sample audio is divided into Two parts of 2:3.

In addition, before performing step S305, the method may further include pre-processing the audio to be tested, for example, pre-emphasizing the audio to be detected; framing the test audio by using a framing method of overlapping segments; applying Hamming Window to eliminate the Gibbs effect; and distinguish between speech frames and non-speech frames and discard non-speech frames. In one embodiment, a special filter is firstly designed to filter and high-frequency compensation according to the frequency characteristics of the speech signal; then, the overlapping segmentation method is used to perform frame division; secondly, the signal is added to the signal. The window is used to eliminate the Gibbs effect; then the endpoint detection method is used to distinguish the speech frame from the non-speech frame according to the short-time energy and the short-term average zero-crossing rate, and the non-speech frame is discarded.

Next, in step S315, a feature matrix for the audio to be tested and the selected sample audio is extracted by using the extraction method of the Mel cepstrum coefficients. That is to say, according to the extraction method of Mel's cepstrum coefficient, a vector of 1 row and 20 columns is extracted from each frame of each speaker's speech as its feature vector, then a person's n frame constitutes a feature vector. n rows and 20 columns of feature matrices.

Next, the steps of training the support vector machine are performed. In step S320, by using the feature matrix of the first part A1 of the audio to be tested as the first type of samples and the feature matrix of the selected sample audio B as the second type of samples, the support vector machine training is performed, and the audio to be tested is calculated. The second portion A2 belongs to the ratio a of the second type of samples in order to determine whether the second portion A2 of the audio to be tested belongs to the selected sample audio; then in step S325, the feature matrix of the first portion B1 of the selected sample audio is taken as the first a type of sample, and the feature matrix of the audio A to be tested is used as the second type of sample, performing support vector machine training, and calculating the ratio b of the second part B2 of the selected sample audio belonging to the second type of sample; then, in step S330 Performing support vector machine training by calculating the feature matrix of the second part A2 of the audio to be tested as the first type of samples and using the feature matrix of the selected sample audio B as the second type of samples, and calculating the first part of the audio to be tested A1 belongs to the second category a ratio c of the samples; and in step S335, performing support vector machine training by using the feature matrix of the second portion B2 of the selected sample audio as the first type of samples and using the feature matrix of the audio A to be tested as the second type of samples And calculate the ratio d of the first part B1 of the selected sample audio belonging to the second type of sample. Any of the above operations S320 to S335 can be exemplarily shown as FIG. FIG. 4 shows an example of a process of training the support vector machine in the above operations S320 to S335 and calculating the audio matching degree.

Finally, with continued reference to FIG. 3, in step S340, based on the calculated a, b, c, and d, the degree of matching between the audio to be tested and the selected sample audio is calculated to determine whether the audio to be tested and the selected sample audio are from the same person. the sound of. For example, the average of a, b, c, and d can be calculated, and the ratio of the average to 0.5 can be determined as the degree of matching of the audio to be tested with the sample audio. In this case, if the audio to be tested and the selected sample audio belong to one person, the average value should be close to 0.5. If not from the same person, the average should be close to zero. Therefore, the ratio of the average value to 0.5 can be regarded as the degree of matching of the audio to be tested with the sample audio. According to this matching degree, it is possible to confirm whether the matching result and the test sample are a person's voice and prevent misjudgment.

It should be noted that different proportional thresholds may be set based on the requirements of different application environments to determine whether the audio to be tested and the sample audio are from the same person. For example, in the case of lower security, it can be determined whether the sample audio and the audio to be tested are from the same person by setting the threshold to a lower value, for example, 70%, that is, if the calculated ratio is greater than or equal to 70. %, they think that the two are from the same person, otherwise they think that the two are from different people's voices. In the case of higher security (eg, access control system), it can be determined whether the sample audio and the audio to be tested are from the same person by setting the threshold to a higher value, for example, 95%. This can achieve the effect of adjusting the recognition accuracy according to the needs of the application, and is more convenient for the user to use.

Therefore, the voiceprint recognition method and system proposed by the present disclosure can classify the segmented samples in different manners under different small sample conditions by classifying the to-be-matched audio and sample audio, thereby achieving high fault tolerance and high efficiency. Identification.

FIG. 5 schematically illustrates a block diagram of a computer system suitable for implementing a voiceprint recognition method in accordance with an embodiment of the present disclosure. The computer system shown in FIG. 5 is merely an example and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 5, a computer system 500 in accordance with an embodiment of the present disclosure includes a processor 501 that can be loaded into a random access memory (RAM) 503 according to a program stored in a read only memory (ROM) 502 or from a storage portion 508. The program in the middle performs various appropriate actions and processes. Processor 501 can include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor, and/or a related chipset and/or a special purpose microprocessor (e.g., an application specific integrated circuit (ASIC)), and the like. Processor 501 can also include an onboard memory for caching purposes. The processor 501 may include a single processing unit or a plurality of processing units for performing different actions of the method flow according to the embodiments of the present disclosure described with reference to FIGS. 2 and 3.

In the RAM 503, various programs and data required for the operation of the system 500 are stored. The processor 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. The processor 501 performs the various operations described above with reference to FIGS. 2 and 3 by executing programs in the ROM 502 and/or the RAM 503. It is noted that the program can also be stored in one or more memories other than ROM 502 and RAM 503. The processor 501 can also perform the various operations described above with reference to FIGS. 2 and 3 by executing a program stored in the one or more memories.

System 500 may also include an input/output (I/O) interface 505 to which an input/output (I/O) interface 505 is also coupled, in accordance with an embodiment of the present disclosure. System 500 can also include one or more of the following components coupled to I/O interface 505: an input portion 506 including a keyboard, mouse, etc.; including, for example, a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker An output portion 507 of the like; a storage portion 508 including a hard disk or the like; and a communication portion 509 including a network interface card such as a LAN card, a modem, and the like. The communication section 509 performs communication processing via a network such as the Internet. Driver 510 is also coupled to I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 510 as needed so that a computer program read therefrom is installed into the storage portion 508 as needed.

According to an embodiment of the present disclosure, the method described above with reference to the flowcharts may be implemented as a computer software program. For example, an embodiment of the present disclosure includes a computer program product comprising a computer program carried on a computer readable storage medium, the computer program comprising program code for executing the method illustrated in the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network via the communication portion 509, and/or installed from the removable medium 511. The above-described functions defined in the system of the embodiments of the present disclosure are executed when the computer program is executed by the processor 501. The systems, devices, devices, modules, units, and the like described above may be implemented by a computer program module in accordance with an embodiment of the present disclosure.

It should be noted that the computer readable storage medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain or store a program, which can be used by or in connection with an instruction execution system, apparatus, or device. While in the present disclosure, a computer readable signal medium may include a data signal that is propagated in the baseband or as part of a carrier, carrying computer readable program code. Such propagated data signals can take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer readable signal medium can also be any computer readable storage medium other than a computer readable storage medium, which can be transmitted, propagated or transmitted for use by or in connection with an instruction execution system, apparatus or device. program of. Program code embodied on a computer readable storage medium may be transmitted by any suitable medium, including but not limited to wireless, wire, optical cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products in accordance with various embodiments of the present disclosure. In this regard, each block of the flowchart or block diagrams can represent a module, a program segment, or a portion of code that includes one or more Executable instructions. It should also be noted that in some alternative implementations, the functions noted in the blocks may also occur in a different order than that illustrated in the drawings. For example, two successively represented blocks may in fact be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowcharts, and combinations of blocks in the block diagrams or flowcharts, can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be used A combination of dedicated hardware and computer instructions is implemented.

It should be noted that the above scheme is only one specific implementation showing the concept of the present disclosure, and the present disclosure is not limited to the above implementation. Some of the above-described implementations may be omitted or skipped without departing from the spirit and scope of the present disclosure.

The foregoing method can be implemented in a form of executable program commands by a plurality of computer devices and recorded in a computer readable recording medium. In this case, the computer readable recording medium may include a separate program command, a data file, a data structure, or a combination thereof. Meanwhile, program commands recorded in a recording medium may be specifically designed or configured for use in the present disclosure, or are known to those skilled in the art of computer software. The computer readable recording medium includes a magnetic medium such as a hard disk, a floppy disk or a magnetic tape, an optical medium such as a compact disk read only memory (CD-ROM) or a digital versatile disk (DVD), a magneto-optical medium such as a magneto-optical floppy disk, and, for example, a storage and Hardware devices such as ROM, RAM, and flash memory that execute program commands. In addition, the program commands include machine language code formed by the compiler and a high-level language that the computer can execute by using the interpreter. The foregoing hardware device may be configured to operate as at least one software module to perform the operations of the present disclosure, and the reverse operation is also the same.

Although the operations of the methods herein are shown and described in a particular order, the order of the operations of the various methods can be varied, such that a particular operation can be performed in the reverse order or the particular operation can be performed at least partially concurrently with other operations. In addition, the present disclosure is not limited to the above-described exemplary embodiments, and may include one or more other components or operations, or omit one or more other components or operations, without departing from the spirit and scope of the disclosure.

Those skilled in the art will appreciate that the features recited in the various embodiments and/or claims of the present disclosure may be Combinations or/or combinations, even if such combinations or combinations are not explicitly recited in the present disclosure. In particular, various combinations and/or combinations of the features described in the various embodiments and/or claims of the present disclosure can be made without departing from the spirit and scope of the disclosure. All such combinations and/or combinations fall within the scope of the disclosure.

The present disclosure has been described in connection with the preferred embodiments of the present disclosure, and it will be understood by those skilled in the art Therefore, the present disclosure should not be limited by the embodiments described above, but by the appended claims and their equivalents.

Claims

A voiceprint recognition method comprising:

Receiving the audio to be tested and dividing the audio to be tested into the first part and the second part;

Selecting a sample audio from the sample database and dividing the selected sample audio into a first part and a second part;

Extracting a feature matrix for the audio to be tested and the selected sample audio by using an extraction method of Mel cepstral coefficients;

Performing support vector machine training by using the feature matrix of the first part of the audio to be tested as the first type of samples and using the feature matrix of the selected sample audio as the second type of samples, and calculating the second part of the audio to be tested belongs to the second Proportion of class samples a;

The support vector machine training is performed by using the feature matrix of the first part of the selected sample audio as the first type of sample and the feature matrix of the audio to be tested as the second type of sample, and calculating the second part of the selected sample audio belongs to the The ratio b of the second type of sample;

Performing support vector machine training by using the feature matrix of the second part of the audio to be tested as the first type of samples and using the feature matrix of the selected sample audio as the second type of samples, and calculating the first part of the audio to be tested belongs to the second The proportion of the class sample c;

The support vector machine training is performed by using the feature matrix of the second part of the selected sample audio as the first type of sample and the feature matrix of the audio to be tested as the second type of sample, and calculating the first part of the selected sample audio belongs to the first The ratio d of the second type of sample;

Based on the calculated a, b, c, and d, the degree of matching of the audio to be tested with the selected sample audio is calculated to determine whether the audio to be tested and the selected sample audio are from the same person's voice.
The method of claim 1, further comprising: pre-processing the received audio to be tested, wherein the pre-processing comprises at least one of the following:

Pre-emphasis the detected audio;

Framing the test audio by using a framing method of overlapping segments;

Applying a Hamming window to eliminate the Gibbs effect;

Distinguish between speech and non-speech frames and discard non-speech frames.
The method of claim 1 wherein said dividing the audio to be tested into the first portion and the second portion comprises dividing the audio to be tested into two portions of equal length.
The method of claim 1 wherein said dividing the selected sample audio into the first portion and the second portion comprises dividing the selected sample audio into two portions of equal length.
The method of claim 1 wherein said calculating a degree of matching of the audio to be tested to the sample audio comprises:

Calculate the average of a, b, c, and d;

The ratio of the average value to 0.5 is determined as the degree of matching of the audio to be tested with the sample audio.
A voiceprint recognition system comprising:

a receiver configured to receive audio to be tested;

a sample database configured to store one or more sample audios;

Support vector machine, configured to classify test data according to a classification sample;

Controller, configured as:

Dividing the audio to be tested from the receiver into a first part and a second part, and selecting one sample audio from the sample database and dividing the selected sample audio into the first part and the second part;

Extracting a feature matrix for the audio to be tested and the selected sample audio by using a method of extracting the Mel cepstral coefficients;

Calculating a second of the audio to be tested by inputting a feature matrix of the first portion of the audio to be tested as the first type of samples and a feature matrix of the selected sample audio as the second type of samples to the support vector machine and training the support vector machine The proportion a of the samples belonging to the second type;

Calculating the selected sample audio by inputting to the support vector machine a feature matrix of the first portion of the selected sample audio of the first type of samples and a feature matrix of the audio to be tested as the second type of samples and training the support vector machine The proportion of the two parts belonging to the second type of sample b;

Calculating the first to be tested by inputting a feature matrix of the second portion of the audio to be tested as the first type of samples and a feature matrix of the selected sample audio as the second type of samples to the support vector machine and training the support vector machine Part of the proportion c of the second type of sample;

Calculating the selected sample audio by inputting a feature matrix of the second portion of the selected sample audio as the first type of sample and a feature matrix of the audio to be tested as the second type of sample to the support vector machine and training the support vector machine The first part belongs to the proportion d of the second type of sample;

Based on the calculated a, b, c, and d, the degree of matching between the audio to be tested and the sample audio is calculated to determine whether the audio to be tested and the sample audio are from the same person's voice.
The system of claim 6 wherein said controller is further configured to pre-process the received audio to be tested; wherein said pre-processing comprises at least one of the following:

Pre-emphasis the detected audio;

Framing the test audio by using a framing method of overlapping segments;

Applying a Hamming window to eliminate the Gibbs effect;

Distinguish between speech and non-speech frames and discard non-speech frames.
The system of claim 6 wherein said controller is further configured to split the audio to be tested into two equal lengths.
The system of claim 6 wherein said controller is further configured to split the selected sample audio into two portions of equal length.
The system of claim 6 wherein said controller is further configured to:

Calculate the average of a, b, c, and d;

The ratio of the average value to 0.5 is determined as the degree of matching of the audio to be tested with the sample audio.
A computer system comprising:

One or more processors;

Memory for storing one or more programs,

Wherein the one or more programs are executed by the one or more processors, such that the one or more processors implement the voiceprint recognition method of any one of claims 1 to 5.
A computer readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the voiceprint recognition method of any one of claims 1 to 5.