WO2018095167A1 - Voiceprint identification method and voiceprint identification system - Google Patents

Voiceprint identification method and voiceprint identification system Download PDF

Info

Publication number
WO2018095167A1
WO2018095167A1 PCT/CN2017/106886 CN2017106886W WO2018095167A1 WO 2018095167 A1 WO2018095167 A1 WO 2018095167A1 CN 2017106886 W CN2017106886 W CN 2017106886W WO 2018095167 A1 WO2018095167 A1 WO 2018095167A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
tested
sample
type
feature matrix
Prior art date
Application number
PCT/CN2017/106886
Other languages
French (fr)
Chinese (zh)
Inventor
雷利博
薛韬
罗超
Original Assignee
北京京东尚科信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东尚科信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京京东尚科信息技术有限公司
Publication of WO2018095167A1 publication Critical patent/WO2018095167A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Definitions

  • the present disclosure relates to the field of voiceprint recognition, and in particular to a voiceprint recognition method and a voiceprint recognition system.
  • Voiceprint refers to a spectrum pattern showing sound wave characteristics drawn by a special electro-acoustic conversion instrument (such as a sonograph, a mapter, etc.), which is a collection of various acoustic feature maps.
  • a special electro-acoustic conversion instrument such as a sonograph, a mapter, etc.
  • the voiceprint is a long-term stable characteristic signal. Due to the innate physiological differences of the vocal organs and the behavioral differences formed by the acquired organs, each person's voiceprint has a strong personal color.
  • Voiceprint recognition is a biometric method that automatically recognizes the identity of a speaker based on characteristic parameters such as unique physiological and behavioral characteristics contained in human speech.
  • the voiceprint recognition mainly collects the voice information of the person, extracts the unique voice feature and converts it into a digital symbol, and saves it into a feature template, so that the voice to be recognized is matched with the template in the database during the application, thereby discriminating the speech. Human identity.
  • Sound spectrum analysis plays a major role in the lives of modern people. For example, the installation, adjustment and operation of machinery in industrial production can be monitored by means of sound spectrum analysis. In addition, sound spectrum analysis has a wide range of applications in scientific testing of musical instrument manufacturing processes, jewelry identification, communication and efficient use of broadcast equipment.
  • the "voiceprint recognition" technology can be used for identity authentication to discriminate the identity of the speaker.
  • most of the research results in this field are based on text relevance, that is, the verifier must be pronounced according to the prescribed text, thus limiting the development of the technology.
  • the fault tolerance of the existing algorithms is too poor, basically relying on a similarity score to assess whether the samples of the two speech features belong to the same person. If the sample size is not large enough or the sample's speech feature similarity is high, it is difficult to make an accurate judgment.
  • a voiceprint recognition method may include: receiving audio to be tested and dividing the audio to be tested into a first part and a second part; selecting one sample from the sample database Audio and dividing the selected sample audio into a first part and a second part; extracting a feature matrix for the audio to be tested and the selected sample audio by using a method of extracting the Mel cepstrum coefficients; Part of the feature matrix is used as the first type of sample, and the feature matrix of the selected sample audio is used as the second type of sample, and the support vector machine training is performed.
  • the second part of the audio to be tested belongs to the ratio a of the second type of samples; by using the feature matrix of the first part of the selected sample audio as the first type of samples and the feature matrix of the audio to be tested as the second type of samples, performing Support vector machine training, and calculate the ratio b of the second part of the selected sample audio belonging to the second type of sample; by using the feature matrix of the second part of the audio to be tested as the first type of sample, and the characteristics of the selected sample audio As a second type of sample, the matrix performs support vector machine training, and calculates a ratio c of the first part of the audio to be tested belonging to the second type of sample; by using the feature matrix of the second part of the selected sample audio as the first type of sample, and Using the feature matrix of the audio to be tested as the second type of sample, performing support vector machine training, and calculating the ratio d of the first part of the selected sample audio belonging to the second type of sample; calculating according to the calculated a, b, c, and d The degree to which the
  • the voiceprint recognition method further includes: preprocessing the received audio to be tested, wherein the preprocessing includes at least one of: pre-emphasizing the audio to be detected;
  • the framing method of the stacked segment is to framing the test audio;
  • the Hamming window is applied to eliminate the Gibbs effect; and the speech frame and the non-speech frame are distinguished and the non-speech frame is discarded.
  • the dividing the audio to be tested into the first portion and the second portion includes dividing the audio to be tested into two portions of equal length.
  • the dividing the selected sample audio into the first portion and the second portion comprises dividing the selected sample audio into two portions of equal length.
  • the calculating the degree of matching of the audio to be tested and the sample audio comprises: calculating an average value of a, b, c, and d; and determining a ratio of the average value to 0.5 as the audio and sample to be tested The degree of matching of the audio.
  • a voiceprint recognition system comprising: a receiver configured to receive audio to be tested; a sample database configured to store one or more sample audio; a vector machine configured to classify the test data according to the classification sample; the controller configured to: divide the audio to be tested from the receiver into the first part and the second part, and select a sample audio from the sample database and select the The sample audio is divided into a first part and a second part; the feature matrix for the audio to be tested and the selected sample audio is extracted by using the extraction method of the Mel cepstral coefficient; the test to be tested as the first type of sample is input to the support vector machine a feature matrix of the first portion of the audio and a feature matrix of the selected sample audio as the second type of samples and training the support vector machine, calculating a ratio a of the second portion of the audio to be tested belonging to the second type of samples; Entering the feature matrix of the first part of the selected sample audio as the first type of sample and as the first a
  • the controller may be further configured to perform pre-processing on the received audio to be tested; wherein the pre-processing comprises at least one of: pre-emphasizing the audio to be detected; overlapping by using The segmented framing method framing the test audio; applying a Hamming window to eliminate the Gibbs effect; and distinguishing between speech frames and non-speech frames and discarding non-speech frames.
  • the controller is further configured to divide the audio to be tested into two parts of equal length.
  • the controller is further configured to split the selected sample audio into two parts of equal length.
  • the controller is further configured to: calculate an average value of a, b, c, and d; and determine a ratio of the average value to 0.5 as a degree of matching of the audio to be tested and the sample audio.
  • a computer system comprising: one or more processors; a memory for storing one or more programs, wherein when the one or more programs are When executed by a plurality of processors, the one or more processors are caused to implement the voiceprint recognition method as described above.
  • a computer readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the voiceprint recognition method as described above.
  • FIG. 1 is a block diagram showing the structure of a voiceprint recognition system according to an exemplary embodiment of the present disclosure
  • FIG. 2 illustrates an operational logic diagram of a voiceprint recognition method in accordance with an example embodiment of the present disclosure
  • FIG. 3 illustrates a flow chart of a voiceprint recognition method according to an example embodiment of the present disclosure
  • FIG. 4 is a diagram showing an example of a process of training the support vector machine of FIG. 3 and calculating an audio matching degree
  • FIG. 5 schematically illustrates a block diagram of a computer system suitable for implementing a voiceprint recognition method in accordance with an embodiment of the present disclosure.
  • the present disclosure provides a text-independent voiceprint recognition method and a voiceprint recognition system, wherein the voiceprint recognition method can effectively improve the fault tolerance of voiceprint recognition in a small sample, and quickly and efficiently identify two segments. Whether the audio belongs to the same person has broad application prospects. Through speaker recognition in voiceprint recognition technology, identity identification using voice information can be achieved.
  • FIG. 1 shows a block diagram of a structure of a voiceprint recognition system 100 in accordance with an exemplary embodiment of the present disclosure.
  • the voiceprint recognition system 100 includes a receiver 110 configured to receive audio to be tested, a sample database 120 configured to store one or more sample audios, and a support vector machine 130 configured to test according to a classification sample. The data is classified; and the controller 140.
  • the support vector machine 130 is capable of performing a classification function.
  • the input space is first transformed into a high-dimensional space by a nonlinear transformation, so that the sample is transformed into a linearly separable case, wherein the The linear transformation is achieved by an appropriate inner product function; then the optimal linear classification surface is sought in the new space to achieve the classification function.
  • the controller 140 may be configured to divide the audio to be tested from the receiver 110 into a first portion and a second portion, and select one sample audio from the sample database 130 and divide the selected sample audio into the first portion and the second portion. For example, the audio to be tested and the selected sample audio are both divided into two parts of equal length.
  • the controller 140 extracts a feature matrix for the audio to be tested and the selected sample audio by using the extraction method of the Mel Cepstrum Coefficient (MFCC).
  • MFCC Mel Cepstrum Coefficient
  • the Mel frequency is based on the auditory characteristics of the human ear, which is nonlinearly related to the frequency (Hz).
  • the Mel Frequency Cepstral Coefficient (MFCC) is a Hz spectral feature calculated using this relationship between them.
  • MFCC Mel Frequency Cepstral Coefficient
  • the controller 140 determines whether the audio to be tested and the selected sample audio are from the same person by using a support vector machine.
  • the feature matrix of the first part of the to-be-tested audio as the first type of samples and the feature matrix of the selected sample audio as the second type of samples may be input to the support vector machine 130 and the support vector machine 130 may be trained to calculate The second portion of the audio to be tested belongs to the ratio a of the second type of samples; by inputting to the support vector machine 130 the feature matrix of the first portion of the selected sample audio as the first type of samples and the audio to be tested as the second type of samples Feature matrix and training the support vector machine 130 to calculate a ratio b of the second portion of the selected sample audio belonging to the second type of sample; by inputting to the support vector machine 130 the second portion of the audio to be tested as the first type of sample a feature matrix and a feature matrix of the selected sample audio as the second type of samples and training the support vector machine 130 to calculate a ratio c of the first portion of
  • the controller 140 may be further configured to pre-process the received audio to be tested, for example, pre-emphasis the audio to be detected; pre-value filtering and high frequency compensation; then by using overlapping points
  • the segmentation method of the segment is to frame the test audio; then the Hamming window is applied to eliminate the Gibbs effect; and the speech frame and the non-speech frame are distinguished and the non-speech frame is discarded.
  • the sound signal tends to be continuously changing, in order to simplify the continuously varying signal, it is assumed that the audio signal does not change in a short time scale, so that the signal is grouped into a unit by a plurality of sampling points, which is called a "frame". That is, "one frame.”
  • a frame is often 20-40 milliseconds. If the frame length is shorter, the sampling points in each frame will not be enough to make a reliable spectrum calculation. However, if the length is too long, each frame signal will change too. Big.
  • FIG. 2 illustrates an operational logic diagram of a voiceprint recognition method in accordance with an example embodiment of the present disclosure.
  • the audio to be tested is received by the receiver; then, in operation S05, the audio to be tested is pre-processed, for example, pre-value filtering and high-frequency compensation; then the audio is to be tested by using the overlapping segmentation method Framing is performed; then a Hamming window is applied to eliminate the Gibbs effect; and speech and non-speech frames are distinguished and non-speech frames are discarded.
  • the audio to be tested Split into first and second parts.
  • sample audio may be selected from the sample database, and the selected sample audio is divided into a first portion and a second portion at operation S20.
  • feature vectors for respective portions of the audio to be tested and the selected sample audio are extracted by using the extraction method of the Mel cepstrum coefficients, so that one or more of the feature vectors are used in operation S30. To train the support vector machine.
  • operation S35 it is determined whether the audio to be tested and the selected sample audio are from the same person.
  • FIG. 3 illustrates a flow chart of a voiceprint recognition method in accordance with an example embodiment of the present disclosure.
  • the audio A to be tested is received and the audio A to be tested is divided into a first part A1 and a second part A2.
  • a sample audio B is selected from the sample database and the selected sample audio B is divided into a first portion B1 and a second portion B2.
  • the audio A to be tested can be divided from the middle into two parts of equal lengths A1 and A2, while the sample audio B is equally divided into two parts B1 and B2 from the middle.
  • the audio to be tested and the selected sample audio may be divided in other division ratios, for example, the audio to be tested is divided into two parts of 1:2, and the selected sample audio is divided into Two parts of 2:3.
  • the method may further include pre-processing the audio to be tested, for example, pre-emphasizing the audio to be detected; framing the test audio by using a framing method of overlapping segments; applying Hamming Window to eliminate the Gibbs effect; and distinguish between speech frames and non-speech frames and discard non-speech frames.
  • a special filter is firstly designed to filter and high-frequency compensation according to the frequency characteristics of the speech signal; then, the overlapping segmentation method is used to perform frame division; secondly, the signal is added to the signal.
  • the window is used to eliminate the Gibbs effect; then the endpoint detection method is used to distinguish the speech frame from the non-speech frame according to the short-time energy and the short-term average zero-crossing rate, and the non-speech frame is discarded.
  • a feature matrix for the audio to be tested and the selected sample audio is extracted by using the extraction method of the Mel cepstrum coefficients. That is to say, according to the extraction method of Mel's cepstrum coefficient, a vector of 1 row and 20 columns is extracted from each frame of each speaker's speech as its feature vector, then a person's n frame constitutes a feature vector. n rows and 20 columns of feature matrices.
  • step S320 by using the feature matrix of the first part A1 of the audio to be tested as the first type of samples and the feature matrix of the selected sample audio B as the second type of samples, the support vector machine training is performed, and the audio to be tested is calculated.
  • the second portion A2 belongs to the ratio a of the second type of samples in order to determine whether the second portion A2 of the audio to be tested belongs to the selected sample audio; then in step S325, the feature matrix of the first portion B1 of the selected sample audio is taken as the first a type of sample, and the feature matrix of the audio A to be tested is used as the second type of sample, performing support vector machine training, and calculating the ratio b of the second part B2 of the selected sample audio belonging to the second type of sample; then, in step S330 Performing support vector machine training by calculating the feature matrix of the second part A2 of the audio to be tested as the first type of samples and using the feature matrix of the selected sample audio B as the second type of samples, and calculating the first part of the audio to be tested A1 belongs to the second category a ratio c of the samples; and in step S335, performing support vector machine training by using the feature matrix of the second portion B2 of the selected sample audio as the first type of samples and using the feature matrix of the audio A to be tested as the second type
  • step S340 based on the calculated a, b, c, and d, the degree of matching between the audio to be tested and the selected sample audio is calculated to determine whether the audio to be tested and the selected sample audio are from the same person.
  • the sound of For example, the average of a, b, c, and d can be calculated, and the ratio of the average to 0.5 can be determined as the degree of matching of the audio to be tested with the sample audio. In this case, if the audio to be tested and the selected sample audio belong to one person, the average value should be close to 0.5. If not from the same person, the average should be close to zero.
  • the ratio of the average value to 0.5 can be regarded as the degree of matching of the audio to be tested with the sample audio. According to this matching degree, it is possible to confirm whether the matching result and the test sample are a person's voice and prevent misjudgment.
  • different proportional thresholds may be set based on the requirements of different application environments to determine whether the audio to be tested and the sample audio are from the same person. For example, in the case of lower security, it can be determined whether the sample audio and the audio to be tested are from the same person by setting the threshold to a lower value, for example, 70%, that is, if the calculated ratio is greater than or equal to 70. %, they think that the two are from the same person, otherwise they think that the two are from different people's voices. In the case of higher security (eg, access control system), it can be determined whether the sample audio and the audio to be tested are from the same person by setting the threshold to a higher value, for example, 95%. This can achieve the effect of adjusting the recognition accuracy according to the needs of the application, and is more convenient for the user to use.
  • the voiceprint recognition method and system proposed by the present disclosure can classify the segmented samples in different manners under different small sample conditions by classifying the to-be-matched audio and sample audio, thereby achieving high fault tolerance and high efficiency. Identification.
  • a computer system comprising: one or more processors; a memory for storing one or more programs, wherein when the one or more programs are When executed by a plurality of processors, the one or more processors are caused to implement the voiceprint recognition method as described above.
  • a computer readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the voiceprint recognition method as described above.
  • FIG. 5 schematically illustrates a block diagram of a computer system suitable for implementing a voiceprint recognition method in accordance with an embodiment of the present disclosure.
  • the computer system shown in FIG. 5 is merely an example and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • a computer system 500 in accordance with an embodiment of the present disclosure includes a processor 501 that can be loaded into a random access memory (RAM) 503 according to a program stored in a read only memory (ROM) 502 or from a storage portion 508.
  • the program in the middle performs various appropriate actions and processes.
  • Processor 501 can include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor, and/or a related chipset and/or a special purpose microprocessor (e.g., an application specific integrated circuit (ASIC)), and the like.
  • ASIC application specific integrated circuit
  • Processor 501 can also include an onboard memory for caching purposes.
  • the processor 501 may include a single processing unit or a plurality of processing units for performing different actions of the method flow according to the embodiments of the present disclosure described with reference to FIGS. 2 and 3.
  • the processor 501 In the RAM 503, various programs and data required for the operation of the system 500 are stored.
  • the processor 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504.
  • the processor 501 performs the various operations described above with reference to FIGS. 2 and 3 by executing programs in the ROM 502 and/or the RAM 503. It is noted that the program can also be stored in one or more memories other than ROM 502 and RAM 503.
  • the processor 501 can also perform the various operations described above with reference to FIGS. 2 and 3 by executing a program stored in the one or more memories.
  • System 500 may also include an input/output (I/O) interface 505 to which an input/output (I/O) interface 505 is also coupled, in accordance with an embodiment of the present disclosure.
  • System 500 can also include one or more of the following components coupled to I/O interface 505: an input portion 506 including a keyboard, mouse, etc.; including, for example, a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker
  • An output portion 507 of the like a storage portion 508 including a hard disk or the like; and a communication portion 509 including a network interface card such as a LAN card, a modem, and the like.
  • the communication section 509 performs communication processing via a network such as the Internet.
  • Driver 510 is also coupled to I/O interface 505 as needed.
  • a removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 510 as needed so that a computer program read therefrom is installed into the storage portion 508 as needed.
  • an embodiment of the present disclosure includes a computer program product comprising a computer program carried on a computer readable storage medium, the computer program comprising program code for executing the method illustrated in the flowchart.
  • the computer program can be downloaded and installed from the network via the communication portion 509, and/or installed from the removable medium 511.
  • the above-described functions defined in the system of the embodiments of the present disclosure are executed when the computer program is executed by the processor 501.
  • the systems, devices, devices, modules, units, and the like described above may be implemented by a computer program module in accordance with an embodiment of the present disclosure.
  • the computer readable storage medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two.
  • the computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can contain or store a program, which can be used by or in connection with an instruction execution system, apparatus, or device.
  • a computer readable signal medium may include a data signal that is propagated in the baseband or as part of a carrier, carrying computer readable program code. Such propagated data signals can take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer readable signal medium can also be any computer readable storage medium other than a computer readable storage medium, which can be transmitted, propagated or transmitted for use by or in connection with an instruction execution system, apparatus or device. program of.
  • Program code embodied on a computer readable storage medium may be transmitted by any suitable medium, including but not limited to wireless, wire, optical cable, RF, etc., or any suitable combination of the foregoing.
  • each block of the flowchart or block diagrams can represent a module, a program segment, or a portion of code that includes one or more Executable instructions.
  • the functions noted in the blocks may also occur in a different order than that illustrated in the drawings. For example, two successively represented blocks may in fact be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams or flowcharts, and combinations of blocks in the block diagrams or flowcharts can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be used A combination of dedicated hardware and computer instructions is implemented.
  • the foregoing method can be implemented in a form of executable program commands by a plurality of computer devices and recorded in a computer readable recording medium.
  • the computer readable recording medium may include a separate program command, a data file, a data structure, or a combination thereof.
  • program commands recorded in a recording medium may be specifically designed or configured for use in the present disclosure, or are known to those skilled in the art of computer software.
  • the computer readable recording medium includes a magnetic medium such as a hard disk, a floppy disk or a magnetic tape, an optical medium such as a compact disk read only memory (CD-ROM) or a digital versatile disk (DVD), a magneto-optical medium such as a magneto-optical floppy disk, and, for example, a storage and Hardware devices such as ROM, RAM, and flash memory that execute program commands.
  • the program commands include machine language code formed by the compiler and a high-level language that the computer can execute by using the interpreter.
  • the foregoing hardware device may be configured to operate as at least one software module to perform the operations of the present disclosure, and the reverse operation is also the same.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voiceprint identification method and system. The method comprises: receiving an audio to be tested and segmenting the audio to be tested into a first part and a second part; selecting a sample audio and segmenting the sample audio into a first part and a second part; extracting characteristic matrixes for the audio to be tested and the sample audio by using Mel-frequency cepstral coefficient extraction method; executing support vector machine training by using the characteristic matrix of the first part of the audio to be tested as a first type of sample and using the characteristic matrix of the selected sample audio as a second type of sample, and calculating the matching degree of the second part of the audio to be tested and the second type of sample; performing a similar process on the first part of the sample audio, the first part of the audio to be tested and the second part of the sample audio, and respectively calculating the matching degree between the three with the audio to be tested, the selected sample audio and the audio to be tested as the respective corresponding second type of sample; and determining, according to the matching degree, whether the voice in the audio to be tested and the sample audio are from the same person.

Description

声纹识别方法和声纹识别系统Voiceprint recognition method and voiceprint recognition system 技术领域Technical field
本公开涉及声纹识别领域,具体地,涉及一种声纹识别方法和声纹识别系统。The present disclosure relates to the field of voiceprint recognition, and in particular to a voiceprint recognition method and a voiceprint recognition system.
背景技术Background technique
声纹是指通过特殊的电声转换仪器(诸如,声谱仪、语图仪等)绘制的展现声波特征的波谱图形,是各种声学特征图谱的集合。对于人体来说,声纹是长期稳定的特征信号,由于发声器官先天的生理差异和后天形成的行为差异,每个人的声纹都带着强烈的个人色彩。Voiceprint refers to a spectrum pattern showing sound wave characteristics drawn by a special electro-acoustic conversion instrument (such as a sonograph, a mapter, etc.), which is a collection of various acoustic feature maps. For the human body, the voiceprint is a long-term stable characteristic signal. Due to the innate physiological differences of the vocal organs and the behavioral differences formed by the acquired organs, each person's voiceprint has a strong personal color.
声纹识别是根据人语音中所包含的独一无二的发音生理和行为特征等特征参数,自动对说话人身份进行识别的生物识别方法。声纹识别主要采集人的语音信息,提取特有的语音特征并将它转化成数字符号,且将其存成特征模板,使得在应用时将待识别语音与数据库中的模板进行匹配,从而判别说话人的身份。20世纪60年代开始,关于声谱分析的研究技术开始提出并应用于说话人特征分析。目前声纹识别技术已相对成熟并走向实用。Voiceprint recognition is a biometric method that automatically recognizes the identity of a speaker based on characteristic parameters such as unique physiological and behavioral characteristics contained in human speech. The voiceprint recognition mainly collects the voice information of the person, extracts the unique voice feature and converts it into a digital symbol, and saves it into a feature template, so that the voice to be recognized is matched with the template in the database during the application, thereby discriminating the speech. Human identity. Beginning in the 1960s, research techniques for sound spectrum analysis began to be proposed and applied to speaker feature analysis. At present, voiceprint recognition technology has been relatively mature and practical.
声谱分析在现代人的生活中发挥着重大作用,例如,工业生产中机械的安装、调整和运转可借助声谱分析进行监察。此外,声谱分析在乐器制作工艺的科学检验、珠宝鉴定、通信和广播设备的有效利用方面都有广泛的应用。在通信方面,可以利用“声纹识别”技术来进行身份认证,从而判别说话人的身份。目前该领域的研究成果大多是基于文本相关性的,即,被验证者必需按照规定的文本发音,从而使该项技术的发展受到了限制。此外,现有算法的容错性太差,基本都是靠一个相似度的得分来评定两份语音特征的样本是否属于同一个人。如果样本量不够大或者样本的语音特征相似度较高,则难以做出准确判断。Sound spectrum analysis plays a major role in the lives of modern people. For example, the installation, adjustment and operation of machinery in industrial production can be monitored by means of sound spectrum analysis. In addition, sound spectrum analysis has a wide range of applications in scientific testing of musical instrument manufacturing processes, jewelry identification, communication and efficient use of broadcast equipment. In terms of communication, the "voiceprint recognition" technology can be used for identity authentication to discriminate the identity of the speaker. At present, most of the research results in this field are based on text relevance, that is, the verifier must be pronounced according to the prescribed text, thus limiting the development of the technology. In addition, the fault tolerance of the existing algorithms is too poor, basically relying on a similarity score to assess whether the samples of the two speech features belong to the same person. If the sample size is not large enough or the sample's speech feature similarity is high, it is difficult to make an accurate judgment.
发明内容Summary of the invention
根据本公开的第一方面,提供了一种声纹识别方法,该声纹识别方法可以包括:接收待测试音频并将待测试音频分割为第一部分和第二部分;从样本数据库中选择一个样本音频并将所选样本音频分割为第一部分和第二部分;通过使用梅尔倒谱系数的提取方法,提取针对所述待测试音频以及所选样本音频的特征矩阵;通过将待测试音频的第一部分的特征矩阵作为第一类样本,并将所选样本音频的特征矩阵作为第二类样本,执行支持向量机训练,并计 算待测试音频的第二部分属于第二类样本的比例a;通过将所选样本音频的第一部分的特征矩阵作为第一类样本,并将待测试音频的特征矩阵作为第二类样本,执行支持向量机训练,并计算所选样本音频的第二部分属于第二类样本的比例b;通过将待测试音频的第二部分的特征矩阵作为第一类样本,并将所选样本音频的特征矩阵作为第二类样本,执行支持向量机训练,并计算待测试音频的第一部分属于第二类样本的比例c;通过将所选样本音频的第二部分的特征矩阵作为第一类样本,并将待测试音频的特征矩阵作为第二类样本,执行支持向量机训练,并计算所选样本音频的第一部分属于第二类样本的比例d;根据计算出的a、b、c和d,计算待测试音频与所选样本音频的匹配程度,以便确定待测试音频和所选样本音频是否来自同一个人的声音。According to a first aspect of the present disclosure, a voiceprint recognition method is provided, the voiceprint recognition method may include: receiving audio to be tested and dividing the audio to be tested into a first part and a second part; selecting one sample from the sample database Audio and dividing the selected sample audio into a first part and a second part; extracting a feature matrix for the audio to be tested and the selected sample audio by using a method of extracting the Mel cepstrum coefficients; Part of the feature matrix is used as the first type of sample, and the feature matrix of the selected sample audio is used as the second type of sample, and the support vector machine training is performed. The second part of the audio to be tested belongs to the ratio a of the second type of samples; by using the feature matrix of the first part of the selected sample audio as the first type of samples and the feature matrix of the audio to be tested as the second type of samples, performing Support vector machine training, and calculate the ratio b of the second part of the selected sample audio belonging to the second type of sample; by using the feature matrix of the second part of the audio to be tested as the first type of sample, and the characteristics of the selected sample audio As a second type of sample, the matrix performs support vector machine training, and calculates a ratio c of the first part of the audio to be tested belonging to the second type of sample; by using the feature matrix of the second part of the selected sample audio as the first type of sample, and Using the feature matrix of the audio to be tested as the second type of sample, performing support vector machine training, and calculating the ratio d of the first part of the selected sample audio belonging to the second type of sample; calculating according to the calculated a, b, c, and d The degree to which the audio to be tested matches the selected sample audio to determine if the audio to be tested and the selected sample audio are from the same person's voice.
根据本公开的实施例,所述声纹识别方法还包括:对所接收的待测试音频进行预处理,其中所述预处理包括以下操作中的至少一个:对待检测音频进行预加重;通过使用交叠分段的分帧方法对待测试音频进行分帧;施加汉明窗以消除吉布斯效应;以及区分语音帧和非语音帧并舍弃非语音帧。According to an embodiment of the present disclosure, the voiceprint recognition method further includes: preprocessing the received audio to be tested, wherein the preprocessing includes at least one of: pre-emphasizing the audio to be detected; The framing method of the stacked segment is to framing the test audio; the Hamming window is applied to eliminate the Gibbs effect; and the speech frame and the non-speech frame are distinguished and the non-speech frame is discarded.
根据本公开的实施例,所述将待测试音频分割为第一部分和第二部分包括将待测试音频分割为长度相等的两部分。According to an embodiment of the present disclosure, the dividing the audio to be tested into the first portion and the second portion includes dividing the audio to be tested into two portions of equal length.
根据本公开的实施例,所述将所选样本音频分割为第一部分和第二部分包括将所选样本音频分割为长度相等的两部分。According to an embodiment of the present disclosure, the dividing the selected sample audio into the first portion and the second portion comprises dividing the selected sample audio into two portions of equal length.
根据本公开的实施例,所述计算待测试音频与样本音频的匹配程度包括:计算a、b、c和d的平均值;以及将所述平均值与0.5的比值确定作为待测试音频与样本音频的匹配程度。According to an embodiment of the present disclosure, the calculating the degree of matching of the audio to be tested and the sample audio comprises: calculating an average value of a, b, c, and d; and determining a ratio of the average value to 0.5 as the audio and sample to be tested The degree of matching of the audio.
根据本公开的第二方面,提供了一种声纹识别系统,该声纹识别系统可以包括:接收器,配置为接收待测试音频;样本数据库,配置为存储一个或更多个样本音频;支持向量机,配置为根据分类样本对测试数据进行分类;控制器,配置为:将来自接收器的待测试音频分割为第一部分和第二部分,并从样本数据库中选择一个样本音频并将所选样本音频分割为第一部分和第二部分;通过使用梅尔倒谱系数的提取方法,提取针对待测试音频以及所选样本音频的特征矩阵;通过向支持向量机输入作为第一类样本的待测试音频的第一部分的特征矩阵以及作为第二类样本的所选样本音频的特征矩阵并训练所述支持向量机,计算待测试音频的第二部分属于第二类样本的比例a;通过向支持向量机输入作为第一类样本的所选样本音频的第一部分的特征矩阵以及作为第二类样本的待测试音频的特征矩阵并训练所述支持向量机,计算所选样本音频的第二部分属于第二类样本的比例b;通过向支持向量机输入作为第一类样本的待测试音频的第二部分的特征矩阵以及作为第二类样本的所选样本音频的特征矩阵并训练所述支持向量机,计算待测试音频的第一部分属于第二类样本的比例c;通过向支持向 量机输入作为第一类样本的所选样本音频的第二部分的特征矩阵以及作为第二类样本的待测试音频的特征矩阵并训练所述支持向量机,计算所选样本音频的第一部分属于第二类样本的比例d;根据计算出的a、b、c和d,计算待测试音频与样本音频的匹配程度,以便确定待测试音频和样本音频是否来自同一个人的声音。According to a second aspect of the present disclosure, there is provided a voiceprint recognition system, the voiceprint recognition system comprising: a receiver configured to receive audio to be tested; a sample database configured to store one or more sample audio; a vector machine configured to classify the test data according to the classification sample; the controller configured to: divide the audio to be tested from the receiver into the first part and the second part, and select a sample audio from the sample database and select the The sample audio is divided into a first part and a second part; the feature matrix for the audio to be tested and the selected sample audio is extracted by using the extraction method of the Mel cepstral coefficient; the test to be tested as the first type of sample is input to the support vector machine a feature matrix of the first portion of the audio and a feature matrix of the selected sample audio as the second type of samples and training the support vector machine, calculating a ratio a of the second portion of the audio to be tested belonging to the second type of samples; Entering the feature matrix of the first part of the selected sample audio as the first type of sample and as the first a feature matrix of the to-be-tested audio of the second type of samples and training the support vector machine to calculate a ratio b of the second part of the selected sample audio belonging to the second type of sample; to be tested by inputting to the support vector machine as the first type of sample a feature matrix of the second portion of the audio and a feature matrix of the selected sample audio as the second type of samples and training the support vector machine to calculate a ratio c of the first portion of the audio to be tested belonging to the second type of sample; The measuring machine inputs a feature matrix of the second part of the selected sample audio as the first type of samples and a feature matrix of the audio to be tested as the second type of samples and trains the support vector machine to calculate the first part of the selected sample audio The ratio d of the second type of samples; based on the calculated a, b, c, and d, the degree of matching between the audio to be tested and the sample audio is calculated to determine whether the audio to be tested and the sample audio are from the same person's voice.
根据本公开的实施例,所述控制器还可以配置为对所接收的待测试音频进行预处理;其中所述预处理包括以下操作中的至少一个:对待检测音频进行预加重;通过使用交叠分段的分帧方法对待测试音频进行分帧;施加汉明窗以消除吉布斯效应;以及区分语音帧和非语音帧并舍弃非语音帧。According to an embodiment of the present disclosure, the controller may be further configured to perform pre-processing on the received audio to be tested; wherein the pre-processing comprises at least one of: pre-emphasizing the audio to be detected; overlapping by using The segmented framing method framing the test audio; applying a Hamming window to eliminate the Gibbs effect; and distinguishing between speech frames and non-speech frames and discarding non-speech frames.
根据本公开的实施例,所述控制器还配置为将待测试音频分割为长度相等的两部分。According to an embodiment of the present disclosure, the controller is further configured to divide the audio to be tested into two parts of equal length.
根据本公开的实施例,所述控制器还配置为将所选样本音频分割为长度相等的两部分。According to an embodiment of the present disclosure, the controller is further configured to split the selected sample audio into two parts of equal length.
根据本公开的实施例,所述控制器还配置为:计算a、b、c和d的平均值;以及将所述平均值与0.5的比值确定作为待测试音频与样本音频的匹配程度。According to an embodiment of the present disclosure, the controller is further configured to: calculate an average value of a, b, c, and d; and determine a ratio of the average value to 0.5 as a degree of matching of the audio to be tested and the sample audio.
根据本公开的实施例,还提供了一种计算机系统,包括:一个或多个处理器;存储器,用于存储一个或多个程序,其中,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如上所述的声纹识别方法。According to an embodiment of the present disclosure, there is also provided a computer system comprising: one or more processors; a memory for storing one or more programs, wherein when the one or more programs are When executed by a plurality of processors, the one or more processors are caused to implement the voiceprint recognition method as described above.
根据本公开的实施例,还提供了一种计算机可读存储介质,其上存储有可执行指令,该指令被处理器执行时使处理器实现如上所述的声纹识别方法。According to an embodiment of the present disclosure, there is also provided a computer readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the voiceprint recognition method as described above.
附图说明DRAWINGS
以下结合附图,将更清楚本公开的示例实施例的上述和其它方面、特征以及优点,附图中:The above and other aspects, features, and advantages of the example embodiments of the present disclosure will be more apparent from the following description of the accompanying drawings.
图1示出了根据本公开的示例实施例的声纹识别系统的结构框图;FIG. 1 is a block diagram showing the structure of a voiceprint recognition system according to an exemplary embodiment of the present disclosure;
图2示出了根据本公开的示例实施例的声纹识别方法的操作逻辑图;2 illustrates an operational logic diagram of a voiceprint recognition method in accordance with an example embodiment of the present disclosure;
图3示出了根据本公开的示例实施例的声纹识别方法的流程图;FIG. 3 illustrates a flow chart of a voiceprint recognition method according to an example embodiment of the present disclosure;
图4示出了图3中的训练支持向量机并计算音频匹配度的处理中的一个示例图;以及4 is a diagram showing an example of a process of training the support vector machine of FIG. 3 and calculating an audio matching degree;
图5示意性示出了根据本公开实施例的适于实现声纹识别方法的计算机系统的框图。FIG. 5 schematically illustrates a block diagram of a computer system suitable for implementing a voiceprint recognition method in accordance with an embodiment of the present disclosure.
具体实施方式detailed description
以下,将参照附图来描述本公开的实施例。但是应该理解,这些描述只是示例性的,而并非要限制本公开的范围。此外,在以下说明中,省略了对公知结构和技术的描述,以避免 不必要地混淆本公开的概念。Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. It should be understood, however, that the description is only illustrative, and is not intended to limit the scope of the disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted to avoid The concepts of the present disclosure are unnecessarily confused.
在此使用的术语仅仅是为了描述具体实施例,而并非意在限制本公开。在此使用的术语“包括”、“包含”等表明了所述特征、步骤、操作和/或部件的存在,但是并不排除存在或添加一个或多个其他特征、步骤、操作或部件。The terminology used herein is for the purpose of describing the particular embodiments, The use of the terms "comprising", "comprising" or "an"
在此使用的所有术语(包括技术和科学术语)具有本领域技术人员通常所理解的含义,除非另外定义。应注意,这里使用的术语应解释为具有与本说明书的上下文相一致的含义,而不应以理想化或过于刻板的方式来解释。All terms (including technical and scientific terms) used herein have the meaning commonly understood by one of ordinary skill in the art, unless otherwise defined. It should be noted that the terms used herein are to be interpreted as having a meaning consistent with the context of the present specification and should not be interpreted in an ideal or too rigid manner.
在使用类似于“A、B和C等中至少一个”这样的表述的情况下,一般来说应该按照本领域技术人员通常理解该表述的含义来予以解释(例如,“具有A、B和C中至少一个的系统”应包括但不限于单独具有A、单独具有B、单独具有C、具有A和B、具有A和C、具有B和C、和/或具有A、B、C的系统等)。在使用类似于“A、B或C等中至少一个”这样的表述的情况下,一般来说应该按照本领域技术人员通常理解该表述的含义来予以解释(例如,“具有A、B或C中至少一个的系统”应包括但不限于单独具有A、单独具有B、单独具有C、具有A和B、具有A和C、具有B和C、和/或具有A、B、C的系统等)。本领域技术人员还应理解,实质上任意表示两个或更多可选项目的转折连词和/或短语,无论是在说明书、权利要求书还是附图中,都应被理解为给出了包括这些项目之一、这些项目任一方、或两个项目的可能性。例如,短语“A或B”应当被理解为包括“A”或“B”、或“A和B”的可能性。Where an expression similar to "at least one of A, B, and C, etc." is used, it should generally be interpreted in accordance with the meaning of the expression as commonly understood by those skilled in the art (for example, "having A, B, and C" "Systems of at least one of" shall include, but are not limited to, systems having A alone, B alone, C alone, A and B, A and C, B and C, and/or A, B, C, etc. ). Where an expression similar to "at least one of A, B or C, etc." is used, it should generally be interpreted according to the meaning of the expression as commonly understood by those skilled in the art (for example, "having A, B or C" "Systems of at least one of" shall include, but are not limited to, systems having A alone, B alone, C alone, A and B, A and C, B and C, and/or A, B, C, etc. ). Those skilled in the art will also appreciate that transitional conjunctions and/or phrases that are arbitrarily arbitrarily representing two or more optional items, whether in the specification, claims, or drawings, are to be construed as The possibility of one of the projects, either or both of these projects. For example, the phrase "A or B" should be understood to include the possibility of "A" or "B", or "A and B."
以下参考附图描述了本公开的示例实施。本公开提供了一种文本无关型的声纹识别方法和声纹识别系统,其中所述声纹识别方法能够在小样本的情况下有效提高声纹识别的容错性,快速高效地识别出两段音频是否属于同一个人,从而具有广阔的应用前景。通过声纹识别技术中的说话人识别,可以实现利用语音信息进行身份鉴别。Example implementations of the present disclosure are described below with reference to the drawings. The present disclosure provides a text-independent voiceprint recognition method and a voiceprint recognition system, wherein the voiceprint recognition method can effectively improve the fault tolerance of voiceprint recognition in a small sample, and quickly and efficiently identify two segments. Whether the audio belongs to the same person has broad application prospects. Through speaker recognition in voiceprint recognition technology, identity identification using voice information can be achieved.
图1示出了根据本公开的示例实施例的声纹识别系统100的结构框图。如图1所示,声纹识别系统100包括接收器110,配置为接收待测试音频;样本数据库120,配置为存储一个或更多个样本音频;支持向量机130,配置为根据分类样本对测试数据进行分类;以及控制器140。支持向量机130能够执行分类功能,具体地,对于线性不可分的情况,首先通过非线性变换将输入空间变换到一个高维空间,使样本被变换为线性可分的情况,其中这里提到的非线性变换是通过适当的内积函数实现的;然后在新的空间中寻求最优的线性分类面,从而实现分类功能。所述控制器140可以配置为:将来自接收器110的待测试音频分割为第一部分和第二部分,并从样本数据库130中选择一个样本音频并将所选样本音频分割为第一部分和第二部分,例如,将待测试音频和所选样本音频均分割为长度相等的两部分。尽管上述 实施例描述了将待测试音频和所选样本音频均分割为长度相等的两部分,然而应注意,还可以以不同的分割比例来分割待测试音频和所选样本音频,且二者的分割比例可以是不一样的。接着,控制器140通过使用梅尔倒谱系数(MFCC)的提取方法,提取针对待测试音频以及所选样本音频的特征矩阵。梅尔频率是基于人耳听觉特性提出来的,它与频率(Hz)成非线性对应关系。梅尔频率倒谱系数(MFCC)则是利用它们之间的这种关系,计算得到的Hz频谱特征。目前MFCC及其提取方法已经广泛地应用在语音识别领域。FIG. 1 shows a block diagram of a structure of a voiceprint recognition system 100 in accordance with an exemplary embodiment of the present disclosure. As shown in FIG. 1, the voiceprint recognition system 100 includes a receiver 110 configured to receive audio to be tested, a sample database 120 configured to store one or more sample audios, and a support vector machine 130 configured to test according to a classification sample. The data is classified; and the controller 140. The support vector machine 130 is capable of performing a classification function. Specifically, for a linearly inseparable case, the input space is first transformed into a high-dimensional space by a nonlinear transformation, so that the sample is transformed into a linearly separable case, wherein the The linear transformation is achieved by an appropriate inner product function; then the optimal linear classification surface is sought in the new space to achieve the classification function. The controller 140 may be configured to divide the audio to be tested from the receiver 110 into a first portion and a second portion, and select one sample audio from the sample database 130 and divide the selected sample audio into the first portion and the second portion. For example, the audio to be tested and the selected sample audio are both divided into two parts of equal length. Despite the above The embodiment describes that the audio to be tested and the selected sample audio are equally divided into two parts of equal length, however, it should be noted that the audio to be tested and the selected sample audio may also be divided in different division ratios, and the ratio of the two is divided. It can be different. Next, the controller 140 extracts a feature matrix for the audio to be tested and the selected sample audio by using the extraction method of the Mel Cepstrum Coefficient (MFCC). The Mel frequency is based on the auditory characteristics of the human ear, which is nonlinearly related to the frequency (Hz). The Mel Frequency Cepstral Coefficient (MFCC) is a Hz spectral feature calculated using this relationship between them. At present, MFCC and its extraction methods have been widely used in the field of speech recognition.
根据本公开的实施例,控制器140通过使用支持向量机来确定待测试音频和所选样本音频是否来自同一个人。具体地,可以通过向支持向量机130输入作为第一类样本的待测试音频的第一部分的特征矩阵以及作为第二类样本的所选样本音频的特征矩阵并训练所述支持向量机130,计算待测试音频的第二部分属于第二类样本的比例a;通过向支持向量机130输入作为第一类样本的所选样本音频的第一部分的特征矩阵以及作为第二类样本的待测试音频的特征矩阵并训练所述支持向量机130,计算所选样本音频的第二部分属于第二类样本的比例b;通过向支持向量机130输入作为第一类样本的待测试音频的第二部分的特征矩阵以及作为第二类样本的所选样本音频的特征矩阵并训练所述支持向量机130,计算待测试音频的第一部分属于第二类样本的比例c;通过向支持向量机130输入作为第一类样本的所选样本音频的第二部分的特征矩阵以及作为第二类样本的待测试音频的特征矩阵并训练所述支持向量机130,计算所选样本音频的第一部分属于第二类样本的比例d;并且根据计算出的a、b、c和d,计算待测试音频与样本音频的匹配程度,以便确定待测试音频和样本音频是否来自同一个人的声音。在一个实施例中,控制器140可以通过计算a、b、c和d的平均值,并将所述平均值与0.5的比值确定为待测试音频与样本音频的匹配程度。According to an embodiment of the present disclosure, the controller 140 determines whether the audio to be tested and the selected sample audio are from the same person by using a support vector machine. Specifically, the feature matrix of the first part of the to-be-tested audio as the first type of samples and the feature matrix of the selected sample audio as the second type of samples may be input to the support vector machine 130 and the support vector machine 130 may be trained to calculate The second portion of the audio to be tested belongs to the ratio a of the second type of samples; by inputting to the support vector machine 130 the feature matrix of the first portion of the selected sample audio as the first type of samples and the audio to be tested as the second type of samples Feature matrix and training the support vector machine 130 to calculate a ratio b of the second portion of the selected sample audio belonging to the second type of sample; by inputting to the support vector machine 130 the second portion of the audio to be tested as the first type of sample a feature matrix and a feature matrix of the selected sample audio as the second type of samples and training the support vector machine 130 to calculate a ratio c of the first portion of the audio to be tested belonging to the second type of sample; by inputting to the support vector machine 130 as the first a feature matrix of the second part of the selected sample audio of a class of samples and a feature matrix of the audio to be tested as a second type of sample Training the support vector machine 130, calculating a ratio d of the first part of the selected sample audio belonging to the second type of samples; and calculating a matching degree between the to-be-tested audio and the sample audio according to the calculated a, b, c, and d, so that Determine if the audio to be tested and the sample audio are from the same person's voice. In one embodiment, the controller 140 may determine the average of a, b, c, and d and determine the ratio of the average to 0.5 as the degree of matching of the audio to be tested to the sample audio.
在备选实施例中,所述控制器140还可以配置为对所接收的待测试音频进行预处理,例如,对待检测音频进行预加重;前值滤波和高频补偿;接着通过使用交叠分段的分帧方法对待测试音频进行分帧;然后施加汉明窗以消除吉布斯效应;以及区分语音帧和非语音帧并舍弃非语音帧。由于声音信号往往是连续变化的,为了将连续变化信号简化,假设在一个短时间尺度内,音频信号不发生改变,使得将信号以多个取样点集合成一个单位,称为“讯框”,即“一帧”。一帧往往为20-40毫秒,如果讯框长度更短,那每个讯框内的取样点将不足以做出可靠的频谱计算,但若长度太长,则每个讯框信号会变化太大。In an alternative embodiment, the controller 140 may be further configured to pre-process the received audio to be tested, for example, pre-emphasis the audio to be detected; pre-value filtering and high frequency compensation; then by using overlapping points The segmentation method of the segment is to frame the test audio; then the Hamming window is applied to eliminate the Gibbs effect; and the speech frame and the non-speech frame are distinguished and the non-speech frame is discarded. Since the sound signal tends to be continuously changing, in order to simplify the continuously varying signal, it is assumed that the audio signal does not change in a short time scale, so that the signal is grouped into a unit by a plurality of sampling points, which is called a "frame". That is, "one frame." A frame is often 20-40 milliseconds. If the frame length is shorter, the sampling points in each frame will not be enough to make a reliable spectrum calculation. However, if the length is too long, each frame signal will change too. Big.
图2示出了根据本公开的示例实施例的声纹识别方法的操作逻辑图。首先,在操作S01,通过接收器接收待测试音频;接着在操作S05,对待测试音频进行预处理,例如,前值滤波和高频补偿;接着通过使用交叠分段的分帧方法对待测试音频进行分帧;然后施加汉明窗以消除吉布斯效应;以及区分语音帧和非语音帧并舍弃非语音帧。在操作S10,将待测试音频 分割为第一和第二部分。此外,在操作S15,可以从样本数据库选择样本音频,并在操作S20将所选样本音频分为第一部分和第二部分。随后,在操作S25,通过使用梅尔倒谱系数的提取方法,提取针对待测试音频和所选样本音频的各个部分的特征向量,以便在操作S30用所述特征向量中的一个或更多个来训练支持向量机。最后,在操作S35,确定待测试音频和所选样本音频是否来自同一个人。FIG. 2 illustrates an operational logic diagram of a voiceprint recognition method in accordance with an example embodiment of the present disclosure. First, in operation S01, the audio to be tested is received by the receiver; then, in operation S05, the audio to be tested is pre-processed, for example, pre-value filtering and high-frequency compensation; then the audio is to be tested by using the overlapping segmentation method Framing is performed; then a Hamming window is applied to eliminate the Gibbs effect; and speech and non-speech frames are distinguished and non-speech frames are discarded. In operation S10, the audio to be tested Split into first and second parts. Further, at operation S15, sample audio may be selected from the sample database, and the selected sample audio is divided into a first portion and a second portion at operation S20. Subsequently, in operation S25, feature vectors for respective portions of the audio to be tested and the selected sample audio are extracted by using the extraction method of the Mel cepstrum coefficients, so that one or more of the feature vectors are used in operation S30. To train the support vector machine. Finally, in operation S35, it is determined whether the audio to be tested and the selected sample audio are from the same person.
图3示出了根据本公开的示例实施例的声纹识别方法的流程图。在步骤S305,接收待测试音频A并将待测试音频A分割为第一部分A1和第二部分A2。在步骤S310,从样本数据库中选择一个样本音频B并将所选样本音频B分割为第一部分B1和第二部分B2。例如,可以将待测试音频A从中间分割成长度相等的A1和A2两部分,同时将样本音频B同样地从中间分割成B1和B2两部分。此外,除了上述分割方式之外,还可以以其他分割比例来分割待测试音频和所选样本音频,例如,将待测试音频分割为1∶2的两个部分,且将所选样本音频分割为2∶3的两个部分。FIG. 3 illustrates a flow chart of a voiceprint recognition method in accordance with an example embodiment of the present disclosure. In step S305, the audio A to be tested is received and the audio A to be tested is divided into a first part A1 and a second part A2. At step S310, a sample audio B is selected from the sample database and the selected sample audio B is divided into a first portion B1 and a second portion B2. For example, the audio A to be tested can be divided from the middle into two parts of equal lengths A1 and A2, while the sample audio B is equally divided into two parts B1 and B2 from the middle. In addition, in addition to the above division manner, the audio to be tested and the selected sample audio may be divided in other division ratios, for example, the audio to be tested is divided into two parts of 1:2, and the selected sample audio is divided into Two parts of 2:3.
此外,在执行步骤S305之前,所述方法还可以包括对待测试音频进行预处理,例如,对待检测音频进行预加重;通过使用交叠分段的分帧方法对待测试音频进行分帧;施加汉明窗以消除吉布斯效应;以及区分语音帧和非语音帧并舍弃非语音帧等。在一个实施例中,首先根据语音信号的频率特点设计了一个特殊的滤波器对信号进行滤波、高频补偿;然后采用交叠分段的分帧方法进行分帧;其次给信号加上了汉明窗以消除吉布斯效应;接着利用端点检测的方法,按照短时能量和短时平均过零率的高低区分语音帧和非语音帧,并将非语音帧舍弃。In addition, before performing step S305, the method may further include pre-processing the audio to be tested, for example, pre-emphasizing the audio to be detected; framing the test audio by using a framing method of overlapping segments; applying Hamming Window to eliminate the Gibbs effect; and distinguish between speech frames and non-speech frames and discard non-speech frames. In one embodiment, a special filter is firstly designed to filter and high-frequency compensation according to the frequency characteristics of the speech signal; then, the overlapping segmentation method is used to perform frame division; secondly, the signal is added to the signal. The window is used to eliminate the Gibbs effect; then the endpoint detection method is used to distinguish the speech frame from the non-speech frame according to the short-time energy and the short-term average zero-crossing rate, and the non-speech frame is discarded.
接着,在步骤S315,通过使用梅尔倒谱系数的提取方法,提取针对所述待测试音频以及所选样本音频的特征矩阵。也就是说,根据梅尔倒谱系数的提取方法从每一个说话人的语音的每一帧都中提取出一个1行20列的向量作为其特征向量,那么一个人的n帧就构成了一个n行20列的特征矩阵。Next, in step S315, a feature matrix for the audio to be tested and the selected sample audio is extracted by using the extraction method of the Mel cepstrum coefficients. That is to say, according to the extraction method of Mel's cepstrum coefficient, a vector of 1 row and 20 columns is extracted from each frame of each speaker's speech as its feature vector, then a person's n frame constitutes a feature vector. n rows and 20 columns of feature matrices.
接下来,执行训练支持向量机的步骤。在步骤S320,通过将待测试音频的第一部分A1的特征矩阵作为第一类样本,并将所选样本音频B的特征矩阵作为第二类样本,执行支持向量机训练,并计算待测试音频的第二部分A2属于第二类样本的比例a,以便判别待测试音频的第二部分A2是否属于所选样本音频;接着在步骤S325,通过将所选样本音频的第一部分B1的特征矩阵作为第一类样本,并将待测试音频A的特征矩阵作为第二类样本,执行支持向量机训练,并计算所选样本音频的第二部分B2属于第二类样本的比例b;然后,在步骤S330,通过将待测试音频的第二部分A2的特征矩阵作为第一类样本,并将所选样本音频B的特征矩阵作为第二类样本,执行支持向量机训练,并计算待测试音频的第一部分A1属于第二类 样本的比例c;以及在步骤S335,通过将所选样本音频的第二部分B2的特征矩阵作为第一类样本,并将待测试音频A的特征矩阵作为第二类样本,执行支持向量机训练,并计算所选样本音频的第一部分B1属于第二类样本的比例d。上述操作S320至S335中的任一操作可以示例性地表示为图4。图4示出了上述操作S320至S335中的训练支持向量机并计算音频匹配度的处理中的一个示例图。Next, the steps of training the support vector machine are performed. In step S320, by using the feature matrix of the first part A1 of the audio to be tested as the first type of samples and the feature matrix of the selected sample audio B as the second type of samples, the support vector machine training is performed, and the audio to be tested is calculated. The second portion A2 belongs to the ratio a of the second type of samples in order to determine whether the second portion A2 of the audio to be tested belongs to the selected sample audio; then in step S325, the feature matrix of the first portion B1 of the selected sample audio is taken as the first a type of sample, and the feature matrix of the audio A to be tested is used as the second type of sample, performing support vector machine training, and calculating the ratio b of the second part B2 of the selected sample audio belonging to the second type of sample; then, in step S330 Performing support vector machine training by calculating the feature matrix of the second part A2 of the audio to be tested as the first type of samples and using the feature matrix of the selected sample audio B as the second type of samples, and calculating the first part of the audio to be tested A1 belongs to the second category a ratio c of the samples; and in step S335, performing support vector machine training by using the feature matrix of the second portion B2 of the selected sample audio as the first type of samples and using the feature matrix of the audio A to be tested as the second type of samples And calculate the ratio d of the first part B1 of the selected sample audio belonging to the second type of sample. Any of the above operations S320 to S335 can be exemplarily shown as FIG. FIG. 4 shows an example of a process of training the support vector machine in the above operations S320 to S335 and calculating the audio matching degree.
最后,继续参考图3,在步骤S340,根据计算出的a、b、c和d,计算待测试音频与所选样本音频的匹配程度,以便确定待测试音频和所选样本音频是否来自同一个人的声音。例如,可以计算a、b、c和d的平均值,以及将所述平均值与0.5的比值确定作为待测试音频与样本音频的匹配程度。在这种情况下,如果待测试音频与所选样本音频属于一个人的,则平均值的大小应该接近0.5。如果不是来自同一个人,则平均值的比例应该接近0。因此,可以将该平均值与0.5的比值视为待测试音频与样本音频的匹配度。根据这个匹配度,能确认匹配结果与测试样本是否为一个人的声音,防止误判。Finally, with continued reference to FIG. 3, in step S340, based on the calculated a, b, c, and d, the degree of matching between the audio to be tested and the selected sample audio is calculated to determine whether the audio to be tested and the selected sample audio are from the same person. the sound of. For example, the average of a, b, c, and d can be calculated, and the ratio of the average to 0.5 can be determined as the degree of matching of the audio to be tested with the sample audio. In this case, if the audio to be tested and the selected sample audio belong to one person, the average value should be close to 0.5. If not from the same person, the average should be close to zero. Therefore, the ratio of the average value to 0.5 can be regarded as the degree of matching of the audio to be tested with the sample audio. According to this matching degree, it is possible to confirm whether the matching result and the test sample are a person's voice and prevent misjudgment.
应注意,可以基于不同应用环境的需求,设置不同的比例阈值来确定待测试音频与样本音频是否来自同一个人。例如,在安全性较低的情况下,可以通过将阈值设置为较低值,例如,70%,来确定样本音频和待测试音频是否来自同一个人,即,如果计算出的比值大于或等于70%,则认为二者来自同一个人,否则认为二者来自不同的人的声音。在安全性较高的情况下(例如,门禁系统),可以通过将阈值设置为较高值,例如,95%,来确定样本音频和待测试音频是否来自同一个人。这样能够实现根据应用需要来调整识别准确度的效果,更便于用户使用。It should be noted that different proportional thresholds may be set based on the requirements of different application environments to determine whether the audio to be tested and the sample audio are from the same person. For example, in the case of lower security, it can be determined whether the sample audio and the audio to be tested are from the same person by setting the threshold to a lower value, for example, 70%, that is, if the calculated ratio is greater than or equal to 70. %, they think that the two are from the same person, otherwise they think that the two are from different people's voices. In the case of higher security (eg, access control system), it can be determined whether the sample audio and the audio to be tested are from the same person by setting the threshold to a higher value, for example, 95%. This can achieve the effect of adjusting the recognition accuracy according to the needs of the application, and is more convenient for the user to use.
因此,本公开所提出的声纹识别方法和系统能够通过分割待匹配音频和样本音频,使得在小样本的条件下以不同方式组合分割后的样本进行分类,达到高容错性、高效率的准确身份识别。Therefore, the voiceprint recognition method and system proposed by the present disclosure can classify the segmented samples in different manners under different small sample conditions by classifying the to-be-matched audio and sample audio, thereby achieving high fault tolerance and high efficiency. Identification.
根据本公开的实施例,还提供了一种计算机系统,包括:一个或多个处理器;存储器,用于存储一个或多个程序,其中,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如上所述的声纹识别方法。According to an embodiment of the present disclosure, there is also provided a computer system comprising: one or more processors; a memory for storing one or more programs, wherein when the one or more programs are When executed by a plurality of processors, the one or more processors are caused to implement the voiceprint recognition method as described above.
根据本公开的实施例,还提供了一种计算机可读存储介质,其上存储有可执行指令,该指令被处理器执行时使处理器实现如上所述的声纹识别方法。According to an embodiment of the present disclosure, there is also provided a computer readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the voiceprint recognition method as described above.
图5示意性示出了根据本公开实施例的适于实现声纹识别方法的计算机系统的框图。图5示出的计算机系统仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。FIG. 5 schematically illustrates a block diagram of a computer system suitable for implementing a voiceprint recognition method in accordance with an embodiment of the present disclosure. The computer system shown in FIG. 5 is merely an example and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
如图5所示,根据本公开实施例的计算机系统500包括处理器501,其可以根据存储在只读存储器(ROM)502中的程序或者从存储部分508加载到随机访问存储器(RAM)503 中的程序而执行各种适当的动作和处理。处理器501例如可以包括通用微处理器(例如CPU)、指令集处理器和/或相关芯片组和/或专用微处理器(例如,专用集成电路(ASIC)),等等。处理器501还可以包括用于缓存用途的板载存储器。处理器501可以包括用于执行参考图2和图3描述的根据本公开实施例的方法流程的不同动作的单一处理单元或者是多个处理单元。As shown in FIG. 5, a computer system 500 in accordance with an embodiment of the present disclosure includes a processor 501 that can be loaded into a random access memory (RAM) 503 according to a program stored in a read only memory (ROM) 502 or from a storage portion 508. The program in the middle performs various appropriate actions and processes. Processor 501 can include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor, and/or a related chipset and/or a special purpose microprocessor (e.g., an application specific integrated circuit (ASIC)), and the like. Processor 501 can also include an onboard memory for caching purposes. The processor 501 may include a single processing unit or a plurality of processing units for performing different actions of the method flow according to the embodiments of the present disclosure described with reference to FIGS. 2 and 3.
在RAM 503中,存储有系统500操作所需的各种程序和数据。处理器501、ROM 502以及RAM 503通过总线504彼此相连。处理器501通过执行ROM 502和/或RAM 503中的程序来执行以上参考图2和图3描述的各种操作。需要注意,所述程序也可以存储在除ROM502和RAM 503以外的一个或多个存储器中。处理器501也可以通过执行存储在所述一个或多个存储器中的程序来执行以上参考图2和图3描述的各种操作。In the RAM 503, various programs and data required for the operation of the system 500 are stored. The processor 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. The processor 501 performs the various operations described above with reference to FIGS. 2 and 3 by executing programs in the ROM 502 and/or the RAM 503. It is noted that the program can also be stored in one or more memories other than ROM 502 and RAM 503. The processor 501 can also perform the various operations described above with reference to FIGS. 2 and 3 by executing a program stored in the one or more memories.
根据本公开的实施例,系统500还可以包括输入/输出(I/O)接口505,输入/输出(I/O)接口505也连接至总线504。系统500还可以包括连接至I/O接口505的以下部件中的一项或多项:包括键盘、鼠标等的输入部分506;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分507;包括硬盘等的存储部分508;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分509。通信部分509经由诸如因特网的网络执行通信处理。驱动器510也根据需要连接至I/O接口505。可拆卸介质511,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器510上,以便于从其上读出的计算机程序根据需要被安装入存储部分508。 System 500 may also include an input/output (I/O) interface 505 to which an input/output (I/O) interface 505 is also coupled, in accordance with an embodiment of the present disclosure. System 500 can also include one or more of the following components coupled to I/O interface 505: an input portion 506 including a keyboard, mouse, etc.; including, for example, a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker An output portion 507 of the like; a storage portion 508 including a hard disk or the like; and a communication portion 509 including a network interface card such as a LAN card, a modem, and the like. The communication section 509 performs communication processing via a network such as the Internet. Driver 510 is also coupled to I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 510 as needed so that a computer program read therefrom is installed into the storage portion 508 as needed.
根据本公开的实施例,上文参考流程图描述的方法可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读存储介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分509从网络上被下载和安装,和/或从可拆卸介质511被安装。在该计算机程序被处理器501执行时,执行本公开实施例的系统中限定的上述功能。根据本公开的实施例,上文描述的系统、设备、装置、模块、单元等可以通过计算机程序模块来实现。According to an embodiment of the present disclosure, the method described above with reference to the flowcharts may be implemented as a computer software program. For example, an embodiment of the present disclosure includes a computer program product comprising a computer program carried on a computer readable storage medium, the computer program comprising program code for executing the method illustrated in the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network via the communication portion 509, and/or installed from the removable medium 511. The above-described functions defined in the system of the embodiments of the present disclosure are executed when the computer program is executed by the processor 501. The systems, devices, devices, modules, units, and the like described above may be implemented by a computer program module in accordance with an embodiment of the present disclosure.
需要说明的是,本公开所示的计算机可读存储介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。 而在本公开中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读存储介质,该计算机可读存储介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。It should be noted that the computer readable storage medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain or store a program, which can be used by or in connection with an instruction execution system, apparatus, or device. While in the present disclosure, a computer readable signal medium may include a data signal that is propagated in the baseband or as part of a carrier, carrying computer readable program code. Such propagated data signals can take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer readable signal medium can also be any computer readable storage medium other than a computer readable storage medium, which can be transmitted, propagated or transmitted for use by or in connection with an instruction execution system, apparatus or device. program of. Program code embodied on a computer readable storage medium may be transmitted by any suitable medium, including but not limited to wireless, wire, optical cable, RF, etc., or any suitable combination of the foregoing.
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products in accordance with various embodiments of the present disclosure. In this regard, each block of the flowchart or block diagrams can represent a module, a program segment, or a portion of code that includes one or more Executable instructions. It should also be noted that in some alternative implementations, the functions noted in the blocks may also occur in a different order than that illustrated in the drawings. For example, two successively represented blocks may in fact be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowcharts, and combinations of blocks in the block diagrams or flowcharts, can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be used A combination of dedicated hardware and computer instructions is implemented.
应注意,以上方案仅是示出本公开构思的一个具体实现方案,本公开不限于上述实现方案。可以省略或跳过上述实现方案中的一部分处理,而不脱离本公开的精神和范围。It should be noted that the above scheme is only one specific implementation showing the concept of the present disclosure, and the present disclosure is not limited to the above implementation. Some of the above-described implementations may be omitted or skipped without departing from the spirit and scope of the present disclosure.
前面的方法可以通过多种计算机装置以可执的程序命令形式实现并记录在计算机可读记录介质中。在这种情况下,计算机可读记录介质可以包括单独的程序命令、数据文件、数据结构或其组合。同时,记录在记录介质中的程序命令可以专门设计或配置用于本公开,或是计算机软件领域的技术人员已知应用的。计算机可读记录介质包括例如硬盘、软盘或磁带等磁性介质、例如压缩盘只读存储器(CD-ROM)或数字通用盘(DVD)等光学介质、例如光磁软盘的磁光介质以及例如存储和执行程序命令的ROM、RAM、闪存等硬件装置。此外,程序命令包括编译器形成的机器语言代码和计算机通过使用解释程序可执行的高级语言。前面的硬件装置可以配置成作为至少一个软件模块操作以执行本公开的操作,并且逆向操作也是一样的。The foregoing method can be implemented in a form of executable program commands by a plurality of computer devices and recorded in a computer readable recording medium. In this case, the computer readable recording medium may include a separate program command, a data file, a data structure, or a combination thereof. Meanwhile, program commands recorded in a recording medium may be specifically designed or configured for use in the present disclosure, or are known to those skilled in the art of computer software. The computer readable recording medium includes a magnetic medium such as a hard disk, a floppy disk or a magnetic tape, an optical medium such as a compact disk read only memory (CD-ROM) or a digital versatile disk (DVD), a magneto-optical medium such as a magneto-optical floppy disk, and, for example, a storage and Hardware devices such as ROM, RAM, and flash memory that execute program commands. In addition, the program commands include machine language code formed by the compiler and a high-level language that the computer can execute by using the interpreter. The foregoing hardware device may be configured to operate as at least one software module to perform the operations of the present disclosure, and the reverse operation is also the same.
尽管以特定顺序示出并描述了本文方法的操作,然而可以改变每个方法的操作的顺序,使得可以以相反顺序执行特定操作或使得可以至少部分地与其它操作同时来执行特定操作。此外,本公开不限于上述示例实施例,它可以在不脱离本公开的精神和范围的前提下,包括一个或多个其他部件或操作,或省略一个或多个其他部件或操作。Although the operations of the methods herein are shown and described in a particular order, the order of the operations of the various methods can be varied, such that a particular operation can be performed in the reverse order or the particular operation can be performed at least partially concurrently with other operations. In addition, the present disclosure is not limited to the above-described exemplary embodiments, and may include one or more other components or operations, or omit one or more other components or operations, without departing from the spirit and scope of the disclosure.
本领域技术人员可以理解,本公开的各个实施例和/或权利要求中记载的特征可以进行多 种组合或/或结合,即使这样的组合或结合没有明确记载于本公开中。特别地,在不脱离本公开精神和教导的情况下,本公开的各个实施例和/或权利要求中记载的特征可以进行多种组合和/或结合。所有这些组合和/或结合均落入本公开的范围。Those skilled in the art will appreciate that the features recited in the various embodiments and/or claims of the present disclosure may be Combinations or/or combinations, even if such combinations or combinations are not explicitly recited in the present disclosure. In particular, various combinations and/or combinations of the features described in the various embodiments and/or claims of the present disclosure can be made without departing from the spirit and scope of the disclosure. All such combinations and/or combinations fall within the scope of the disclosure.
以上已经结合本公开的优选实施例示出了本公开,但是本领域的技术人员将会理解,在不脱离本公开的精神和范围的情况下,可以对本公开进行各种修改、替换和改变。因此,本公开不应由上述实施例来限定,而应由所附权利要求及其等价物来限定。 The present disclosure has been described in connection with the preferred embodiments of the present disclosure, and it will be understood by those skilled in the art Therefore, the present disclosure should not be limited by the embodiments described above, but by the appended claims and their equivalents.

Claims (12)

  1. 一种声纹识别方法,包括:A voiceprint recognition method comprising:
    接收待测试音频并将待测试音频分割为第一部分和第二部分;Receiving the audio to be tested and dividing the audio to be tested into the first part and the second part;
    从样本数据库中选择一个样本音频并将所选样本音频分割为第一部分和第二部分;Selecting a sample audio from the sample database and dividing the selected sample audio into a first part and a second part;
    通过使用梅尔倒谱系数的提取方法,提取针对所述待测试音频以及所选样本音频的特征矩阵;Extracting a feature matrix for the audio to be tested and the selected sample audio by using an extraction method of Mel cepstral coefficients;
    通过将待测试音频的第一部分的特征矩阵作为第一类样本,并将所选样本音频的特征矩阵作为第二类样本,执行支持向量机训练,并计算待测试音频的第二部分属于第二类样本的比例a;Performing support vector machine training by using the feature matrix of the first part of the audio to be tested as the first type of samples and using the feature matrix of the selected sample audio as the second type of samples, and calculating the second part of the audio to be tested belongs to the second Proportion of class samples a;
    通过将所选样本音频的第一部分的特征矩阵作为第一类样本,并将待测试音频的特征矩阵作为第二类样本,执行支持向量机训练,并计算所选样本音频的第二部分属于第二类样本的比例b;The support vector machine training is performed by using the feature matrix of the first part of the selected sample audio as the first type of sample and the feature matrix of the audio to be tested as the second type of sample, and calculating the second part of the selected sample audio belongs to the The ratio b of the second type of sample;
    通过将待测试音频的第二部分的特征矩阵作为第一类样本,并将所选样本音频的特征矩阵作为第二类样本,执行支持向量机训练,并计算待测试音频的第一部分属于第二类样本的比例c;Performing support vector machine training by using the feature matrix of the second part of the audio to be tested as the first type of samples and using the feature matrix of the selected sample audio as the second type of samples, and calculating the first part of the audio to be tested belongs to the second The proportion of the class sample c;
    通过将所选样本音频的第二部分的特征矩阵作为第一类样本,并将待测试音频的特征矩阵作为第二类样本,执行支持向量机训练,并计算所选样本音频的第一部分属于第二类样本的比例d;The support vector machine training is performed by using the feature matrix of the second part of the selected sample audio as the first type of sample and the feature matrix of the audio to be tested as the second type of sample, and calculating the first part of the selected sample audio belongs to the first The ratio d of the second type of sample;
    根据计算出的a、b、c和d,计算待测试音频与所选样本音频的匹配程度,以便确定待测试音频和所选样本音频是否来自同一个人的声音。Based on the calculated a, b, c, and d, the degree of matching of the audio to be tested with the selected sample audio is calculated to determine whether the audio to be tested and the selected sample audio are from the same person's voice.
  2. 根据权利要求1所述的方法,还包括:对所接收的待测试音频进行预处理,其中所述预处理包括以下操作中的至少一个:The method of claim 1, further comprising: pre-processing the received audio to be tested, wherein the pre-processing comprises at least one of the following:
    对待检测音频进行预加重;Pre-emphasis the detected audio;
    通过使用交叠分段的分帧方法对待测试音频进行分帧;Framing the test audio by using a framing method of overlapping segments;
    施加汉明窗以消除吉布斯效应;以及Applying a Hamming window to eliminate the Gibbs effect;
    区分语音帧和非语音帧并舍弃非语音帧。Distinguish between speech and non-speech frames and discard non-speech frames.
  3. 根据权利要求1所述的方法,其中所述将待测试音频分割为第一部分和第二部分包括将待测试音频分割为长度相等的两部分。The method of claim 1 wherein said dividing the audio to be tested into the first portion and the second portion comprises dividing the audio to be tested into two portions of equal length.
  4. 根据权利要求1所述的方法,其中所述将所选样本音频分割为第一部分和第二部分包括将所选样本音频分割为长度相等的两部分。 The method of claim 1 wherein said dividing the selected sample audio into the first portion and the second portion comprises dividing the selected sample audio into two portions of equal length.
  5. 根据权利要求1所述的方法,其中所述计算待测试音频与样本音频的匹配程度包括:The method of claim 1 wherein said calculating a degree of matching of the audio to be tested to the sample audio comprises:
    计算a、b、c和d的平均值;以及Calculate the average of a, b, c, and d;
    将所述平均值与0.5的比值确定作为待测试音频与样本音频的匹配程度。The ratio of the average value to 0.5 is determined as the degree of matching of the audio to be tested with the sample audio.
  6. 一种声纹识别系统,包括:A voiceprint recognition system comprising:
    接收器,配置为接收待测试音频;a receiver configured to receive audio to be tested;
    样本数据库,配置为存储一个或更多个样本音频;a sample database configured to store one or more sample audios;
    支持向量机,配置为根据分类样本对测试数据进行分类;Support vector machine, configured to classify test data according to a classification sample;
    控制器,配置为:Controller, configured as:
    将来自接收器的待测试音频分割为第一部分和第二部分,并从样本数据库中选择一个样本音频并将所选样本音频分割为第一部分和第二部分;Dividing the audio to be tested from the receiver into a first part and a second part, and selecting one sample audio from the sample database and dividing the selected sample audio into the first part and the second part;
    通过使用梅尔倒谱系数的提取方法,提取针对待测试音频以及所选样本音频的特征矩阵;Extracting a feature matrix for the audio to be tested and the selected sample audio by using a method of extracting the Mel cepstral coefficients;
    通过向支持向量机输入作为第一类样本的待测试音频的第一部分的特征矩阵以及作为第二类样本的所选样本音频的特征矩阵并训练所述支持向量机,计算待测试音频的第二部分属于第二类样本的比例a;Calculating a second of the audio to be tested by inputting a feature matrix of the first portion of the audio to be tested as the first type of samples and a feature matrix of the selected sample audio as the second type of samples to the support vector machine and training the support vector machine The proportion a of the samples belonging to the second type;
    通过向支持向量机输入作为第一类样本的所选样本音频的第一部分的特征矩阵以及作为第二类样本的待测试音频的特征矩阵并训练所述支持向量机,计算所选样本音频的第二部分属于第二类样本的比例b;Calculating the selected sample audio by inputting to the support vector machine a feature matrix of the first portion of the selected sample audio of the first type of samples and a feature matrix of the audio to be tested as the second type of samples and training the support vector machine The proportion of the two parts belonging to the second type of sample b;
    通过向支持向量机输入作为第一类样本的待测试音频的第二部分的特征矩阵以及作为第二类样本的所选样本音频的特征矩阵并训练所述支持向量机,计算待测试音频的第一部分属于第二类样本的比例c;Calculating the first to be tested by inputting a feature matrix of the second portion of the audio to be tested as the first type of samples and a feature matrix of the selected sample audio as the second type of samples to the support vector machine and training the support vector machine Part of the proportion c of the second type of sample;
    通过向支持向量机输入作为第一类样本的所选样本音频的第二部分的特征矩阵以及作为第二类样本的待测试音频的特征矩阵并训练所述支持向量机,计算所选样本音频的第一部分属于第二类样本的比例d;Calculating the selected sample audio by inputting a feature matrix of the second portion of the selected sample audio as the first type of sample and a feature matrix of the audio to be tested as the second type of sample to the support vector machine and training the support vector machine The first part belongs to the proportion d of the second type of sample;
    根据计算出的a、b、c和d,计算待测试音频与样本音频的匹配程度,以便确定待测试音频和样本音频是否来自同一个人的声音。Based on the calculated a, b, c, and d, the degree of matching between the audio to be tested and the sample audio is calculated to determine whether the audio to be tested and the sample audio are from the same person's voice.
  7. 根据权利要求6所述的系统,其中所述控制器还配置为对所接收的待测试音频进行预处理;其中所述预处理包括以下操作中的至少一个:The system of claim 6 wherein said controller is further configured to pre-process the received audio to be tested; wherein said pre-processing comprises at least one of the following:
    对待检测音频进行预加重;Pre-emphasis the detected audio;
    通过使用交叠分段的分帧方法对待测试音频进行分帧;Framing the test audio by using a framing method of overlapping segments;
    施加汉明窗以消除吉布斯效应;以及 Applying a Hamming window to eliminate the Gibbs effect;
    区分语音帧和非语音帧并舍弃非语音帧。Distinguish between speech and non-speech frames and discard non-speech frames.
  8. 根据权利要求6所述的系统,其中所述控制器还配置为将待测试音频分割为长度相等的两部分。The system of claim 6 wherein said controller is further configured to split the audio to be tested into two equal lengths.
  9. 根据权利要求6所述的系统,其中所述控制器还配置为将所选样本音频分割为长度相等的两部分。The system of claim 6 wherein said controller is further configured to split the selected sample audio into two portions of equal length.
  10. 根据权利要求6所述的系统,其中所述控制器还配置为:The system of claim 6 wherein said controller is further configured to:
    计算a、b、c和d的平均值;以及Calculate the average of a, b, c, and d;
    将所述平均值与0.5的比值确定作为待测试音频与样本音频的匹配程度。The ratio of the average value to 0.5 is determined as the degree of matching of the audio to be tested with the sample audio.
  11. 一种计算机系统,包括:A computer system comprising:
    一个或多个处理器;One or more processors;
    存储器,用于存储一个或多个程序,Memory for storing one or more programs,
    其中,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现权利要求1至5中任一项所述的声纹识别方法。Wherein the one or more programs are executed by the one or more processors, such that the one or more processors implement the voiceprint recognition method of any one of claims 1 to 5.
  12. 一种计算机可读存储介质,其上存储有可执行指令,该指令被处理器执行时使处理器实现权利要求l至5中任一项所述的声纹识别方法。 A computer readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the voiceprint recognition method of any one of claims 1 to 5.
PCT/CN2017/106886 2016-11-22 2017-10-19 Voiceprint identification method and voiceprint identification system WO2018095167A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611035943.3A CN108091340B (en) 2016-11-22 2016-11-22 Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
CN201611035943.3 2016-11-22

Publications (1)

Publication Number Publication Date
WO2018095167A1 true WO2018095167A1 (en) 2018-05-31

Family

ID=62168704

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/106886 WO2018095167A1 (en) 2016-11-22 2017-10-19 Voiceprint identification method and voiceprint identification system

Country Status (2)

Country Link
CN (1) CN108091340B (en)
WO (1) WO2018095167A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109031961A (en) * 2018-06-29 2018-12-18 百度在线网络技术(北京)有限公司 Method and apparatus for controlling operation object
CN111489756A (en) * 2020-03-31 2020-08-04 中国工商银行股份有限公司 Voiceprint recognition method and device
CN115100776A (en) * 2022-05-30 2022-09-23 厦门快商通科技股份有限公司 Access control authentication method, system and storage medium based on voice recognition

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108908377B (en) * 2018-07-06 2020-06-23 达闼科技(北京)有限公司 Speaker recognition method and device and robot
CN110889008B (en) * 2018-09-10 2021-11-09 珠海格力电器股份有限公司 Music recommendation method and device, computing device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070239457A1 (en) * 2006-04-10 2007-10-11 Nokia Corporation Method, apparatus, mobile terminal and computer program product for utilizing speaker recognition in content management
CN102664011A (en) * 2012-05-17 2012-09-12 吉林大学 Method for quickly recognizing speaker
CN102737633A (en) * 2012-06-21 2012-10-17 北京华信恒达软件技术有限公司 Method and device for recognizing speaker based on tensor subspace analysis
CN102820033A (en) * 2012-08-17 2012-12-12 南京大学 Voiceprint identification method
CN103562993A (en) * 2011-12-16 2014-02-05 华为技术有限公司 Speaker recognition method and device
CN104464756A (en) * 2014-12-10 2015-03-25 黑龙江真美广播通讯器材有限公司 Small speaker emotion recognition system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001318692A (en) * 2000-05-11 2001-11-16 Yasutaka Sakamoto Individual identification system by speech recognition
CN101562012B (en) * 2008-04-16 2011-07-20 创而新(中国)科技有限公司 Method and system for graded measurement of voice
EP3123468A1 (en) * 2014-03-28 2017-02-01 Intel IP Corporation Training classifiers using selected cohort sample subsets
CN104485102A (en) * 2014-12-23 2015-04-01 智慧眼(湖南)科技发展有限公司 Voiceprint recognition method and device
CN105244026B (en) * 2015-08-24 2019-09-20 北京意匠文枢科技有限公司 A kind of method of speech processing and device
CN105244031A (en) * 2015-10-26 2016-01-13 北京锐安科技有限公司 Speaker identification method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070239457A1 (en) * 2006-04-10 2007-10-11 Nokia Corporation Method, apparatus, mobile terminal and computer program product for utilizing speaker recognition in content management
CN103562993A (en) * 2011-12-16 2014-02-05 华为技术有限公司 Speaker recognition method and device
CN102664011A (en) * 2012-05-17 2012-09-12 吉林大学 Method for quickly recognizing speaker
CN102737633A (en) * 2012-06-21 2012-10-17 北京华信恒达软件技术有限公司 Method and device for recognizing speaker based on tensor subspace analysis
CN102820033A (en) * 2012-08-17 2012-12-12 南京大学 Voiceprint identification method
CN104464756A (en) * 2014-12-10 2015-03-25 黑龙江真美广播通讯器材有限公司 Small speaker emotion recognition system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109031961A (en) * 2018-06-29 2018-12-18 百度在线网络技术(北京)有限公司 Method and apparatus for controlling operation object
CN111489756A (en) * 2020-03-31 2020-08-04 中国工商银行股份有限公司 Voiceprint recognition method and device
CN115100776A (en) * 2022-05-30 2022-09-23 厦门快商通科技股份有限公司 Access control authentication method, system and storage medium based on voice recognition
CN115100776B (en) * 2022-05-30 2023-12-26 厦门快商通科技股份有限公司 Entrance guard authentication method, system and storage medium based on voice recognition

Also Published As

Publication number Publication date
CN108091340B (en) 2020-11-03
CN108091340A (en) 2018-05-29

Similar Documents

Publication Publication Date Title
Boles et al. Voice biometrics: Deep learning-based voiceprint authentication system
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
WO2018095167A1 (en) Voiceprint identification method and voiceprint identification system
US9536547B2 (en) Speaker change detection device and speaker change detection method
Ahmad et al. A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network
CN108922541B (en) Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models
US20170154640A1 (en) Method and electronic device for voice recognition based on dynamic voice model selection
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
JP2019211749A (en) Method and apparatus for detecting starting point and finishing point of speech, computer facility, and program
Vyas A Gaussian mixture model based speech recognition system using Matlab
CN108335699A (en) A kind of method for recognizing sound-groove based on dynamic time warping and voice activity detection
Archana et al. Gender identification and performance analysis of speech signals
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
CN109215634A (en) A kind of method and its system of more word voice control on-off systems
Pao et al. Combining acoustic features for improved emotion recognition in mandarin speech
Ramgire et al. A survey on speaker recognition with various feature extraction and classification techniques
GB2576960A (en) Speaker recognition
CN109065026A (en) A kind of recording control method and device
Krishna et al. Emotion recognition using dynamic time warping technique for isolated words
CN111429919A (en) Anti-sound crosstalk method based on conference recording system, electronic device and storage medium
Budiga et al. CNN trained speaker recognition system in electric vehicles
Komlen et al. Text independent speaker recognition using LBG vector quantization
Tahliramani et al. Performance analysis of speaker identification system with and without spoofing attack of voice conversion
Estrebou et al. Voice recognition based on probabilistic SOM
Bonifaco et al. Comparative analysis of filipino-based rhinolalia aperta speech using mel frequency cepstral analysis and Perceptual Linear Prediction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17874980

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 04.09.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 17874980

Country of ref document: EP

Kind code of ref document: A1