WO2019136909A1 - Voice living-body detection method based on deep learning, server and storage medium - Google Patents

Voice living-body detection method based on deep learning, server and storage medium Download PDF

Info

Publication number
WO2019136909A1
WO2019136909A1 PCT/CN2018/089203 CN2018089203W WO2019136909A1 WO 2019136909 A1 WO2019136909 A1 WO 2019136909A1 CN 2018089203 W CN2018089203 W CN 2018089203W WO 2019136909 A1 WO2019136909 A1 WO 2019136909A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
neural network
network model
dimensional
frames
Prior art date
Application number
PCT/CN2018/089203
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
郑斯奇
于夕畔
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019136909A1 publication Critical patent/WO2019136909A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a voice learning method, a server, and a storage medium based on deep learning.
  • the voice of a non-real person is generally referred to as a forged recording, including music input, recording replay, voice generated by technical means such as speech synthesis, and the like.
  • Forged recordings are often used in the financial and security fields.
  • Voiceprints are identified by forgery recordings, so that they can be logged into the victim's account to achieve the goal of stealing money or damaging the reputation and property of others.
  • the present application provides a voice learning method, a server, and a storage medium based on deep learning, so that before the corresponding application is performed by using voice, it is possible to quickly detect whether the voice is a voice directly output by the user, or is another person.
  • the malicious forgery of voice in this way, can provide a higher level of security guarantee for the security of voice control, and promote the development of voice recognition technology.
  • the present application provides a server, where the server includes a memory, a processor, and the memory stores a deep learning-based voice living detection program executable on the processor, where the When the deep learning voice in vivo detection program is executed by the processor, the following steps are performed: training the deep neural network model to obtain an optimal depth neural network model; acquiring the to-be-detected speech and framing the to-be-detected speech to obtain 1000*20-dimensional matrix; inputting the 1000*20-dimensional matrix into the optimal depth neural network model; calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain 1*4 dimensions An output vector, the 1*4-dimensional output vector represents four types of speech categories; and one of the 1*4-dimensional output vectors having the largest value is selected as the category of the speech to be detected.
  • the present application further provides a deep learning-based voice living body detection method, which is applied to a server, and the method includes the following steps: training a deep neural network model to obtain an optimal depth neural network model; Performing to detect the speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix; inputting the 1000*20-dimensional matrix to the optimal depth neural network model; using the optimal depth neural network model
  • the 1000*20-dimensional matrix is calculated to obtain a 1*4-dimensional output vector, and the 1*4-dimensional output vector represents four types of speech categories; and the one with the largest value among the 1*4-dimensional output vectors is selected as the The category of the detected speech is described.
  • the present application further provides a storage medium storing a deep learning-based voice living body detection program, and the depth learning-based voice living body detection program can be executed by at least one processor.
  • the deep learning-based voice living body detection method, the server and the storage medium proposed by the present application firstly train the deep neural network model to obtain an optimal depth neural network model; secondly, acquire the voice to be detected. And framing the to-be-detected speech to obtain a 1000*20-dimensional matrix; again, inputting the 1000*20-dimensional matrix into the optimal depth neural network model; and then using the optimal depth neural network model Calculating the 1000*20-dimensional matrix to obtain a 1*4-dimensional output vector, wherein the 1*4-dimensional output vector represents a speech class of 4; finally, selecting the largest value of the 1*4-dimensional output vector The class serves as the category of the speech to be detected.
  • 1 is a schematic diagram of an optional hardware architecture of the server of the present application.
  • FIG. 2 is a program block diagram of a first embodiment of a voice learning method based on deep learning of the present application
  • FIG. 3 is a flowchart of a first embodiment of a method for detecting a living body based on deep learning according to the present application
  • FIG. 1 it is a schematic diagram of an optional hardware architecture of the server 1.
  • the server 1 may be a computing device such as a rack server, a blade server, a tower server, or a rack server.
  • the server 1 may be a standalone server or a server cluster composed of multiple servers.
  • the server 1 may include, but is not limited to, the memory 11, the processor 12, and the network interface 13 being communicably connected to each other through a system bus.
  • the server 1 connects to the network through the network interface 13 to obtain information.
  • the network may be an intranet, an Internet, a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, or a 5G network.
  • Wireless or wired networks such as networks, Bluetooth, Wi-Fi, and call networks.
  • Figure 1 only shows the server 1 with the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
  • the memory 11 includes at least one type of storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, SD or DX memory, etc.), a random access memory (RAM), and a static random access.
  • Memory SRAM
  • ROM read only memory
  • EEPROM electrically erasable programmable read only memory
  • PROM programmable read only memory
  • magnetic memory magnetic disk, optical disk, and the like.
  • the memory 11 may be an internal storage unit of the server 1, such as a hard disk or memory of the server 1.
  • the memory 11 may also be an external storage device of the server 1, such as a plug-in hard disk equipped with the server 1, a smart memory card (SMC), and a secure digital (Secure Digital). , SD) card, flash card (Flash Card), etc.
  • the memory 11 can also include both the internal storage unit of the server 1 and its external storage device.
  • the memory 11 is generally used to store an operating system installed in the server 1 and various types of application software, such as program code of the deep learning-based voice living body detection program 200. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 12 is typically used to control the overall operation of the server 1, such as performing data interaction or communication related control and processing, and the like.
  • the processor 12 is configured to run program code or process data stored in the memory 11, such as running the deep learning-based voice biometric detection program 200 and the like.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the server 1 and other electronic devices.
  • a deep learning-based voice biometric detection program 200 is installed and run in the server 1.
  • the server 1 trains the deep neural network model.
  • Obtaining an optimal depth neural network model acquiring a to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix; and inputting the 1000*20-dimensional matrix into the optimal depth neural network model;
  • the class with the largest value in the output vector is used as the category of the speech to be detected.
  • the present application proposes a deep learning based speech living body detection program 200.
  • FIG. 2 it is a program module diagram of the first embodiment of the deep learning-based voice living body detection program 200 of the present application.
  • the server 1 includes a series of computer program instructions stored on the memory 11, that is, the deep learning-based voice living body detection program 200, which can be implemented when the computer program instructions are executed by the processor 12.
  • the deep learning-based voice biometric detection operation of each embodiment is applied.
  • the deep learning based speech biometric detection program 200 can be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the deep learning-based voice living body detection program 200 can be divided into a training module 201, a voice processing module 202, a matrix input module 203, a matrix calculation module 204, and a determination module 205. among them:
  • the training module 201 is configured to train the deep neural network model to obtain an optimal depth neural network model.
  • the training module 201 is specifically configured to frame the training speech, and each 1000 frames is used as a sample; class identification is performed for each sample; and the identified sample is used as the training of the deep neural network model. sample.
  • the purpose of truncating every 1000 frames is to make the model have a fixed length input, and considering different lengths of recording, different MFCC (Mel-Frequency Cepstral Coefficients) feature distribution effects may be generated, if the input features are not Fixed, easy to cause inaccurate model recognition.
  • MFCC Mel-Frequency Cepstral Coefficients
  • the advantage of having a training sample for every 1000 frames of all recordings is that the model can learn the sound characteristics of each type of speech for each period of time, which is more robust than the training effect of a certain 1000 frames of a certain recording.
  • each input recording is tagged for identification, such as true class [0000], one type of forgery [0100], two types of forgery [0010], and three types of forgery [0001].
  • true class 0000
  • forgery it is divided into three kinds of forgery, the first type is forged as music, the second type is forged as recording replay forgery, and the third type is forged as technical voice forgery.
  • the first type of forgery refers to the voiceprint recognition input as music.
  • Class forgery is mainly a simple replay of recordings, such as recording a target person's speech or music with a recording pen, mobile phone, etc., and then directly replaying it to the input of voiceprint recognition;
  • the third type of forgery mainly refers to speech synthesis or voice conversion.
  • the technology performs the target person's speech forgery.
  • the speech synthesis recording generally collects a certain amount of voice data of the target person, and can synthesize the voice of the target person to specify the text content.
  • the voice conversion recording directly changes the spectrum of the original recording, and the forgery is due to It contains a lot of voice signal processing technology, so it is called technical forgery.
  • Model training is carried out using the open source Keras framework. Considering hardware limitations, set the DNB training to use minibatch technology, and set each batch size to 128. Train 1000 batches per iteration and train N iterations. Each batch randomly selects 128 speech MFCC feature samples from the total data, generates model output, and then updates the model parameters through backward feedback according to the loss function, thereby completing 1 batch calculation, thereby generating 1000 batch data and completing 1000 batches. Train to get an iterative model output.
  • the optimal model of the loss function is selected in 50 iterations: the convolution kernel of the first layer is 9*20, the Nfilters is set to 512, and the loss function is set to the maximum entropy categorical_crossentropy of all classifications.
  • the optimizer is Adagrad.
  • the voice processing module 202 is configured to acquire a voice to be detected and frame the to-be-detected voice to obtain a 1000*20-dimensional matrix.
  • the voice processing module 202 is specifically configured to: after framing the to-be-detected voice, extract 1000 frames and separately calculate 20-dimensional MFCC features; and generate the 20-dimensional MFCC according to the 1000 frames. 1000*20 dimensional matrix.
  • the framing operation of the speech to be detected is the same as the processing of the training speech described above.
  • the calculation of the MFCC feature belongs to a conventional algorithm, and the present application will not repeat it again.
  • the matrix input module 203 is configured to input the 1000*20-dimensional matrix to the optimal depth neural network model.
  • the input layer of the obtained optimal depth neural network model DNN is a matrix input, and the 1000*20-dimensional matrix obtained by the voice processing module 202 can be directly input to the obtained optimal depth neural network model.
  • the matrix calculation module 204 is configured to calculate the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector, where the 1*4-dimensional output vector represents 4 voices category.
  • the matrix calculation module 204 convolves the input features with a 1000*20 convolution kernel in the first layer of the DNN model.
  • the purpose of the layer is to perform adjacent frame feature projection, and each volume is controlled by Nfilters (N times of filtering).
  • the accumulative core is convolved to obtain N channel features; in the second to fourth layers, a 1*1 convolution kernel is used for convolution, and the LeakyReLU activation function is used, wherein the function of these 1*1 convolution kernels is to allow the channel Interconnected to interact, so that the model learns more frame and interframe features; pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range, where 2*2MaxPooling (fitting), step selection default The value is 1*1, this layer can select some upper nodes, so that the model parameters are reduced, it is not easy to overfit; in the sixth layer, flattening, that is, by flattening the upper layer output node to obtain 1*P dimension Feature; dimension reduction is performed on the sixth
  • the determining module 205 is configured to select a class with the largest value among the 1*4-dimensional output vectors as the category of the voice to be detected.
  • the 1*4 dimensional output vector is a value in the range of 0 to 1.
  • the 1*4-dimensional output vector represents the probability of belonging to the corresponding class by four decimal ranges of 0-1, that is, the probability of the above-mentioned real voice, one type of forgery, the second type of forgery, and the probability of three types of forgery.
  • the one with the highest value among the four probabilities represents the category of the input speech, that is, the value of the output can be intuitively and effectively detected whether the speech to be detected is a live speech or a voice.
  • the server proposed by the present application trains the deep neural network model to obtain an optimal depth neural network model; acquires the to-be-detected speech and framing the to-be-detected speech to obtain 1000*20 Dimension matrix; inputting the 1000*20-dimensional matrix to the optimal depth neural network model; calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector,
  • the 1*4 dimensional output vector represents a speech class of 4; a class having the largest value among the 1*4 dimensional output vectors is selected as the class of the speech to be detected.
  • the present application also proposes a voice living body detection method based on deep learning.
  • FIG. 3 it is a schematic flowchart of an implementation process of a first embodiment of a voice learning method based on deep learning in the present application.
  • the order of execution of the steps in the flowchart shown in FIG. 3 may be changed according to different requirements, and some steps may be omitted.
  • Step S301 training the deep neural network model to obtain an optimal depth neural network model.
  • the foregoing steps specifically include framing the training speech, using each 1000 frames as a sample; performing class identification on each sample; and using the identified sample as a training sample of the deep neural network model.
  • the purpose of truncating every 1000 frames is to make the model have a fixed length input, and considering different lengths of recording, different MFCC (Mel-Frequency Cepstral Coefficients) feature distribution effects may be generated, if the input features are not Fixed, easy to cause inaccurate model recognition.
  • MFCC Mel-Frequency Cepstral Coefficients
  • the advantage of having a training sample for every 1000 frames of all recordings is that the model can learn the sound characteristics of each type of speech for each period of time, which is more robust than the training effect of a certain 1000 frames of a certain recording.
  • each input recording is tagged for identification, such as true class [0000], one type of forgery [0100], two types of forgery [0010], and three types of forgery [0001].
  • true class 0000
  • forgery it is divided into three kinds of forgery, the first type is forged as music, the second type is forged as recording replay forgery, and the third type is forged as technical voice forgery.
  • the first type of forgery refers to the voiceprint recognition input as music.
  • Class forgery is mainly a simple replay of recordings, such as recording a target person's speech or music with a recording pen, mobile phone, etc., and then directly replaying it to the input of voiceprint recognition;
  • the third type of forgery mainly refers to speech synthesis or voice conversion.
  • the technology performs the target person's speech forgery.
  • the speech synthesis recording generally collects a certain amount of voice data of the target person, and can synthesize the voice of the target person to specify the text content.
  • the voice conversion recording directly changes the spectrum of the original recording, and the forgery is due to It contains a lot of voice signal processing technology, so it is called technical forgery.
  • Model training is carried out using the open source Keras framework. Considering hardware limitations, set the DNB training to use minibatch technology, and set each batch size to 128. Train 1000 batches per iteration and train N iterations. Each batch randomly selects 128 speech MFCC feature samples from the total data, generates model output, and then updates the model parameters through backward feedback according to the loss function, thereby completing 1 batch calculation, thereby generating 1000 batch data and completing 1000 batches. Train to get an iterative model output.
  • the optimal model of the loss function is selected in 50 iterations: the convolution kernel of the first layer is 9*20, the Nfilters is set to 512, and the loss function is set to the maximum entropy categorical_crossentropy of all classifications.
  • the optimizer is Adagrad.
  • Step S302 Acquire the to-be-detected speech and frame the to-be-detected speech to obtain a 1000*20-dimensional matrix.
  • the voice processing module 202 is specifically configured to: after framing the to-be-detected voice, extract 1000 frames and separately calculate 20-dimensional MFCC features; and generate the 20-dimensional MFCC according to the 1000 frames. 1000*20 dimensional matrix.
  • the framing operation of the speech to be detected is the same as the processing of the training speech described above.
  • the calculation of the MFCC feature belongs to a conventional algorithm, and the present application will not repeat it again.
  • Step S303 input the 1000*20-dimensional matrix to the optimal depth neural network model.
  • the input layer of the obtained optimal depth neural network model DNN is a matrix input, and the 1000*20-dimensional matrix obtained by the voice processing module 202 can be directly input to the obtained optimal depth neural network model.
  • Step S304 calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector, where the 1*4-dimensional output vector represents a speech class of 4.
  • the matrix calculation module 204 convolves the input features with a 1000*20 convolution kernel in the first layer of the DNN model.
  • the purpose of the layer is to perform adjacent frame feature projection, and each volume is controlled by Nfilters (N times of filtering).
  • the accumulative core is convolved to obtain N channel features; in the second to fourth layers, a 1*1 convolution kernel is used for convolution, and the LeakyReLU activation function is used, wherein the function of these 1*1 convolution kernels is to allow the channel Interconnected to interact, so that the model learns more frame and interframe features; pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range, where 2*2MaxPooling (fitting), step selection default The value is 1*1, this layer can select some upper nodes, so that the model parameters are reduced, it is not easy to overfit; in the sixth layer, flattening, that is, by flattening the upper layer output node to obtain 1*P dimension Feature; dimension reduction is performed on the sixth
  • Step S305 selecting a class having the largest value among the 1*4-dimensional output vectors as the category of the to-be-detected speech.
  • the 1*4 dimensional output vector is a value in the range of 0 to 1.
  • the 1*4-dimensional output vector represents the probability of belonging to the corresponding class by four decimal ranges of 0-1, that is, the probability of the above-mentioned real voice, one type of forgery, the second type of forgery, and the probability of three types of forgery.
  • the one with the highest value among the four probabilities represents the category of the input speech, that is, the value of the output can be intuitively and effectively detected whether the speech to be detected is a live speech or a voice.
  • the deep learning-based voice living body detection method proposed by the present application firstly trains the deep neural network model to obtain an optimal depth neural network model; secondly, acquires the to-be-detected speech and treats the to-be-detected speech Detecting speech to perform frame division to obtain a 1000*20-dimensional matrix; again, inputting the 1000*20-dimensional matrix into the optimal depth neural network model; and then using the optimal depth neural network model to the 1000*
  • the 20-dimensional matrix is calculated to obtain a 1*4-dimensional output vector, which represents the speech class of 4; finally, the one with the largest value among the 1*4-dimensional output vectors is selected as the Detect the category of the voice.
  • the present application further provides another embodiment, that is, providing a storage medium storing a deep learning-based voice living body detection program, the depth learning-based voice living body detection program being executable by at least one processor And causing the at least one processor to perform the steps of the deep learning-based voice biometric detection method as described above.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Abstract

Disclosed is a voice living-body detection method based on deep learning, wherein same is applied to a server. The method comprises: training a deep neural network model to obtain an optimal deep neural network model; acquiring voice to be detected and framing the voice to be detected to obtain a 1000*20-dimensional matrix; inputting the 1000*20-dimensional matrix into the optimal deep neural network model; calculating the 1000*20-dimensional matrix by using the optimal deep neural network model to obtain a 1*4-dimensional output vector, wherein the 1*4-dimensional output vector represents four voice categories; and selecting a category with the maximum value from among the 1*4-dimensional output vector as the category of the voice to be detected. Further provided are a server and a storage medium. By implementing the above-mention solution, a higher level of security guarantee can be provided for the security of speech control, and the development of voice recognition technology is promoted.

Description

基于深度学习的语音活体检测方法、服务器及存储介质Voice learning method, server and storage medium based on deep learning
优先权申明Priority claim
本申请要求于2018年01月12日提交中国专利局、申请号为201810029892.6,名称为“基于深度学习的语音活体检测方法、服务器及存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合本申请中。The present application claims priority to Chinese Patent Application No. 201810029892.6, entitled "Deep Learning-Based Voice Living Detection Method, Server and Storage Medium", which is filed on January 12, 2018, the Chinese patent application The entire content is incorporated herein by reference.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种基于深度学习的语音活体检测方法、服务器及存储介质。The present application relates to the field of computer technologies, and in particular, to a voice learning method, a server, and a storage medium based on deep learning.
背景技术Background technique
随着语音识别技术的不断发展,语音识别应用也越来越多,包括语音控制、语音支付等等。但目前在语音识别的过程中,一般只是识别语义,并不能很好的区别出语音是人为发出,还是其他录音输入,比如像苹果siri,在对苹果终端设备用siri唤醒的过程中,无论是本人还是录音,一旦输入“hi,siri”,都会将终端设备唤醒,并不能区别发出语音的来源。故针对语音的活体检测尤为重要。语音活体检测是指识别输入信息是否为真人说话,非真人说话的语音一般称为伪造录音,包括音乐输入、录音重播、通过语音合成等技术手段产生的语音等。伪造录音常常用于金融、安全领域,通过伪造录音进行声纹识别闯入,从而登录到被害人账户,以达到盗取钱财或损害他人名誉及财产等目标。With the continuous development of speech recognition technology, speech recognition applications are also increasing, including voice control, voice payment and so on. However, in the process of speech recognition, it is generally only to identify the semantics, and it is not very good to distinguish whether the speech is artificially generated or other recording input, such as Apple Siri, in the process of waking up the Apple terminal device with siri, whether it is I am still recording. Once I enter "hi, siri", I will wake up the terminal device and cannot distinguish the source of the voice. Therefore, the detection of living body for speech is particularly important. Voice live detection refers to whether the input information is spoken by a real person. The voice of a non-real person is generally referred to as a forged recording, including music input, recording replay, voice generated by technical means such as speech synthesis, and the like. Forged recordings are often used in the financial and security fields. Voiceprints are identified by forgery recordings, so that they can be logged into the victim's account to achieve the goal of stealing money or damaging the reputation and property of others.
发明内容Summary of the invention
有鉴于此,本申请提出一种基于深度学习的语音活体检测方法、服务器 及存储介质,使得在利用语音进行相应应用之前,可以快速的检测出所述语音是否为用户直接输出的语音,还是他人的恶意伪造语音,如此,可以为语音控制的安全性提供了更高层次的安全性保证,促进了语音识别技术的发展。In view of this, the present application provides a voice learning method, a server, and a storage medium based on deep learning, so that before the corresponding application is performed by using voice, it is possible to quickly detect whether the voice is a voice directly output by the user, or is another person. The malicious forgery of voice, in this way, can provide a higher level of security guarantee for the security of voice control, and promote the development of voice recognition technology.
首先,为实现上述目的,本申请提出一种服务器,所述服务器包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的基于深度学习的语音活体检测程序,所述基于深度学习的语音活体检测程序被所述处理器执行时实现如下步骤:对深度神经网络模型进行训练以获得最优深度神经网络模型;获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵;将所述1000*20维矩阵输入到所述最优深度神经网络模型;利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4种语音类别;选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。First, in order to achieve the above object, the present application provides a server, where the server includes a memory, a processor, and the memory stores a deep learning-based voice living detection program executable on the processor, where the When the deep learning voice in vivo detection program is executed by the processor, the following steps are performed: training the deep neural network model to obtain an optimal depth neural network model; acquiring the to-be-detected speech and framing the to-be-detected speech to obtain 1000*20-dimensional matrix; inputting the 1000*20-dimensional matrix into the optimal depth neural network model; calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain 1*4 dimensions An output vector, the 1*4-dimensional output vector represents four types of speech categories; and one of the 1*4-dimensional output vectors having the largest value is selected as the category of the speech to be detected.
此外,为实现上述目的,本申请还提供一种基于深度学习的语音活体检测方法,应用于服务器,所述方法包括以下步骤:对深度神经网络模型进行训练以获得最优深度神经网络模型;获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵;将所述1000*20维矩阵输入到所述最优深度神经网络模型;利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4种语音类别;选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。In addition, in order to achieve the above object, the present application further provides a deep learning-based voice living body detection method, which is applied to a server, and the method includes the following steps: training a deep neural network model to obtain an optimal depth neural network model; Performing to detect the speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix; inputting the 1000*20-dimensional matrix to the optimal depth neural network model; using the optimal depth neural network model The 1000*20-dimensional matrix is calculated to obtain a 1*4-dimensional output vector, and the 1*4-dimensional output vector represents four types of speech categories; and the one with the largest value among the 1*4-dimensional output vectors is selected as the The category of the detected speech is described.
进一步地,为实现上述目的,本申请还提供一种存储介质,所述存储介质存储有基于深度学习的语音活体检测程序,所述基于深度学习的语音活体检测程序可被至少一个处理器执行,以使所述至少一个处理器执行如上所述的基于深度学习的语音活体检测方法的步骤。Further, in order to achieve the above object, the present application further provides a storage medium storing a deep learning-based voice living body detection program, and the depth learning-based voice living body detection program can be executed by at least one processor. The step of causing the at least one processor to perform the depth learning based speech biometric detection method as described above.
相较于现有技术,本申请所提出的基于深度学习的语音活体检测方法、服务器及存储介质,首先,对深度神经网络模型进行训练以获得最优深度神经网络模型;其次,获取待检测语音并对所述待检测语音进行分帧,得到 1000*20维矩阵;再次,将所述1000*20维矩阵输入到所述最优深度神经网络模型;然后,利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4中语音类别;最后,选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。这样,使得在利用语音进行相应应用之前,可以快速的检测出所述语音是否为用户直接输出的语音,还是他人的恶意伪造语音,如此,可以为语音控制的安全性提供了更高层次的安全性保证,促进了语音识别技术的发展。Compared with the prior art, the deep learning-based voice living body detection method, the server and the storage medium proposed by the present application firstly train the deep neural network model to obtain an optimal depth neural network model; secondly, acquire the voice to be detected. And framing the to-be-detected speech to obtain a 1000*20-dimensional matrix; again, inputting the 1000*20-dimensional matrix into the optimal depth neural network model; and then using the optimal depth neural network model Calculating the 1000*20-dimensional matrix to obtain a 1*4-dimensional output vector, wherein the 1*4-dimensional output vector represents a speech class of 4; finally, selecting the largest value of the 1*4-dimensional output vector The class serves as the category of the speech to be detected. In this way, before the corresponding application is performed by using the voice, it is possible to quickly detect whether the voice is a voice directly output by the user or a malicious forged voice of another person, thus providing a higher level of security for the security of the voice control. Sexual guarantees have promoted the development of speech recognition technology.
附图说明DRAWINGS
图1是本申请服务器一可选的硬件架构的示意图;1 is a schematic diagram of an optional hardware architecture of the server of the present application;
图2是本申请基于深度学习的语音活体检测程序第一实施例的程序模块图;2 is a program block diagram of a first embodiment of a voice learning method based on deep learning of the present application;
图3为本申请基于深度学习的语音活体检测方法第一实施例的流程图;附图标记:FIG. 3 is a flowchart of a first embodiment of a method for detecting a living body based on deep learning according to the present application;
Figure PCTCN2018089203-appb-000001
Figure PCTCN2018089203-appb-000001
Figure PCTCN2018089203-appb-000002
Figure PCTCN2018089203-appb-000002
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions of "first", "second" and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Thus, features defining "first" or "second" may include at least one of the features, either explicitly or implicitly. In addition, the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.
参阅图1所示,是服务器1一可选的硬件架构的示意图。Referring to FIG. 1, it is a schematic diagram of an optional hardware architecture of the server 1.
所述服务器1可以是机架式服务器、刀片式服务器、塔式服务器或机柜式服务器等计算设备,该服务器1可以是独立的服务器,也可以是多个服务器所组成的服务器集群。The server 1 may be a computing device such as a rack server, a blade server, a tower server, or a rack server. The server 1 may be a standalone server or a server cluster composed of multiple servers.
本实施例中,所述服务器1可包括,但不仅限于,可通过系统总线相互通信连接存储器11、处理器12、网络接口13。In this embodiment, the server 1 may include, but is not limited to, the memory 11, the processor 12, and the network interface 13 being communicably connected to each other through a system bus.
所述服务器1通过网络接口13连接网络,获取资讯。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯系统(Global System of  Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi、通话网络等无线或有线网络。The server 1 connects to the network through the network interface 13 to obtain information. The network may be an intranet, an Internet, a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, or a 5G network. Wireless or wired networks such as networks, Bluetooth, Wi-Fi, and call networks.
需要指出的是,图1仅示出了具有组件11-13的服务器1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。It is pointed out that Figure 1 only shows the server 1 with the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
其中,所述存储器11至少包括一种类型的存储介质,所述存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器11可以是所述服务器1的内部存储单元,例如该服务器1的硬盘或内存。在另一些实施例中,所述存储器11也可以是所述服务器1的外部存储设备,例如该服务器1配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述服务器1的内部存储单元也包括其外部存储设备。本实施例中,所述存储器11通常用于存储安装于所述服务器1的操作系统和各类应用软件,例如基于深度学习的语音活体检测程序200的程序代码等。此外,所述存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 11 includes at least one type of storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, SD or DX memory, etc.), a random access memory (RAM), and a static random access. Memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the server 1, such as a hard disk or memory of the server 1. In other embodiments, the memory 11 may also be an external storage device of the server 1, such as a plug-in hard disk equipped with the server 1, a smart memory card (SMC), and a secure digital (Secure Digital). , SD) card, flash card (Flash Card), etc. Of course, the memory 11 can also include both the internal storage unit of the server 1 and its external storage device. In this embodiment, the memory 11 is generally used to store an operating system installed in the server 1 and various types of application software, such as program code of the deep learning-based voice living body detection program 200. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述服务器1的总体操作,例如执行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行所述的基于深度学习的语音活体检测程序200等。The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used to control the overall operation of the server 1, such as performing data interaction or communication related control and processing, and the like. In this embodiment, the processor 12 is configured to run program code or process data stored in the memory 11, such as running the deep learning-based voice biometric detection program 200 and the like.
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述服务器1与其他电子设备之间建立通信连接。The network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the server 1 and other electronic devices.
本实施例中,所述服务器1内安装并运行有基于深度学习的语音活体检测程序200,当所述基于深度学习的语音活体检测程序200运行时,所述服务器1对深度神经网络模型进行训练以获得最优深度神经网络模型;获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵;将所述1000*20维矩阵输入到所述最优深度神经网络模型;利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4中语音类别;选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。这样,使得在利用语音进行相应应用之前,可以快速的检测出所述语音是否为用户直接输出的语音,还是他人的恶意伪造语音,如此,可以为语音控制的安全性提供了更高层次的安全性保证,促进了语音识别技术的发展。In this embodiment, a deep learning-based voice biometric detection program 200 is installed and run in the server 1. When the deep learning-based voice biometric detection program 200 is running, the server 1 trains the deep neural network model. Obtaining an optimal depth neural network model; acquiring a to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix; and inputting the 1000*20-dimensional matrix into the optimal depth neural network model; Calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector, the 1*4-dimensional output vector representing a speech class of 4; selecting the 1*4 dimension The class with the largest value in the output vector is used as the category of the speech to be detected. In this way, before the corresponding application is performed by using the voice, it is possible to quickly detect whether the voice is a voice directly output by the user or a malicious forged voice of another person, thus providing a higher level of security for the security of the voice control. Sexual guarantees have promoted the development of speech recognition technology.
至此,己经详细介绍了本申请各个实施例的应用环境和相关设备的硬件结构和功能。下面,将基于上述应用环境和相关设备,提出本申请的各个实施例。So far, the application environment of the various embodiments of the present application and the hardware structure and functions of related devices have been described in detail. Hereinafter, various embodiments of the present application will be proposed based on the above-described application environment and related devices.
首先,本申请提出一种基于深度学习的语音活体检测程序200。First, the present application proposes a deep learning based speech living body detection program 200.
参阅图2所示,是本申请基于深度学习的语音活体检测程序200第一实施例的程序模块图。Referring to FIG. 2, it is a program module diagram of the first embodiment of the deep learning-based voice living body detection program 200 of the present application.
本实施例中,服务器1,包括一系列的存储于存储器11上的计算机程序指令,即所述基于深度学习的语音活体检测程序200,当该计算机程序指令被处理器12执行时,可以实现本申请各实施例的基于深度学习的语音活体检测操作。在一些实施例中,基于该计算机程序指令各部分所实现的特定的操作,所述基于深度学习的语音活体检测程序200可以被划分为一个或多个模块。例如,在图2中,所述的基于深度学习的语音活体检测程序200可以被分割成训练模块201、语音处理模块202、矩阵输入模块203、矩阵计算模块204及判断模块205。其中:In this embodiment, the server 1 includes a series of computer program instructions stored on the memory 11, that is, the deep learning-based voice living body detection program 200, which can be implemented when the computer program instructions are executed by the processor 12. The deep learning-based voice biometric detection operation of each embodiment is applied. In some embodiments, the deep learning based speech biometric detection program 200 can be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the deep learning-based voice living body detection program 200 can be divided into a training module 201, a voice processing module 202, a matrix input module 203, a matrix calculation module 204, and a determination module 205. among them:
所述训练模块201,用于对深度神经网络模型进行训练以获得最优深度神经网络模型。The training module 201 is configured to train the deep neural network model to obtain an optimal depth neural network model.
具体的,所述训练模块201具体用于对训练语音进行分帧,将每1000帧作为一个样本;对每一个样本进行类别标识;将标识后的所述样本作为所述深度神经网络模型的训练样本。Specifically, the training module 201 is specifically configured to frame the training speech, and each 1000 frames is used as a sample; class identification is performed for each sample; and the identified sample is used as the training of the deep neural network model. sample.
在本实施方式中,每1000帧截断的目的是使模型有固定长度的输入,考虑不同长度的录音会产生不同的MFCC(Mel-Frequency Cepstral Coefficients,倒频谱系数)特征分布效应,若输入特征非固定,容易造成模型识别不准确。对于短于1000帧且长于100帧的录音,我们在其后拼接上全0帧;对于短于100帧的录音,我们直接舍去,认为其中没有人说话。所有录音每1000帧为一个训练样本的好处是可以让模型学习到该类语音每个时段的声音特征,比单纯采用某段录音的某一个1000帧训练效果更具鲁棒性。In the present embodiment, the purpose of truncating every 1000 frames is to make the model have a fixed length input, and considering different lengths of recording, different MFCC (Mel-Frequency Cepstral Coefficients) feature distribution effects may be generated, if the input features are not Fixed, easy to cause inaccurate model recognition. For recordings shorter than 1000 frames and longer than 100 frames, we stitched all 0 frames afterwards; for recordings shorter than 100 frames, we directly rounded off and thought that no one spoke. The advantage of having a training sample for every 1000 frames of all recordings is that the model can learn the sound characteristics of each type of speech for each period of time, which is more robust than the training effect of a certain 1000 frames of a certain recording.
训练阶段对每个输入录音都打上标签进行标识,如真类为[0000],一类伪造[0100],二类伪造[0010],三类伪造[0001]。具体而言,真类顾名思义,即为真是语音,而对于伪造语音,分为三种伪造,第一类伪造为音乐,第二类伪造为录音重播伪造,第三类伪造为技术性语音伪造。第一类伪造指声纹识别输入为音乐,音乐由于含有丰富的声音成分,能正常进行语音的注册与验证,但并不包含说话人声音的信息,故并非声纹识别的目标录音;第二类伪造主要为录音的简单重播,如用录音笔、手机等设备录下目标人说话或者音乐等语音,然后直接重播到声纹识别的输入端;第三类伪造主要指采用语音合成或语音转换的技术进行目标人说话伪造,语音合成录音一般采集目标人一定量语音数据便能用合成手段产生目标人指定文本内容的语音,语音转换录音是直接对原始录音进行频谱的改变,该类伪造由于含有大量语音信号处理技术,故称为技术型伪造。In the training phase, each input recording is tagged for identification, such as true class [0000], one type of forgery [0100], two types of forgery [0010], and three types of forgery [0001]. Specifically, the real class, as its name implies, is true speech, and for forged speech, it is divided into three kinds of forgery, the first type is forged as music, the second type is forged as recording replay forgery, and the third type is forged as technical voice forgery. The first type of forgery refers to the voiceprint recognition input as music. Because the music contains rich sound components, the voice registration and verification can be performed normally, but the voice of the speaker is not included, so it is not the target recording of the voiceprint recognition; Class forgery is mainly a simple replay of recordings, such as recording a target person's speech or music with a recording pen, mobile phone, etc., and then directly replaying it to the input of voiceprint recognition; the third type of forgery mainly refers to speech synthesis or voice conversion. The technology performs the target person's speech forgery. The speech synthesis recording generally collects a certain amount of voice data of the target person, and can synthesize the voice of the target person to specify the text content. The voice conversion recording directly changes the spectrum of the original recording, and the forgery is due to It contains a lot of voice signal processing technology, so it is called technical forgery.
而对于如何利用训练样本对DNN(深度神经网络)进行训练,简述如下:采用开源的Keras框架进行模型训练;考虑到硬件限制,设置DNN训练中采 用minibatch技术,设置每个batch大小为128,每次迭代训练1000个batch,总训练N次迭代。每个batch从总数据中随机选择128个语音MFCC特征样本,产生模型输出,然后根据损失函数通过后向反馈更新模型参数,从而完成1次batch计算,以此产生1000次batch数据并完成1000个训练,获得一次迭代的模型输出。一般情况下,在50次迭代中选择损失函数的最优的模型:第一层的卷积核为9*20,Nfilters设为512,损失函数设为所有分类判别的最大熵categorical_crossentropy,优化器为adagrad。For how to use training samples to train DNN (Deep Neural Network), the following is a brief description: Model training is carried out using the open source Keras framework. Considering hardware limitations, set the DNB training to use minibatch technology, and set each batch size to 128. Train 1000 batches per iteration and train N iterations. Each batch randomly selects 128 speech MFCC feature samples from the total data, generates model output, and then updates the model parameters through backward feedback according to the loss function, thereby completing 1 batch calculation, thereby generating 1000 batch data and completing 1000 batches. Train to get an iterative model output. In general, the optimal model of the loss function is selected in 50 iterations: the convolution kernel of the first layer is 9*20, the Nfilters is set to 512, and the loss function is set to the maximum entropy categorical_crossentropy of all classifications. The optimizer is Adagrad.
所述语音处理模块202,用于获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵。The voice processing module 202 is configured to acquire a voice to be detected and frame the to-be-detected voice to obtain a 1000*20-dimensional matrix.
在本实施方式中,所述语音处理模块202具体用于:对所述待检测语音进行分帧后,提取1000帧并分别计算20维MFCC特征;依据所述1000帧的20维MFCC生成所述1000*20维矩阵。In this embodiment, the voice processing module 202 is specifically configured to: after framing the to-be-detected voice, extract 1000 frames and separately calculate 20-dimensional MFCC features; and generate the 20-dimensional MFCC according to the 1000 frames. 1000*20 dimensional matrix.
在本实施方式中,对待检测语音进行分帧操作与上述对训练语音的处理方式相同,对于短于1000帧且长于100帧的录音,我们在其后拼接上全0帧;对于短于100帧的录音,我们直接舍去,认为其中没有人说话。而对于MFCC特征的计算则属于一种常规算法,本申请再次便不多做赘述。In this embodiment, the framing operation of the speech to be detected is the same as the processing of the training speech described above. For recordings shorter than 1000 frames and longer than 100 frames, we splicing all 0 frames thereafter; for less than 100 frames The recordings were taken directly by us and we thought that no one was talking. The calculation of the MFCC feature belongs to a conventional algorithm, and the present application will not repeat it again.
所述矩阵输入模块203,用于将所述1000*20维矩阵输入到所述最优深度神经网络模型。The matrix input module 203 is configured to input the 1000*20-dimensional matrix to the optimal depth neural network model.
在本实施方式中,上述得到的最优深度神经网络模型DNN的输入层为矩阵输入,可以直接将上述语音处理模块202得到的1000*20维矩阵输入到得到的最优深度神经网络模型。In this embodiment, the input layer of the obtained optimal depth neural network model DNN is a matrix input, and the 1000*20-dimensional matrix obtained by the voice processing module 202 can be directly input to the obtained optimal depth neural network model.
所述矩阵计算模块204,用于利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4中语音类别。The matrix calculation module 204 is configured to calculate the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector, where the 1*4-dimensional output vector represents 4 voices category.
具体的,矩阵计算模块204在DNN模型第一层用1000*20卷积核对输入特征进行卷积,该层的目的是进行相邻帧特征投影,通过Nfilters(N次过滤), 控制每个卷积核经过卷积后得到N个通道特征;在第二到四层,采用1*1的卷积核进行卷积,并使用LeakyReLU激活函数,其中这些1*1卷积核的作用是允许通道间连接起来相互作用,从而模型学习到更多帧与帧间特征;在第五层进行池化,对2*2核范围进行提取最大值,其中2*2MaxPooling(拟合),步长选默认值1*1,该层可以选择某些上层节点,使模型参数降低,不容易过拟合;在第六层进行展平,即可以通过对上一层输出节点进行展平获得1*P维特征;在第七层对所述第六层进行降维,获得输出Out7,其中第七层是线性层,同时在第七层以所述Out7输入,用softmax激活函数,输出为1*4向量,即输出4个向量,作为检测结果。Specifically, the matrix calculation module 204 convolves the input features with a 1000*20 convolution kernel in the first layer of the DNN model. The purpose of the layer is to perform adjacent frame feature projection, and each volume is controlled by Nfilters (N times of filtering). The accumulative core is convolved to obtain N channel features; in the second to fourth layers, a 1*1 convolution kernel is used for convolution, and the LeakyReLU activation function is used, wherein the function of these 1*1 convolution kernels is to allow the channel Interconnected to interact, so that the model learns more frame and interframe features; pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range, where 2*2MaxPooling (fitting), step selection default The value is 1*1, this layer can select some upper nodes, so that the model parameters are reduced, it is not easy to overfit; in the sixth layer, flattening, that is, by flattening the upper layer output node to obtain 1*P dimension Feature; dimension reduction is performed on the sixth layer in the seventh layer, and an output Out7 is obtained, wherein the seventh layer is a linear layer, and the seventh layer is input with the Out7, and the function is activated by softmax, and the output is a 1*4 vector. , that is, output 4 vectors as the detection result.
所述判断模块205,用于选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。其中所述1*4维输出向量为0~1范围内的数值。The determining module 205 is configured to select a class with the largest value among the 1*4-dimensional output vectors as the category of the voice to be detected. The 1*4 dimensional output vector is a value in the range of 0 to 1.
在本实施方式中,所述1*4维输出向量通过4个0-1范围的小数表示属于相应类的概率,即上述真实语音的概率、一类伪造,二类伪造,三类伪造的概率,而这四个概率中数值最大的一类则代表了输入语音的类别,即通过输出的数值可以直观有效的检测出待检测语音是否为活体语音即真是语音。In this embodiment, the 1*4-dimensional output vector represents the probability of belonging to the corresponding class by four decimal ranges of 0-1, that is, the probability of the above-mentioned real voice, one type of forgery, the second type of forgery, and the probability of three types of forgery. The one with the highest value among the four probabilities represents the category of the input speech, that is, the value of the output can be intuitively and effectively detected whether the speech to be detected is a live speech or a voice.
通过上述程序模块201-205,本申请所提出的服务器,对深度神经网络模型进行训练以获得最优深度神经网络模型;获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵;将所述1000*20维矩阵输入到所述最优深度神经网络模型;利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4中语音类别;选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。这样,使得在利用语音进行相应应用之前,可以快速的检测出所述语音是否为用户直接输出的语音,还是他人的恶意伪造语音,如此,可以为语音控制的安全性提供了更高层次的安全性保证,促进了语音识别技术的发展。Through the above-mentioned program modules 201-205, the server proposed by the present application trains the deep neural network model to obtain an optimal depth neural network model; acquires the to-be-detected speech and framing the to-be-detected speech to obtain 1000*20 Dimension matrix; inputting the 1000*20-dimensional matrix to the optimal depth neural network model; calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector, The 1*4 dimensional output vector represents a speech class of 4; a class having the largest value among the 1*4 dimensional output vectors is selected as the class of the speech to be detected. In this way, before the corresponding application is performed by using the voice, it is possible to quickly detect whether the voice is a voice directly output by the user or a malicious forged voice of another person, thus providing a higher level of security for the security of the voice control. Sexual guarantees have promoted the development of speech recognition technology.
此外,本申请还提出一种基于深度学习的语音活体检测方法。In addition, the present application also proposes a voice living body detection method based on deep learning.
参阅图3所示,是本申请基于深度学习的语音活体检测方法第一实施例的实施流程示意图。在本实施例中,根据不同的需求,图3所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。Referring to FIG. 3, it is a schematic flowchart of an implementation process of a first embodiment of a voice learning method based on deep learning in the present application. In this embodiment, the order of execution of the steps in the flowchart shown in FIG. 3 may be changed according to different requirements, and some steps may be omitted.
步骤S301,对深度神经网络模型进行训练以获得最优深度神经网络模型。Step S301, training the deep neural network model to obtain an optimal depth neural network model.
具体的,上述步骤具体包括对训练语音进行分帧,将每1000帧作为一个样本;对每一个样本进行类别标识;将标识后的所述样本作为所述深度神经网络模型的训练样本。Specifically, the foregoing steps specifically include framing the training speech, using each 1000 frames as a sample; performing class identification on each sample; and using the identified sample as a training sample of the deep neural network model.
在本实施方式中,每1000帧截断的目的是使模型有固定长度的输入,考虑不同长度的录音会产生不同的MFCC(Mel-Frequency Cepstral Coefficients,倒频谱系数)特征分布效应,若输入特征非固定,容易造成模型识别不准确。对于短于1000帧且长于100帧的录音,我们在其后拼接上全0帧;对于短于100帧的录音,我们直接舍去,认为其中没有人说话。所有录音每1000帧为一个训练样本的好处是可以让模型学习到该类语音每个时段的声音特征,比单纯采用某段录音的某一个1000帧训练效果更具鲁棒性。In the present embodiment, the purpose of truncating every 1000 frames is to make the model have a fixed length input, and considering different lengths of recording, different MFCC (Mel-Frequency Cepstral Coefficients) feature distribution effects may be generated, if the input features are not Fixed, easy to cause inaccurate model recognition. For recordings shorter than 1000 frames and longer than 100 frames, we stitched all 0 frames afterwards; for recordings shorter than 100 frames, we directly rounded off and thought that no one spoke. The advantage of having a training sample for every 1000 frames of all recordings is that the model can learn the sound characteristics of each type of speech for each period of time, which is more robust than the training effect of a certain 1000 frames of a certain recording.
训练阶段对每个输入录音都打上标签进行标识,如真类为[0000],一类伪造[0100],二类伪造[0010],三类伪造[0001]。具体而言,真类顾名思义,即为真是语音,而对于伪造语音,分为三种伪造,第一类伪造为音乐,第二类伪造为录音重播伪造,第三类伪造为技术性语音伪造。第一类伪造指声纹识别输入为音乐,音乐由于含有丰富的声音成分,能正常进行语音的注册与验证,但并不包含说话人声音的信息,故并非声纹识别的目标录音;第二类伪造主要为录音的简单重播,如用录音笔、手机等设备录下目标人说话或者音乐等语音,然后直接重播到声纹识别的输入端;第三类伪造主要指采用语音合成或语音转换的技术进行目标人说话伪造,语音合成录音一般采集目标人一定量语音数据便能用合成手段产生目标人指定文本内容的语音,语音转换录音是直接对原始录音进行频谱的改变,该类伪造由于含有大量语音信号处理技术,故称为技术型伪造。In the training phase, each input recording is tagged for identification, such as true class [0000], one type of forgery [0100], two types of forgery [0010], and three types of forgery [0001]. Specifically, the real class, as its name implies, is true speech, and for forged speech, it is divided into three kinds of forgery, the first type is forged as music, the second type is forged as recording replay forgery, and the third type is forged as technical voice forgery. The first type of forgery refers to the voiceprint recognition input as music. Because the music contains rich sound components, the voice registration and verification can be performed normally, but the voice of the speaker is not included, so it is not the target recording of the voiceprint recognition; Class forgery is mainly a simple replay of recordings, such as recording a target person's speech or music with a recording pen, mobile phone, etc., and then directly replaying it to the input of voiceprint recognition; the third type of forgery mainly refers to speech synthesis or voice conversion. The technology performs the target person's speech forgery. The speech synthesis recording generally collects a certain amount of voice data of the target person, and can synthesize the voice of the target person to specify the text content. The voice conversion recording directly changes the spectrum of the original recording, and the forgery is due to It contains a lot of voice signal processing technology, so it is called technical forgery.
而对于如何利用训练样本对DNN(深度神经网络)进行训练,简述如下:采用开源的Keras框架进行模型训练;考虑到硬件限制,设置DNN训练中采用minibatch技术,设置每个batch大小为128,每次迭代训练1000个batch,总训练N次迭代。每个batch从总数据中随机选择128个语音MFCC特征样本,产生模型输出,然后根据损失函数通过后向反馈更新模型参数,从而完成1次batch计算,以此产生1000次batch数据并完成1000个训练,获得一次迭代的模型输出。一般情况下,在50次迭代中选择损失函数的最优的模型:第一层的卷积核为9*20,Nfilters设为512,损失函数设为所有分类判别的最大熵categorical_crossentropy,优化器为adagrad。For how to use training samples to train DNN (Deep Neural Network), the following is a brief description: Model training is carried out using the open source Keras framework. Considering hardware limitations, set the DNB training to use minibatch technology, and set each batch size to 128. Train 1000 batches per iteration and train N iterations. Each batch randomly selects 128 speech MFCC feature samples from the total data, generates model output, and then updates the model parameters through backward feedback according to the loss function, thereby completing 1 batch calculation, thereby generating 1000 batch data and completing 1000 batches. Train to get an iterative model output. In general, the optimal model of the loss function is selected in 50 iterations: the convolution kernel of the first layer is 9*20, the Nfilters is set to 512, and the loss function is set to the maximum entropy categorical_crossentropy of all classifications. The optimizer is Adagrad.
步骤S302,获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵。Step S302: Acquire the to-be-detected speech and frame the to-be-detected speech to obtain a 1000*20-dimensional matrix.
在本实施方式中,所述语音处理模块202具体用于:对所述待检测语音进行分帧后,提取1000帧并分别计算20维MFCC特征;依据所述1000帧的20维MFCC生成所述1000*20维矩阵。In this embodiment, the voice processing module 202 is specifically configured to: after framing the to-be-detected voice, extract 1000 frames and separately calculate 20-dimensional MFCC features; and generate the 20-dimensional MFCC according to the 1000 frames. 1000*20 dimensional matrix.
在本实施方式中,对待检测语音进行分帧操作与上述对训练语音的处理方式相同,对于短于1000帧且长于100帧的录音,我们在其后拼接上全0帧;对于短于100帧的录音,我们直接舍去,认为其中没有人说话。而对于MFCC特征的计算则属于一种常规算法,本申请再次便不多做赘述。In this embodiment, the framing operation of the speech to be detected is the same as the processing of the training speech described above. For recordings shorter than 1000 frames and longer than 100 frames, we splicing all 0 frames thereafter; for less than 100 frames The recordings were taken directly by us and we thought that no one was talking. The calculation of the MFCC feature belongs to a conventional algorithm, and the present application will not repeat it again.
步骤S303,将所述1000*20维矩阵输入到所述最优深度神经网络模型。Step S303, input the 1000*20-dimensional matrix to the optimal depth neural network model.
在本实施方式中,上述得到的最优深度神经网络模型DNN的输入层为矩阵输入,可以直接将上述语音处理模块202得到的1000*20维矩阵输入到得到的最优深度神经网络模型。In this embodiment, the input layer of the obtained optimal depth neural network model DNN is a matrix input, and the 1000*20-dimensional matrix obtained by the voice processing module 202 can be directly input to the obtained optimal depth neural network model.
步骤S304,利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4中语音类别。Step S304, calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector, where the 1*4-dimensional output vector represents a speech class of 4.
具体的,矩阵计算模块204在DNN模型第一层用1000*20卷积核对输入特征进行卷积,该层的目的是进行相邻帧特征投影,通过Nfilters(N次过滤), 控制每个卷积核经过卷积后得到N个通道特征;在第二到四层,采用1*1的卷积核进行卷积,并使用LeakyReLU激活函数,其中这些1*1卷积核的作用是允许通道间连接起来相互作用,从而模型学习到更多帧与帧间特征;在第五层进行池化,对2*2核范围进行提取最大值,其中2*2MaxPooling(拟合),步长选默认值1*1,该层可以选择某些上层节点,使模型参数降低,不容易过拟合;在第六层进行展平,即可以通过对上一层输出节点进行展平获得1*P维特征;在第七层对所述第六层进行降维,获得输出Out7,其中第七层是线性层,同时在第七层以所述Out7输入,用softmax激活函数,输出为1*4向量,即输出4个向量,作为检测结果。Specifically, the matrix calculation module 204 convolves the input features with a 1000*20 convolution kernel in the first layer of the DNN model. The purpose of the layer is to perform adjacent frame feature projection, and each volume is controlled by Nfilters (N times of filtering). The accumulative core is convolved to obtain N channel features; in the second to fourth layers, a 1*1 convolution kernel is used for convolution, and the LeakyReLU activation function is used, wherein the function of these 1*1 convolution kernels is to allow the channel Interconnected to interact, so that the model learns more frame and interframe features; pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range, where 2*2MaxPooling (fitting), step selection default The value is 1*1, this layer can select some upper nodes, so that the model parameters are reduced, it is not easy to overfit; in the sixth layer, flattening, that is, by flattening the upper layer output node to obtain 1*P dimension Feature; dimension reduction is performed on the sixth layer in the seventh layer, and an output Out7 is obtained, wherein the seventh layer is a linear layer, and the seventh layer is input with the Out7, and the function is activated by softmax, and the output is a 1*4 vector. , that is, output 4 vectors as the detection result.
步骤S305,选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。其中所述1*4维输出向量为0~1范围内的数值。Step S305, selecting a class having the largest value among the 1*4-dimensional output vectors as the category of the to-be-detected speech. The 1*4 dimensional output vector is a value in the range of 0 to 1.
在本实施方式中,所述1*4维输出向量通过4个0-1范围的小数表示属于相应类的概率,即上述真实语音的概率、一类伪造,二类伪造,三类伪造的概率,而这四个概率中数值最大的一类则代表了输入语音的类别,即通过输出的数值可以直观有效的检测出待检测语音是否为活体语音即真是语音。In this embodiment, the 1*4-dimensional output vector represents the probability of belonging to the corresponding class by four decimal ranges of 0-1, that is, the probability of the above-mentioned real voice, one type of forgery, the second type of forgery, and the probability of three types of forgery. The one with the highest value among the four probabilities represents the category of the input speech, that is, the value of the output can be intuitively and effectively detected whether the speech to be detected is a live speech or a voice.
通过上述步骤S301-305,本申请所提出的基于深度学习的语音活体检测方法,首先,对深度神经网络模型进行训练以获得最优深度神经网络模型;其次,获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵;再次,将所述1000*20维矩阵输入到所述最优深度神经网络模型;然后,利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4中语音类别;最后,选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。这样,使得在利用语音进行相应应用之前,可以快速的检测出所述语音是否为用户直接输出的语音,还是他人的恶意伪造语音,如此,可以为语音控制的安全性提供了更高层次的安全性保证,促进了语音识别技术的发展。Through the above steps S301-305, the deep learning-based voice living body detection method proposed by the present application firstly trains the deep neural network model to obtain an optimal depth neural network model; secondly, acquires the to-be-detected speech and treats the to-be-detected speech Detecting speech to perform frame division to obtain a 1000*20-dimensional matrix; again, inputting the 1000*20-dimensional matrix into the optimal depth neural network model; and then using the optimal depth neural network model to the 1000* The 20-dimensional matrix is calculated to obtain a 1*4-dimensional output vector, which represents the speech class of 4; finally, the one with the largest value among the 1*4-dimensional output vectors is selected as the Detect the category of the voice. In this way, before the corresponding application is performed by using the voice, it is possible to quickly detect whether the voice is a voice directly output by the user or a malicious forged voice of another person, thus providing a higher level of security for the security of the voice control. Sexual guarantees have promoted the development of speech recognition technology.
本申请还提供了另一种实施方式,即提供一种存储介质,所述存储介质存储有基于深度学习的语音活体检测程序,所述基于深度学习的语音活体检测程序可被至少一个处理器执行,以使所述至少一个处理器执行如上述的基于深度学习的语音活体检测方法的步骤。The present application further provides another embodiment, that is, providing a storage medium storing a deep learning-based voice living body detection program, the depth learning-based voice living body detection program being executable by at least one processor And causing the at least one processor to perform the steps of the deep learning-based voice biometric detection method as described above.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims (20)

  1. 一种基于深度学习的语音活体检测方法,应用于服务器,其特征在于,所述方法包括以下步骤:A deep learning-based voice living body detection method is applied to a server, characterized in that the method comprises the following steps:
    对深度神经网络模型进行训练以获得最优深度神经网络模型;Training the deep neural network model to obtain an optimal depth neural network model;
    获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵;Obtaining a to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix;
    将所述1000*20维矩阵输入到所述最优深度神经网络模型;Inputting the 1000*20-dimensional matrix into the optimal depth neural network model;
    利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4种语音类别;及Calculating the 1000*20-dimensional matrix using the optimal depth neural network model to obtain a 1*4-dimensional output vector, the 1*4-dimensional output vector representing four voice categories;
    选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。A class having the largest value among the 1*4-dimensional output vectors is selected as the category of the speech to be detected.
  2. 如权利要求1所述的基于深度学习的语音活体检测方法,其特征在于,所述对深度神经网络模型进行训练以获得最优深度神经网络模型的步骤包括:The deep learning-based voice living body detection method according to claim 1, wherein the step of training the depth neural network model to obtain an optimal depth neural network model comprises:
    对训练语音进行分帧,将每1000帧作为一个样本;Framing the training speech, taking 1000 frames per frame;
    对每一个样本进行类别标识;及Class identification for each sample; and
    将标识后的所述样本作为所述深度神经网络模型的训练样本。The identified sample is used as a training sample of the deep neural network model.
  3. 如权利要求2所述的基于深度学习的语音活体检测方法,其特征在于,所述对训练语音进行分帧,将每1000帧作为一个样本的步骤具体包括:The depth learning-based voice living body detection method according to claim 2, wherein the step of framing the training speech and using each 1000 frames as one sample specifically includes:
    对于短于1000帧且长于100帧的训练语音,在该训练语音的后面拼接上全0帧以达到1000帧;For training speech shorter than 1000 frames and longer than 100 frames, all 0 frames are stitched after the training speech to reach 1000 frames;
    对于短于100帧的训练语音,直接除去。For training speech shorter than 100 frames, remove directly.
  4. 如权利要求1所述的基于深度学习的语音活体检测方法,其特征在于,所述获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵的步骤具体包括:The method of claim 1, wherein the step of acquiring the to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix comprises:
    对所述待检测语音进行分帧后,提取1000帧并分别计算20维MFCC特征;及After framing the to-be-detected speech, extract 1000 frames and calculate 20-dimensional MFCC features respectively; and
    依据所述1000帧的20维MFCC生成所述1000*20维矩阵。The 1000*20-dimensional matrix is generated according to the 1000-frame 20-dimensional MFCC.
  5. 如权利要求1所述的基于深度学习的语音活体检测方法,其特征在于,所述利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量的步骤包括:The deep learning-based voice living body detecting method according to claim 1, wherein the calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector The steps include:
    在第一层用1000*20卷积核对输入特征进行卷积;Convolving the input features with a 1000*20 convolution kernel in the first layer;
    在第二到四层,采用1*1的卷积核进行卷积,并使用LeakyReLU激活函数;In the second to fourth layers, convolution is performed using a 1*1 convolution kernel and the LeakyReLU activation function is used;
    在第五层进行池化,对2*2核范围进行提取最大值;Pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range;
    在第六层进行展平;Flatten on the sixth floor;
    在第七层对所述第六层进行降维,获得输出Out7;及Performing dimensionality reduction on the sixth layer on the seventh layer to obtain an output Out7;
    在第七层以所述Out7输入,用softmax激活函数,输出为1*4向量,作为检测结果。In the seventh layer, the input is performed by the Out7, and the function is activated by the softmax, and the output is a 1*4 vector as a detection result.
  6. 如权利要求5所述的基于深度学习的语音活体检测方法,其特征在于,所述1*4维输出向量为0~1范围内的数值。The deep learning-based voice living body detecting method according to claim 5, wherein the 1*4 dimensional output vector is a value in a range of 0 to 1.
  7. 如权利要求1所述的基于深度学习的语音活体检测方法,其特征在于,所述4种语音类别包括真实语音、一类伪造语音、二类伪造语音、三类伪造语音。The deep learning-based voice living body detecting method according to claim 1, wherein the four voice categories include real voice, one type of forged voice, two types of forged voice, and three types of forged voice.
  8. 一种服务器,其特征在于,所述服务器包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的基于深度学习的语音活体检测程序,所述基于深度学习的语音活体检测程序被所述处理器执行时实现如下步骤:A server, comprising: a memory, a processor, wherein the memory stores a deep learning-based voice living detection program executable on the processor, the deep learning-based voice biometric detection The program implements the following steps when executed by the processor:
    对深度神经网络模型进行训练以获得最优深度神经网络模型;Training the deep neural network model to obtain an optimal depth neural network model;
    获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵;Obtaining a to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix;
    将所述1000*20维矩阵输入到所述最优深度神经网络模型;Inputting the 1000*20-dimensional matrix into the optimal depth neural network model;
    利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4种语音类别;及Calculating the 1000*20-dimensional matrix using the optimal depth neural network model to obtain a 1*4-dimensional output vector, the 1*4-dimensional output vector representing four voice categories;
    选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。A class having the largest value among the 1*4-dimensional output vectors is selected as the category of the speech to be detected.
  9. 如权利要求8所述的服务器,其特征在于,所述基于深度学习的语音活 体检测程序被所述处理器执行时,所述对深度神经网络模型进行训练以获得最优深度神经网络模型的步骤包括:The server according to claim 8, wherein said step of training said depth neural network model to obtain an optimal depth neural network model when said depth learning based speech living body detection program is executed by said processor include:
    对训练语音进行分帧,将每1000帧作为一个样本;Framing the training speech, taking 1000 frames per frame;
    对每一个样本进行类别标识;及Class identification for each sample; and
    将标识后的所述样本作为所述深度神经网络模型的训练样本。The identified sample is used as a training sample of the deep neural network model.
  10. 如权利要求9所述的服务器,其特征在于,所述基于深度学习的语音活体检测程序被所述处理器执行时,所述对训练语音进行分帧,将每1000帧作为一个样本的步骤具体包括:The server according to claim 9, wherein when the depth learning-based voice biometric detection program is executed by the processor, the step of framing the training speech and taking each 1000 frames as a sample is specific include:
    对于短于1000帧且长于100帧的训练语音,在该训练语音的后面拼接上全0帧以达到1000帧;For training speech shorter than 1000 frames and longer than 100 frames, all 0 frames are stitched after the training speech to reach 1000 frames;
    对于短于100帧的训练语音,直接除去。For training speech shorter than 100 frames, remove directly.
  11. 如权利要求8所述的服务器,其特征在于,所述基于深度学习的语音活体检测程序被所述处理器执行时,所述获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵的步骤具体包括:The server according to claim 8, wherein when the deep learning-based voice biometric detection program is executed by the processor, the acquiring the to-be-detected speech and framing the to-be-detected speech to obtain 1000 * The steps of the 20-dimensional matrix specifically include:
    对所述待检测语音进行分帧后,提取1000帧并分别计算20维MFCC特征;及After framing the to-be-detected speech, extract 1000 frames and calculate 20-dimensional MFCC features respectively; and
    依据所述1000帧的20维MFCC生成所述1000*20维矩阵。The 1000*20-dimensional matrix is generated according to the 1000-frame 20-dimensional MFCC.
  12. 如权利要求8所述的服务器,其特征在于,所述基于深度学习的语音活体检测程序被所述处理器执行时,所述利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量的步骤包括:The server according to claim 8, wherein said depth learning-based voice biometric detection program is executed by said processor, said using said optimal depth neural network model for said 1000*20 dimensional matrix The steps to perform the calculation to obtain a 1*4 dimensional output vector include:
    在第一层用1000*20卷积核对输入特征进行卷积;Convolving the input features with a 1000*20 convolution kernel in the first layer;
    在第二到四层,采用1*1的卷积核进行卷积,并使用LeakyReLU激活函数;In the second to fourth layers, convolution is performed using a 1*1 convolution kernel and the LeakyReLU activation function is used;
    在第五层进行池化,对2*2核范围进行提取最大值;Pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range;
    在第六层进行展平;Flatten on the sixth floor;
    在第七层对所述第六层进行降维,获得输出Out7;及Performing dimensionality reduction on the sixth layer on the seventh layer to obtain an output Out7;
    在第七层以所述Out7输入,用softmax激活函数,输出为1*4向量,作为检测结果。In the seventh layer, the input is performed by the Out7, and the function is activated by the softmax, and the output is a 1*4 vector as a detection result.
  13. 如权利要求12所述的服务器,其特征在于,所述基于深度学习的语音活体检测程序被所述处理器执行时,所述1*4维输出向量为0~1范围内的数值。The server according to claim 12, wherein said 1*4 dimensional output vector is a value in the range of 0 to 1 when said depth learning based speech biometric detection program is executed by said processor.
  14. 如权利要求8所述的服务器,其特征在于,所述基于深度学习的语音活体检测程序被所述处理器执行时,所述4种语音类别包括真实语音、一类伪造语音、二类伪造语音、三类伪造语音。The server according to claim 8, wherein when the depth learning-based voice biometric detection program is executed by the processor, the four types of speech include real speech, a type of forged speech, and two types of forged speech. Three types of forged voices.
  15. 一种存储介质,所述存储介质存储有基于深度学习的语音活体检测程序,所述基于深度学习的语音活体检测程序可被至少一个处理器执行时,实现以下步骤:A storage medium storing a deep learning-based voice living body detection program, wherein when the depth learning-based voice living body detection program is executable by at least one processor, the following steps are implemented:
    对深度神经网络模型进行训练以获得最优深度神经网络模型;Training the deep neural network model to obtain an optimal depth neural network model;
    获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵;Obtaining a to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix;
    将所述1000*20维矩阵输入到所述最优深度神经网络模型;Inputting the 1000*20-dimensional matrix into the optimal depth neural network model;
    利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量,所述1*4维输出向量代表了4种语音类别;及Calculating the 1000*20-dimensional matrix using the optimal depth neural network model to obtain a 1*4-dimensional output vector, the 1*4-dimensional output vector representing four voice categories;
    选择所述1*4维输出向量中数值最大的一类作为所述待检测语音的类别。A class having the largest value among the 1*4-dimensional output vectors is selected as the category of the speech to be detected.
  16. 如权利要求15所述的存储介质,其特征在于,所述基于深度学习的语音活体检测程序被所述处理器执行时,所述对深度神经网络模型进行训练以获得最优深度神经网络模型的步骤包括:The storage medium according to claim 15, wherein said depth learning-based voice biometric detection program is executed by said processor, said depth neural network model being trained to obtain an optimal depth neural network model The steps include:
    对训练语音进行分帧,将每1000帧作为一个样本;Framing the training speech, taking 1000 frames per frame;
    对每一个样本进行类别标识;及Class identification for each sample; and
    将标识后的所述样本作为所述深度神经网络模型的训练样本。The identified sample is used as a training sample of the deep neural network model.
  17. 如权利要求16所述的存储介质,其特征在于,所述基于深度学习的语音活体检测程序被所述处理器执行时,所述对训练语音进行分帧,将每1000帧作为一个样本的步骤具体包括:The storage medium according to claim 16, wherein when said depth learning-based voice biometric detection program is executed by said processor, said step of framing training speech, each 1000 frames as a sample Specifically include:
    对于短于1000帧且长于100帧的训练语音,在该训练语音的后面拼接上全0帧以达到1000帧;For training speech shorter than 1000 frames and longer than 100 frames, all 0 frames are stitched after the training speech to reach 1000 frames;
    对于短于100帧的训练语音,直接除去。For training speech shorter than 100 frames, remove directly.
  18. 如权利要求15所述的存储介质,其特征在于,所述基于深度学习的语音活体检测程序被所述处理器执行时,所述获取待检测语音并对所述待检测语音进行分帧,得到1000*20维矩阵的步骤具体包括:The storage medium according to claim 15, wherein when the depth learning-based voice biometric detection program is executed by the processor, the acquiring the to-be-detected speech and framing the to-be-detected speech The steps of the 1000*20 dimensional matrix specifically include:
    对所述待检测语音进行分帧后,提取1000帧并分别计算20维MFCC特征;及After framing the to-be-detected speech, extract 1000 frames and calculate 20-dimensional MFCC features respectively; and
    依据所述1000帧的20维MFCC生成所述1000*20维矩阵。The 1000*20-dimensional matrix is generated according to the 1000-frame 20-dimensional MFCC.
  19. 如权利要求15所述的存储介质,其特征在于,所述基于深度学习的语音活体检测程序被所述处理器执行时,所述利用所述最优深度神经网络模型对所述1000*20维矩阵进行计算以得到1*4维输出向量的步骤包括:The storage medium according to claim 15, wherein said depth learning-based voice biometric detection program is executed by said processor, said using said optimal depth neural network model for said 1000*20 dimension The steps in which the matrix is calculated to obtain a 1*4 dimensional output vector include:
    在第一层用1000*20卷积核对输入特征进行卷积;Convolving the input features with a 1000*20 convolution kernel in the first layer;
    在第二到四层,采用1*1的卷积核进行卷积,并使用LeakyReLU激活函数;In the second to fourth layers, convolution is performed using a 1*1 convolution kernel and the LeakyReLU activation function is used;
    在第五层进行池化,对2*2核范围进行提取最大值;Pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range;
    在第六层进行展平;Flatten on the sixth floor;
    在第七层对所述第六层进行降维,获得输出Out7;及Performing dimensionality reduction on the sixth layer on the seventh layer to obtain an output Out7;
    在第七层以所述Out7输入,用softmax激活函数,输出为1*4向量,作为检测结果。In the seventh layer, the input is performed by the Out7, and the function is activated by the softmax, and the output is a 1*4 vector as a detection result.
  20. 如权利要求19所述的存储介质,其特征在于,所述基于深度学习的语音活体检测程序被所述处理器执行时,所述1*4维输出向量为0~1范围内的数值。The storage medium according to claim 19, wherein said 1*4 dimensional output vector is a value in the range of 0 to 1 when said depth learning based speech biometric detection program is executed by said processor.
PCT/CN2018/089203 2018-01-12 2018-05-31 Voice living-body detection method based on deep learning, server and storage medium WO2019136909A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810029892.6A CN108281158A (en) 2018-01-12 2018-01-12 Voice biopsy method, server and storage medium based on deep learning
CN201810029892.6 2018-01-12

Publications (1)

Publication Number Publication Date
WO2019136909A1 true WO2019136909A1 (en) 2019-07-18

Family

ID=62803422

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/089203 WO2019136909A1 (en) 2018-01-12 2018-05-31 Voice living-body detection method based on deep learning, server and storage medium

Country Status (2)

Country Link
CN (1) CN108281158A (en)
WO (1) WO2019136909A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021031279A1 (en) * 2019-08-20 2021-02-25 东北大学 Deep-learning-based intelligent pneumonia diagnosis system and method for x-ray chest radiograph
EP4091164A4 (en) * 2020-01-13 2024-01-24 Univ Michigan Regents Secure automatic speaker verification system

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036459B (en) * 2018-08-22 2019-12-27 百度在线网络技术(北京)有限公司 Voice endpoint detection method and device, computer equipment and computer storage medium
CN109346089A (en) * 2018-09-27 2019-02-15 深圳市声扬科技有限公司 Living body identity identifying method, device, computer equipment and readable storage medium storing program for executing
CN109801638B (en) * 2019-01-24 2023-10-13 平安科技(深圳)有限公司 Voice verification method, device, computer equipment and storage medium
CN111933154B (en) * 2020-07-16 2024-02-13 平安科技(深圳)有限公司 Method, equipment and computer readable storage medium for recognizing fake voice
CN112489677B (en) * 2020-11-20 2023-09-22 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and medium based on neural network
CN112735431B (en) * 2020-12-29 2023-12-22 三星电子(中国)研发中心 Model training method and device and artificial intelligent dialogue recognition method and device
CN112735381B (en) * 2020-12-29 2022-09-27 四川虹微技术有限公司 Model updating method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869630A (en) * 2016-06-27 2016-08-17 上海交通大学 Method and system for detecting voice spoofing attack of speakers on basis of deep learning
US20170061966A1 (en) * 2015-08-25 2017-03-02 Nuance Communications, Inc. Audio-Visual Speech Recognition with Scattering Operators
CN107545248A (en) * 2017-08-24 2018-01-05 北京小米移动软件有限公司 Biological characteristic biopsy method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436810A (en) * 2011-10-26 2012-05-02 华南理工大学 Record replay attack detection method and system based on channel mode noise
CN106409298A (en) * 2016-09-30 2017-02-15 广东技术师范学院 Identification method of sound rerecording attack
CN106531172B (en) * 2016-11-23 2019-06-14 湖北大学 Speaker's audio playback discrimination method and system based on ambient noise variation detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061966A1 (en) * 2015-08-25 2017-03-02 Nuance Communications, Inc. Audio-Visual Speech Recognition with Scattering Operators
CN105869630A (en) * 2016-06-27 2016-08-17 上海交通大学 Method and system for detecting voice spoofing attack of speakers on basis of deep learning
CN107545248A (en) * 2017-08-24 2018-01-05 北京小米移动软件有限公司 Biological characteristic biopsy method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XU, XIAO: "Research on Live Face Detection Algorithm Based on Deep Learning", CHINA MASTER S THESES FULL-TEXT DATABASE, 15 March 2017 (2017-03-15), pages 1 - 39 and 55, ISSN: 1674-0246 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021031279A1 (en) * 2019-08-20 2021-02-25 东北大学 Deep-learning-based intelligent pneumonia diagnosis system and method for x-ray chest radiograph
EP4091164A4 (en) * 2020-01-13 2024-01-24 Univ Michigan Regents Secure automatic speaker verification system

Also Published As

Publication number Publication date
CN108281158A (en) 2018-07-13

Similar Documents

Publication Publication Date Title
WO2019136909A1 (en) Voice living-body detection method based on deep learning, server and storage medium
US10332507B2 (en) Method and device for waking up via speech based on artificial intelligence
US20200372905A1 (en) Mixed speech recognition method and apparatus, and computer-readable storage medium
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
CN109859772B (en) Emotion recognition method, emotion recognition device and computer-readable storage medium
US11875799B2 (en) Method and device for fusing voiceprint features, voice recognition method and system, and storage medium
WO2019174131A1 (en) Identity authentication method, server, and computer readable storage medium
CN107564513A (en) Audio recognition method and device
CN112418059B (en) Emotion recognition method and device, computer equipment and storage medium
WO2021051572A1 (en) Voice recognition method and apparatus, and computer device
WO2019136911A1 (en) Voice recognition method for updating voiceprint data, terminal device, and storage medium
CN110675862A (en) Corpus acquisition method, electronic device and storage medium
WO2018210323A1 (en) Method and device for providing social object
CN109658921A (en) A kind of audio signal processing method, equipment and computer readable storage medium
CN109658943A (en) A kind of detection method of audio-frequency noise, device, storage medium and mobile terminal
WO2019196305A1 (en) Electronic device, identity verification method, and storage medium
US10910000B2 (en) Method and device for audio recognition using a voting matrix
US11830493B2 (en) Method and apparatus with speech processing
CN112071331B (en) Voice file restoration method and device, computer equipment and storage medium
CN114241411B (en) Counting model processing method and device based on target detection and computer equipment
CN116386664A (en) Voice counterfeiting detection method, device, system and storage medium
CN111951793B (en) Method, device and storage medium for awakening word recognition
Tai et al. Seef-aldr: A speaker embedding enhancement framework via adversarial learning based disentangled representation
CN113011532A (en) Classification model training method and device, computing equipment and storage medium
US20210020167A1 (en) Apparatus and method with speech recognition and learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18899637

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.10.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18899637

Country of ref document: EP

Kind code of ref document: A1