WO2019136909A1

WO2019136909A1 - Voice living-body detection method based on deep learning, server and storage medium

Info

Publication number: WO2019136909A1
Application number: PCT/CN2018/089203
Authority: WO
Inventors: 王健宗; 郑斯奇; 于夕畔; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-01-12
Filing date: 2018-05-31
Publication date: 2019-07-18
Also published as: CN108281158A

Abstract

Disclosed is a voice living-body detection method based on deep learning, wherein same is applied to a server. The method comprises: training a deep neural network model to obtain an optimal deep neural network model; acquiring voice to be detected and framing the voice to be detected to obtain a 1000*20-dimensional matrix; inputting the 1000*20-dimensional matrix into the optimal deep neural network model; calculating the 1000*20-dimensional matrix by using the optimal deep neural network model to obtain a 1*4-dimensional output vector, wherein the 1*4-dimensional output vector represents four voice categories; and selecting a category with the maximum value from among the 1*4-dimensional output vector as the category of the voice to be detected. Further provided are a server and a storage medium. By implementing the above-mention solution, a higher level of security guarantee can be provided for the security of speech control, and the development of voice recognition technology is promoted.

Description

Voice learning method, server and storage medium based on deep learning

Priority claim

The present application claims priority to Chinese Patent Application No. 201810029892.6, entitled "Deep Learning-Based Voice Living Detection Method, Server and Storage Medium", which is filed on January 12, 2018, the Chinese patent application The entire content is incorporated herein by reference.

Technical field

The present application relates to the field of computer technologies, and in particular, to a voice learning method, a server, and a storage medium based on deep learning.

Background technique

With the continuous development of speech recognition technology, speech recognition applications are also increasing, including voice control, voice payment and so on. However, in the process of speech recognition, it is generally only to identify the semantics, and it is not very good to distinguish whether the speech is artificially generated or other recording input, such as Apple Siri, in the process of waking up the Apple terminal device with siri, whether it is I am still recording. Once I enter "hi, siri", I will wake up the terminal device and cannot distinguish the source of the voice. Therefore, the detection of living body for speech is particularly important. Voice live detection refers to whether the input information is spoken by a real person. The voice of a non-real person is generally referred to as a forged recording, including music input, recording replay, voice generated by technical means such as speech synthesis, and the like. Forged recordings are often used in the financial and security fields. Voiceprints are identified by forgery recordings, so that they can be logged into the victim's account to achieve the goal of stealing money or damaging the reputation and property of others.

Summary of the invention

In view of this, the present application provides a voice learning method, a server, and a storage medium based on deep learning, so that before the corresponding application is performed by using voice, it is possible to quickly detect whether the voice is a voice directly output by the user, or is another person. The malicious forgery of voice, in this way, can provide a higher level of security guarantee for the security of voice control, and promote the development of voice recognition technology.

First, in order to achieve the above object, the present application provides a server, where the server includes a memory, a processor, and the memory stores a deep learning-based voice living detection program executable on the processor, where the When the deep learning voice in vivo detection program is executed by the processor, the following steps are performed: training the deep neural network model to obtain an optimal depth neural network model; acquiring the to-be-detected speech and framing the to-be-detected speech to obtain 1000*20-dimensional matrix; inputting the 1000*20-dimensional matrix into the optimal depth neural network model; calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain 1*4 dimensions An output vector, the 1*4-dimensional output vector represents four types of speech categories; and one of the 1*4-dimensional output vectors having the largest value is selected as the category of the speech to be detected.

In addition, in order to achieve the above object, the present application further provides a deep learning-based voice living body detection method, which is applied to a server, and the method includes the following steps: training a deep neural network model to obtain an optimal depth neural network model; Performing to detect the speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix; inputting the 1000*20-dimensional matrix to the optimal depth neural network model; using the optimal depth neural network model The 1000*20-dimensional matrix is calculated to obtain a 1*4-dimensional output vector, and the 1*4-dimensional output vector represents four types of speech categories; and the one with the largest value among the 1*4-dimensional output vectors is selected as the The category of the detected speech is described.

Further, in order to achieve the above object, the present application further provides a storage medium storing a deep learning-based voice living body detection program, and the depth learning-based voice living body detection program can be executed by at least one processor. The step of causing the at least one processor to perform the depth learning based speech biometric detection method as described above.

Compared with the prior art, the deep learning-based voice living body detection method, the server and the storage medium proposed by the present application firstly train the deep neural network model to obtain an optimal depth neural network model; secondly, acquire the voice to be detected. And framing the to-be-detected speech to obtain a 1000*20-dimensional matrix; again, inputting the 1000*20-dimensional matrix into the optimal depth neural network model; and then using the optimal depth neural network model Calculating the 1000*20-dimensional matrix to obtain a 1*4-dimensional output vector, wherein the 1*4-dimensional output vector represents a speech class of 4; finally, selecting the largest value of the 1*4-dimensional output vector The class serves as the category of the speech to be detected. In this way, before the corresponding application is performed by using the voice, it is possible to quickly detect whether the voice is a voice directly output by the user or a malicious forged voice of another person, thus providing a higher level of security for the security of the voice control. Sexual guarantees have promoted the development of speech recognition technology.

DRAWINGS

1 is a schematic diagram of an optional hardware architecture of the server of the present application;

2 is a program block diagram of a first embodiment of a voice learning method based on deep learning of the present application;

FIG. 3 is a flowchart of a first embodiment of a method for detecting a living body based on deep learning according to the present application;

The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.

Detailed ways

In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

It should be noted that the descriptions of "first", "second" and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Thus, features defining "first" or "second" may include at least one of the features, either explicitly or implicitly. In addition, the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.

Referring to FIG. 1, it is a schematic diagram of an optional hardware architecture of the server 1.

The server 1 may be a computing device such as a rack server, a blade server, a tower server, or a rack server. The server 1 may be a standalone server or a server cluster composed of multiple servers.

In this embodiment, the server 1 may include, but is not limited to, the memory 11, the processor 12, and the network interface 13 being communicably connected to each other through a system bus.

The server 1 connects to the network through the network interface 13 to obtain information. The network may be an intranet, an Internet, a Global System of Mobile communication (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, or a 5G network. Wireless or wired networks such as networks, Bluetooth, Wi-Fi, and call networks.

It is pointed out that Figure 1 only shows the server 1 with the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.

The memory 11 includes at least one type of storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, SD or DX memory, etc.), a random access memory (RAM), and a static random access. Memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the server 1, such as a hard disk or memory of the server 1. In other embodiments, the memory 11 may also be an external storage device of the server 1, such as a plug-in hard disk equipped with the server 1, a smart memory card (SMC), and a secure digital (Secure Digital). , SD) card, flash card (Flash Card), etc. Of course, the memory 11 can also include both the internal storage unit of the server 1 and its external storage device. In this embodiment, the memory 11 is generally used to store an operating system installed in the server 1 and various types of application software, such as program code of the deep learning-based voice living body detection program 200. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.

The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used to control the overall operation of the server 1, such as performing data interaction or communication related control and processing, and the like. In this embodiment, the processor 12 is configured to run program code or process data stored in the memory 11, such as running the deep learning-based voice biometric detection program 200 and the like.

The network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the server 1 and other electronic devices.

In this embodiment, a deep learning-based voice biometric detection program 200 is installed and run in the server 1. When the deep learning-based voice biometric detection program 200 is running, the server 1 trains the deep neural network model. Obtaining an optimal depth neural network model; acquiring a to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix; and inputting the 1000*20-dimensional matrix into the optimal depth neural network model; Calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector, the 1*4-dimensional output vector representing a speech class of 4; selecting the 1*4 dimension The class with the largest value in the output vector is used as the category of the speech to be detected. In this way, before the corresponding application is performed by using the voice, it is possible to quickly detect whether the voice is a voice directly output by the user or a malicious forged voice of another person, thus providing a higher level of security for the security of the voice control. Sexual guarantees have promoted the development of speech recognition technology.

So far, the application environment of the various embodiments of the present application and the hardware structure and functions of related devices have been described in detail. Hereinafter, various embodiments of the present application will be proposed based on the above-described application environment and related devices.

First, the present application proposes a deep learning based speech living body detection program 200.

Referring to FIG. 2, it is a program module diagram of the first embodiment of the deep learning-based voice living body detection program 200 of the present application.

In this embodiment, the server 1 includes a series of computer program instructions stored on the memory 11, that is, the deep learning-based voice living body detection program 200, which can be implemented when the computer program instructions are executed by the processor 12. The deep learning-based voice biometric detection operation of each embodiment is applied. In some embodiments, the deep learning based speech biometric detection program 200 can be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the deep learning-based voice living body detection program 200 can be divided into a training module 201, a voice processing module 202, a matrix input module 203, a matrix calculation module 204, and a determination module 205. among them:

The training module 201 is configured to train the deep neural network model to obtain an optimal depth neural network model.

Specifically, the training module 201 is specifically configured to frame the training speech, and each 1000 frames is used as a sample; class identification is performed for each sample; and the identified sample is used as the training of the deep neural network model. sample.

In the present embodiment, the purpose of truncating every 1000 frames is to make the model have a fixed length input, and considering different lengths of recording, different MFCC (Mel-Frequency Cepstral Coefficients) feature distribution effects may be generated, if the input features are not Fixed, easy to cause inaccurate model recognition. For recordings shorter than 1000 frames and longer than 100 frames, we stitched all 0 frames afterwards; for recordings shorter than 100 frames, we directly rounded off and thought that no one spoke. The advantage of having a training sample for every 1000 frames of all recordings is that the model can learn the sound characteristics of each type of speech for each period of time, which is more robust than the training effect of a certain 1000 frames of a certain recording.

In the training phase, each input recording is tagged for identification, such as true class [0000], one type of forgery [0100], two types of forgery [0010], and three types of forgery [0001]. Specifically, the real class, as its name implies, is true speech, and for forged speech, it is divided into three kinds of forgery, the first type is forged as music, the second type is forged as recording replay forgery, and the third type is forged as technical voice forgery. The first type of forgery refers to the voiceprint recognition input as music. Because the music contains rich sound components, the voice registration and verification can be performed normally, but the voice of the speaker is not included, so it is not the target recording of the voiceprint recognition; Class forgery is mainly a simple replay of recordings, such as recording a target person's speech or music with a recording pen, mobile phone, etc., and then directly replaying it to the input of voiceprint recognition; the third type of forgery mainly refers to speech synthesis or voice conversion. The technology performs the target person's speech forgery. The speech synthesis recording generally collects a certain amount of voice data of the target person, and can synthesize the voice of the target person to specify the text content. The voice conversion recording directly changes the spectrum of the original recording, and the forgery is due to It contains a lot of voice signal processing technology, so it is called technical forgery.

For how to use training samples to train DNN (Deep Neural Network), the following is a brief description: Model training is carried out using the open source Keras framework. Considering hardware limitations, set the DNB training to use minibatch technology, and set each batch size to 128. Train 1000 batches per iteration and train N iterations. Each batch randomly selects 128 speech MFCC feature samples from the total data, generates model output, and then updates the model parameters through backward feedback according to the loss function, thereby completing 1 batch calculation, thereby generating 1000 batch data and completing 1000 batches. Train to get an iterative model output. In general, the optimal model of the loss function is selected in 50 iterations: the convolution kernel of the first layer is 9*20, the Nfilters is set to 512, and the loss function is set to the maximum entropy categorical_crossentropy of all classifications. The optimizer is Adagrad.

The voice processing module 202 is configured to acquire a voice to be detected and frame the to-be-detected voice to obtain a 1000*20-dimensional matrix.

In this embodiment, the voice processing module 202 is specifically configured to: after framing the to-be-detected voice, extract 1000 frames and separately calculate 20-dimensional MFCC features; and generate the 20-dimensional MFCC according to the 1000 frames. 1000*20 dimensional matrix.

In this embodiment, the framing operation of the speech to be detected is the same as the processing of the training speech described above. For recordings shorter than 1000 frames and longer than 100 frames, we splicing all 0 frames thereafter; for less than 100 frames The recordings were taken directly by us and we thought that no one was talking. The calculation of the MFCC feature belongs to a conventional algorithm, and the present application will not repeat it again.

The matrix input module 203 is configured to input the 1000*20-dimensional matrix to the optimal depth neural network model.

In this embodiment, the input layer of the obtained optimal depth neural network model DNN is a matrix input, and the 1000*20-dimensional matrix obtained by the voice processing module 202 can be directly input to the obtained optimal depth neural network model.

The matrix calculation module 204 is configured to calculate the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector, where the 1*4-dimensional output vector represents 4 voices category.

Specifically, the matrix calculation module 204 convolves the input features with a 1000*20 convolution kernel in the first layer of the DNN model. The purpose of the layer is to perform adjacent frame feature projection, and each volume is controlled by Nfilters (N times of filtering). The accumulative core is convolved to obtain N channel features; in the second to fourth layers, a 1*1 convolution kernel is used for convolution, and the LeakyReLU activation function is used, wherein the function of these 1*1 convolution kernels is to allow the channel Interconnected to interact, so that the model learns more frame and interframe features; pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range, where 2*2MaxPooling (fitting), step selection default The value is 1*1, this layer can select some upper nodes, so that the model parameters are reduced, it is not easy to overfit; in the sixth layer, flattening, that is, by flattening the upper layer output node to obtain 1*P dimension Feature; dimension reduction is performed on the sixth layer in the seventh layer, and an output Out7 is obtained, wherein the seventh layer is a linear layer, and the seventh layer is input with the Out7, and the function is activated by softmax, and the output is a 1*4 vector. , that is, output 4 vectors as the detection result.

The determining module 205 is configured to select a class with the largest value among the 1*4-dimensional output vectors as the category of the voice to be detected. The 1*4 dimensional output vector is a value in the range of 0 to 1.

In this embodiment, the 1*4-dimensional output vector represents the probability of belonging to the corresponding class by four decimal ranges of 0-1, that is, the probability of the above-mentioned real voice, one type of forgery, the second type of forgery, and the probability of three types of forgery. The one with the highest value among the four probabilities represents the category of the input speech, that is, the value of the output can be intuitively and effectively detected whether the speech to be detected is a live speech or a voice.

Through the above-mentioned program modules 201-205, the server proposed by the present application trains the deep neural network model to obtain an optimal depth neural network model; acquires the to-be-detected speech and framing the to-be-detected speech to obtain 1000*20 Dimension matrix; inputting the 1000*20-dimensional matrix to the optimal depth neural network model; calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector, The 1*4 dimensional output vector represents a speech class of 4; a class having the largest value among the 1*4 dimensional output vectors is selected as the class of the speech to be detected. In this way, before the corresponding application is performed by using the voice, it is possible to quickly detect whether the voice is a voice directly output by the user or a malicious forged voice of another person, thus providing a higher level of security for the security of the voice control. Sexual guarantees have promoted the development of speech recognition technology.

In addition, the present application also proposes a voice living body detection method based on deep learning.

Referring to FIG. 3, it is a schematic flowchart of an implementation process of a first embodiment of a voice learning method based on deep learning in the present application. In this embodiment, the order of execution of the steps in the flowchart shown in FIG. 3 may be changed according to different requirements, and some steps may be omitted.

Step S301, training the deep neural network model to obtain an optimal depth neural network model.

Specifically, the foregoing steps specifically include framing the training speech, using each 1000 frames as a sample; performing class identification on each sample; and using the identified sample as a training sample of the deep neural network model.

Step S302: Acquire the to-be-detected speech and frame the to-be-detected speech to obtain a 1000*20-dimensional matrix.

Step S303, input the 1000*20-dimensional matrix to the optimal depth neural network model.

Step S304, calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector, where the 1*4-dimensional output vector represents a speech class of 4.

Step S305, selecting a class having the largest value among the 1*4-dimensional output vectors as the category of the to-be-detected speech. The 1*4 dimensional output vector is a value in the range of 0 to 1.

Through the above steps S301-305, the deep learning-based voice living body detection method proposed by the present application firstly trains the deep neural network model to obtain an optimal depth neural network model; secondly, acquires the to-be-detected speech and treats the to-be-detected speech Detecting speech to perform frame division to obtain a 1000*20-dimensional matrix; again, inputting the 1000*20-dimensional matrix into the optimal depth neural network model; and then using the optimal depth neural network model to the 1000* The 20-dimensional matrix is calculated to obtain a 1*4-dimensional output vector, which represents the speech class of 4; finally, the one with the largest value among the 1*4-dimensional output vectors is selected as the Detect the category of the voice. In this way, before the corresponding application is performed by using the voice, it is possible to quickly detect whether the voice is a voice directly output by the user or a malicious forged voice of another person, thus providing a higher level of security for the security of the voice control. Sexual guarantees have promoted the development of speech recognition technology.

The present application further provides another embodiment, that is, providing a storage medium storing a deep learning-based voice living body detection program, the depth learning-based voice living body detection program being executable by at least one processor And causing the at least one processor to perform the steps of the deep learning-based voice biometric detection method as described above.

The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims

A deep learning-based voice living body detection method is applied to a server, characterized in that the method comprises the following steps:

Training the deep neural network model to obtain an optimal depth neural network model;

Obtaining a to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix;

Inputting the 1000*20-dimensional matrix into the optimal depth neural network model;

Calculating the 1000*20-dimensional matrix using the optimal depth neural network model to obtain a 1*4-dimensional output vector, the 1*4-dimensional output vector representing four voice categories;

A class having the largest value among the 1*4-dimensional output vectors is selected as the category of the speech to be detected.
The deep learning-based voice living body detection method according to claim 1, wherein the step of training the depth neural network model to obtain an optimal depth neural network model comprises:

Framing the training speech, taking 1000 frames per frame;

Class identification for each sample; and

The identified sample is used as a training sample of the deep neural network model.
The depth learning-based voice living body detection method according to claim 2, wherein the step of framing the training speech and using each 1000 frames as one sample specifically includes:

For training speech shorter than 1000 frames and longer than 100 frames, all 0 frames are stitched after the training speech to reach 1000 frames;

For training speech shorter than 100 frames, remove directly.
The method of claim 1, wherein the step of acquiring the to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix comprises:

After framing the to-be-detected speech, extract 1000 frames and calculate 20-dimensional MFCC features respectively; and

The 1000*20-dimensional matrix is generated according to the 1000-frame 20-dimensional MFCC.
The deep learning-based voice living body detecting method according to claim 1, wherein the calculating the 1000*20-dimensional matrix by using the optimal depth neural network model to obtain a 1*4-dimensional output vector The steps include:

Convolving the input features with a 1000*20 convolution kernel in the first layer;

In the second to fourth layers, convolution is performed using a 1*1 convolution kernel and the LeakyReLU activation function is used;

Pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range;

Flatten on the sixth floor;

Performing dimensionality reduction on the sixth layer on the seventh layer to obtain an output Out7;

In the seventh layer, the input is performed by the Out7, and the function is activated by the softmax, and the output is a 1*4 vector as a detection result.
The deep learning-based voice living body detecting method according to claim 5, wherein the 1*4 dimensional output vector is a value in a range of 0 to 1.
The deep learning-based voice living body detecting method according to claim 1, wherein the four voice categories include real voice, one type of forged voice, two types of forged voice, and three types of forged voice.
A server, comprising: a memory, a processor, wherein the memory stores a deep learning-based voice living detection program executable on the processor, the deep learning-based voice biometric detection The program implements the following steps when executed by the processor:

Training the deep neural network model to obtain an optimal depth neural network model;

Obtaining a to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix;

Inputting the 1000*20-dimensional matrix into the optimal depth neural network model;

Calculating the 1000*20-dimensional matrix using the optimal depth neural network model to obtain a 1*4-dimensional output vector, the 1*4-dimensional output vector representing four voice categories;

A class having the largest value among the 1*4-dimensional output vectors is selected as the category of the speech to be detected.
The server according to claim 8, wherein said step of training said depth neural network model to obtain an optimal depth neural network model when said depth learning based speech living body detection program is executed by said processor include:

Framing the training speech, taking 1000 frames per frame;

Class identification for each sample; and

The identified sample is used as a training sample of the deep neural network model.
The server according to claim 9, wherein when the depth learning-based voice biometric detection program is executed by the processor, the step of framing the training speech and taking each 1000 frames as a sample is specific include:

For training speech shorter than 1000 frames and longer than 100 frames, all 0 frames are stitched after the training speech to reach 1000 frames;

For training speech shorter than 100 frames, remove directly.
The server according to claim 8, wherein when the deep learning-based voice biometric detection program is executed by the processor, the acquiring the to-be-detected speech and framing the to-be-detected speech to obtain 1000 * The steps of the 20-dimensional matrix specifically include:

After framing the to-be-detected speech, extract 1000 frames and calculate 20-dimensional MFCC features respectively; and

The 1000*20-dimensional matrix is generated according to the 1000-frame 20-dimensional MFCC.
The server according to claim 8, wherein said depth learning-based voice biometric detection program is executed by said processor, said using said optimal depth neural network model for said 1000*20 dimensional matrix The steps to perform the calculation to obtain a 1*4 dimensional output vector include:

Convolving the input features with a 1000*20 convolution kernel in the first layer;

In the second to fourth layers, convolution is performed using a 1*1 convolution kernel and the LeakyReLU activation function is used;

Pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range;

Flatten on the sixth floor;

Performing dimensionality reduction on the sixth layer on the seventh layer to obtain an output Out7;

In the seventh layer, the input is performed by the Out7, and the function is activated by the softmax, and the output is a 1*4 vector as a detection result.
The server according to claim 12, wherein said 1*4 dimensional output vector is a value in the range of 0 to 1 when said depth learning based speech biometric detection program is executed by said processor.
The server according to claim 8, wherein when the depth learning-based voice biometric detection program is executed by the processor, the four types of speech include real speech, a type of forged speech, and two types of forged speech. Three types of forged voices.
A storage medium storing a deep learning-based voice living body detection program, wherein when the depth learning-based voice living body detection program is executable by at least one processor, the following steps are implemented:

Training the deep neural network model to obtain an optimal depth neural network model;

Obtaining a to-be-detected speech and framing the to-be-detected speech to obtain a 1000*20-dimensional matrix;

Inputting the 1000*20-dimensional matrix into the optimal depth neural network model;

Calculating the 1000*20-dimensional matrix using the optimal depth neural network model to obtain a 1*4-dimensional output vector, the 1*4-dimensional output vector representing four voice categories;

A class having the largest value among the 1*4-dimensional output vectors is selected as the category of the speech to be detected.
The storage medium according to claim 15, wherein said depth learning-based voice biometric detection program is executed by said processor, said depth neural network model being trained to obtain an optimal depth neural network model The steps include:

Framing the training speech, taking 1000 frames per frame;

Class identification for each sample; and

The identified sample is used as a training sample of the deep neural network model.
The storage medium according to claim 16, wherein when said depth learning-based voice biometric detection program is executed by said processor, said step of framing training speech, each 1000 frames as a sample Specifically include:

For training speech shorter than 1000 frames and longer than 100 frames, all 0 frames are stitched after the training speech to reach 1000 frames;

For training speech shorter than 100 frames, remove directly.
The storage medium according to claim 15, wherein when the depth learning-based voice biometric detection program is executed by the processor, the acquiring the to-be-detected speech and framing the to-be-detected speech The steps of the 1000*20 dimensional matrix specifically include:

After framing the to-be-detected speech, extract 1000 frames and calculate 20-dimensional MFCC features respectively; and

The 1000*20-dimensional matrix is generated according to the 1000-frame 20-dimensional MFCC.
The storage medium according to claim 15, wherein said depth learning-based voice biometric detection program is executed by said processor, said using said optimal depth neural network model for said 1000*20 dimension The steps in which the matrix is calculated to obtain a 1*4 dimensional output vector include:

Convolving the input features with a 1000*20 convolution kernel in the first layer;

In the second to fourth layers, convolution is performed using a 1*1 convolution kernel and the LeakyReLU activation function is used;

Pooling in the fifth layer, extracting the maximum value of the 2*2 kernel range;

Flatten on the sixth floor;

Performing dimensionality reduction on the sixth layer on the seventh layer to obtain an output Out7;

In the seventh layer, the input is performed by the Out7, and the function is activated by the softmax, and the output is a 1*4 vector as a detection result.
The storage medium according to claim 19, wherein said 1*4 dimensional output vector is a value in the range of 0 to 1 when said depth learning based speech biometric detection program is executed by said processor.