WO2020238046A1

WO2020238046A1 - Human voice smart detection method and apparatus, and computer readable storage medium

Info

Publication number: WO2020238046A1
Application number: PCT/CN2019/117352
Authority: WO
Inventors: 王健宗; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-05-29
Filing date: 2019-11-12
Publication date: 2020-12-03
Also published as: CN110246506A

Abstract

The present application relates to artificial intelligence technology, and discloses a human voice smart detection method, comprising: receiving a label set and a training set comprising a positive sample set and a negative sample set and, after performing pre-processing operations including pre-emphasis and windowing and framing on the training set, inputting same into a human voice detection model, and inputting the label set into a loss function; the human voice detection model receives the pre-processed training set, performs training to obtain a training value, inputs the training value into the loss function, the loss function being calculated to obtain a loss value, and determines the size of the loss value and a preset threshold value until the loss value is less than the preset threshold value, and then the human voice detection model exits training; receiving inputted audio data, using the human voice detection model to determine whether the audio data comprises a human voice, and outputting a determining result. Also provided in the present application are a human voice smart detection apparatus and a computer readable storage medium. The present application can implement highly effective human voice detection.

Description

Human voice intelligent detection method, device and computer readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 29, 2019, the application number is 201910468133.4, and the invention title is "Human Voice Intelligent Detection Method, Device and Computer-readable Storage Medium". The reference is incorporated in the application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium that can intelligently detect whether there is a human voice based on voice data input.

Background technique

Video surveillance systems have been widely used, but most video surveillance systems currently do not detect human voices. Furthermore, the main content of research in the field of human voice detection at home and abroad includes the recognition of voice features of different people, voice recognition of different semantic features, and voice recognition of different emotional state features, but the commonality of most research is that people are known to be human. Under the premise of the spoken voice, to study a certain aspect of the human voice, there are few studies that directly detect whether it is a human voice, and due to the variability between human voice and the environment, most human voices The detection method is not ideal in practical applications, and the effect of human voice detection needs to be resolved in time.

Summary of the invention

This application provides a human voice intelligent detection method, device, and computer-readable storage medium, the main purpose of which is to determine whether the voice data includes accurate results of human voice when the user inputs voice data.

In order to achieve the above objective, a human voice intelligent detection method provided by this application includes:

The data processing layer receives a training set and a label set including a positive sample set and a negative sample set, where the positive sample set includes human voice data and the negative sample set does not include human voice data, and the training set includes pre-processing. Preprocessing operations of emphasis and windowing and framing, input the training set completed by the preprocessing operation to the human voice detection model, and input the label set to the loss function;

The human voice detection model receives the training set completed by the preprocessing operation and performs training to obtain training values, and inputs the training values to the loss function, which is based on the label set and the training value The loss value is calculated, and the size of the loss value and the preset threshold is judged, until the loss value is less than the preset threshold, the human voice detection model exits training;

The input voice data is received and input to the human voice detection model, and the human voice detection model determines whether the voice data includes human voice and outputs the judgment result.

Optionally, performing pre-processing operations including pre-emphasis and windowing and framing on the training set includes:

Perform pre-emphasis on the sound frequency of the training set based on a digital filter, and the pre-emphasis method is:

H(z) = 1-μz ^-1

Wherein, H(z) is the training set after the pre-emphasis, z is the sound frequency, and μ is the pre-emphasis coefficient;

Based on the pre-emphasized training set, perform windowing and framing processing according to the Hamming window method, and the Hamming window method ω(n) is:

Wherein, n is the training set after the pre-emphasis, N is the window length of the Hamming window method, and cos is the cosine function.

H(z) = 1-μz ^-1

Optionally,

The human voice detection model receives the training set completed by the preprocessing operation and performs training to obtain training values, including:

Input the training set to the first convolutional layer of the human voice detection model to perform a convolution operation to obtain a first convolutional data set, and input the first convolutional data set to the first layer of pooling Floor;

The first-level pooling layer performs a maximization pooling operation on the first convolutional data set to obtain a first dimensionality reduction data set, and inputs the first dimensionality reduction data set to the second-level convolutional layer Perform the convolution operation to obtain a second convolution data set, and input the second convolution data set to the second pooling layer to perform the maximization pooling operation to obtain a second dimensionality reduction data set, and Input the second dimensionality reduction data set to the fully connected layer;

The fully connected layer combines an activation function to perform calculation on the second dimensionality reduction data set to obtain the training value.

Optionally, the convolution operation is:

Where ω'is output data, ω is input data, k is the size of the convolution kernel, s is the step size of the convolution operation, and p is the data zero-filling matrix;

The activation function is:

Where y is the second dimensionality reduction data set, and e is an infinite non-cyclic decimal.

In addition, in order to achieve the above-mentioned object, the present application also provides a human voice intelligent detection device, which includes a memory and a processor. The memory stores a human voice intelligent detection program that can run on the processor. When the human voice intelligent detection program is executed by the processor, the following steps are implemented:

H(z) = 1-μz ^-1

Optionally,

In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a human voice intelligent detection program, and the human voice intelligent detection program can be executed by one or more processors , In order to realize the steps of the human voice intelligent detection method described above.

The human voice detection model of the present application uses a convolutional neural network. The convolutional neural network retains the associated information between voices based on the idea of local perception and weight sharing, which can greatly reduce the number of required parameters and is further improved by pooling. The number of network parameters is reduced and the robustness of the model is improved. Therefore, the human voice intelligent detection method, device, and computer-readable storage medium proposed in this application can realize efficient human voice detection judgment.

Description of the drawings

FIG. 1 is a schematic flowchart of a human voice intelligent detection method provided by an embodiment of the application;

2 is a schematic diagram of the internal structure of a human voice intelligent detection device provided by an embodiment of the application;

3 is a schematic diagram of modules of a human voice intelligent detection program in a human voice intelligent detection device provided by an embodiment of the application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

This application provides a human voice intelligent detection method. Referring to FIG. 1, it is a schematic flowchart of a human voice intelligent detection method provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.

In this embodiment, the human voice intelligent detection method includes:

S1. The data processing layer receives a training set and a label set including a positive sample set and a negative sample set, where the positive sample set includes human voice data and the negative sample set does not include human voice data, and the training set is performed The pre-processing operations include pre-emphasis and windowing and framing. The training set completed by the pre-processing operation is input to the human voice detection model, and the label set is input to the loss function.

In a preferred embodiment of the present application, the positive sample set including human voice data is recorded by a microphone in a quiet environment, the sampling frequency of the microphone recording is 16kHz, the sampling size is 16bits, and the persons participating in the admission record at least two different sections The vocal data, one section is admitted in standard Mandarin, and the other section is admitted in the local dialect of the admitted person. The duration of each piece of human voice data in the positive sample set is not less than 10 seconds.

In a preferred embodiment of the present application, the negative sample set comes from the audio data set AudioSet, which includes multiple manually marked sound clips. The AudioSet is a large-scale and complete audio data set that is currently open. Further, the Multiple manually marked sound clips include 2084320 manually marked sound clips each with a length of 10 seconds.

The preferred implementation of the pre-emphasis pre-processing operation in this application is to improve the high-frequency range part of the training set, so that the signal spectrum from the low-frequency range to the high-frequency range of the training set becomes flat, and at the same time, it can suppress random noise and Further, the pre-emphasis is based on the digital filter to pre-emphasize the sound frequency of the training set, and the method of the pre-emphasis, that is, the pre-emphasis, is:

H(z) = 1-μz ^-1

The preferred implementation of the windowing and framing in this application is based on the feature that the audio signal of the training set remains unchanged within a small range of time, and the audio signal of the training set is subjected to framing processing, and further, The windowing and framing is based on the pre-emphasized training set, and the windowing and framing processing is performed according to the Hamming window method, and the Hamming window method ω(n) is:

S2. The human voice detection model receives the training set completed by the preprocessing operation and performs training to obtain training values, and inputs the training values to the loss function, which is based on the label set and the The training value is calculated to obtain a loss value, and the size of the loss value and a preset threshold is judged, until the loss value is less than the preset threshold, the human voice detection model exits the training.

The human voice detection model in the preferred embodiment of the present application receives the training set completed by the preprocessing operation, and inputs the training set to the first layer of convolutional layer. After the first layer of convolutional layer is subjected to convolution, Obtain a convolutional data set and input it to the first layer of pooling layer; then, the first layer of pooling layer performs a maximization pooling operation on the convolutional data set to obtain a dimensionality reduction data set and input it to the second layer of volume Multilayer, the second layer of convolutional layer performs the convolution operation and then inputs to the second layer of pooling layer for the maximum pooling operation, until finally input to the fully connected layer; the fully connected layer is combined with activation Function calculation to obtain the training value;

The convolution operation described in the preferred embodiment of this application is:

Where ω’ is the output data, ω is the input data, k is the size of the convolution kernel, s is the step size of the convolution operation, and p is the data zero-filling matrix;

The activation function described in the preferred embodiment of this application is:

The loss value T in the preferred embodiment of the present application is:

Wherein, n is the size of the training set, y _t is the training value, and μ _t is the label set.

S3. Receive the input sound data and input it to the human voice detection model. The human voice detection model judges whether the sound data includes human voice and outputs a judgment result.

The invention also provides a human voice intelligent detection device. Referring to FIG. 2, it is a schematic diagram of the internal structure of a human voice intelligent detection device provided by an embodiment of this application.

In this embodiment, the human voice intelligent detection device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server. The human voice intelligent detection device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.

Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may be an internal storage unit of the human voice intelligent detection device 1 in some embodiments, such as a hard disk of the human voice intelligent detection device 1. In other embodiments, the memory 11 may also be an external storage device of the human voice intelligent detection device 1, for example, a plug-in hard disk equipped on the human voice intelligent detection device 1, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Further, the memory 11 may also include both an internal storage unit of the human voice intelligent detection device 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the human voice intelligent detection device 1, such as the code of the human voice intelligent detection program 01, etc., but also to temporarily store data that has been output or will be output.

In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, and is used to run the program code or processing stored in the memory 11 Data, for example, execute the human voice intelligent detection program 01 (the human voice intelligent detection program 01 is essentially a software system) and so on.

The communication bus 13 is used to realize the connection and communication between these components.

The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.

Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc. Among them, the display may also be appropriately called a display screen or a display unit, which is used to display the information processed in the human voice intelligent detection device 1 and to display a visualized user interface.

Figure 2 only shows the human voice intelligent detection device 1 with components 11-14 and human voice intelligent detection program 01. Those skilled in the art will understand that the structure shown in Figure 1 does not constitute a human voice intelligent detection device The definition of 1 may include fewer or more components than shown, or a combination of certain components, or different component arrangements.

In the embodiment of the device 1 shown in FIG. 2, the human voice intelligent detection program 01 is stored in the memory 11; when the processor 12 executes the human voice intelligent detection program 01 stored in the memory 11, the following steps are implemented:

Step 1. The data processing layer receives a training set and a label set including a positive sample set and a negative sample set, where the positive sample set includes human voice data and the negative sample set does not include human voice data, and the training set A pre-processing operation including pre-emphasis and windowing and framing is performed, the training set completed by the pre-processing operation is input to the human voice detection model, and the label set is input to the loss function.

In a preferred embodiment of the present application, the positive sample set including human voice data is recorded by a microphone in a quiet environment, the sampling frequency of the microphone recording is 16kHz, the sampling size is 16bits, and the persons participating in the admission record at least two different sections The vocal data, one section is admitted in standard Mandarin, and the other section is admitted in the local dialect of the admitted person. The duration of each segment of human voice data in the positive sample set is not less than 10 seconds.

H(z) = 1-μz ^-1

Step 2: The human voice detection model receives the training set completed by the preprocessing operation and performs training to obtain training values, and inputs the training values to the loss function, which is based on the label set and the The training value is calculated to obtain a loss value, and the size of the loss value and a preset threshold is judged, until the loss value is less than the preset threshold, the human voice detection model exits the training.

The human voice detection model described in the preferred embodiment of the present application receives the training set completed by the preprocessing operation, and inputs the training set to the first layer of convolutional layer. After the first layer of convolutional layer performs convolution operation Obtain the convolutional data set and input it to the first-layer pooling layer; then, the first-layer pooling layer performs a maximization pooling operation on the convolutional data set to obtain a dimensionality reduction data set and input it to the second-layer volume Layers, the second convolutional layer performs the convolution operation and then inputs to the second pooling layer for the maximization pooling operation, until finally input to the fully connected layer; the fully connected layer is combined with activation Function calculation to obtain the training value;

The loss value T in the preferred embodiment of the present application is:

Step 3: The input voice data is received and input to the human voice detection model, and the human voice detection model judges whether the voice data includes human voice and outputs the judgment result.

Optionally, in other embodiments, the human voice intelligent detection program can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (this embodiment It is executed by the processor 12) to complete this application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the execution process of the human voice intelligent detection program in the human voice intelligent detection device .

For example, referring to FIG. 3, it is a schematic diagram of the program modules of the human voice intelligent detection program in an embodiment of the applicant's voice intelligent detection device. In this embodiment, the human voice intelligent detection program can be divided into data receiving modules 10. Model training module 20, human voice result output module 30, exemplarily:

The data receiving module 10 is configured to receive a positive sample set including human voice data, a negative sample set not including human voice data, and a label set. The positive sample set and the negative sample set are collectively referred to as a training set, and the The training set is subjected to pre-processing operations including pre-emphasis and windowing and framing, the training set completed by the pre-processing operation is input to the human voice detection model, and the label set is input to the loss function.

The model training module 20 is configured to: the human voice detection model receives the training set completed by the preprocessing operation to obtain training values, and inputs the training values into the loss function, which is based on the The label set and the training value are calculated to obtain a loss value, and the size of the loss value and a preset threshold is judged, until the loss value is less than the preset threshold, the human voice detection model exits training.

The human voice result output module 30 is configured to receive input voice data and input it to the human voice detection model, and the human voice detection model determines whether the voice data includes human voice and outputs the judgment result.

The functions or operation steps implemented by the program modules such as the data receiving module 10, the model training module 20, and the human voice result output module 30 when executed are substantially the same as those in the foregoing embodiment, and will not be repeated here.

In addition, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium stores a human voice intelligent detection program, and the human voice intelligent detection program can be executed by one or more processors to realize Do as follows:

Receive a positive sample set including human voice data, a negative sample set that does not include human voice data, and a label set, the positive sample set and the negative sample set are collectively referred to as a training set, and pre-emphasis and windowing are performed on the training set The pre-processing operation of framing is to input the training set completed by the pre-processing operation into the human voice detection model, and input the label set into the loss function.

The human voice detection model receives the training set completed by the preprocessing operation for training to obtain training values, and inputs the training values to the loss function, and the loss function is calculated based on the label set and the training value The loss value is obtained, and the size of the loss value and a preset threshold is judged, and the human voice detection model exits training when the loss value is less than the preset threshold.

The specific implementation of the computer-readable storage medium of the present application is basically the same as the foregoing embodiments of the human voice intelligent detection device and method, and will not be repeated here.

It should be noted that the serial numbers of the above embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes The other elements listed may also include elements inherent to the process, device, article, or method. Without more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A human voice intelligent detection method, characterized in that the method includes:

The data processing layer receives a training set and a label set including a positive sample set and a negative sample set, where the positive sample set includes human voice data and the negative sample set does not include human voice data, and the training set includes pre-processing. Preprocessing operations of emphasis and windowing and framing, input the training set completed by the preprocessing operation to the human voice detection model, and input the label set to the loss function;

The human voice detection model receives the training set completed by the preprocessing operation and performs training to obtain training values, and inputs the training values to the loss function, which is based on the label set and the training value The loss value is calculated, and the size of the loss value and the preset threshold is judged, until the loss value is less than the preset threshold, the human voice detection model exits training;

The input voice data is received and input to the human voice detection model, and the human voice detection model is used to determine whether the voice data includes human voice and output the judgment result.
The human voice intelligent detection method according to claim 1, wherein the data processing layer receives a training set and a label set including a positive sample set and a negative sample set, including:

Extracting a plurality of manually marked sound clips included in the preset audio data set AudioSet as the negative sample set;

Recording vocals of multiple sampling frequencies to construct the positive sample set;

A corresponding label set is established based on the positive sample set and the negative sample set.
The human voice intelligent detection method according to claim 2, wherein performing pre-processing operations including pre-emphasis and windowing and framing on the training set includes:

Perform pre-emphasis on the sound frequency of the training set based on a digital filter, and the pre-emphasis method is:

H(z) = 1-μz -1

Wherein, H(z) is the training set after the pre-emphasis, z is the sound frequency, and μ is the pre-emphasis coefficient;

Based on the pre-emphasized training set, perform windowing and framing processing according to the Hamming window method, and the Hamming window method ω(n) is:

Wherein, n is the training set after the pre-emphasis, N is the window length of the Hamming window method, and cos is the cosine function.
The human voice intelligent detection method according to any one of claims 1 to 3, wherein the human voice detection model receives a training set completed by the preprocessing operation and performs training to obtain a training value, comprising:

Input the training set to the first convolutional layer of the human voice detection model to perform a convolution operation to obtain a first convolutional data set, and input the first convolutional data set to the first layer of pooling Floor;

The first-level pooling layer performs a maximization pooling operation on the first convolutional data set to obtain a first dimensionality reduction data set, and inputs the first dimensionality reduction data set to the second-level convolutional layer Perform the convolution operation to obtain a second convolution data set, and input the second convolution data set to the second pooling layer to perform the maximization pooling operation to obtain a second dimensionality reduction data set, and Input the second dimensionality reduction data set to the fully connected layer;

The fully connected layer combines an activation function to perform calculation on the second dimensionality reduction data set to obtain the training value.
The human voice intelligent detection method according to claim 4, wherein the convolution operation is:

Where ω'is output data, ω is input data, k is the size of the convolution kernel, s is the step size of the convolution operation, and p is the data zero-filling matrix;

The activation function is:

Where y is the second dimensionality reduction data set, and e is an infinite non-cyclic decimal.
A human voice intelligent detection device, characterized in that the device includes a memory and a processor, the memory stores a human voice intelligent detection program that can run on the processor, and the human voice intelligent detection program is The processor implements the following steps when executing:

The data processing layer receives a training set and a label set including a positive sample set and a negative sample set, where the positive sample set includes human voice data and the negative sample set does not include human voice data, and the training set includes pre-processing. Preprocessing operations of emphasis and windowing and framing, input the training set completed by the preprocessing operation to the human voice detection model, and input the label set to the loss function;

The human voice detection model receives the training set completed by the preprocessing operation and performs training to obtain training values, and inputs the training values to the loss function, which is based on the label set and the training value The loss value is calculated, and the size of the loss value and the preset threshold is judged, until the loss value is less than the preset threshold, the human voice detection model exits training;

The input voice data is received and input to the human voice detection model, and the human voice detection model determines whether the voice data includes human voice and outputs the judgment result.
7. The human voice intelligent detection device according to claim 6, wherein the data processing layer receives a training set and a label set including a positive sample set and a negative sample set, including:

Extracting a plurality of manually marked sound clips included in the preset audio data set AudioSet as the negative sample set;

Recording vocals of multiple sampling frequencies to construct the positive sample set;

A corresponding label set is established based on the positive sample set and the negative sample set.
8. The human voice intelligent detection device according to claim 7, wherein the pre-processing operation including pre-emphasis and windowing and framing on the training set comprises:

Perform pre-emphasis on the sound frequency of the training set based on a digital filter, and the pre-emphasis method is:

H(z) = 1-μz -1

Wherein, H(z) is the training set after the pre-emphasis, z is the sound frequency, and μ is the pre-emphasis coefficient;

Based on the pre-emphasized training set, perform windowing and framing processing according to the Hamming window method, and the Hamming window method ω(n) is:

Wherein, n is the training set after the pre-emphasis, N is the window length of the Hamming window method, and cos is the cosine function.
8. The human voice intelligent detection device according to any one of claims 6 to 8, wherein the human voice detection model receives the training set completed by the preprocessing operation and performs training to obtain training values, comprising:

Input the training set to the first convolutional layer of the human voice detection model to perform a convolution operation to obtain a first convolutional data set, and input the first convolutional data set to the first layer of pooling Floor;

The first-level pooling layer performs a maximization pooling operation on the first convolutional data set to obtain a first dimensionality reduction data set, and inputs the first dimensionality reduction data set to the second-level convolutional layer Perform the convolution operation to obtain a second convolution data set, and input the second convolution data set to the second pooling layer to perform the maximization pooling operation to obtain a second dimensionality reduction data set, and Input the second dimensionality reduction data set to the fully connected layer;

The fully connected layer combines an activation function to perform calculation on the second dimensionality reduction data set to obtain the training value.
The human voice intelligent detection device according to claim 9, wherein the convolution operation is:

Where ω'is output data, ω is input data, k is the size of the convolution kernel, s is the step size of the convolution operation, and p is the data zero-filling matrix;

The activation function is:

Where y is the second dimensionality reduction data set, and e is an infinite non-cyclic decimal.
A computer-readable storage medium, characterized in that a human voice intelligent detection program is stored on the computer-readable storage medium, and the human voice intelligent detection program can be executed by one or more processors to implement the following steps:

The data processing layer receives a training set and a label set including a positive sample set and a negative sample set, where the positive sample set includes human voice data and the negative sample set does not include human voice data, and the training set includes pre-processing. Preprocessing operations of emphasis and windowing and framing, input the training set completed by the preprocessing operation to the human voice detection model, and input the label set to the loss function;

The human voice detection model receives the training set completed by the preprocessing operation and performs training to obtain training values, and inputs the training values to the loss function, which is based on the label set and the training value The loss value is calculated, and the size of the loss value and the preset threshold is judged, until the loss value is less than the preset threshold, the human voice detection model exits training;

The input voice data is received and input to the human voice detection model, and the human voice detection model is used to determine whether the voice data includes human voice and output the judgment result.
11. The computer-readable storage medium of claim 11, wherein the data processing layer receives a training set and a label set including a positive sample set and a negative sample set, comprising:

Extracting a plurality of manually marked sound clips included in the preset audio data set AudioSet as the negative sample set;

Recording vocals of multiple sampling frequencies to construct the positive sample set;

A corresponding label set is established based on the positive sample set and the negative sample set.
The computer-readable storage medium of claim 12, wherein performing pre-processing operations including pre-emphasis and windowing and framing on the training set comprises:

Perform pre-emphasis on the sound frequency of the training set based on a digital filter, and the pre-emphasis method is:

H(z) = 1-μz -1

Wherein, H(z) is the training set after the pre-emphasis, z is the sound frequency, and μ is the pre-emphasis coefficient;

Based on the pre-emphasized training set, perform windowing and framing processing according to the Hamming window method, and the Hamming window method ω(n) is:

Wherein, n is the training set after the pre-emphasis, N is the window length of the Hamming window method, and cos is the cosine function.
The computer-readable storage medium according to any one of claims 11 to 13, wherein the human voice detection model receives a training set completed by the preprocessing operation and performs training to obtain a training value, comprising:

Input the training set to the first convolutional layer of the human voice detection model to perform a convolution operation to obtain a first convolutional data set, and input the first convolutional data set to the first layer of pooling Floor;

The first-level pooling layer performs a maximization pooling operation on the first convolutional data set to obtain a first dimensionality reduction data set, and inputs the first dimensionality reduction data set to the second-level convolutional layer Perform the convolution operation to obtain a second convolution data set, and input the second convolution data set to the second pooling layer to perform the maximization pooling operation to obtain a second dimensionality reduction data set, and Input the second dimensionality reduction data set to the fully connected layer;

The fully connected layer combines an activation function to perform calculation on the second dimensionality reduction data set to obtain the training value.
The computer-readable storage medium of claim 14, wherein the convolution operation is:

Where ω'is output data, ω is input data, k is the size of the convolution kernel, s is the step size of the convolution operation, and p is the data zero-filling matrix;

The activation function is:

Where y is the second dimensionality reduction data set, and e is an infinite non-cyclic decimal.
A human voice intelligent detection system, characterized in that the human voice intelligent detection system includes:

The data receiving module is configured to: the data processing layer receives a training set and a label set including a positive sample set and a negative sample set, wherein the positive sample set includes human voice data and the negative sample set does not include human voice data, right The training set is subjected to pre-processing operations including pre-emphasis and windowing and framing, the training set completed by the pre-processing operation is input to a human voice detection model, and the label set is input to a loss function;

The model training module is configured to: the human voice detection model receives the training set completed by the preprocessing operation and performs training to obtain training values, and inputs the training values to the loss function, which is based on the The label set and the training value are calculated to obtain a loss value, and the size of the loss value and a preset threshold is judged, until the loss value is less than the preset threshold, the human voice detection model exits training;

The human voice result output module is configured to: receive the input voice data and input it into the human voice detection model, use the human voice detection model to determine whether the voice data includes human voice and output the judgment result.
The human voice intelligent detection system according to claim 16, wherein the data processing layer receives a training set and a label set including a positive sample set and a negative sample set, including:

Extracting a plurality of manually marked sound clips included in the preset audio data set AudioSet as the negative sample set;

Recording vocals of multiple sampling frequencies to construct the positive sample set;

A corresponding label set is established based on the positive sample set and the negative sample set.
The human voice intelligent detection system according to claim 17, wherein the pre-processing operations including pre-emphasis and windowing and framing on the training set include:

Perform pre-emphasis on the sound frequency of the training set based on a digital filter, and the pre-emphasis method is:

H(z) = 1-μz -1

Wherein, H(z) is the training set after the pre-emphasis, z is the sound frequency, and μ is the pre-emphasis coefficient;

Based on the pre-emphasized training set, perform windowing and framing processing according to the Hamming window method, and the Hamming window method ω(n) is:

Wherein, n is the training set after the pre-emphasis, N is the window length of the Hamming window method, and cos is the cosine function.
The human voice intelligent detection system according to claims 16 to 18, wherein the human voice detection model receives the training set completed by the pre-processing operation and performs training to obtain training values, comprising:

Input the training set to the first convolutional layer of the human voice detection model to perform a convolution operation to obtain a first convolutional data set, and input the first convolutional data set to the first layer of pooling Floor;

The first-level pooling layer performs a maximization pooling operation on the first convolutional data set to obtain a first dimensionality reduction data set, and inputs the first dimensionality reduction data set to the second-level convolutional layer Perform the convolution operation to obtain a second convolution data set, and input the second convolution data set to the second pooling layer to perform the maximization pooling operation to obtain a second dimensionality reduction data set, and Input the second dimensionality reduction data set to the fully connected layer;

The fully connected layer combines an activation function to perform calculation on the second dimensionality reduction data set to obtain the training value.
The human voice intelligent detection system according to claim 19, wherein the convolution operation is:

Where ω'is output data, ω is input data, k is the size of the convolution kernel, s is the step size of the convolution operation, and p is the data zero-filling matrix;

The activation function is:

Where y is the second dimensionality reduction data set, and e is an infinite non-cyclic decimal.