CN112489677B

CN112489677B - Voice endpoint detection method, device, equipment and medium based on neural network

Info

Publication number: CN112489677B
Application number: CN202011309613.5A
Authority: CN
Inventors: 郑振鹏; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2023-09-22
Anticipated expiration: 2040-11-20
Also published as: WO2021208728A1; CN112489677A

Abstract

The application relates to the technical field of voice detection and discloses a voice endpoint detection method, a device, equipment and a medium based on a neural network, wherein the method comprises the steps of extracting acoustic characteristics of a sample voice file, distributing context characteristic information to each frame of voice characteristics to obtain a characteristic matrix, carrying out characteristic processing on the characteristic matrix through a neural network model to obtain a one-dimensional characteristic vector, and carrying out learning processing on sequence information of a voice frame on the one-dimensional characteristic vector to obtain a predicted value; and calculating a loss function value of the predicted value and the real voice value, updating network parameters of the neural network model according to the loss function value, and outputting a predicted result of the voice file to be detected through the trained neural network model. The application also relates to blockchain technology, and a sample voice file is stored in the blockchain. By combining the context characteristic information, the neural network model is trained, and the accuracy of voice endpoint detection of the neural network is improved.

Description

Voice endpoint detection method, device, equipment and medium based on neural network

Technical Field

The present application relates to the field of speech detection technologies, and in particular, to a method, an apparatus, a device, and a medium for detecting a speech endpoint based on a neural network.

Background

Voice endpoint detection (Voice Activity Detection) is an important part of voice processing. The accurate voice endpoint detection can reduce the calculation amount of voice signal processing, improve the real-time performance of the system, and improve the robustness of the voice system and the accuracy of the subsequent voice system. However, in a practical environment, background noise is too large, which brings a great challenge to the detection result of the voice endpoint detection. An accurate voice endpoint detection system is of great importance.

The existing voice endpoint detection model can only utilize fixed context information, and cannot adaptively select optimal context information according to the voice condition of the context. This allows the speech end point detection model to have excessive noise input when using longer context information, resulting in end point detection anomalies; when shorter context information is used, the context information of the voice cannot be better utilized, so that the voice endpoint can be correctly judged. This results in the inability of the prior art to provide more accurate voice endpoint detection, and a need exists for a method that improves the accuracy of voice endpoint detection.

Disclosure of Invention

The embodiment of the application aims to provide a voice endpoint detection method, device, equipment and medium based on a neural network so as to improve the accuracy of voice endpoint detection.

In order to solve the above technical problems, an embodiment of the present application provides a voice endpoint detection method based on a neural network, including:

acquiring a sample voice file, and extracting acoustic characteristics of the sample voice file to obtain voice characteristics, wherein the voice characteristics comprise characteristic information;

distributing N frames of context feature information to each frame of voice feature to obtain a feature matrix, wherein N is a positive integer;

performing feature processing on the feature matrix through a neural network model to obtain a one-dimensional feature vector, and performing learning processing on sequence information of a voice frame on the one-dimensional feature vector to obtain a predicted value;

calculating a loss function value of the predicted value and the real voice value, and updating network parameters of the neural network model according to the loss function value to obtain a trained neural network model;

acquiring a voice file to be detected, and extracting acoustic characteristics of the voice file to be detected to obtain voice characteristics of the voice file to be detected;

And inputting the voice characteristics of the voice file to be detected into the trained neural network model to obtain a prediction result.

In order to solve the above technical problems, an embodiment of the present application provides a voice endpoint detection apparatus based on a neural network, including:

the voice feature extraction module is used for obtaining a sample voice file, and extracting acoustic features of the sample voice file to obtain voice features, wherein the voice features comprise feature information;

the feature matrix acquisition module is used for distributing N frames of context feature information to each frame of the voice feature to obtain a feature matrix, wherein N is a positive integer;

the predicted value acquisition module is used for carrying out feature processing on the feature matrix through a neural network model to obtain a one-dimensional feature vector, and carrying out learning processing on the sequence information of the language frame on the one-dimensional feature vector to obtain a predicted value;

the neural network model training module is used for calculating a loss function value of the predicted value and the real voice value, and updating the network parameters of the neural network model by the loss function value to obtain a trained neural network model;

the system comprises an acoustic feature information extraction module, a sound detection module and a sound detection module, wherein the acoustic feature information extraction module is used for acquiring a to-be-detected sound file and extracting acoustic features of the to-be-detected sound file to obtain acoustic feature information;

And the prediction result acquisition module is used for inputting the acoustic characteristic information into the trained neural network model to obtain a prediction result.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer device is provided comprising one or more processors; a memory for storing one or more programs to cause the one or more processors to implement the neural network-based voice endpoint detection method of any of the above.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the neural network-based voice endpoint detection method of any of the above.

The embodiment of the invention provides a voice endpoint detection method, device, equipment and medium based on a neural network. Wherein the method comprises the following steps: acquiring a sample voice file, extracting acoustic features of the sample voice file to obtain voice features, distributing N frames of context feature information to each frame of voice features to obtain a feature matrix, carrying out feature processing on the feature matrix through a neural network model to obtain one-dimensional feature vectors, and carrying out learning processing on sequence information of voice frames on the one-dimensional feature vectors to obtain predicted values; calculating a loss function value of the predicted value and the real voice value, and updating network parameters of the neural network model according to the loss function value to obtain a trained neural network model; acquiring a voice file to be detected, and extracting acoustic characteristics of the voice file to be detected to obtain voice characteristics of the voice file to be detected; and inputting the voice characteristics of the voice file to be detected into the trained neural network model to obtain a prediction result. According to the embodiment of the invention, the feature matrix is obtained by extracting the voice features and distributing the context feature information for the voice features, the neural network model is trained according to the feature matrix, the predicted value of the voice file to be detected is input according to the trained neural network model, and the accuracy of voice endpoint detection of the neural network is improved.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

Fig. 1 is a schematic diagram of an application environment of a voice endpoint detection method based on a neural network according to an embodiment of the present application;

FIG. 2 is a flowchart of an implementation of a voice endpoint detection method based on a neural network according to an embodiment of the present application;

FIG. 3 is a flowchart of an implementation of a sub-process in a neural network-based voice endpoint detection method according to an embodiment of the present application;

FIG. 4 is a flowchart of another implementation of a sub-process in a neural network-based voice endpoint detection method according to an embodiment of the present application;

FIG. 5 is a flowchart of another implementation of a sub-process in a neural network-based voice endpoint detection method according to an embodiment of the present application;

FIG. 6 is a flowchart of another implementation of a sub-process in a neural network-based voice endpoint detection method according to an embodiment of the present application;

FIG. 7 is a flowchart of another implementation of a sub-process in a neural network-based voice endpoint detection method according to an embodiment of the present application;

FIG. 8 is a flowchart of another implementation of a sub-process in a neural network-based voice endpoint detection method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a voice endpoint detection apparatus based on a neural network according to an embodiment of the present application;

fig. 10 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

The present application will be described in detail with reference to the drawings and embodiments.

Referring to fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a search class application, an instant messaging tool, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the voice endpoint detection method based on the neural network provided by the embodiment of the present application is generally executed by a server, and correspondingly, the voice endpoint detection device based on the neural network is generally configured in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 illustrates one embodiment of a neural network-based voice endpoint detection method.

It should be noted that, if there are substantially the same results, the method of the present application is not limited to the flow sequence shown in fig. 2, and the method includes the following steps:

s1: and acquiring a sample voice file, and extracting acoustic features of the sample voice file to obtain voice features, wherein the voice features comprise feature information.

Specifically, the sample voice file includes a real voice signal and an unreal voice signal, wherein the unreal voice signal includes a noise signal. The embodiment of the application aims to train a neural network model through real voice signals and non-real voice signals to obtain a trained neural network model, and then, when a voice file to be detected is received, the real voice signals of the voice file to be detected can be directly output through the trained neural network model. In the process of training the neural network model, the server can acquire a sample voice file for training the neural network model, and perform acoustic feature extraction on the sample voice file to obtain voice features. The voice characteristics comprise characteristic information, wherein the characteristic information refers to information such as real voice and noise information contained in a sample voice file. The detailed process of step S1 is as follows in step S11 to step S13, and will not be described here again.

Further, the acoustic feature extraction is performed on the sample voice file, so long as the acoustic Fbank feature is extracted, because the acoustic Fbank feature is more in line with the essence of the sound signal, and fits the receiving characteristic of the human ear.

S2: and distributing N frames of context feature information to each frame of voice feature to obtain a feature matrix, wherein N is a positive integer.

Specifically, compared with modeling by using single-frame voice, modeling of voice endpoint is performed by means of context feature information of voice, so that accuracy of voice endpoint detection can be greatly improved. Therefore, the embodiment of the application distributes N frames of context feature information to each frame of voice feature, establishes a feature matrix and provides a basis for the subsequent training of the neural network model.

The feature matrix is a matrix established after the feature information of the voice features is digitalized. For example, if a certain frame of speech feature Y contains feature information including feature information Y1, feature information Y2, feature information Y3, and feature information Y4, the feature matrix is y= { Y1, Y2, Y3, Y4}.

The N frames are set according to actual conditions, and are not limited here. In one embodiment, the N frames are 5 frames.

S3: and carrying out feature processing on the feature matrix through the neural network model to obtain a one-dimensional feature vector, and carrying out learning processing on the sequence information of the voice frame on the one-dimensional feature vector to obtain a predicted value.

Specifically, the feature matrix is input into a neural network model, the feature matrix is constructed into feature matrices of different receptive fields through a self-adaptive receptive field attention module of the neural network model, the feature matrices are subjected to pooling treatment, and finally converted into one-dimensional feature vectors, and the one-dimensional feature vectors are subjected to learning treatment of sequence information of voice frames through a two-way short-time memory neural network module, so that a predicted value is finally obtained. The detailed process of step S3 is specifically described in steps S31 to S34, and will not be described here again.

The self-adaptive receptive field attention module is a module in the neural network model and is used for selecting the most proper contextual receptive field information and obtaining a corresponding feature vector; the self-adaptive receptive field attention module also comprises an attention mapping module which models the gating feature matrixes of different receptive fields to obtain weight coefficients of the gating feature matrixes of different receptive fields.

The feature processing refers to feature conversion, feature mapping, pooling, normalization and the like on the feature matrix, and aims to obtain a one-dimensional feature vector. The learning process refers to the sequence information learning of the voice frame through a 2-layer two-way short-time memory neural network module, and then the prediction result of the current frame is output through a 1-layer fully-connected neural network.

The predicted value refers to a value converted by a near real voice signal in a voice sample file after being processed by a neural network model.

S4: and calculating a loss function value of the predicted value and the real voice value, and updating network parameters of the neural network model according to the loss function value to obtain a trained neural network model.

Specifically, the deviation degree of the predicted value is judged by calculating the loss function value of the predicted value and the real voice value, and the network parameter of the neural network model is updated by gradually updating the loss function value, when the loss function value is small enough, the predicted value can be considered to be infinitely close to the real voice value, and the predicted value output by the neural network model is infinitely close to the real voice value at the moment, so that when the loss function value is small enough, the network parameter of the neural network model is stopped to be updated, and the trained neural network model is obtained.

The real voice value is a value obtained by carrying out numerical processing on a real voice signal in the sample voice file.

S5: and acquiring a voice file to be detected, and extracting acoustic characteristics of the voice file to be detected to obtain voice characteristics of the voice file to be detected.

Specifically, after the neural network model has been trained through the above steps, when the to-be-detected voice file is obtained at this time, acoustic feature extraction is performed on the to-be-detected voice file to obtain the voice feature of the to-be-detected voice file, where the voice feature extraction of the to-be-detected voice file is consistent with the acoustic feature extraction method in step S1, and will not be described herein.

S6, inputting the voice characteristics of the voice file to be detected into the trained neural network model to obtain a prediction result.

Specifically, the voice characteristics of the voice file to be detected of each frame are used as input quantity and are input into a trained neural network model, and a prediction result of the voice characteristics of the voice file to be detected of each frame is obtained. The prediction result of each frame has two cases, wherein the two cases are that the voice characteristic of the frame is real voice and the voice characteristic of the frame is non-real voice. The obtained prediction result of the voice characteristics of each frame realizes the purpose of voice endpoint detection.

In the embodiment, a sample voice file is obtained, acoustic feature extraction is performed on the sample voice file to obtain voice features, N frames of context feature information are distributed to each frame of voice features to obtain a feature matrix, feature processing is performed on the feature matrix through a neural network model to obtain one-dimensional feature vectors, and learning processing is performed on sequence information of voice frames on the one-dimensional feature vectors to obtain predicted values; calculating a loss function value of the predicted value and the real voice value, and updating the network parameter of the neural network model with the loss function value to obtain a trained neural network model; acquiring a voice file to be detected, and extracting acoustic characteristics of the voice file to be detected to obtain voice characteristics of the voice file to be detected; and inputting the voice characteristics of the voice file to be detected into the trained neural network model to obtain a prediction result. According to the embodiment of the invention, the feature matrix is obtained by extracting the voice features and distributing the context feature information for the voice features, the neural network model is trained according to the feature matrix, the predicted value of the voice file to be detected is input according to the trained neural network model, and the accuracy of voice endpoint detection of the neural network is improved.

Referring to fig. 3, fig. 3 shows a specific implementation manner of step S3, in which in step S3, feature processing is performed on a feature matrix through a neural network model to obtain a one-dimensional feature vector, and sequence information learning processing of a speech frame is performed on the one-dimensional feature vector to obtain a specific implementation process of a predicted value, which is described in detail as follows:

s31, inputting the feature matrix into a neural network model, and performing vector processing on the feature matrix through the self-adaptive receptive field attention module to obtain a feature vector.

Specifically, a feature matrix containing N frames of context feature information is sent to an adaptive receptive field attention module of the neural network model. In the self-adaptive receptive field attention module, feature matrixes of different receptive fields are constructed, converted into feature matrixes with the same size, and subjected to pooling treatment to obtain feature vectors.

The vector processing is to construct feature matrixes of different receptive fields for the feature matrixes, convert the feature matrixes into feature matrixes with the same size, and perform pooling processing to obtain feature vectors.

S32, inputting the feature vector into the full-connection layer network, and carrying out normalization processing on the feature vector to obtain a target feature matrix.

Specifically, the feature vectors are input into a shared 2-layer full-connection layer network to obtain feature vectors corresponding to the pooled feature vectors, and then the feature vectors are normalized to obtain a target feature matrix.

Wherein, normalization is a dimensionless processing means to change the absolute value of the physical system value into a relative value relationship. In this embodiment, the pooled coefficient vector is converted into a normalized coefficient through normalization processing, and finally the target feature matrix is obtained.

S33, converting the target feature matrix of each frame into a one-dimensional vector according to a Reshape function mode to obtain a one-dimensional feature vector.

Specifically, the Reshape function is a function of transforming a specified matrix into a matrix with a specific dimension in MATLAB, the number of elements in the matrix is unchanged, and the function can readjust the number of rows, the number of columns and the dimension of the matrix. The function syntax b=reshape (a, size) refers to returning an n-dimensional array identical to the a element, but the size of the reconstructed array dimension is determined by the vector size. In this embodiment, the target feature matrix of each frame is converted into a one-dimensional vector according to the Reshape function, so that the vector is conveniently input into the two-way short-time memory neural network module to obtain the predicted value.

S34, inputting the one-dimensional feature vector into the two-way short-time memory neural network module to perform learning processing on the sequence information of the voice frame, so as to obtain a predicted value.

Specifically, one-dimensional feature vectors are input into a 2-layer two-way short-time memory neural network module, sequence information of voice frames is subjected to learning processing, and each frame of voice frame result after the learning processing is correspondingly input into a full-connection-layer neural network classifier, so that each frame of predicted value is obtained.

In the implementation, the feature matrix is input into the neural network model, the feature matrix is subjected to vector processing through the self-adaptive receptive field attention module to obtain a feature vector, the feature vector is input into the full-connection layer network, the feature vector is subjected to normalization processing to obtain a target feature matrix, each frame of target feature matrix is converted into a one-dimensional vector according to a Reshape function mode to obtain a one-dimensional feature vector, the one-dimensional feature vector is input into the two-way short-time memory neural network module to perform learning processing of sequence information of a voice frame to obtain a predicted value, the feature matrix is subjected to pooling processing, normalization processing and the like to finally obtain the predicted value, and the subsequent updating of network parameters of the neural network model is facilitated, so that the accuracy of end point detection is improved.

Referring to fig. 4, fig. 4 shows a specific implementation manner of step S31, in which a feature matrix is input into a neural network model in step S31, and vector processing is performed on the feature matrix by an adaptive receptive field attention module to obtain a specific implementation process of the feature vector, which is described in detail below:

s311, inputting the feature matrix into a neural network model, and converting the feature matrix into a feature matrix of the receptive field through an adaptive receptive field attention module to serve as a basic feature matrix.

Specifically, in the convolutional neural network, the receptive field is defined as the area size of the mapping of the pixel points on the feature map output by each layer of the convolutional neural network on the original image. In this embodiment, the feature matrix is converted into a feature matrix of a different receptive field, which is used as a basic feature matrix, so as to facilitate subsequent pooling processing.

S312, converting the basic feature matrix into the basic feature matrix with the same size according to the gating function mapping mode.

Specifically, the basic feature matrix is mapped to the area with the same size by means of gating function mapping, so that the basic feature matrix with the same size is obtained.

The gating function mapping mode is to map the input matrix to the same size area to obtain the same size matrix.

And S313, carrying out global maximum pooling and global average pooling on the basic feature matrixes with the same size to obtain global maximum pooling vectors and global average pooling vectors, and taking the global maximum pooling vectors and the global average pooling vectors as feature vectors.

Specifically, global context is obtained by global maximization, and the global context is maximized by taking a feature map as a unit instead of taking a window form as a maximum. Global context is obtained by global averaging pooling, and the global context is averaged by taking the characteristic spectrum as a unit instead of taking the average value in the form of a window. In this embodiment, global maximum pooling and global average pooling are performed on the basic feature matrices with the same size, so as to obtain a global maximum pooling vector and a global average pooling vector.

In this embodiment, the feature matrix is input into the neural network model, the adaptive receptive field attention module converts the feature matrix into the receptive field feature matrix, and the receptive field attention module converts the basic feature matrix into the basic feature matrix with the same size according to the gating function mapping mode, and performs global maximum pooling and global average pooling processing on the basic feature matrix with the same size to obtain a global maximum pooling vector and a global average pooling vector, so as to obtain a feature vector, select the most suitable context receptive field information, and obtain the feature vector, thereby being beneficial to providing the accuracy of voice endpoint detection.

Referring to fig. 5, fig. 5 shows a specific implementation manner of step S32, in which in step S31, the feature vector is input into the fully-connected layer network, and normalized to obtain a specific implementation process of the target feature matrix, which is described in detail below:

s321, inputting the feature vector into the full-connection layer network to correspondingly obtain a maximum pooling coefficient vector and an average pooling coefficient vector.

Specifically, in the above steps, the global maximum pooling vector and the global average pooling vector are already obtained, and by inputting the global maximum pooling vector and the global average pooling vector into the fully-connected layer network, the maximum pooling coefficient vector and the average pooling coefficient vector to be obtained can be obtained correspondingly.

S322, adding the maximum pooled coefficient vector and the average pooled coefficient vector to obtain a coefficient vector accumulated value, and carrying out normalization processing on the coefficient vector accumulated value to obtain a normalized coefficient.

Specifically, the normalization processing is performed on the coefficient vector accumulated values, so that the normalized coefficient of each basic feature matrix with the same size can be obtained, and the normalized coefficient can be conveniently multiplied with the basic feature matrix with the same size in the follow-up process.

S323, multiplying the normalized coefficient by the basic feature matrix with the same size to obtain the target feature matrix.

Specifically, the normalized coefficient is multiplied by the basic feature matrix with the same size, so that the feature matrix of the attention processing of the attention mapping module, namely the target feature matrix, can be obtained.

In this embodiment, the feature vector is input into the fully-connected layer network, so as to correspondingly obtain a maximum pooled coefficient vector and an average pooled coefficient vector, the maximum pooled coefficient vector and the average pooled coefficient vector are added to obtain a coefficient vector accumulated value, normalization processing is performed on the coefficient vector accumulated value to obtain a normalized coefficient, the normalized coefficient is multiplied by a basic feature matrix with the same size to obtain a target feature matrix, and further the feature vector is processed to obtain a feature matrix for attention processing of an attention mapping module, so that accuracy of voice endpoint detection is improved.

Referring to fig. 6, fig. 6 shows a specific implementation manner of step S34, in which a one-dimensional feature vector is input into a two-way short-time memory neural network module to perform learning processing on sequence information of a speech frame in step S34, and a specific implementation process of obtaining a predicted value is described in detail as follows:

s341, inputting the one-dimensional feature vector into the two-way short-time memory neural network module, and learning the sequence information of the voice frame of each frame of the one-dimensional feature vector to obtain a voice frame result.

Specifically, the one-dimensional feature vector is input into the 2-layer two-way short-time memory neural network module, the 2-layer two-way short-time memory neural network module learns the sequence information of the voice frame of each frame of the one-dimensional feature vector, and finally a voice frame result is obtained.

S342, inputting the voice frame result of each frame into the full-connection layer neural network classifier correspondingly to obtain a predicted value.

Specifically, the full-connection layer neural network classifier classifies and judges the frame result of each frame of voice to obtain the predicted value of each frame of voice, and the predicted values are used as the differences between the predicted values and the real voice values to be compared with the real voice values.

In this embodiment, the one-dimensional feature vector is input into the two-way short-time memory neural network module, the sequence information of the voice frame of each frame of the one-dimensional feature vector is subjected to learning processing, so as to obtain a voice frame result, and each frame of voice frame result is correspondingly input into the full-connection layer neural network classifier, so as to obtain a predicted value, obtain the predicted value, facilitate the subsequent update of network parameters of the neural network, and further improve the accuracy of voice endpoint detection.

Referring to fig. 7, fig. 7 shows a specific implementation manner of step S1, in which a sample voice file is obtained in step S1, and acoustic feature extraction is performed on the sample voice file, so as to obtain a specific implementation process of the voice feature, which is described in detail as follows:

S11, acquiring a sample voice file, and carrying out framing treatment on the voice file according to a preset length to obtain voice framing.

Specifically, the acquired sample voice file is segmented according to a preset length, namely framing processing is carried out, and voice framing with the same length is obtained.

The preset length is set according to the actual situation, and is not limited herein. In one embodiment, the predetermined length is 15 milliseconds.

S12, converting the time domain signals of the voice framing into frequency domain signals according to a Fourier transform mode to obtain basic voice framing.

Specifically, the speech frame, time domain signal obtained after the frame processing, needs to be converted into a frequency domain signal in order to extract the acoustic Fbank feature. While the fourier transform may transform the signal from the time domain to the frequency domain. The fourier transform can be divided into continuous fourier transform and discrete fourier transform, and since the speech framing belongs to digital audio, not analog audio, discrete fourier transform is adopted, so that the basic speech framing is obtained.

S13, selecting basic voice framing which accords with a preset frequency spectrum range as voice characteristics.

Specifically, the energy spectrum of the basic voice framing is calculated, then the mel-frequency spectrum is calculated according to the energy spectrum, the basic voice framing which accords with the preset frequency spectrum range is selected, and finally the logarithm operation is carried out on the basic voice framing, so that the voice characteristics are finally obtained.

The Mel-scale (MFC), which is a spectrum that can be used to represent short-term audio, is based on a logarithmic spectrum (spectrum) represented by a nonlinear Mel scale and its linear cosine transform (linear cosine transform).

Note that, the preset spectrum range is set according to the actual situation, and is not limited herein.

In this embodiment, a sample voice file is obtained, and framing is performed on the voice file according to a preset length to obtain voice framing, a time domain signal of the voice framing is converted into a frequency domain signal according to a fourier transform manner to obtain a basic voice framing, and the basic voice framing conforming to a preset frequency spectrum range is selected as a voice feature, so that acoustic feature extraction is realized, training of a subsequent neural network model is facilitated, and accuracy of voice endpoint detection is improved.

Referring to fig. 8, fig. 8 shows a specific implementation manner of step S4, in which the loss function value of the predicted value and the real voice value is calculated in step S4, and the loss function value is updated to the neural network model parameters, so as to obtain a specific implementation process of the trained neural network model, which is described in detail as follows:

S41, acquiring a real voice value of the sample voice file.

Specifically, since the sample speech file includes the real speech signal and the non-real speech signal, in order to calculate the loss function value with the subsequent predicted value, the degree to which the predicted value deviates from the real speech value is determined. The real voice signal is required to be acquired firstly, and then the real voice signal is subjected to numerical processing to obtain a real voice value.

S42, calculating a loss function value of the predicted value and the real voice value, inputting the gradient of the loss function value into the neural network model, and updating the network parameters of the neural network model.

Specifically, the loss function values of the predicted value and the real voice value are calculated, and the gradient is input into the neural network model according to the sequence from large to small, so as to update the network parameters of the neural network model.

And S43, stopping updating the network parameters of the neural network model when the loss function value reaches a preset threshold value, and obtaining the trained neural network model.

Specifically, when the loss function value reaches a preset threshold, namely the predicted value and the very close real voice value, updating of the network parameters of the neural network model is stopped at the moment, and the trained neural network model is obtained.

The preset threshold is set according to the actual situation, and is not limited herein. In one embodiment, the predetermined threshold is 0.005.

In this embodiment, by acquiring the real voice value of the sample voice file, calculating the predicted value and the loss function value of the real voice value, inputting the gradient of the loss function value into the neural network model, updating the network parameters of the neural network model, and stopping updating the network parameters of the neural network model when the loss function value reaches the preset threshold value, the trained neural network model is obtained, the training of the neural network model is realized, the subsequent output of the predicted result for the voice file to be detected is facilitated, and thus the accuracy of the voice endpoint detection is facilitated to be improved.

It should be emphasized that, to further ensure the privacy and security of the sample voice file, the sample voice file may also be stored in a node of a blockchain.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

Referring to fig. 9, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a voice endpoint detection apparatus based on a neural network, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 9, the voice endpoint detection apparatus based on the neural network of the present embodiment includes: a voice feature extraction module 71, a feature matrix acquisition module 72, a predicted value acquisition module 73, a neural network model training module 74, an acoustic feature information extraction module 75 and a predicted result acquisition module 76, wherein:

the voice feature extraction module 71 is configured to obtain a sample voice file, and perform acoustic feature extraction on the sample voice file to obtain voice features, where the voice features include feature information;

a feature matrix obtaining module 72, configured to allocate N frames of context feature information to each frame of speech feature, to obtain a feature matrix, where N is a positive integer;

the predicted value obtaining module 73 is configured to perform feature processing on the feature matrix through the neural network model to obtain a one-dimensional feature vector, and perform learning processing on sequence information of a language frame on the one-dimensional feature vector to obtain a predicted value;

The neural network model training module 74 is configured to calculate a loss function value of the predicted value and the real voice value, and update network parameters of the neural network model according to the loss function value to obtain a trained neural network model;

the acoustic feature information extraction module 75 is configured to obtain a to-be-detected voice file, and perform acoustic feature extraction on the to-be-detected voice file to obtain a voice feature of the to-be-detected voice file;

the prediction result obtaining module 76 is configured to input the voice feature of the voice file to be detected into the trained neural network model, so as to obtain a prediction result.

Further, the predicted value obtaining module 73 includes:

the feature vector acquisition unit is used for inputting the feature matrix into the neural network model, and carrying out vector processing on the feature matrix through the self-adaptive receptive field attention module to obtain a feature vector;

the target feature matrix acquisition unit is used for inputting the feature vector into the full-connection layer network, and carrying out normalization processing on the feature vector to obtain a target feature matrix;

the one-dimensional feature vector acquisition unit is used for converting the target feature matrix of each frame into a one-dimensional vector according to the mode of a Reshape function to obtain a one-dimensional feature vector;

And the sequence information learning unit is used for obtaining a predicted value by inputting the one-dimensional feature vector into the two-way short-time memory neural network module and performing learning processing on the sequence information of the voice frame.

Further, the feature vector acquisition unit includes:

the basic feature matrix acquisition subunit is used for inputting the feature matrix into the neural network model, and converting the feature matrix into a feature matrix of a receptive field through the self-adaptive receptive field attention module to serve as a basic feature matrix;

the basic feature matrix conversion subunit is used for converting the basic feature matrix into a basic feature matrix with the same size according to the gating function mapping mode;

the basic feature matrix pooling unit is used for carrying out global maximum pooling and global average pooling on basic feature matrices with the same size to obtain global maximum pooling vectors and global average pooling vectors, and taking the global maximum pooling vectors and the global average pooling vectors as feature vectors.

Further, the target feature matrix acquisition unit includes:

the characteristic vector input subunit is used for correspondingly obtaining a maximum pooling coefficient vector and an average pooling coefficient vector by inputting the characteristic vector into the full-connection layer network;

The normalization coefficient acquisition subunit is used for adding the maximum pooling coefficient vector and the average pooling coefficient vector to obtain a coefficient vector accumulated value, and carrying out normalization processing on the coefficient vector accumulated value to obtain a normalization coefficient;

and the normalized coefficient processing subunit is used for multiplying the normalized coefficient with the basic feature matrix with the same size to obtain the target feature matrix.

Further, the sequence information learning unit includes:

the voice frame result obtaining subunit is used for obtaining a voice frame result by inputting the one-dimensional feature vector into the two-way short-time memory neural network module and performing learning processing on the sequence information of the voice frame of each frame of the one-dimensional feature vector;

and the voice frame input subunit is used for correspondingly inputting the voice frame result of each frame into the full-connection layer neural network classifier to obtain a predicted value.

Further, the voice feature extraction module 71 includes:

the voice framing acquisition unit is used for acquiring a sample voice file, and framing the voice file according to a preset length to obtain voice framing;

the basic voice framing acquisition unit is used for converting the time domain signals of the voice framing into frequency domain signals according to a Fourier transform mode to obtain basic voice framing;

The basic voice framing selection unit is used for selecting basic voice framing which accords with a preset frequency spectrum range as voice characteristics.

Further, the neural network model training module 74 includes:

the real voice value acquisition unit is used for acquiring the real voice value of the sample voice file;

the network parameter updating unit is used for calculating a loss function value of the predicted value and the real voice value, inputting the gradient of the loss function value into the neural network model and updating the network parameter of the neural network model;

and the network parameter stopping updating unit is used for stopping updating the network parameters of the neural network model when the loss function value reaches a preset threshold value to obtain the trained neural network model.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 10, fig. 10 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 8 comprises a memory 81, a processor 82, a network interface 83 communicatively connected to each other via a system bus. It should be noted that only a computer device 8 having three components memory 81, a processor 82, a network interface 83 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer device may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The computer device can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 81 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 81 may be an internal storage unit of the computer device 8, such as a hard disk or memory of the computer device 8. In other embodiments, the memory 81 may also be an external storage device of the computer device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 8. Of course, the memory 81 may also include both internal storage units of the computer device 8 and external storage devices. In the present embodiment, the memory 81 is typically used to store an operating system and various types of application software installed on the computer device 8, such as program codes of a voice endpoint detection method based on a neural network. Further, the memory 81 may be used to temporarily store various types of data that have been output or are to be output.

The processor 82 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 82 is typically used to control the overall operation of the computer device 8. In this embodiment, the processor 82 is configured to execute a program code stored in the memory 81 or process data, for example, a program code for executing a voice endpoint detection method based on a neural network.

The network interface 83 may comprise a wireless network interface or a wired network interface, which network interface 83 is typically used to establish a communication connection between the computer device 8 and other electronic devices.

The present application also provides another embodiment, namely, a computer readable storage medium, where a server maintenance program is stored, where the server maintenance program is executable by at least one processor, so that the at least one processor performs the steps of a voice endpoint detection method based on a neural network as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method of the embodiments of the present application.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A voice endpoint detection method based on a neural network, comprising:

obtaining a sample voice file, and carrying out framing treatment on the voice file according to a preset length to obtain voice framing;

according to the Fourier transform mode, converting the time domain signal of the voice framing into a frequency domain signal to obtain a basic voice framing;

calculating the energy spectrum of the basic voice frame, calculating a mel-frequency spectrum according to the energy spectrum, selecting basic voice frames conforming to a preset frequency spectrum range, and carrying out logarithmic operation on the basic voice frames conforming to the preset frequency spectrum range to obtain voice characteristics, wherein the voice characteristics comprise characteristic information;

2. The voice endpoint detection method based on the neural network according to claim 1, wherein the performing feature processing on the feature matrix through the neural network model to obtain a one-dimensional feature vector, and performing learning processing on sequence information of a language frame on the one-dimensional feature vector to obtain a predicted value includes:

inputting the feature matrix into the neural network model, and performing vector processing on the feature matrix through a self-adaptive receptive field attention module to obtain a feature vector;

the feature vector is input into a full-connection layer network, and normalized to obtain a target feature matrix;

converting the target feature matrix of each frame into a one-dimensional vector according to a Reshape function mode to obtain the one-dimensional feature vector;

and inputting the one-dimensional feature vector into a two-way short-time memory neural network module, and performing learning processing on sequence information of the voice frame to obtain the predicted value.

3. The voice endpoint detection method based on the neural network according to claim 2, wherein the inputting the feature matrix into the neural network model, performing vector processing on the feature matrix by the adaptive receptive field attention module, and obtaining a feature vector includes:

inputting the feature matrix into the neural network model, and converting the feature matrix into a feature matrix of a receptive field through an adaptive receptive field attention module to serve as a basic feature matrix;

converting the basic feature matrix into basic feature matrices with the same size according to a gating function mapping mode;

and carrying out global maximum pooling and global average pooling on the basic feature matrixes with the same size to obtain global maximum pooling vectors and global average pooling vectors, and taking the global maximum pooling vectors and the global average pooling vectors as the feature vectors.

4. The voice endpoint detection method based on the neural network according to claim 3, wherein the normalizing the feature vector by inputting the feature vector into a fully-connected layer network to obtain a target feature matrix comprises:

Inputting the characteristic vector into a full-connection layer network to correspondingly obtain a maximum pooling coefficient vector and an average pooling coefficient vector;

adding the maximum pooling coefficient vector and the average pooling coefficient vector to obtain a coefficient vector accumulated value, and carrying out normalization processing on the coefficient vector accumulated value to obtain a normalization coefficient; multiplying the normalized coefficient by the basic feature matrix with the same size to obtain the target feature matrix.

5. The method for detecting a voice endpoint based on a neural network according to claim 2, wherein the learning the sequence information of the voice frame by inputting the one-dimensional feature vector into a two-way short-time memory neural network module, and obtaining the predicted value comprises:

the one-dimensional feature vector is input into a two-way short-time memory neural network module, and the sequence information of the voice frame of each frame of the one-dimensional feature vector is subjected to learning processing to obtain a voice frame result;

and correspondingly inputting the voice frame result of each frame into a full-connection layer neural network classifier to obtain the predicted value.

6. The method for detecting a voice endpoint based on a neural network according to claim 1, wherein the calculating the loss function value of the predicted value and the real voice value, updating the neural network model parameter with the loss function value, and obtaining the trained neural network model comprises:

Acquiring the real voice value of the sample voice file;

calculating a loss function value of the predicted value and the real voice value, inputting the gradient of the loss function value into the neural network model, and updating network parameters of the neural network model;

and stopping updating the network parameters of the neural network model when the loss function value reaches a preset threshold value to obtain the trained neural network model.

7. A voice endpoint detection apparatus based on a neural network, comprising:

the base voice framing selection unit is used for calculating an energy spectrum of the base voice framing, calculating a mel-frequency cepstrum according to the energy spectrum, selecting a base voice framing conforming to a preset frequency spectrum range, and carrying out logarithmic operation on the base voice framing conforming to the preset frequency spectrum range to obtain voice characteristics, wherein the voice characteristics comprise characteristic information;

the neural network model training module is used for calculating the loss function value of the predicted value and the real voice value, and updating the network parameters of the neural network model according to the loss function value to obtain a trained neural network model;

the system comprises an acoustic feature information extraction module, a voice detection module and a voice detection module, wherein the acoustic feature information extraction module is used for acquiring a voice file to be detected and extracting acoustic features of the voice file to be detected to obtain voice features of the voice file to be detected;

the prediction result obtaining module is used for inputting the voice characteristics of the voice file to be detected into the trained neural network model to obtain a prediction result.

8. A computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program implementing a neural network-based voice endpoint detection method as claimed in any one of claims 1 to 6.

9. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, which when executed by a processor, implements the neural network-based voice endpoint detection method according to any one of claims 1 to 6.