CN116259312A

CN116259312A - Method for automatically editing task by aiming at voice and neural network model training method

Info

Publication number: CN116259312A
Application number: CN202111568954.9A
Authority: CN
Inventors: 刘臣; 倪仁倢; 周立欣
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2023-06-13

Abstract

The invention discloses a method for automatically editing tasks by aiming at voice and a neural network model training method, which comprises the following steps: the neural network model is trained by adopting a data form of firstly integrating and then locally and adopting a training speed of firstly fast and then slow; the application provides a method for automatically editing tasks by voice and a neural network model training method, wherein the method comprises the following steps: extracting audio in the original audio and video and preprocessing to obtain preprocessed audio; extracting a plurality of acoustic features from the preprocessed audio; inputting the acoustic characteristics into a trained neural network model for voice detection, and outputting a result; and automatically editing the original audio and video according to the output result of the trained neural network model. According to the invention, the voice section and the non-voice section in the audio and video are detected rapidly, and the original audio and video is automatically clipped in an artistic way.

Description

Method for automatically editing task by aiming at voice and neural network model training method

Technical Field

The invention relates to the technical field of artificial intelligence, voice detection and audio/video editing, in particular to a method for automatically editing voice tasks and a neural network model training method.

Background

With the popularization of the internet, the digital media industry is developing at a high speed, the number of audio and video media is exponentially increasing, and their post-production is not separated from editing. The method is an artistic work with higher stylization degree, and the editing style and the requirements of different types of media are different. And editing is often the most labor and time intensive work in post-production compared to other works such as toning or captions. The audio and video of the language class always has a higher proportion in broadcasting and television, and the time is longer, and a clipping operator needs to comprehensively review and then start clipping work, so that a great deal of manpower and time are spent in clipping the audio and video of the language class.

Since the editing work has a close relation with the front and back information, the way of detecting the voice end point is not well performed when directly applied to the voice clip, because it makes the media link after editing hard and has lower quality. The current automatic editing device mainly uses a machine learning algorithm, for example, a hidden Markov model (hidden markov model, HMM) is used in combination with a dimension bit algorithm, but because the HMM model is limited by Markov property, the capability of extracting long-time sequence information is limited, so that editing results cannot be combined with the front-back relevance of audios and videos.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention aims to provide a method for automatically editing tasks by voice and a neural network model training method, which are used for rapidly detecting voice segments and non-voice segments in audio and video and automatically editing original audio and video in an artistic way. To achieve the above objects and other advantages and in accordance with the purpose of the invention, there is provided a method for automatically editing tasks for voice, comprising:

s1, building a neural network model, wherein the neural network model comprises a convolution layer, a bidirectional circulating neural network and a feedforward neural network;

s2, training the neural network model in the step S1;

s3, performing audio and video automatic editing based on the neural network model, wherein the audio and video automatic editing comprises the following steps of:

s11, extracting audio in original audio and video and preprocessing to obtain preprocessed audio;

s12, extracting various acoustic features from the audio preprocessed in the step S11;

s13, inputting the acoustic characteristics in the step S12 into a trained neural network model for voice detection, and outputting a result;

s14, automatically editing the original audio and video according to the output result in the step S13.

Preferably, the convolution layer comprises a plurality of neural networks, the convolution layer is formed by respectively carrying out convolution operation on different acoustic characteristics through the plurality of convolution neural networks, the convolution layer is combined with a bidirectional cyclic neural network and a feedforward neural network to obtain a neural network model of a voice automatic clipping task, and the bidirectional cyclic neural network comprises a layer of forward cyclic neural network and a layer of backward cyclic neural network.

Preferably, the output of each neural network is activated by the convolution layer through an activation function and then stacked to obtain a final output result of the convolution layer, and then the output result of the convolution layer is input into the bidirectional cyclic neural network to obtain an output result of the bidirectional cyclic neural network. And inputting the output result of the bidirectional circulating neural network into the feedforward neural network to obtain the output result of the feedforward neural network. And activating and classifying the output result of the feedforward neural network by using a Softmax activation function to obtain the final output result of the neural network model.

Preferably, in the step S2, a binary cross entropy loss function is used to calculate a loss value, a back propagation algorithm is used to update the gradient parameters, then the trained model parameters are saved, during the training process of the model, the model with the best performance on the verification set is selected to perform a test on the test set, and finally the model parameters with the best performance on the test set are saved.

Preferably, the acoustic features in the step S3 include a log mel spectrum, short-time energy and short-time zero-crossing rate, and the three acoustic features are normalized respectively, so that the calculation of a subsequent neural network is facilitated, and the extracted log mel spectrum, short-time energy and short-time zero-crossing rate have the same sequence length.

Preferably, the length of the output sequence and the length of the input characteristic sequence of the neural network model are the same, the number of the neural networks and the number of the input characteristics in the model convolution layer are kept the same, and the output result of the neural network model is a 1-dimensional time sequence with two classes.

A training method based on a neural network model uses larger batch of training set data and larger learning rate to perform first training, and stops the first training when a loss function is close to convergence; performing a second training round by using the training set data of the smaller batch and the smaller learning rate, and stopping the second training round when the loss function converges; the super parameters are continuously adjusted by using the training method, and finally, the model parameters with the best performance on the test set are saved.

Compared with the prior art, the invention has the beneficial effects that:

(1) Compared with the traditional machine learning algorithm, the method for automatically editing the task by aiming at the voice and the neural network model training method can more effectively read the front and rear associated information of the audio and video, so that the editing process is more intelligent. In addition, compared with the traditional voice endpoint detection method, the method of the invention reserves the margins with different degrees before and after voice segmentation, so that the editing result is transited naturally and has more artistry.

(2) The device provided by the invention has high calculation efficiency, the time spent in calculating 60 minutes of audio frequency is less than 1 minute, and compared with the traditional manual editing, a great amount of manpower and time can be saved.

Drawings

FIG. 1 is a schematic diagram of a network architecture of a neural network model for a method of automatic voice clipping tasks and a neural network model training method according to the present invention;

FIG. 2 is a schematic diagram of a convolutional layer structure of a method for automatic voice clipping tasks and a neural network model training method according to the present invention;

FIG. 3 is a schematic diagram of a two-way gated loop neural network architecture for a method of automatic voice clipping tasks and a neural network model training method in accordance with the present invention;

FIG. 4 is a flow chart of a neural network model training method for the method for automatic voice editing task and the neural network model training method according to the present invention;

FIG. 5 is a schematic diagram of a labeling process of a data set in an embodiment of a method for automatic voice clipping tasks and a neural network model training method according to the present invention;

FIG. 6 is a flow chart of an audio/video automatic editing method in a neural network based on the method for automatic editing task of voice and the neural network model training method according to the present invention;

FIG. 7 is a schematic diagram of the structural components of an audio/video automatic editing apparatus based on a neural network for a method of automatic editing task of voice and a training method of a neural network model according to the present invention;

FIG. 8 is a graph comparing model clipping with artificial clipping results for a method of automatic clipping tasks for voice and a neural network model training method according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Related terms and related concepts such as neural networks related to the embodiments of the present application are described below.

(1) Feedforward neural network

A feed forward neural network (feedforward neural network, FNN) is a unidirectional multi-layer structure comprising an initial input layer, an intermediate hidden layer and a final output layer, wherein the hidden layer may be a single layer or multiple layers. Each layer has several neurons, and the neurons in each layer are all connected to each other, and the neurons in each layer can receive the signals of the neurons in the previous layer and generate and output to the next layer.

(2) Convolutional neural network

The convolutional neural network (convolutional neuron network, CNN) is a feed-forward neural network with a convolutional structure. The convolutional neural network uses a filter (composed of a convolutional kernel and an offset vector) to continuously slide and carry out Hadamard product operation on data with a corresponding window size, and then sums the data with the offset vector, so that a new value is obtained, and the calculation process is shown as a formula (1). CNN has the characteristic of weight sharing, has fewer learning parameters, can effectively avoid over fitting, and has higher calculation efficiency. In addition, due to the existence of the local receptive field, the CNN can be combined with characterization information of peripheral data, and can abstract the characteristic of higher latitude of the data as the convolution layer goes deep.

Wherein, the ". Is Hadamard product.

(3) Circulating neural network

The recurrent neural network (Recurrent Neural Network, RNN) is a class of neural networks with short-term memory capabilities. The recurrent neural network memorizes the previous hidden layer information and influences the output of the current node by using the previous information. That is, the nodes between hidden layers of the RNN are connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the previous moment. In theory, although RNN can process sequence data with any length, conventional RNN cannot autonomously update the state of hidden layer, and has the phenomenon of gradient disappearance or gradient explosion, and cannot effectively obtain the related information of long-time sequence. The gating circulation units GRU and LSTM which appear later are added with the concept of a control gate, so that hidden layers can be updated, the phenomenon of gradient disappearance or gradient explosion is effectively avoided, and the hidden layers can be better combined with long-time sequence information.

(4) Gate-controlled circulation unit

Door control circulation unit (Gated Recurrent Units GRU)

The gating circulation unit is a variant of the RNN, and is additionally provided with a reset gate and an update gate for updating the state of the hidden layer, so that the gating circulation unit can effectively capture the dependency relationship of a long-time sequence, and the problems of gradient disappearance and the like of the classical RNN are avoided to a certain extent. It is similar to Long Short-Term Memory (LSTM) except that it incorporates the input gate and the forget gate of LSTM into an update gate such that its computational complexity is less than LSTM. The update gate of the GRU is used to determine how much previous hidden information is retained to the current node, and the reset gate is used to control how much information is forgotten at the current node.

(4) Loss function

The loss function is used to measure the degree of difference between the predicted value f (x) and the true value Y of the model, and is a non-negative function, generally denoted by L (Y, f (x)), and the smaller the loss function, the closer the predicted value of the model is to the true value. The training process of the neural network model is to use an optimizer and a back propagation algorithm, and update model parameters continuously through iteration to enable the value of the loss function to be reduced continuously until the loss value is completely converged, and the training process of the model is ended at the moment.

(5) Back propagation algorithm

The back propagation algorithm (BP) is a learning algorithm for a multi-layer neuronal network, which is based on a gradient descent method. The learning process of the BP algorithm consists of a forward propagation process and a backward propagation process. In the forward propagation process, input information is processed layer by layer through an hidden layer by an input layer and is transmitted to an output layer. The difference between the model output and the true value is calculated as an objective function, the inverse propagation is carried out, the partial derivative of the objective function to the weight of each neuron is calculated layer by layer, the gradient of the objective function to the weight vector is formed, the gradient is used as the basis for modifying the weight, and the training of the network is completed in the weight modification process. When the error reaches the desired range, the training process ends.

Referring to fig. 1-8, a method for automatic voice editing tasks, comprising: s1, building a neural network model, wherein the neural network model comprises a convolution layer, a Bi-gating cyclic neural network (Bi-GRU) and FNN;

s2, training the neural network model in the step S1, training the neural network model in a data form of firstly integrating and then locally, and simultaneously training quickly and slowly.

In the real-time example of the application, the audio of the training set is recorded manually, and the verification set and the test set are intercepted from the ChiME-5 data set for one hour respectively, so that a professional clipping engineer is invited to label the data set. And cutting the training set into longer audios as data of each batch, setting a larger learning rate, and performing first training on the neural network model. In the second training, the training set is segmented into shorter audios as data of each batch, and the training is performed by using a smaller learning rate;

s11, extracting audio in original audio and video and preprocessing to obtain preprocessed audio, wherein in the embodiment of the application, the preprocessing process comprises the steps of downsampling the audio in the original audio and video to obtain preprocessed audio, and if the audio in the original audio and video is multichannel, converting the audio into mono;

Furthermore, the convolution layer comprises a plurality of neural networks, the convolution operation is respectively carried out on different acoustic characteristics through the plurality of neural networks to form the convolution layer, and the convolution layer is combined with the Bi-GRU and the FNN to obtain the neural network model of the voice automatic clipping task.

Further, the convolutional layer uses the LeakyReLU function to activate the output of each convolutional neural network, then stacks the activated output of each convolutional neural network to obtain the final output result of the convolutional layer, and then inputs the output result of the convolutional layer into the Bi-GRU to obtain the output result of the Bi-GRU. And inputting the output result of the Bi-GRU into the FNN to obtain the output result of the FNN. And activating and classifying the output result of the FNN by using a Softmax activation function to obtain the final output result of the neural network model.

Further, the convolution layer is used for extracting and abstracting high-dimensional information of acoustic features, the forward gating circulating unit is used for reading time sequence information before the current node, and the backward gating circulating unit is used for reading time sequence information after the current node.

Further, in the step S2, a binary cross entropy loss function is adopted to calculate a loss value, gradient parameters are updated through a back propagation algorithm, then trained model parameters are stored, in the training process of the model, a model with the best performance on a verification set is selected to perform a test on a test set, and finally the model parameters with the best performance on the test set are stored, wherein the training method of the neural network model is different from the traditional small batch gradient descent (mini batch) and enables Bi-GRU to be better predicted by combining time sequence data through changing the size and the learning rate of batches.

Further, the acoustic features in the step S3 include logarithmic mel frequency spectrum, short-time energy and short-time zero crossing rate, and the three acoustic features are normalized respectively, so that calculation of a subsequent neural network is facilitated, the extracted logarithmic mel frequency spectrum, short-time energy and short-time zero crossing rate have the same time sequence length, and the trained neural network model can identify a voice part and a non-voice part in the audio.

Further, the length of the output sequence of the neural network model is the same as the length of the input feature sequence, the number of the neural networks in the model convolution layer and the number of the input features are kept the same, the output result of the neural network model is a 1-dimensional time sequence with two types, in the embodiment of the application, the audio in the original audio and video is downsampled to be the same as the output frequency of the model, and then the original audio and video is clipped according to the output of the model.

Specifically, the segments to be reserved in the original audio and video are selected according to the output of the model, and then recombined to obtain the clipped audio and video.

The audio and video automatic editing device based on the neural network model further comprises:

and a pretreatment module: preprocessing the original audio and video to obtain preprocessed audio;

and the feature extraction module is used for: for extracting a plurality of acoustic features from the preprocessed audio;

and the voice detection module is used for: the method comprises the steps of inputting the multiple acoustic characteristics into a trained neural network model for voice detection, and outputting a result;

and a clipping module: and the method is used for automatically editing the original audio and video according to the output result of the trained neural network model.

Example 1

Referring to fig. 1, fig. 1 is a schematic diagram of a network structure of a neural network model used in a real-time example of the present application. The neural network model used in the real-time example of the present application mainly includes three parts. The first part is a convolution layer composed of a plurality of CNNs, the second part is a Bi-GRU composed of a layer of forward GRU and a layer of backward GRU, and the third part is a layer of FNN.

The convolutional layer structure in the neural network model in the embodiment of the application is shown in fig. 2. First, for logarithmic mel spectrum features, a convolution operation is performed using two layers of two-dimensional CNNs. The first layer was convolved with 8 1 x 40 filters. The step size (stride) of the convolution is set to 1 and the first layer CNN is used to abstract the features at each time node. The second layer uses 3 5*8 filters to perform the convolution operation, the step size of the convolution is still set to 1, and padding (padding) is set to 2, so that the sequence length of the CNN output and the input sequence length are consistent, and the final clipping work is facilitated. The second layer CNN may incorporate characteristic information of a part of the peripheral data. In addition, the output of the convolutional layer is activated using the LeakyReLU function, considering that the audio characteristics are highly nonlinear. The activation function has a parameter alpha to be learned in a negative interval, and the phenomenon that the gradient is 0 is effectively avoided. The formula of the LeakyReLU is as follows:

for short-time energy features, a layer of one-dimensional CNN is used for convolution operation, the filter size is set to 5, the step size is set to 1, the filling is set to 2, and the activation function is LeakyReLU. In addition, for the short-time zero-crossing rate, a one-dimensional CNN is used for carrying out convolution operation, the filter size is set to 5, the step length is set to 1, the filling is set to 2, and the activation function is LeakyReLU. And finally, fusing the output results of each CNN by the convolution layer to obtain the final output result of the convolution layer, and then inputting the final output result into the Bi-GRU.

In the embodiment of the present application, please refer to fig. 3 for a two-way gated recurrent neural network structure in the neural network model. The Bi-GRU combines a layer of forward GRU, a layer of backward GRU and a layer of FNN, so that the output of each time node of the model can read the forward and backward hidden layer information. To prevent model over-learning while reducing unnecessary parameters, the width of the GRU hidden layer is set to 2 and the random inactivation rate (dropout) of the hidden layer units is set to 0.2. The input of FNN is hidden layer of forward GRU and backward GRU, the number of input nodes is 4, and the output is 1 dimension. And activating the output result of the FNN by using a Softmax activation function to obtain a final output result of the model. The Softmax activation function, which converts a numerical value into a probability distribution, is formulated as follows:

wherein z is _i And C is the number of output nodes, namely the number of classified categories, for the output value of the ith node.

And combining the convolution layer with the Bi-GRU and the FNN to obtain the neural network model for the automatic editing task.

Example two

Referring to fig. 4, fig. 4 is a flow chart of a neural network model training method according to an embodiment of the application. The training set adopts a manual recording mode, the duration is about 20 minutes, wherein the voice accounts for about 60 percent, and the language is Chinese. During recording, the sentences are stopped for 3 seconds, the sentences are stopped for 30 seconds, meanwhile, some noise is artificially emitted, and urban street noise is added to the noise to simulate an outdoor recording environment. The irregular noise is more disturbing than the steady white noise. The processed voice signal-to-noise ratio is about 0db, so that the model training difficulty is increased, and the robustness of the model is enhanced. The verification set and the test set use the CHiME-5 data set, which is a speech recognition challenge data set in a noisy environment. The data set comprises real, simulated and clean recordings, wherein the real recordings are recorded by 6 four-way microphone arrays, the content is family dinner, the language is English, and the duration of each recording is about 120 minutes. The audio recorded by the microphone furthest from the speaker was selected and each 60 minutes was taken from it as a verification set and a test set, respectively, with the speech portion ratio being approximately 75%. The data set audio has low voice volume, a great amount of irregular noise, far-field reverberation and other interference factors exist at the same time, about 15% of voices reach the standard of the average opinion score grade 2, namely, the users need to concentrate considerable attention to hear, and the data is used for checking the performance of the model in extreme environments.

The labeling is performed in a manual labeling manner, and the labeling process is shown in fig. 5. In this embodiment we invite the clipper of the Shanghai radio and television station to annotate the dataset. The audio is displayed in a waveform diagram mode, the audio to be reserved is selected after the audio is manually listened, the reserved part is provided with a positive example value of 1, and otherwise, the reserved part is provided with a negative example value of 0, as shown in fig. 6. The labeling habits of the clipper are as follows: when the pause between sentences exceeds 2s, the clipper can cut the sentences, and meanwhile, 0.5-1 s are reserved at the two ends of the sentences to be used as buffering. When there is a longer pause between speech paragraphs, they will leave about 1-2 seconds free at the beginning and end of the paragraph, respectively. And when the sentences are continuous, for example, the pause between the sentences is less than 2s, the clipper does not cut them. In addition, it removes artificially generated noise, but does not remove environmental noise occurring at the speech segment. The audio after manual editing has smooth transition and no obvious pause, and the listener can distinguish different paragraphs. The original label is stored in the form of binary time sequence, the number of the original label is the same as that of the audio sampling points, and then the original label is downsampled to make the sum feature sequence length the same.

In order to make the model in this embodiment better combine with the association between audio features for clipping, we train in a data form of whole-first and then local, similar to the tree growth process, from trunk to tree branches. Meanwhile, the training process is fast and slow, and the whole part is fast and the part is slow. First, audio in the training set is divided into a batch every 95 seconds, and the optimizer performs the first training with Adam, with a learning rate set to 0.01. The purpose of the training is to allow the model to learn the relevance features of the whole, i.e. the paragraph-to-paragraph links. After a certain number of iterations, training is stopped when the model has not yet fully converged. In the event that the model is trained to fully converge using large batches of data, the model has been fitted, although the optimizer may find the global optimum for the gradient, with poor results on the validation set. In the second training round, the audio was divided into a batch every 21 seconds, and Adam was used as well by the optimizer, but the initial learning rate was reduced to 0.001, and an exponential learning rate decay was set, with a decay factor of 0.95. The purpose of the second training round is to allow the model to focus on details, i.e. sentence-to-sentence links. Because of the large variation of the small batch data, the model parameters can be relatively stable due to the small learning rate. And considering the application in the actual engineering, the data of the verification set and the test set are not segmented, so that the integrity of the test data is ensured.

Compared with the traditional small batch gradient descent (mini batch) of fixed batches, the neural network model training method in the embodiment improves the data difference by changing the batch size, so that the model can effectively weigh macroscopic and microscopic information, and the model is optimized to the optimal state. And continuously adjusting the super parameters during training, selecting the model with the best performance on the verification set, testing on the model retest set, and storing the parameters of the model.

Table 1 results of comparative experiments with different training patterns

The formula of the accuracy (Acc) evaluation index in table 1 is as follows:

where TP represents the number of positive samples predicted correctly, TN represents the number of negative samples predicted correctly, and R represents the total number of samples of the original audio.

Experimental results show that the training mode used in the embodiment is superior to the traditional small-batch gradient descent, so that the accuracy of the model is improved by about 3%. When the model trained by the embodiment performs a voice detection task, a voice part and a non-voice part can be accurately distinguished, and a small amount of audio can be reserved at two ends of the voice part so as to improve the fluency of a subsequent editing result.

Example III

As shown in fig. 6, fig. 6 is a flowchart of an audio/video automatic clipping method based on a neural network in an embodiment of the present application:

s11: sampling the original audio and video to obtain preprocessed audio;

in the implementation process of the invention, the original video or audio is sampled at the sampling frequency of 22050Hz to obtain the preprocessed audio; if the preprocessed audio is multichannel, it is compressed into a single channel.

S12: extracting a plurality of acoustic features from the preprocessed audio;

in the implementation of the invention, the extracted acoustic features include log mel spectrum, short-time energy and short-time zero-crossing rate.

The extraction process of the logarithmic mel spectrum is as follows: firstly, pre-emphasis is carried out on the sampled audio, the pre-emphasis coefficient is set to be 0.97, and the pre-emphasized audio is obtained; carrying out framing operation on the pre-emphasized audio, wherein the frame length is set to 46ms, and the frame shift is set to 23ms, so as to obtain the framed audio; and carrying out windowing operation on the audio after framing based on a Hamming window to obtain windowed audio, wherein a Hamming window function is adopted as a window function, and the Hamming window has the following formula:

and converting the windowed audio from a time domain to a frequency domain by using a fast Fourier transform, and then converting a frequency scale to a Mel scale, wherein the conversion formula is as follows:

and filtering on the Mel scale by using 40 triangular filters with equal area to obtain logarithmic Mel spectrum characteristics. And respectively carrying out normalization processing on the data of each dimension after the logarithmic mel frequency spectrum characteristics are extracted. The normalization process is as follows: according to the time dimension, the average value and standard deviation of each dimension data are respectively calculated, and the data of each time unit are subtracted by the average value and divided by the standard deviation to obtain normalized data. The formula for the normalization operation is as follows:

wherein x is _i The value μ at time i is the mean value σ and standard deviation.

Short-time energy refers to energy information contained in a frame of audio. The short-time energy extraction process is as follows: taking every 512 sampling points as a frame, and carrying out framing operation; the window function adopts a rectangular window to carry out windowing operation; the value of the short-time energy is calculated. The short-time energy is calculated as follows:

wherein E is _n Is the value of the short-time energy, m is the audio frame, ω (n) is the window function.

And normalizing the extracted short-time energy characteristics, wherein the normalization process is the same as that of the step [00 ].

The short time zero crossing rate refers to the number of times a signal passes through a zero point in each frame. The zero crossing rate value is higher in unvoiced sound and lower in voiced sound. The short-time zero-crossing rate is calculated as follows: every 512 sampling points are taken as a frame; calculating the number of times that each frame time passes through the zero point; then dividing the sampling points by the number of sampling points contained in each frame; the value of the short time zero crossing rate is obtained. The calculation formula of the short-time zero-crossing rate is as follows:

where m is an audio frame and sgn () is a sign function.

And then normalizing the extracted short-time zero-crossing rate characteristics, wherein the normalization process is the same as that of the step [00 ].

By carrying out an ablation experiment, the performances of different acoustic features on the same model are checked,

table 2 review of results of performance ablation experiments for different acoustic features on the same model

The accuracy in Table 2 is the same as that in the formula (4) in the step [ ] and

experimental results show that the combination of the logarithmic Mel frequency spectrum, the short-time energy and the short-time zero-crossing rate used in the embodiment enables the performance of the model to reach the optimal state.

S13: inputting the acoustic characteristics into the trained neural network model for voice detection, and outputting a result;

in the embodiment of the application, the extracted log mel frequency spectrum, short-time energy and short-time zero-crossing rate are input into the trained neural network model, and 1-dimensional time sequence output is obtained through calculation of the model. Since the time series of model outputs is a probability distribution, the probability distribution is rounded to be converted into the corresponding class. For example, all cases with probability of 0.5 or more are regarded as positive cases, the value is 1, all cases with probability of less than 0.5 are regarded as negative cases, and the value is 0, so that a time sequence of two classifications is obtained. The output sequence has the same length as the original characteristic sequence, so as to facilitate the subsequent editing work

S14, automatically editing the original audio and video according to the model output result;

the final output result of the neural network model is a two-class time series data, 40 units per second; downsampling the original video and audio with the sampling frequency of 40Hz per second to obtain downsampled audio frames, wherein the downsampled audio frames have the same output frequency as the model; editing the downsampled result according to the final output result of the neural network model, correspondingly reserving the downsampled audio frame if the model output is positive, extracting the content of the original audio and video according to the reserved audio frame, and recombining the content and the content to obtain the edited audio and video.

Example IV

In this embodiment of the present application, the environment for developing the audio/video automatic editing apparatus based on the neural network is as follows:

the operating environment is Windows 10pro, the CPU is AMD Ryzen 2700, the GPU adopts Nvidia Geforce GTX1080, the memory uses 16g ddr4 with two channels, the development language is Python3.8, the deep learning framework adopts Pytorch 1.9.0+cud11.2, and the development tool is Pycharm.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an audio/video automatic editing apparatus based on a neural network in an embodiment of the present application, where the apparatus includes:

the preprocessing module 201: preprocessing the original audio and video to obtain preprocessed audio;

specifically, using a wave toolkit to downsample the original audio and video, wherein the sampling frequency is 22050Hz; if the original audio and video are dual-channel, the original audio and video are compressed into a single channel, and the preprocessed audio is obtained.

Feature extraction module 202: for extracting a plurality of acoustic features from the preprocessed audio;

specifically, the pre-processed audio is subjected to framing by using a library and soundfile tool package, the frame length is set to 46ms, and the frame shift is set to 23ms, so that the framed audio is obtained; windowing operation is carried out on the audio after framing based on a Hamming window, the windowed audio is obtained, and a Hamming window function is adopted as a window function; using fast Fourier transform to convert the windowed audio from time domain to frequency domain, and then converting the frequency scale to Mel scale; filtering on the Mel scale by using 40 triangular filters with equal area to obtain logarithmic Mel frequency spectrum characteristics; the method comprises the steps of carrying out a first treatment on the surface of the Taking every 512 sampling points of the audio as a frame by using a wave toolkit, and framing; the window function adopts a rectangular window to window the audio; calculating short-time energy and short-time zero crossing rate characteristics by using a math kit; and normalizing the values of the logarithmic mel frequency spectrum, the short-time energy and the short-time zero-crossing rate by using a numpy tool kit to obtain normalized characteristic values of the logarithmic mel frequency spectrum, the short-time energy and the short-time zero-crossing rate.

The voice detection module 203: the method comprises the steps of inputting the multiple acoustic characteristics into a trained neural network model for voice detection, and outputting a result;

specifically, a neural network model is built by using a Pytorch framework, training is carried out, and the training is carried out until the optimal model parameters are stored. Inputting the normalized logarithmic Mel frequency spectrum, short-time energy and short-time zero crossing rate characteristic values into the stored neural network model for calculation, and obtaining 1-dimensional time sequence probability distribution output through calculation of the model; and rounding the probability distribution by using a round function in Pytorch to obtain the two-class time sequence data.

Table 3 model calculation of time required for 60 minutes of audio features

Where CPU represents the time required for computation using CPU and GPU represents the time required for computation using GPU.

The results show that the time required for the model to calculate 60 minutes of audio is less than 1 minute, whereas manual editing takes around 35 minutes.

Clip module 204: the method is used for automatically editing the original audio and video according to the output result of the trained neural network model;

specifically, downsampling the audio in the original audio and video to be the same as the output frequency of the model by using a movie tool package, and extracting the corresponding fragment in the original audio and video according to the two-classification time series data; and combining the fragments by using a tqdm tool package to obtain the audio and video after clipping. Referring to fig. 8, fig. 8 is a diagram showing a comparison of the model clipping and the manual clipping results in the present embodiment. The results show that the device in the embodiment can achieve a very similar result to manual editing. It is found that there is a certain fluctuation in the time reserved on both sides when the voice is manually cut, and the larger the float is along with the progress of the editing work, the longer the reserved time is when the voice is cut by the device in the embodiment is relatively fixed. In a voice segment with serious noise interference, the clipper deletes the voice segment by mistake, but the correct reservation is reserved in the embodiment. The method and the device used in the application have higher application value in places such as broadcasting television stations where a large amount of voice media are needed to be clipped, and can be applied to other fields such as audio and video of net lessons, conferences and the like.

The quantities and scale of processing described herein are intended to simplify the description of the present invention and applications, modifications and variations of the present invention will be readily apparent to those skilled in the art.

Although embodiments of the present invention have been disclosed above, it is not limited to the details and embodiments shown and described, it is well suited to various fields of use for which the invention would be readily apparent to those skilled in the art, and accordingly, the invention is not limited to the specific details and illustrations shown and described herein, without departing from the general concepts defined in the claims and their equivalents.

Claims

1. A method for automatically editing tasks for voice, comprising:

s1, building a neural network model, wherein the neural network model comprises a convolution layer, a cyclic neural network and a feedforward neural network;

s2, training the neural network model in the step S1;

2. The method for voice auto-editing task according to claim 1, wherein the convolutional layer comprises a plurality of convolutional neural networks, the convolutional layer is formed by respectively performing convolutional operations on different acoustic features through the plurality of convolutional neural networks, and the convolutional layer obtains a neural network model of the voice auto-editing task by combining with a bidirectional cyclic neural network and a feed-forward neural network, wherein the bidirectional cyclic neural network comprises a layer of forward cyclic neural network and a layer of backward cyclic neural network.

3. The method for automatically editing task according to claim 2, wherein the convolutional layer uses an activation function to activate the output of each neural network and then stacks the output of each neural network to obtain a final output result of the convolutional layer, and then inputs the output result of the convolutional layer into the bidirectional recurrent neural network to obtain an output result of the bidirectional recurrent neural network. And inputting the output result of the bidirectional circulating neural network into the feedforward neural network to obtain the output result of the feedforward neural network. And activating and classifying the output result of the feedforward neural network by using an activation function to obtain a final output result of the neural network model.

4. A method for an automatic voice clipping task according to claim 3 wherein the convolutional layer is used to extract and abstract high dimensional information of acoustic features and a forward loop neural network is used to read time series information before the current node and a backward neural network is used to read time series information after the current node and a feed forward neural network is used to combine the time series information before and after.

5. The method for automatic voice editing task according to claim 4, wherein in step S2, a binary cross entropy loss function is used to calculate a loss value, gradient parameters are updated by a back propagation algorithm, during the training process of the models, the model with the best performance on the verification set is selected to test on the test set, and finally the model parameters with the best performance on the test set are saved.

6. The method for automatic voice editing task according to claim 5, wherein the acoustic features in step S3 include logarithmic mel frequency spectrum, short-time energy and short-time zero crossing rate, and the three acoustic features are normalized respectively, so as to facilitate calculation of a subsequent neural network, and the extracted logarithmic mel frequency spectrum, short-time energy and short-time zero crossing rate have the same time sequence length.

7. The training method of claim 6, wherein the training set data of larger batches and the learning rate are used for a first training, and the training is stopped when the loss function is close to convergence; performing a second training round by using the training set data of the smaller batch and the smaller learning rate, and stopping the second training round when the loss function converges; the super parameters are continuously adjusted by using the training method, and finally, the model parameters with the best performance on the test set are saved.