CN110827798B

CN110827798B - Audio signal processing method and device

Info

Publication number: CN110827798B
Application number: CN201911103069.6A
Authority: CN
Inventors: 盘子圣; 丁宁
Original assignee: Guangzhou Huanlao Network Technology Co ltd
Current assignee: Guangzhou Huanlao Network Technology Co ltd
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2020-09-11
Anticipated expiration: 2039-11-12
Also published as: CN110827798A

Abstract

The application relates to a method and a device for processing audio signals, wherein the method comprises the following steps: preprocessing the audio to be detected to obtain a multi-dimensional Mel frequency spectrum characteristic sequence; slicing the multi-dimensional Mel frequency spectrum characteristic sequence, inputting the trained audio frequency identification model, and obtaining the corresponding prediction probability of each audio frequency segment output by the audio frequency identification model, wherein the prediction probability is the probability of predicting the audio frequency segment to have the audio frequency of the specified type, the audio frequency segment has the specified duration, and the audio frequency of the specified type comprises the audio signal without specific semantics; generating a two-classification sequence according to the obtained prediction probabilities, wherein each sequence element in the two-classification sequence corresponds to an audio clip with a specified duration; and according to the specified duration, determining the time information of the specified type of audio in the audio to be tested from the two classification sequences. The method and the device can improve the recognition accuracy of the audio of the designated type and can improve the recognition efficiency.

Description

Audio signal processing method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing an audio signal.

Background

With the development of internet technology, the way and channel of information dissemination has changed dramatically. The information spread over the network is of many kinds, and there are many kinds of information that may be related to the spread of pornography. Therefore, in order to decontaminate the network environment, the propagated information needs to be audited.

For example, for audio programs distributed on the internet, audio can be converted into text content through a speech recognition algorithm to perform yellow-related recognition, but many audio programs have no semantic information, such as surge, so that detection is missed and recognition accuracy is not high.

Disclosure of Invention

In view of the above, the present application is proposed to provide a method and apparatus for audio signal processing that overcomes or at least partially solves the above mentioned problems.

In a first aspect, the present application provides a method of audio signal processing, the method comprising:

preprocessing the audio to be detected to obtain a multi-dimensional Mel frequency spectrum characteristic sequence;

slicing the multi-dimensional Mel frequency spectrum characteristic sequence, inputting the sliced multi-dimensional Mel frequency spectrum characteristic sequence into a trained audio recognition model, and obtaining a prediction probability corresponding to each audio clip output by the audio recognition model, wherein the prediction probability is the probability of predicting that the audio clip has an audio of a specified type, the audio clip has a specified duration, and the audio of the specified type comprises an audio signal without specific semantics;

generating two classification sequences according to the obtained prediction probabilities, wherein each sequence element in the two classification sequences corresponds to an audio clip with specified duration;

and according to the specified duration, determining the time information of the specified type of audio in the audio to be tested from the two classification sequences.

Optionally, before the determining, according to the specified duration, the time information where the specified type of audio is located from the two classification sequences, the method further includes:

judging whether the two classification sequences have sequence elements which accord with a preset correction rule or not;

and if so, correcting the sequence element.

Optionally, the determining whether there are sequence elements that meet a preset modification rule in the binary sequence includes:

traversing the binary sequence, and if the binary value of the currently traversed sequence element is a first preset value, reading the binary values of continuous N1 elements from the beginning of the current element, wherein N1 is a positive integer;

if there are M1 elements with the second classification value as the first preset value in the N1 second classification values, and the M1 elements with the second classification value as the first preset value are discontinuous, or there are 1 element with the second classification value as the first preset value in the N1 second classification values, reading the second classification values of N2 elements before and after the N1 continuous elements, wherein 1< M1< N1;

if the number of the elements with the binary values being the first preset value in the read binary values of the elements of N1+2N2 is less than M2, determining that the current element meets a preset correction rule, wherein M1 is less than M2;

the modifying the sequence element includes:

and setting the classification value of the current element as a second preset value.

Optionally, the preprocessing the audio to be detected to obtain a multi-vimel spectrum feature sequence includes:

framing the audio to be detected according to a specified framing rule to obtain a corresponding audio frame sequence;

performing short-time Fourier transform on each frame of the audio frame sequence to generate a magnitude spectrum corresponding to the audio frame sequence;

and filtering the amplitude spectrum through a preset Mel filter bank to obtain a multi-dimensional Mel frequency spectrum characteristic sequence.

Optionally, the slicing the multi-vimel spectral feature sequence and inputting the sliced multi-vimel spectral feature sequence into a trained audio recognition model includes:

slicing the multi-dimensional Mel frequency spectrum characteristic sequence into audio segments with the length of a specified duration, wherein each audio segment is partially overlapped with the adjacent front and back audio segments respectively, and the duration of each overlap is half of the specified duration;

and respectively inputting the audio segments into the audio recognition models.

Optionally, the determining, according to the specified duration, the time information of the audio to be tested where the audio of the specified type is located from the two classification sequences includes:

determining the position of an element with a second classification value as a first preset value from the two classification sequences, and taking the position as a target position of the audio of the specified type in the audio to be detected;

and calculating time information corresponding to the target position according to the specified duration and the target position.

Optionally, the audio recognition model is trained in the following manner:

acquiring audio training data, wherein the audio training data comprises first audio training data containing audio of a specified type and second audio training data not containing audio of the specified type, and time information corresponding to the audio of the specified type in the first audio training data is labeled in advance;

according to the pre-marked time information, extracting audio of a specified type from the first audio training data, and marking the classification type of the audio of the specified type as a first class;

labeling the audio in the first audio training data except the audio of the specified type and the classification type of the second audio training data as a second class;

preprocessing the audio training data to obtain a training data multi-dimensional Mel frequency spectrum characteristic sequence;

segmenting the training data multi-dimensional Mel frequency spectrum characteristic sequence to obtain a plurality of training data audio frequency segments;

and modeling the training audio segments and the corresponding classification types by adopting a convolutional neural network CNN and a long-short term memory network LSTM to generate an audio recognition model.

Optionally, the generating a binary sequence according to the obtained plurality of prediction probabilities includes:

judging whether the current prediction probability is greater than or equal to a preset probability threshold value;

if so, setting the two classification values corresponding to the current prediction probability as a first preset value;

if not, setting the two classification values corresponding to the current prediction probability as a second preset value;

all binary values are organized into a binary sequence.

In a second aspect, the present application provides an apparatus for audio signal processing, the apparatus comprising:

the Mel frequency spectrum processing module is used for preprocessing the audio to be detected to obtain a multi-dimensional Mel frequency spectrum characteristic sequence;

a prediction probability obtaining module, configured to slice the multi-dimensional mel frequency spectrum feature sequence and then input the sliced multi-dimensional mel frequency spectrum feature sequence into a trained audio frequency recognition model, and obtain a prediction probability corresponding to each audio frequency segment output by the audio frequency recognition model, where the prediction probability is a probability that a specified type of audio frequency exists in the audio frequency segment, the audio frequency segment has a specified duration, and the specified type of audio frequency includes an audio signal without specific semantics;

the two-classification sequence generating module is used for generating two-classification sequences according to the obtained prediction probabilities, wherein each sequence element in the two-classification sequences corresponds to an audio clip with specified duration;

and the audio time identification module is used for determining the time information of the audio of the specified type in the audio to be tested from the two classification sequences according to the specified duration.

Optionally, the apparatus further comprises:

the correction judging module is used for judging whether sequence elements which accord with a preset correction rule exist in the two classification sequences or not before determining the time information of the audio of the specified type from the two classification sequences according to the specified duration; if yes, calling a correction module;

and the correcting module is used for correcting the sequence elements.

Optionally, the correction determining module is specifically configured to:

the correction module is specifically configured to:

Optionally, the mel-frequency spectrum processing module is specifically configured to:

Optionally, the prediction probability obtaining module includes:

the slicing submodule is used for slicing the multi-dimensional Mel frequency spectrum characteristic sequence into audio segments with the length of the specified duration, wherein each audio segment is partially overlapped with the adjacent front and back audio segments respectively, and the overlapped duration is half of the specified duration;

and the audio input sub-module is used for respectively inputting the audio segments into the audio recognition model.

Optionally, the audio time identification module is specifically configured to:

Optionally, the audio recognition model is trained in the following manner:

Optionally, the two-classification sequence generating module is specifically configured to:

all binary values are organized into a binary sequence.

In a third aspect, the present application further provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method as described above.

In a fourth aspect, the present application also provides a storage medium, wherein instructions of the storage medium, when executed by a processor of the device, enable the electronic device to perform the method as described above.

The application has the following beneficial effects:

in this embodiment, after extracting the multi-dimensional mel frequency spectrum feature sequence of the audio frequency to be detected, the multi-dimensional mel frequency spectrum feature sequence is sliced and input into the audio frequency identification model, the audio frequency identification model outputs the prediction probability corresponding to each audio frequency segment, and then according to the prediction probability corresponding to each audio frequency segment, a binary sequence can be generated, wherein each sequence element in the binary sequence corresponds to the audio frequency segment with the specified duration, and according to the specified duration, the time information of the specified type of audio frequency without specific semantics in the audio frequency to be detected can be determined from the binary sequence, so that the identification of the audio frequency without specific semantics is realized, the audio frequency of the specified type is identified by combining the audio frequency slice and the model, the identification accuracy of the specified type of audio frequency can be improved, and the identification efficiency can be improved.

Drawings

FIG. 1 is a flowchart illustrating steps of an embodiment of a method for audio signal processing according to the present application;

FIG. 2 is a schematic audio slice of the present application;

FIG. 3 is a flowchart illustrating steps of an embodiment of a method for training an audio recognition model according to the present application;

FIG. 4 is a network diagram of the CNN + LSTM of the present application;

FIG. 5 is a flow chart illustrating steps of another embodiment of a method of audio signal processing according to the present application;

fig. 6 is a block diagram of an embodiment of an apparatus for audio signal processing according to the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

Referring to fig. 1, a flow chart of steps of an embodiment of a method of audio signal processing according to an embodiment of the present application is shown, which may include the steps of:

step 101, preprocessing the audio frequency to be detected to obtain a multi-dimensional Mel frequency spectrum characteristic sequence.

In this step, the audio to be tested may be processed into a multi-dimensional mel-frequency spectrum feature sequence composed of audio frames.

In one embodiment, step 101 may include the following sub-steps:

and a substep S11, framing the audio to be tested according to a specified framing rule to obtain a corresponding audio frame sequence.

As an example, the specified framing rules may include, but are not limited to, pre-emphasis, specified frame shift information, window length information, hamming window information, and the like.

For example, pre-emphasis with a coefficient of 0.97 may be performed on the audio to be tested, then frame division with a frame shift of 256 and a window length of 512 is adopted, and each frame is multiplied by a hamming window of 512 points to finally obtain N frames of window data as the audio frame sequence.

And a substep S12 of performing a short-time fourier transform on each frame of the sequence of audio frames to generate a magnitude spectrum corresponding to the sequence of audio frames.

In this step, after the audio frame sequence is obtained, a short-time Fourier transform (STFT) process may be performed on the audio frame sequence frame by frame, so as to obtain a magnitude spectrum.

Among them, the idea of STFT is: selecting a time-frequency localized window function, and assuming that the analysis window function g (t) is stable (pseudo-stable) in a short time interval, moving the window function so that f (t) g (t) is a stable signal in different finite time widths, thereby calculating the amplitude spectrum of each time. The short-time fourier transform uses a fixed window function, whose shape is not changed once determined, and whose resolution is determined. If the resolution is to be changed, the window function needs to be reselected.

And a substep S13, filtering the amplitude spectrum through a preset Mel filter bank to obtain a multidimensional Mel frequency spectrum characteristic sequence.

In this step, a group of mel filter banks simulating the auditory sense of human ears may be created in advance, and the amplitude spectrum after short-time fourier transform may be passed through the mel filter banks to obtain a multi-dimensional mel frequency spectrum feature sequence capable of reflecting the auditory sense of human ears, for example, a mel frequency spectrum feature sequence of 80 dimensions may be obtained by the mel filter banks.

It should be noted that, the creating process of the mel filter bank is not limited in this embodiment, and those skilled in the art may create an appropriate mel filter bank according to actual requirements.

And 102, slicing the multi-dimensional Mel frequency spectrum characteristic sequence, inputting the sliced multi-dimensional Mel frequency spectrum characteristic sequence into a trained audio recognition model, and obtaining a prediction probability corresponding to each audio segment output by the audio recognition model, wherein the prediction probability is the probability of predicting that the audio segment has an audio of a specified type, and the audio segment has a specified duration.

In this embodiment, after obtaining the multi-vimel spectral feature sequence, the multi-vimel spectral feature sequence may be sliced in order to obtain a prediction result with finer temporal granularity and smoother output probability.

In an embodiment, the step of slicing the multidimensional mel-frequency spectrum feature sequence and inputting the sliced multidimensional mel-frequency spectrum feature sequence into the trained audio recognition model may include the following sub-steps:

and a substep S11, slicing the multi-vimel spectral feature sequence into audio segments with a specified duration, wherein each audio segment is partially overlapped with adjacent front and back audio segments, and the respective overlapping duration is half of the specified duration.

And a substep S12 of inputting the audio segments into the audio recognition models, respectively.

It should be noted that the specified time period can be determined by those skilled in the art according to actual needs, and the embodiment is not limited thereto.

In order to make the temporal granularity of prediction finer, in one embodiment, when performing slicing, each audio segment may be partially overlapped with adjacent preceding and following audio segments, and the time length of the overlap with the preceding and following audio segments is half of the specified time length. Thus, when each audio clip is input into the audio recognition model, because the audio recognition model predicts the first half of the audio clip (predicted in the previous audio clip), the prediction time of the audio recognition model for the audio clip can be saved, and the audio recognition model can output the prediction probability corresponding to the current audio clip by half the time length.

For example, as shown in the audio slice diagram of fig. 2, assuming that the specified time duration is 6 seconds, the multidimensional mel-frequency spectrum feature sequence may be sliced into audio segments with a length of 6 seconds (a segment of 384 × 80), and an overlap of 3 seconds is formed between the front and rear audio segments, as shown in fig. 2, the 0 th to 6 th seconds are audio segment 1, the 3 rd to 9 th seconds are audio segment 2, the 6 th to 12 th seconds are audio segment 3, the portion where audio segment 2 overlaps audio segment 1 is the 3 rd to 6 seconds, the portion where audio segment 2 overlaps audio segment 3 is the 6 th to 9 th seconds, and so on.

After the segmentation is completed, the obtained audio segments can be input into the audio recognition model according to the time sequence, and the audio recognition model predicts the prediction probability corresponding to each audio segment. The prediction probability may be a probability that the audio segment is predicted to have the specified type of audio, and the value of the probability may range from 0 to 1.

As an example, the specified type of audio may be a sound without specific semantics, including, by way of example and not limitation: human breathing audio; animal sounds such as cat cry, dog cry, and bird cry; the sound of various environments such as the sound of waves, the sound of running water, the sound of blowing treetop, the sound of trains, the sound of whistling, the sound of office air conditioners, the sound of roadside automobiles and the like.

In one embodiment, the audio recognition model may be a two-class model, and if the audio segment is recognized by the model to have the specified type of audio, the model outputs the probability that the current audio segment has the specified type of audio. In other embodiments, the audio recognition model may also be a multi-classification model, and the model may output probabilities that various types of audio exist in the current audio piece respectively. The present embodiment does not limit the type of the audio recognition model. In addition, a training process regarding the audio recognition model will be explained in the next embodiment.

And 103, generating two classification sequences according to the obtained prediction probabilities, wherein each sequence element in the two classification sequences corresponds to the audio clip with the specified duration.

In this step, after obtaining the prediction probabilities output by the audio recognition model, the obtained plurality of prediction probabilities may be converted into a binary sequence, and since there are only two values in the binary sequence, the efficiency of recognizing a specified type of audio may be improved compared to different prediction probabilities.

In an embodiment, the binary sequence may include the first preset value and/or the second preset value, and then step 103 may include the following sub-steps:

judging whether the current prediction probability is greater than or equal to a preset probability threshold value; if so, setting the two classification values corresponding to the current prediction probability as a first preset value; if not, setting the two classification values corresponding to the current prediction probability as a second preset value; all binary values are organized into a binary sequence.

In one example, after obtaining the prediction probability from the audio recognition model, the currently obtained prediction probability may be associated with the audio segment currently input to the audio model, and then it is determined whether the prediction probability is greater than or equal to a preset probability threshold; if so, setting the two classification values corresponding to the prediction probability as a first preset value; if not, setting the two classification values corresponding to the prediction probability as a second preset value. And when all the prediction probabilities are traversed, organizing all the binary values into a binary sequence.

For example, assuming that the preset probability threshold is 0.7, if the current prediction probability is greater than or equal to 0.7, the two classification values corresponding to the prediction probability may be set to be a value 1 (i.e., a first preset value); if the current prediction probability is less than 0.7, the binary value corresponding to the prediction probability may be set to be a value of 0 (i.e., a second preset value), so that the binary sequence is a sequence including only 0 and/or 1. For example, assuming that the obtained prediction probabilities are "0.3, 0.6, 0.87, 0.9, 0.7, 0.55, and 0.6", respectively, the generated binary sequence is: "0011100".

It should be noted that, for the time when the binary values corresponding to the prediction probabilities are to be determined, this embodiment is not limited to this, for example, when each prediction probability is output by the audio recognition model, the binary values corresponding to the prediction probabilities may be determined in real time; or after the audio recognition model outputs a plurality of prediction probabilities, two classification values corresponding to the prediction probabilities can be respectively determined.

In addition, it should be noted that, if the audio recognition model is a binary model, a binary sequence may be determined for the output of the model; if the audio recognition model is a multi-classification model, a corresponding two-classification sequence may be generated for each classification, and the specific manner refers to the description in step 103, which is not described herein again.

And step 104, determining the time information of the audio of the specified type in the audio to be tested from the two classification sequences according to the specified duration.

In this step, each two classification values in the two classification sequences have an associated audio segment, and then the time information of the specified type of audio in the audio to be tested can be determined according to the specified duration of the audio segment and each two classification values.

In one embodiment, step 104 may include the following sub-steps:

and a substep S21, determining the position of the element with the second classification value as the first preset value from the two classification sequences, and taking the position as the target position of the audio of the specified type in the audio to be tested.

For example, assuming that the first preset value is 1 and the classification sequence is "0011100", the position where the element with the value of 1 is located may be taken as the target position, that is, the positions where the 3 rd, 4 th and 5 th audio segments are located may be taken as the target position where the specified type of audio is located in the audio to be tested.

And a substep S22, calculating time information corresponding to the target position according to the specified time length and the target position.

In one embodiment, the audio recognition model outputs a prediction probability after a specified time period of one-half because each audio clip overlaps the previous and next audio clips, respectively, for a specified time period of one-half. For example, as shown in fig. 2, the specified time duration of each audio segment is 6 seconds, and overlaps with the preceding and following audio segments for 3 seconds, respectively, the audio recognition model may output a prediction probability every 3 seconds, that is, the interval time duration between two prediction probabilities is 3 seconds, so that it can be obtained that the time duration between two adjacent two classification values in the two classification sequences is 3 seconds. For example, for the two classification sequences "0011100", the time corresponding to each element is 00:00: 00-00: 00:03, 00:00: 03-00: 00:06, 00:00: 06-00: 00:09, 00:00: 09-00: 00: 12. 00:00: 12-00: 00:15, 00:00: 15-00: 00:18, 00:00: 18-00: 00: 21. The time information corresponding to the target position is: 00:00: 06-00: 00:09, 00:00: 09-00: 00: 12. 00:00: 12-00: 00:15, namely the time corresponding to the specific type of audio (such as the surge audio) in the audio to be tested is 00:00: 06-00: 00: 15.

Referring to fig. 3, a flowchart illustrating steps of an embodiment of a method for training an audio recognition model according to the present application is shown, and may include the following steps:

step 301, obtaining audio training data, where the audio training data includes first audio training data including a specified type of audio and second audio training data not including the specified type of audio, and time information corresponding to the specified type of audio in the first audio training data is pre-labeled.

In this step, a large amount of the first audio training data and the second audio training data may be collected in advance as the audio training data. The first audio training data comprises the audio of the designated type, and the second audio training data does not comprise the audio of the designated type.

Illustratively, the specified type of audio may be sounds without specific semantics, including, by way of example and not limitation: human breathing audio; animal sounds such as cat cry, dog cry, and bird cry; the sound of various environments such as the sound of waves, the sound of running water, the sound of blowing treetop, the sound of trains, the sound of whistling, the sound of office air conditioners, the sound of roadside automobiles and the like.

For example, assuming that the audio recognition model that needs to be trained is to be used to recognize breathy audio, a large amount of both breathy and non-breathy audio may be collected as audio training data.

For the first audio training data, time information corresponding to the audio of the specified type may be manually labeled, and the labeled time information may be accurate to the second level, for example, the time period of occurrence of the surge in the first audio training data is labeled as 00:10:01-00:11:05, 00:15:06-00:15:52, and so on.

The embodiment labels the audio of the specified type according to the time dimension, so that the audio can be accurately processed, and the processing efficiency is improved.

It should be noted that, in the audio training data, the ratio of positive and negative samples may be determined according to actual requirements, for example, the total duration of the collected breathlessness audio: the total duration of non-breathlessness audio ≈ 1: 5.

step 302, according to the pre-labeled time information, extracting a specified type of audio from the first audio training data, and labeling the classification type of the specified type of audio as a first class.

In this step, each first audio training data may be preprocessed, which in one example may include: and automatically cutting the corresponding first audio training data according to the time information marked in advance in each first audio training data to obtain the audio of the specified type.

The class type of the cut out specified type of audio may then be labeled as a first class, which may illustratively be represented by a value of 1, e.g., the class type of the cut out buffy audio may be labeled as a value of 1, indicated in the first class.

Step 303, labeling the audio of the first audio training data except the audio of the specified type and the classification type of the second audio training data as a second class.

In this step, for the audio in the first audio training data except for the audio of the designated type and the second audio training data, the classification type thereof may be labeled as the second class, and the second class may be represented by a value of 0, for example, the classification type of the audio of the non-buffeting time period and the non-buffeting audio in the buffeting audio may be labeled as a value of 0, and indicated as the second class.

And 304, preprocessing the audio training data to obtain a training data multi-dimensional Mel frequency spectrum characteristic sequence.

In this step, preprocessing such as feature extraction may be performed on the audio training data to obtain a training data multi-vimel spectrum feature sequence.

In an embodiment, the process of preprocessing the audio training data is similar to the preprocessing process of the audio to be tested in the embodiment of fig. 1, and may include the following processes:

1) and framing the audio training data according to a specified framing rule to obtain a corresponding training audio frame sequence.

As an example, the above-mentioned specified framing rules may include, but are not limited to, specifying a sampling rate, specifying a channel, pre-emphasis, specified frame shift information, window length information, hamming window information, and the like.

For example, the audio training data may be uniformly converted to 16000 sampling rate, monaural audio, and then the converted audio may be pre-emphasized to remove the effect of lip radiation and compensate for the loss of the high frequency part, and the pre-emphasis factor may be 0.97 as an example. Then, frame shift is 256, window length is 512, and each frame is multiplied by 512-point Hamming window, finally N frame window data is obtained as training audio frame sequence.

2) And performing short-time Fourier transform on each frame of the training audio frame sequence to generate a training data amplitude spectrum corresponding to the training audio frame sequence.

3) And filtering the training data amplitude spectrum through a preset Mel filter bank to obtain a training data multi-dimensional Mel frequency spectrum characteristic sequence.

In this step, a group of mel filter banks simulating the auditory sensation of human ears may be created in advance, and the training data amplitude spectrum after short-time fourier transform may be passed through the mel filter banks to obtain a training data multi-dimensional mel frequency spectrum feature sequence capable of reflecting the auditory sensation of human ears, for example, an 80-dimensional training data mel frequency spectrum feature sequence may be obtained by the mel filter banks.

In the embodiment, the Mel frequency spectrum is adopted for feature extraction, so that more feature information can be provided for model learning, and the accuracy of the model is improved.

Step 305, segmenting the training data multi-dimensional Mel frequency spectrum characteristic sequence to obtain a plurality of training data audio frequency segments.

Regarding the slicing manner of the training data multi-dimensional mel-frequency spectrum feature sequence in this step, the slicing manner of the training data multi-dimensional mel-frequency spectrum feature sequence is similar to the slicing manner of the multi-dimensional mel-frequency spectrum feature sequence in the embodiment of fig. 1, for example, the 80-dimensional training data mel-frequency spectrum feature sequence may be sliced into a plurality of training audio segments with the length of 6 seconds, i.e. 384x80 (unlike the slicing manner of the prediction audio, the slicing manner here may be sliced into audio frequency bands which do not overlap).

And step 306, modeling the training audio segments and the corresponding classification types by adopting a convolutional neural network CNN and a long-short term memory network LSTM, and generating an audio recognition model.

In this step, after obtaining a plurality of training audio segments, the obtained plurality of training audio segments may be used as an input of a training model, and the classification type (0 or 1) corresponding to each training audio segment may be used as a corresponding training label, and an audio recognition model is trained through a training algorithm.

In this embodiment, CNN (Convolutional Neural Networks) + LSTM (Long Short-Term Memory) can be used as the training algorithm.

In one embodiment, CNN may be used to perform feature abstraction layer by layer on a mel-frequency spectrum segment (i.e., a training audio segment) of 384x80, in the last layer of CNN, three one-dimensional convolution kernels with step lengths of 1, 2, and 3 are used to convert a feature map into three 24x1x256 feature matrices with a height of 1 and original width and original channel number, and then the three feature matrices are spliced and fused in a channel dimension to obtain a 24x768 feature matrix, an LSTM layer of time step length 24 is used to process an information stream of a speech feature in a time dimension, and an LSTM layer is used to output a result of the last time step, and a full-connection layer and a sigmoid activation function are used to obtain classification.

In one example, as shown in the network diagram of CNN + LSTM of fig. 4, the network parameters from top to bottom layer by layer are as follows:

(1) the size of the convolution kernel is 3x3x1, the number of the convolution kernels is 32, and the returned characteristic matrix is 384x80x 32;

(2) the size of the convolution kernel is 3x3x32, the number of the convolution kernels is 32, and the returned characteristic matrix is 384x80x 32;

(3) maximum pooling level, step size 2, returned feature matrix 192x40x 32;

(4) a batch normalization layer with an activation function Relu;

(5) the size of the convolution kernel is 3x3x32, the number of the convolution kernels is 64, and the returned characteristic matrix is 192x40x 64;

(6) the size of the convolution kernel is 3x3x64, the number of the convolution kernels is 64, and the returned characteristic matrix is 192x40x 64;

(7) maximum pooling layer, step size 2, returned feature matrix 96x20x 64;

(8) a batch normalization layer with an activation function Relu;

(9) the size of the convolution kernel is 3x3x64, the number of the convolution kernels is 128, and the returned characteristic matrix is 96x20x 128;

(10) the size of the convolution kernel is 3x3x128, the number of the convolution kernels is 128, and the returned characteristic matrix is 96x20x 128;

(11) maximum pooling layer, step size 2, returned feature matrix 48x10x 128;

(12) a batch normalization layer with an activation function Relu;

(13) the size of the convolution kernel is 3x3x128, the number of the convolution kernels is 256, and the returned characteristic matrix is 48x10x 256;

(14) the size of the convolution kernel is 3x3x256, the number of the convolution kernels is 256, and the returned characteristic matrix is 48x10x 256;

(15) the maximum pooling layer, the step length is 2, and the returned feature matrix is 24x5x 256;

(16) a batch normalization layer with an activation function Relu;

(17) three one-dimensional convolution kernels are respectively 1x5x256, 2x5x256 and 3x5x256, the number of the convolution kernels is 256, three groups of feature matrices are all 24x1x256, splicing is carried out in the channel dimension, a second dimension with the dimension number of 1 is removed, and the returned feature matrix is 24x 768;

(18) the step length of the LSTM is 24, the node number is 512, and the output of the last time step is returned, namely 512;

(19) and the full connection layer has a node number of 1 and an activation function of Sigmoid.

Based on the above CNN + LSTM network, the following Focal Loss may be used as a Loss function:

the r factor is used to make the direction of gradient decrease during training focus more on the misclassified samples with larger classification errors, for example, r is 2; the a factor is used to balance the imbalance of the proportion of the positive and negative samples, because the training data contains fewer positive samples than negative samples, and a can be 0.2.

Of course, other Loss functions, such as cross-entropy Loss function, may be used instead of the Focal local Loss function, which is not limited in this embodiment.

In addition, during model training, the batch size can be set to 64, the optimization algorithm is Adam, and the learning rate is 0.0001.

It should be noted that the model trained by the CNN + LSTM network shown in fig. 4 is a two-class model, and if a multi-class model needs to be trained, the sigmoid activation function of the last layer, i.e. a single node in fig. 4 may be replaced by a Softmax activation function of multiple nodes (the number of nodes corresponds to the number of classes), so as to achieve the purpose of multi-class.

In this embodiment, when training the audio recognition model, CNN and LSTM are fused, acoustic feature extraction is performed layer by layer through CNN, the concept of N-Gram commonly used in natural speech processing is referred to in the last layer of CNN, acoustic features of different step lengths are fused, and then the LSTM is used to perform modeling processing on the preceding and following features in time sequence, so as to achieve higher accuracy.

Referring to fig. 5, a flow chart of steps of another method embodiment of audio signal processing of the present application embodiment is shown, which may include the steps of:

step 501, preprocessing the audio frequency to be detected to obtain a multi-dimensional Mel frequency spectrum characteristic sequence.

Step 502, slicing the multi-dimensional Mel frequency spectrum characteristic sequence, inputting the sliced multi-dimensional Mel frequency spectrum characteristic sequence into a trained audio recognition model, and obtaining a prediction probability corresponding to each audio segment output by the audio recognition model, wherein the prediction probability is a probability of predicting that the audio segment has an audio of a specified type, and the audio segment has a specified duration.

Step 503, generating two classification sequences according to the obtained plurality of prediction probabilities, wherein each sequence element in the two classification sequences corresponds to an audio segment with a specified duration.

Step 504, judging whether the two classification sequences have sequence elements which accord with a preset correction rule, if so, correcting the sequence elements.

In one example, if the specified type of audio is breathlessness, sounds such as "heke", "kawa" or inspiration, sigh and the like which are frequently appeared in daily speaking are similar to breath sounds of the breathlessness, and the breathiness is easily identified by the audio identification model by mistake, so that in order to improve the identification accuracy, after the two-classification sequence is obtained, whether a sequence element which accords with a preset correction rule exists in the two-classification sequence can be judged, and when the sequence element which accords with the preset correction rule exists, the sequence element can be corrected.

In an embodiment, the step of determining whether there are sequence elements in the binary sequence that meet the preset modification rule may include the following sub-steps:

and a substep S31, traversing the binary sequence, and reading the binary values of N1 continuous elements from the beginning of the current element if the binary value of the currently traversed sequence element is a first preset value, wherein N1 is a positive integer.

Substep S32, if there are M1 elements with the second classification values as the first preset value among the N1 second classification values, and the M1 elements with the second preset value are not consecutive, or there are 1 element with the second preset value among the N1 second classification values, reading the second classification values of N2 elements before and after the N1 elements, where 1< M1< N1;

in sub-step S33, if the number of the elements with the first classification value among the read classification values of the elements of N1+2N2 is smaller than M2, it is determined that the current element meets a preset modification rule, where M1< M2.

In one embodiment, the modifying the sequence element includes: and setting the classification value of the current element as a second preset value.

For example, assuming that the first preset value is 1, the second preset value is 0, N1 is 5, M1 is 2, N2 is 1, and M2 is 3, then:

when the current frame is 1, the following judgment is carried out:

(1) when at least 3 frames in the results of 5 continuous frames are 1, no correction is carried out;

(2) when 2 frames in the results of 5 continuous frames are 1 and the 2 frames are continuous, no correction is carried out;

(3) when 2 frames in the result of continuous 5 frames are 1 and the 2 frames are discontinuous, if 3 frames in 7 frames of the 5 frames before and after the 5 frames are combined are 1, no correction is carried out; otherwise, resetting the current frame to 0;

(4) when 1 frame in the results of the continuous 5 frames is 1, if 3 frames in 7 frames of the 5 frames before and after the 5 frames are combined are 1, no correction is carried out; otherwise, the current frame is reset to 0.

And 505, determining the time information of the audio of the specified type in the audio to be tested from the two classification sequences according to the specified duration.

In this embodiment, after determining the two-class sequence, whether to modify the two-class sequence may be determined by determining whether a sequence element meeting a preset modification rule exists in the two-class sequence, so that the modified two-class sequence can more accurately reflect the specified type of audio, which is beneficial to improving the subsequent identification accuracy of the audio of the specified audio.

Based on the method for processing the audio signal, referring to fig. 6, a block diagram of an embodiment of the apparatus for processing an audio signal according to the present application is shown, and the apparatus may include the following modules:

a mel-frequency spectrum processing module 601, configured to pre-process the audio to be detected to obtain a multi-dimensional mel-frequency spectrum feature sequence;

a prediction probability obtaining module 602, configured to slice the multi-vimel spectrum feature sequence and then input the sliced multi-vimel spectrum feature sequence into a trained audio recognition model, and obtain a prediction probability corresponding to each audio segment output by the audio recognition model, where the prediction probability is a probability that a specified type of audio exists in the audio segment, the audio segment has a specified duration, and the specified type of audio includes an audio signal without specific semantics;

a two-class sequence generating module 603, configured to generate two-class sequences according to the obtained multiple prediction probabilities, where each sequence element in the two-class sequences corresponds to an audio segment with a specified duration;

and an audio time identification module 604, configured to determine, according to the specified duration, time information that the audio of the specified type is in the audio to be tested from the two classification sequences.

In one embodiment, the apparatus further comprises:

and the correcting module is used for correcting the sequence elements.

In an embodiment, the modification determining module is specifically configured to:

the correction module is specifically configured to:

In an embodiment, the mel-frequency spectrum processing module 601 is specifically configured to:

In one embodiment, the prediction probability obtaining module 602 includes:

In an embodiment, the audio time identification module 604 is specifically configured to:

In one embodiment, the audio recognition model is trained by:

In an embodiment, the binary sequence generating module 603 is specifically configured to:

all binary values are organized into a binary sequence.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present application further provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method in the method embodiment of FIG. 1.

The embodiment of the present application further provides a storage medium, and when executed by a processor of the device, the instructions in the storage medium enable the electronic device to perform the method according to the embodiment of the method in fig. 1.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A method of audio signal processing, the method comprising:

judging whether the two classification sequences have sequence elements which accord with a preset correction rule or not; if so, correcting the sequence elements;

according to the specified duration, determining the time information of the specified type of audio in the audio to be tested from the corrected two classification sequences;

wherein, the judging whether the two classification sequences have sequence elements which accord with a preset correction rule comprises the following steps:

if the number of elements with the classification values being the first preset value in the read two classification values of the elements of N1+2N2 is less than M2, determining that the current element meets a preset correction rule, wherein M1 is less than M2;

the modifying the sequence element includes:

2. The method of claim 1, wherein the pre-processing the audio to be tested to obtain a multi-vimel spectral feature sequence comprises:

3. The method of claim 1, wherein the slicing the multi-vimel spectral feature sequence and inputting the sliced multi-vimel spectral feature sequence into a trained audio recognition model comprises:

4. The method according to claim 1, wherein the determining, according to the specified duration, time information of the specified type of audio in the audio to be tested from the modified two-class sequence comprises:

5. The method of claim 1, 3 or 4, wherein the audio recognition model is trained by:

according to the time information marked in advance, extracting audio of a specified type from the first audio training data, and marking the classification type of the audio of the specified type as a first class;

6. The method of claim 1, wherein generating the bi-classification sequence according to the obtained plurality of prediction probabilities comprises:

all binary values are organized into a binary sequence.

7. An apparatus for audio signal processing, the apparatus comprising:

the correction judging module is used for judging whether the two classification sequences have sequence elements which accord with a preset correction rule or not; if yes, calling a correction module;

the correction module is used for correcting the sequence elements;

the audio time identification module is used for determining the time information of the audio of the specified type in the audio to be detected from the corrected two classification sequences according to the specified duration;

wherein, the correction judging module is specifically configured to:

the correction module is specifically configured to:

8. The apparatus of claim 7, wherein the Mel spectral processing module is specifically configured to:

9. The apparatus of claim 7, wherein the prediction probability obtaining module comprises:

10. The apparatus of claim 7, wherein the audio time identification module is specifically configured to:

11. The apparatus according to claim 7, 9 or 10, wherein the audio recognition model is trained by:

12. The apparatus of claim 7, wherein the two-class sequence generation module is specifically configured to:

all binary values are organized into a binary sequence.

13. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1-6.

14. A storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-6.