CN108648748B

CN108648748B - Acoustic event detection method under hospital noise environment

Info

Publication number: CN108648748B
Application number: CN201810297418.1A
Authority: CN
Inventors: 邵虹; 田影; 刘阳; 崔文成
Original assignee: Shenyang University of Technology
Current assignee: Shenyang University of Technology
Priority date: 2018-03-30
Filing date: 2018-03-30
Publication date: 2021-07-13
Anticipated expiration: 2038-03-30
Also published as: CN108648748A

Abstract

The invention relates to an acoustic event detection method, in particular to an acoustic event detection method in a hospital noise environment. The method and the device can accurately recognize the voice into the characters, improve the recognition rate of the voice input electronic medical record and reduce the false recognition rate. The method comprises the following steps: step 1, intercepting the characteristics of the audio signal of each acoustic event, and correspondingly marking the audio segment of the audio signal; step 2, extracting MFCC characteristic coefficients of each target acoustic event in the audio; step 3, aligning the voice phonemes; step 4, generating a feature matrix of the voice; step 5, establishing a CRNN model for each target acoustic event; step 6, preprocessing an audio signal of a target acoustic event to be detected, which is acquired in real time in a hospital noise environment, and then extracting MFCC features; step 7, obtaining the category of the target acoustic event to be detected; and 8, filtering out audio segments irrelevant to the target acoustic event.

Description

Acoustic event detection method under hospital noise environment

Technical Field

The invention relates to an acoustic event detection method, in particular to an acoustic event detection method in a hospital noise environment.

Background

Under the condition of very low signal-to-noise ratio or the condition of speaking of a plurality of people, the recognition rate of the existing voice electronic medical record can be greatly reduced, so that the acoustic event detection becomes a key step in the process of removing the noise influence in the hospital environment.

Current speech recognizers uniformly classify non-speech into one class: noise, actually realistic noise, may be more complex than speech, and if various noise types can also be modeled, it is helpful for speech recognition to distinguish which useful speech is, which is significant for recognition of electronic medical records of speech.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the acoustic event detection method in the hospital noise environment, which can accurately recognize the voice into characters, improve the recognition rate of the voice input electronic medical record and reduce the false recognition rate.

In order to achieve the purpose, the invention adopts the following technical scheme, which comprises the following steps.

Step 1, in a training stage, taking an audio signal of a target acoustic event and a hospital environment noise signal as training data, carrying out feature interception on the audio signal of each acoustic event, and carrying out corresponding marking on an audio fragment of the audio signal.

And 2, extracting the MFCC characteristic coefficient of each target acoustic event in the audio according to the features intercepted in the step 1, wherein the MFCC characteristic coefficient comprises an audio fragment of the features of the acoustic events.

And 3, training an alignment model by adopting an HMM-CNN according to the extracted MFCC characteristic coefficients, and aligning the speech phonemes.

And 4, calculating statistics of cepstrum mean and variance normalization of the MFCC features after alignment in the step 3, taking the sound event number as an index, wherein each statistic set is a matrix, namely a feature matrix of the generated speech.

And 5, establishing a CRNN model for each target acoustic event by using Theano as the background of a Keras development tool according to the feature matrix generated in the step 4.

And 6, in the identification stage, preprocessing the audio signal of the target acoustic event to be detected, which is acquired in real time in the hospital noise environment, and then extracting the MFCC characteristics.

And 7, classifying and identifying by adopting the CRNN model obtained in the step 5 according to the MFCC coefficients extracted in the step 6 to obtain the category of the target acoustic event to be detected.

And 8, in the acoustic events of which the types are determined in the step 7, comprehensively analyzing the noise events based on time sequence and direction to obtain corresponding event sequence codes, filtering the current event sequence according to the obtained event sequence codes, and filtering audio segments irrelevant to the target acoustic events.

Preferably, the method adopted by the model for aligning the speech factors in the step 3 comprises the following steps.

Step 3-1, processing according to MFCC characteristics to extract a frame sequence; each frame is normalized to the same scale and fed into the CNN which yields a posterior probability of belonging to one class.

And 3-2, reducing training parameters by the characteristic shared by the CNN weight to inhibit overfitting, enhancing the original voice signal characteristic by convolution operation of the convolutional layer, and reducing the background noise.

And 3-3, performing sub-sampling on the characteristics by utilizing a speech signal frequency spectrum local correlation principle at a pooling layer, reducing the dimension of the data and reserving useful information.

Step 3-4, after normalization, this probability is used as the output probability of the HMM, which is used to infer the most likely sequence of feature frames.

Preferably, the method for using the feature matrix in step 4 is as follows: and calculating statistics of cepstrum mean and variance normalization of the extracted features, taking the sound event number as an index, wherein each statistic set is a feature matrix.

Preferably, the method for training the CRNN acoustic model in step 5 includes the following steps.

Step 5-1, using a Gated Linear Unit (GLU) as an activation function in the CNN, and using the gated linear unit to introduce an attention mechanism into all layers of the neural network in the audio classification; processing the associated audio event by setting its value to a time domain close to zero; convolutional layers are applied to extract advanced features.

Step 5-2, capturing the time context information by using a bidirectional recurrent neural network (Bi-RNN), and predicting the posterior of each audio category of each frame by using a Forward Neural Network (FNN) and the number of audio categories. The prediction probability of each audio tag is obtained by averaging the posteriori of all frames.

Step 5-3, applying binary cross entropy loss between the prediction probability of the audio record and the basic fact; the weights of the neural network may be updated by using the gradient of the weights calculated by back propagation.

Wherein GLU has the formula.

Y＝(W*X+b)⊙σ(V*X+c) (1)

In the above formula, σ is an S-type nonlinear, which is an element product, and σ is a convolution operator; w and V are convolution filters, b and c are offsets; x represents the input T-F representation in the first layer or a feature map of the spacer layer.

In addition, the binary cross entropy loss formula is trained as follows.

Where E is a binary cross entropy, and On and Pn represent the estimate and reference label vectors at sample index n, respectively; the size of the cluster is denoted by n.adam as a random optimization method.

Compared with the prior art, the invention has the beneficial effects.

The invention can enable the voice-input electronic medical record to accurately recognize the voice into characters under the noise environment of a hospital, thereby improving the recognition rate of the voice-input electronic medical record and reducing the false recognition rate. The target acoustic event detection under the hospital noise environment is realized, and certain robustness is provided for noise.

Drawings

The invention is further described with reference to the following figures and detailed description. The scope of the invention is not limited to the following expressions.

Fig. 1 is a block diagram of the overall structure of the present invention.

FIG. 2 is a model diagram of a CNN-HMM according to the present invention.

FIG. 3 is a diagram of the CRNN model of the present invention.

In the figure, 1 is a pooling layer, 2 is a hidden layer, and 3 is a convolutional layer.

Detailed Description

As shown in fig. 1-3, the present invention relates to the study of acoustic event detection methods, particularly noise data in a hospital environment. The method comprises the following steps.

In the training phase.

1. Firstly, hospital noise is comprehensively analyzed, various acoustic events in the hospital are adopted for detection and classification, and the sound data of the medical equipment comprises the following components: the breathing machine, the ECG monitor, the pacing equipment of defibrillation of heart, the sound that the nursing car removed includes in addition: the printer, the patient cries, six acoustic events are detected. The data each comprises 100 events each having a length of not less than 1 second, the audio length being ten seconds or more, the target sound event categories being selected according to their frequency of occurrence in the original annotation and the number of different recordings in which they occur. The data set divides the data into training and evaluation subsets according to the number of examples available for each event class, while also taking into account the recording location. In order to adjust the parameters to achieve the best effect, the development set is further divided into four levels, and each record is only used once as test data. At this stage, the only condition imposed is that the test subset does not contain in-training data. The sound event data consisted of five records in the evaluation set, with 12 records in the training and testing subset distributed over the four folds. The sound event data consisted of five records in the evaluation set, and four folds distributed 10 records into the training and testing subsets.

2. And then, acquiring audio signals and noise signals of various target acoustic events by using a microphone array voice recording system, taking the audio of the target acoustic events and hospital noise signals as training data, carrying out characteristic interception on the audio signals of each acoustic event, and carrying out corresponding marking on audio segments.

3. And extracting MFCC characteristic coefficients of each target acoustic event in the audio according to the intercepted features, wherein the audio segments comprise the features of the acoustic events.

4. The model for aligning the speech factors according to the extracted MFCC coefficients adopts the following method.

(1) Processing is performed to extract a sequence of frames based on the MFCC features. Each frame is normalized to the same scale and fed into the CNN which yields a posterior probability of belonging to one class.

(2) The characteristic of CNN weight sharing reduces training parameters to inhibit overfitting, enhances the original speech signal characteristic through convolution operation of convolution layers, and reduces background noise. The CNN model is divided into three layers: a pooling layer, a hidden layer, and a convolutional layer.

(3) And sub-sampling the characteristics by utilizing a speech signal frequency spectrum local correlation principle at a pooling layer, reducing the dimension of the data and retaining useful information.

(4) After normalization, this probability is used as the output probability of the HMM, which is used to infer the most likely sequence of feature frames.

Referring to FIG. 2, which is a diagram of a CNN-HMM model, the speech phonemes are aligned by a trained CNN-HMM acoustic model.

5. And calculating statistics of cepstrum mean and variance normalization of the features after alignment in the third step, taking the sound event number as an index, wherein each statistic set is a matrix, namely a feature matrix of the generated speech.

6. And establishing a CRNN model for each target acoustic event by using Theano as the background of a Keras development tool according to the feature matrix generated in the step four.

The method in training the CRNN acoustic model is as follows.

(1) CNN uses Gated Linear Units (GLU) as activation functions, which in audio classification introduces attention mechanisms into all layers of the neural network. The associated audio event is processed by setting its value to the time domain close to zero. If a GLU is close to 1, then there should be a corresponding T-F unit. If a GLU is close to 0, then the corresponding T-F unit should be ignored. In this way, the network will learn to focus on audio events and ignore irrelevant sounds. Convolutional layers are applied to extract advanced features.

(2) Capturing temporal context information using a Bi-directional recurrent neural network (Bi-RNN), predicting an a posteriori for each audio class of each frame using a Forward Neural Network (FNN) and a number of audio classes. The prediction probability of each audio tag is obtained by averaging the posteriori of all frames.

(3) Applying a binary cross entropy loss between the prediction probability of the audio recording and the ground truth. The weights of the neural network may be updated by using the gradient of the weights calculated by back propagation; a CRNN acoustic model is created for each target acoustic event, as shown in fig. 3, which is a diagram of a CRNN model.

In connection with fig. 1, in the recognition phase.

1. And (3) preprocessing the audio signal of the target acoustic event to be detected, which is acquired in real time in the noise environment of the hospital, by using a microphone array voice recording system, and then extracting the MFCC characteristics.

2. And D, according to the extracted MFCC coefficients, performing classification recognition by adopting the CRNN model obtained in the fifth step to obtain the category of the target acoustic event to be detected.

3. And seventhly, comprehensively analyzing the noise events based on time sequence and direction in the acoustic events of the determined category to obtain corresponding event sequence codes, filtering the current event sequence according to the obtained event sequence codes, and filtering audio segments irrelevant to the target acoustic events.

The GLU formula is as follows.

Y＝(W*X+b)⊙σ(V*X+c) (1)

Where σ is an S-type non-linear, and is an element product, and is a convolution operator. W and V are convolution filters, and b and c are offsets. X represents the input T-F representation in the first layer or a feature map of the spacer layer.

In addition, the binary cross entropy loss formula is trained as follows.

Where E is the binary cross entropy, and On and Pn represent the estimate and reference label vectors at sample index n, respectively. The size of the cluster is denoted by n.adam as a random optimization method.

The invention well solves the problem of the recognition rate under the noise environment in the traditional voice recording electronic medical record, can greatly improve the working efficiency and effect of medical personnel, and can be well popularized and applied in the field of voice recording electronic medical records.

It should be understood that the detailed description of the present invention is only for illustrating the present invention and is not limited by the technical solutions described in the embodiments of the present invention, and those skilled in the art should understand that the present invention can be modified or substituted equally to achieve the same technical effects; as long as the use requirements are met, the method is within the protection scope of the invention.

Claims

1. The method for detecting the acoustic events in the noise environment of the hospital is characterized by comprising the following steps of:

step 1, in a training stage, taking an audio signal of a target acoustic event and a hospital environment noise signal as training data, carrying out feature interception on the audio signal of each acoustic event, and carrying out corresponding marking on an audio segment of the audio signal;

step 2, extracting MFCC characteristic coefficients of each target acoustic event in the audio according to the features intercepted in the step 1, wherein the MFCC characteristic coefficients comprise audio segments of the features of the acoustic events;

step 3, according to the extracted MFCC characteristic coefficients, adopting an HMM-CNN training alignment model to align the speech phonemes;

step 4, calculating statistics of cepstrum mean and variance normalization of the MFCC features after alignment in the step 3, taking the sound event number as an index, wherein each statistic set is a matrix, namely a feature matrix of the generated voice;

step 5, establishing a CRNN model for each target acoustic event by using Theano as a background of a Keras development tool according to the feature matrix generated in the step 4;

step 6, in the identification stage, the audio signals of the target acoustic events to be detected, which are acquired in real time in the noise environment of the hospital, are preprocessed and then MFCC (Mel frequency cepstrum coefficient) feature extraction is carried out;

step 7, classifying and identifying by adopting the CRNN model obtained in the step 5 according to the MFCC coefficient extracted in the step 6 to obtain the category of the target acoustic event to be detected;

step 8, in the acoustic events of the determined category in the step 7, carrying out comprehensive analysis based on time sequence and direction on the noise events to obtain corresponding event sequence codes, filtering the current event sequence according to the obtained event sequence codes, and filtering out audio segments irrelevant to the target acoustic events;

the method adopted by the model for aligning the speech phonemes in the step 3 comprises the following steps:

step 3-1, processing according to MFCC characteristics to extract a frame sequence; each frame is normalized to the same scale and fed into the CNN which produces a posterior probability of belonging to one class;

3-2, reducing training parameters by the characteristic shared by CNN weight to inhibit overfitting, enhancing the characteristics of the original voice signal by convolution operation of a volume base layer, and reducing background noise;

3-3, performing sub-sampling on the characteristics by utilizing a speech signal frequency spectrum local correlation principle at a pooling layer, reducing the dimension of data and reserving useful information;