CN116052725A

CN116052725A - Fine granularity borborygmus recognition method and device based on deep neural network

Info

Publication number: CN116052725A
Application number: CN202310335591.7A
Authority: CN
Inventors: 胡兵; 刘瑞德; 黄凯得; 冯亦龙; 袁湘蕾; 刘伟; 林怡秀
Original assignee: West China Hospital of Sichuan University
Current assignee: West China Hospital of Sichuan University
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-05-02
Anticipated expiration: 2043-03-31
Also published as: CN116052725B

Abstract

The invention discloses a method and a device for identifying fine granularity borborygmus based on a deep neural network, which mainly solve the problems that in the prior art, a great amount of relevant information is lost in the process of converting 1-dimensional audio data into a 2-dimensional feature map and the specific moment of borygmus occurrence cannot be accurately positioned due to the fact that the signal features are very dependent on manual extraction. The deep neural network-based fine granularity borborygmus recognition method is used for collecting and labeling belly auscultation recording data; constructing a depth neural network model based on a transducer structure, and training the depth neural network by using the marked belly auscultation recording data to obtain a final model; and loading the final model, and obtaining a corresponding borborygmus recognition result by the input abdominal voice signal. Through the scheme, the accurate fine granularity identification of the borborygmus event can be realized through end-to-end training without manual feature extraction.

Description

Fine granularity borborygmus recognition method and device based on deep neural network

Technical Field

The invention relates to the technical field of deep neural networks, in particular to a method and a device for identifying fine-granularity borborygmus based on a deep neural network.

Background

Human Bowel Sounds (BS) are sounds produced by the peristaltic movement of the intestines to push food, liquids and gases. Many studies have shown that borborygmus is closely related to the functional status of the gastrointestinal tract, auscultation of borygmus is helpful in diagnosing functional bowel diseases such as irritable bowel syndrome and functional constipation, and thus, abdominal borygmus auscultation is an important investigation item in clinical work.

The prior art schemes are mainly divided into two main categories: a method based on conventional signal processing and a method based on deep learning; the disadvantages of this are: (1) Most of the existing methods rely on manually extracted signal features, but the features extracted by the manual features are often suboptimal due to the complexity of borygmus. This suboptimal signal characteristic limits the recognition performance of the subsequent classification algorithm; (2) The current mainstream algorithm is based on a Convolutional Neural Network (CNN), but CNN is specially designed for 2-dimensional image data, sound signals are 1-dimensional sequence data, the data characteristics of the sound signals are greatly different from those of the 2-dimensional image data, bowel sounds are 1-dimensional sequence data, the bowel sounds need to be converted into 2-dimensional characteristic diagrams, and a large amount of relevant information is lost in the data conversion process; (3) The prior art only realizes coarse granularity identification of the borborygmus events, namely only judges whether the borygmus events are contained in the belly sound fragments for a long time, but cannot accurately locate the specific moment of the borygmus events, and cannot judge related parameters such as the occurrence times, the frequency and the like of the borygmus events, and the parameters are critical to diagnosis of intestinal diseases.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a borborygmus sound fine-granularity identification method based on a deep neural network, which trains a novel borygmus sound identification neural network model based on a transducer structure on collected borygmus sound data and realizes fine-granularity accurate identification of borygmus sound events on the premise of not manually extracting audio features.

The invention provides the following technical scheme:

in one aspect, a method for identifying fine-grained borborygmus sounds based on a deep neural network includes

Collecting and labeling belly auscultation recording data;

constructing a depth neural network model based on a transducer structure, and training the depth neural network by using the marked belly auscultation recording data to obtain a final model;

and loading the final model, and inputting an abdomen sound signal to be detected to obtain a corresponding borborygmus recognition result.

In a preferred embodiment, collecting and annotating the abdominal auscultation recording data includes: the abdominal auscultation audio stream is collected and the auscultation recording is split into equal length audio pieces.

In a preferred embodiment, collecting and annotating the abdominal auscultation recording data includes: audio clips containing the borborygmus event and other sounds including gaussian noise, heart beat sounds, breathing sounds, speaking sounds and stethoscope friction sounds are screened from the audio clips and respectively form a data set.

In a preferred embodiment, collecting and annotating the abdominal auscultation recording data includes: the starting and ending time of the borborygmus event in each sound clip in the borygmus event data set is marked in detail, and the marking precision is accurate to the set millisecond level.

In a preferred embodiment, collecting and annotating the abdominal auscultation recording data includes: dividing the marked borborygmus event data set into three parts of a training set, a verification set and a test set.

On the other hand, the deep neural network model is a fine-grained borygmus recognition model constructed by using a frame embedding module, a position coding module, a plurality of stacked Transformer encoders and a linear classifier;

the frame embedding module is composed of a 1-dimensional convolution layer with multiple output channels;

the position coding module is responsible for automatically coding the position information of the audio frame sequence and adding the coded position information into the original input sequence through addition operation;

the transducer encoder is composed of a layer normalization module, a multi-head self-attention module and a multi-layer perceptron module;

the linear classifier module is composed of a layer of linear neurons and is responsible for classifying the audio feature scores extracted by the transducer encoder, and each audio frame is divided into two types of borygmus events and non-borygmus events.

In a preferred embodiment, the multi-head self-attention module and the multi-layer perceptron module are provided with residual modules.

In a preferred embodiment, constructing a transducer structure-based deep neural network model includes the steps of:

s301, adding position information into a position coding module, entering a layer normalization module, and performing normalization operation on the training set data, wherein the normalized data meets normal distribution with a mean value of 0 and a standard deviation of 1;

s302, outputting more abstract sequence features after the normalized data in the step S301 enter a multi-head self-attention module; the multi-head attention module is composed of a plurality of parallel self-attention modules;

s303, carrying out layer normalization on the sequence features output in the step S302 again, entering a multi-layer perceptron module, and outputting more abstract classification features; the multi-layer perceptron module consists of two layers of linear neurons, and the middle of the multi-layer perceptron module is connected by a nonlinear activation function; the classification features enter a residual error module and then output advanced features;

s304, the advanced features enter a linear classifier mode to be classified into borborygmus events and non-borygmus events.

In a preferred embodiment, the input abdominal sound signal obtains a corresponding borborygmus recognition result including: and (3) accurately identifying the borborygmus event in clinic by using the trained deep neural network model, and calculating the occurrence frequency of the borygmus event by using a non-maximum suppression algorithm according to the output result of the deep neural network model.

In a third aspect, a method and apparatus for identifying fine-grained borborygmus sounds based on a deep neural network include a memory: for storing executable instructions; a processor: the executable instructions are used for executing the executable instructions stored in the memory to realize a fine granularity borborygmus recognition method based on a deep neural network.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention can automatically learn the characteristics of the audio signal; compared with other existing methods, the neural network module provided by the invention can automatically learn how to extract optimal and highly abstract audio signal characteristics from a large amount of belly auscultation data, so that a better recognition effect can be realized.

(2) After training, the neural network model provided by the invention can directly process the original audio stream so as to obtain a refined borborygmus recognition result; in the using process of the neural network model proposed by other methods, low-level audio features such as a histogram of a frequency domain or a mel frequency cepstrum coefficient and the like are manually extracted from an original audio signal, and then the neural network model can calculate and obtain a coarse-granularity borborygmus recognition result according to the manually extracted features.

(3) The deep neural network provided by the invention is not based on a traditional CNN structure, but is based on a transducer structure specially designed for sequence data; the neural network structure provided by the invention not only can extract the clinical domain response characteristics of the borborygmus signal, but also can effectively extract the remote dependence information of different sound signals, so that the context information of the sequence signal can be better constructed.

(4) The current mainstream technology only realizes coarse-granularity identification of the borborygmus events, but cannot accurately position the specific moment of the borygmus events, so that the current mainstream technology is difficult to apply in large scale in the real world; the neural network model provided by the invention realizes fine-grained modeling of the abdomen auscultation signal through the transducer structure, further realizes the identification of the refined borygmus event, and has great help to auscultation of clinical borygmus.

Drawings

For a clearer description of embodiments of the invention or of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are some of the embodiments of the invention and that, without the inventive effort, further drawings may be obtained according to these drawings, for a person skilled in the art, in which:

FIG. 1 is a flow chart of the present invention.

Fig. 2 is an overall structure of the neural network model.

Fig. 3 is a specific structure of a frame embedding module.

Fig. 4 is a specific structure of the position coding module.

Fig. 5 is a transducer encoder module.

Fig. 6 is a specific structure of the linear classifier.

Description of the embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to fig. 1 to 6, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making inventive efforts are within the scope of protection of the present invention.

Aiming at the defects of the prior art, the invention provides a borborygmus fine-granularity identification method based on a deep neural network, which is based on a transducer structure, does not need manual feature extraction, and can realize fine-granularity accurate identification of borygmus events only through end-to-end training.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a borborygmus fine granularity identification method based on a deep neural network comprises the following steps:

(1) And collecting and marking data, collecting the desensitized belly auscultation record, and marking the starting and ending positions of the borborygmus event in detail.

(2) Creating a deep neural network based on a transducer structure, training a neural network model using the collected abdominal auscultatory recordings.

(3) Loading model to obtain corresponding bowel sound identification result according to inputted abdominal voice signal

The overall flow is as shown in FIG. 1:

further, in the step (1), the abdominal auscultation recording data for training the deep neural network model is collected and marked, and the method comprises the following steps:

(11) Recording and collecting belly auscultation audio streams, dividing auscultation recordings into audio clips with equal length, and storing the audio clips in a wav format, so that subsequent training is facilitated.

(12) Audio clips containing the bowel sound event and other sounds (gaussian noise, heart beat sound, breathing sound, talking sound, and other noise such as stethoscope friction sound) are screened out to form a data set.

(13) The beginning and ending times of the borborygmus event in each sound clip are marked in detail, the marking accuracy is accurate to 10 milliseconds, and other noise which is not borygmus is not marked.

(14) Dividing the marked data set into three parts of a training set, a verification set and a test set.

Further, in the step (2), a deep neural network based on a transducer structure is created, and the collected belly auscultation recording is used for training a neural network model, which specifically comprises the following steps:

(21) First, in order to achieve fine-grained division of a borborygmus event, input audio data is divided into a plurality of equal-sized "audio frames" in 10-millisecond steps, and the division fineness is the same as the fineness of data labeling.

(22) Subsequently, a deep neural network model is constructed: a fine-grained borygmus recognition model is constructed using a frame embedding module, a position encoding module, a plurality of stacked Transformer encoders, and a linear classifier, as shown in fig. 2. The specific description of each module is as follows:

(i) The frame embedding module is formed by a 1-dimensional convolution layer with multiple output channels. As shown in fig. 3, the module takes a single audio frame as input, outputs a plurality of eigenvectors of the audio frame after convolution calculation, and splices the eigenvectors obtained by calculation into a one-dimensional eigenvector. Specifically, assume that the input data is

Wherein->

、/>

And->

Distribution indicates the number of audio frames, the number of channels per audio frame and the number of features, the audio features output by the convolution layer +.>

Can be expressed as:

，

wherein the method comprises the steps of

Representing a one-dimensional multichannel convolution operation,/->

，/>

And->

The number of convolution operators in the convolution layer and the feature number of the feature vector obtained after convolution are respectively represented. In the present invention, < >>

，/>

. Then, the plurality of feature vectors of each audio frame are spliced into a one-dimensional feature vector, and a final frame embedding result is obtained:

. In this module, the 1-dimensional convolution layer can be regarded as a plurality of learnable filters, so that the module can replace the feature extraction step in the prior art, and automatically learn and extract shallow features of an audio frame in an end-to-end training process.

(ii) The position coding module is responsible for automatically coding the position information of the audio frame sequence and adding the coded position information into the original input sequence through addition operation. As shown in fig. 4, the position coding module is composed of a set of learnable parameters, and the parameter dimensions are the same as those of the input audio frame sequence, so as to automatically learn the optimal position information coding method in the end-to-end training process. Specifically, assume that the input data is

Wherein->

Representing the feature number of the input data, its corresponding position code can be expressed as: />

And the output data subjected to the position information coding is: />

。

(iii) As shown in fig. 5, the transducer encoder is composed of a layer normalization module, a multi-head self-attention module, a residual module, and a multi-layer perceptron module. Meanwhile, in order to further extract the advanced features of the borborygmus event and improve the accuracy of borygmus recognition, the invention stacks and uses 3 transducer encoders. The detailed description of each sub-module is as follows:

first, the input data is submitted to a layer normalization module. The layer normalization module is responsible for normalizing the input data, so that normalized data meets normal distribution with a mean value of 0 and a standard deviation of 1.

The normalized data is then input into a multi-headed self-attention module, which is made up of a plurality of parallel self-attention modules. The self-attention module aims at simulating the attention behavior of a person when the person interacts with the outside, and a large amount of irrelevant interference information can be removed from the sequence in the operation process, so that a better recognition effect and higher training efficiency are realized. Compared with a convolutional neural network model which is used in a large amount in the prior art, the self-attention module is a neural network module specially designed for sequence data, and can better extract features from the sequence data, especially long-distance dependent features in the sequence data, so that the invention can achieve better recognition results than the prior art.

Specifically, for input data

The self-attention module is represented by a linear transformation matrix +.>

、

And->

Will input data +.>

Mapping to the corresponding query vector +.>

Keyword vector->

Sum vector->

Wherein->

、/>

And->

Characteristic numbers respectively representing a query vector, a keyword vector and a value vector, in the present invention,/-in>

=64. Then, the self-attention module calculates the query vector +.>

And keyword vector->

The similarity between the audio frames is obtained, the relevance between the different audio frames is obtained, and finally, all value vectors are subjected to +_ according to the relevance score>

And performing weighted sum operation to obtain new characteristic representation. The similarity calculation method used in the present invention is based on +.>

The specific formula of the scaling vector point multiplication method of the function is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

indicate->

Feature vectors calculated by the self-attention module, < >>

In the present invention->

. Q, K and V represent query vector, keyword vector and value vector, respectively, +.>

Representing the dimensions of the keyword vector, softmax represents the normalized exponential function. Then, the multi-head self-attention module splices the audio features calculated by the self-attention modules along the feature direction to obtain a new feature vector +.>

And pass through a linear transformation matrix +.>

Will->

Mapping to final output eigenvector +.>

I.e.

。

Then, in order to further extract the abstract features of the borborygmus event, the invention carries out layer normalization on the features calculated by the multi-head self-attention module again, and submits the normalized data to the multi-layer perceptron module. The multi-layer perceptron module in the invention is composed of two layers of linear neurons, and the middle part is composed of

The nonlinear functions are connected. As shown, the specific structure of the module can be expressed as:

，

is self-attention of multiple headsThe abstract features calculated by the force module are used as input vectors of the multi-layer perceptron. LayerNorm represents a layer normalization module, geLU represents a nonlinear activation function, and the calculation formula is as follows:

and->

Is the characteristic vector calculated by the multi-layer perceptron module, W ₀ And W is ₁ Representing linear mapping matrices in the first and second layer linear neurons, respectively, and b ₀ And b ₁ Representing regularized deviation amounts in the first and second layer linear neurons, respectively.

Finally, in order to cope with the gradient vanishing problem of the deep neural network in the training process, a residual error module is respectively introduced on the multi-head self-attention module and the multi-layer perceptron module.

To summarize, assume the following

The input of the individual transducer encoder is +.>

And->

The module can be represented as:

/>

wherein LayerNorm represents a layer normalization module,

representing the input vector of the first layer normalization module. W (W) _Q ，W _K And W is _V Representing a linear mapping matrix in a multi-headed self-attention mechanism, responsible for inputting vectors +.>

The linear transformation is a corresponding query vector, keyword vector and value vector. D (D) _K Representing the dimensions of the keyword vector. softmax represents the normalized exponential function. W (W) _O Is another linear mapping matrix in the multi-head self-attention module, and is responsible for mapping the feature vector calculated by the self-attention mechanism into the output domain. GeLU represents a nonlinear activation function, W ₀ And W is ₁ Representing linear mapping matrices in first and second layer linear neurons, respectively, in a multi-layer perceptron, and b ₀ And b ₁ Representing regularized deviation amounts in the first and second layer linear neurons, respectively. />

Indicate->

The final output of the individual transducer encoders,

。

(iv) Finally, as shown in fig. 6, the linear classifier module is composed of a layer of linear neurons and is responsible for classifying the audio feature components extracted by the transducer encoder, and classifying each audio frame into two classes: namely borborygmus events and non-borygmus events.

It can be seen from the above that, since the transducer encoder used in the present invention can effectively calculate the correlation information between each audio frame and other frames, and thus extract the advanced features of each audio frame, the deep neural network provided in the present invention can fully and carefully identify each audio frame, thereby implementing fine-grained segmentation of the borygmus events, and precisely locating the boundary of each borygmus event.

(23) The deep neural network model constructed in the previous step is then trained using the acquired abdominal auscultation data. The present invention is trained using a random gradient descent method (SGD) and uses a cross entropy function as the training objective function. In particular, it is assumed that all parameters of the deep neural network model constructed in the present invention are

At the beginning, all entries of the model are randomly initialized. And then entering a forward calculation stage, randomly selecting 512 abdominal auscultation fragments from the training data set without returning, predicting the 512 data by using the neural network model to be trained, and calculating a model prediction result and a loss value corresponding to a correct label by using an objective function. Subsequently the model parameters are calculated by the back propagation method +.>

Gradient of->

The gradient of the model is updated by the following formula:

，

wherein the method comprises the steps of

Indicates learning rate (I/O)>

Representing parameters before update ∈ ->

Representing the updated parameters. Thus, one training iteration is completed. The iterations are repeated until all the data in the training set is extracted. At this point all data in the verification set is fetched,and predicting the data in the verification set by the trained model, and calculating indexes such as prediction loss, prediction accuracy and the like of the model, so as to evaluate the quality of the model. If the effect of the model does not meet the requirement, repeating the training steps until the performance of the model on the verification set reaches the expected value, and storing the parameters of the model. In the invention, the learning rate used in the training process is 0.1, the dynamic value is 0.9, and the weight attenuation rate is 0.02.

(24) The trained neural network model is used for accurately identifying the borborygmus events in clinic, and further calculating related parameters such as the occurrence frequency, the frequency and the like of the borygmus events according to the identification result, so as to assist doctors in diagnosing diseases.

A method and a device for identifying fine granularity borborygmus based on a deep neural network comprise a memory: for storing executable instructions; a processor: the method is used for executing executable instructions stored in the memory to realize a fine-granularity borborygmus recognition method based on the deep neural network.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for identifying fine granularity borborygmus based on deep neural network is characterized in that,

collecting and labeling belly auscultation recording data;

2. The method for identifying fine-grained borygmus sounds based on a deep neural network according to claim 1, wherein collecting and labeling the auscultation recording data of the abdomen comprises: the abdominal auscultation audio stream is collected and the auscultation recording is split into equal length audio pieces.

3. The method for identifying fine-grained borygmus sounds based on deep neural network according to claim 2, wherein collecting and labeling the auscultation recording data of the abdomen comprises: audio clips containing the borborygmus event and other sounds including gaussian noise, heart beat sounds, breathing sounds, speaking sounds and stethoscope friction sounds are screened from the audio clips and respectively form a data set.

4. A method for identifying fine-grained borygmus sounds based on a deep neural network according to claim 3, wherein collecting and labeling the auscultation recording data of the abdomen comprises: the starting and ending time of the borborygmus event in each sound clip in the borygmus event data set is marked in detail, and the marking precision is accurate to the set millisecond level.

5. The method and device for identifying fine-grained borygmus sounds based on deep neural network according to claim 4, wherein collecting and labeling the belly auscultation recording data comprises: dividing the marked borborygmus event data set into three parts of a training set, a verification set and a test set.

6. The method for identifying fine-grained borygmus sounds based on a deep neural network according to any one of claims 1 to 5, wherein the deep neural network model is a fine-grained borygmus sound identification model constructed by using a frame embedding module, a position encoding module, a plurality of stacked Transformer encoders and a linear classifier;

7. The method for identifying fine-grained borygmus on basis of a deep neural network according to claim 6, wherein the multi-head self-attention module and the multi-layer perceptron module are provided with residual modules.

8. The method for identifying fine-grained borygmus sounds based on a deep neural network according to claim 7, wherein constructing a deep neural network model based on a transducer structure comprises the steps of:

s302, outputting sequence features after the normalized data in the step S301 enter a multi-head self-attention module; the multi-head attention module is composed of a plurality of parallel self-attention modules;

s303, carrying out layer normalization on the sequence features output in the step S302 again, and then entering a multi-layer perceptron module to output classification features; the multi-layer perceptron module consists of two layers of linear neurons, and the middle of the multi-layer perceptron module is connected by a nonlinear activation function; the classification features enter a residual error module and then output advanced features;

9. The method for identifying fine-grained borborygmus based on deep neural network according to claim 1, wherein the input abdominal sound signal obtains the corresponding borygmus identification result comprising: and (3) accurately identifying the output result of the borborygmus event in clinic by using the trained deep neural network model, and calculating the occurrence frequency of the borygmus event by using a non-maximum suppression algorithm.

10. A method and a device for identifying fine-granularity borborygmus based on a deep neural network are characterized by comprising

A memory: for storing executable instructions;

a processor: for executing executable instructions stored in said memory, implementing a deep neural network based fine-grained borborygmus recognition method according to any of claims 1-9.