CN113408664B

CN113408664B - Training method, classification method, device, electronic equipment and storage medium

Info

Publication number: CN113408664B
Application number: CN202110821915.9A
Authority: CN
Inventors: 张沁怡; 马彩虹
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2024-04-16
Anticipated expiration: 2041-07-20
Also published as: CN113408664A

Abstract

The disclosure provides a training method, a classifying device, electronic equipment and a storage medium, and relates to the field of artificial intelligence, in particular to the field of deep learning. The specific implementation scheme is as follows: the abnormal audio classification model comprises a feature extraction module, a time sequence feature learning module and a multi-classification module, and the method comprises the following steps: processing each sample audio in the training sample set by utilizing a feature extraction module to obtain depth feature data corresponding to each sample audio; processing depth feature data corresponding to each sample audio by using a time sequence feature learning module to obtain a time sequence feature sequence corresponding to each sample audio; processing the time sequence feature sequence corresponding to each sample audio by utilizing a multi-classification module to obtain a multi-classification result corresponding to each sample audio; and carrying out joint training on the feature extraction module, the time sequence feature learning module and the multi-classification module according to the multi-classification result and the multi-classification label corresponding to each sample audio to obtain an abnormal audio classification model.

Description

Training method, classification method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the field of deep learning. And in particular to a training method, a classification method, an apparatus, an electronic device and a storage medium.

Background

With the development of internet technology, information can be propagated through a network. The type of information includes a plurality of types, for example, the information includes abnormal audio. Since the transmission of abnormal audio will have an adverse effect on the object, it is necessary to audit the audio to avoid the transmission of abnormal audio in the network as much as possible. For example, the abnormal audio may include a jiao asthma audio or a moan audio, or the like. In auditing audio, the audio needs to be classified.

Disclosure of Invention

The disclosure provides a training method, a classification method, a device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a training method of an abnormal audio classification model, the abnormal audio classification model including a feature extraction module, a time sequence feature learning module, and a multi-classification module, the method including: processing each sample audio in the training sample set by using the characteristic extraction module to obtain depth characteristic data corresponding to each sample audio; processing depth feature data corresponding to each sample audio by using the time sequence feature learning module to obtain a time sequence feature sequence corresponding to each sample audio; processing a time sequence feature sequence corresponding to each sample audio by using the multi-classification module to obtain a multi-classification result corresponding to each sample audio; and performing joint training on the feature extraction module, the time sequence feature learning module and the multi-classification module according to the multi-classification result and the multi-classification label corresponding to each sample audio to obtain the abnormal audio classification model.

According to another aspect of the present disclosure, there is provided an abnormal audio classification method including: acquiring target audio; and inputting the target audio into the abnormal audio classification model to obtain a multi-classification result corresponding to the target audio, wherein the abnormal audio classification model is trained by the method.

According to another aspect of the present disclosure, there is provided a training apparatus for an abnormal audio classification model including a feature extraction module, a time series feature learning module, and a multi-classification module, the apparatus including: the first processing module is used for processing each sample audio in the training sample set by utilizing the characteristic extraction module to obtain depth characteristic data corresponding to each sample audio; the second processing module is used for processing the depth characteristic data corresponding to each sample audio by using the time sequence characteristic learning module to obtain a time sequence characteristic sequence corresponding to each sample audio; the third processing module is used for processing the time sequence feature sequence corresponding to each sample audio by utilizing the multi-classification module to obtain a multi-classification result corresponding to each sample audio; and the training module is used for carrying out combined training on the characteristic extraction module, the time sequence characteristic learning module and the multi-classification module according to the multi-classification result and the multi-classification label corresponding to each sample audio to obtain the abnormal audio classification model.

According to another aspect of the embodiments of the present disclosure, there is provided an abnormal audio classification apparatus including: the acquisition module is used for acquiring target audio; and a second obtaining module, configured to input the target audio into the abnormal audio classification model, to obtain a multi-classification result corresponding to the target audio, where the abnormal audio classification model is obtained by training using the apparatus as described above.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as described above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates an exemplary system architecture of a training method, an abnormal audio classification method, and an apparatus to which an abnormal audio classification model may be applied, according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a training method of an abnormal audio classification model according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a flow chart of inputting a time series feature sequence corresponding to each sample audio into a plurality of classification units, respectively, resulting in classification results corresponding to each sample audio, according to an embodiment of the disclosure;

fig. 4 schematically illustrates a schematic diagram of inputting a time-series feature sequence corresponding to sample audio into a plurality of classification units, respectively, resulting in classification results corresponding to the sample audio according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a schematic diagram of a training process of an abnormal audio classification model according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow diagram for deriving a training sample set using a sample equalization strategy, in accordance with an embodiment of the present disclosure;

FIG. 7 schematically illustrates a flow chart for deriving a training sample set using a sample equalization strategy according to another embodiment of the present disclosure;

FIG. 8 schematically illustrates a flow chart of an abnormal audio classification method according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates a block diagram of a training apparatus of an abnormal audio classification model according to an embodiment of the present disclosure;

FIG. 10 schematically illustrates a block diagram of an abnormal audio classification apparatus according to an embodiment of the present disclosure; and

fig. 11 illustrates a block diagram of an electronic device suitable for use in a training method of an abnormal audio classification model or an abnormal audio classification method in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to obtain the abnormal audio classification model, the training process of the abnormal audio classification model can be realized in a mode of dividing the training process into a plurality of operations such as feature extraction, classifier design and the like, and the operations are mutually independent. That is, the feature extraction model may be used to extract features of the audio to obtain feature data, and then the feature data is used to train the classification model to obtain an abnormal audio classification model capable of predicting whether the audio is abnormal audio. The training process of the feature extraction model and the classification model is independent of each other, i.e. during training of the abnormal audio classification model (i.e. the classification model), the feature extraction model may be understood as a model that has been trained in advance.

In the process of realizing the disclosed concept, at least the problem of low prediction accuracy exists in the mode. As the research is advanced, it is further found that this is mainly caused by two reasons.

First, feature data extracted by the feature extraction model is difficult to reflect features of abnormal audio. In the training process of the classification model, the feature extraction model can be understood as a model which is trained in advance, so that model parameters of the feature extraction model can be understood as fixed, feature data extracted by the feature extraction model are difficult to reflect features of abnormal audio, namely, features related to the abnormal audio are difficult to extract, and prediction accuracy of the abnormal audio classification model obtained by training the classification model by using the feature data is low. The above can be understood as being more difficult to determine a globally optimal solution, i.e. the above-described way of training an abnormal audio classification model can be understood as a way of translating a problem into a solution to a plurality of independent sub-problems. For each sub-problem, although the optimal solution for solving the sub-problem can be found as much as possible, the optimal solution for the sub-problem can be understood as a local optimal solution, because the solutions for the sub-problems are independent processes, unified utilization of information is lacking, and thus, the result obtained based on each local optimal solution is likely not a global optimal solution, in other words, it is more difficult to determine that the result obtained based on each local optimal solution is a global optimal solution.

Second, the classification model may classify normal audio as abnormal audio. Because the audio data has diversity and a plurality of sounds similar to the abnormal audio are arranged in the actual audio, if the abnormal audio or the normal audio is classified, the normal audio can be classified into the abnormal audio, so that the prediction accuracy of an abnormal audio classification model is reduced, and the classification requirement for the abnormal audio is difficult to meet. For example, the abnormal audio is a dyspnea audio, and the actual audio may include an audio similar to dyspnea such as a wheezing, a side ball, or a sigh.

Therefore, in order to realize that the feature data can reflect the features of the abnormal audio more accurately, that is, in order to realize that the globally optimal solution is determined, an end-to-end training mode can be adopted for realizing. That is, a method is adopted in which the depth network model directly learns the mapping relation between the training sample set input from the input end and the prediction result obtained from the output end, and model parameters of each layer of the depth network model are adjusted according to the output value of the loss function in the training process of the depth network model. In order to reduce the possibility of misclassification, the classification requirement for abnormal audio is met, and a multi-class classification mode can be utilized to realize multi-class classification.

Based on the foregoing, embodiments of the present disclosure propose an end-to-end solution for feature extraction and multi-classification. That is, embodiments of the present disclosure provide a training method for an abnormal audio classification model, an abnormal audio classification method, an apparatus, an electronic device, a non-transitory computer-readable storage medium storing computer instructions, and a computer program product. The abnormal audio classification model comprises a feature extraction module, a time sequence feature learning module and a multi-classification module, and the training method of the abnormal audio classification model comprises the following steps: processing each sample audio in the training sample set by utilizing a feature extraction module to obtain depth feature data corresponding to each sample audio, processing the depth feature data corresponding to each sample audio by utilizing a time sequence feature learning module to obtain a time sequence feature sequence corresponding to each sample audio, processing the time sequence feature sequence corresponding to each sample audio by utilizing a multi-classification module to obtain a multi-classification result corresponding to each sample audio, and carrying out joint training on the feature extraction module, the time sequence feature learning module and the multi-classification module according to the multi-classification result and the multi-classification label corresponding to each sample audio to obtain an abnormal audio classification model.

Fig. 1 schematically illustrates an exemplary system architecture of a training method, an abnormal audio classification method and apparatus to which an abnormal audio classification model may be applied according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the training method, the abnormal audio classification method, and the apparatus of the abnormal audio classification model may be applied may include a terminal device, but the terminal device may implement the training method, the abnormal audio classification method, and the apparatus of the abnormal audio classification model provided by the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and/or social platform software, etc. (as examples only).

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for content browsed by the user using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

The server 105 may be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical hosts and VPS services (Virtual Private Server, VPS). The server 105 may also be a server of a distributed system or a server that incorporates a blockchain.

It should be noted that, the training method and the abnormal audio classification method of the abnormal audio classification model provided by the embodiments of the present disclosure may be generally performed by the terminal device 101, 102, or 103. Accordingly, the training apparatus and the abnormal audio classification apparatus for an abnormal audio classification model provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.

Alternatively, the training method and the abnormal audio classification method of the abnormal audio classification model provided by the embodiments of the present disclosure may also be generally performed by the server 105. Accordingly, the training method and the abnormal audio classification apparatus for the abnormal audio classification model provided by the embodiments of the present disclosure may be generally provided in the server 105. The training method and the abnormal audio classification method of the abnormal audio classification model provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the training method and the abnormal audio classification apparatus for an abnormal audio classification model provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

For example, the server 105 processes each sample audio in the training sample set by using the feature extraction module to obtain depth feature data corresponding to each sample audio, processes the depth feature data corresponding to each sample audio by using the time sequence feature learning module to obtain a time sequence feature sequence corresponding to each sample audio, processes the time sequence feature sequence corresponding to each sample audio by using the multi-classification module to obtain a multi-classification result corresponding to each sample audio, and performs joint training on the feature extraction module, the time sequence feature learning module and the multi-classification module according to the multi-classification result and the multi-classification label corresponding to each sample audio to obtain an abnormal audio classification model. Or the feature extraction module, the time sequence feature learning module and the multi-classification module are jointly trained by a server or a server cluster capable of communicating with the terminal equipment 101, 102, 103 and/or the server 105 to obtain an abnormal audio classification model.

The server 105 acquires the target audio and inputs the target audio into the abnormal audio classification model to obtain a classification result corresponding to the target audio. Or the server cluster capable of communicating with the terminal equipment 101, 102, 103 and/or the server 105 acquires the target audio, and inputs the target audio into the abnormal audio classification model to obtain a classification result corresponding to the target audio.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

According to an embodiment of the present disclosure, a training method of an abnormal audio classification model is provided. The abnormal audio classification model may include a feature extraction module, a time series feature learning module, and a multi-classification module.

Fig. 2 schematically illustrates a flowchart of a training method of an abnormal audio classification model according to an embodiment of the present disclosure.

As shown in fig. 2, the method 200 includes operations S210 to S240.

In operation S210, each sample audio in the training sample set is processed by the feature extraction module to obtain depth feature data corresponding to each sample audio.

In operation S220, the depth feature data corresponding to each sample audio is processed by the time sequence feature learning module to obtain a time sequence feature sequence corresponding to each sample audio.

In operation S230, the time-series feature sequence corresponding to each sample audio is processed by the multi-classification module to obtain a multi-classification result corresponding to each sample audio.

In operation S240, the feature extraction module, the time sequence feature learning module, and the multi-classification module are jointly trained according to the multi-classification result and the multi-classification label corresponding to each sample audio, to obtain an abnormal audio classification model.

According to embodiments of the present disclosure, the training sample set may include a plurality of sample audio. An original training sample set can be obtained, and the original training sample set is processed to obtain the training sample set. The original training sample set may include a plurality of original sample audio. Processing the original training sample set to obtain a training sample set may include: and extracting the acoustic characteristics of each original sample audio to obtain the sample audio corresponding to the original sample audio. The acoustic features may include at least one of Mel-frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCC), fbank, timbre vectors, zero-crossing rate, subband energy entropy, spectral center, spectral spread, spectral entropy, spectral flux, spectral roll-off, and tone bias.

For example, the acoustic feature is mel-frequency cepstral coefficients. The mel-frequency cepstral coefficient is a linear transformation of the log-energy spectrum based on a nonlinear mel scale of sound frequencies. The band division of mel-frequency cepstral coefficients is equally divided on the mel scale, which more closely approximates the human auditory system than the linearly spaced bands used in normal cepstrum. Based on the above, for some specific audio content that may appear in the frequency range that can be identified by the human ear, for example, the mel frequency cepstrum coefficient of the audio with asthma may be obtained as the characteristic of the audio with asthma to perform identification analysis, so that the emphasis of audio analysis may be concentrated in the frequency range that can be identified by the human ear, which is beneficial to improving the accuracy of identification.

According to an embodiment of the present disclosure, a feature extraction module may be used to perform depth feature extraction, and the feature extraction module may include a convolutional neural network model. The convolutional neural network model may be configured according to actual traffic demands, and is not limited herein. For example, the convolutional neural network model includes a VGGish model. The VGGish model may be a VGG-like (i.e., visual Geometry Group) model pre-trained using an AudioSet dataset. The VGGish model can extract 128-dimensional feature vectors with semantic meaning from audio.

According to an embodiment of the present disclosure, processing each sample audio with the feature extraction module may include: cutting each sample audio by utilizing a feature extraction model to obtain a plurality of cut sample audios, extracting depth features of each cut sample audio in the plurality of cut sample audios to obtain depth feature data corresponding to each cut sample audio, and combining the depth feature data corresponding to the plurality of cut sample audios to obtain depth feature data corresponding to the sample audios.

According to the embodiment of the disclosure, the sequence data refers to the basis that a time development line is taken as the basis for acquiring data, and the data reflecting the time-varying state or degree of change of a certain object or phenomenon and the like is acquired at different time points, and the biggest characteristic is that the data at the front time point and the rear time point have a certain relationship. Audio is one of the series data, since there is always a certain relation between the front and rear segments thereof. The timing characteristics learning module may be used to address issues related to the sequence data. The timing characteristics learning module may include a recurrent neural network (Recurrent Neural Network, RNN) model. The recurrent neural network model may include a unidirectional recurrent neural network model or a bidirectional recurrent neural network model. The unidirectional recurrent neural network model may include a Long Short-Term Memory neural network (LSTM) model. Bi-direction Long Short-Term Memory (Bi-LSTM) model. The one-way neural network model is used to memorize past information but cannot process future information. In order to better understand the front and rear sections, future information needs to be acquired, and the bidirectional circulating neural network model can solve the problem, so that the front and rear section information is acquired. The two-way long-short term memory neural network model can be understood as a combination of a two-way circulation neural network model and a one-way long-short term memory neural network model.

According to embodiments of the present disclosure, a multi-classification module may be used to implement multi-classification of abnormal audio. The multiple categories may include at least three categories. The multi-classification label corresponding to each sample audio may refer to the actual multi-classification result of the sample audio. The multi-classification result corresponding to each sample audio may refer to a predicted multi-classification result of the sample audio.

For example, a classification model for abnormal audio is used to classify the dyspnea audio. The multiple classifications may include four categories, i.e., a first category, where the sample audio is normal or abnormal audio. The second category, the sample audio is a breath-like audio or other type of audio in the anomaly audio. The third category, the sample audio, is pure breath-like audio or other audio-like of the breath-like audio. The fourth category, the sample audio is normal audio (i.e., non-asthmatic audio) or abnormal audio (i.e., asthmatic audio) in the pure breath audio.

The multi-classification tags may include tags for characterizing the sample audio as normal or abnormal audio, tags for characterizing the sample audio as breath-like or other-like audio in abnormal audio, tags for characterizing the sample audio as pure breath-like or other-like audio in breath-like audio, and tags for characterizing the sample audio as normal or abnormal audio in pure breath-like audio.

The multi-classification results may include a classification result for characterizing that the sample audio is normal or abnormal audio, a classification result for characterizing that the sample audio is breath-like audio or other-like audio in abnormal audio, a classification result for characterizing that the sample audio is pure breath-like audio or other-like audio in breath-like audio, and a classification result for characterizing that the sample audio is normal or abnormal audio in pure breath-like audio.

According to the embodiment of the present disclosure, after the multi-classification result corresponding to each sample audio is obtained, the feature extraction module, the time series feature learning module, and the multi-classification module included in the abnormal audio classification model may be jointly trained, instead of being individually trained, using the multi-classification result and the multi-classification label corresponding to each sample audio, that is, model parameters of the feature extraction module, the time series feature learning module, and the multi-classification module may be adjusted according to the output value of the loss function.

According to the embodiment of the disclosure, in order to improve the prediction accuracy of the model, the method can be realized by using an attention mechanism, because the attention mechanism can focus important information with high weight, ignore unimportant information with low weight, and exchange information with other information by sharing the important information, thereby realizing the transmission of the important information. Thus, a higher weight can be set for important information to realize the transmission of the important information, so that the prediction accuracy of the model is improved. That is, an attention module may be provided in the abnormal audio classification model so that a time series feature sequence corresponding to each sample audio may be processed by the attention module to obtain a weight time series feature sequence corresponding to each sample audio. And processing the weight time sequence characteristic sequence corresponding to each sample audio by utilizing a multi-classification module to obtain a multi-classification result corresponding to each sample audio. And carrying out joint training on a feature extraction module, a time sequence feature learning module, an attention module and a multi-classification module which are included in the abnormal audio classification model by utilizing a multi-classification result and a multi-classification label which correspond to each sample audio to obtain the abnormal audio classification model.

According to the embodiment of the disclosure, each sample audio in a training sample set is processed by utilizing a feature extraction module to obtain depth feature data corresponding to each sample audio, the depth feature data corresponding to each sample audio is processed by utilizing a time sequence feature learning module to obtain a time sequence feature sequence corresponding to each sample audio, the time sequence feature sequence corresponding to each sample audio is processed by utilizing a multi-classification module to obtain a multi-classification result corresponding to each sample audio, and the feature extraction module, the time sequence feature learning module and the multi-classification module are jointly trained according to the multi-classification result and the multi-classification label corresponding to each sample audio to obtain an abnormal audio classification model. Because the feature extraction is completed in the training process of the abnormal audio classification model, the end-to-end training of feature extraction and classification is formed, and the extracted depth feature data can better reflect the features of the abnormal audio, so that the prediction accuracy of the model is improved. Through multi-classification, the possibility of classifying normal audio into abnormal audio is reduced, the prediction accuracy of the model is improved, and the classification requirement for the abnormal audio is met.

According to an embodiment of the present disclosure, the multi-classification module includes a plurality of classification units connected in parallel.

Processing the time sequence feature sequence corresponding to each sample audio by using the multi-classification module to obtain a multi-classification result corresponding to each sample audio can comprise the following operations.

And respectively inputting the time sequence characteristic sequences corresponding to each sample audio into a plurality of classification units to obtain classification results corresponding to each sample audio.

According to embodiments of the present disclosure, the multi-classification module may include at least two classification units, with a side-by-side relationship between the at least two classification units. Each sorting unit may be understood as a classifier. The at least two sorting units may be arranged in a cascade or in parallel.

According to the embodiment of the disclosure, a time sequence feature sequence corresponding to each sample audio is respectively input into a plurality of classification units, and a classification result corresponding to each classification unit is obtained. And obtaining a classification result corresponding to each sample audio according to the plurality of classification results.

According to an embodiment of the present disclosure, the training method of the abnormal audio classification model may further include the following operations.

And obtaining a training sample set by using a sample equalization strategy.

According to the embodiments of the present disclosure, in training the abnormal audio classification model, it is found that there is a problem that the number of sample audios of different categories is unbalanced, that is, the number of sample audios of a certain category or categories is far smaller than the number of sample audios of other categories, for example, the number of class is the asthma audio is far smaller than the number of normal audios, which may cause that the model has difficulty in learning the features of training samples for the few categories, thereby affecting the prediction accuracy of the model.

To this end, a sample equalization strategy may be utilized. The sample equalization policy may include a data equalization policy and/or an algorithm equalization policy. The data equalization policy may refer to a policy utilized to achieve a relatively uniform number of different classes of sample audio in a training sample set. An algorithmic equalization policy may refer to a policy utilized to achieve sample equalization without changing the number of different classes of sample audio.

For data equalization strategies, for example, a training sample set may be derived using an oversampling strategy. A training sample set may be obtained using an undersampling strategy. The training sample set may be obtained using an oversampling in combination with an undersampling strategy. The training sample set may also be obtained using a strategy according to a preset sampling ratio.

For an algorithmic equalization strategy, the number of initial sample audio corresponding to each multi-class label may be determined. A weighting coefficient corresponding to each multi-class label is determined based on the number of initial sample audio corresponding to each multi-class label. And obtaining a target loss function according to the initial loss function and the weighting coefficient corresponding to each multi-class label. Processing each sample audio in the training sample set by utilizing a feature extraction module to obtain depth feature data corresponding to each sample audio, processing the depth feature data corresponding to each sample audio by utilizing a time sequence feature learning module to obtain a time sequence feature sequence corresponding to each sample audio, processing the time sequence feature sequence corresponding to each sample audio by utilizing a multi-classification module to obtain a multi-classification result corresponding to each sample audio, inputting the multi-classification result corresponding to each sample audio and the multi-classification label into a target loss function to obtain an output value, adjusting model parameters of the feature extraction module, the time sequence feature learning module and the multi-classification module according to the output value until the output value converges, and determining the feature extraction module, the time sequence feature learning module and the multi-classification module obtained under the condition that the output value converges as an abnormal audio classification model.

According to an embodiment of the present disclosure, operation S210 may include the following operations.

And processing each sample audio in the training sample set by utilizing a feature extraction module based on a preset processing sequence to obtain depth feature data corresponding to each sample audio.

According to embodiments of the present disclosure, the preset processing order may refer to an order in which individual sample audio in a training sample set is processed. In order to improve the prediction accuracy of the model, a preset processing order may be set as an order in which the sample audio of each category is alternately processed.

According to the embodiment of the disclosure, sample audio of different categories is alternately input into an abnormal audio classification model, and depth feature data corresponding to each sample audio is obtained.

The training method of the abnormal audio classification model according to the embodiment of the present disclosure is further described below with reference to fig. 3 to 7 in conjunction with the specific embodiment.

Fig. 3 schematically illustrates a flowchart of inputting a time series feature sequence corresponding to each sample audio into a plurality of classification units, respectively, resulting in classification results corresponding to each sample audio according to an embodiment of the present disclosure.

As shown in fig. 3, the method 300 includes operations S331-S335.

In operation S331, the time sequence feature sequence corresponding to each sample audio is input into the first classification unit, and a first classification result corresponding to each sample audio is obtained, where the first classification result characterizes the classification result that the sample audio is normal audio or abnormal audio.

In operation S332, the time sequence feature sequence corresponding to each sample audio is input to the second classification unit, so as to obtain a second classification result corresponding to each sample audio, where the second classification result characterizes that the sample audio is a classification result of the breath audio or other audio in the abnormal audio.

In operation S333, the time sequence feature sequence corresponding to each sample audio is input into a third classification unit, so as to obtain a third classification result corresponding to each sample audio, where the third classification result characterizes that the sample audio is a classification result of pure breath audio or other audio in the breath audio.

In operation S334, the time sequence feature sequence corresponding to each sample audio is input to a fourth classification unit, so as to obtain a fourth classification result corresponding to each sample audio, where the fourth classification result characterizes the classification result that the sample audio is normal audio or abnormal audio in the pure breath audio.

In operation S335, a multi-classification result corresponding to each sample audio is obtained according to the first, second, third, and fourth classification results corresponding to each sample audio.

According to an embodiment of the present disclosure, the multi-classification module may include four classification units. The first classification unit is used for classifying whether the sample audio is normal audio or abnormal audio. The second classification unit is used for classifying the sample audio as the breath audio or other audio in the abnormal audio, and the other audio in the abnormal audio can comprise unknown audio and normal non-breath audio. The third classification unit is used for classifying that the sample audio is pure breath audio or other audio. The fourth classification unit is used for classifying normal audio or abnormal audio in the pure breath audio.

For example, an abnormal audio classification model is used to classify the asthma prone audio. The breath-like audio may include a jiao asthma audio, a wheezing audio, a semi-wheezing audio, and the like. The normal audio in the pure breath audio may be referred to as a non-dyspnea audio, and the abnormal audio may be referred to as a dyspnea audio.

For a certain sample audio, if a first classification result corresponding to the sample audio indicates that the sample audio is abnormal audio, a second classification result indicates that the sample audio is breath-class audio, a third classification result indicates that the sample audio is pure breath-class audio and a fourth classification result indicates that the sample audio is asthma-class audio, then multiple classification results corresponding to the sample audio can be determined to be abnormal audio, breath-class audio, pure breath-class audio and asthma-class audio. Thus, it can be determined that the sample audio is offensive and dyspnea audio.

Fig. 4 schematically illustrates a schematic diagram of inputting a time-series feature sequence corresponding to sample audio into a plurality of classification units, respectively, resulting in classification results corresponding to sample audio according to an embodiment of the present disclosure.

As shown in fig. 4, in the process 400, a time sequence feature sequence 401 corresponding to a sample audio is input into a first classification unit 4020, a second classification unit 4021, a third classification unit 4022 and a fourth classification unit 4023 included in a multi-classification module 402, respectively, to obtain a first classification result 4030, a second classification result 4031, a third classification result 4032 and a fourth classification result 4033, and a multi-classification result 403 corresponding to the sample audio is obtained according to the first classification result 4030, the second classification result 4031, the third classification result 4032 and the fourth classification result 4033.

Fig. 5 schematically illustrates a schematic diagram of a training process of an abnormal audio classification model according to an embodiment of the present disclosure.

As shown in fig. 5, each sample audio included in the training sample set 501 is input to a feature extraction module 5020 in the process 500 to obtain depth feature data 503 corresponding to the sample audio.

The depth feature data 503 corresponding to the sample audio is input to the time series feature learning module 5021, and a time series feature sequence 504 corresponding to the sample audio is obtained.

The time sequence feature sequence 504 corresponding to the sample audio is input into the multi-classification module 5022, and the multi-classification result 505 corresponding to the sample audio is obtained.

And performing joint training on the feature extraction module 5020, the time sequence feature learning module 5021 and the multi-classification module 5022 according to the multi-classification result 505 and the multi-classification label 506 corresponding to the sample audio to obtain an abnormal audio classification model 502 comprising the feature extraction module 5020, the time sequence feature learning module 5021 and the multi-classification module 5022.

Fig. 6 schematically illustrates a flow chart for deriving a training sample set using a sample equalization strategy, according to an embodiment of the disclosure.

As depicted in fig. 6, the method 600 includes operations S641-S644.

In operation S641, the number of initial sample audio corresponding to each multi-category label is determined.

In operation S642, a target multi-class label is determined, wherein the target multi-class label is a multi-class label having a number of corresponding initial sample audio less than a first preset number threshold.

In operation S643, the initial sample audio corresponding to the target multi-classification tag is oversampled such that the number of initial sample audio corresponding to the target multi-classification tag is greater than or equal to the first preset number threshold and less than or equal to the second preset number threshold.

In operation S644, the initial sample audio corresponding to the target multi-classification tag and the initial sample audio corresponding to the other multi-classification tags obtained after the oversampling are determined as the training sample set, wherein the other multi-classification tags are multi-classification tags whose number of corresponding initial sample audio is greater than or equal to the first preset number threshold and less than or equal to the second preset number threshold.

According to embodiments of the present disclosure, the first preset number threshold may be used to determine whether a certain class of sample audio is a smaller number of sample audio, i.e., if it is determined that the number of initial sample audio corresponding to a multi-classification tag is less than the first preset number threshold, it may be explained that the class of sample audio characterized by the multi-classification tag is a smaller number of sample audio. The values of the first preset number threshold and the second preset number threshold may be configured according to actual service requirements, which is not limited herein. The first preset number threshold is less than the second preset number threshold.

In accordance with an embodiment of the present disclosure, oversampling the initial sample audio corresponding to the target multi-class label may include: and (5) oversampling the initial sample audio corresponding to the target multi-classification label by using a random sampling method. Alternatively, the initial sample audio corresponding to the target multi-class label is oversampled using a synthetic minority oversampling (SyntheticMinority Oversampling, SMO) method.

According to an embodiment of the present disclosure, operation S643 may include the following operations.

The initial sample audio corresponding to the target multi-class label is copied.

In accordance with an embodiment of the present disclosure, oversampling initial sample audio corresponding to a target multi-class label using a random sampling method may include: and randomly copying the initial sample audio corresponding to the target multi-classification label to obtain the copied initial sample audio corresponding to the target multi-classification label.

Fig. 7 schematically illustrates a flow chart for deriving a training sample set using a sample equalization strategy according to another embodiment of the present disclosure.

As shown in fig. 7, the method 700 includes operations S741-S743.

In operation S741, initial sample audio corresponding to each multi-category label is determined.

In operation S742, sample audio is determined from the initial sample audio corresponding to each multi-class label according to a preset sampling ratio.

In operation S743, a training sample set is obtained from sample audio corresponding to each multi-class label.

According to embodiments of the present disclosure, a preset sampling ratio may be used to determine the number of sample audio for each category. Sample audio may be determined from the initial sample audio corresponding to each multi-class label using a preset sampling ratio to try to ensure the number of sample audio of different classes to be balanced.

According to the embodiment of the present disclosure, the value of the preset sampling proportion may be configured according to the actual service requirement, which is not limited herein. For example, a multi-class label including four classes, the preset sampling ratio is 1:1:1:1, i.e., the number of sample audio of the four multi-class labels is made the same.

According to the embodiment of the disclosure, the sample audio is determined from the initial sample audio corresponding to each multi-classification label according to the preset sampling proportion, so that the quantity balance of the sample audio in different classes is effectively ensured.

Fig. 8 schematically illustrates a flowchart of an abnormal audio classification method according to an embodiment of the present disclosure.

As shown in fig. 8, the method 800 includes operations S810-S820.

In operation S810, a target audio is acquired.

In operation S820, the target audio is input into an abnormal audio classification model, and a multi-classification result corresponding to the target audio is obtained, wherein the abnormal audio classification model is obtained by training the training method of the abnormal audio classification model according to the embodiment of the present disclosure.

According to the embodiment of the disclosure, a multi-classification result corresponding to a target audio is obtained by inputting the target audio into an abnormal audio classification model, the abnormal audio classification model is obtained by processing each sample audio in a training sample set by using a feature extraction module to obtain depth feature data corresponding to each sample audio, processing the depth feature data corresponding to each sample audio by using a time sequence feature learning module to obtain a time sequence feature sequence corresponding to each sample audio, processing the time sequence feature sequence corresponding to each sample audio by using a multi-classification module to obtain a multi-classification result corresponding to each sample audio, and performing joint training on the feature extraction module, the time sequence feature learning module and the multi-classification module according to the multi-classification result and the multi-classification label corresponding to each sample audio. Because the feature extraction is completed in the training process of the abnormal audio classification model, the end-to-end training of feature extraction and classification is formed, and the extracted depth feature data can better reflect the features of the abnormal audio, so that the prediction accuracy of the model is improved. Through multi-classification, the possibility of classifying normal audio into abnormal audio is reduced, the prediction accuracy of the model is improved, and the classification requirement for the abnormal audio is met.

It should be noted that, in the technical solution of the embodiment of the present disclosure, the acquisition, storage, application, etc. of the related sample audio and target audio all conform to the rules of the related laws and regulations, and necessary security measures are adopted, and the public order is not violated.

According to an embodiment of the present disclosure, a training apparatus for an abnormal audio classification model is provided. The abnormal audio classification model may include a feature extraction module, a time series feature learning module, and a multi-classification module.

Fig. 9 schematically illustrates a block diagram of a training apparatus of an abnormal audio classification model according to an embodiment of the present disclosure.

As shown in fig. 9, the training apparatus 900 of the abnormal audio classification model may include a first processing module 910, a second processing module 920, a third processing module 930, and a training module 940.

The first processing module 910 is configured to process each sample audio in the training sample set by using the feature extraction module, so as to obtain depth feature data corresponding to each sample audio.

The second processing module 920 is configured to process the depth feature data corresponding to each sample audio by using the timing feature learning module, so as to obtain a timing feature sequence corresponding to each sample audio.

And a third processing module 930, configured to process the time sequence feature sequence corresponding to each sample audio by using the multi-classification module, so as to obtain a multi-classification result corresponding to each sample audio.

The training module 940 is configured to perform joint training on the feature extraction module, the time sequence feature learning module, and the multi-classification module according to the multi-classification result and the multi-classification label corresponding to each sample audio, so as to obtain an abnormal audio classification model.

The third processing module 930 may include a first obtaining sub-module.

The first obtaining sub-module is used for inputting the time sequence feature sequences corresponding to each sample audio into a plurality of classification units respectively to obtain multi-classification results corresponding to each sample audio.

According to an embodiment of the present disclosure, the multi-classification module includes a first classification unit, a second classification unit, a third classification unit, and a fourth classification unit.

The first obtaining sub-module may include a first obtaining unit, a second obtaining unit, a third obtaining unit, a fourth obtaining unit, and a fifth obtaining unit.

The first obtaining unit is used for inputting the time sequence feature sequence corresponding to each sample audio into the first classifying unit to obtain a first classifying result corresponding to each sample audio, wherein the first classifying result represents the classifying result that the sample audio is normal audio or abnormal audio.

The second obtaining unit is used for inputting the time sequence characteristic sequences corresponding to each sample audio into the second classifying unit to obtain a second classifying result corresponding to each sample audio, wherein the second classifying result represents that the sample audio is a classifying result of breath audio or other audio in abnormal audio.

The third obtaining unit is configured to input the time sequence feature sequence corresponding to each sample audio into the third classifying unit, and obtain a third classification result corresponding to each sample audio, where the third classification result characterizes that the sample audio is a classification result of pure breath audio or other audio in the breath audio.

The fourth obtaining unit is used for inputting the time sequence feature sequence corresponding to each sample audio into the fourth classifying unit to obtain a fourth classifying result corresponding to each sample audio, wherein the fourth classifying result represents the classifying result that the sample audio is normal audio or abnormal audio in the pure breath audio.

And a fifth obtaining unit, configured to obtain a multi-classification result corresponding to each sample audio according to the first classification result, the second classification result, the third classification result, and the fourth classification result corresponding to each sample audio.

According to an embodiment of the present disclosure, the training apparatus 900 of the abnormal audio classification model may include a first obtaining module.

The first obtaining module is used for obtaining a training sample set by using a sample balancing strategy.

According to an embodiment of the present disclosure, the first obtaining module may include a first determining sub-module, a second determining sub-module, an oversampling sub-module, and a third determining sub-module.

A first determination sub-module for determining a number of initial sample audio corresponding to each multi-class label.

And the second determining sub-module is used for determining a target multi-classification label, wherein the target multi-classification label is a multi-classification label with the number of corresponding initial sample audio being smaller than a first preset number threshold.

And the oversampling submodule is used for oversampling the initial sample audio corresponding to the target multi-classification label so that the number of the initial sample audio corresponding to the target multi-classification label is larger than or equal to a first preset number threshold value and smaller than or equal to a second preset number threshold value.

And the third determining submodule is used for determining the initial sample audio corresponding to the target multi-classification label and the initial sample audio corresponding to other multi-classification labels obtained after the oversampling as a training sample set, wherein the other multi-classification labels are multi-classification labels with the number of the corresponding initial sample audio being greater than or equal to a first preset number threshold and less than or equal to a second preset number threshold.

According to an embodiment of the present disclosure, the oversampling sub-module may include a copying unit.

And the copying unit is used for copying the initial sample audio corresponding to the target multi-classification label.

According to an embodiment of the present disclosure, the first obtaining module may include a fourth determining sub-module, a fifth determining sub-module, and a second obtaining sub-module.

And a fourth determination sub-module for determining initial sample audio corresponding to each multi-class label.

And a fifth determining sub-module for determining sample audio from the initial sample audio corresponding to each multi-class label according to a preset sampling ratio.

And the second obtaining submodule is used for obtaining a training sample set according to the sample audio corresponding to each multi-classification label.

According to an embodiment of the present disclosure, the first processing module 910 may include a processing sub-module.

And the processing sub-module is used for processing each sample audio in the training sample set by utilizing the characteristic extraction module based on a preset processing sequence to obtain depth characteristic data corresponding to each sample audio.

Fig. 10 schematically illustrates a block diagram of an abnormal audio classification apparatus according to an embodiment of the present disclosure.

As shown in fig. 10, the abnormal audio classification apparatus 1000 may include an acquisition module 1010 and a second acquisition module 1020.

An acquisition module 1010, configured to acquire target audio.

The second obtaining module 1020 is configured to input the target audio into an abnormal audio classification model to obtain a multi-classification result corresponding to the target audio, where the abnormal audio classification model is obtained by training using the training device for the abnormal audio classification model according to the embodiment of the present disclosure.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

Fig. 11 illustrates a block diagram of an electronic device 1100 suitable for use in a training method of an abnormal audio classification model or an abnormal audio classification method in accordance with an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the electronic device 1100 includes a computing unit 1101 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data required for the operation of the electronic device 1100 can also be stored. The computing unit 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

A number of components in the electronic device 1100 are connected to the I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, etc.; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108, such as a magnetic disk, optical disk, etc.; and a communication unit 1109 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 1101 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 1101 performs the respective methods and processes described above, for example, a training method of an abnormal audio classification model or an abnormal audio classification method. For example, in some embodiments, the training method of the abnormal audio classification model or the abnormal audio classification method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, some or all of the computer programs may be loaded and/or installed onto device 1100 via ROM1102 and/or communication unit 1109. When the computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the training method of the abnormal audio classification model or the abnormal audio classification method described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the training method of the abnormal audio classification model or the abnormal audio classification method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of an abnormal audio classification model, the abnormal audio classification model comprising a feature extraction module, a time sequence feature learning module and a multi-classification module, the method comprising:

processing each sample audio in a training sample set by utilizing the feature extraction module to obtain depth feature data corresponding to each sample audio;

processing depth feature data corresponding to each sample audio by using the time sequence feature learning module to obtain a time sequence feature sequence corresponding to each sample audio;

Processing a time sequence feature sequence corresponding to each sample audio by utilizing the multi-classification module to obtain a multi-classification result corresponding to each sample audio; and

according to the multi-classification result and the multi-classification label corresponding to each sample audio, carrying out joint training on the feature extraction module, the time sequence feature learning module and the multi-classification module to obtain the abnormal audio classification model;

wherein the multi-classification module comprises a plurality of classification units connected in parallel;

the processing, by using the multi-classification module, the time sequence feature sequence corresponding to each sample audio to obtain a multi-classification result corresponding to each sample audio, including:

and respectively inputting the time sequence characteristic sequences corresponding to each sample audio into the plurality of classification units to obtain a multi-classification result corresponding to each sample audio.

2. The method of claim 1, wherein the multi-classification module comprises a first classification unit, a second classification unit, a third classification unit, and a fourth classification unit;

the step of inputting the time sequence feature sequences corresponding to each sample audio into the plurality of classification units to obtain multi-classification results corresponding to each sample audio, comprises the following steps:

Inputting a time sequence feature sequence corresponding to each sample audio into the first classification unit to obtain a first classification result corresponding to each sample audio, wherein the first classification result represents a classification result that the sample audio is normal audio or abnormal audio;

inputting the time sequence feature sequence corresponding to each sample audio into the second classification unit to obtain a second classification result corresponding to each sample audio, wherein the second classification result represents that the sample audio is a classification result of breath audio or other audio in the abnormal audio;

inputting the time sequence feature sequence corresponding to each sample audio into the third classification unit to obtain a third classification result corresponding to each sample audio, wherein the third classification result represents that the sample audio is a classification result of pure breath audio or other audio in the breath audio;

inputting the time sequence feature sequence corresponding to each sample audio into a fourth classification unit to obtain a fourth classification result corresponding to each sample audio, wherein the fourth classification result represents the classification result that the sample audio is normal audio or abnormal audio in the pure breath audio; and

And obtaining a multi-classification result corresponding to each sample audio according to the first classification result, the second classification result, the third classification result and the fourth classification result corresponding to each sample audio.

3. The method of claim 1 or 2, further comprising:

and obtaining the training sample set by using a sample equalization strategy.

4. A method according to claim 3, wherein said deriving said training sample set using a sample equalization strategy comprises:

determining a number of initial sample audio corresponding to each of the multi-class tags;

determining a target multi-classification label, wherein the target multi-classification label is a multi-classification label with the number of the corresponding initial sample audio being smaller than a first preset number threshold;

oversampling the initial sample audio corresponding to the target multi-classification tag such that the number of initial sample audio corresponding to the target multi-classification tag is greater than or equal to the first preset number threshold and less than or equal to a second preset number threshold; and

and determining the initial sample audio corresponding to the target multi-classification label and the initial sample audio corresponding to other multi-classification labels obtained after the oversampling as the training sample set, wherein the other multi-classification labels are multi-classification labels with the number of the corresponding initial sample audio being greater than or equal to the first preset number threshold and less than or equal to the second preset number threshold.

5. The method of claim 4, wherein the oversampling of the initial sample audio corresponding to the target multi-class label comprises:

and copying the initial sample audio corresponding to the target multi-classification label.

6. A method according to claim 3, wherein said deriving said training sample set using a sample equalization strategy comprises:

determining initial sample audio corresponding to each of the multi-class labels;

determining sample audio from initial sample audio corresponding to each multi-classification label according to a preset sampling proportion; and

and obtaining the training sample set according to the sample audio corresponding to each multi-classification label.

7. The method of claim 1, wherein the processing each sample audio in a training sample set with the feature extraction module to obtain depth feature data corresponding to the each sample audio comprises:

and processing each sample audio in the training sample set by utilizing the characteristic extraction module based on a preset processing sequence to obtain depth characteristic data corresponding to each sample audio.

8. An abnormal audio classification method, comprising:

Acquiring target audio; and

inputting the target audio into the abnormal audio classification model to obtain a multi-classification result corresponding to the target audio, wherein the abnormal audio classification model is trained by the method of any one of claims 1-7.

9. A training device for an abnormal audio classification model, the abnormal audio classification model comprising a feature extraction module, a time series feature learning module, and a multi-classification module, the device comprising:

the first processing module is used for processing each sample audio in the training sample set by utilizing the characteristic extraction module to obtain depth characteristic data corresponding to each sample audio;

the second processing module is used for processing the depth characteristic data corresponding to each sample audio by utilizing the time sequence characteristic learning module to obtain a time sequence characteristic sequence corresponding to each sample audio;

the third processing module is used for processing the time sequence feature sequence corresponding to each sample audio by utilizing the multi-classification module to obtain a multi-classification result corresponding to each sample audio; and

the training module is used for carrying out joint training on the feature extraction module, the time sequence feature learning module and the multi-classification module according to the multi-classification result and the multi-classification label corresponding to each sample audio to obtain the abnormal audio classification model;

the third processing module includes:

the first obtaining submodule is used for respectively inputting the time sequence characteristic sequences corresponding to each sample audio into the plurality of classification units to obtain multi-classification results corresponding to each sample audio.

10. The apparatus of claim 9, wherein the multi-classification module comprises a first classification unit, a second classification unit, a third classification unit, and a fourth classification unit;

the first obtaining sub-module includes:

the first obtaining unit is used for inputting the time sequence characteristic sequences corresponding to each sample audio into the first classifying unit to obtain a first classifying result corresponding to each sample audio, wherein the first classifying result represents the classifying result that the sample audio is normal audio or abnormal audio;

the second obtaining unit is used for inputting the time sequence characteristic sequences corresponding to each sample audio into the second classifying unit to obtain a second classifying result corresponding to each sample audio, wherein the second classifying result represents that the sample audio is a classifying result of breath audio or other audio in the abnormal audio;

A third obtaining unit, configured to input a time sequence feature sequence corresponding to each sample audio into the third classifying unit, to obtain a third classification result corresponding to each sample audio, where the third classification result characterizes that the sample audio is a classification result of pure breath audio or other audio in the breath audio;

a fourth obtaining unit, configured to input a time sequence feature sequence corresponding to each sample audio into the fourth classifying unit, to obtain a fourth classification result corresponding to each sample audio, where the fourth classification result characterizes that the sample audio is a classification result of normal audio or abnormal audio in the pure breath audio; and

11. The apparatus of claim 9 or 10, further comprising:

the first obtaining module is used for obtaining the training sample set by using a sample equalization strategy.

12. The apparatus of claim 11, wherein the first obtaining module comprises:

A first determining sub-module for determining a number of initial sample audio corresponding to each of the multi-class labels;

a second determining sub-module configured to determine a target multi-class label, where the target multi-class label is a multi-class label in which the number of initial sample audio is less than a first preset number threshold;

an oversampling submodule for oversampling initial sample audio corresponding to the target multi-classification tag so that the number of initial sample audio corresponding to the target multi-classification tag is greater than or equal to the first preset number threshold and less than or equal to a second preset number threshold; and

and the third determining submodule is used for determining the initial sample audio corresponding to the target multi-classification label and the initial sample audio corresponding to other multi-classification labels obtained after the oversampling as the training sample set, wherein the other multi-classification labels are multi-classification labels with the number of the initial sample audio being greater than or equal to the first preset number threshold and less than or equal to the second preset number threshold.

13. The apparatus of claim 12, wherein the oversampling submodule comprises:

14. The apparatus of claim 11, wherein the first obtaining module comprises:

a fourth determining sub-module for determining initial sample audio corresponding to each of the multi-class labels;

a fifth determining sub-module, configured to determine sample audio from initial sample audio corresponding to each multi-classification label according to a preset sampling proportion; and

and the second obtaining submodule is used for obtaining the training sample set according to the sample audio corresponding to each multi-classification label.

15. An abnormal audio classification apparatus comprising:

the acquisition module is used for acquiring target audio; and

a second obtaining module, configured to input the target audio into the abnormal audio classification model to obtain a multi-classification result corresponding to the target audio, where the abnormal audio classification model is obtained by training using the apparatus according to any one of claims 9 to 14.

16. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8.

17. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.

18. A computer program product comprising a computer program stored on at least one of a readable storage medium and an electronic device, which, when executed by a processor, implements the method according to any one of claims 1-8.