CN117037847A

CN117037847A - End-to-end community noise monitoring method and device and related components

Info

Publication number: CN117037847A
Application number: CN202310950511.9A
Authority: CN
Inventors: 钟桂生
Original assignee: Shenzhen Wanwuyun Technology Co ltd
Current assignee: Shenzhen Wanwuyun Technology Co ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-11-10
Anticipated expiration: 2043-07-31
Also published as: CN117037847B

Abstract

The invention discloses an end-to-end community noise monitoring method, a device and related components, wherein the method comprises the following steps: collecting an audio data set, and preprocessing the audio data set to obtain a training data set and a label set; extracting frequency domain characteristics and converting formats of the training data set to obtain a model training image set; generating a triplet training set and a triplet tag set according to the model training image set and the tag set; extracting features of the triplet training set and the triplet tag set through a deep learning model, and classifying through a multi-layer perceptron to obtain noise types; and analyzing the noise category to obtain a final monitoring result. According to the method, automatic collection, analysis and community noise identification are realized, the intelligent degree of property management is improved, the workload and management cost of property management staff are reduced, and the noise identification efficiency is improved through the combination of a deep learning model and a multi-layer perceptron.

Description

End-to-end community noise monitoring method and device and related components

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an end-to-end community noise monitoring method, apparatus, and related components.

Background

In recent years, more and more cities have proposed ideas of smart cities and smart communities to promote the level of intellectualization of urban management and services. The intelligent community aims to improve the comfort and satisfaction of resident living environments, and the frequency of community noise disturbing civil events is one of important indexes for evaluating the intelligent degree of the community. As an important pollution source in urban environment monitoring and management, community noise problems are increasingly complicated and serious, and the quality of life and health condition of people are seriously affected. Therefore, how to effectively solve the community noise problem has become a hotspot problem of general concern for urban authorities and residents.

The traditional community noise monitoring method mainly depends on methods such as property manual timing inspection and single noise source detection, but the methods have the defects of high cost, low efficiency, moral risk and the like. In modern society with continuously accelerated urban processes, the traditional noise monitoring method cannot meet the requirements of people on the quality of living environment, so that a more efficient and intelligent community noise monitoring technology is urgently needed, and a more scientific and feasible solution is provided for urban environment pollution control.

Disclosure of Invention

The invention aims to provide an end-to-end community noise monitoring method, an end-to-end community noise monitoring device and related components, and aims to solve the problems of high cost, low efficiency and the like of the existing community noise monitoring method.

In a first aspect, an embodiment of the present invention provides an end-to-end community noise monitoring method, including:

collecting an audio data set, and preprocessing the audio data set to obtain a training data set and a label set;

extracting frequency domain characteristics and converting formats of the training data set to obtain a model training image set;

generating a triplet training set and a triplet tag set according to the model training image set and the tag set;

extracting features of the triplet training set and the triplet tag set through a deep learning model, and classifying through a multi-layer perceptron to obtain noise types;

and analyzing the noise category to obtain a final monitoring result.

In a second aspect, an embodiment of the present invention provides an end-to-end community noise monitoring device, including:

the collecting unit is used for collecting the audio data set, preprocessing the audio data set and obtaining a training data set and a label set;

the extraction unit is used for carrying out frequency domain feature extraction and format conversion on the training data set to obtain a model training image set;

the generating unit is used for generating a triplet training set and a triplet label set according to the model training image set and the label set;

the classification unit is used for extracting the characteristics of the triplet training set and the triplet label set through a deep learning model and classifying through a multi-layer perceptron to obtain noise types;

and the analysis unit is used for analyzing the noise category to obtain a final monitoring result.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the end-to-end community noise monitoring method according to the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program when executed by a processor implements the end-to-end community noise monitoring method according to the first aspect.

The invention discloses an end-to-end community noise monitoring method, a device and related components, wherein the method comprises the following steps: collecting an audio data set, and preprocessing the audio data set to obtain a training data set and a label set; extracting frequency domain characteristics and converting formats of the training data set to obtain a model training image set; generating a triplet training set and a triplet tag set according to the model training image set and the tag set; extracting features of the triplet training set and the triplet tag set through a deep learning model, and classifying through a multi-layer perceptron to obtain noise types; and analyzing the noise category to obtain a final monitoring result. According to the method, automatic collection, analysis and community noise identification are realized, the intelligent degree of property management is improved, the workload and management cost of property management staff are reduced, the accuracy and performance of noise identification are improved through the combination of the deep learning model and the multi-layer perceptron, and the noise identification efficiency is further improved. The embodiment of the invention also provides an end-to-end community noise monitoring device, a computer readable storage medium and a computer device, which have the beneficial effects and are not described in detail herein.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an end-to-end community noise monitoring method according to the present embodiment;

fig. 2 is a flow chart of frequency domain feature extraction in the present embodiment;

FIG. 3 is a flow chart of the Swin-transducer model downsampling of the present embodiment;

FIG. 4 is a flowchart of the triplet function optimization of the present embodiment;

fig. 5 is a schematic structural diagram of an additional network structure of the present embodiment;

FIG. 6 is an analysis chart of a dog-bone example;

FIG. 7 is a schematic block diagram of an end-to-end community noise monitoring device of the present embodiment;

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1, the present invention provides an end-to-end community noise monitoring method, which includes:

s101: collecting an audio data set, and preprocessing the audio data set to obtain a training data set and a label set;

in this embodiment, the method for collecting the audio data set includes: and installing auscultation devices in corresponding places, collecting environmental audio data in real time through the auscultation devices, and making the audio data into a data set and transmitting the data set to the model for model training or reasoning.

The embodiment is also provided with noise decibelsThe diagnosis method is used for judging whether the current audio data trigger the recognition and reasoning of the model; the method comprises the following specific steps: calculating the actual decibel value of the environmental audio frequency and comparing the actual decibel value with a set environmental noise decibel threshold valueComparing and judging whether the duration of the environmental noise exceeds the predetermined time length +.>If the decibel of the ambient noise exceeds the threshold value +.>And the ambient noise duration exceeds a predetermined length of time +.>Then the subsequent work such as the recognition and reasoning of the model is triggered; if the noise decibel does not exceed the threshold +.>Or the ambient noise duration does not exceed a predetermined length of time +.>No action is taken.

In order to meet the training requirement of the model, the embodiment needs to perform preprocessing, such as operations of filling, slicing, and the like, on the collected audio data set, and is specific:

reading an audio data set X at a specified audio sampling rate sr _ori Inside original audio samplesWhere i is the audio data set X _ori Is a sample index of (2);

the original audio samples are then sampled based on the sample duration d and the sample overlap rate rSplitting into several sub-audio samples, each with +.>A representation, where j is the maximum segmentable index of the audio sample;

then respectively judging a plurality of sub-audio samplesWhether a complete model training sample is obtained; if not, then sub-audio sample +.>Performing alignment treatment, wherein the alignment data obey Gaussian distribution with average value of 0 and standard deviation of 0.01, namely +.>If yes, no treatment is carried out;

then for each sub-audio sampleMarking according to a noise category index I, wherein community noise sources are of various types, such as street music noise, decoration drilling sound, automobile alarm sound and the like, and the default noiseless tag is '0' in the embodiment; if one sub-audio sample contains a plurality of noise categories, marking is carried out by taking the duration and the noise intensity as the reference, and the noise categories can be marked more accurately by considering the factors of the duration and the noise intensity; meanwhile, the difference between different noise categories can be better reflected, so that the accuracy and the robustness of the subsequent model training are improved;

and then respectively storing the marked sub-audio samples and the labels corresponding to the sub-audio samples into a training data set X and a label set Y, wherein the training data set and the label set are mainly used for training and predicting the model.

S102: extracting frequency domain characteristics and converting formats of the training data set to obtain a model training image set;

in the embodiment, based on a Swin-transform network structure, a ternary loss function (Triplet Loss Function) is selected for training a model to obtain more representative characteristic representation, so that the accuracy and the robustness of a deep learning algorithm are improved; because the computer cannot directly process the audio data, the training data set needs to be pre-processed and converted into the input requirement of the deep learning algorithm.

Wherein, the preprocessing includes: extracting frequency domain features and constructing a triplet training set, namely, a step S102 and a step S103;

according to the method, the frequency domain characteristics of the training data set are extracted through a frequency spectrum analysis method, so that the energy distribution condition of the audio signal in the time dimension and the frequency spectrum dimension can be obtained, and the specificity of noise signals and the difference among different signals are captured.

Specifically, referring to fig. 2, a training data set is traversed and audio samples within the training data set are read at a specified audio sampling rate; framing and windowing the audio samples based on frame length and frame shift, and performing short-term Fourier transform on each frame of signal to obtain frequency spectrum information of the audio samples; constructing a mel filter bank based on the number of mel filters; carrying out convolution operation and logarithmic operation on the Mel filter bank and the frequency spectrum information to obtain Mel frequency spectrum characteristics; normalizing the Mel spectrum characteristics; converting the normalized Mel spectrum feature into Mel spectrum feature image according to image size; and storing the Mel spectrum characteristic image in the model training image set.

In one embodiment, the model training data set X is traversed and the audio samples X within the training data set X are read at a specified audio sample rate sr _i The method comprises the steps of carrying out a first treatment on the surface of the Then the audio sample x is based on the frame length n and the frame shift h _i Framing and windowing, and performing a short-term fourier transform (STFT) on each frame of signal to obtain the audio samples x _i Is of the frequency spectrum information of (2)

The short-term fourier transform is formulated as follows:

wherein x (n) is an original signal, represents an input time domain signal, and n represents a time index;to define a two-dimensional function over time domain m and frequency domain k, representing spectral coefficients of the STFT; w (n) is a window function; e represents a natural constant; n represents the window length; m represents a time frame index; k represents a frequency index; j represents an imaginary unit;

then, a mel filter bank is constructed based on the number M of the mel filters, and the response function of the mth mel filter is as follows:

wherein f (M) represents the center frequency (0.ltoreq.m < M) of the mth Mel filter, and the calculation formula is as follows:

wherein f _max Representing the highest frequency of the signal, typically half the sampling rate, i.e. sr/2; f (f) _mel Defined as a function of converting linear frequency (in Hz) to nonlinear frequency (i.e. mel scale), the calculation formula is as follows:

then a mel filter bank is combined with the audio samples x _i Is of the frequency spectrum information of (2)Performing convolution operation and logarithmic operation to obtain corresponding Mel frequency spectrumFeatures; the calculation formula of the spectrum information of the mth Mel filter and the t frame is as follows:

wherein K represents the maximum value of the frequency, which is equal to f _max ；An amplitude spectrum representing the audio signal of the t-th frame at a frequency k; />Representing the mel spectrum characteristics of the t-th frame audio signal after the effect of the mth mel filter;

then, carrying out normalization operation on the Mel spectrum characteristics; then converting the normalized Mel spectrum characteristic into Mel spectrum characteristic image according to image size C×H×W, wherein the image is RGB three channels under general condition, namely C takes a value of 3; then the Mel spectrum characteristic image is stored in the model training image set X _img In the method, the Mel spectrum characteristic image is stored in a model training image set X _img After that, step S103 is performed.

S103: generating a triplet training set and a triplet tag set according to the model training image set and the tag set;

because the present embodiment selects the ternary loss function (TripletLoss Function) for training the model, it is necessary to construct a triplet training dataset with model training.

Specifically, model training image set X _img And tag set Y groups according to tag class; then, for each group of data, randomly selecting one sample as an Anchor point (Anchor), randomly selecting one sample (similar) similar to the Anchor point as a Positive sample (Positive), and selecting one sample (dissimilar) dissimilar to the Anchor point as a Negative sample (Negative); then combining the anchor point, the positive sample and the negative sample into a three-tuple training sampleAnd record the label corresponding to itThe triplet training sample is then->And corresponding labelRespectively stored in the triplet training set X _tri Triplet tag set Y _tri The method comprises the steps of carrying out a first treatment on the surface of the Repeating the steps until a sufficient number of triples training sets are generated.

S104: extracting features of the triplet training set and the triplet tag set through a deep learning model, and classifying through a multi-layer perceptron to obtain noise types;

specifically, inputting a spectrogram of the triplet training set into a deep learning model; dividing an input spectrogram into mutually non-overlapping small blocks, splicing in the channel dimension, and finally flattening all the small blocks into a sequence to obtain sequence characteristics; then, carrying out linear mapping on the sequence features in the channel dimension to generate high-dimension features; then respectively carrying out downsampling for the high-dimensional features by 8 times, 16 times and 32 times, and carrying out feature fusion to obtain feature information of different scales and feature information in a global range; and then compressing the feature information of different scales and the feature information in the global range into a one-dimensional vector serving as a feature vector.

In one embodiment, considering the complexity of an actual landing scene, the difference of equipment sensors, the diversity of noise and other reasons, the triple data set is encoded by using a triple network based on a Swin-transform model, and the similarity measurement is carried out on the encoding result through a triple loss function and the corresponding optimization is carried out according to the measurement result; the trained Swin-transducer model has the characteristics of better extracting different noise types, so that the accuracy and performance of noise identification can be improved, and the steps of extracting the characteristics in the embodiment are as follows: inputting the spectrogram of the triplet training set into a Swin-transducer model; dividing an input spectrogram into mutually non-overlapping small blocks, splicing in the channel dimension, and finally flattening all the small blocks into a sequence to obtain sequence characteristics; then, carrying out linear mapping on the sequence features in the channel dimension to generate high-dimension features; then down-sampling the high-dimension features by 8 times, 16 times and 32 times respectively (as shown in fig. 3, the left side of the figure is an input spectrogram, and the right side of the figure is down-sampling the high-dimension features by multiple times), and performing feature fusion to obtain feature information of different scales and feature information in a global range; and then compressing the characteristic information of different scales and the characteristic information in the global range into a one-dimensional vector serving as a characteristic vector.

The three-tuple network based on the Swin-transform model is integrally formed by a three-tuple training data set, a Swin-transform model with shared parameters and a three-tuple loss function for calculating sample similarity; swin Transformer Block in the Swin-transducer model consists of a multi-head self-attention mechanism (W-MSA) module with a window and a multi-head self-attention mechanism (SW-MSA) module with a moving window, wherein the former realizes information exchange in the window, namely local feature extraction; the latter enables a larger range of information capture, i.e. long range dependency modeling, by window movement.

It should be noted that the Swin-transducer model is a novel deep neural network architecture based on a transducer structure, and can process tasks such as large-scale image classification; compared with the traditional convolutional neural network structure (CNN), the Swin Transformer uses a brand-new extended Transformer architecture to process the problems of long-distance dependence in a visual scene and the like, and realizes multi-scale feature fusion and high computational efficiency in a windowing mode.

Because the noise signal has the timing and frequency properties, the Swin-transducer model is selected in this embodiment for the following reasons:

windowed feature extraction mode: the Swin-transducer divides the input spectrogram into a plurality of equal blocks (Patch), calculates the correlation between the blocks by Self-Attention (Self-Attention) in each window, realizes the information exchange in the window, is favorable for extracting the information of different scales and positions, and can flexibly balance the weight relation among the features of different scales.

Modeling long-distance dependency relationship: the Swin-transform uses a movable window structure to establish long-distance connection between pixels on a spectrogram in each movable window, and fully considers the relation between different local parts and global feature extraction.

Interaction of spatial and temporal information: the Swin-transform uses a cascading local window communication mechanism to expand the dimensions and fuse the information of the images, thereby being beneficial to realizing advanced interaction in two dimensions of space and time.

After the feature extraction is carried out through the Swin-transducer model, the Swin-transducer model is optimized through a loss function, and the feature vectors extracted according to the Swin-transducer model are classified;

the optimization of the Swin-transducer model is described in detail below:

in the embodiment, a triplet loss function is adopted to optimize a Swin-transducer model, the triplet loss function is a common loss function in deep learning and is mainly applied to similar deep network structures such as a twin network and a triplet network, and the training strategy is based on optimization of the distance between the same category and the distance between different categories, so that characteristic representation of a sample is learned; the formula for the triplet loss function is as follows:

wherein,the Swin-transducer model output representing the triplet training sample is one-dimensional vector; margin is a super-parameter representing the threshold of similarity between the same class of samples and different classes of samplesThe value is usually positive;indicating Euclidean distance between Anchor sample and Positive sample, and similarly, < ->The Euclidean distance between the Anchor sample and the Negative sample is represented as follows:

in general, the values of the triplet loss function are three cases:

easy triples:i.e.The similar samples (Anchor and Positive) are very close, while the heterogeneous samples (Anchor and Negative) are very far apart, so that optimization is not needed;

difficult triples (Hard triples):i.e. < ->In the case, contrary to the relaxed triples, the similar samples are far away, and the heterogeneous samples are near away, so that the loss value is maximum and optimization is needed;

general triples (Semi-hard triples):i.e. The distance of the similar samples is closer than that of the heterogeneous samples, the constraint condition is nearly satisfied, but a space exists for continuous optimization.

In the embodiment, a back propagation algorithm (backprojection algorithm) is adopted to update parameters of the neural network, shorten the distance between samples of the same type, and expand the distance between samples of different types, so that a difficult triplet is optimized into an easy triplet to realize optimization of a Swin-transducer model (as shown in fig. 4);

specifically, calculating the feature vector through a triplet loss function to obtain a loss result; then calculating corresponding gradients according to the loss result and through a back propagation algorithm; the parameters of the deep learning model are then updated according to the gradient usage optimizer.

In one embodiment, the feature vector is calculated by a triplet loss function to obtain a loss result; then calculating corresponding gradients according to the loss result and through a back propagation algorithm; then updating parameters of the Swin-transform model by using an Adam optimizer according to the gradient; then repeating the steps until the preset training round number is reached or convergence conditions are met, wherein gradient counter propagation starts from a loss function, and gradient of each parameter is calculated and updated layer by layer; different batches of data sets are used in each training cycle.

The classification is described in detail below:

after feature extraction is completed based on a Swin-transducer model, fine Tuning (Fine-Tuning) is required to be performed on a specific task so as to improve the recognition accuracy of noise types; the specific operation is to fine tune the network structure based on the pre-training model, and add additional network structure (as shown in fig. 5), wherein the network structure mainly comprises a Multi-Layer Perceptron (MLP), and the training of the model is completed under the structures of a plurality of activation functions, regularization layers and the like; the SoftMax layer is used as an output layer of the network and directly outputs a predictive tag of the noise signal.

Specifically, inputting the feature vector into a multi-layer perceptron; then, the feature vector continuously passes through the full connection layer, the regularization layer and the activation function layer twice and is correspondingly processed; inputting the processed feature vector into a full connection layer, and mapping the feature vector to category probability through a softMax layer; and then taking the maximum value of the class probability as a prediction label to obtain the noise class.

In one embodiment, feature vectors are input into a multi-layer perceptron; then, continuously processing the feature vector twice and sequentially passing through the Linear layer, the Dropout layer, the Batch-Norm layer and the GELU layer; inputting the processed feature vector into a Linear layer, and mapping the feature vector to category probability through a softMax layer; then, taking the maximum value of the class probability as a prediction label to obtain a noise class, wherein the Linear layer is a full-connection layer; the Dropout layer and the Batch-Norm layer are regularized layers; the GELU layer is an activation function layer; the class probability is generally a real number.

In this embodiment, the calculation formula of the prediction tag is as follows:

wherein,representing that the sample belongs to the category with the highest probability; p is p _c Is the output (x _c ) The probability of belonging to the class c calculated by the SoftMax function is calculated as follows:

s105: and analyzing the noise category to obtain a final monitoring result.

Specifically, the noise types are aligned according to the time dimension, and a window with a certain duration is arrangedLength ofThe noise types in the windows are statistically analyzed to obtain a final monitoring result, and the embodiment can ensure the relative stability of the data volume in each window by aligning the noise types according to the time dimension and setting the window length for a certain period of time, thereby avoiding analysis errors caused by data fluctuation, and simultaneously flexibly coping with different data analysis and monitoring requirements by adjusting the window length; for example, a shorter window length may be selected when finer analysis is desired, and a longer window length may be selected when macroscopic grasping of noise environment changes is desired.

Referring to fig. 6, taking a long-time "dog-out disturbing person" as an example, for an environmental audio collected in real time by the auscultation device, firstly, noise decibel diagnosis is performed to determine whether the decibel value of the environmental audio is higher than a predetermined threshold(Threshold line in the figure), t ₀ The threshold value is not exceeded before the moment>t ₀ After the moment in time, the decibel value of the ambient audio exceeds a threshold +.>And the duration reaches a predetermined length of time +.>Automatically triggering a noise recognition model to work, and enabling the model to reasoning and output a recognition result of each audio slice as 'dog' based on a window (namely a small frame in the figure, wherein the window length is +.>Audio slice set to 5 times, that is, the window length of the large frame in the figure is 5 times that of the small frame in the figure), and the statistical analysis and the analysis method are completedThe method comprises the following steps: with window purity P _wind Representative Window->The internal model recognizes the noise ratio, when P _wind The value is higher than the noise proportion P _wind Notifying a property manager through a related device, and enabling the property manager to go to a scene for processing; the noise ratio R in this example _wind As shown in fig. 5, the small frame ratio identified as "dog" is higher than 80%, so the final analysis result is "dog" disturbing civil event, and the property is informed to take corresponding measures, as can be understood that the threshold value isR _wind ) The setting of (2) may be determined by the actual condition of the noise monitoring scene.

Referring to fig. 7, the embodiment provides an end-to-end community noise monitoring device 700, which includes:

the collecting unit 701 is configured to collect an audio data set, and perform preprocessing on the audio data set to obtain a training data set and a tag set;

an extracting unit 702, configured to perform frequency domain feature extraction and format conversion on the training data set to obtain a model training image set;

a generating unit 703, configured to generate a triplet training set and a triplet tag set according to the model training image set and the tag set;

the classifying unit 704 is configured to perform feature extraction on the triplet training set and the triplet tag set through a deep learning model, and classify the triplet training set and the triplet tag set through a multi-layer perceptron to obtain a noise class;

and the analysis unit 705 is used for analyzing the noise category to obtain a final monitoring result.

Further, the collecting unit 701 includes:

a reading subunit for reading original audio samples within the audio data set at a specified audio sampling rate;

the segmentation subunit is used for segmenting the original audio sample into a plurality of sub-audio samples based on sample duration and sample overlapping rate;

the judging subunit is used for respectively judging whether the plurality of sub audio samples are a complete model training sample or not;

the sub-audio sample compensation unit is used for carrying out sub-audio sample compensation processing if the sub-audio sample is not the sub-audio sample; if yes, no treatment is carried out;

the standard reaching subunit is used for marking each sub-audio sample according to the noise category index;

and the storage subunit is used for respectively storing the marked sub-audio samples and the labels corresponding to the sub-audio samples into the training data set and the label set.

Further, the extracting unit 702 includes:

a traversing subunit, configured to traverse the training data set and read audio samples in the training data set at a specified audio sampling rate;

the framing subunit is used for framing and windowing the audio samples based on frame length and frame movement, and performing short-term Fourier transform on each frame of signal to obtain frequency spectrum information of the audio samples;

a construction subunit for constructing a mel-filter bank based on the number of mel-filters;

the calculating subunit is used for carrying out convolution operation and logarithmic operation on the Mel filter bank and the frequency spectrum information to obtain Mel frequency spectrum characteristics;

a normalization subunit, configured to normalize the mel spectrum feature;

an image conversion subunit, configured to convert the normalized mel-frequency spectrum feature into a mel-frequency spectrum feature image according to the image size;

and the image storage subunit is used for storing the Mel frequency spectrum characteristic image in the model training image set.

Further, the classifying unit 704 includes:

an input subunit, configured to input a spectrogram of the triplet training set into the deep learning model;

the splicing subunit is used for dividing an input spectrogram into small blocks which are not overlapped with each other, splicing the small blocks in the channel dimension, and finally flattening all the small blocks into a sequence to obtain sequence characteristics;

a mapping subunit, configured to linearly map the sequence feature in a channel dimension, and generate a high-dimension feature;

the feature fusion subunit is used for respectively carrying out downsampling for 8 times, 16 times and 32 times on the high-dimensional features, and carrying out feature fusion to obtain feature information of different scales and feature information in a global range;

and the compression subunit is used for compressing the characteristic information of the different scales and the characteristic information in the global range into a one-dimensional vector serving as a characteristic vector.

Further, the classifying unit 704 further includes:

a vector input subunit, configured to input the feature vector into the multi-layer perceptron;

the processing subunit is used for carrying out corresponding processing on the feature vector twice continuously and sequentially through the full-connection layer, the regularization layer and the activation function layer;

the class probability mapping subunit is used for inputting the processed feature vector into the full-connection layer and mapping the feature vector to class probability through a softMax layer;

and the noise category obtaining subunit is used for obtaining the noise category by taking the maximum value of the category probability as a prediction label.

Further, the compression subunit further includes:

the vector calculation subunit is used for calculating the feature vector through a triplet loss function to obtain a loss result;

a gradient calculation subunit, configured to calculate a corresponding gradient according to the loss result and through a back propagation algorithm;

and the updating subunit is used for updating parameters of the deep learning model according to the gradient use optimizer.

Further, the analysis unit 705 includes:

and the alignment subunit is used for aligning the noise categories according to the time dimension, setting the window length with a certain duration, and carrying out statistical analysis on the noise categories in the window to obtain a final monitoring result.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working procedures of the apparatus and units described above may refer to the corresponding procedures in the foregoing method embodiments, which are not described herein again.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed, implements the method provided by the above embodiments. The storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The invention also provides a computer device, which can comprise a memory and a processor, wherein the memory stores a computer program, and the processor can realize the method provided by the embodiment when calling the computer program in the memory. Of course the computer device may also include various network interfaces, power supplies, and the like.

In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprise," "include," or any other variation thereof, are intended to cover a non-exclusive inclusion.

Such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An end-to-end community noise monitoring method, comprising:

and analyzing the noise category to obtain a final monitoring result.

2. The end-to-end community noise monitoring method of claim 1, wherein the collecting the audio data set and preprocessing the audio data set to obtain the training data set and the tag set comprises:

reading original audio samples within the audio data set at a specified audio sample rate;

splitting the original audio sample into a plurality of sub-audio samples based on sample duration and sample overlap ratio;

respectively judging whether a plurality of sub-audio samples are a complete model training sample;

if not, carrying out the alignment compensation processing on the sub-audio samples; if yes, no treatment is carried out;

marking each sub-audio sample according to the noise category index;

and respectively storing the marked sub-audio samples and the labels corresponding to the sub-audio samples into the training data set and the label set.

3. The end-to-end community noise monitoring method of claim 1, wherein the performing frequency domain feature extraction and format conversion on the training data set to obtain a model training image set comprises:

traversing the training data set and reading audio samples within the training data set at a specified audio sampling rate;

framing and windowing the audio samples based on frame length and frame shift, and performing short-term Fourier transform on each frame of signal to obtain frequency spectrum information of the audio samples;

constructing a mel filter bank based on the number of mel filters;

performing convolution operation and logarithmic operation on the Mel filter group and the frequency spectrum information to obtain Mel frequency spectrum characteristics;

normalizing the Mel spectrum characteristics;

converting the normalized Mel spectrum feature into Mel spectrum feature image according to image size;

and storing the Mel spectrum characteristic image in the model training image set.

4. The end-to-end community noise monitoring method of claim 3, wherein the feature extraction of the triplet training set and the triplet tag set by a deep learning model comprises:

inputting the spectrogram of the triplet training set into the deep learning model;

dividing an input spectrogram into mutually non-overlapping small blocks, splicing in the channel dimension, and finally flattening all the small blocks into a sequence to obtain sequence characteristics;

linearly mapping the sequence features in the channel dimension to generate high-dimension features;

downsampling the high-dimensional features by 8 times, 16 times and 32 times respectively, and carrying out feature fusion to obtain feature information of different scales and feature information in a global range;

and compressing the characteristic information of the different scales and the characteristic information in the global range into a one-dimensional vector serving as a characteristic vector.

5. The end-to-end community noise monitoring method of claim 4, wherein the classifying by the multi-layer perceptron comprises:

inputting the feature vector into the multi-layer perceptron;

the feature vector is processed correspondingly through a full connection layer, a regularization layer and an activation function layer continuously twice and sequentially;

inputting the processed feature vector into the full connection layer, and mapping the feature vector to category probability through a softMax layer;

and taking the maximum value of the class probability as a prediction label to obtain the noise class.

6. The end-to-end community noise monitoring method according to claim 4, wherein compressing the feature information of the different scales and the feature information in the global scope into a one-dimensional vector, as a feature vector, further comprises:

calculating the feature vector through a triplet loss function to obtain a loss result;

calculating corresponding gradients according to the loss result and through a back propagation algorithm;

the parameters of the deep learning model are updated according to the gradient using an optimizer.

7. The method for end-to-end community noise monitoring according to claim 1, wherein the analyzing the noise class to obtain a final monitoring result comprises: and aligning the noise categories according to time dimension, setting window length with a certain time length, carrying out statistical analysis on the noise categories in the window, and obtaining a final monitoring result.

8. An end-to-end community noise monitoring device, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the end-to-end community noise monitoring method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the end-to-end community noise monitoring method of any of claims 1 to 7.