CN115101091A

CN115101091A - Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion

Info

Publication number: CN115101091A
Application number: CN202210506993.4A
Authority: CN
Inventors: 吕建飞; 钱汉望; 宋林森; 刘华巍; 李宝清; 袁晓兵
Original assignee: Shanghai Scifine Iot Technology Co ltd
Current assignee: Shanghai Scifine Iot Technology Co ltd
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-09-23

Abstract

The invention discloses a sound data classification method, a terminal and a medium based on multidimensional characteristic weighting fusion, wherein the method comprises the following steps: s1, preprocessing the input sound data to obtain sound segment data; s2, extracting static and dynamic combined features J from the sound segment data; s3, extracting initial input features K from the sound segment data; s4, processing the static dynamic combination characteristics J and the initial input characteristics K through a fully-connected neural network to obtain an attention weight vector H; s5, calculating a final input feature G, G ═ J × H + K (1-H); and S6, inputting the final input characteristics G into an LSTM network to obtain the classification m of the sound data classification. According to the invention, dynamic differential characteristics are added to link the previous frame and the next frame, and resources which are evenly distributed on different dimensions for the fusion characteristics are redistributed according to attention weights, so that the purposes of 'getting strong and making up weak' are achieved, the fused characteristics are more effective and have more distinctiveness, and the classification accuracy is improved.

Description

Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion

Technical Field

The invention relates to the field of sound data processing, in particular to a sound data classification method based on multi-dimensional feature weighted fusion, a terminal and a computer readable storage medium.

Background

According to the vehicle sound data collected by the sound array, the field vehicle target is identified, and the classification and identification are generally divided into two steps: firstly, extracting the characteristics of a target sound signal; then, a classifier is designed to obtain a judgment result. Commonly used acoustic signal characteristics are: linear Prediction Coefficient (LPC), Mel-Frequency Cepstral Coefficient (MFCC), and zero-crossing rate (ZCR).

On the other hand, deep learning is currently underway, and it is also a common way to extract features, usually through a neural network. On the basis of the primary formal characteristics, the classification model generally performs multi-feature fusion, so as to obtain fusion features with stronger expression capacity. The prior art is generally characterized by two fusion modes: (1) the feature splicing is used as fusion features, and the feature dimension is changed into 2 times of the original feature dimension; (2) and adding corresponding dimensions as fusion features, wherein the feature dimensions are unchanged. Multi-feature fusion is widely used in the fields of language identification, speaker identification, voiceprint identification and the like, and is successfully applied to classification identification of vehicle targets in recent years. However, the two feature fusion methods have equal contribution values to the respective features by default, do not highlight more effective feature components, and do not suppress other useless feature components, which may cause dimension disasters, and may not reach the intended target of feature fusion (i.e., "make up for deficiencies"), affecting the feature fusion result.

Disclosure of Invention

The invention aims to provide a sound data classification method based on multi-dimensional feature weighted fusion, a terminal and a storage medium, so as to solve the problems. Therefore, the technical scheme adopted by the invention is as follows:

according to an aspect of the present invention, there is provided a sound data classification method based on multi-dimensional feature weighted fusion, including the following steps:

s1, preprocessing the input sound data to obtain sound segment data;

s2, extracting static and dynamic combined features J from the sound segment data;

s3, extracting initial input features K from the sound segment data;

s4, processing the static dynamic combination characteristics J and the initial input characteristics K through a fully-connected neural network to obtain an attention weight vector H;

s5, calculating a final input feature G, G ═ J × H + K (1-H);

and S6, inputting the final input characteristics G into an LSTM network to obtain the classification m of the sound data classification.

Further, the specific process of S1 is:

s11, performing a-time down-sampling processing on the sound data of the p channel, wherein the length of each frame of data is (1/a) × fs, and fs is the sampling frequency of the sound data;

s12, randomly selecting certain channel data as input;

and S13, cutting the channel data according to b frames to obtain the sound segment data, wherein the segment of the sound segment data is shifted to 1 frame.

Further, a is 8 or 16, b is 5-50, and fs is 22kHz, 44kHz or 88 kHz.

Further, the specific process of S2 is:

s21, removing the head and tail frames of each cut sound segment data, and extracting (b-2) N-dimensional static features from the sound segment data;

s22, extracting N-order dynamic difference features from the static features, and generating (b-2) N-dimensional dynamic difference features in the same way;

s23, combining the static and dynamic characteristics together, i.e. generating the static and dynamic combined characteristics J of dimension (b-2) × 2 × N.

Further, N-32, specifically, the 32-dimensional static features include 8-dimensional linear prediction coefficients, 23-dimensional mel-frequency cepstral coefficients and 1-dimensional zero-crossing rate.

Further, the specific process of S3 is: and sending b-2 frames in the b frames of the sound segment data into a deep learning network to generate the initial input features K with the dimensions of (b-2) × 2 × N.

Further, the network structure of the deep learning network is an 8-layer convolutional neural network, the convolution kernel is 3, the step size is 2 or 1, batch standardization BatchNorm1d, an activation function relu and maximum pooling MaxPool1d are added after each layer of convolution.

Further, the specific process of S4 is:

s41, splicing and combining the static and dynamic combined features J with the initial input features K in the dimensions of (b-2) × 2 × N into combined features in the dimensions of (b-2) × 4 × N through feature splicing;

s42, reducing the dimension of the combined feature from (b-2) × 4 × N to (b-2) × 2 × N through the 3 layers of full-connection neural network;

s43, normalizing the attention weight vector to be between 0 and 1 through the softmax layer to form an attention weight vector H with the dimension of (b-2) x 2 x N;

the number of hidden layer neurons of the first layer of fully-connected network, the second layer of fully-connected network and the third layer of fully-connected network F3 is 4 × N, 3 × N and 2 × N respectively.

According to another aspect of the present invention, there is also provided a terminal comprising a processor and a memory, the memory storing program instructions, wherein the processor executes the program instructions to implement the steps of the method as claimed above.

According to yet another aspect of the present invention, there is also provided a computer readable storage medium, wherein the computer readable storage medium stores program instructions executable by a processor to implement the steps of the method as described above.

According to the invention, dynamic differential characteristics are added to link the previous frame and the next frame, so that the dynamic characteristics can supplement the static characteristics, the static and dynamic characteristic combination can have better characteristic distinguishability, resources which are evenly distributed on different dimensions of the fused characteristics are redistributed according to attention weights, the important characteristics are obtained when the weights are large, the important characteristics are divided into a plurality of points, and the other important characteristics are divided into a plurality of points, so that the purpose of 'getting strong and supplementing weak' is achieved, the fused characteristics are more effective and have more distinguishability, and the classification accuracy is improved.

Drawings

FIG. 1 is a flow chart of a method for classifying sound data based on multi-dimensional feature weighted fusion according to an embodiment of the present invention;

fig. 2 is a schematic diagram of attention weight vector generation of a sound data classification method based on multi-dimensional feature weighted fusion according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the objects, features and advantages of the invention can be more clearly understood. It should be understood that the embodiments shown in the drawings are not intended to limit the scope of the present invention, but are merely intended to illustrate the essential spirit of the technical solution of the present invention.

In the following description, for the purposes of illustrating various disclosed embodiments, certain specific details are set forth in order to provide a thorough understanding of the various disclosed embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details. In other instances, well-known devices, structures and techniques associated with this application may not be shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.

Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In the following description, for the purposes of clearly illustrating the structure and operation of the present invention, directional terms are used, but the terms "front", "rear", "left", "right", "outer", "inner", "outer", "inward", "upper", "lower", etc. should be interpreted as words of convenience and should not be interpreted as limiting terms.

As shown in fig. 1, a first embodiment of the present invention provides a sound data classification method based on multidimensional feature weighted fusion, including the following steps:

and S1, preprocessing the input voice data to obtain voice segment data. The specific process is as follows:

s11, a-time down-sampling processing is carried out on the sound data of the p channels, wherein the length of each frame of data is (1/a) × fs, p is more than or equal to 1, and fs is the sampling frequency of the sound data. Here, the sound data is, specifically, vehicle sound data collected by using a uniform circular array, and in order to make the vehicle sound data usable for various purposes, the sampling frequency is usually relatively high, for example, the sampling frequency may be 22kHz, 44kHz, or 88 kHz. For the classification purpose of the present application, the sampling frequency is too high, the data amount is too large, and the data processing is not facilitated, so that a-time down-sampling processing needs to be performed in advance to reduce the data amount. The value of a can be selected from 8 or 16 according to actual conditions.

And S12, randomly selecting certain channel data as input.

And S13, cutting the channel data into b frames to obtain the sound segment data, wherein the segment of the sound segment data is shifted to 1 frame, and the segment length is b frames. The value of b is selected according to the implementation situation, and can be generally selected from 5-50.

And S2, extracting static and dynamic combined characteristics J from the sound segment data. The specific process is as follows:

and S21, removing the head and the tail frames of each cut sound segment data, and extracting (b-2) N-dimensional static features from the sound segment data. In a specific embodiment, 8-dimensional linear prediction coefficients LPC, 23-dimensional mel-frequency cepstral coefficients MFCC and 1-dimensional zero-crossing rate ZCR are extracted for each cut piece of the sound segment data to obtain a (b-2) × 32-dimensional static feature. Next, N is 32 as an example.

And S22, extracting n-order dynamic difference features from the static features, and generating (b-2) × 32-dimensional dynamic difference features.

And S23, combining the static and dynamic characteristics together, namely generating (b-2) × 64-dimensional static and dynamic combined characteristics J.

Since the first and last frames of the dynamic difference are zero values and have no meaning, the data of the first and last frames in the sound segment data are discarded. Meanwhile, because the segment length segment shift difference is b times, a large amount of repeated data exists, and therefore data cannot be lost. The static characteristic only operates on the data of a certain frame, and does not consider the time sequence of the vehicle signal, so the method adds the dynamic differential characteristic to link the front frame with the rear frame. The dynamic features can supplement the static features, so that the static and dynamic feature combination can have better feature distinguishability.

And S3, extracting initial input features K from the sound segment data. Specifically, data of (1/a) × fs sampling points are input, b-2 frames in b frame sound segment data are input into a deep learning network, and initial input features K of (b-2) × 64 dimensions are generated through an 8-layer convolution neural network. The network structure of the deep learning network is an 8-layer convolutional neural network, the convolutional kernel is 3, the step length is 2 or 1, batch standardization BatchNorm1d, an activation function relu and maximum pooling MaxPool1d are added after each layer of convolution.

And S4, processing the static dynamic combination features J and the initial input features K through a full-connection neural network to obtain an attention weight vector H. The specific process is as follows:

s41, combining the static and dynamic combined features J of (b-2) × 64 dimensions and the initial input features K of (b-2) × 64 dimensions into combined features of (b-2) × 128 dimensions through feature splicing;

s42, reducing the dimension of the combined feature from (b-2) to 128 to (b-2) to 64 dimensions through a 3-layer full-connection neural network;

s43, normalized to 0-1 through softmax layer, forming an attention weight vector H with (b-2) × 64 dimensions.

Fig. 2 shows the attention weight vector H generation process, the intermediate process of generating weights forming the following form:

Y1＝F1(J,K)

Y2＝F2(Y1)

Y3＝F3(Y2)

the input of the neural network is static and dynamic combined characteristics J and initial input characteristics K, then the neural network passes through a first layer full-connection network F1, the number of hidden layer neurons is 128, and the output is Y1; the input of the second layer of fully-connected network F2 is Y1, the number of hidden layer neurons is 96, the output is Y2, the input of the third layer of fully-connected network F3 is Y2, the number of hidden layer neurons is 64, the output is Y3, Y3 is transmitted into an output layer with softmax, the normalization is carried out to be between 0 and 1, and then the moment H can be obtained, and the calculation formula is as follows:

H＝softmax(Y3)

and S5, calculating a final input feature G, wherein G is J H + K (1-H). The respective weights of the feature components are learned through attention mechanism, and the key is to find the relevance between the feature components based on original data, so that some important feature dimensions in the feature dimensions are highlighted and fused. The method aims to acquire effective detail information as much as possible and reduce the influence of irrelevant useless information; the distinctiveness of the extracted fused features is enhanced, thereby making the fused features more focused on certain feature dimension components.

And S6, inputting the final input characteristics G into an LSTM network to obtain the classification m of the sound data classification. Since the dynamic difference feature relates the previous and the next frames and the vehicle sound signal is correlated in time sequence, a deep learning network with time sequence should be adopted. Therefore, the method adopts a Long Short Term Memory Network (LSTM) as a classification Network, specifically adopts a 2-layer LSTM Network, 32 hidden layer neurons in a first layer dimension and 16 hidden layer neurons in a second layer dimension, and outputs a classification category m.

A second embodiment of the invention provides a terminal comprising a processor and a memory, the memory storing program instructions, wherein the processor executes the program instructions to implement steps S1-S6 of the method as claimed above.

Illustratively, the computer program may be partitioned into one or more modules/units, stored in the memory and executed by the processor, to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program in the terminal.

The terminal can be a desktop computer, a notebook, a palm computer, a cloud server and other computing equipment. The terminal may include, but is not limited to, a processor, a memory. For example, it may also include input output devices, network access devices, buses, and the like.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is the control center of the terminal device and connects the various parts of the entire terminal device using various interfaces and lines.

The memory may be used for storing the computer programs and/or modules, and the processor may implement various functions of the terminal device by executing or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

A third embodiment of the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium stores program instructions executable by a processor to implement steps S1-S6 of the method as described above.

The respective modules/units of the terminal, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer-readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.

While the preferred embodiments of the present invention have been described in detail above, it should be understood that aspects of the embodiments can be modified, if necessary, to employ aspects, features and concepts of the various patents, applications and publications to provide yet further embodiments.

These and other changes can be made to the embodiments in light of the above detailed description. In general, in the claims, the terms used should not be construed to be limited to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled.

Claims

1. A sound data classification method based on multi-dimensional feature weighted fusion is characterized by comprising the following steps:

s1, preprocessing the input sound data to obtain sound segment data;

s3, extracting initial input features K from the sound segment data;

s5, calculating a final input feature G, G ═ J × H + K (1-H);

and S6, inputting the final input characteristics G into an LSTM network to obtain the category m of the sound data classification.

2. The method of claim 1, wherein the specific process of S1 is:

s11, performing a-time down-sampling processing on the p-channel sound data, wherein the length of each frame of data is (1/a) × fs, and fs is the sampling frequency of the sound data;

s12, randomly selecting certain channel data as input;

3. The method of claim 2, wherein a is 8 or 16, b is 5-50, and fs is 22k, 44k, or 88 k.

4. The method of claim 2, wherein the specific process of S2 is:

5. The method according to claim 4, wherein N-32, in particular, the 32-dimensional static features comprise 8-dimensional linear prediction coefficients, 23-dimensional mel-frequency cepstral coefficients and 1-dimensional zero-crossing rate.

6. The method according to claim 4 or 5, wherein the specific process of S3 is as follows: and sending b-2 frames in the b frames of the sound segment data into a deep learning network to generate the initial input features K with the dimensions of (b-2) × 2 × N.

7. The method of claim 6, wherein the network structure of the deep learning network is an 8-layer convolutional neural network, the convolution kernel is 3, the step size is 2 or 1, a batch of normalized BatchNorm1d, an activation function relu and a max pooling MaxPool1d are added after each layer of convolution.

8. The method as claimed in claim 6, wherein the specific process of S4 is:

s42, reducing the dimension of the combined feature from (b-2) 4X N to (b-2) 2X N through the 3 layers of fully connected neural networks;

9. A terminal comprising a processor and a memory, the memory storing program instructions, characterized in that the processor executes the program instructions to implement the steps of the method according to any of claims 1-8.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores program instructions executable by a processor to implement the steps of the method according to any one of claims 1-8.