CN115101091A - Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion - Google Patents

Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion Download PDF

Info

Publication number
CN115101091A
CN115101091A CN202210506993.4A CN202210506993A CN115101091A CN 115101091 A CN115101091 A CN 115101091A CN 202210506993 A CN202210506993 A CN 202210506993A CN 115101091 A CN115101091 A CN 115101091A
Authority
CN
China
Prior art keywords
data
features
sound
static
dynamic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210506993.4A
Other languages
Chinese (zh)
Inventor
吕建飞
钱汉望
宋林森
刘华巍
李宝清
袁晓兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Scifine Iot Technology Co ltd
Original Assignee
Shanghai Scifine Iot Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Scifine Iot Technology Co ltd filed Critical Shanghai Scifine Iot Technology Co ltd
Priority to CN202210506993.4A priority Critical patent/CN115101091A/en
Publication of CN115101091A publication Critical patent/CN115101091A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/09Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being zero crossing rates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a sound data classification method, a terminal and a medium based on multidimensional characteristic weighting fusion, wherein the method comprises the following steps: s1, preprocessing the input sound data to obtain sound segment data; s2, extracting static and dynamic combined features J from the sound segment data; s3, extracting initial input features K from the sound segment data; s4, processing the static dynamic combination characteristics J and the initial input characteristics K through a fully-connected neural network to obtain an attention weight vector H; s5, calculating a final input feature G, G ═ J × H + K (1-H); and S6, inputting the final input characteristics G into an LSTM network to obtain the classification m of the sound data classification. According to the invention, dynamic differential characteristics are added to link the previous frame and the next frame, and resources which are evenly distributed on different dimensions for the fusion characteristics are redistributed according to attention weights, so that the purposes of 'getting strong and making up weak' are achieved, the fused characteristics are more effective and have more distinctiveness, and the classification accuracy is improved.

Description

Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion
Technical Field
The invention relates to the field of sound data processing, in particular to a sound data classification method based on multi-dimensional feature weighted fusion, a terminal and a computer readable storage medium.
Background
According to the vehicle sound data collected by the sound array, the field vehicle target is identified, and the classification and identification are generally divided into two steps: firstly, extracting the characteristics of a target sound signal; then, a classifier is designed to obtain a judgment result. Commonly used acoustic signal characteristics are: linear Prediction Coefficient (LPC), Mel-Frequency Cepstral Coefficient (MFCC), and zero-crossing rate (ZCR).
On the other hand, deep learning is currently underway, and it is also a common way to extract features, usually through a neural network. On the basis of the primary formal characteristics, the classification model generally performs multi-feature fusion, so as to obtain fusion features with stronger expression capacity. The prior art is generally characterized by two fusion modes: (1) the feature splicing is used as fusion features, and the feature dimension is changed into 2 times of the original feature dimension; (2) and adding corresponding dimensions as fusion features, wherein the feature dimensions are unchanged. Multi-feature fusion is widely used in the fields of language identification, speaker identification, voiceprint identification and the like, and is successfully applied to classification identification of vehicle targets in recent years. However, the two feature fusion methods have equal contribution values to the respective features by default, do not highlight more effective feature components, and do not suppress other useless feature components, which may cause dimension disasters, and may not reach the intended target of feature fusion (i.e., "make up for deficiencies"), affecting the feature fusion result.
Disclosure of Invention
The invention aims to provide a sound data classification method based on multi-dimensional feature weighted fusion, a terminal and a storage medium, so as to solve the problems. Therefore, the technical scheme adopted by the invention is as follows:
according to an aspect of the present invention, there is provided a sound data classification method based on multi-dimensional feature weighted fusion, including the following steps:
s1, preprocessing the input sound data to obtain sound segment data;
s2, extracting static and dynamic combined features J from the sound segment data;
s3, extracting initial input features K from the sound segment data;
s4, processing the static dynamic combination characteristics J and the initial input characteristics K through a fully-connected neural network to obtain an attention weight vector H;
s5, calculating a final input feature G, G ═ J × H + K (1-H);
and S6, inputting the final input characteristics G into an LSTM network to obtain the classification m of the sound data classification.
Further, the specific process of S1 is:
s11, performing a-time down-sampling processing on the sound data of the p channel, wherein the length of each frame of data is (1/a) × fs, and fs is the sampling frequency of the sound data;
s12, randomly selecting certain channel data as input;
and S13, cutting the channel data according to b frames to obtain the sound segment data, wherein the segment of the sound segment data is shifted to 1 frame.
Further, a is 8 or 16, b is 5-50, and fs is 22kHz, 44kHz or 88 kHz.
Further, the specific process of S2 is:
s21, removing the head and tail frames of each cut sound segment data, and extracting (b-2) N-dimensional static features from the sound segment data;
s22, extracting N-order dynamic difference features from the static features, and generating (b-2) N-dimensional dynamic difference features in the same way;
s23, combining the static and dynamic characteristics together, i.e. generating the static and dynamic combined characteristics J of dimension (b-2) × 2 × N.
Further, N-32, specifically, the 32-dimensional static features include 8-dimensional linear prediction coefficients, 23-dimensional mel-frequency cepstral coefficients and 1-dimensional zero-crossing rate.
Further, the specific process of S3 is: and sending b-2 frames in the b frames of the sound segment data into a deep learning network to generate the initial input features K with the dimensions of (b-2) × 2 × N.
Further, the network structure of the deep learning network is an 8-layer convolutional neural network, the convolution kernel is 3, the step size is 2 or 1, batch standardization BatchNorm1d, an activation function relu and maximum pooling MaxPool1d are added after each layer of convolution.
Further, the specific process of S4 is:
s41, splicing and combining the static and dynamic combined features J with the initial input features K in the dimensions of (b-2) × 2 × N into combined features in the dimensions of (b-2) × 4 × N through feature splicing;
s42, reducing the dimension of the combined feature from (b-2) × 4 × N to (b-2) × 2 × N through the 3 layers of full-connection neural network;
s43, normalizing the attention weight vector to be between 0 and 1 through the softmax layer to form an attention weight vector H with the dimension of (b-2) x 2 x N;
the number of hidden layer neurons of the first layer of fully-connected network, the second layer of fully-connected network and the third layer of fully-connected network F3 is 4 × N, 3 × N and 2 × N respectively.
According to another aspect of the present invention, there is also provided a terminal comprising a processor and a memory, the memory storing program instructions, wherein the processor executes the program instructions to implement the steps of the method as claimed above.
According to yet another aspect of the present invention, there is also provided a computer readable storage medium, wherein the computer readable storage medium stores program instructions executable by a processor to implement the steps of the method as described above.
According to the invention, dynamic differential characteristics are added to link the previous frame and the next frame, so that the dynamic characteristics can supplement the static characteristics, the static and dynamic characteristic combination can have better characteristic distinguishability, resources which are evenly distributed on different dimensions of the fused characteristics are redistributed according to attention weights, the important characteristics are obtained when the weights are large, the important characteristics are divided into a plurality of points, and the other important characteristics are divided into a plurality of points, so that the purpose of 'getting strong and supplementing weak' is achieved, the fused characteristics are more effective and have more distinguishability, and the classification accuracy is improved.
Drawings
FIG. 1 is a flow chart of a method for classifying sound data based on multi-dimensional feature weighted fusion according to an embodiment of the present invention;
fig. 2 is a schematic diagram of attention weight vector generation of a sound data classification method based on multi-dimensional feature weighted fusion according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the objects, features and advantages of the invention can be more clearly understood. It should be understood that the embodiments shown in the drawings are not intended to limit the scope of the present invention, but are merely intended to illustrate the essential spirit of the technical solution of the present invention.
In the following description, for the purposes of illustrating various disclosed embodiments, certain specific details are set forth in order to provide a thorough understanding of the various disclosed embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details. In other instances, well-known devices, structures and techniques associated with this application may not be shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.
Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In the following description, for the purposes of clearly illustrating the structure and operation of the present invention, directional terms are used, but the terms "front", "rear", "left", "right", "outer", "inner", "outer", "inward", "upper", "lower", etc. should be interpreted as words of convenience and should not be interpreted as limiting terms.
As shown in fig. 1, a first embodiment of the present invention provides a sound data classification method based on multidimensional feature weighted fusion, including the following steps:
and S1, preprocessing the input voice data to obtain voice segment data. The specific process is as follows:
s11, a-time down-sampling processing is carried out on the sound data of the p channels, wherein the length of each frame of data is (1/a) × fs, p is more than or equal to 1, and fs is the sampling frequency of the sound data. Here, the sound data is, specifically, vehicle sound data collected by using a uniform circular array, and in order to make the vehicle sound data usable for various purposes, the sampling frequency is usually relatively high, for example, the sampling frequency may be 22kHz, 44kHz, or 88 kHz. For the classification purpose of the present application, the sampling frequency is too high, the data amount is too large, and the data processing is not facilitated, so that a-time down-sampling processing needs to be performed in advance to reduce the data amount. The value of a can be selected from 8 or 16 according to actual conditions.
And S12, randomly selecting certain channel data as input.
And S13, cutting the channel data into b frames to obtain the sound segment data, wherein the segment of the sound segment data is shifted to 1 frame, and the segment length is b frames. The value of b is selected according to the implementation situation, and can be generally selected from 5-50.
And S2, extracting static and dynamic combined characteristics J from the sound segment data. The specific process is as follows:
and S21, removing the head and the tail frames of each cut sound segment data, and extracting (b-2) N-dimensional static features from the sound segment data. In a specific embodiment, 8-dimensional linear prediction coefficients LPC, 23-dimensional mel-frequency cepstral coefficients MFCC and 1-dimensional zero-crossing rate ZCR are extracted for each cut piece of the sound segment data to obtain a (b-2) × 32-dimensional static feature. Next, N is 32 as an example.
And S22, extracting n-order dynamic difference features from the static features, and generating (b-2) × 32-dimensional dynamic difference features.
And S23, combining the static and dynamic characteristics together, namely generating (b-2) × 64-dimensional static and dynamic combined characteristics J.
Since the first and last frames of the dynamic difference are zero values and have no meaning, the data of the first and last frames in the sound segment data are discarded. Meanwhile, because the segment length segment shift difference is b times, a large amount of repeated data exists, and therefore data cannot be lost. The static characteristic only operates on the data of a certain frame, and does not consider the time sequence of the vehicle signal, so the method adds the dynamic differential characteristic to link the front frame with the rear frame. The dynamic features can supplement the static features, so that the static and dynamic feature combination can have better feature distinguishability.
And S3, extracting initial input features K from the sound segment data. Specifically, data of (1/a) × fs sampling points are input, b-2 frames in b frame sound segment data are input into a deep learning network, and initial input features K of (b-2) × 64 dimensions are generated through an 8-layer convolution neural network. The network structure of the deep learning network is an 8-layer convolutional neural network, the convolutional kernel is 3, the step length is 2 or 1, batch standardization BatchNorm1d, an activation function relu and maximum pooling MaxPool1d are added after each layer of convolution.
And S4, processing the static dynamic combination features J and the initial input features K through a full-connection neural network to obtain an attention weight vector H. The specific process is as follows:
s41, combining the static and dynamic combined features J of (b-2) × 64 dimensions and the initial input features K of (b-2) × 64 dimensions into combined features of (b-2) × 128 dimensions through feature splicing;
s42, reducing the dimension of the combined feature from (b-2) to 128 to (b-2) to 64 dimensions through a 3-layer full-connection neural network;
s43, normalized to 0-1 through softmax layer, forming an attention weight vector H with (b-2) × 64 dimensions.
Fig. 2 shows the attention weight vector H generation process, the intermediate process of generating weights forming the following form:
Y1=F1(J,K)
Y2=F2(Y1)
Y3=F3(Y2)
the input of the neural network is static and dynamic combined characteristics J and initial input characteristics K, then the neural network passes through a first layer full-connection network F1, the number of hidden layer neurons is 128, and the output is Y1; the input of the second layer of fully-connected network F2 is Y1, the number of hidden layer neurons is 96, the output is Y2, the input of the third layer of fully-connected network F3 is Y2, the number of hidden layer neurons is 64, the output is Y3, Y3 is transmitted into an output layer with softmax, the normalization is carried out to be between 0 and 1, and then the moment H can be obtained, and the calculation formula is as follows:
H=softmax(Y3)
and S5, calculating a final input feature G, wherein G is J H + K (1-H). The respective weights of the feature components are learned through attention mechanism, and the key is to find the relevance between the feature components based on original data, so that some important feature dimensions in the feature dimensions are highlighted and fused. The method aims to acquire effective detail information as much as possible and reduce the influence of irrelevant useless information; the distinctiveness of the extracted fused features is enhanced, thereby making the fused features more focused on certain feature dimension components.
And S6, inputting the final input characteristics G into an LSTM network to obtain the classification m of the sound data classification. Since the dynamic difference feature relates the previous and the next frames and the vehicle sound signal is correlated in time sequence, a deep learning network with time sequence should be adopted. Therefore, the method adopts a Long Short Term Memory Network (LSTM) as a classification Network, specifically adopts a 2-layer LSTM Network, 32 hidden layer neurons in a first layer dimension and 16 hidden layer neurons in a second layer dimension, and outputs a classification category m.
According to the invention, dynamic differential characteristics are added to link the previous frame and the next frame, so that the dynamic characteristics can supplement the static characteristics, the static and dynamic characteristic combination can have better characteristic distinguishability, resources which are evenly distributed on different dimensions of the fused characteristics are redistributed according to attention weights, the important characteristics are obtained when the weights are large, the important characteristics are divided into a plurality of points, and the other important characteristics are divided into a plurality of points, so that the purpose of 'getting strong and supplementing weak' is achieved, the fused characteristics are more effective and have more distinguishability, and the classification accuracy is improved.
A second embodiment of the invention provides a terminal comprising a processor and a memory, the memory storing program instructions, wherein the processor executes the program instructions to implement steps S1-S6 of the method as claimed above.
Illustratively, the computer program may be partitioned into one or more modules/units, stored in the memory and executed by the processor, to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program in the terminal.
The terminal can be a desktop computer, a notebook, a palm computer, a cloud server and other computing equipment. The terminal may include, but is not limited to, a processor, a memory. For example, it may also include input output devices, network access devices, buses, and the like.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is the control center of the terminal device and connects the various parts of the entire terminal device using various interfaces and lines.
The memory may be used for storing the computer programs and/or modules, and the processor may implement various functions of the terminal device by executing or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
A third embodiment of the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium stores program instructions executable by a processor to implement steps S1-S6 of the method as described above.
The respective modules/units of the terminal, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer-readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.
While the preferred embodiments of the present invention have been described in detail above, it should be understood that aspects of the embodiments can be modified, if necessary, to employ aspects, features and concepts of the various patents, applications and publications to provide yet further embodiments.
These and other changes can be made to the embodiments in light of the above detailed description. In general, in the claims, the terms used should not be construed to be limited to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled.

Claims (10)

1. A sound data classification method based on multi-dimensional feature weighted fusion is characterized by comprising the following steps:
s1, preprocessing the input sound data to obtain sound segment data;
s2, extracting static and dynamic combined features J from the sound segment data;
s3, extracting initial input features K from the sound segment data;
s4, processing the static dynamic combination characteristics J and the initial input characteristics K through a fully-connected neural network to obtain an attention weight vector H;
s5, calculating a final input feature G, G ═ J × H + K (1-H);
and S6, inputting the final input characteristics G into an LSTM network to obtain the category m of the sound data classification.
2. The method of claim 1, wherein the specific process of S1 is:
s11, performing a-time down-sampling processing on the p-channel sound data, wherein the length of each frame of data is (1/a) × fs, and fs is the sampling frequency of the sound data;
s12, randomly selecting certain channel data as input;
and S13, cutting the channel data according to b frames to obtain the sound segment data, wherein the segment of the sound segment data is shifted to 1 frame.
3. The method of claim 2, wherein a is 8 or 16, b is 5-50, and fs is 22k, 44k, or 88 k.
4. The method of claim 2, wherein the specific process of S2 is:
s21, removing the head and tail frames of each cut sound segment data, and extracting (b-2) N-dimensional static features from the sound segment data;
s22, extracting N-order dynamic difference features from the static features, and generating (b-2) N-dimensional dynamic difference features in the same way;
s23, combining the static and dynamic characteristics together, i.e. generating the static and dynamic combined characteristics J of dimension (b-2) × 2 × N.
5. The method according to claim 4, wherein N-32, in particular, the 32-dimensional static features comprise 8-dimensional linear prediction coefficients, 23-dimensional mel-frequency cepstral coefficients and 1-dimensional zero-crossing rate.
6. The method according to claim 4 or 5, wherein the specific process of S3 is as follows: and sending b-2 frames in the b frames of the sound segment data into a deep learning network to generate the initial input features K with the dimensions of (b-2) × 2 × N.
7. The method of claim 6, wherein the network structure of the deep learning network is an 8-layer convolutional neural network, the convolution kernel is 3, the step size is 2 or 1, a batch of normalized BatchNorm1d, an activation function relu and a max pooling MaxPool1d are added after each layer of convolution.
8. The method as claimed in claim 6, wherein the specific process of S4 is:
s41, splicing and combining the static and dynamic combined features J with the initial input features K in the dimensions of (b-2) × 2 × N into combined features in the dimensions of (b-2) × 4 × N through feature splicing;
s42, reducing the dimension of the combined feature from (b-2) 4X N to (b-2) 2X N through the 3 layers of fully connected neural networks;
s43, normalizing the attention weight vector to be between 0 and 1 through the softmax layer to form an attention weight vector H with the dimension of (b-2) x 2 x N;
the number of hidden layer neurons of the first layer of fully-connected network, the second layer of fully-connected network and the third layer of fully-connected network F3 is 4 × N, 3 × N and 2 × N respectively.
9. A terminal comprising a processor and a memory, the memory storing program instructions, characterized in that the processor executes the program instructions to implement the steps of the method according to any of claims 1-8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores program instructions executable by a processor to implement the steps of the method according to any one of claims 1-8.
CN202210506993.4A 2022-05-11 2022-05-11 Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion Pending CN115101091A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210506993.4A CN115101091A (en) 2022-05-11 2022-05-11 Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210506993.4A CN115101091A (en) 2022-05-11 2022-05-11 Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion

Publications (1)

Publication Number Publication Date
CN115101091A true CN115101091A (en) 2022-09-23

Family

ID=83287274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210506993.4A Pending CN115101091A (en) 2022-05-11 2022-05-11 Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion

Country Status (1)

Country Link
CN (1) CN115101091A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189757A (en) * 2019-06-27 2019-08-30 电子科技大学 A kind of giant panda individual discrimination method, equipment and computer readable storage medium
CN110992987A (en) * 2019-10-23 2020-04-10 大连东软信息学院 Parallel feature extraction system and method for general specific voice in voice signal
CN112541533A (en) * 2020-12-07 2021-03-23 阜阳师范大学 Modified vehicle identification method based on neural network and feature fusion
CN112581979A (en) * 2020-12-10 2021-03-30 重庆邮电大学 Speech emotion recognition method based on spectrogram
CN112614492A (en) * 2020-12-09 2021-04-06 通号智慧城市研究设计院有限公司 Voiceprint recognition method, system and storage medium based on time-space information fusion
WO2021208719A1 (en) * 2020-11-19 2021-10-21 平安科技(深圳)有限公司 Voice-based emotion recognition method, apparatus and device, and storage medium
WO2021208287A1 (en) * 2020-04-14 2021-10-21 深圳壹账通智能科技有限公司 Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN113889077A (en) * 2021-09-22 2022-01-04 武汉普惠海洋光电技术有限公司 Voice recognition method, voice recognition device, electronic equipment and storage medium
US20220004870A1 (en) * 2019-09-05 2022-01-06 Tencent Technology (Shenzhen) Company Limited Speech recognition method and apparatus, and neural network training method and apparatus
CN114201988A (en) * 2021-11-26 2022-03-18 北京理工大学 Satellite navigation composite interference signal identification method and system
CN114373476A (en) * 2022-01-11 2022-04-19 江西师范大学 Sound scene classification method based on multi-scale residual attention network
CN114387567A (en) * 2022-03-23 2022-04-22 长视科技股份有限公司 Video data processing method and device, electronic equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189757A (en) * 2019-06-27 2019-08-30 电子科技大学 A kind of giant panda individual discrimination method, equipment and computer readable storage medium
US20220004870A1 (en) * 2019-09-05 2022-01-06 Tencent Technology (Shenzhen) Company Limited Speech recognition method and apparatus, and neural network training method and apparatus
CN110992987A (en) * 2019-10-23 2020-04-10 大连东软信息学院 Parallel feature extraction system and method for general specific voice in voice signal
WO2021208287A1 (en) * 2020-04-14 2021-10-21 深圳壹账通智能科技有限公司 Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
WO2021208719A1 (en) * 2020-11-19 2021-10-21 平安科技(深圳)有限公司 Voice-based emotion recognition method, apparatus and device, and storage medium
CN112541533A (en) * 2020-12-07 2021-03-23 阜阳师范大学 Modified vehicle identification method based on neural network and feature fusion
CN112614492A (en) * 2020-12-09 2021-04-06 通号智慧城市研究设计院有限公司 Voiceprint recognition method, system and storage medium based on time-space information fusion
CN112581979A (en) * 2020-12-10 2021-03-30 重庆邮电大学 Speech emotion recognition method based on spectrogram
CN113889077A (en) * 2021-09-22 2022-01-04 武汉普惠海洋光电技术有限公司 Voice recognition method, voice recognition device, electronic equipment and storage medium
CN114201988A (en) * 2021-11-26 2022-03-18 北京理工大学 Satellite navigation composite interference signal identification method and system
CN114373476A (en) * 2022-01-11 2022-04-19 江西师范大学 Sound scene classification method based on multi-scale residual attention network
CN114387567A (en) * 2022-03-23 2022-04-22 长视科技股份有限公司 Video data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Espi et al. Exploiting spectro-temporal locality in deep learning based acoustic event detection
Yue et al. The classification of underwater acoustic targets based on deep learning methods
Mukherjee et al. A lazy learning-based language identification from speech using MFCC-2 features
Gaurav et al. Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition
CN112183107B (en) Audio processing method and device
CN112949708A (en) Emotion recognition method and device, computer equipment and storage medium
Liang et al. Channel compression: Rethinking information redundancy among channels in CNN architecture
CN114708857A (en) Speech recognition model training method, speech recognition method and corresponding device
López-Espejo et al. A novel loss function and training strategy for noise-robust keyword spotting
Zhao et al. A survey on automatic emotion recognition using audio big data and deep learning architectures
CN114579743A (en) Attention-based text classification method and device and computer readable medium
CN113705652B (en) Task type dialogue state tracking system and method based on pointer generation network
Qu et al. Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks
WO2023279691A1 (en) Speech classification method and apparatus, model training method and apparatus, device, medium, and program
Yasmin et al. A rough set theory and deep learning-based predictive system for gender recognition using audio speech
CN114913872A (en) Time-frequency double-domain audio classification method and system based on convolutional neural network
Shah et al. Speech emotion recognition based on SVM using MATLAB
CN113870863A (en) Voiceprint recognition method and device, storage medium and electronic equipment
Sakamoto et al. Stargan-vc+ asr: Stargan-based non-parallel voice conversion regularized by automatic speech recognition
CN115101091A (en) Sound data classification method, terminal and medium based on multi-dimensional feature weighted fusion
Hajihashemi et al. Novel time-frequency based scheme for detecting sound events from sound background in audio segments
Mohammed et al. Spin-Image Descriptors for Text-Independent Speaker Recognition
CN115206297A (en) Variable-length speech emotion recognition method based on space-time multiple fusion network
Chakravarty et al. Feature extraction using GTCC spectrogram and ResNet50 based classification for audio spoof detection
Shao et al. Deep semantic learning for acoustic scene classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination