CN113361592A - Acoustic event identification method based on public subspace representation learning - Google Patents

Acoustic event identification method based on public subspace representation learning Download PDF

Info

Publication number
CN113361592A
CN113361592A CN202110620415.9A CN202110620415A CN113361592A CN 113361592 A CN113361592 A CN 113361592A CN 202110620415 A CN202110620415 A CN 202110620415A CN 113361592 A CN113361592 A CN 113361592A
Authority
CN
China
Prior art keywords
segment
matrix
acoustic event
level
subspace
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110620415.9A
Other languages
Chinese (zh)
Other versions
CN113361592B (en
Inventor
韩纪庆
史秋莹
郑贵滨
郑铁然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110620415.9A priority Critical patent/CN113361592B/en
Publication of CN113361592A publication Critical patent/CN113361592A/en
Application granted granted Critical
Publication of CN113361592B publication Critical patent/CN113361592B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Complex Calculations (AREA)

Abstract

An acoustic event identification method based on common subspace representation learning relates to an acoustic event identification method. The method aims to solve the problem of low accuracy of an acoustic event recognition task caused by inconsistency of subspaces among different semantic features. Firstly, sampling, quantizing, frame level feature extraction, segment level feature extraction and expansion are carried out on each original acoustic event signal; then, semantic feature representation of the common subspace is obtained through learning; and calculating a kernel matrix of the training set, and training a classifier to obtain a classification model. During testing, sampling, quantizing, frame level feature extraction, segment level feature extraction and expansion are carried out on each original acoustic event signal; obtaining semantic feature representation of the common subspace under the guidance of the learned common subspace; and finally, calculating a kernel matrix of the test set, and performing model matching under the guidance of the classification model to obtain a prediction result. The method is mainly used for the identification of acoustic events.

Description

Acoustic event identification method based on public subspace representation learning
Technical Field
The invention relates to an acoustic event identification method.
Background
Acoustic events are sound signals with well-defined semantics and are also important media for human perception of the surrounding environment. With the development of information technology, it is possible for machines to have the ability to recognize and understand acoustic events by simulating the human auditory mechanisms. The acoustic event recognition technology is also widely applied to various practical fields such as environmental monitoring and smart home, and is receiving attention of more and more researchers.
In a plurality of acoustic event identification technologies, the semantic feature extraction method based on subspace learning can effectively improve the identification performance due to the fact that the content information and the time sequence relation of acoustic events are considered. The method fully describes the semantic features of a single event by independently learning respective subspaces of different acoustic events. However, since it ignores the consistency of the subspace used to characterize semantic features of different acoustic events, the differences between these features come not only from the semantic features implied by the events, but also from the inconsistency of the subspace, thereby affecting the recognition accuracy of the acoustic events.
Disclosure of Invention
The method aims to solve the problem of low accuracy of an acoustic event recognition task caused by inconsistency of subspaces among different semantic features.
An acoustic event recognition method based on common subspace representation learning, comprising the following steps:
extracting logarithmic Mel spectrum characteristics of the audio frame corresponding to the acoustic event sample to be identified to obtain a frame level characteristic matrix; further abstracting the frame-level features into segment-level features by using a convolutional neural network;
utilizing common subspace basis matrices U*Solving the optimal semantic feature matrix of each acoustic event sample
Figure BDA0003099311250000011
Then, calculating a kernel matrix of the acoustic event sample to be identified:
Figure BDA0003099311250000012
Figure BDA0003099311250000013
wherein the content of the first and second substances,
Figure BDA0003099311250000014
to utilize a common subspace basis matrix U*Aiming at an optimal semantic feature matrix obtained by a training set, wherein the training set is a set formed by acoustic event samples used for training; n is a radical ofteNumber of samples, N, of acoustic event samples to be identifiedtrThe total number of training samples in the training set; k (·, ·) is a grassmann kernel function; r represents a real number space;
finally in the classification parameter alpha*Under the guidance of (1), model matching is performed according to the following formula:
P=(α*)TKte
wherein the content of the first and second substances,
Figure BDA0003099311250000021
for the prediction result, each column of the prediction result is a probability score of the acoustic event sample to be identified on each category, and the identification result of the acoustic event sample to be identified is determined by taking the maximum value of each column;
the classification parameter alpha*The classification parameters of the support vector machine classifier are obtained by training with a training set.
Further, the classification parameter α*The acquisition process comprises the following steps:
calculating a kernel matrix according to the optimal semantic feature matrix of the training set:
Figure BDA0003099311250000022
Figure BDA0003099311250000023
wherein, KtrA kernel matrix which is a training set;
Figure BDA0003099311250000024
respectively according to the optimal semantic features of the training set;
by KtrTraining a support vector machine classifier to obtain a classification model, wherein classification parameters of the classification model are expressed as
Figure BDA0003099311250000025
Wherein c is the total number of classes.
Further, the common subspace base matrix U*The determination process of (a) includes the steps of:
step 5.1: randomly initializing a base matrix of the public subspace;
step 5.2: extracting frame level characteristics corresponding to each training sample in the training set, obtaining segment level characteristics according to the frame level characteristic matrix, and recording the segment level characteristics of the ith sample in the segment level characteristic set as the segment level characteristics
Figure BDA0003099311250000026
Wherein N isiD is the nth segment to form the number of segments required for the sample
Figure BDA0003099311250000027
A characteristic dimension of (d);
randomly selecting part of acoustic event samples in the training set of the segment-level features, and respectively solving semantic feature matrixes of the acoustic event samples under the condition of fixing a common subspace basis matrix;
step 5.3: updating the base matrix of the public subspace by using the plurality of semantic feature matrixes obtained in the step 5.2;
step 5.4: repeating the steps 5.2 to 5.3 until convergence to obtain the optimal representation of the common subspace base matrix, namely the common subspace base matrix U*
Further, the segment level feature X of step 5.2iA smoothing process and a length normalization process are required.
Further, the process of randomly selecting some acoustic event samples and solving their semantic feature matrices respectively under the condition of fixing the common subspace basis matrix as described in step 5.2 includes the following steps:
in the training set of the segment-level features, selecting l acoustic event samples at random, fixing U, and respectively obtaining a semantic feature matrix according to the following formula:
Figure BDA0003099311250000028
wherein, U is a standard orthogonal basis matrix of a random initialization public subspace; viFor the ith sample X in the l training samplesiCorresponding semantic feature matrix, constraining the different columns in the matrix to be mutually orthogonal, and dividing ViIs shown as
Figure BDA0003099311250000029
The above elements; q. q.siIs NiThe total combination number of two segments with the occurrence sequence in each segment, lambda is used for reflecting a hyper-parameter of the influence of the characterization time sequence relation on semantic feature quality, and eta is a hyper-parameter reflecting the obvious degree of the characterization time sequence relation; 1p=[1]∈Rp×1Is a p-dimensional whole-column vector,
Figure BDA0003099311250000031
is p × qiAll one matrix of (a);
Figure BDA0003099311250000032
Figure BDA0003099311250000033
respectively represent XiNth, mth segment level features;
solving about
Figure BDA0003099311250000034
To the optimization problem of (2).
Further, said NiTotal combined number q of two segments with successive occurrence order in each segmenti=0.5×Ni×(Ni-1)。
Further, in the process of updating the base matrix of the common subspace by using the plurality of semantic feature matrices obtained in step 5.2, the base matrix of the common subspace is updated according to the following formula:
Figure BDA0003099311250000035
wherein | · | purpleFRepresenting the Frobenius norm.
Furthermore, the acoustic event sample to be identified is obtained after sampling and quantizing the acoustic event signal to be identified.
Further, in the process of further abstracting the frame-level features into the segment-level features by using the convolutional neural network, a plurality of adjacent frame-level features are further abstracted into the segment-level features.
Further, after the frame-level features are further abstracted into segment-level features by using the convolutional neural network, the segment-level features are subjected to smoothing processing and length normalization processing.
Has the advantages that:
the method can well solve the influence of the inconsistency of the subspace among different semantic features on the accuracy rate of the acoustic event recognition task. Meanwhile, the invention can describe the semantic features of the acoustic event from multiple angles by learning the multidimensional public subspace, thereby improving the recognition performance together, and the accuracy of the acoustic event recognition task is as high as 84.1%.
Drawings
Fig. 1 is a schematic diagram of an acoustic event recognition method based on common subspace representation learning.
FIG. 2 is a convolutional neural network architecture diagram for extracting segment-level features.
Fig. 3 is a histogram of the accuracy of acoustic event recognition methods and correlation methods on ESC-50 data sets based on common subspace representation learning.
Detailed Description
The first embodiment is as follows: the present embodiment is described with reference to fig. 1, and fig. 1 is a schematic diagram of an acoustic event recognition method based on common subspace representation learning. In the training stage, firstly, sampling, quantizing, frame level feature extraction, segment level feature extraction and expansion are respectively carried out on original signals from a training set; then, obtaining semantic feature representation of the public subspace by learning the public subspace; and finally, calculating a kernel matrix of the training set, and training a classifier to obtain a classification model. In the testing stage, firstly, sampling, quantizing, frame level feature extracting, segment level feature extracting and expanding are carried out on each original acoustic event signal in a testing set; then, obtaining semantic feature representation under the guidance of the learned public subspace; and finally, calculating a kernel matrix of the test set, and performing model matching under the guidance of the classification model to obtain a prediction result.
The acoustic event identification method based on common subspace representation learning in the embodiment comprises the following steps:
step 1: and respectively sampling and quantizing the original acoustic event signals in the training set and the testing set to obtain processed acoustic event samples. In this embodiment, the sampling rate may be 44100 Hz, and the number of quantization bits may be 16.
Step 2: and (2) dividing each acoustic event sample obtained in the step (1) into a plurality of audio frames, dividing the audio frames into a plurality of audio frames according to a pre-specified frame length and an inter-frame overlapping proportion, and respectively extracting classical logarithmic Mel spectral features in an acoustic event recognition task from the audio frames according to a pre-specified Mel frequency band number to obtain a frame level feature matrix. In the present embodiment, the frame length, the inter-frame overlap, and the number of mel-frequency bands may be set to 23 msec, 50%, and 128, respectively.
And step 3: considering that the audio frame is often too short in duration, there is a limitation that the audio frame contains insufficient semantic information. For this purpose, for each frame-level feature matrix obtained in step 2, according to a pre-specified segment length and an inter-segment overlapping proportion, a plurality of adjacent frame-level features are input into a pre-trained convolutional neural network, and under the condition of fixed network parameters, segment-level features are obtained without supervision, so that a segment-level feature matrix is obtained. FIG. 2 is a convolutional neural network architecture diagram for extracting segment-level features. The network is composed of 13 convolutional layers, 6 maximum pooling layers and an average pooling layer, logarithmic Mel spectrum characteristics of samples in an AudioSet data set are used as input, supervised pre-training is carried out, and then optimal network parameters are obtained.
In order to fully utilize and utilize the capability of the convolutional neural network and effectively control the total time of samples, in this embodiment, each segment-level feature may be set as a further abstraction result of 84 audio frames, the inter-segment overlap may be 20%, and the penultimate layer (last convolutional layer) of the above network may be output as a segment-level feature, the node number of which is 1024.
And 4, step 4: in order to enhance the time sequence correlation among the segment-level features, smoothing is performed on each segment-level feature matrix obtained in the step 3 by using a classical time domain transform averaging method, and then length normalization is performed on the smoothed segment-level feature matrix to obtain the segment-level feature matrix expanded by the operation. For the convenience of the following description, the segment level of the i-th sample after being expanded is characterized as
Figure BDA0003099311250000041
Wherein N isiD is the nth segment to form the number of segments required for the sample
Figure BDA0003099311250000042
The characteristic dimension of (c). Since the above expansion operation does not change the segment-level feature dimension obtained after step 3, d in this step can also be set to 1024.
And 5: in order to depict the overall semantic features of the acoustic event samples, a public subspace is constructed by using a strategy based on public subspace learning, and the method comprises the following steps:
step 5.1: defining a subspace with a dimension p, randomly initializing the orthonormal basis matrix of the subspace, and recording as
Figure BDA0003099311250000051
Wherein the content of the first and second substances,
Figure BDA0003099311250000052
refers to the Stiefel flow pattern, which isA set consisting of orthonormal matrices of size d × p. Wherein p can range from [1,2,3,4,5 ]]Selecting.
Step 5.2: in the training set processed in step 4, randomly selecting l acoustic event samples, fixing U (steps 5.1-5.3 are three main steps in an iterative algorithm, which can be described as alternately updating U and Vi, and here, fixing U updates the other), and obtaining their semantic feature matrices according to the following formula:
Figure BDA0003099311250000053
wherein the content of the first and second substances,
Figure BDA0003099311250000054
is about ViFunction of ViFor the ith sample X in the l training samplesiCorresponding semantic feature matrix, in order to avoid redundancy in semantic features, the invention restrains the mutual orthogonality between different columns in the matrix, and combines ViIs shown as
Figure BDA0003099311250000055
The above elements; q. q.siIs NiThe total combination number of two segments with the successive occurrence sequence in each segment can be determined by qi=0.5×Ni×(Ni-1) performing a calculation; λ is a hyper-parameter of the invention used for reflecting the influence of the characterization time sequence relation on the semantic feature quality, and η is a hyper-parameter reflecting the significance degree of the characterization time sequence relation; 1p=[1]∈Rp×1Is a p-dimensional whole-column vector,
Figure BDA0003099311250000056
is p × qiAll of the matrices of (a).
Figure BDA0003099311250000057
Containing a sample XiThe difference between any two of the segments having a tandem occurrence relationship;
Figure BDA0003099311250000058
respectively represent XiNth, mth segment level features.
To effectively solve about
Figure BDA0003099311250000059
The invention utilizes a classical Riemann gradient descent algorithm which is an iterative solution strategy, and obtains a semantic feature representation V after the preassigned maximum iteration timesi. According to the invention, through learning the multidimensional public subspace, the semantic features of the acoustic events can be described from multiple angles, so that the recognition performance is improved together.
Specifically, the algorithm depends on the iteration of each step
Figure BDA00030993112500000510
About ViMay be represented as:
Figure BDA00030993112500000511
where l may be set to 64, λ may be set to 0.1, η may be set to 0.001, and the maximum number of iterations may be set to 20.
Step 5.3: updating the base matrix of the public subspace according to the following formula by using the plurality of semantic feature matrixes obtained in the step 5.2:
Figure BDA00030993112500000512
Figure BDA00030993112500000513
is a function related to U, and in order to effectively solve the formula, a classical Riemannian gradient descent algorithm is adopted for updating once to obtain updated U. Specifically, the above algorithms all depend on the update process
Figure BDA00030993112500000514
With respect to the gradient of U, it can be expressed as:
Figure BDA0003099311250000061
step 5.4: repeating steps 5.2 to 5.3 until convergence to obtain an optimal representation U of the common subspace base matrix*
Step 6: at U*Respectively solving the optimal semantic feature matrix of each acoustic event sample in the training set and the test set after the step 4 under the guidance of (1), and recording the optimal semantic feature matrix set of the training set as
Figure BDA0003099311250000062
Wherein N istrThe total training sample number is obtained; the optimal semantic feature matrix set of the test set is
Figure BDA0003099311250000063
Wherein N isteIs the total number of test samples.
And 7: calculating a kernel matrix of the training set by using the optimal semantic feature matrix of the training set according to the following formula:
Figure BDA0003099311250000064
Figure BDA0003099311250000065
wherein, KtrTo train the kernel matrix of the set, k (·,) is a pre-specified grassmann kernel function. Further, with KtrTraining a support vector machine classifier to obtain a classification model, wherein classification parameters of the model can be expressed as
Figure BDA0003099311250000066
Wherein c is the total number of classes.
And 8: and calculating a kernel matrix of the test set according to the following formula by using the optimal semantic feature matrix of the training set and the test set:
Figure BDA0003099311250000067
Figure BDA0003099311250000068
wherein, KteIs the above-mentioned kernel matrix. Further, at α*Under the guidance of (1), model matching is performed according to the following formula:
P=(α*)TKte
wherein the content of the first and second substances,
Figure BDA0003099311250000069
for predicting the result, each column can be regarded as a probability score of the corresponding test sample on each category, and the recognition result of the test sample can be determined by taking the maximum value of each column.
The invention is tested on the internationally published data set, and the test result shows that the technology can effectively improve the identification performance.
Fig. 3 is a histogram of the accuracy of acoustic event recognition methods and correlation methods on ESC-50 data sets based on common subspace representation learning. By comparing the accuracy of the method provided by the invention and the acoustic event identification method based on subspace representation learning, the necessity of introducing a common subspace can be verified. Meanwhile, compared with the accuracy achieved by the pre-training convolutional neural network and the human ear, the effectiveness of the method provided by the invention can be verified.
The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims (10)

1. An acoustic event recognition method based on common subspace representation learning, characterized by comprising the following steps:
extracting logarithmic Mel spectrum characteristics of the audio frame corresponding to the acoustic event sample to be identified to obtain a frame level characteristic matrix; further abstracting the frame-level features into segment-level features by using a convolutional neural network;
utilizing common subspace basis matrices U*Solving the optimal semantic feature matrix of each acoustic event sample
Figure FDA0003099311240000011
Then, calculating a kernel matrix of the acoustic event sample to be identified:
Figure FDA0003099311240000012
Figure FDA0003099311240000013
wherein the content of the first and second substances,
Figure FDA0003099311240000014
to utilize a common subspace basis matrix U*Aiming at an optimal semantic feature matrix obtained by a training set, wherein the training set is a set formed by acoustic event samples used for training; n is a radical ofteNumber of samples, N, of acoustic event samples to be identifiedtrThe total number of training samples in the training set; k (·, ·) is a grassmann kernel function; r represents a real number space;
finally in the classification parameter alpha*Under the guidance of (1), model matching is performed according to the following formula:
P=(α*)TKte
wherein the content of the first and second substances,
Figure FDA0003099311240000015
for predicting the result, each column thereof corresponds to a sample of the acoustic event to be recognizedScoring the probability of each category, and determining the recognition result of the acoustic event sample to be recognized by taking the maximum value of each column;
the classification parameter alpha*The classification parameters of the support vector machine classifier are obtained by training with a training set.
2. The method of claim 1, wherein the classification parameter α is a*The acquisition process comprises the following steps:
calculating a kernel matrix according to the optimal semantic feature matrix of the training set:
Figure FDA0003099311240000016
Figure FDA0003099311240000017
wherein, KtrA kernel matrix which is a training set;
Figure FDA0003099311240000018
respectively according to the optimal semantic features of the training set;
by KtrTraining a support vector machine classifier to obtain a classification model, wherein classification parameters of the classification model are expressed as
Figure FDA0003099311240000019
Wherein c is the total number of classes.
3. The method for recognizing acoustic events based on common subspace representation learning according to claim 1 or 2, wherein the common subspace base matrix U*The determination process of (a) includes the steps of:
step 5.1: randomly initializing a base matrix of the public subspace;
step 5.2: extracting frame level characteristics corresponding to each training sample in the training set, obtaining segment level characteristics according to the frame level characteristic matrix, and recording the segment level characteristics of the ith sample in the segment level characteristic set as the segment level characteristics
Figure FDA0003099311240000021
Wherein N isiD is the nth segment to form the number of segments required for the sample
Figure FDA0003099311240000022
A characteristic dimension of (d);
randomly selecting part of acoustic event samples in the training set of the segment-level features, and respectively solving semantic feature matrixes of the acoustic event samples under the condition of fixing a common subspace basis matrix;
step 5.3: updating the base matrix of the public subspace by using the plurality of semantic feature matrixes obtained in the step 5.2;
step 5.4: repeating the steps 5.2 to 5.3 until convergence to obtain the optimal representation of the common subspace base matrix, namely the common subspace base matrix U*
4. The method of claim 3, wherein the segment-level feature X of step 5.2 is the same as the segment-level feature X of the acoustic event recognition method based on common subspace representation learningiA smoothing process and a length normalization process are required.
5. The method for recognizing acoustic events based on common subspace representation learning according to claim 4, wherein the process of randomly selecting partial acoustic event samples and respectively solving the semantic feature matrices thereof under the condition of fixing the common subspace base matrix in step 5.2 comprises the following steps:
in the training set of the segment-level features, selecting l acoustic event samples at random, fixing U, and respectively obtaining a semantic feature matrix according to the following formula:
Figure FDA0003099311240000023
wherein, U is a standard orthogonal basis matrix of a random initialization public subspace; viFor the ith sample X in the l training samplesiCorresponding semantic feature matrix, constraining the different columns in the matrix to be mutually orthogonal, and dividing ViIs shown as
Figure FDA0003099311240000028
The above elements; q. q.siIs NiThe total combination number of two segments with the occurrence sequence in each segment, lambda is used for reflecting a hyper-parameter of the influence of the characterization time sequence relation on semantic feature quality, and eta is a hyper-parameter reflecting the obvious degree of the characterization time sequence relation; 1p=[1]∈Rp×1Is a p-dimensional whole-column vector,
Figure FDA0003099311240000024
is p × qiAll one matrix of (a);
Figure FDA0003099311240000025
Figure FDA0003099311240000026
respectively represent XiNth, mth segment level features;
solving about
Figure FDA0003099311240000027
To the optimization problem of (2).
6. The method of claim 5, wherein N is the number of acoustic events identified based on the learning of the common subspace representationiTotal combined number q of two segments with successive occurrence order in each segmenti=0.5×Ni×(Ni-1)。
7. The method for recognizing acoustic events based on common subspace representation learning according to claim 6, wherein the updating of the base matrix of the common subspace by using the plurality of semantic feature matrices obtained in step 5.2 is performed by updating the base matrix of the common subspace according to the following formula:
Figure FDA0003099311240000031
wherein | · | purpleFRepresenting the Frobenius norm.
8. The method as claimed in claim 7, wherein the acoustic event samples to be identified are obtained by sampling and quantizing the acoustic event signals to be identified.
9. The method of claim 8, wherein the further abstraction of the frame-level features into the segment-level features by the convolutional neural network is to further abstract a plurality of adjacent frame-level features into the segment-level features.
10. The method of claim 9, wherein after the frame-level features are further abstracted into segment-level features by using a convolutional neural network, the segment-level features are further subjected to smoothing and length normalization.
CN202110620415.9A 2021-06-03 2021-06-03 Acoustic event identification method based on public subspace representation learning Active CN113361592B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110620415.9A CN113361592B (en) 2021-06-03 2021-06-03 Acoustic event identification method based on public subspace representation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110620415.9A CN113361592B (en) 2021-06-03 2021-06-03 Acoustic event identification method based on public subspace representation learning

Publications (2)

Publication Number Publication Date
CN113361592A true CN113361592A (en) 2021-09-07
CN113361592B CN113361592B (en) 2022-11-08

Family

ID=77531792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110620415.9A Active CN113361592B (en) 2021-06-03 2021-06-03 Acoustic event identification method based on public subspace representation learning

Country Status (1)

Country Link
CN (1) CN113361592B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016042359A (en) * 2014-08-18 2016-03-31 株式会社デンソーアイティーラボラトリ Recognition apparatus, real number matrix decomposition method, and recognition method
CN106250855A (en) * 2016-08-02 2016-12-21 南京邮电大学 A kind of multi-modal emotion identification method based on Multiple Kernel Learning
CN110148428A (en) * 2019-05-27 2019-08-20 哈尔滨工业大学 A kind of acoustic events recognition methods indicating study based on subspace
US20200075040A1 (en) * 2018-08-31 2020-03-05 The Regents Of The University Of Michigan Automatic speech-based longitudinal emotion and mood recognition for mental health treatment
CN112241605A (en) * 2019-07-17 2021-01-19 华北电力大学(保定) Method for identifying state of circuit breaker energy storage process by constructing CNN characteristic matrix through acoustic vibration signals
CN112820071A (en) * 2021-02-25 2021-05-18 泰康保险集团股份有限公司 Behavior identification method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016042359A (en) * 2014-08-18 2016-03-31 株式会社デンソーアイティーラボラトリ Recognition apparatus, real number matrix decomposition method, and recognition method
CN106250855A (en) * 2016-08-02 2016-12-21 南京邮电大学 A kind of multi-modal emotion identification method based on Multiple Kernel Learning
US20200075040A1 (en) * 2018-08-31 2020-03-05 The Regents Of The University Of Michigan Automatic speech-based longitudinal emotion and mood recognition for mental health treatment
CN110148428A (en) * 2019-05-27 2019-08-20 哈尔滨工业大学 A kind of acoustic events recognition methods indicating study based on subspace
CN112241605A (en) * 2019-07-17 2021-01-19 华北电力大学(保定) Method for identifying state of circuit breaker energy storage process by constructing CNN characteristic matrix through acoustic vibration signals
CN112820071A (en) * 2021-02-25 2021-05-18 泰康保险集团股份有限公司 Behavior identification method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIWEN ZHANG等: "《Pyramidal Temporal Pooling With Discriminative Mapping for Audio Classification》", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
史秋莹 等: "《基于DNN和多模态信息融合的复杂音频场景识别》", 《第十四届全国人机语音通讯学术会议(NCMMSC’2017)论文集》 *
程石磊: "《视频序列中人体行为的特征提取与识别方法研究》", 《中国博士学位论文全文数据库 (信息科技辑)》 *

Also Published As

Publication number Publication date
CN113361592B (en) 2022-11-08

Similar Documents

Publication Publication Date Title
Xie et al. Utterance-level aggregation for speaker recognition in the wild
JP6235938B2 (en) Acoustic event identification model learning device, acoustic event detection device, acoustic event identification model learning method, acoustic event detection method, and program
Shi et al. Few-shot acoustic event detection via meta learning
CN110349597B (en) Voice detection method and device
CN112885372B (en) Intelligent diagnosis method, system, terminal and medium for power equipment fault sound
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN111986699A (en) Sound event detection method based on full convolution network
CN112216287A (en) Environmental sound identification method based on ensemble learning and convolution neural network
Naranjo-Alcazar et al. On the performance of residual block design alternatives in convolutional neural networks for end-to-end audio classification
Mustika et al. Comparison of keras optimizers for earthquake signal classification based on deep neural networks
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
KR102241364B1 (en) Apparatus and method for determining user stress using speech signal
CN113361592B (en) Acoustic event identification method based on public subspace representation learning
Mahanta et al. The brogrammers dicova 2021 challenge system report
Neili et al. Gammatonegram based pulmonary pathologies classification using convolutional neural networks
CN115083433A (en) DNN-based text irrelevant representation tone clustering method
CN115267672A (en) Method for detecting and positioning sound source
CN114898773A (en) Synthetic speech detection method based on deep self-attention neural network classifier
CN112712096A (en) Audio scene classification method and system based on deep recursive non-negative matrix decomposition
Estrebou et al. Voice recognition based on probabilistic SOM
Alex et al. Performance analysis of SOFM based reduced complexity feature extraction methods with back propagation neural network for multilingual digit recognition
Thakur et al. Conv-codes: audio hashing for bird species classification
Long et al. Offline to online speaker adaptation for real-time deep neural network based LVCSR systems
Ashurov et al. Classification of Environmental Sounds Through Spectrogram-Like Images Using Dilation-Based CNN
Nagajyothi et al. Voice Recognition Based on Vector Quantization Using LBG

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant