CN108010514B - Voice classification method based on deep neural network - Google Patents

Voice classification method based on deep neural network Download PDF

Info

Publication number
CN108010514B
CN108010514B CN201711155884.8A CN201711155884A CN108010514B CN 108010514 B CN108010514 B CN 108010514B CN 201711155884 A CN201711155884 A CN 201711155884A CN 108010514 B CN108010514 B CN 108010514B
Authority
CN
China
Prior art keywords
local
global
neural network
information
spectrogram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201711155884.8A
Other languages
Chinese (zh)
Other versions
CN108010514A (en
Inventor
毛华
章毅
吴雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201711155884.8A priority Critical patent/CN108010514B/en
Publication of CN108010514A publication Critical patent/CN108010514A/en
Application granted granted Critical
Publication of CN108010514B publication Critical patent/CN108010514B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a speech classification method based on a deep neural network, which aims to solve different speech classification problems through a unified algorithm model. The invention comprises the following steps: s1: converting the voice into a corresponding spectrogram; and partitioning along the frequency domain on the complete spectrogram to obtain a group of local frequency domain information sets. S2: complete and local frequency domain information is used as input of the model, and the convolutional neural network can extract local and global features based on different input. S3: and fusing the global and local feature expressions to form a final feature expression by applying an attention mechanism. S4: using the labeled data, the network is trained by gradient descent and back propagation algorithms. S5: and for the unmarked voice, the trained parameters are adopted, and the model outputs the classification with the highest probability as a prediction result. The invention realizes a unified algorithm model for different voice classification problems, and improves the accuracy on a plurality of voice classification problems.

Description

Voice classification method based on deep neural network
Technical Field
A speech classification method based on a deep neural network is used for processing different speech classification tasks and relates to the technical fields of speech signal processing, artificial intelligence and the like.
Background
With the rapid development of computer technology, the dependence and requirement of human beings on computers are continuously enhanced, and how to better interact with computers has become a research hotspot. Speech, the most common and natural way of communicating in daily life, contains a huge amount of information, such as the accent of a speaker, the emotional state of the speaker, etc. The speech classification and recognition capability of the computer is an important component of speech processing of the computer, is a key premise for realizing a natural human-computer interaction interface, and has great research value and application value. The speech classification technology is an important research direction, and plays an important role in speech recognition, speech content detection and other aspects. The voice classification is the basis and the premise for deep processing of the audio, and for a section of audio given currently, the audio environment, the gender, the accent, the emotion and the like of a speaker in which the voice is located can be determined in advance through classification, so that a basis is provided for adjusting the adaptive algorithm of the voice model. Therefore, the speech classification method is crucial.
Speech classification includes a number of different tasks, such as: speech emotion recognition, accent recognition, speaker recognition, speech environment differentiation, etc. The challenge of the speech classification task is the high dimensional nature of speech. Conventional speech classification methods typically extract specific audio features for a single problem or database, thereby reducing the dimensionality of the data input to the classification network. However, feature extraction requires sufficient speech signal processing knowledge, since feature extraction represents filtering of information, which can result in loss of information. Secondly, conventional classification algorithms are often not suitable for multi-classification tasks, such as support vector machines and the like. These problems are the difficulties that our work needs to overcome.
The deep neural network method is one of the most important means for processing big data, especially high-dimensional data at present. The deep neural network is characterized in that the learning of the characteristics of the audio data and the classification can be realized through constructing a multilayer nonlinear mapping function and through the training of the connection weight. The deep neural network can adjust the parameters of the network according to the output result because of the functions of feedback, learning and the like, and at present, the heat tide of the deep neural network is gradually spread in various subject fields, so that the deep neural network is successfully applied to a plurality of fields including machine translation, voice recognition, target recognition and the like.
Disclosure of Invention
The invention provides a voice classification method based on a deep neural network aiming at the defects, and solves the problems that the existing single task classification or data feature extraction method is only aimed at, and high-dimensional data is difficult to process in the prior art.
In order to achieve the purpose, the invention adopts the technical scheme that:
a speech classification method based on a deep neural network is characterized by comprising the following steps:
s1: carrying out short-time Fourier transform on the voice data, and converting the voice data into a corresponding spectrogram; partitioning along a frequency domain on a complete spectrogram to obtain a group of local frequency domain information sets;
s2: establishing an algorithm model based on a convolutional neural network and an attention mechanism, and respectively taking a complete spectrogram and local frequency domain information as the input of the model to carry out feature learning; extracting local and global features by using a convolutional neural network based on local and complete spectrogram information;
s3: fusing global and local feature expressions by using an attention mechanism to form a final feature expression, and inputting the final feature expression into a softmax classifier so as to obtain prediction of the classification to which the voice belongs;
s4: adopting marked voice data, training a network through a gradient descent and back propagation algorithm, and storing network parameters;
s5: and predicting the unmarked voice by adopting a trained model, and outputting the belonged classification with the highest probability as a final prediction result by the model.
Further, the distributed spectrogram conversion process in S1 specifically includes the following steps:
carrying out short-time Fourier transform on the original audio, and dividing the given original audio into M sections of short audio; calculating the short-time energy of each short audio segment and performing modulus extraction to finally obtain a complete spectrogram expression S, wherein the expression S of the spectrogram is as follows:
Figure GDA0003140056830000021
wherein, N is expressed as the length of each short audio segment; formula (1) shows the structural composition of the spectrogram as a two-dimensional matrix, wherein two dimensions represent the time change sequence and the frequency domain change from low frequency to high frequency on the voice respectively, and the numerical value at each point represents the amplitude.
Partitioning the complete spectrogram information along the direction of frequency domain change can obtain a group of local and global frequency spectrum information sets, namely a group of input data combinations based on different frequency domain distributions: { s1,s2,...,sn,S}。
Further, the feature extraction of the convolutional neural network in S2 specifically includes the following steps:
for a plurality of local inputs, extracting features of different information by using a convolutional neural network so as to obtain a group of local expressions:
Figure GDA0003140056830000022
in the above formula, each local input snAll have convolution parameters w corresponding theretonAnd bnF is expressed as an activation function; the resulting set of local features is expressed as: { a1,a2,…,an}。
For the current complete global frequency domain information, extracting global features by using a convolutional neural network, wherein a specific calculation formula is as follows:
a=g(wS+b) (3)
each global input S has a convolution parameter weight w and a bias parameter b corresponding to the global input S, g represents an activation function adopted by the global input S, and finally a represents a global feature extracted by the convolutional neural network.
Wherein the convolution and pooling operations of the convolutional neural network are mainly involved in equations (2) and (3). The specific operation of convolution is as follows:
Figure GDA0003140056830000031
where M and N define the size of the convolution kernel, M, N represent the number of rows and columns used to define the pixel point location, f is the convolution kernel function, aijDefining a characteristic expression, s, of i rows and j columns of the current layerijThe input data of the current layer i row j column is defined. w defines the parameters of the convolution kernel, b is the corresponding offset value;
the convolution operation in equation (4) plays an important role in the convolution network. Through the design of the shared weight, the features extracted by the convolutional network have the characteristics of no deformation; i.e. the input of the input changes slightly, the characteristics proposed by the network do not change much.
The specific operation of pooling is as follows:
p=σ(a) (5)
where σ represents the pooling function, the most common pooling functions are three, namely taking the maximum, minimum or average within the receptive field (the space of the convolution kernel). a represents the input of the pooling layer and p represents the output after the pooling operation;
the pooling parameters in the formula (5) greatly reduce the number of weights in the network, and prevent the overfitting phenomenon of the network.
Further, the fusion of the attention mechanism and the local feature expression in S3 specifically includes the following steps:
based on different local features, a new global feature expression is obtained again by applying an attention mechanism; the global information is first given a "coefficient" for each of its components:
Figure GDA0003140056830000032
in the above formula, piA certain component representing the global feature a, a total of m component information,
Figure GDA0003140056830000033
the representation is based on the current local feature an,piThe coefficient of this component, representing its degree of importance; the attention mechanism learns through two-tier mapping, the first tier employing a weight W1Bias parameter b1And an activation function f to learn the mapping, the second layer using the weight W2Bias parameter b2And the activation function g learns the mapping on the results of the first layer;
the meaning of equation (6) is the essential operation of the attention mechanism, based on the local feature a of the guidance informationnAssigning different p to each component of the global feature aiThe weight value represents the importance of the composition. It is desirable to find the most representative features in the composition through network training.
Then multiplying the calculated coefficient representing the degree of importance with the corresponding component to form a new global information:
Figure GDA0003140056830000041
thus, by applying the attention mechanism, n new global information are obtained, and are added to the initial global feature a in a para-position manner to obtain a final feature expression:
Figure GDA0003140056830000042
and inputting the final feature expression A into a softmax classifier, wherein the obtained class with the maximum probability value is the prediction class of the voice data.
After the scheme is adopted, the invention has the beneficial effects that:
(1) the traditional speech classification method adopts different feature extraction algorithms aiming at a single problem, and the invention directly performs feature learning on a speech spectrogram through a deep neural network, and can autonomously learn different audio features according to different tasks.
(2) Training of deep neural networks often requires large data, however, the number of speech data presently disclosed is small. Based on the previous research of the deep neural network, the invention further provides an algorithm model for fusing the convolutional neural network and the attention mechanism, and the recognition rate on multiple tasks is further improved.
Taking two groups of voice classification tasks of accent recognition and speaker recognition as examples:
model (model) Accuracy (%)
i-Vector 74.50
Convolutional network and attention model 79.32
VGG-11 54.40
ResNet-18 61.66
ResNet-34 58.47
Table 1 shows the comparison of the model of the present invention with other methods in terms of accent recognition, where i-Vector is the classical feature extraction algorithm and VGG and ResNet are representative convolutional neural network models.
Model (model) Accuracy (%)
MFCC 91.00
Convolutional network and attention model 98.04
VGG-11 75.21
ResNet-18 75.04
ResNet-34 66.05
Table 2 shows the comparison of the model of the present invention with other methods in terms of speaker recognition, where MFCC is the classical feature extraction algorithm and VGG and resets are representative convolutional neural network models.
The above experimental results prove that:
1) on the aspect of a plurality of voice classification problems, compared with the traditional feature extraction algorithm, the features learned by the model provided by the invention can obtain a better recognition result.
2) Compared with other neural network methods, the method further improves the application of an attention mechanism in the convolutional neural network, increases the robustness and generalization capability of the model, and improves the accuracy of speech recognition on multiple problems.
Drawings
FIG. 1 is a schematic diagram of an algorithm model according to the present invention;
FIG. 2 is a frequency domain based distributed spectrogram;
FIG. 3 is a basic block diagram of a convolution block employing an attention mechanism;
fig. 4 is an overall process diagram of the present invention.
Detailed description of the preferred embodiments
The technical solution in the embodiment of the present invention will be described in detail below with reference to the drawings in the embodiment of the present invention; the embodiments described herein are merely a few embodiments of the present invention and are not all embodiments of the specification.
Referring to fig. 1, a core model of speech classification based on a deep neural network is a deep neural network model composed of a plurality of convolution blocks using an attention mechanism. One is a convolutional neural network, which mainly adopts multilayer nonlinear functions to learn the mapping relation between input data and characteristics; the deep learning algorithm can automatically learn relevant features according to the target; the other is an attention mechanism, which mainly obtains an expression that the local information has different weights by distributing different weights to the local information. The invention effectively improves the accuracy of voice classification by combining the deep learning algorithm and the attention mechanism. .
The speech classification method based on the deep neural network comprises the following steps:
step S1: carrying out short-time Fourier transform on the original audio, and dividing the given original audio into M sections of short audio; calculating the short-time energy of each short audio segment and performing modulus extraction to finally obtain a complete spectrogram expression S, wherein the expression S of the spectrogram is as follows:
Figure GDA0003140056830000061
where N is expressed as the short audio length per segment size.
The complete spectrogram information is partitioned along the direction of frequency domain change, so that the method can obtainA set of local and global spectral information is collected, that is, a set of input data combinations based on different frequency domain distributions are obtained: { s1,s2,…,sn,S}。
The display of the complete spectrogram and the frequency domain-based distributed spectrogram can be seen with reference to fig. 2. The spectrogram based on distribution is to perform partitioning along intervals of frequency domain variation, so as to obtain distribution information in different frequency intervals.
Step S2: establishing an algorithm model based on a convolutional neural network and an attention mechanism, and respectively taking a complete spectrogram and local frequency domain information as the input of the model to carry out feature learning; for a plurality of local inputs, extracting features of different information by using a convolutional neural network so as to obtain a group of local expressions:
Figure GDA0003140056830000062
in the above formula, each local input snAll have convolution parameters w corresponding theretonAnd bnF is expressed as an activation function; the resulting set of local features is expressed as: { a1,a2,…,an}。
For the current complete global frequency domain information, extracting global features by using a convolutional neural network, wherein a specific calculation formula is as follows:
a=g(wS+b) (3)
each global input S has a convolution parameter weight w and a bias parameter b corresponding to the global input S, g represents an activation function adopted by the global input S, and finally a represents a global feature extracted by the convolutional neural network.
Step S3: on the basis of the local features and the global features proposed in the step S2, based on different local features, applying an attention mechanism to retrieve a new global feature expression; the global information is first given a "coefficient" for each of its components:
Figure GDA0003140056830000063
in the above formula, piA certain component representing the global feature a, a total of m component information,
Figure GDA0003140056830000064
the representation is based on the current local feature an,piThe coefficient of this component, representing its degree of importance; the attention mechanism learns through two-tier mapping, the first tier employing a weight W1Bias parameter b1And an activation function f to learn the mapping, the second layer using the weight W2Bias parameter b2And the activation function g learns the mapping on the results of the first layer.
Then multiplying the calculated coefficient representing the degree of importance with the corresponding component to form a new global information:
Figure GDA0003140056830000071
thus, by applying the attention mechanism, n new global information are obtained, and are added to the initial global feature a in a para-position manner to obtain a final feature expression:
Figure GDA0003140056830000072
and inputting the final feature expression A into a softmax classifier, wherein the obtained class with the maximum probability value is the prediction class of the voice data.
Referring to fig. 3, a basic structure diagram of a convolution block based on attention mechanism is shown, including a feature extraction process based on local and global information, and finally performing information re-fusion by using attention idea, and finally obtaining a final feature expression a.
Step S4: adopting marked voice data, training a network through a gradient descent and back propagation algorithm, and storing network parameters; the initial construction of the model is that the parameters in the network are initialized randomly, and the network parameters are updated through the errors generated by the marked voice data until the network becomes stable and the optimal parameters are reserved.
Step S5: and predicting the unmarked voice by adopting the trained model and parameters, and outputting the belonged classification with the highest probability as a final prediction result by the model.
Referring to fig. 4, the complete process diagram of the present invention from step S1 to step S5, if there is audio still to be identified, the process continues to step S1 to step S5, and finally the classifier outputs the classification with the highest probability value, which is the prediction result.

Claims (4)

1. A speech classification method based on a deep neural network is characterized in that a distributed spectrogram is combined with a convolutional neural network and an attention mechanism, and the method comprises the following steps:
s1: carrying out short-time Fourier transform on the voice data, and converting the voice data into a corresponding spectrogram; partitioning along a frequency domain on a complete spectrogram to obtain a group of local frequency domain information sets;
s2: establishing an algorithm model based on a convolutional neural network and an attention mechanism, and respectively taking a complete spectrogram and local frequency domain information as the input of the model to carry out feature learning; extracting local and global features by using a convolutional neural network based on local and complete spectrogram information;
s3: fusing global and local feature expressions by using an attention mechanism to form a final feature expression, and inputting the final feature expression into a softmax classifier so as to obtain prediction of the classification to which the voice belongs;
s4: adopting marked voice data, training a network through a gradient descent and back propagation algorithm, and storing network parameters;
s5: and predicting the unmarked voice by adopting a trained model, and outputting the belonged classification with the highest probability as a final prediction result by the model.
2. The method of claim 1, wherein the deep neural network based speech classification method comprises: the distributed spectrogram conversion process in S1 specifically includes the following steps:
s11: carrying out short-time Fourier transform on the original audio, and dividing the given original audio into M sections of short audio; calculating the short-time energy of each short audio segment and performing modulus extraction to finally obtain a complete spectrogram expression S, wherein the expression S of the spectrogram is as follows:
Figure FDA0003140056820000011
wherein, N is expressed as the length of each short audio segment;
s12: partitioning the complete spectrogram information along the direction of frequency domain variation, wherein a certain local frequency domain information snIs expressed as follows:
Figure FDA0003140056820000012
finally, a group of local and global frequency spectrum information sets are obtained, and a group of input data combinations based on different frequency domain distributions are obtained: { s1,s2,…,sn,S}。
3. The method of claim 1, wherein the deep neural network based speech classification method comprises: the characteristic extraction of the convolutional neural network in the step S2 specifically includes the following steps:
s21: for a plurality of local inputs, extracting features of different information by using a convolutional neural network so as to obtain a group of local expressions:
Figure FDA0003140056820000021
in the above formula, each local input snAll have convolution parameters w corresponding theretonAnd bnF is expressed as an activation function; the resulting set of partsThe characteristics are expressed as: { a1,a2,…,an};
S22: for the current complete global frequency domain information, extracting global features by using a convolutional neural network, wherein a specific calculation formula is as follows:
a=g(wS+b) (4)
each global input S has a convolution parameter weight w and a bias parameter b corresponding to the global input S, g represents an activation function adopted by the global input S, and finally a represents a global feature extracted by the convolutional neural network.
4. The method of claim 1, wherein the deep neural network based speech classification method comprises: the fusion of the attention mechanism and the local feature expression in step S3 specifically includes the following steps:
based on different local features, a new global feature expression is obtained again by applying an attention mechanism; the global information is first given a "coefficient" for each of its components:
Figure FDA0003140056820000022
in the above formula, piA certain component representing the global feature a, a total of m component information,
Figure FDA0003140056820000023
the representation is based on the current local feature an,piThe coefficient of this component, representing its degree of importance; the attention mechanism learns through two-tier mapping, the first tier employing a weight W1Bias parameter b1And an activation function f to learn the mapping, the second layer using the weight W2Bias parameter b2And the activation function g learns the mapping on the results of the first layer;
then multiplying the calculated coefficient representing the degree of importance with the corresponding component to form a new global information:
Figure FDA0003140056820000024
thus, by applying the attention mechanism, n new global information are obtained, and are added to the initial global feature a in a para-position manner to obtain a final feature expression:
Figure FDA0003140056820000025
and inputting the final feature expression A into a softmax classifier, wherein the obtained class with the maximum probability value is the prediction class of the voice data.
CN201711155884.8A 2017-11-20 2017-11-20 Voice classification method based on deep neural network Expired - Fee Related CN108010514B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711155884.8A CN108010514B (en) 2017-11-20 2017-11-20 Voice classification method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711155884.8A CN108010514B (en) 2017-11-20 2017-11-20 Voice classification method based on deep neural network

Publications (2)

Publication Number Publication Date
CN108010514A CN108010514A (en) 2018-05-08
CN108010514B true CN108010514B (en) 2021-09-10

Family

ID=62052777

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711155884.8A Expired - Fee Related CN108010514B (en) 2017-11-20 2017-11-20 Voice classification method based on deep neural network

Country Status (1)

Country Link
CN (1) CN108010514B (en)

Families Citing this family (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11462209B2 (en) * 2018-05-18 2022-10-04 Baidu Usa Llc Spectrogram to waveform synthesis using convolutional networks
CN108846048A (en) * 2018-05-30 2018-11-20 大连理工大学 Musical genre classification method based on Recognition with Recurrent Neural Network and attention mechanism
CN108877783B (en) * 2018-07-05 2021-08-31 腾讯音乐娱乐科技(深圳)有限公司 Method and apparatus for determining audio type of audio data
CN109256135B (en) * 2018-08-28 2021-05-18 桂林电子科技大学 End-to-end speaker confirmation method, device and storage medium
CN109410914B (en) * 2018-08-28 2022-02-22 江西师范大学 Method for identifying Jiangxi dialect speech and dialect point
CN109243466A (en) * 2018-11-12 2019-01-18 成都傅立叶电子科技有限公司 A kind of vocal print authentication training method and system
CN109599129B (en) * 2018-11-13 2021-09-14 杭州电子科技大学 Voice depression recognition system based on attention mechanism and convolutional neural network
CN109523994A (en) * 2018-11-13 2019-03-26 四川大学 A kind of multitask method of speech classification based on capsule neural network
CN109285539B (en) * 2018-11-28 2022-07-05 中国电子科技集团公司第四十七研究所 Sound recognition method based on neural network
CN111259189B (en) * 2018-11-30 2023-04-18 马上消费金融股份有限公司 Music classification method and device
CN109509475B (en) * 2018-12-28 2021-11-23 出门问问信息科技有限公司 Voice recognition method and device, electronic equipment and computer readable storage medium
CN109817233B (en) * 2019-01-25 2020-12-01 清华大学 Voice stream steganalysis method and system based on hierarchical attention network model
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system
CN110047516A (en) * 2019-03-12 2019-07-23 天津大学 A kind of speech-emotion recognition method based on gender perception
CN110197206B (en) * 2019-05-10 2021-07-13 杭州深睿博联科技有限公司 Image processing method and device
CN110223714B (en) * 2019-06-03 2021-08-03 杭州哲信信息技术有限公司 Emotion recognition method based on voice
CN110459225B (en) * 2019-08-14 2022-03-22 南京邮电大学 Speaker recognition system based on CNN fusion characteristics
CN110534133B (en) * 2019-08-28 2022-03-25 珠海亿智电子科技有限公司 Voice emotion recognition system and voice emotion recognition method
CN110648669B (en) * 2019-09-30 2022-06-07 上海依图信息技术有限公司 Multi-frequency shunt voiceprint recognition method, device and system and computer readable storage medium
CN110782878B (en) * 2019-10-10 2022-04-05 天津大学 Attention mechanism-based multi-scale audio scene recognition method
CN111009262A (en) * 2019-12-24 2020-04-14 携程计算机技术(上海)有限公司 Voice gender identification method and system
CN110992988B (en) * 2019-12-24 2022-03-08 东南大学 Speech emotion recognition method and device based on domain confrontation
CN111223488B (en) * 2019-12-30 2023-01-17 Oppo广东移动通信有限公司 Voice wake-up method, device, equipment and storage medium
CN111340187B (en) * 2020-02-18 2024-02-02 河北工业大学 Network characterization method based on attention countermeasure mechanism
CN111666996B (en) * 2020-05-29 2023-09-19 湖北工业大学 High-precision equipment source identification method based on attention mechanism
CN114141244A (en) * 2020-09-04 2022-03-04 四川大学 Voice recognition technology based on audio media analysis
CN112489687B (en) * 2020-10-28 2024-04-26 深兰人工智能芯片研究院(江苏)有限公司 Voice emotion recognition method and device based on sequence convolution
CN112466298B (en) * 2020-11-24 2023-08-11 杭州网易智企科技有限公司 Voice detection method, device, electronic equipment and storage medium
CN112489689B (en) * 2020-11-30 2024-04-30 东南大学 Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure
CN112992119B (en) * 2021-01-14 2024-05-03 安徽大学 Accent classification method based on deep neural network and model thereof
CN112885372B (en) * 2021-01-15 2022-08-09 国网山东省电力公司威海供电公司 Intelligent diagnosis method, system, terminal and medium for power equipment fault sound
CN113593525B (en) * 2021-01-26 2024-08-06 腾讯科技(深圳)有限公司 Accent classification model training and accent classification method, apparatus and storage medium
CN112967730B (en) * 2021-01-29 2024-07-02 北京达佳互联信息技术有限公司 Voice signal processing method and device, electronic equipment and storage medium
CN113571063B (en) * 2021-02-02 2024-06-04 腾讯科技(深圳)有限公司 Speech signal recognition method and device, electronic equipment and storage medium
CN113035227B (en) * 2021-03-12 2022-02-11 山东大学 Multi-modal voice separation method and system
CN113049084B (en) * 2021-03-16 2022-05-06 电子科技大学 Attention mechanism-based Resnet distributed optical fiber sensing signal identification method
CN113409827B (en) * 2021-06-17 2022-06-17 山东省计算中心(国家超级计算济南中心) Voice endpoint detection method and system based on local convolution block attention network
CN116778951B (en) * 2023-05-25 2024-08-09 上海蜜度科技股份有限公司 Audio classification method, device, equipment and medium based on graph enhancement
CN116504259B (en) * 2023-06-30 2023-08-29 中汇丰(北京)科技有限公司 Semantic recognition method based on natural language processing
CN116825092B (en) * 2023-08-28 2023-12-01 珠海亿智电子科技有限公司 Speech recognition method, training method and device of speech recognition model
CN117275491B (en) * 2023-11-17 2024-01-30 青岛科技大学 Sound classification method based on audio conversion and time attention seeking neural network

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706780A (en) * 2009-09-03 2010-05-12 北京交通大学 Image semantic retrieving method based on visual attention model
CN102044254A (en) * 2009-10-10 2011-05-04 北京理工大学 Speech spectrum color enhancement method for speech visualization
CN106652999A (en) * 2015-10-29 2017-05-10 三星Sds株式会社 System and method for voice recognition
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN107145518A (en) * 2017-04-10 2017-09-08 同济大学 Personalized recommendation system based on deep learning under a kind of social networks
CN107203999A (en) * 2017-04-28 2017-09-26 北京航空航天大学 A kind of skin lens image automatic division method based on full convolutional neural networks
CN107221326A (en) * 2017-05-16 2017-09-29 百度在线网络技术(北京)有限公司 Voice awakening method, device and computer equipment based on artificial intelligence
CN107239446A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of intelligence relationship extracting method based on neutral net Yu notice mechanism
CN107256228A (en) * 2017-05-02 2017-10-17 清华大学 Answer selection system and method based on structuring notice mechanism
CN107316066A (en) * 2017-07-28 2017-11-03 北京工商大学 Image classification method and system based on multi-path convolutional neural networks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10762894B2 (en) * 2015-03-27 2020-09-01 Google Llc Convolutional neural networks
WO2017062610A1 (en) * 2015-10-06 2017-04-13 Evolv Technologies, Inc. Augmented machine decision making

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706780A (en) * 2009-09-03 2010-05-12 北京交通大学 Image semantic retrieving method based on visual attention model
CN102044254A (en) * 2009-10-10 2011-05-04 北京理工大学 Speech spectrum color enhancement method for speech visualization
CN106652999A (en) * 2015-10-29 2017-05-10 三星Sds株式会社 System and method for voice recognition
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN107145518A (en) * 2017-04-10 2017-09-08 同济大学 Personalized recommendation system based on deep learning under a kind of social networks
CN107203999A (en) * 2017-04-28 2017-09-26 北京航空航天大学 A kind of skin lens image automatic division method based on full convolutional neural networks
CN107256228A (en) * 2017-05-02 2017-10-17 清华大学 Answer selection system and method based on structuring notice mechanism
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram
CN107221326A (en) * 2017-05-16 2017-09-29 百度在线网络技术(北京)有限公司 Voice awakening method, device and computer equipment based on artificial intelligence
CN107239446A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of intelligence relationship extracting method based on neutral net Yu notice mechanism
CN107316066A (en) * 2017-07-28 2017-11-03 北京工商大学 Image classification method and system based on multi-path convolutional neural networks

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Deep Convolutional Recurrent neural network with attention mechanism robust speech emotion recognition";Che-Wei Huang;《ICME 2017》;20170714;全文 *
"Describing Multimedia Content Using Attention-based Encoder-Decoder Networks";Kyunghyun Cho;《IEEE transaction on Multimedia》;20151231;第17卷(第11期);全文 *
"Time-frequency feature extraction from spectrograms and wavelet packets with application to automatic stress and emotion classification in speech";L. He, M. Lech;《Proceedings of the International Conference on Information, Communications and Signal Processing》;20091231;全文 *
"Transferring Deep Convolutional Neural networks for the sceneclassification of high-resolution remote sensing imagery";Fan Hu;《MDPI》;20151205;第7卷(第11期);全文 *

Also Published As

Publication number Publication date
CN108010514A (en) 2018-05-08

Similar Documents

Publication Publication Date Title
CN108010514B (en) Voice classification method based on deep neural network
Turkoglu et al. Multi-model LSTM-based convolutional neural networks for detection of apple diseases and pests
Goceri Analysis of deep networks with residual blocks and different activation functions: classification of skin diseases
Wang et al. Research on Web text classification algorithm based on improved CNN and SVM
Thakur et al. Deep metric learning for bioacoustic classification: Overcoming training data scarcity using dynamic triplet loss
Chang et al. Automatic channel pruning via clustering and swarm intelligence optimization for CNN
CN109271522A (en) Comment sensibility classification method and system based on depth mixed model transfer learning
WO2021127982A1 (en) Speech emotion recognition method, smart device, and computer-readable storage medium
CN106897254B (en) Network representation learning method
CN112818861A (en) Emotion classification method and system based on multi-mode context semantic features
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN111898703B (en) Multi-label video classification method, model training method, device and medium
EP4198807A1 (en) Audio processing method and device
WO2020151310A1 (en) Text generation method and device, computer apparatus, and medium
WO2021042857A1 (en) Processing method and processing apparatus for image segmentation model
Lian et al. Unsupervised representation learning with future observation prediction for speech emotion recognition
CN111353534B (en) Graph data category prediction method based on adaptive fractional order gradient
CN112418059A (en) Emotion recognition method and device, computer equipment and storage medium
CN115601583A (en) Deep convolution network target identification method of double-channel attention mechanism
CN109522432B (en) Image retrieval method integrating adaptive similarity and Bayes framework
CN107807919A (en) A kind of method for carrying out microblog emotional classification prediction using random walk network is circulated
CN116386899A (en) Graph learning-based medicine disease association relation prediction method and related equipment
CN109033413B (en) Neural network-based demand document and service document matching method
WO2021059527A1 (en) Learning device, learning method, and recording medium
CN115017900B (en) Conversation emotion recognition method based on multi-mode multi-prejudice

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210910