CN107578775B - Multi-classification voice method based on deep neural network - Google Patents

Multi-classification voice method based on deep neural network Download PDF

Info

Publication number
CN107578775B
CN107578775B CN201710801016.6A CN201710801016A CN107578775B CN 107578775 B CN107578775 B CN 107578775B CN 201710801016 A CN201710801016 A CN 201710801016A CN 107578775 B CN107578775 B CN 107578775B
Authority
CN
China
Prior art keywords
model
classification
network
neural network
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710801016.6A
Other languages
Chinese (zh)
Other versions
CN107578775A (en
Inventor
毛华
彭德中
章毅
曾煜妮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201710801016.6A priority Critical patent/CN107578775B/en
Publication of CN107578775A publication Critical patent/CN107578775A/en
Application granted granted Critical
Publication of CN107578775B publication Critical patent/CN107578775B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multitask voice classification method based on deep learning, which relates to the technical field of voice processing and comprises the following steps: and S1, performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram. And S2, establishing a neural network model based on the convolutional neural network and the residual error network, and taking the spectrogram as network input to extract features. And S3, inputting the extracted features into a plurality of different softmax classifiers so as to obtain an initialized model. And S4, digitizing the voice sample and the corresponding marks, and training the initialized model by using the data set to obtain the trained network model. And S5, predicting the unmarked voice data by the trained model to obtain the classified probability value, and selecting the category with higher probability value as the classification result. The invention solves the problem that the existing audio classification method aims at the problem of low classification efficiency caused by independent processing of tasks and neglect of the relevance of voice tasks.

Description

Multi-classification voice method based on deep neural network
Technical Field
The invention relates to the technical field of sound signal processing, in particular to a voice multi-classification method based on a deep neural network.
Background
Sound provides us with a lot of information about the source of the sound and the surrounding environment. The human auditory system is able to separate and recognize complex sounds, and it would be useful if a machine could perform similar functions (audio classification and recognition), such as speech recognition in noise. Audio classification is an important field of pattern recognition and has been successfully applied in many fields, such as professional education and entertainment. In recent years, different classes of audio classification, such as accent recognition, speaker recognition, and speech emotion recognition, have been used with much success.
However, most audio classification methods are processed separately for tasks, and the correlation between tasks is omitted. For example, the accent recognition task and speaker recognition are typically treated as separate classification tasks. In fact, however, for the same piece of speech data, the accent of the spoken speaker will be determined once it is confirmed. Thus, it is desirable to improve the classification effect of both tasks simultaneously by using this relationship.
In recent years, the deep learning has caused the climax of artificial intelligence, and due to the strong abstract capability of the deep neural network on data, the neural network learning method has been successfully applied to various fields such as speech signal processing and the like. In our work, convolutional neural networks are used to learn speech features, improving accuracy in multi-classification tasks.
A spectrogram is a detailed and accurate phonetic representation containing time and frequency information. The general form of a spectrogram is mainly three dimensions: time, frequency and amplitude in color.
Disclosure of Invention
The invention aims to: the problem that the existing audio classification method aims at independent processing of tasks and ignores correlation of voice tasks, so that classification efficiency is low is solved.
The technical scheme of the invention is as follows:
a multitask speech classification method based on deep learning comprises the following steps:
and S1, performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram.
And S2, establishing a neural network model based on the convolutional neural network and the residual error network, and taking the spectrogram as network input to extract features.
And S3, inputting the extracted features into a plurality of different softmax classifiers so as to obtain an initialized model.
And S4, digitizing the voice sample and the corresponding marks, and training the initialized model by using the data set to obtain the trained network model.
And S5, predicting the unmarked voice data by the trained model to obtain the classified probability value, and selecting the category with higher probability value as the classification result.
Further, in S2, the basic operation of the convolutional neural network includes a convolution operation and a pooling operation, and the convolution operation can be expressed by the following formula:
Figure GDA0001437692210000021
wherein M and N define the size of convolution kernel, i, j represents the number of rows and columns to define the position of pixel point, f is convolution kernel function, L is (1, L) represents the number of layers of convolution neural network,
Figure GDA0001437692210000022
defining the characteristics of i rows and j columns of l layers,
Figure GDA0001437692210000023
the parameters of the convolution kernel of n rows m of l layers are defined, b is the corresponding bias function,
the meaning of formula (1) is: the product of different parts of the input feature map and the convolution kernel obtains a new feature map under the action of the convolution kernel function, and the formula ensures that the feature extraction is independent of the position, namely the statistical property of one part of the input feature map is the same as that of other parts.
The pooling operation of the convolutional neural network can be represented by the following equation:
al=f(βldown(al-1)+bl) (2)
in the above formula, alFor the input of the l-th layer, down represents the down-sampling mode, βlIs the corresponding parameter; the meaning of equation (2) is that the input feature map image pooling operation, i.e. the aggregation of features at different positions of the image, is performed to reduce the parameters in the network.
The basic residual block of the residual network in S2 can be represented by the following formula:
y=F(x,W)+x (3)
where F denotes a two-layer convolutional network, W is a parameter of the convolutional network, x is an input of a residual block, and y denotes a basic residual block.
Equation (3) means that an input x, after two layers of forward convolutional networks, gets an output F (x, W), and then through a shortcut, gets an output y.
The formula of the basic architecture model used in S2 is represented as:
y=F1(x,W1)*F2(x,W2)+x (4)
wherein is a multiplication by a bit-wise operation, F1,F2Is two convolutional layers, x is the input of the basic structure, W1,W2Are parameters of both convolutional layers.
The meaning of the formula (4) is that one input x obtains an output F under the action of two convolution networks respectively1(x,W1) And F2(x,W2) The two are multiplied and then passed through a shortcut to obtain the output y.
Specifically, the step S4 includes the following steps:
and S4, digitizing the voice sample and the corresponding marks, and training the initialized model by using the data set to obtain the trained network model.
S41: and analyzing the time domain and the frequency domain of each voice sample, extracting a spectrogram, and digitizing a plurality of marks corresponding to a plurality of tasks of the voice sample.
S42: on the basis of the initialized multi-task classification model obtained in step S3, the current speech classification task is learned to obtain a trained multi-task classification model.
S43: and the trained multi-task classification model is used for multi-task classification of the voice data, the probability value of each voice in each task is given, and the category with the larger probability value is selected as a classification result.
After the scheme is adopted, the invention has the beneficial effects that:
(1) the feature extraction of the voice data is a key preprocessing operation, and the voice spectrogram is extracted by a neural network, and the voice spectrogram is converted into 200-dimensional shared features in specific operation.
(2) In the classification process, the neural network is expected to learn the intrinsic characteristics of the voice so as to correctly predict each classification category, and then the neural network structure of the neural network is proposed to obtain better voice expression. Particularly, for a model which completes multi-classification like SVM and a classical neural network structure, the model is better; for a single classification model, the accuracy rate of independently realizing two tasks on the same model is lower than that of a multi-task classification model.
Taking speech emotion recognition on sentences and songs as an example, the main task is speech emotion classification, and the auxiliary task is sentence and song classification.
Rate of accuracy
SVM 48.01%
Single task model 56.33%
Multitasking model 62.39%
Table 1 compares mainly the accuracy of the single-task and multi-task models on the main tasks. Wherein, SVM is a classic machine learning classification method; the single-task model is used for single-task classification, the emotion classification accuracy is 56.33%, while the emotion recognition accuracy is increased by 6.06% when two tasks are simultaneously realized on the multi-task model
Network architecture Emotion recognition accuracy Speech and song classification accuracy
Convolutional neural network 53.73% 92.24
Residual error network 57.21% 94.62%
Gate-based residual error network 62.39 93.13
Table 2, mainly compares the accuracy of the multitask model based on different neural network structures in speech emotion recognition on sentences and songs. Wherein the gate-based residual error network is the model proposed by this patent.
The above experimental results prove that:
1) for a model which also completes multi-classification, such as SVM and a classical neural network structure, the model is better.
2) For a single classification model, the accuracy rate of independently realizing two tasks on the same model is lower than that of a multi-task classification model.
(3) Compared with models of other non-neural network methods, the method has the advantages that the multi-task classification model can be well initialized by extracting the characteristics of the voice through the deep neural network method, the robustness of the model is improved, and the recognition effect of each task is improved. Since the audio signal itself may be affected by noise, etc., the neural network method has a good generalization ability to noise, etc. In addition, single-task models, such as emotional classification of audio, are very sensitive to new speakers, and multi-task classification is relatively less influential because speaker characteristics are also learned.
Drawings
FIG. 1 is a diagram of a multitask model according to the present invention;
FIG. 2 is a spectrogram of speech containing emotional speech;
FIG. 3 is a spectrogram of speech containing happy emotions;
FIG. 4 is a basic structure diagram of the residual error network of the present invention;
fig. 5 is a basic configuration diagram of the neural network in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a core model of a deep neural network-based multitask speech classification is a multitask classification model, which is used to classify two types of tasks.
The multitask speech classification method based on deep learning comprises the following steps:
and S1, performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram.
And S2, establishing a neural network model based on a convolutional neural network and a residual error network, taking the spectrogram as network input, and extracting features, wherein in the step, common features used for a plurality of tasks are extracted by constructing a two-classification task network structure. The multitask of the invention is directed at two major classification tasks, one of which is that the emotion included in the voice and whether the voice belongs to a song or a sentence are distinguished at the same time; the second is to distinguish the speaker and the accent of the speaker at the same time.
As shown in fig. 3, the basic operations of the convolutional neural network include a convolution operation and a pooling operation, and the convolution operation can be expressed by the following formula:
Figure GDA0001437692210000051
wherein M and N define the size of convolution kernel, i, j represents the number of rows and columns to define the position of pixel point, f is convolution kernel function, L is (1, L) represents the number of layers of convolution neural network,
Figure GDA0001437692210000052
defining the characteristics of i rows and j columns of l layers,
Figure GDA0001437692210000053
the parameters of the convolution kernel of n rows m of l layers are defined, b is the corresponding bias function,
the meaning of formula (1) is: the product of different parts of the input feature map and the convolution kernel obtains a new feature map under the action of the convolution kernel function, and the above formula ensures that the feature extraction is irrelevant to the position, namely the statistical property of one part of the input feature map is the same as that of other parts; the pooling operation of the convolutional neural network can be represented by the following equation:
al=f(βldown(al-1)+bl) (2)
in the above formula, down represents the down-sampling mode, βlIs the corresponding parameter;
the meaning of equation (2) is that the input feature map image pooling operation, i.e. the aggregation of features at different positions of the image, is performed to reduce the parameters in the network.
As shown in fig. 4, the basic residual block of the residual network in S2 can be represented by the following formula:
y=F(x,W)+x (3)
where F denotes a two-layer convolutional network, W is a parameter of the convolutional network, x is an input of a residual block, and y denotes a basic residual block.
Equation (3) means that an input x, after two layers of forward convolutional networks, gets an output F (x, W), and then through a shortcut, gets an output y.
As shown in fig. 5, the formula of the basic architecture model of the deep neural network used in S2 is expressed as:
y=F1(x,W1)*F2(x,W2)+x (4)
wherein is a multiplication by a bit-wise operation, F1,F2Is two convolutional layers, x is the input of the basic structure, W1,W2Are parameters of both convolutional layers.
The meaning of the formula (4) is that one input x obtains an output F under the action of two convolution networks respectively1(x,W1) And F2(x,W2) The two are multiplied and then passed through a shortcut to obtain the output y.
And S3, inputting the extracted features into a plurality of different softmax classifiers so as to obtain an initialized model.
S4, the voice sample and the corresponding labels are digitized, and the initialized model is trained by the data set. S4 includes the following steps:
s4, training the initialized model by using the voice data and the corresponding marks to obtain a trained network model;
s41: performing time domain and frequency domain analysis on each voice sample, extracting a spectrogram, and digitizing a plurality of marks corresponding to a plurality of tasks of the voice sample;
s42: learning a current speech classification task on the basis of the initialized multi-task classification model obtained in the step S3 to obtain a trained multi-task classification model;
s43: and the trained multi-task classification model is used for multi-task classification of the voice data, the probability value of each voice in each task is given, and the category with the larger probability value is selected as a classification result.
And S5, predicting the unmarked voice data by the trained model to obtain the classified probability value, and selecting the category with higher probability value as the classification result.
Fig. 2 and fig. 3 list the spectrogram containing two emotions of "angry" and "happy", and we can see that the spectrogram amplitude difference is obvious in the range of 10kHz to 15 kHz.
Fig. 4 and 5 show a neural network method proposed by the present invention, which specifically includes:
(1) the basic structure of the two models in fig. 4 and 5 is a convolutional neural network, which specifically includes two operations. One is the convolution operation of the convolutional neural network, which can be expressed by the following formula:
Figure GDA0001437692210000071
wherein M and N define the size of convolution kernel, p, q represent the number of rows and columns to define the position of pixel point, f is convolution kernel function, L is (1, L) represents the number of layers of convolution neural network,
Figure GDA0001437692210000072
the characteristics of the i-rows of the l-layers are defined, k defines the parameters of the convolution kernel, and b is the corresponding bias function.
Another operation is the pooling operation of the convolutional neural network, which can be expressed by the following equation:
al=f(βldown(al-1)+bl)
in the above formula, down represents the down-sampling operation, and β is the corresponding parameter.
(2) Fig. 4 shows a basic residual block of a residual network, which can also be expressed by the following formula:
y=F(x,W)+x
where F is the convolutional layer function, x is the input to a residual block, and W is the parameter.
(3) Fig. 5 shows the basic architecture of our neural network, which can also be expressed by the following formula:
y=F1(x,W1)*F2(x,W2)+x
wherein is a multiplication by a bit-wise operation, F1,F2Is a concatenation of layers, x is the input to this basic structure. W1,W2Are parameters of both convolutional layers.
The existing audio classification problem mainly aims at single sample and single mark, namely, a training model only classifies a single task. For example, voice emotion classification and single task classification are used for realizing which emotion only one audio belongs to. However, because different speakers understand different emotions, the expression of different speakers in the same emotion situation is different. The multi-task classification mainly realizes a plurality of different tasks at the same time, for example, the project completes the speech emotion classification task and also completes the speaker classification problem. That is, for a trained model, a piece of speech is input, and two results are obtained, one is the person who says the piece of speech, and the other is the emotion contained in the piece of speech. That is, the emotional characteristics and the speaker characteristics are learned simultaneously when the model is trained.
Taking speech emotion recognition on sentences and songs as an example, the main task is speech emotion classification, and the auxiliary task is sentence and song classification.
Rate of accuracy
SVM 48.01%
Single task model 56.33%
Multitasking model 62.39%
Table 1 compares mainly the accuracy of the single-task and multi-task models on the main tasks. Wherein, SVM is a classic machine learning classification method; the single-task model is used for single-task classification, the emotion classification accuracy is 56.33%, while the emotion recognition accuracy is increased by 6.06% when two tasks are simultaneously realized on the multi-task model
Network architecture Emotion recognition accuracy Speech and song classification accuracy
Convolutional neural network 53.73% 92.24
Residual error network 57.21% 94.62%
Gate-based residual error network 62.39 93.13
Table 2, mainly compares the accuracy of the multitask model based on different neural network structures in speech emotion recognition on sentences and songs. Wherein the gate-based residual error network is the model proposed by this patent.
The above experimental results prove that:
(1) for the model which also completes multi-classification, such as SVM and the classical neural network structure, the model is better
(2) For a single classification model, the accuracy rate of independently realizing two tasks on the same model is lower than that of a multi-task classification model.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (2)

1. A multitask speech classification method based on deep learning is characterized in that: the method comprises the following steps:
s1: performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram;
s2: establishing a neural network model based on a convolutional neural network and a residual error network, taking a spectrogram as network input, and extracting characteristics;
in S2, the basic operations of the convolutional neural network include a convolution operation and a pooling operation, and the convolution operation can be expressed by the following formula:
Figure FDA0002826407960000011
wherein M and N define the size of convolution kernel, i, j represents the number of rows and columns to define the position of pixel point, f is convolution kernel function, L is (1, L) represents the number of layers of convolution neural network,
Figure FDA0002826407960000012
defining the characteristics of i rows and j columns of l layers,
Figure FDA0002826407960000013
parameters of the convolution kernel of n rows m defining l layers, blIs the bias function of l layers;
the pooling operation of the convolutional neural network can be represented by the following equation:
al=f(βldown(al-1)+bl) (2)
in the above formula, alFor the input of the l-th layer, f is the pooling layer function, down denotes the down-sampling mode, βlIs the corresponding parameter;
the basic residual block of the residual network in S2 can be represented by the following formula:
y=F(x,W)+x (3)
wherein F represents a two-layer convolutional network, W is a parameter of the convolutional network, x is an input of a residual block, and y represents a basic residual block output;
the formula of the basic architecture model used in S2 is represented as:
y=F1(x,W1)*F2(x,W2)+x (4)
wherein is a multiplication by a bit-wise operation, F1,F2Is two convolutional layers, x is the input of the basic structure, W1,W2Is a parameter of both convolutional layers, y represents the output;
s3: inputting the extracted features into a plurality of different softmax classifiers, thereby obtaining an initialized model;
s4: digitizing the voice sample and the corresponding marks, and training an initialized model by using the data set to obtain a trained network model;
s5: and predicting the unmarked voice data by the trained model to obtain a classified probability value, and selecting the class with a higher probability value as a classification result.
2. The method for multi-task speech classification based on deep learning of claim 1, wherein: the step of S4 includes the following steps:
s4: digitizing the voice sample and the corresponding marks, and training an initialized model by using the data set to obtain a trained network model;
s41: performing time domain and frequency domain analysis on each voice sample, extracting a spectrogram, and digitizing a plurality of marks corresponding to a plurality of tasks of the voice sample;
s42: learning a current speech classification task on the basis of the initialized multi-task classification model obtained in the step S3 to obtain a trained multi-task classification model;
s43: and the trained multi-task classification model is used for multi-task classification of the voice data, the probability value of each voice in each task is given, and the category with the larger probability value is selected as a classification result.
CN201710801016.6A 2017-09-07 2017-09-07 Multi-classification voice method based on deep neural network Expired - Fee Related CN107578775B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710801016.6A CN107578775B (en) 2017-09-07 2017-09-07 Multi-classification voice method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710801016.6A CN107578775B (en) 2017-09-07 2017-09-07 Multi-classification voice method based on deep neural network

Publications (2)

Publication Number Publication Date
CN107578775A CN107578775A (en) 2018-01-12
CN107578775B true CN107578775B (en) 2021-02-12

Family

ID=61031600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710801016.6A Expired - Fee Related CN107578775B (en) 2017-09-07 2017-09-07 Multi-classification voice method based on deep neural network

Country Status (1)

Country Link
CN (1) CN107578775B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754357B (en) * 2018-01-26 2021-09-21 京东方科技集团股份有限公司 Image processing method, processing device and processing equipment
CN109243424A (en) * 2018-08-28 2019-01-18 合肥星空物联信息科技有限公司 One key voiced translation terminal of one kind and interpretation method
CN109490822B (en) * 2018-10-16 2022-12-20 南京信息工程大学 Voice DOA estimation method based on ResNet
CN109523993B (en) * 2018-11-02 2022-02-08 深圳市网联安瑞网络科技有限公司 Voice language classification method based on CNN and GRU fusion deep neural network
CN109523994A (en) * 2018-11-13 2019-03-26 四川大学 A kind of multitask method of speech classification based on capsule neural network
CN109493881B (en) * 2018-11-22 2023-12-05 北京奇虎科技有限公司 Method and device for labeling audio and computing equipment
CN111354372B (en) * 2018-12-21 2023-07-18 中国科学院声学研究所 Audio scene classification method and system based on front-end and back-end combined training
CN109684995A (en) * 2018-12-22 2019-04-26 中国人民解放军战略支援部队信息工程大学 Specific Emitter Identification method and device based on depth residual error network
CN109754822A (en) * 2019-01-22 2019-05-14 平安科技(深圳)有限公司 The method and apparatus for establishing Alzheimer's disease detection model
CN109919047A (en) * 2019-02-18 2019-06-21 山东科技大学 A kind of mood detection method based on multitask, the residual error neural network of multi-tag
CN110189769B (en) * 2019-05-23 2021-11-19 复钧智能科技(苏州)有限公司 Abnormal sound detection method based on combination of multiple convolutional neural network models
CN110532424A (en) * 2019-09-26 2019-12-03 西南科技大学 A kind of lungs sound tagsort system and method based on deep learning and cloud platform
CN110992987B (en) * 2019-10-23 2022-05-06 大连东软信息学院 Parallel feature extraction system and method for general specific voice in voice signal
CN110808069A (en) * 2019-11-11 2020-02-18 上海瑞美锦鑫健康管理有限公司 Evaluation system and method for singing songs
CN111128131B (en) * 2019-12-17 2022-07-01 北京声智科技有限公司 Voice recognition method and device, electronic equipment and computer readable storage medium
CN111429947B (en) * 2020-03-26 2022-06-10 重庆邮电大学 Speech emotion recognition method based on multi-stage residual convolutional neural network
CN111460157B (en) * 2020-04-01 2023-03-28 哈尔滨理工大学 Cyclic convolution multitask learning method for multi-field text classification
CN111933179B (en) * 2020-06-04 2021-04-20 华南师范大学 Environmental sound identification method and device based on hybrid multi-task learning
CN111833856B (en) * 2020-07-15 2023-10-24 厦门熙重电子科技有限公司 Voice key information calibration method based on deep learning
CN111599382B (en) * 2020-07-27 2020-10-27 深圳市声扬科技有限公司 Voice analysis method, device, computer equipment and storage medium
CN112331187B (en) * 2020-11-24 2023-01-13 思必驰科技股份有限公司 Multi-task speech recognition model training method and multi-task speech recognition method
CN113823271A (en) * 2020-12-18 2021-12-21 京东科技控股股份有限公司 Training method and device of voice classification model, computer equipment and storage medium
CN112506667A (en) * 2020-12-22 2021-03-16 北京航空航天大学杭州创新研究院 Deep neural network training method based on multi-task optimization
CN112992119B (en) * 2021-01-14 2024-05-03 安徽大学 Accent classification method based on deep neural network and model thereof
CN112992157A (en) * 2021-02-08 2021-06-18 贵州师范大学 Neural network noisy line identification method based on residual error and batch normalization
CN114882884B (en) * 2022-07-06 2022-09-23 深圳比特微电子科技有限公司 Multitask implementation method and device based on deep learning model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1300831A1 (en) * 2001-10-05 2003-04-09 Sony International (Europe) GmbH Method for detecting emotions involving subspace specialists
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN106875007A (en) * 2017-01-25 2017-06-20 上海交通大学 End-to-end deep neural network is remembered based on convolution shot and long term for voice fraud detection
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10127927B2 (en) * 2014-07-28 2018-11-13 Sony Interactive Entertainment Inc. Emotional speech processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1300831A1 (en) * 2001-10-05 2003-04-09 Sony International (Europe) GmbH Method for detecting emotions involving subspace specialists
CN106847309A (en) * 2017-01-09 2017-06-13 华南理工大学 A kind of speech-emotion recognition method
CN106875007A (en) * 2017-01-25 2017-06-20 上海交通大学 End-to-end deep neural network is remembered based on convolution shot and long term for voice fraud detection
CN106952649A (en) * 2017-05-14 2017-07-14 北京工业大学 Method for distinguishing speek person based on convolutional neural networks and spectrogram

Also Published As

Publication number Publication date
CN107578775A (en) 2018-01-12

Similar Documents

Publication Publication Date Title
CN107578775B (en) Multi-classification voice method based on deep neural network
Venkataramanan et al. Emotion recognition from speech
Espi et al. Exploiting spectro-temporal locality in deep learning based acoustic event detection
JP6189970B2 (en) Combination of auditory attention cue and phoneme posterior probability score for sound / vowel / syllable boundary detection
CN105023573B (en) It is detected using speech syllable/vowel/phone boundary of auditory attention clue
Vrysis et al. 1D/2D deep CNNs vs. temporal feature integration for general audio classification
CN110992988B (en) Speech emotion recognition method and device based on domain confrontation
CN111341319A (en) Audio scene recognition method and system based on local texture features
Vrysis et al. Extending temporal feature integration for semantic audio analysis
CN116665669A (en) Voice interaction method and system based on artificial intelligence
Shah et al. Speech emotion recognition based on SVM using MATLAB
CN111462755A (en) Information prompting method and device, electronic equipment and medium
Roy et al. Time-based raga recommendation and information retrieval of musical patterns in Indian classical music using neural networks
Yasmin et al. A rough set theory and deep learning-based predictive system for gender recognition using audio speech
CN111241820A (en) Bad phrase recognition method, device, electronic device, and storage medium
Atkar et al. Speech Emotion Recognition using Dialogue Emotion Decoder and CNN Classifier
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN111785236A (en) Automatic composition method based on motivational extraction model and neural network
Calık et al. An ensemble-based framework for mispronunciation detection of Arabic phonemes
Kim et al. A study on the Recommendation of Contents using Speech Emotion Information and Emotion Collaborative Filtering
Fennir et al. Acoustic scene classification for speaker diarization
Pentari et al. Investigating Graph-based Features for Speech Emotion Recognition
Ashrafidoost et al. Recognizing Emotional State Changes Using Speech Processing
CN110910904A (en) Method for establishing voice emotion recognition model and voice emotion recognition method
Malekzadeh et al. The recognition of persian phonemes using PPNet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210212

Termination date: 20210907

CF01 Termination of patent right due to non-payment of annual fee