CN107578775B - Multi-classification voice method based on deep neural network - Google Patents
Multi-classification voice method based on deep neural network Download PDFInfo
- Publication number
- CN107578775B CN107578775B CN201710801016.6A CN201710801016A CN107578775B CN 107578775 B CN107578775 B CN 107578775B CN 201710801016 A CN201710801016 A CN 201710801016A CN 107578775 B CN107578775 B CN 107578775B
- Authority
- CN
- China
- Prior art keywords
- model
- classification
- network
- neural network
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a multitask voice classification method based on deep learning, which relates to the technical field of voice processing and comprises the following steps: and S1, performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram. And S2, establishing a neural network model based on the convolutional neural network and the residual error network, and taking the spectrogram as network input to extract features. And S3, inputting the extracted features into a plurality of different softmax classifiers so as to obtain an initialized model. And S4, digitizing the voice sample and the corresponding marks, and training the initialized model by using the data set to obtain the trained network model. And S5, predicting the unmarked voice data by the trained model to obtain the classified probability value, and selecting the category with higher probability value as the classification result. The invention solves the problem that the existing audio classification method aims at the problem of low classification efficiency caused by independent processing of tasks and neglect of the relevance of voice tasks.
Description
Technical Field
The invention relates to the technical field of sound signal processing, in particular to a voice multi-classification method based on a deep neural network.
Background
Sound provides us with a lot of information about the source of the sound and the surrounding environment. The human auditory system is able to separate and recognize complex sounds, and it would be useful if a machine could perform similar functions (audio classification and recognition), such as speech recognition in noise. Audio classification is an important field of pattern recognition and has been successfully applied in many fields, such as professional education and entertainment. In recent years, different classes of audio classification, such as accent recognition, speaker recognition, and speech emotion recognition, have been used with much success.
However, most audio classification methods are processed separately for tasks, and the correlation between tasks is omitted. For example, the accent recognition task and speaker recognition are typically treated as separate classification tasks. In fact, however, for the same piece of speech data, the accent of the spoken speaker will be determined once it is confirmed. Thus, it is desirable to improve the classification effect of both tasks simultaneously by using this relationship.
In recent years, the deep learning has caused the climax of artificial intelligence, and due to the strong abstract capability of the deep neural network on data, the neural network learning method has been successfully applied to various fields such as speech signal processing and the like. In our work, convolutional neural networks are used to learn speech features, improving accuracy in multi-classification tasks.
A spectrogram is a detailed and accurate phonetic representation containing time and frequency information. The general form of a spectrogram is mainly three dimensions: time, frequency and amplitude in color.
Disclosure of Invention
The invention aims to: the problem that the existing audio classification method aims at independent processing of tasks and ignores correlation of voice tasks, so that classification efficiency is low is solved.
The technical scheme of the invention is as follows:
a multitask speech classification method based on deep learning comprises the following steps:
and S1, performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram.
And S2, establishing a neural network model based on the convolutional neural network and the residual error network, and taking the spectrogram as network input to extract features.
And S3, inputting the extracted features into a plurality of different softmax classifiers so as to obtain an initialized model.
And S4, digitizing the voice sample and the corresponding marks, and training the initialized model by using the data set to obtain the trained network model.
And S5, predicting the unmarked voice data by the trained model to obtain the classified probability value, and selecting the category with higher probability value as the classification result.
Further, in S2, the basic operation of the convolutional neural network includes a convolution operation and a pooling operation, and the convolution operation can be expressed by the following formula:
wherein M and N define the size of convolution kernel, i, j represents the number of rows and columns to define the position of pixel point, f is convolution kernel function, L is (1, L) represents the number of layers of convolution neural network,defining the characteristics of i rows and j columns of l layers,the parameters of the convolution kernel of n rows m of l layers are defined, b is the corresponding bias function,
the meaning of formula (1) is: the product of different parts of the input feature map and the convolution kernel obtains a new feature map under the action of the convolution kernel function, and the formula ensures that the feature extraction is independent of the position, namely the statistical property of one part of the input feature map is the same as that of other parts.
The pooling operation of the convolutional neural network can be represented by the following equation:
al=f(βldown(al-1)+bl) (2)
in the above formula, alFor the input of the l-th layer, down represents the down-sampling mode, βlIs the corresponding parameter; the meaning of equation (2) is that the input feature map image pooling operation, i.e. the aggregation of features at different positions of the image, is performed to reduce the parameters in the network.
The basic residual block of the residual network in S2 can be represented by the following formula:
y=F(x,W)+x (3)
where F denotes a two-layer convolutional network, W is a parameter of the convolutional network, x is an input of a residual block, and y denotes a basic residual block.
Equation (3) means that an input x, after two layers of forward convolutional networks, gets an output F (x, W), and then through a shortcut, gets an output y.
The formula of the basic architecture model used in S2 is represented as:
y=F1(x,W1)*F2(x,W2)+x (4)
wherein is a multiplication by a bit-wise operation, F1,F2Is two convolutional layers, x is the input of the basic structure, W1,W2Are parameters of both convolutional layers.
The meaning of the formula (4) is that one input x obtains an output F under the action of two convolution networks respectively1(x,W1) And F2(x,W2) The two are multiplied and then passed through a shortcut to obtain the output y.
Specifically, the step S4 includes the following steps:
and S4, digitizing the voice sample and the corresponding marks, and training the initialized model by using the data set to obtain the trained network model.
S41: and analyzing the time domain and the frequency domain of each voice sample, extracting a spectrogram, and digitizing a plurality of marks corresponding to a plurality of tasks of the voice sample.
S42: on the basis of the initialized multi-task classification model obtained in step S3, the current speech classification task is learned to obtain a trained multi-task classification model.
S43: and the trained multi-task classification model is used for multi-task classification of the voice data, the probability value of each voice in each task is given, and the category with the larger probability value is selected as a classification result.
After the scheme is adopted, the invention has the beneficial effects that:
(1) the feature extraction of the voice data is a key preprocessing operation, and the voice spectrogram is extracted by a neural network, and the voice spectrogram is converted into 200-dimensional shared features in specific operation.
(2) In the classification process, the neural network is expected to learn the intrinsic characteristics of the voice so as to correctly predict each classification category, and then the neural network structure of the neural network is proposed to obtain better voice expression. Particularly, for a model which completes multi-classification like SVM and a classical neural network structure, the model is better; for a single classification model, the accuracy rate of independently realizing two tasks on the same model is lower than that of a multi-task classification model.
Taking speech emotion recognition on sentences and songs as an example, the main task is speech emotion classification, and the auxiliary task is sentence and song classification.
Rate of accuracy | |
SVM | 48.01% |
Single task model | 56.33% |
Multitasking model | 62.39% |
Table 1 compares mainly the accuracy of the single-task and multi-task models on the main tasks. Wherein, SVM is a classic machine learning classification method; the single-task model is used for single-task classification, the emotion classification accuracy is 56.33%, while the emotion recognition accuracy is increased by 6.06% when two tasks are simultaneously realized on the multi-task model
Network architecture | Emotion recognition accuracy | Speech and song classification accuracy |
Convolutional neural network | 53.73% | 92.24 |
Residual error network | 57.21% | 94.62% |
Gate-based residual error network | 62.39 | 93.13 |
Table 2, mainly compares the accuracy of the multitask model based on different neural network structures in speech emotion recognition on sentences and songs. Wherein the gate-based residual error network is the model proposed by this patent.
The above experimental results prove that:
1) for a model which also completes multi-classification, such as SVM and a classical neural network structure, the model is better.
2) For a single classification model, the accuracy rate of independently realizing two tasks on the same model is lower than that of a multi-task classification model.
(3) Compared with models of other non-neural network methods, the method has the advantages that the multi-task classification model can be well initialized by extracting the characteristics of the voice through the deep neural network method, the robustness of the model is improved, and the recognition effect of each task is improved. Since the audio signal itself may be affected by noise, etc., the neural network method has a good generalization ability to noise, etc. In addition, single-task models, such as emotional classification of audio, are very sensitive to new speakers, and multi-task classification is relatively less influential because speaker characteristics are also learned.
Drawings
FIG. 1 is a diagram of a multitask model according to the present invention;
FIG. 2 is a spectrogram of speech containing emotional speech;
FIG. 3 is a spectrogram of speech containing happy emotions;
FIG. 4 is a basic structure diagram of the residual error network of the present invention;
fig. 5 is a basic configuration diagram of the neural network in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a core model of a deep neural network-based multitask speech classification is a multitask classification model, which is used to classify two types of tasks.
The multitask speech classification method based on deep learning comprises the following steps:
and S1, performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram.
And S2, establishing a neural network model based on a convolutional neural network and a residual error network, taking the spectrogram as network input, and extracting features, wherein in the step, common features used for a plurality of tasks are extracted by constructing a two-classification task network structure. The multitask of the invention is directed at two major classification tasks, one of which is that the emotion included in the voice and whether the voice belongs to a song or a sentence are distinguished at the same time; the second is to distinguish the speaker and the accent of the speaker at the same time.
As shown in fig. 3, the basic operations of the convolutional neural network include a convolution operation and a pooling operation, and the convolution operation can be expressed by the following formula:
wherein M and N define the size of convolution kernel, i, j represents the number of rows and columns to define the position of pixel point, f is convolution kernel function, L is (1, L) represents the number of layers of convolution neural network,defining the characteristics of i rows and j columns of l layers,the parameters of the convolution kernel of n rows m of l layers are defined, b is the corresponding bias function,
the meaning of formula (1) is: the product of different parts of the input feature map and the convolution kernel obtains a new feature map under the action of the convolution kernel function, and the above formula ensures that the feature extraction is irrelevant to the position, namely the statistical property of one part of the input feature map is the same as that of other parts; the pooling operation of the convolutional neural network can be represented by the following equation:
al=f(βldown(al-1)+bl) (2)
in the above formula, down represents the down-sampling mode, βlIs the corresponding parameter;
the meaning of equation (2) is that the input feature map image pooling operation, i.e. the aggregation of features at different positions of the image, is performed to reduce the parameters in the network.
As shown in fig. 4, the basic residual block of the residual network in S2 can be represented by the following formula:
y=F(x,W)+x (3)
where F denotes a two-layer convolutional network, W is a parameter of the convolutional network, x is an input of a residual block, and y denotes a basic residual block.
Equation (3) means that an input x, after two layers of forward convolutional networks, gets an output F (x, W), and then through a shortcut, gets an output y.
As shown in fig. 5, the formula of the basic architecture model of the deep neural network used in S2 is expressed as:
y=F1(x,W1)*F2(x,W2)+x (4)
wherein is a multiplication by a bit-wise operation, F1,F2Is two convolutional layers, x is the input of the basic structure, W1,W2Are parameters of both convolutional layers.
The meaning of the formula (4) is that one input x obtains an output F under the action of two convolution networks respectively1(x,W1) And F2(x,W2) The two are multiplied and then passed through a shortcut to obtain the output y.
And S3, inputting the extracted features into a plurality of different softmax classifiers so as to obtain an initialized model.
S4, the voice sample and the corresponding labels are digitized, and the initialized model is trained by the data set. S4 includes the following steps:
s4, training the initialized model by using the voice data and the corresponding marks to obtain a trained network model;
s41: performing time domain and frequency domain analysis on each voice sample, extracting a spectrogram, and digitizing a plurality of marks corresponding to a plurality of tasks of the voice sample;
s42: learning a current speech classification task on the basis of the initialized multi-task classification model obtained in the step S3 to obtain a trained multi-task classification model;
s43: and the trained multi-task classification model is used for multi-task classification of the voice data, the probability value of each voice in each task is given, and the category with the larger probability value is selected as a classification result.
And S5, predicting the unmarked voice data by the trained model to obtain the classified probability value, and selecting the category with higher probability value as the classification result.
Fig. 2 and fig. 3 list the spectrogram containing two emotions of "angry" and "happy", and we can see that the spectrogram amplitude difference is obvious in the range of 10kHz to 15 kHz.
Fig. 4 and 5 show a neural network method proposed by the present invention, which specifically includes:
(1) the basic structure of the two models in fig. 4 and 5 is a convolutional neural network, which specifically includes two operations. One is the convolution operation of the convolutional neural network, which can be expressed by the following formula:
wherein M and N define the size of convolution kernel, p, q represent the number of rows and columns to define the position of pixel point, f is convolution kernel function, L is (1, L) represents the number of layers of convolution neural network,the characteristics of the i-rows of the l-layers are defined, k defines the parameters of the convolution kernel, and b is the corresponding bias function.
Another operation is the pooling operation of the convolutional neural network, which can be expressed by the following equation:
al=f(βldown(al-1)+bl)
in the above formula, down represents the down-sampling operation, and β is the corresponding parameter.
(2) Fig. 4 shows a basic residual block of a residual network, which can also be expressed by the following formula:
y=F(x,W)+x
where F is the convolutional layer function, x is the input to a residual block, and W is the parameter.
(3) Fig. 5 shows the basic architecture of our neural network, which can also be expressed by the following formula:
y=F1(x,W1)*F2(x,W2)+x
wherein is a multiplication by a bit-wise operation, F1,F2Is a concatenation of layers, x is the input to this basic structure. W1,W2Are parameters of both convolutional layers.
The existing audio classification problem mainly aims at single sample and single mark, namely, a training model only classifies a single task. For example, voice emotion classification and single task classification are used for realizing which emotion only one audio belongs to. However, because different speakers understand different emotions, the expression of different speakers in the same emotion situation is different. The multi-task classification mainly realizes a plurality of different tasks at the same time, for example, the project completes the speech emotion classification task and also completes the speaker classification problem. That is, for a trained model, a piece of speech is input, and two results are obtained, one is the person who says the piece of speech, and the other is the emotion contained in the piece of speech. That is, the emotional characteristics and the speaker characteristics are learned simultaneously when the model is trained.
Taking speech emotion recognition on sentences and songs as an example, the main task is speech emotion classification, and the auxiliary task is sentence and song classification.
Rate of accuracy | |
SVM | 48.01% |
Single task model | 56.33% |
Multitasking model | 62.39% |
Table 1 compares mainly the accuracy of the single-task and multi-task models on the main tasks. Wherein, SVM is a classic machine learning classification method; the single-task model is used for single-task classification, the emotion classification accuracy is 56.33%, while the emotion recognition accuracy is increased by 6.06% when two tasks are simultaneously realized on the multi-task model
Network architecture | Emotion recognition accuracy | Speech and song classification accuracy |
Convolutional neural network | 53.73% | 92.24 |
Residual error network | 57.21% | 94.62% |
Gate-based residual error network | 62.39 | 93.13 |
Table 2, mainly compares the accuracy of the multitask model based on different neural network structures in speech emotion recognition on sentences and songs. Wherein the gate-based residual error network is the model proposed by this patent.
The above experimental results prove that:
(1) for the model which also completes multi-classification, such as SVM and the classical neural network structure, the model is better
(2) For a single classification model, the accuracy rate of independently realizing two tasks on the same model is lower than that of a multi-task classification model.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.
Claims (2)
1. A multitask speech classification method based on deep learning is characterized in that: the method comprises the following steps:
s1: performing time-frequency analysis operation on the voice data to obtain a corresponding spectrogram;
s2: establishing a neural network model based on a convolutional neural network and a residual error network, taking a spectrogram as network input, and extracting characteristics;
in S2, the basic operations of the convolutional neural network include a convolution operation and a pooling operation, and the convolution operation can be expressed by the following formula:
wherein M and N define the size of convolution kernel, i, j represents the number of rows and columns to define the position of pixel point, f is convolution kernel function, L is (1, L) represents the number of layers of convolution neural network,defining the characteristics of i rows and j columns of l layers,parameters of the convolution kernel of n rows m defining l layers, blIs the bias function of l layers;
the pooling operation of the convolutional neural network can be represented by the following equation:
al=f(βldown(al-1)+bl) (2)
in the above formula, alFor the input of the l-th layer, f is the pooling layer function, down denotes the down-sampling mode, βlIs the corresponding parameter;
the basic residual block of the residual network in S2 can be represented by the following formula:
y=F(x,W)+x (3)
wherein F represents a two-layer convolutional network, W is a parameter of the convolutional network, x is an input of a residual block, and y represents a basic residual block output;
the formula of the basic architecture model used in S2 is represented as:
y=F1(x,W1)*F2(x,W2)+x (4)
wherein is a multiplication by a bit-wise operation, F1,F2Is two convolutional layers, x is the input of the basic structure, W1,W2Is a parameter of both convolutional layers, y represents the output;
s3: inputting the extracted features into a plurality of different softmax classifiers, thereby obtaining an initialized model;
s4: digitizing the voice sample and the corresponding marks, and training an initialized model by using the data set to obtain a trained network model;
s5: and predicting the unmarked voice data by the trained model to obtain a classified probability value, and selecting the class with a higher probability value as a classification result.
2. The method for multi-task speech classification based on deep learning of claim 1, wherein: the step of S4 includes the following steps:
s4: digitizing the voice sample and the corresponding marks, and training an initialized model by using the data set to obtain a trained network model;
s41: performing time domain and frequency domain analysis on each voice sample, extracting a spectrogram, and digitizing a plurality of marks corresponding to a plurality of tasks of the voice sample;
s42: learning a current speech classification task on the basis of the initialized multi-task classification model obtained in the step S3 to obtain a trained multi-task classification model;
s43: and the trained multi-task classification model is used for multi-task classification of the voice data, the probability value of each voice in each task is given, and the category with the larger probability value is selected as a classification result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710801016.6A CN107578775B (en) | 2017-09-07 | 2017-09-07 | Multi-classification voice method based on deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710801016.6A CN107578775B (en) | 2017-09-07 | 2017-09-07 | Multi-classification voice method based on deep neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107578775A CN107578775A (en) | 2018-01-12 |
CN107578775B true CN107578775B (en) | 2021-02-12 |
Family
ID=61031600
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710801016.6A Expired - Fee Related CN107578775B (en) | 2017-09-07 | 2017-09-07 | Multi-classification voice method based on deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107578775B (en) |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109754357B (en) * | 2018-01-26 | 2021-09-21 | 京东方科技集团股份有限公司 | Image processing method, processing device and processing equipment |
CN109243424A (en) * | 2018-08-28 | 2019-01-18 | 合肥星空物联信息科技有限公司 | One key voiced translation terminal of one kind and interpretation method |
CN109490822B (en) * | 2018-10-16 | 2022-12-20 | 南京信息工程大学 | Voice DOA estimation method based on ResNet |
CN109523993B (en) * | 2018-11-02 | 2022-02-08 | 深圳市网联安瑞网络科技有限公司 | Voice language classification method based on CNN and GRU fusion deep neural network |
CN109523994A (en) * | 2018-11-13 | 2019-03-26 | 四川大学 | A kind of multitask method of speech classification based on capsule neural network |
CN109493881B (en) * | 2018-11-22 | 2023-12-05 | 北京奇虎科技有限公司 | Method and device for labeling audio and computing equipment |
CN111354372B (en) * | 2018-12-21 | 2023-07-18 | 中国科学院声学研究所 | Audio scene classification method and system based on front-end and back-end combined training |
CN109684995A (en) * | 2018-12-22 | 2019-04-26 | 中国人民解放军战略支援部队信息工程大学 | Specific Emitter Identification method and device based on depth residual error network |
CN109754822A (en) * | 2019-01-22 | 2019-05-14 | 平安科技(深圳)有限公司 | The method and apparatus for establishing Alzheimer's disease detection model |
CN109919047A (en) * | 2019-02-18 | 2019-06-21 | 山东科技大学 | A kind of mood detection method based on multitask, the residual error neural network of multi-tag |
CN110189769B (en) * | 2019-05-23 | 2021-11-19 | 复钧智能科技(苏州)有限公司 | Abnormal sound detection method based on combination of multiple convolutional neural network models |
CN110532424A (en) * | 2019-09-26 | 2019-12-03 | 西南科技大学 | A kind of lungs sound tagsort system and method based on deep learning and cloud platform |
CN110992987B (en) * | 2019-10-23 | 2022-05-06 | 大连东软信息学院 | Parallel feature extraction system and method for general specific voice in voice signal |
CN110808069A (en) * | 2019-11-11 | 2020-02-18 | 上海瑞美锦鑫健康管理有限公司 | Evaluation system and method for singing songs |
CN111128131B (en) * | 2019-12-17 | 2022-07-01 | 北京声智科技有限公司 | Voice recognition method and device, electronic equipment and computer readable storage medium |
CN111429947B (en) * | 2020-03-26 | 2022-06-10 | 重庆邮电大学 | Speech emotion recognition method based on multi-stage residual convolutional neural network |
CN111460157B (en) * | 2020-04-01 | 2023-03-28 | 哈尔滨理工大学 | Cyclic convolution multitask learning method for multi-field text classification |
CN111933179B (en) * | 2020-06-04 | 2021-04-20 | 华南师范大学 | Environmental sound identification method and device based on hybrid multi-task learning |
CN111833856B (en) * | 2020-07-15 | 2023-10-24 | 厦门熙重电子科技有限公司 | Voice key information calibration method based on deep learning |
CN111599382B (en) * | 2020-07-27 | 2020-10-27 | 深圳市声扬科技有限公司 | Voice analysis method, device, computer equipment and storage medium |
CN112331187B (en) * | 2020-11-24 | 2023-01-13 | 思必驰科技股份有限公司 | Multi-task speech recognition model training method and multi-task speech recognition method |
CN113823271A (en) * | 2020-12-18 | 2021-12-21 | 京东科技控股股份有限公司 | Training method and device of voice classification model, computer equipment and storage medium |
CN112506667A (en) * | 2020-12-22 | 2021-03-16 | 北京航空航天大学杭州创新研究院 | Deep neural network training method based on multi-task optimization |
CN112992119B (en) * | 2021-01-14 | 2024-05-03 | 安徽大学 | Accent classification method based on deep neural network and model thereof |
CN112992157A (en) * | 2021-02-08 | 2021-06-18 | 贵州师范大学 | Neural network noisy line identification method based on residual error and batch normalization |
CN114882884B (en) * | 2022-07-06 | 2022-09-23 | 深圳比特微电子科技有限公司 | Multitask implementation method and device based on deep learning model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1300831A1 (en) * | 2001-10-05 | 2003-04-09 | Sony International (Europe) GmbH | Method for detecting emotions involving subspace specialists |
CN106847309A (en) * | 2017-01-09 | 2017-06-13 | 华南理工大学 | A kind of speech-emotion recognition method |
CN106875007A (en) * | 2017-01-25 | 2017-06-20 | 上海交通大学 | End-to-end deep neural network is remembered based on convolution shot and long term for voice fraud detection |
CN106952649A (en) * | 2017-05-14 | 2017-07-14 | 北京工业大学 | Method for distinguishing speek person based on convolutional neural networks and spectrogram |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10127927B2 (en) * | 2014-07-28 | 2018-11-13 | Sony Interactive Entertainment Inc. | Emotional speech processing |
-
2017
- 2017-09-07 CN CN201710801016.6A patent/CN107578775B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1300831A1 (en) * | 2001-10-05 | 2003-04-09 | Sony International (Europe) GmbH | Method for detecting emotions involving subspace specialists |
CN106847309A (en) * | 2017-01-09 | 2017-06-13 | 华南理工大学 | A kind of speech-emotion recognition method |
CN106875007A (en) * | 2017-01-25 | 2017-06-20 | 上海交通大学 | End-to-end deep neural network is remembered based on convolution shot and long term for voice fraud detection |
CN106952649A (en) * | 2017-05-14 | 2017-07-14 | 北京工业大学 | Method for distinguishing speek person based on convolutional neural networks and spectrogram |
Also Published As
Publication number | Publication date |
---|---|
CN107578775A (en) | 2018-01-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107578775B (en) | Multi-classification voice method based on deep neural network | |
Venkataramanan et al. | Emotion recognition from speech | |
Espi et al. | Exploiting spectro-temporal locality in deep learning based acoustic event detection | |
JP6189970B2 (en) | Combination of auditory attention cue and phoneme posterior probability score for sound / vowel / syllable boundary detection | |
CN105023573B (en) | It is detected using speech syllable/vowel/phone boundary of auditory attention clue | |
Vrysis et al. | 1D/2D deep CNNs vs. temporal feature integration for general audio classification | |
CN110992988B (en) | Speech emotion recognition method and device based on domain confrontation | |
CN111341319A (en) | Audio scene recognition method and system based on local texture features | |
Vrysis et al. | Extending temporal feature integration for semantic audio analysis | |
CN116665669A (en) | Voice interaction method and system based on artificial intelligence | |
Shah et al. | Speech emotion recognition based on SVM using MATLAB | |
CN111462755A (en) | Information prompting method and device, electronic equipment and medium | |
Roy et al. | Time-based raga recommendation and information retrieval of musical patterns in Indian classical music using neural networks | |
Yasmin et al. | A rough set theory and deep learning-based predictive system for gender recognition using audio speech | |
CN111241820A (en) | Bad phrase recognition method, device, electronic device, and storage medium | |
Atkar et al. | Speech Emotion Recognition using Dialogue Emotion Decoder and CNN Classifier | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
CN111785236A (en) | Automatic composition method based on motivational extraction model and neural network | |
Calık et al. | An ensemble-based framework for mispronunciation detection of Arabic phonemes | |
Kim et al. | A study on the Recommendation of Contents using Speech Emotion Information and Emotion Collaborative Filtering | |
Fennir et al. | Acoustic scene classification for speaker diarization | |
Pentari et al. | Investigating Graph-based Features for Speech Emotion Recognition | |
Ashrafidoost et al. | Recognizing Emotional State Changes Using Speech Processing | |
CN110910904A (en) | Method for establishing voice emotion recognition model and voice emotion recognition method | |
Malekzadeh et al. | The recognition of persian phonemes using PPNet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210212 Termination date: 20210907 |
|
CF01 | Termination of patent right due to non-payment of annual fee |