CN109523993A - A kind of voice languages classification method merging deep neural network with GRU based on CNN - Google Patents
A kind of voice languages classification method merging deep neural network with GRU based on CNN Download PDFInfo
- Publication number
- CN109523993A CN109523993A CN201811303649.5A CN201811303649A CN109523993A CN 109523993 A CN109523993 A CN 109523993A CN 201811303649 A CN201811303649 A CN 201811303649A CN 109523993 A CN109523993 A CN 109523993A
- Authority
- CN
- China
- Prior art keywords
- voice
- languages
- gru
- cnn
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 54
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000003062 neural network model Methods 0.000 claims abstract description 13
- 230000008569 process Effects 0.000 claims abstract description 9
- 238000004458 analytical method Methods 0.000 claims abstract description 8
- 238000005520 cutting process Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims abstract description 7
- 238000001228 spectrum Methods 0.000 claims abstract description 7
- 238000003860 storage Methods 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 35
- 238000012545 processing Methods 0.000 claims description 12
- 238000012360 testing method Methods 0.000 claims description 12
- 101100261000 Caenorhabditis elegans top-3 gene Proteins 0.000 claims description 6
- 230000004927 fusion Effects 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 5
- 238000013135 deep learning Methods 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 claims description 4
- 230000007787 long-term memory Effects 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000001537 neural effect Effects 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 238000000844 transformation Methods 0.000 claims description 3
- 238000010200 validation analysis Methods 0.000 claims description 3
- 230000008901 benefit Effects 0.000 abstract description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 210000004218 nerve net Anatomy 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000003032 molecular docking Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of voice languages classification methods for merging deep neural network with GRU based on CNN, comprising the following steps: S1: obtaining server source audio data, audio pre-processes simultaneously cutting source audio data;S2: reading audio data file information, generates audio data inventory csv file;S3: carrying out Short Time Fourier Transform for audio data file, obtains a series of relevant two-dimentional voice sound spectrograph of time-domain and frequency-domain of frequency spectrum function expansion after voice signal time frequency analysis;S4: building model;S5: two-dimentional voice sound spectrograph image data is input in the deep neural network model of voice languages classification that CNN is merged with GRU, classification output languages categorical data;S6: storage languages categorical data and source audio data file information.The present invention solves voice languages classification problem, has many advantages, such as automation, discrimination height, strong robustness, at low cost, portable strong, and what be can be convenient carries out service interfacing with third party system.
Description
Technical field
The present invention relates to voice processing technology fields, merge deep neural network with GRU based on CNN more particularly to one kind
Voice languages classification method.
Background technique
Along with the arriving of multimedia era, computer is more more and more universal, and also more and more frequently, people endeavour for human-computer interaction
Control is operated in allowing machine to understand mankind's spoken instructions and realize by voice to machine.It is traditional relative to keyboard, mouse etc.
Input equipment, voice technology become the key technology of man-machine interface in information transmission as the important means of human-computer interaction gradually.
Method of speech processing traditional at present is mostly using pattern matching strategy, based on template matching and based on statistics
The language model method of probability is the main application method of voice processing technology.And due to the limitation of acoustic model and language model
Property plus human language in daily life as environment, it is personal caused by diversity and colloquial random give speech processes
Great challenge is brought, causes great speech processes difficult.
Therefore a kind of voice that can cope with speech signal collection diversity and manual extraction feature instability problem is needed
Processing method.With voice processing technology and deep learning development like a raging fire, depth convolutional neural networks are because of its part
The feature that connection and weight are shared, has good translation invariance, in the status that image domains have it can not be substituted.GRU mind
A variant through network as long memory network in short-term, it is possible to provide long-term memory, it in the time domain can be to the long phase of voice
The modeling of closing property.Depth convolutional neural networks and GRU neural network have it to be widely applied in speech processes field.Urgent need mentions
It is a kind of out to merge two kinds of neural networks and then to improve the solution of model performance using the advantage of two kinds of neural networks respectively.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide one kind to merge depth nerve net with GRU based on CNN
Voice audio data, can be converted to voice sound spectrograph image, and then in voice sound spectrograph by the voice languages classification method of network
Feature and autonomous learning are extracted using the deep neural network that CNN is merged with GRU in image data, classifies end-to-endly and obtains language
Sound languages classification.
The purpose of the present invention is achieved through the following technical solutions: one kind merging depth nerve net with GRU based on CNN
The voice languages classification method of network, comprising the following steps:
S1: obtaining server source audio data, and audio pre-processes simultaneously cutting source audio data;
S2: reading audio data file information, generates audio data inventory csv file;
S3: audio data file is subjected to Short Time Fourier Transform, obtains a series of frequency spectrum functions after voice signal time frequency analysis
The relevant two-dimentional voice sound spectrograph of the time-domain and frequency-domain of expansion;
S4: building model automatically extracts time-domain and frequency-domain high-level spy by convolution, Chi Hua, normalization operation using CNN network
Sign, access GRU neural network carry out time domain associated extraction feature modeling to phonetic feature figure;
S5: two-dimentional voice sound spectrograph image data is input to the deep neural network model of voice languages classification that CNN is merged with GRU
In, classification output languages categorical data;
S6: the languages categorical data of classification output is recorded in result csv file and is stored with source audio data file information.
The step S1 the following steps are included:
S101: for complicated, difficult identification problem in the acquisition of source audio data, source audio data file is renamed, stores and establishes
Corresponding data File Mapping;
S102: for source audio data type format diverse problems are obtained, source audio file is recompiled, uniform format is
MP3 audio file formats;
S103: utilizing FFMPEG cutting MP3 audio file, intercepts audio data.
The step S3 includes following sub-step:
S301: MP3 audio data is converted into WAV format using the library mpg123, while voice-grade channel is interfered and realizes number
According to enhancing, languages classification accuracy is improved;
S302: audio data is subjected to Short Time Fourier Transform, time frequency analysis obtains its time frequency spectrum function, and then delineates
Spectrogram is simultaneously converted into the storage of gray scale voice sound spectrograph;
S303: being directed to depth convolutional neural networks voice languages disaggregated model, divides training voice sound spectrograph training dataset, tests
Demonstrate,prove data set and test data set.
The step S4 includes following sub-step:
S401: input terminal speech audio source data is pre-processed into the relevant two-dimentional voice of time-domain and frequency-domain for speech audio source data
Sound spectrograph gray level image;
S402: the feature in two-dimentional voice sound spectrograph is extracted using CNN;
S403: by the characteristic pattern extracted by CNN and GRU neural network fusion, using the characteristic pattern quantity of input as single GRU mind
Overall channel number through network;
S404: the information time for intercepting time-domain and frequency-domain expressed by each characteristic pattern is time series;
S405: using the frequency domain information in all channels as the input of single GRU core, being connected by circulation single GRU core foundation has
The Recognition with Recurrent Neural Network of long-term memory;
After the Fusion Features of S406:GRU neural network, one FC layers are connected, feature space is mapped into vector;
S407: the class probability of the last one Softmax layers of calculating voice languages of full connection is utilized;
S408: after output end exports the class probability of voice languages, the voice languages classification marked according to its probability match, choosing
It takes the languages classification of Top3 probability to export result deposit csv file to be permanently stored.
The step S401 includes following sub-step:
S4011: input terminal speech audio source data is intercepted using FFMPEG;
S4012: being MP3 audio format by the speech audio source data transcoding being truncated to;
S4013: using mpg123 by the speech audio source data of MP3 audio format be converted into WAV wave file carry out adding window,
Sub-frame processing obtains the relevant two-dimentional voice sound spectrograph gray level image of time-domain and frequency-domain of speech audio source data.
CNN need to meet the following conditions in the step S402:
The CNN includes 4 convolution pond layers, and using RELU nonlinear activation and BN regularization, the first convolutional layer uses 7*7
The filter of specification, the second convolutional layer use the filter of 5*5 specification, remaining convolutional layer all uses the filter of 3*3 specification, often
The pond specification that a convolutional layer is equipped with all is 3*3 specification, and step-length is 2.
The step S5 includes following sub-step:
S501: Caffe deep learning frame transformations tool is used, sound spectrograph training dataset and validation data set are converted respectively
For training set needed for Caffe training network and verifying collection LevelDB database, depth convolutional Neural net is merged with GRU to CNN
Network is trained, and is established voice languages classification CNN with GRU and is merged deep neural network model;
S502: voice languages classification CNN merges the deep neural network model training stage with GRU, utilizes Caffe training order tune
With the Prototxt solution and the optimum optimization neural network number of plies, the deep neural network of number of filter of best configuration
Structure exercises supervision network training to optimal performance;
S503: voice languages classification CNN deep neural network model test phase is merged with GRU, according to after network training most
Excellent weight, bias model file input voice sound spectrograph test data set, in conjunction with 178 marked kind in pre-training network
Languages classification predicts audio data languages, and exports best Top3 languages category result.
It is further comprising the steps of before the step S1:
After obtaining third party's business reception and registration voice languages sort instructions, system business logic processing module handles the voice languages point
Class instruction, is conveyed to the server that internal system Redis is issued to again in the step S1.
The beneficial effects of the present invention are: solving voice languages classification problem, for the language after voice data time frequency analysis
Spectrogram, the CNN and GRU for devising optimal solution merge deep neural network to realize that voice languages are classified, have automatic
Change, discrimination height, strong robustness, it is at low cost, portable strong the advantages that, and can be convenient with third party system carry out business
Docking.
Detailed description of the invention
Fig. 1 is method and step schematic flow diagram of the invention;
Fig. 2 is that data service of the invention flows to schematic flow diagram.
Specific embodiment
Below in conjunction with embodiment, technical solution of the present invention is clearly and completely described, it is clear that described
Embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field
Technical staff's every other embodiment obtained under the premise of not making the creative labor belongs to what the present invention protected
Range.
Refering to fig. 1-2, the present invention provides a kind of technical solution: a kind of language merging deep neural network with GRU based on CNN
Sound languages classification method includes following procedure: (1) input terminal directly inputs the relevant voice sound spectrograph image of time domain;(2) model
Building automatically extracts time-domain and frequency-domain high-level feature, (3) by sequence of operations such as convolution, Chi Hua, normalization using CNN network
It accesses GRU neural network and time domain associated extraction feature modeling is carried out to its phonetic feature figure, and then connect full articulamentum for feature
Space reflection is to output layer;(4) output end exports voice languages class probability, obtains voice languages classification and permanently stores classification
As a result.
After obtaining third party's business and conveying voice languages sort instructions, system business logic processing module handles the voice
Languages sort instructions are conveyed to internal system Redis and are issued to voice languages sort module, prediction classification voice again
The languages classification of audio data simultaneously stores classification results.
Specifically, the business logic processing of system is mainly responsible for the interaction to external information and internal information, third is being received
When the voice languages sort instructions that square business is conveyed, built-in command is converted into it by logical process, is communicated to internal services
Device Redis.After Redis receives instruction, message distribution is carried out, downloading obtains third party's business audio source data, and system is arrived in storage
In internal server, while its store path is recorded, voice languages sort instructions are handed down to voice languages sort module.
The classification work of voice languages is carried out in voice languages sort module, whole process specifically includes following step
It is rapid:
S1: obtaining server source audio data, and audio pre-processes simultaneously cutting source audio data;
The step S1 the following steps are included:
S101: for complicated, difficult identification problem in the acquisition of source audio data, source audio data file is renamed, stores and establishes
Corresponding data File Mapping;
S102: for source audio data type format diverse problems are obtained, source audio file is recompiled, uniform format is
MP3 audio file formats;
S103: utilizing FFMPEG cutting MP3 audio file, intercepts audio data.
S2: reading audio data file information, generates audio data inventory csv file;
S3: audio data file is subjected to Short Time Fourier Transform, obtains a series of frequency spectrum functions after voice signal time frequency analysis
The relevant two-dimentional voice sound spectrograph of the time-domain and frequency-domain of expansion;
The step S3 includes following sub-step:
S301: MP3 audio data is converted into WAV format using the library mpg123, while voice-grade channel is interfered and realizes number
According to enhancing, languages classification accuracy is improved;
S302: audio data is subjected to Short Time Fourier Transform, time frequency analysis obtains its time frequency spectrum function, and then delineates
Spectrogram is simultaneously converted into the storage of gray scale voice sound spectrograph;
S303: being directed to depth convolutional neural networks voice languages disaggregated model, divides training voice sound spectrograph training dataset, tests
Demonstrate,prove data set and test data set.
The step S4 includes following sub-step:
S401: input terminal speech audio source data is pre-processed into the relevant two-dimentional voice of time-domain and frequency-domain for speech audio source data
Sound spectrograph gray level image;
The step S401 includes following sub-step:
S4011: input terminal speech audio source data is intercepted using FFMPEG;
S4012: being MP3 audio format by the speech audio source data transcoding being truncated to;
S4013: using mpg123 by the speech audio source data of MP3 audio format be converted into WAV wave file carry out adding window,
Sub-frame processing obtains the relevant two-dimentional voice sound spectrograph gray level image of time-domain and frequency-domain of speech audio source data.
S402: the feature in two-dimentional voice sound spectrograph is extracted using CNN;
CNN need to meet the following conditions in the step S402:
The CNN includes 4 convolution pond layers, and using RELU nonlinear activation and BN regularization, the first convolutional layer uses 7*7
The filter of specification, the second convolutional layer use the filter of 5*5 specification, remaining convolutional layer all uses the filter of 3*3 specification, often
The pond specification that a convolutional layer is equipped with all is 3*3 specification, and step-length is 2.
S403: by the characteristic pattern extracted by CNN and GRU neural network fusion, using the characteristic pattern quantity of input as single
The overall channel number of GRU neural network;
S404: the information time for intercepting time-domain and frequency-domain expressed by each characteristic pattern is time series;
S405: using the frequency domain information in all channels as the input of single GRU core, being connected by circulation single GRU core foundation has
The Recognition with Recurrent Neural Network of long-term memory;
After the Fusion Features of S406:GRU neural network, one FC layers are connected, feature space is mapped into vector;
S407: the class probability of the last one Softmax layers of calculating voice languages of full connection is utilized;
S408: after output end exports the class probability of voice languages, the voice languages classification marked according to its probability match, choosing
It takes the languages classification of Top3 probability to export result deposit csv file to be permanently stored.
S4: building model automatically extracts time-domain and frequency-domain high-level by convolution, Chi Hua, normalization operation using CNN network
Feature, access GRU neural network carry out time domain associated extraction feature modeling to phonetic feature figure;
S5: two-dimentional voice sound spectrograph image data is input to the deep neural network model of voice languages classification that CNN is merged with GRU
In, classification output languages categorical data;
The step S5 includes following sub-step:
S501: Caffe deep learning frame transformations tool is used, sound spectrograph training dataset and validation data set are converted respectively
For training set needed for Caffe training network and verifying collection LevelDB database, depth convolutional Neural net is merged with GRU to CNN
Network is trained, and is established voice languages classification CNN with GRU and is merged deep neural network model;
S502: voice languages classification CNN merges the deep neural network model training stage with GRU, utilizes Caffe training order tune
With the Prototxt solution and the optimum optimization neural network number of plies, the deep neural network of number of filter of best configuration
Structure exercises supervision network training to optimal performance;
S503: voice languages classification CNN deep neural network model test phase is merged with GRU, according to after network training most
Excellent weight, bias model file input voice sound spectrograph test data set, in conjunction with 178 marked kind in pre-training network
Languages classification predicts audio data languages, and exports best Top3 languages category result.
S6: the languages categorical data of classification output and source audio data file information are recorded in result csv file simultaneously
Storage.
The operation principle of the present invention is that: after carrying out a series of processing to audio source data, finally predict its voice languages
Classification.It is the pretreatment of audio source data first, it is file that is empty, can not reading that investigation, which deletes some files, is carried out necessary
File renames and stores the mapping relations for retaining Rename file and source file, if other audio formats other than MP3, utilizes
FFMPEG, which is decoded it, to be recompiled as MP3 audio format, then intercepts the sound that source audio segment is handled, while intercepted to it
Frequency is according to establishing list csv file;Then in the voice sound spectrograph transformation stage, according to its inventory file one by one to audio text
Audio data is converted into WAV format using the library mpg123 by part, carries out Short Time Fourier Transform, and to its voice-grade channel carry out with
Machine interference enhancing data, improve classification accuracy, generate unified specification two-dimensional time-frequency corresponding with voice audio data
Voice sound spectrograph;Then the voice languages classification stage is predicted for voice sound spectrograph, need to after labeled data training modeling
Its classification could be predicted, when training network, according to the system after the audio data conversion for having marked 178 kinds of category of language of acquisition
One specification voice language composes grayscale image training set and verifying collection data, and it is required to generate the training of Caffe depth convolutional neural networks
LevelDB database inputs in the depth convolutional neural networks structure based on Caffe, debugs training network and extracts feature not
Disconnected learning model obtains the weight of optimal solution and the Caffe model file of biasing, and then is directed to target language to be predicted
Sound sound spectrograph is input in prediction network, and foundation has trained the model file supervised learning of optimization to measure its languages classification in advance;Most
Afterwards, the prediction result of voice languages classification is labeled in write-in csv file with source audio data attribute and is stored, returned
As a result.
The above is only a preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein
Form should not be regarded as an exclusion of other examples, and can be used for other combinations, modifications, and environments, and can be at this
In the text contemplated scope, modifications can be made through the above teachings or related fields of technology or knowledge.And those skilled in the art institute into
Capable modifications and changes do not depart from the spirit and scope of the present invention, then all should be in the protection scope of appended claims of the present invention
It is interior.
Claims (8)
1. a kind of voice languages classification method for merging deep neural network with GRU based on CNN, it is characterised in that: including following
Step:
S1: obtaining server source audio data, and audio pre-processes simultaneously cutting source audio data;
S2: reading audio data file information, generates audio data inventory csv file;
S3: audio data file is subjected to Short Time Fourier Transform, obtains a series of frequency spectrum functions after voice signal time frequency analysis
The relevant two-dimentional voice sound spectrograph of the time-domain and frequency-domain of expansion;
S4: building model automatically extracts time-domain and frequency-domain high-level spy by convolution, Chi Hua, normalization operation using CNN network
Sign, access GRU neural network carry out time domain associated extraction feature modeling to phonetic feature figure;
S5: two-dimentional voice sound spectrograph image data is input to the deep neural network model of voice languages classification that CNN is merged with GRU
In, classification output languages categorical data;
S6: the languages categorical data of classification output is recorded in result csv file and is stored with source audio data file information.
2. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1,
It is characterized by: the step S1 the following steps are included:
S101: for complicated, difficult identification problem in the acquisition of source audio data, source audio data file is renamed, stores and establishes
Corresponding data File Mapping;
S102: for source audio data type format diverse problems are obtained, source audio file is recompiled, uniform format is
MP3 audio file formats;
S103: utilizing FFMPEG cutting MP3 audio file, intercepts audio data.
3. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1,
It is characterized by: the step S3 includes following sub-step:
S301: MP3 audio data is converted into WAV format using the library mpg123, while voice-grade channel is interfered and realizes number
According to enhancing, languages classification accuracy is improved;
S302: audio data is subjected to Short Time Fourier Transform, time frequency analysis obtains its time frequency spectrum function, and then delineates
Spectrogram is simultaneously converted into the storage of gray scale voice sound spectrograph;
S303: being directed to depth convolutional neural networks voice languages disaggregated model, divides training voice sound spectrograph training dataset, tests
Demonstrate,prove data set and test data set.
4. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1,
It is characterized by: the step S4 includes following sub-step:
S401: input terminal speech audio source data is pre-processed into the relevant two-dimentional voice of time-domain and frequency-domain for speech audio source data
Sound spectrograph gray level image;
S402: the feature in two-dimentional voice sound spectrograph is extracted using CNN;
S403: by the characteristic pattern extracted by CNN and GRU neural network fusion, using the characteristic pattern quantity of input as single GRU mind
Overall channel number through network;
S404: the information time for intercepting time-domain and frequency-domain expressed by each characteristic pattern is time series;
S405: using the frequency domain information in all channels as the input of single GRU core, being connected by circulation single GRU core foundation has
The Recognition with Recurrent Neural Network of long-term memory;
After the Fusion Features of S406:GRU neural network, one FC layers are connected, feature space is mapped into vector;
S407: the class probability of the last one Softmax layers of calculating voice languages of full connection is utilized;
S408: after output end exports the class probability of voice languages, the voice languages classification marked according to its probability match, choosing
It takes the languages classification of Top3 probability to export result deposit csv file to be permanently stored.
5. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 4,
It is characterized by: the step S401 includes following sub-step:
S4011: input terminal speech audio source data is intercepted using FFMPEG;
S4012: being MP3 audio format by the speech audio source data transcoding being truncated to;
S4013: using mpg123 by the speech audio source data of MP3 audio format be converted into WAV wave file carry out adding window,
Sub-frame processing obtains the relevant two-dimentional voice sound spectrograph gray level image of time-domain and frequency-domain of speech audio source data.
6. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 4,
It is characterized by: CNN need to meet the following conditions in the step S402:
The CNN includes 4 convolution pond layers, and using RELU nonlinear activation and BN regularization, the first convolutional layer uses 7*7
The filter of specification, the second convolutional layer use the filter of 5*5 specification, remaining convolutional layer all uses the filter of 3*3 specification, often
The pond specification that a convolutional layer is equipped with all is 3*3 specification, and step-length is 2.
7. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1,
It is characterized by: the step S5 includes following sub-step:
S501: Caffe deep learning frame transformations tool is used, sound spectrograph training dataset and validation data set are converted respectively
For training set needed for Caffe training network and verifying collection LevelDB database, depth convolutional Neural net is merged with GRU to CNN
Network is trained, and is established voice languages classification CNN with GRU and is merged deep neural network model;
S502: voice languages classification CNN merges the deep neural network model training stage with GRU, utilizes Caffe training order tune
With the Prototxt solution and the optimum optimization neural network number of plies, the deep neural network of number of filter of best configuration
Structure exercises supervision network training to optimal performance;
S503: voice languages classification CNN deep neural network model test phase is merged with GRU, according to after network training most
Excellent weight, bias model file input voice sound spectrograph test data set, in conjunction with 178 marked kind in pre-training network
Languages classification predicts audio data languages, and exports best Top3 languages category result.
8. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1,
It is characterized by: further comprising the steps of before the step S1:
After obtaining third party's business reception and registration voice languages sort instructions, system business logic processing module handles the voice languages point
Class instruction, is conveyed to the server that internal system Redis is issued to again in the step S1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811303649.5A CN109523993B (en) | 2018-11-02 | 2018-11-02 | Voice language classification method based on CNN and GRU fusion deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811303649.5A CN109523993B (en) | 2018-11-02 | 2018-11-02 | Voice language classification method based on CNN and GRU fusion deep neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109523993A true CN109523993A (en) | 2019-03-26 |
CN109523993B CN109523993B (en) | 2022-02-08 |
Family
ID=65774431
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811303649.5A Active CN109523993B (en) | 2018-11-02 | 2018-11-02 | Voice language classification method based on CNN and GRU fusion deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109523993B (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110136690A (en) * | 2019-05-22 | 2019-08-16 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device and computer readable storage medium |
CN110148399A (en) * | 2019-05-06 | 2019-08-20 | 北京猎户星空科技有限公司 | A kind of control method of smart machine, device, equipment and medium |
CN110164472A (en) * | 2019-04-19 | 2019-08-23 | 天津大学 | Noise classification method based on convolutional neural networks |
CN110164417A (en) * | 2019-05-31 | 2019-08-23 | 科大讯飞股份有限公司 | A kind of languages vector obtains, languages know method for distinguishing and relevant apparatus |
CN110322894A (en) * | 2019-06-27 | 2019-10-11 | 电子科技大学 | A kind of waveform diagram generation and giant panda detection method based on sound |
CN110349564A (en) * | 2019-07-22 | 2019-10-18 | 苏州思必驰信息科技有限公司 | Across the language voice recognition methods of one kind and device |
CN110379407A (en) * | 2019-07-22 | 2019-10-25 | 出门问问(苏州)信息科技有限公司 | Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment |
CN110782872A (en) * | 2019-11-11 | 2020-02-11 | 复旦大学 | Language identification method and device based on deep convolutional recurrent neural network |
CN110807365A (en) * | 2019-09-29 | 2020-02-18 | 浙江大学 | Underwater target identification method based on fusion of GRU and one-dimensional CNN neural network |
CN110827793A (en) * | 2019-10-21 | 2020-02-21 | 成都大公博创信息技术有限公司 | Language identification method |
CN110841169A (en) * | 2019-11-28 | 2020-02-28 | 中国科学院深圳先进技术研究院 | Deep sound stimulation system and method for sleep regulation |
CN110930978A (en) * | 2019-11-08 | 2020-03-27 | 北京搜狗科技发展有限公司 | Language identification method and device and language identification device |
CN111160171A (en) * | 2019-12-19 | 2020-05-15 | 哈尔滨工程大学 | Radiation source signal identification method combining two-domain multi-features |
CN111312228A (en) * | 2019-12-09 | 2020-06-19 | 中国南方电网有限责任公司 | End-to-end-based voice navigation method applied to electric power enterprise customer service |
CN111341295A (en) * | 2020-03-10 | 2020-06-26 | 成都华日通讯技术股份有限公司 | Offline real-time multilingual broadcast sensitive word monitoring method |
CN111816166A (en) * | 2020-07-17 | 2020-10-23 | 字节跳动有限公司 | Voice recognition method, apparatus, and computer-readable storage medium storing instructions |
CN111833884A (en) * | 2020-05-27 | 2020-10-27 | 北京三快在线科技有限公司 | Voiceprint feature extraction method and device, electronic equipment and storage medium |
CN111968670A (en) * | 2020-08-19 | 2020-11-20 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio recognition method and device |
CN112053694A (en) * | 2020-07-23 | 2020-12-08 | 哈尔滨理工大学 | Voiceprint recognition method based on CNN and GRU network fusion |
WO2021000498A1 (en) * | 2019-07-04 | 2021-01-07 | 平安科技(深圳)有限公司 | Composite speech recognition method, device, equipment, and computer-readable storage medium |
CN112325936A (en) * | 2020-10-30 | 2021-02-05 | 北京印刷学院 | Logistics environment detection and identification method and system |
CN112489623A (en) * | 2020-11-17 | 2021-03-12 | 携程计算机技术(上海)有限公司 | Language identification model training method, language identification method and related equipment |
CN113077785A (en) * | 2019-12-17 | 2021-07-06 | 中国科学院声学研究所 | End-to-end multi-language continuous voice stream voice content identification method and system |
CN113408649A (en) * | 2021-07-09 | 2021-09-17 | 南京工业大学 | Multi-mode child emotion recognition fusion model based on video image facial expressions and voice |
CN113808583A (en) * | 2020-06-16 | 2021-12-17 | 阿里巴巴集团控股有限公司 | Voice recognition method, device and system |
CN113808583B (en) * | 2020-06-16 | 2024-05-28 | 浙江未来精灵人工智能科技有限公司 | Voice recognition method, device and system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103745731A (en) * | 2013-12-31 | 2014-04-23 | 安徽科大讯飞信息科技股份有限公司 | Automatic voice recognition effect testing system and automatic voice recognition effect testing method |
US20150032449A1 (en) * | 2013-07-26 | 2015-01-29 | Nuance Communications, Inc. | Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition |
CN107578775A (en) * | 2017-09-07 | 2018-01-12 | 四川大学 | A kind of multitask method of speech classification based on deep neural network |
US20180032846A1 (en) * | 2016-08-01 | 2018-02-01 | Nvidia Corporation | Fusing multilayer and multimodal deep neural networks for video classification |
CN107797987A (en) * | 2017-10-12 | 2018-03-13 | 北京知道未来信息技术有限公司 | A kind of mixing language material name entity recognition method based on Bi LSTM CNN |
CN107807971A (en) * | 2017-10-18 | 2018-03-16 | 北京信息科技大学 | A kind of automated graphics semantic description method |
CN108399435A (en) * | 2018-03-21 | 2018-08-14 | 南京邮电大学 | A kind of video classification methods based on sound feature |
-
2018
- 2018-11-02 CN CN201811303649.5A patent/CN109523993B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150032449A1 (en) * | 2013-07-26 | 2015-01-29 | Nuance Communications, Inc. | Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition |
CN103745731A (en) * | 2013-12-31 | 2014-04-23 | 安徽科大讯飞信息科技股份有限公司 | Automatic voice recognition effect testing system and automatic voice recognition effect testing method |
US20180032846A1 (en) * | 2016-08-01 | 2018-02-01 | Nvidia Corporation | Fusing multilayer and multimodal deep neural networks for video classification |
CN107578775A (en) * | 2017-09-07 | 2018-01-12 | 四川大学 | A kind of multitask method of speech classification based on deep neural network |
CN107797987A (en) * | 2017-10-12 | 2018-03-13 | 北京知道未来信息技术有限公司 | A kind of mixing language material name entity recognition method based on Bi LSTM CNN |
CN107807971A (en) * | 2017-10-18 | 2018-03-16 | 北京信息科技大学 | A kind of automated graphics semantic description method |
CN108399435A (en) * | 2018-03-21 | 2018-08-14 | 南京邮电大学 | A kind of video classification methods based on sound feature |
Non-Patent Citations (3)
Title |
---|
周金傲等: "基于音频特征参数的多语种分类算法", 《通信技术》 * |
崔瑞莲等: "基于深度神经网络的语种识别", 《模式识别与人工智能》 * |
陈琢: "中文语音情感挖掘的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110164472A (en) * | 2019-04-19 | 2019-08-23 | 天津大学 | Noise classification method based on convolutional neural networks |
CN110148399A (en) * | 2019-05-06 | 2019-08-20 | 北京猎户星空科技有限公司 | A kind of control method of smart machine, device, equipment and medium |
CN110136690A (en) * | 2019-05-22 | 2019-08-16 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device and computer readable storage medium |
CN110164417A (en) * | 2019-05-31 | 2019-08-23 | 科大讯飞股份有限公司 | A kind of languages vector obtains, languages know method for distinguishing and relevant apparatus |
CN110322894A (en) * | 2019-06-27 | 2019-10-11 | 电子科技大学 | A kind of waveform diagram generation and giant panda detection method based on sound |
CN110322894B (en) * | 2019-06-27 | 2022-02-11 | 电子科技大学 | Sound-based oscillogram generation and panda detection method |
WO2021000498A1 (en) * | 2019-07-04 | 2021-01-07 | 平安科技(深圳)有限公司 | Composite speech recognition method, device, equipment, and computer-readable storage medium |
CN110379407A (en) * | 2019-07-22 | 2019-10-25 | 出门问问(苏州)信息科技有限公司 | Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment |
CN110349564B (en) * | 2019-07-22 | 2021-09-24 | 思必驰科技股份有限公司 | Cross-language voice recognition method and device |
CN110349564A (en) * | 2019-07-22 | 2019-10-18 | 苏州思必驰信息科技有限公司 | Across the language voice recognition methods of one kind and device |
CN110807365A (en) * | 2019-09-29 | 2020-02-18 | 浙江大学 | Underwater target identification method based on fusion of GRU and one-dimensional CNN neural network |
CN110807365B (en) * | 2019-09-29 | 2022-02-11 | 浙江大学 | Underwater target identification method based on fusion of GRU and one-dimensional CNN neural network |
CN110827793A (en) * | 2019-10-21 | 2020-02-21 | 成都大公博创信息技术有限公司 | Language identification method |
CN110930978A (en) * | 2019-11-08 | 2020-03-27 | 北京搜狗科技发展有限公司 | Language identification method and device and language identification device |
CN110782872A (en) * | 2019-11-11 | 2020-02-11 | 复旦大学 | Language identification method and device based on deep convolutional recurrent neural network |
CN110841169A (en) * | 2019-11-28 | 2020-02-28 | 中国科学院深圳先进技术研究院 | Deep sound stimulation system and method for sleep regulation |
CN110841169B (en) * | 2019-11-28 | 2020-09-25 | 中国科学院深圳先进技术研究院 | Deep learning sound stimulation system and method for sleep regulation |
CN111312228A (en) * | 2019-12-09 | 2020-06-19 | 中国南方电网有限责任公司 | End-to-end-based voice navigation method applied to electric power enterprise customer service |
CN113077785A (en) * | 2019-12-17 | 2021-07-06 | 中国科学院声学研究所 | End-to-end multi-language continuous voice stream voice content identification method and system |
CN113077785B (en) * | 2019-12-17 | 2022-07-12 | 中国科学院声学研究所 | End-to-end multi-language continuous voice stream voice content identification method and system |
CN111160171A (en) * | 2019-12-19 | 2020-05-15 | 哈尔滨工程大学 | Radiation source signal identification method combining two-domain multi-features |
CN111160171B (en) * | 2019-12-19 | 2022-04-12 | 哈尔滨工程大学 | Radiation source signal identification method combining two-domain multi-features |
CN111341295A (en) * | 2020-03-10 | 2020-06-26 | 成都华日通讯技术股份有限公司 | Offline real-time multilingual broadcast sensitive word monitoring method |
CN111833884A (en) * | 2020-05-27 | 2020-10-27 | 北京三快在线科技有限公司 | Voiceprint feature extraction method and device, electronic equipment and storage medium |
CN113808583B (en) * | 2020-06-16 | 2024-05-28 | 浙江未来精灵人工智能科技有限公司 | Voice recognition method, device and system |
CN113808583A (en) * | 2020-06-16 | 2021-12-17 | 阿里巴巴集团控股有限公司 | Voice recognition method, device and system |
CN111816166A (en) * | 2020-07-17 | 2020-10-23 | 字节跳动有限公司 | Voice recognition method, apparatus, and computer-readable storage medium storing instructions |
CN112053694A (en) * | 2020-07-23 | 2020-12-08 | 哈尔滨理工大学 | Voiceprint recognition method based on CNN and GRU network fusion |
CN111968670A (en) * | 2020-08-19 | 2020-11-20 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio recognition method and device |
CN112325936A (en) * | 2020-10-30 | 2021-02-05 | 北京印刷学院 | Logistics environment detection and identification method and system |
CN112489623A (en) * | 2020-11-17 | 2021-03-12 | 携程计算机技术(上海)有限公司 | Language identification model training method, language identification method and related equipment |
CN113408649A (en) * | 2021-07-09 | 2021-09-17 | 南京工业大学 | Multi-mode child emotion recognition fusion model based on video image facial expressions and voice |
Also Published As
Publication number | Publication date |
---|---|
CN109523993B (en) | 2022-02-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109523993A (en) | A kind of voice languages classification method merging deep neural network with GRU based on CNN | |
WO2021103492A1 (en) | Risk prediction method and system for business operations | |
CN112084327B (en) | Classification of sparsely labeled text documents while preserving semantics | |
CN103077720B (en) | Speaker identification method and system | |
US11321534B2 (en) | Conversation space artifact generation using natural language processing, machine learning, and ontology-based techniques | |
CN104112026A (en) | Short message text classifying method and system | |
CN112966089A (en) | Problem processing method, device, equipment, medium and product based on knowledge base | |
Jiang et al. | Application intelligent search and recommendation system based on speech recognition technology | |
Walton et al. | Landscape analysis for the specimen data refinery | |
CN109637529A (en) | Voice-based functional localization method, apparatus, computer equipment and storage medium | |
Wang et al. | Gated convolutional LSTM for speech commands recognition | |
CN114385694A (en) | Data processing method and device, computer equipment and storage medium | |
CN113591998A (en) | Method, device, equipment and storage medium for training and using classification model | |
US11532174B2 (en) | Product baseline information extraction | |
CN116976321A (en) | Text processing method, apparatus, computer device, storage medium, and program product | |
Lian | Implementation of computer network user behavior forensic analysis system based on speech data system log | |
CN112699245A (en) | Construction method and device and application method and device of budget management knowledge graph | |
CN110399617A (en) | Audit data processing method, system and readable storage medium storing program for executing | |
US11640558B2 (en) | Unbalanced sample classification method and apparatus | |
US11775510B1 (en) | System and method for modeling a search query | |
KR20230059364A (en) | Public opinion poll system using language model and method thereof | |
CN112133308A (en) | Method and device for multi-label classification of voice recognition text | |
Pal et al. | Cross Domain Answering FAQ Chatbot | |
Shen et al. | NS-Hunter: BERT-Cloze based semantic denoising for distantly supervised relation classification | |
Khan et al. | Non Functional Requirements Identification and Classification Using Transfer Learning Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20220111 Address after: 510000 floor 22, building C, Shenzhen International Innovation Center (Futian science and Technology Plaza), No. 1006, Shennan Avenue, Xintian community, Huafu street, Futian District, Shenzhen, Guangdong Applicant after: Shenzhen wanglian Anrui Network Technology Co.,Ltd. Address before: Floor 4-8, unit 5, building 1, 333 Yunhua Road, high tech Zone, Chengdu, Sichuan 610041 Applicant before: CHENGDU 30KAITIAN COMMUNICATION INDUSTRY Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |