CN109523993A

CN109523993A - A kind of voice languages classification method merging deep neural network with GRU based on CNN

Info

Publication number: CN109523993A
Application number: CN201811303649.5A
Authority: CN
Inventors: 贾宇; 沈宜; 邹严; 张明亮
Original assignee: CHENGDU 30KAITIAN COMMUNICATION INDUSTRY Co Ltd
Current assignee: Shenzhen wanglian Anrui Network Technology Co.,Ltd.
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2019-03-26
Anticipated expiration: 2038-11-02
Also published as: CN109523993B

Abstract

The invention discloses a kind of voice languages classification methods for merging deep neural network with GRU based on CNN, comprising the following steps: S1: obtaining server source audio data, audio pre-processes simultaneously cutting source audio data；S2: reading audio data file information, generates audio data inventory csv file；S3: carrying out Short Time Fourier Transform for audio data file, obtains a series of relevant two-dimentional voice sound spectrograph of time-domain and frequency-domain of frequency spectrum function expansion after voice signal time frequency analysis；S4: building model；S5: two-dimentional voice sound spectrograph image data is input in the deep neural network model of voice languages classification that CNN is merged with GRU, classification output languages categorical data；S6: storage languages categorical data and source audio data file information.The present invention solves voice languages classification problem, has many advantages, such as automation, discrimination height, strong robustness, at low cost, portable strong, and what be can be convenient carries out service interfacing with third party system.

Description

A kind of voice languages classification method merging deep neural network with GRU based on CNN

Technical field

The present invention relates to voice processing technology fields, merge deep neural network with GRU based on CNN more particularly to one kind Voice languages classification method.

Background technique

Along with the arriving of multimedia era, computer is more more and more universal, and also more and more frequently, people endeavour for human-computer interaction Control is operated in allowing machine to understand mankind's spoken instructions and realize by voice to machine.It is traditional relative to keyboard, mouse etc. Input equipment, voice technology become the key technology of man-machine interface in information transmission as the important means of human-computer interaction gradually.

Method of speech processing traditional at present is mostly using pattern matching strategy, based on template matching and based on statistics The language model method of probability is the main application method of voice processing technology.And due to the limitation of acoustic model and language model Property plus human language in daily life as environment, it is personal caused by diversity and colloquial random give speech processes Great challenge is brought, causes great speech processes difficult.

Therefore a kind of voice that can cope with speech signal collection diversity and manual extraction feature instability problem is needed Processing method.With voice processing technology and deep learning development like a raging fire, depth convolutional neural networks are because of its part The feature that connection and weight are shared, has good translation invariance, in the status that image domains have it can not be substituted.GRU mind A variant through network as long memory network in short-term, it is possible to provide long-term memory, it in the time domain can be to the long phase of voice The modeling of closing property.Depth convolutional neural networks and GRU neural network have it to be widely applied in speech processes field.Urgent need mentions It is a kind of out to merge two kinds of neural networks and then to improve the solution of model performance using the advantage of two kinds of neural networks respectively.

Summary of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide one kind to merge depth nerve net with GRU based on CNN Voice audio data, can be converted to voice sound spectrograph image, and then in voice sound spectrograph by the voice languages classification method of network Feature and autonomous learning are extracted using the deep neural network that CNN is merged with GRU in image data, classifies end-to-endly and obtains language Sound languages classification.

The purpose of the present invention is achieved through the following technical solutions: one kind merging depth nerve net with GRU based on CNN The voice languages classification method of network, comprising the following steps:

S1: obtaining server source audio data, and audio pre-processes simultaneously cutting source audio data；

S2: reading audio data file information, generates audio data inventory csv file；

S3: audio data file is subjected to Short Time Fourier Transform, obtains a series of frequency spectrum functions after voice signal time frequency analysis The relevant two-dimentional voice sound spectrograph of the time-domain and frequency-domain of expansion；

S4: building model automatically extracts time-domain and frequency-domain high-level spy by convolution, Chi Hua, normalization operation using CNN network Sign, access GRU neural network carry out time domain associated extraction feature modeling to phonetic feature figure；

S5: two-dimentional voice sound spectrograph image data is input to the deep neural network model of voice languages classification that CNN is merged with GRU In, classification output languages categorical data；

S6: the languages categorical data of classification output is recorded in result csv file and is stored with source audio data file information.

The step S1 the following steps are included:

S101: for complicated, difficult identification problem in the acquisition of source audio data, source audio data file is renamed, stores and establishes Corresponding data File Mapping；

S102: for source audio data type format diverse problems are obtained, source audio file is recompiled, uniform format is MP3 audio file formats；

S103: utilizing FFMPEG cutting MP3 audio file, intercepts audio data.

The step S3 includes following sub-step:

S301: MP3 audio data is converted into WAV format using the library mpg123, while voice-grade channel is interfered and realizes number According to enhancing, languages classification accuracy is improved；

S302: audio data is subjected to Short Time Fourier Transform, time frequency analysis obtains its time frequency spectrum function, and then delineates Spectrogram is simultaneously converted into the storage of gray scale voice sound spectrograph；

S303: being directed to depth convolutional neural networks voice languages disaggregated model, divides training voice sound spectrograph training dataset, tests Demonstrate,prove data set and test data set.

The step S4 includes following sub-step:

S401: input terminal speech audio source data is pre-processed into the relevant two-dimentional voice of time-domain and frequency-domain for speech audio source data Sound spectrograph gray level image；

S402: the feature in two-dimentional voice sound spectrograph is extracted using CNN；

S403: by the characteristic pattern extracted by CNN and GRU neural network fusion, using the characteristic pattern quantity of input as single GRU mind Overall channel number through network；

S404: the information time for intercepting time-domain and frequency-domain expressed by each characteristic pattern is time series；

S405: using the frequency domain information in all channels as the input of single GRU core, being connected by circulation single GRU core foundation has The Recognition with Recurrent Neural Network of long-term memory；

After the Fusion Features of S406:GRU neural network, one FC layers are connected, feature space is mapped into vector；

S407: the class probability of the last one Softmax layers of calculating voice languages of full connection is utilized；

S408: after output end exports the class probability of voice languages, the voice languages classification marked according to its probability match, choosing It takes the languages classification of Top3 probability to export result deposit csv file to be permanently stored.

The step S401 includes following sub-step:

S4011: input terminal speech audio source data is intercepted using FFMPEG；

S4012: being MP3 audio format by the speech audio source data transcoding being truncated to；

S4013: using mpg123 by the speech audio source data of MP3 audio format be converted into WAV wave file carry out adding window, Sub-frame processing obtains the relevant two-dimentional voice sound spectrograph gray level image of time-domain and frequency-domain of speech audio source data.

CNN need to meet the following conditions in the step S402:

The CNN includes 4 convolution pond layers, and using RELU nonlinear activation and BN regularization, the first convolutional layer uses 7*7 The filter of specification, the second convolutional layer use the filter of 5*5 specification, remaining convolutional layer all uses the filter of 3*3 specification, often The pond specification that a convolutional layer is equipped with all is 3*3 specification, and step-length is 2.

The step S5 includes following sub-step:

S501: Caffe deep learning frame transformations tool is used, sound spectrograph training dataset and validation data set are converted respectively For training set needed for Caffe training network and verifying collection LevelDB database, depth convolutional Neural net is merged with GRU to CNN Network is trained, and is established voice languages classification CNN with GRU and is merged deep neural network model；

S502: voice languages classification CNN merges the deep neural network model training stage with GRU, utilizes Caffe training order tune With the Prototxt solution and the optimum optimization neural network number of plies, the deep neural network of number of filter of best configuration Structure exercises supervision network training to optimal performance；

S503: voice languages classification CNN deep neural network model test phase is merged with GRU, according to after network training most Excellent weight, bias model file input voice sound spectrograph test data set, in conjunction with 178 marked kind in pre-training network Languages classification predicts audio data languages, and exports best Top3 languages category result.

It is further comprising the steps of before the step S1:

After obtaining third party's business reception and registration voice languages sort instructions, system business logic processing module handles the voice languages point Class instruction, is conveyed to the server that internal system Redis is issued to again in the step S1.

The beneficial effects of the present invention are: solving voice languages classification problem, for the language after voice data time frequency analysis Spectrogram, the CNN and GRU for devising optimal solution merge deep neural network to realize that voice languages are classified, have automatic Change, discrimination height, strong robustness, it is at low cost, portable strong the advantages that, and can be convenient with third party system carry out business Docking.

Detailed description of the invention

Fig. 1 is method and step schematic flow diagram of the invention；

Fig. 2 is that data service of the invention flows to schematic flow diagram.

Specific embodiment

Below in conjunction with embodiment, technical solution of the present invention is clearly and completely described, it is clear that described Embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field Technical staff's every other embodiment obtained under the premise of not making the creative labor belongs to what the present invention protected Range.

Refering to fig. 1-2, the present invention provides a kind of technical solution: a kind of language merging deep neural network with GRU based on CNN Sound languages classification method includes following procedure: (1) input terminal directly inputs the relevant voice sound spectrograph image of time domain；(2) model Building automatically extracts time-domain and frequency-domain high-level feature, (3) by sequence of operations such as convolution, Chi Hua, normalization using CNN network It accesses GRU neural network and time domain associated extraction feature modeling is carried out to its phonetic feature figure, and then connect full articulamentum for feature Space reflection is to output layer；(4) output end exports voice languages class probability, obtains voice languages classification and permanently stores classification As a result.

After obtaining third party's business and conveying voice languages sort instructions, system business logic processing module handles the voice Languages sort instructions are conveyed to internal system Redis and are issued to voice languages sort module, prediction classification voice again The languages classification of audio data simultaneously stores classification results.

Specifically, the business logic processing of system is mainly responsible for the interaction to external information and internal information, third is being received When the voice languages sort instructions that square business is conveyed, built-in command is converted into it by logical process, is communicated to internal services Device Redis.After Redis receives instruction, message distribution is carried out, downloading obtains third party's business audio source data, and system is arrived in storage In internal server, while its store path is recorded, voice languages sort instructions are handed down to voice languages sort module.

The classification work of voice languages is carried out in voice languages sort module, whole process specifically includes following step It is rapid:

The step S1 the following steps are included:

S103: utilizing FFMPEG cutting MP3 audio file, intercepts audio data.

The step S3 includes following sub-step:

The step S4 includes following sub-step:

The step S401 includes following sub-step:

S4011: input terminal speech audio source data is intercepted using FFMPEG；

CNN need to meet the following conditions in the step S402:

S403: by the characteristic pattern extracted by CNN and GRU neural network fusion, using the characteristic pattern quantity of input as single The overall channel number of GRU neural network；

S4: building model automatically extracts time-domain and frequency-domain high-level by convolution, Chi Hua, normalization operation using CNN network Feature, access GRU neural network carry out time domain associated extraction feature modeling to phonetic feature figure；

The step S5 includes following sub-step:

S6: the languages categorical data of classification output and source audio data file information are recorded in result csv file simultaneously Storage.

The operation principle of the present invention is that: after carrying out a series of processing to audio source data, finally predict its voice languages Classification.It is the pretreatment of audio source data first, it is file that is empty, can not reading that investigation, which deletes some files, is carried out necessary File renames and stores the mapping relations for retaining Rename file and source file, if other audio formats other than MP3, utilizes FFMPEG, which is decoded it, to be recompiled as MP3 audio format, then intercepts the sound that source audio segment is handled, while intercepted to it Frequency is according to establishing list csv file；Then in the voice sound spectrograph transformation stage, according to its inventory file one by one to audio text Audio data is converted into WAV format using the library mpg123 by part, carries out Short Time Fourier Transform, and to its voice-grade channel carry out with Machine interference enhancing data, improve classification accuracy, generate unified specification two-dimensional time-frequency corresponding with voice audio data Voice sound spectrograph；Then the voice languages classification stage is predicted for voice sound spectrograph, need to after labeled data training modeling Its classification could be predicted, when training network, according to the system after the audio data conversion for having marked 178 kinds of category of language of acquisition One specification voice language composes grayscale image training set and verifying collection data, and it is required to generate the training of Caffe depth convolutional neural networks LevelDB database inputs in the depth convolutional neural networks structure based on Caffe, debugs training network and extracts feature not Disconnected learning model obtains the weight of optimal solution and the Caffe model file of biasing, and then is directed to target language to be predicted Sound sound spectrograph is input in prediction network, and foundation has trained the model file supervised learning of optimization to measure its languages classification in advance；Most Afterwards, the prediction result of voice languages classification is labeled in write-in csv file with source audio data attribute and is stored, returned As a result.

The above is only a preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein Form should not be regarded as an exclusion of other examples, and can be used for other combinations, modifications, and environments, and can be at this In the text contemplated scope, modifications can be made through the above teachings or related fields of technology or knowledge.And those skilled in the art institute into Capable modifications and changes do not depart from the spirit and scope of the present invention, then all should be in the protection scope of appended claims of the present invention It is interior.

Claims

1. a kind of voice languages classification method for merging deep neural network with GRU based on CNN, it is characterised in that: including following Step:

2. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1, It is characterized by: the step S1 the following steps are included:

S103: utilizing FFMPEG cutting MP3 audio file, intercepts audio data.

3. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1, It is characterized by: the step S3 includes following sub-step:

4. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1, It is characterized by: the step S4 includes following sub-step:

5. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 4, It is characterized by: the step S401 includes following sub-step:

S4011: input terminal speech audio source data is intercepted using FFMPEG；

6. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 4, It is characterized by: CNN need to meet the following conditions in the step S402:

7. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1, It is characterized by: the step S5 includes following sub-step:

8. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1, It is characterized by: further comprising the steps of before the step S1: