CN109523993A - A kind of voice languages classification method merging deep neural network with GRU based on CNN - Google Patents

A kind of voice languages classification method merging deep neural network with GRU based on CNN Download PDF

Info

Publication number
CN109523993A
CN109523993A CN201811303649.5A CN201811303649A CN109523993A CN 109523993 A CN109523993 A CN 109523993A CN 201811303649 A CN201811303649 A CN 201811303649A CN 109523993 A CN109523993 A CN 109523993A
Authority
CN
China
Prior art keywords
voice
languages
gru
cnn
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811303649.5A
Other languages
Chinese (zh)
Other versions
CN109523993B (en
Inventor
贾宇
沈宜
邹严
张明亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen wanglian Anrui Network Technology Co.,Ltd.
Original Assignee
CHENGDU 30KAITIAN COMMUNICATION INDUSTRY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU 30KAITIAN COMMUNICATION INDUSTRY Co Ltd filed Critical CHENGDU 30KAITIAN COMMUNICATION INDUSTRY Co Ltd
Priority to CN201811303649.5A priority Critical patent/CN109523993B/en
Publication of CN109523993A publication Critical patent/CN109523993A/en
Application granted granted Critical
Publication of CN109523993B publication Critical patent/CN109523993B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of voice languages classification methods for merging deep neural network with GRU based on CNN, comprising the following steps: S1: obtaining server source audio data, audio pre-processes simultaneously cutting source audio data;S2: reading audio data file information, generates audio data inventory csv file;S3: carrying out Short Time Fourier Transform for audio data file, obtains a series of relevant two-dimentional voice sound spectrograph of time-domain and frequency-domain of frequency spectrum function expansion after voice signal time frequency analysis;S4: building model;S5: two-dimentional voice sound spectrograph image data is input in the deep neural network model of voice languages classification that CNN is merged with GRU, classification output languages categorical data;S6: storage languages categorical data and source audio data file information.The present invention solves voice languages classification problem, has many advantages, such as automation, discrimination height, strong robustness, at low cost, portable strong, and what be can be convenient carries out service interfacing with third party system.

Description

A kind of voice languages classification method merging deep neural network with GRU based on CNN
Technical field
The present invention relates to voice processing technology fields, merge deep neural network with GRU based on CNN more particularly to one kind Voice languages classification method.
Background technique
Along with the arriving of multimedia era, computer is more more and more universal, and also more and more frequently, people endeavour for human-computer interaction Control is operated in allowing machine to understand mankind's spoken instructions and realize by voice to machine.It is traditional relative to keyboard, mouse etc. Input equipment, voice technology become the key technology of man-machine interface in information transmission as the important means of human-computer interaction gradually.
Method of speech processing traditional at present is mostly using pattern matching strategy, based on template matching and based on statistics The language model method of probability is the main application method of voice processing technology.And due to the limitation of acoustic model and language model Property plus human language in daily life as environment, it is personal caused by diversity and colloquial random give speech processes Great challenge is brought, causes great speech processes difficult.
Therefore a kind of voice that can cope with speech signal collection diversity and manual extraction feature instability problem is needed Processing method.With voice processing technology and deep learning development like a raging fire, depth convolutional neural networks are because of its part The feature that connection and weight are shared, has good translation invariance, in the status that image domains have it can not be substituted.GRU mind A variant through network as long memory network in short-term, it is possible to provide long-term memory, it in the time domain can be to the long phase of voice The modeling of closing property.Depth convolutional neural networks and GRU neural network have it to be widely applied in speech processes field.Urgent need mentions It is a kind of out to merge two kinds of neural networks and then to improve the solution of model performance using the advantage of two kinds of neural networks respectively.
Summary of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide one kind to merge depth nerve net with GRU based on CNN Voice audio data, can be converted to voice sound spectrograph image, and then in voice sound spectrograph by the voice languages classification method of network Feature and autonomous learning are extracted using the deep neural network that CNN is merged with GRU in image data, classifies end-to-endly and obtains language Sound languages classification.
The purpose of the present invention is achieved through the following technical solutions: one kind merging depth nerve net with GRU based on CNN The voice languages classification method of network, comprising the following steps:
S1: obtaining server source audio data, and audio pre-processes simultaneously cutting source audio data;
S2: reading audio data file information, generates audio data inventory csv file;
S3: audio data file is subjected to Short Time Fourier Transform, obtains a series of frequency spectrum functions after voice signal time frequency analysis The relevant two-dimentional voice sound spectrograph of the time-domain and frequency-domain of expansion;
S4: building model automatically extracts time-domain and frequency-domain high-level spy by convolution, Chi Hua, normalization operation using CNN network Sign, access GRU neural network carry out time domain associated extraction feature modeling to phonetic feature figure;
S5: two-dimentional voice sound spectrograph image data is input to the deep neural network model of voice languages classification that CNN is merged with GRU In, classification output languages categorical data;
S6: the languages categorical data of classification output is recorded in result csv file and is stored with source audio data file information.
The step S1 the following steps are included:
S101: for complicated, difficult identification problem in the acquisition of source audio data, source audio data file is renamed, stores and establishes Corresponding data File Mapping;
S102: for source audio data type format diverse problems are obtained, source audio file is recompiled, uniform format is MP3 audio file formats;
S103: utilizing FFMPEG cutting MP3 audio file, intercepts audio data.
The step S3 includes following sub-step:
S301: MP3 audio data is converted into WAV format using the library mpg123, while voice-grade channel is interfered and realizes number According to enhancing, languages classification accuracy is improved;
S302: audio data is subjected to Short Time Fourier Transform, time frequency analysis obtains its time frequency spectrum function, and then delineates Spectrogram is simultaneously converted into the storage of gray scale voice sound spectrograph;
S303: being directed to depth convolutional neural networks voice languages disaggregated model, divides training voice sound spectrograph training dataset, tests Demonstrate,prove data set and test data set.
The step S4 includes following sub-step:
S401: input terminal speech audio source data is pre-processed into the relevant two-dimentional voice of time-domain and frequency-domain for speech audio source data Sound spectrograph gray level image;
S402: the feature in two-dimentional voice sound spectrograph is extracted using CNN;
S403: by the characteristic pattern extracted by CNN and GRU neural network fusion, using the characteristic pattern quantity of input as single GRU mind Overall channel number through network;
S404: the information time for intercepting time-domain and frequency-domain expressed by each characteristic pattern is time series;
S405: using the frequency domain information in all channels as the input of single GRU core, being connected by circulation single GRU core foundation has The Recognition with Recurrent Neural Network of long-term memory;
After the Fusion Features of S406:GRU neural network, one FC layers are connected, feature space is mapped into vector;
S407: the class probability of the last one Softmax layers of calculating voice languages of full connection is utilized;
S408: after output end exports the class probability of voice languages, the voice languages classification marked according to its probability match, choosing It takes the languages classification of Top3 probability to export result deposit csv file to be permanently stored.
The step S401 includes following sub-step:
S4011: input terminal speech audio source data is intercepted using FFMPEG;
S4012: being MP3 audio format by the speech audio source data transcoding being truncated to;
S4013: using mpg123 by the speech audio source data of MP3 audio format be converted into WAV wave file carry out adding window, Sub-frame processing obtains the relevant two-dimentional voice sound spectrograph gray level image of time-domain and frequency-domain of speech audio source data.
CNN need to meet the following conditions in the step S402:
The CNN includes 4 convolution pond layers, and using RELU nonlinear activation and BN regularization, the first convolutional layer uses 7*7 The filter of specification, the second convolutional layer use the filter of 5*5 specification, remaining convolutional layer all uses the filter of 3*3 specification, often The pond specification that a convolutional layer is equipped with all is 3*3 specification, and step-length is 2.
The step S5 includes following sub-step:
S501: Caffe deep learning frame transformations tool is used, sound spectrograph training dataset and validation data set are converted respectively For training set needed for Caffe training network and verifying collection LevelDB database, depth convolutional Neural net is merged with GRU to CNN Network is trained, and is established voice languages classification CNN with GRU and is merged deep neural network model;
S502: voice languages classification CNN merges the deep neural network model training stage with GRU, utilizes Caffe training order tune With the Prototxt solution and the optimum optimization neural network number of plies, the deep neural network of number of filter of best configuration Structure exercises supervision network training to optimal performance;
S503: voice languages classification CNN deep neural network model test phase is merged with GRU, according to after network training most Excellent weight, bias model file input voice sound spectrograph test data set, in conjunction with 178 marked kind in pre-training network Languages classification predicts audio data languages, and exports best Top3 languages category result.
It is further comprising the steps of before the step S1:
After obtaining third party's business reception and registration voice languages sort instructions, system business logic processing module handles the voice languages point Class instruction, is conveyed to the server that internal system Redis is issued to again in the step S1.
The beneficial effects of the present invention are: solving voice languages classification problem, for the language after voice data time frequency analysis Spectrogram, the CNN and GRU for devising optimal solution merge deep neural network to realize that voice languages are classified, have automatic Change, discrimination height, strong robustness, it is at low cost, portable strong the advantages that, and can be convenient with third party system carry out business Docking.
Detailed description of the invention
Fig. 1 is method and step schematic flow diagram of the invention;
Fig. 2 is that data service of the invention flows to schematic flow diagram.
Specific embodiment
Below in conjunction with embodiment, technical solution of the present invention is clearly and completely described, it is clear that described Embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field Technical staff's every other embodiment obtained under the premise of not making the creative labor belongs to what the present invention protected Range.
Refering to fig. 1-2, the present invention provides a kind of technical solution: a kind of language merging deep neural network with GRU based on CNN Sound languages classification method includes following procedure: (1) input terminal directly inputs the relevant voice sound spectrograph image of time domain;(2) model Building automatically extracts time-domain and frequency-domain high-level feature, (3) by sequence of operations such as convolution, Chi Hua, normalization using CNN network It accesses GRU neural network and time domain associated extraction feature modeling is carried out to its phonetic feature figure, and then connect full articulamentum for feature Space reflection is to output layer;(4) output end exports voice languages class probability, obtains voice languages classification and permanently stores classification As a result.
After obtaining third party's business and conveying voice languages sort instructions, system business logic processing module handles the voice Languages sort instructions are conveyed to internal system Redis and are issued to voice languages sort module, prediction classification voice again The languages classification of audio data simultaneously stores classification results.
Specifically, the business logic processing of system is mainly responsible for the interaction to external information and internal information, third is being received When the voice languages sort instructions that square business is conveyed, built-in command is converted into it by logical process, is communicated to internal services Device Redis.After Redis receives instruction, message distribution is carried out, downloading obtains third party's business audio source data, and system is arrived in storage In internal server, while its store path is recorded, voice languages sort instructions are handed down to voice languages sort module.
The classification work of voice languages is carried out in voice languages sort module, whole process specifically includes following step It is rapid:
S1: obtaining server source audio data, and audio pre-processes simultaneously cutting source audio data;
The step S1 the following steps are included:
S101: for complicated, difficult identification problem in the acquisition of source audio data, source audio data file is renamed, stores and establishes Corresponding data File Mapping;
S102: for source audio data type format diverse problems are obtained, source audio file is recompiled, uniform format is MP3 audio file formats;
S103: utilizing FFMPEG cutting MP3 audio file, intercepts audio data.
S2: reading audio data file information, generates audio data inventory csv file;
S3: audio data file is subjected to Short Time Fourier Transform, obtains a series of frequency spectrum functions after voice signal time frequency analysis The relevant two-dimentional voice sound spectrograph of the time-domain and frequency-domain of expansion;
The step S3 includes following sub-step:
S301: MP3 audio data is converted into WAV format using the library mpg123, while voice-grade channel is interfered and realizes number According to enhancing, languages classification accuracy is improved;
S302: audio data is subjected to Short Time Fourier Transform, time frequency analysis obtains its time frequency spectrum function, and then delineates Spectrogram is simultaneously converted into the storage of gray scale voice sound spectrograph;
S303: being directed to depth convolutional neural networks voice languages disaggregated model, divides training voice sound spectrograph training dataset, tests Demonstrate,prove data set and test data set.
The step S4 includes following sub-step:
S401: input terminal speech audio source data is pre-processed into the relevant two-dimentional voice of time-domain and frequency-domain for speech audio source data Sound spectrograph gray level image;
The step S401 includes following sub-step:
S4011: input terminal speech audio source data is intercepted using FFMPEG;
S4012: being MP3 audio format by the speech audio source data transcoding being truncated to;
S4013: using mpg123 by the speech audio source data of MP3 audio format be converted into WAV wave file carry out adding window, Sub-frame processing obtains the relevant two-dimentional voice sound spectrograph gray level image of time-domain and frequency-domain of speech audio source data.
S402: the feature in two-dimentional voice sound spectrograph is extracted using CNN;
CNN need to meet the following conditions in the step S402:
The CNN includes 4 convolution pond layers, and using RELU nonlinear activation and BN regularization, the first convolutional layer uses 7*7 The filter of specification, the second convolutional layer use the filter of 5*5 specification, remaining convolutional layer all uses the filter of 3*3 specification, often The pond specification that a convolutional layer is equipped with all is 3*3 specification, and step-length is 2.
S403: by the characteristic pattern extracted by CNN and GRU neural network fusion, using the characteristic pattern quantity of input as single The overall channel number of GRU neural network;
S404: the information time for intercepting time-domain and frequency-domain expressed by each characteristic pattern is time series;
S405: using the frequency domain information in all channels as the input of single GRU core, being connected by circulation single GRU core foundation has The Recognition with Recurrent Neural Network of long-term memory;
After the Fusion Features of S406:GRU neural network, one FC layers are connected, feature space is mapped into vector;
S407: the class probability of the last one Softmax layers of calculating voice languages of full connection is utilized;
S408: after output end exports the class probability of voice languages, the voice languages classification marked according to its probability match, choosing It takes the languages classification of Top3 probability to export result deposit csv file to be permanently stored.
S4: building model automatically extracts time-domain and frequency-domain high-level by convolution, Chi Hua, normalization operation using CNN network Feature, access GRU neural network carry out time domain associated extraction feature modeling to phonetic feature figure;
S5: two-dimentional voice sound spectrograph image data is input to the deep neural network model of voice languages classification that CNN is merged with GRU In, classification output languages categorical data;
The step S5 includes following sub-step:
S501: Caffe deep learning frame transformations tool is used, sound spectrograph training dataset and validation data set are converted respectively For training set needed for Caffe training network and verifying collection LevelDB database, depth convolutional Neural net is merged with GRU to CNN Network is trained, and is established voice languages classification CNN with GRU and is merged deep neural network model;
S502: voice languages classification CNN merges the deep neural network model training stage with GRU, utilizes Caffe training order tune With the Prototxt solution and the optimum optimization neural network number of plies, the deep neural network of number of filter of best configuration Structure exercises supervision network training to optimal performance;
S503: voice languages classification CNN deep neural network model test phase is merged with GRU, according to after network training most Excellent weight, bias model file input voice sound spectrograph test data set, in conjunction with 178 marked kind in pre-training network Languages classification predicts audio data languages, and exports best Top3 languages category result.
S6: the languages categorical data of classification output and source audio data file information are recorded in result csv file simultaneously Storage.
The operation principle of the present invention is that: after carrying out a series of processing to audio source data, finally predict its voice languages Classification.It is the pretreatment of audio source data first, it is file that is empty, can not reading that investigation, which deletes some files, is carried out necessary File renames and stores the mapping relations for retaining Rename file and source file, if other audio formats other than MP3, utilizes FFMPEG, which is decoded it, to be recompiled as MP3 audio format, then intercepts the sound that source audio segment is handled, while intercepted to it Frequency is according to establishing list csv file;Then in the voice sound spectrograph transformation stage, according to its inventory file one by one to audio text Audio data is converted into WAV format using the library mpg123 by part, carries out Short Time Fourier Transform, and to its voice-grade channel carry out with Machine interference enhancing data, improve classification accuracy, generate unified specification two-dimensional time-frequency corresponding with voice audio data Voice sound spectrograph;Then the voice languages classification stage is predicted for voice sound spectrograph, need to after labeled data training modeling Its classification could be predicted, when training network, according to the system after the audio data conversion for having marked 178 kinds of category of language of acquisition One specification voice language composes grayscale image training set and verifying collection data, and it is required to generate the training of Caffe depth convolutional neural networks LevelDB database inputs in the depth convolutional neural networks structure based on Caffe, debugs training network and extracts feature not Disconnected learning model obtains the weight of optimal solution and the Caffe model file of biasing, and then is directed to target language to be predicted Sound sound spectrograph is input in prediction network, and foundation has trained the model file supervised learning of optimization to measure its languages classification in advance;Most Afterwards, the prediction result of voice languages classification is labeled in write-in csv file with source audio data attribute and is stored, returned As a result.
The above is only a preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein Form should not be regarded as an exclusion of other examples, and can be used for other combinations, modifications, and environments, and can be at this In the text contemplated scope, modifications can be made through the above teachings or related fields of technology or knowledge.And those skilled in the art institute into Capable modifications and changes do not depart from the spirit and scope of the present invention, then all should be in the protection scope of appended claims of the present invention It is interior.

Claims (8)

1. a kind of voice languages classification method for merging deep neural network with GRU based on CNN, it is characterised in that: including following Step:
S1: obtaining server source audio data, and audio pre-processes simultaneously cutting source audio data;
S2: reading audio data file information, generates audio data inventory csv file;
S3: audio data file is subjected to Short Time Fourier Transform, obtains a series of frequency spectrum functions after voice signal time frequency analysis The relevant two-dimentional voice sound spectrograph of the time-domain and frequency-domain of expansion;
S4: building model automatically extracts time-domain and frequency-domain high-level spy by convolution, Chi Hua, normalization operation using CNN network Sign, access GRU neural network carry out time domain associated extraction feature modeling to phonetic feature figure;
S5: two-dimentional voice sound spectrograph image data is input to the deep neural network model of voice languages classification that CNN is merged with GRU In, classification output languages categorical data;
S6: the languages categorical data of classification output is recorded in result csv file and is stored with source audio data file information.
2. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1, It is characterized by: the step S1 the following steps are included:
S101: for complicated, difficult identification problem in the acquisition of source audio data, source audio data file is renamed, stores and establishes Corresponding data File Mapping;
S102: for source audio data type format diverse problems are obtained, source audio file is recompiled, uniform format is MP3 audio file formats;
S103: utilizing FFMPEG cutting MP3 audio file, intercepts audio data.
3. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1, It is characterized by: the step S3 includes following sub-step:
S301: MP3 audio data is converted into WAV format using the library mpg123, while voice-grade channel is interfered and realizes number According to enhancing, languages classification accuracy is improved;
S302: audio data is subjected to Short Time Fourier Transform, time frequency analysis obtains its time frequency spectrum function, and then delineates Spectrogram is simultaneously converted into the storage of gray scale voice sound spectrograph;
S303: being directed to depth convolutional neural networks voice languages disaggregated model, divides training voice sound spectrograph training dataset, tests Demonstrate,prove data set and test data set.
4. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1, It is characterized by: the step S4 includes following sub-step:
S401: input terminal speech audio source data is pre-processed into the relevant two-dimentional voice of time-domain and frequency-domain for speech audio source data Sound spectrograph gray level image;
S402: the feature in two-dimentional voice sound spectrograph is extracted using CNN;
S403: by the characteristic pattern extracted by CNN and GRU neural network fusion, using the characteristic pattern quantity of input as single GRU mind Overall channel number through network;
S404: the information time for intercepting time-domain and frequency-domain expressed by each characteristic pattern is time series;
S405: using the frequency domain information in all channels as the input of single GRU core, being connected by circulation single GRU core foundation has The Recognition with Recurrent Neural Network of long-term memory;
After the Fusion Features of S406:GRU neural network, one FC layers are connected, feature space is mapped into vector;
S407: the class probability of the last one Softmax layers of calculating voice languages of full connection is utilized;
S408: after output end exports the class probability of voice languages, the voice languages classification marked according to its probability match, choosing It takes the languages classification of Top3 probability to export result deposit csv file to be permanently stored.
5. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 4, It is characterized by: the step S401 includes following sub-step:
S4011: input terminal speech audio source data is intercepted using FFMPEG;
S4012: being MP3 audio format by the speech audio source data transcoding being truncated to;
S4013: using mpg123 by the speech audio source data of MP3 audio format be converted into WAV wave file carry out adding window, Sub-frame processing obtains the relevant two-dimentional voice sound spectrograph gray level image of time-domain and frequency-domain of speech audio source data.
6. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 4, It is characterized by: CNN need to meet the following conditions in the step S402:
The CNN includes 4 convolution pond layers, and using RELU nonlinear activation and BN regularization, the first convolutional layer uses 7*7 The filter of specification, the second convolutional layer use the filter of 5*5 specification, remaining convolutional layer all uses the filter of 3*3 specification, often The pond specification that a convolutional layer is equipped with all is 3*3 specification, and step-length is 2.
7. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1, It is characterized by: the step S5 includes following sub-step:
S501: Caffe deep learning frame transformations tool is used, sound spectrograph training dataset and validation data set are converted respectively For training set needed for Caffe training network and verifying collection LevelDB database, depth convolutional Neural net is merged with GRU to CNN Network is trained, and is established voice languages classification CNN with GRU and is merged deep neural network model;
S502: voice languages classification CNN merges the deep neural network model training stage with GRU, utilizes Caffe training order tune With the Prototxt solution and the optimum optimization neural network number of plies, the deep neural network of number of filter of best configuration Structure exercises supervision network training to optimal performance;
S503: voice languages classification CNN deep neural network model test phase is merged with GRU, according to after network training most Excellent weight, bias model file input voice sound spectrograph test data set, in conjunction with 178 marked kind in pre-training network Languages classification predicts audio data languages, and exports best Top3 languages category result.
8. a kind of voice languages classification method for merging deep neural network with GRU based on CNN according to claim 1, It is characterized by: further comprising the steps of before the step S1:
After obtaining third party's business reception and registration voice languages sort instructions, system business logic processing module handles the voice languages point Class instruction, is conveyed to the server that internal system Redis is issued to again in the step S1.
CN201811303649.5A 2018-11-02 2018-11-02 Voice language classification method based on CNN and GRU fusion deep neural network Active CN109523993B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811303649.5A CN109523993B (en) 2018-11-02 2018-11-02 Voice language classification method based on CNN and GRU fusion deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811303649.5A CN109523993B (en) 2018-11-02 2018-11-02 Voice language classification method based on CNN and GRU fusion deep neural network

Publications (2)

Publication Number Publication Date
CN109523993A true CN109523993A (en) 2019-03-26
CN109523993B CN109523993B (en) 2022-02-08

Family

ID=65774431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811303649.5A Active CN109523993B (en) 2018-11-02 2018-11-02 Voice language classification method based on CNN and GRU fusion deep neural network

Country Status (1)

Country Link
CN (1) CN109523993B (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
CN110148399A (en) * 2019-05-06 2019-08-20 北京猎户星空科技有限公司 A kind of control method of smart machine, device, equipment and medium
CN110164472A (en) * 2019-04-19 2019-08-23 天津大学 Noise classification method based on convolutional neural networks
CN110164417A (en) * 2019-05-31 2019-08-23 科大讯飞股份有限公司 A kind of languages vector obtains, languages know method for distinguishing and relevant apparatus
CN110322894A (en) * 2019-06-27 2019-10-11 电子科技大学 A kind of waveform diagram generation and giant panda detection method based on sound
CN110349564A (en) * 2019-07-22 2019-10-18 苏州思必驰信息科技有限公司 Across the language voice recognition methods of one kind and device
CN110379407A (en) * 2019-07-22 2019-10-25 出门问问(苏州)信息科技有限公司 Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN110807365A (en) * 2019-09-29 2020-02-18 浙江大学 Underwater target identification method based on fusion of GRU and one-dimensional CNN neural network
CN110827793A (en) * 2019-10-21 2020-02-21 成都大公博创信息技术有限公司 Language identification method
CN110841169A (en) * 2019-11-28 2020-02-28 中国科学院深圳先进技术研究院 Deep sound stimulation system and method for sleep regulation
CN110930978A (en) * 2019-11-08 2020-03-27 北京搜狗科技发展有限公司 Language identification method and device and language identification device
CN111160171A (en) * 2019-12-19 2020-05-15 哈尔滨工程大学 Radiation source signal identification method combining two-domain multi-features
CN111312228A (en) * 2019-12-09 2020-06-19 中国南方电网有限责任公司 End-to-end-based voice navigation method applied to electric power enterprise customer service
CN111341295A (en) * 2020-03-10 2020-06-26 成都华日通讯技术股份有限公司 Offline real-time multilingual broadcast sensitive word monitoring method
CN111816166A (en) * 2020-07-17 2020-10-23 字节跳动有限公司 Voice recognition method, apparatus, and computer-readable storage medium storing instructions
CN111833884A (en) * 2020-05-27 2020-10-27 北京三快在线科技有限公司 Voiceprint feature extraction method and device, electronic equipment and storage medium
CN111968670A (en) * 2020-08-19 2020-11-20 腾讯音乐娱乐科技(深圳)有限公司 Audio recognition method and device
CN112053694A (en) * 2020-07-23 2020-12-08 哈尔滨理工大学 Voiceprint recognition method based on CNN and GRU network fusion
WO2021000498A1 (en) * 2019-07-04 2021-01-07 平安科技(深圳)有限公司 Composite speech recognition method, device, equipment, and computer-readable storage medium
CN112325936A (en) * 2020-10-30 2021-02-05 北京印刷学院 Logistics environment detection and identification method and system
CN112489623A (en) * 2020-11-17 2021-03-12 携程计算机技术(上海)有限公司 Language identification model training method, language identification method and related equipment
CN113077785A (en) * 2019-12-17 2021-07-06 中国科学院声学研究所 End-to-end multi-language continuous voice stream voice content identification method and system
CN113408649A (en) * 2021-07-09 2021-09-17 南京工业大学 Multi-mode child emotion recognition fusion model based on video image facial expressions and voice
CN113808583A (en) * 2020-06-16 2021-12-17 阿里巴巴集团控股有限公司 Voice recognition method, device and system
CN113808583B (en) * 2020-06-16 2024-05-28 浙江未来精灵人工智能科技有限公司 Voice recognition method, device and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745731A (en) * 2013-12-31 2014-04-23 安徽科大讯飞信息科技股份有限公司 Automatic voice recognition effect testing system and automatic voice recognition effect testing method
US20150032449A1 (en) * 2013-07-26 2015-01-29 Nuance Communications, Inc. Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition
CN107578775A (en) * 2017-09-07 2018-01-12 四川大学 A kind of multitask method of speech classification based on deep neural network
US20180032846A1 (en) * 2016-08-01 2018-02-01 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
CN107797987A (en) * 2017-10-12 2018-03-13 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on Bi LSTM CNN
CN107807971A (en) * 2017-10-18 2018-03-16 北京信息科技大学 A kind of automated graphics semantic description method
CN108399435A (en) * 2018-03-21 2018-08-14 南京邮电大学 A kind of video classification methods based on sound feature

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150032449A1 (en) * 2013-07-26 2015-01-29 Nuance Communications, Inc. Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition
CN103745731A (en) * 2013-12-31 2014-04-23 安徽科大讯飞信息科技股份有限公司 Automatic voice recognition effect testing system and automatic voice recognition effect testing method
US20180032846A1 (en) * 2016-08-01 2018-02-01 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
CN107578775A (en) * 2017-09-07 2018-01-12 四川大学 A kind of multitask method of speech classification based on deep neural network
CN107797987A (en) * 2017-10-12 2018-03-13 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on Bi LSTM CNN
CN107807971A (en) * 2017-10-18 2018-03-16 北京信息科技大学 A kind of automated graphics semantic description method
CN108399435A (en) * 2018-03-21 2018-08-14 南京邮电大学 A kind of video classification methods based on sound feature

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周金傲等: "基于音频特征参数的多语种分类算法", 《通信技术》 *
崔瑞莲等: "基于深度神经网络的语种识别", 《模式识别与人工智能》 *
陈琢: "中文语音情感挖掘的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110164472A (en) * 2019-04-19 2019-08-23 天津大学 Noise classification method based on convolutional neural networks
CN110148399A (en) * 2019-05-06 2019-08-20 北京猎户星空科技有限公司 A kind of control method of smart machine, device, equipment and medium
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
CN110164417A (en) * 2019-05-31 2019-08-23 科大讯飞股份有限公司 A kind of languages vector obtains, languages know method for distinguishing and relevant apparatus
CN110322894A (en) * 2019-06-27 2019-10-11 电子科技大学 A kind of waveform diagram generation and giant panda detection method based on sound
CN110322894B (en) * 2019-06-27 2022-02-11 电子科技大学 Sound-based oscillogram generation and panda detection method
WO2021000498A1 (en) * 2019-07-04 2021-01-07 平安科技(深圳)有限公司 Composite speech recognition method, device, equipment, and computer-readable storage medium
CN110379407A (en) * 2019-07-22 2019-10-25 出门问问(苏州)信息科技有限公司 Adaptive voice synthetic method, device, readable storage medium storing program for executing and calculating equipment
CN110349564B (en) * 2019-07-22 2021-09-24 思必驰科技股份有限公司 Cross-language voice recognition method and device
CN110349564A (en) * 2019-07-22 2019-10-18 苏州思必驰信息科技有限公司 Across the language voice recognition methods of one kind and device
CN110807365A (en) * 2019-09-29 2020-02-18 浙江大学 Underwater target identification method based on fusion of GRU and one-dimensional CNN neural network
CN110807365B (en) * 2019-09-29 2022-02-11 浙江大学 Underwater target identification method based on fusion of GRU and one-dimensional CNN neural network
CN110827793A (en) * 2019-10-21 2020-02-21 成都大公博创信息技术有限公司 Language identification method
CN110930978A (en) * 2019-11-08 2020-03-27 北京搜狗科技发展有限公司 Language identification method and device and language identification device
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN110841169A (en) * 2019-11-28 2020-02-28 中国科学院深圳先进技术研究院 Deep sound stimulation system and method for sleep regulation
CN110841169B (en) * 2019-11-28 2020-09-25 中国科学院深圳先进技术研究院 Deep learning sound stimulation system and method for sleep regulation
CN111312228A (en) * 2019-12-09 2020-06-19 中国南方电网有限责任公司 End-to-end-based voice navigation method applied to electric power enterprise customer service
CN113077785A (en) * 2019-12-17 2021-07-06 中国科学院声学研究所 End-to-end multi-language continuous voice stream voice content identification method and system
CN113077785B (en) * 2019-12-17 2022-07-12 中国科学院声学研究所 End-to-end multi-language continuous voice stream voice content identification method and system
CN111160171A (en) * 2019-12-19 2020-05-15 哈尔滨工程大学 Radiation source signal identification method combining two-domain multi-features
CN111160171B (en) * 2019-12-19 2022-04-12 哈尔滨工程大学 Radiation source signal identification method combining two-domain multi-features
CN111341295A (en) * 2020-03-10 2020-06-26 成都华日通讯技术股份有限公司 Offline real-time multilingual broadcast sensitive word monitoring method
CN111833884A (en) * 2020-05-27 2020-10-27 北京三快在线科技有限公司 Voiceprint feature extraction method and device, electronic equipment and storage medium
CN113808583B (en) * 2020-06-16 2024-05-28 浙江未来精灵人工智能科技有限公司 Voice recognition method, device and system
CN113808583A (en) * 2020-06-16 2021-12-17 阿里巴巴集团控股有限公司 Voice recognition method, device and system
CN111816166A (en) * 2020-07-17 2020-10-23 字节跳动有限公司 Voice recognition method, apparatus, and computer-readable storage medium storing instructions
CN112053694A (en) * 2020-07-23 2020-12-08 哈尔滨理工大学 Voiceprint recognition method based on CNN and GRU network fusion
CN111968670A (en) * 2020-08-19 2020-11-20 腾讯音乐娱乐科技(深圳)有限公司 Audio recognition method and device
CN112325936A (en) * 2020-10-30 2021-02-05 北京印刷学院 Logistics environment detection and identification method and system
CN112489623A (en) * 2020-11-17 2021-03-12 携程计算机技术(上海)有限公司 Language identification model training method, language identification method and related equipment
CN113408649A (en) * 2021-07-09 2021-09-17 南京工业大学 Multi-mode child emotion recognition fusion model based on video image facial expressions and voice

Also Published As

Publication number Publication date
CN109523993B (en) 2022-02-08

Similar Documents

Publication Publication Date Title
CN109523993A (en) A kind of voice languages classification method merging deep neural network with GRU based on CNN
WO2021103492A1 (en) Risk prediction method and system for business operations
CN112084327B (en) Classification of sparsely labeled text documents while preserving semantics
CN103077720B (en) Speaker identification method and system
US11321534B2 (en) Conversation space artifact generation using natural language processing, machine learning, and ontology-based techniques
CN104112026A (en) Short message text classifying method and system
CN112966089A (en) Problem processing method, device, equipment, medium and product based on knowledge base
Jiang et al. Application intelligent search and recommendation system based on speech recognition technology
Walton et al. Landscape analysis for the specimen data refinery
CN109637529A (en) Voice-based functional localization method, apparatus, computer equipment and storage medium
Wang et al. Gated convolutional LSTM for speech commands recognition
CN114385694A (en) Data processing method and device, computer equipment and storage medium
CN113591998A (en) Method, device, equipment and storage medium for training and using classification model
US11532174B2 (en) Product baseline information extraction
CN116976321A (en) Text processing method, apparatus, computer device, storage medium, and program product
Lian Implementation of computer network user behavior forensic analysis system based on speech data system log
CN112699245A (en) Construction method and device and application method and device of budget management knowledge graph
CN110399617A (en) Audit data processing method, system and readable storage medium storing program for executing
US11640558B2 (en) Unbalanced sample classification method and apparatus
US11775510B1 (en) System and method for modeling a search query
KR20230059364A (en) Public opinion poll system using language model and method thereof
CN112133308A (en) Method and device for multi-label classification of voice recognition text
Pal et al. Cross Domain Answering FAQ Chatbot
Shen et al. NS-Hunter: BERT-Cloze based semantic denoising for distantly supervised relation classification
Khan et al. Non Functional Requirements Identification and Classification Using Transfer Learning Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220111

Address after: 510000 floor 22, building C, Shenzhen International Innovation Center (Futian science and Technology Plaza), No. 1006, Shennan Avenue, Xintian community, Huafu street, Futian District, Shenzhen, Guangdong

Applicant after: Shenzhen wanglian Anrui Network Technology Co.,Ltd.

Address before: Floor 4-8, unit 5, building 1, 333 Yunhua Road, high tech Zone, Chengdu, Sichuan 610041

Applicant before: CHENGDU 30KAITIAN COMMUNICATION INDUSTRY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant