CN110782872A - Language identification method and device based on deep convolutional recurrent neural network - Google Patents

Language identification method and device based on deep convolutional recurrent neural network Download PDF

Info

Publication number
CN110782872A
CN110782872A CN201911093837.4A CN201911093837A CN110782872A CN 110782872 A CN110782872 A CN 110782872A CN 201911093837 A CN201911093837 A CN 201911093837A CN 110782872 A CN110782872 A CN 110782872A
Authority
CN
China
Prior art keywords
audio
neural network
language
sequence
recurrent neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911093837.4A
Other languages
Chinese (zh)
Inventor
程颖
杜姗姗
冯瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201911093837.4A priority Critical patent/CN110782872A/en
Publication of CN110782872A publication Critical patent/CN110782872A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a language identification method and a language identification device based on a deep convolution cyclic neural network, which are used for identifying an audio sequence to be detected so as to identify a corresponding language, and the method can realize a language identification function with high accuracy rate without expert knowledge in the audio field, and is characterized by comprising the following steps: step S1, dividing the audio sequence to be tested into a plurality of audio segments with the time length of 2S; step S2, sequentially carrying out short-time Fourier transform on each audio frequency segment to convert the audio frequency segment into a corresponding spectrogram; step S3, inputting the spectrogram into a pre-trained convolution cyclic neural network model in sequence so as to obtain the audio class judgment probability corresponding to each audio segment; step S4, obtaining the language category of each corresponding audio data according to the audio category judgment probability of all the corresponding audio segments of each audio data.

Description

Language identification method and device based on deep convolutional recurrent neural network
Technical Field
The invention relates to the field of audio recognition, relates to a language recognition method in a daily scene, and particularly relates to a language recognition method and device based on a deep convolution cyclic neural network.
Background
In daily life, many intelligent voice assistants require a user to manually specify the input language of the system to work properly, but automatic language recognition techniques can be used to infer the language used by the user. The language identification technology is used as a preprocessing part of a plurality of voice processing tasks and is widely applied to the fields of multi-language voice identification, cross-language communication, machine translation and the like.
Most of traditional language identification technologies are used for identifying the bottom acoustic features through statistical modeling, the commonly used bottom acoustic features such as Mel Frequency Cepstrum Coefficient (MFCC) and perception linear prediction coefficient (PLP) are all that an audio sequence is transformed from a time domain to a frequency domain through Fast Fourier Transform (FFT), then a filter simulates perception of human ears on hearing, and therefore coefficients with a certain dimensionality are extracted from each frame to serve as the bottom acoustic features. The modeling method generally adopts a Gaussian Mixture Model (GMM) and an improved method thereof, the GMM-UBM can well fit the distribution of real data, but the modeled mean value supervectors have a large amount of redundant information, and the classification of the mean value supervectors is difficult, so the identification accuracy rate has great limitation.
However, as deep learning develops in the field of signal processing, language identification methods based on deep learning are increasing, but these research methods mainly focus on processing input sequences of audio data using different forms of recurrent neural networks. Thus, only the time sequence characteristics of the audio are utilized, and the spatial characteristics of the audio are not considered, so that the expected effect is difficult to achieve.
Disclosure of Invention
In order to solve the problems, the invention provides a language identification method for solving the problem of language identification through the image field, the method can complete the language identification function with high accuracy without expert knowledge in the audio field, and the invention adopts the following technical scheme:
the invention provides a language identification method based on a deep convolution cyclic neural network, which is used for identifying an audio sequence to be detected so as to identify a corresponding language and is characterized by comprising the following steps: step S1, dividing the audio sequence to be tested into a plurality of audio segments with the time length of 2S; step S2, sequentially carrying out short-time Fourier transform on each audio frequency segment to convert the audio frequency segment into a corresponding spectrogram; step S3, inputting the spectrogram into a pre-trained convolution cyclic neural network model in sequence so as to obtain the audio class judgment probability corresponding to each audio segment; and step S4, obtaining the language category of each corresponding audio sequence to be tested according to the audio category judgment probability of all corresponding audio segments of each audio sequence to be tested, wherein the convolution cyclic neural network model comprises a convolution neural network of a VGG framework and a cyclic neural network of a Bi-LSTM network structure, and the feature vector obtained after the spectrogram is input into the convolution neural network is input into the cyclic neural network after slicing operation is carried out along a time axis.
The language identification method based on the deep convolution cyclic neural network provided by the invention can also have the technical characteristics that in the step S1, when the audio sequence to be detected is segmented, the audio segment with the duration less than 2S is directly discarded.
The language identification method based on the deep convolutional recurrent neural network provided by the invention can also have the technical characteristics that in the step S1, the audio sequence to be detected is segmented, and simultaneously, all the segmented audio segments are uniformly coded in an uncompressed lossless WAVE format.
The language identification method based on the deep convolutional recurrent neural network provided by the invention can also have the technical characteristics that the step S4 comprises the following sub-steps: step S4-1, averaging all the corresponding audio category judgment probabilities of each audio sequence to be tested to obtain an average judgment probability of each audio sequence to be tested; and step S4-2, taking the category with the maximum probability expression in the average judgment probability as the language category of the corresponding audio sequence to be tested.
The language identification method based on the deep convolution cyclic neural network provided by the invention can also have the technical characteristics that the convolution cyclic neural network model is obtained through the following model training steps: step T1, constructing an initial convolution cyclic neural network model, wherein model parameters contained in the initial convolution cyclic neural network model are randomly set; step T2, generating a spectrogram by the audio sequence in the training set through steps S1 to S2, and sequentially inputting the spectrogram into the initial convolution cyclic neural network model and performing one iteration; step T3, calculating loss errors respectively by using model parameters of the last layer of the initial convolution cyclic neural network model; step T4, the loss error is propagated reversely to update the model parameters; and step T5, repeating the steps T2 to T4 until the training completion condition is reached, and obtaining the trained convolution cyclic neural network model.
The language identification method based on the deep convolution cyclic neural network provided by the invention can also have the technical characteristics that an audio sequence in a training set comprises general audio data and mixed audio data mixed with one or more of white noise, crack noise, cafe noise, Gaussian noise and impulse noise which are randomly generated.
The invention also provides a language identification device based on the convolution cyclic neural network model, which is used for identifying the audio sequence to be detected so as to identify the corresponding language, and is characterized by comprising the following steps: the preprocessing part is used for dividing each audio sequence to be detected in the audio sequence to be detected into a plurality of audio segments with the time length of 2s and converting the audio segments into corresponding spectrograms through short-time Fourier transform; and the language identification part is used for inputting the spectrogram into a pre-trained convolution cyclic neural network so as to identify the language category corresponding to each audio sequence to be detected, wherein the convolution cyclic neural network model comprises a convolution neural network with a VGG framework and a cyclic neural network with a Bi-LSTM network structure, a feature vector obtained after the spectrogram is input into the convolution neural network is input into the cyclic neural network after slicing operation is carried out along a time axis, and the language identification part is used for identifying the language category through the following steps: a characteristic extraction step, namely sequentially inputting the spectrogram into a pre-trained convolution cyclic neural network model so as to obtain characteristic data corresponding to each audio frequency segment; and a category identification step, namely obtaining the language category of each corresponding audio sequence to be detected according to the audio category judgment probability of all the corresponding audio segments of each audio sequence to be detected.
Action and Effect of the invention
According to the language identification method based on the deep convolutional neural network, the audio sequence to be detected is divided into audio segments, the audio segments are converted into frequency spectrograms through short-time Fourier transform, and the frequency spectrograms are identified through a convolutional recurrent neural network model combining the convolutional neural network and the recurrent neural network, so that the purpose of extracting features from the frequency spectrograms through the neural network model instead of directly extracting acoustic features is achieved, namely, the acoustic identification function is achieved through an image identification method, and the problems that information with language distinctiveness is difficult to directly obtain from acoustic features of a bottom layer in original sound identification, and professional knowledge in the audio field is additionally required for acoustic features of a high layer are solved. Furthermore, the invention also adopts a Bi-LSTM recurrent neural network to capture the time sequence characteristics, can capture bidirectional time sequence information, can automatically select and forget unimportant nodes and memorize important nodes when processing long-time information and sequence information, and can solve the problems of gradient explosion and gradient disappearance of the traditional recurrent neural network to a certain extent. Therefore, the method and the device can be suitable for language identification tasks in a series of daily and noisy scenes, and have high language identification precision.
Drawings
FIG. 1 is a flowchart of a language identification method based on a deep convolutional recurrent neural network according to an embodiment of the present invention;
FIG. 2 is a flow chart of a model training process in an embodiment of the invention; and
fig. 3 is a schematic structural diagram of a convolutional recurrent neural network model in an embodiment of the present invention.
Detailed Description
In order to make the technical means, the creation features, the achievement purposes and the effects of the present invention easy to understand, the following embodiments and the accompanying drawings are used to specifically describe the language identification method based on the deep convolutional recurrent neural network.
< example >
In the embodiment, a wiki data set, a chinese data set, and an english data set are taken as examples, and the convolutional recurrent neural network model is trained and tested through the data sets.
Wherein the Uyg language data set is a THUYG-20 Uygur language data set. The data set contained approximately 20 hours of training data and 1 hour of test data recorded in an office environment using an external microphone of an IBM-associated desktop. The speakers are 348 speakers in colleges and universities, and are Uygur speakers from more than 30 states of Xinjiang. The recorded contents comprise conventional topics such as novel, newspaper and various books. The sampling format is 16kHz, 16 bit, mono, wav format. The Chinese and English datasets are the YouTube news dataset, the English channel is the CNN channel, and Chinese is the VOAchina channel. These recordings are of very high quality and can be used for hundreds of hours online, but training uses training data that is as long as the wiki dataset in order to maintain data balance.
These news programs are often characterized by guests or telecommuters, allowing a good combination of different speakers. Furthermore, news programs have noise in real world situations: including music mixing, non-speech audio from video clips, and transitions between utterances. Meanwhile, in order to improve the robustness of the model in the embodiment, some audio signals are also mixed with white noise, crack noise and cafe noise which are randomly generated respectively. These noises simulate scenes that may exist in real life, have strong audibility, and still retain language recognizability. In addition, more audio data can be acquired by adding noise, and data expansion is realized, so that the data volume acquired from the audio sequence to be tested is richer, and the iterative epoch is increased. In addition, in other embodiments, these noises may not be added to the audio signal, or other data expansion methods in the prior art (such as gaussian noise, impulse noise, etc.) may be adopted.
Fig. 1 is a flowchart of a language identification method based on a deep convolutional recurrent neural network in an embodiment of the present invention.
As shown in fig. 1, the language identification method based on the deep convolutional recurrent neural network includes the following steps:
in step S1, the audio sequence to be tested (i.e. the time sequence of each frame of the audio) is divided into a plurality of audio segments with the time length of 2S.
In step S1 of the present embodiment, all audio files are encoded in the uncompressed lossless WAVE format when dividing the audio segments, so that subsequent operations in this format are facilitated and the signal quality is not degraded. Meanwhile, in step S1, audio segments remaining for segmentation for a duration of less than 2S are directly discarded.
And step S2, sequentially carrying out short-time Fourier transform (STFT) on each audio frequency segment divided in the step S1 to obtain a corresponding spectrogram.
In step S2 of this embodiment, the audio segment is converted into a spectrogram by Short Time Fourier Transform (STFT) to perform model training or audio sequence recognition, and a Hanning window is used in the process of short time fourier transform. Since most english voices do not exceed 3kHz in conversational voices, only frequencies up to 5kHz are included in the spectrogram. The time axis (x-axis) is rendered at a rate of 250 pixels per second. The size of the final transformed spectrogram was 500 × 129 × 1.
And step S3, sequentially inputting the spectrogram into a pre-trained convolution cyclic neural network model so as to obtain the audio class judgment probability corresponding to each audio segment.
In this embodiment, the audio category determination probability is used to indicate the probability that the corresponding audio segment belongs to each language category. The convolution cyclic neural network model is divided into two modules, namely a VGG convolution neural network and a Bi-LSTM cyclic neural network. The VGG convolutional neural network is used for extracting the convolutional characteristics of the spectrogram, and the Bi-LSTM recurrent neural network is used for capturing the time sequence characteristics of the sequence. In step S3, after the spectrogram obtains corresponding feature vectors through the convolutional neural network, the feature vectors are stacked and sliced, the cyclic neural network captures time-series features of the audio sequence, and finally the speech category probability (i.e., audio category judgment probability) corresponding to the audio sequence is obtained through softmax classification.
Step S4, obtaining the language category of each corresponding audio sequence to be tested according to the audio category determination probability of all corresponding audio segments of each audio sequence to be tested, in this embodiment, step S4 specifically includes the following sub-steps:
step S4-1, averaging all the corresponding audio category judgment probabilities of each audio sequence to be tested to obtain an average judgment probability of each audio sequence to be tested;
and step S4-2, taking the category with the maximum probability expression in the average judgment probability as the language category of the corresponding audio sequence to be tested.
FIG. 2 is a flow chart of a model training process in an embodiment of the invention.
As shown in fig. 2, the training process of the convolutional recurrent neural network model includes the following steps:
and step T1, constructing an initial convolution cyclic neural network model, wherein model parameters contained in the initial convolution cyclic neural network model are randomly set.
In the embodiment, an initial convolution cyclic neural network model is built by using the existing deep learning framework keras. In the (initial) convolution cyclic neural network model of the present embodiment: the convolutional neural network of the VGG architecture comprises five convolutional layers, wherein each convolutional layer uses a ReLU activation function, and the convolutional layers perform batch normalization (Batchnormalization) operation and maximum Pooling (Max Pooling) operation with 2 x 2 and step length of 2; the Bi-LSTM recurrent neural network consists of two separate LSTMs, each having 256 output cells. The outputs of the two LSTMs are concatenated into 512-dimensional vectors and fed to a fully concatenated layer (e.g. softmax layer) for sorting.
And step T2, generating spectrograms by the audio sequences in the training set through steps S1 to S2, sequentially inputting the spectrograms into the initial convolution cyclic neural network model, and performing one iteration.
In this embodiment, the THUYG-20 uygur language data set and the YouTube news audio data set are used as training sets, and one or more of white noise, crack noise and cafe noise generated randomly are added to the audio signals in the audio data sets to enhance the data. For each audio data of the data set, 200936 spectrograms of mandarin, english and vernacular, respectively, were obtained in the same manner as in steps S1 through S2, and the pictures were divided into a training set, a verification set and a test set at a ratio of 70%, 20% and 10%.
The images in the training set enter the network model in batches for training, the batch size of the training images entering the network model each time is 128, and 12 ten thousand times of iterative training are performed.
And step T3, after iteration, calculating loss errors respectively by using the model parameters of the last layer of the initial convolution cyclic neural network model.
At step T4, the loss error is propagated back to update the model parameters.
In step T3 of this embodiment, after each iteration (i.e., after the training set image passes through the model), the cross entropy loss function is used to calculate the loss error between the output result and the real sample. The calculated loss error is then propagated back through step T4 to update the model parameters.
And step T5, repeating the steps T2 to T4 until the training completion condition is reached, and obtaining the trained convolution cyclic neural network model.
In this embodiment, the training completion condition of the model training is the same as that of the conventional convolutional neural network model, that is, the training is completed after the model parameters of each layer are converged.
And obtaining the trained convolution cyclic neural network model through the iterative training and the processes of error calculation and back propagation in the iterative process. In the embodiment, the language identification method is executed by using the trained convolution cyclic neural network model in a daily scene.
Fig. 3 is a schematic structural diagram of a convolutional recurrent neural network model in an embodiment of the present invention.
As shown in fig. 3, the convolutional recurrent neural network model of the present invention includes an input layer, a convolutional neural network as a feature extractor, and then the extracted features are stacked and sliced, each sliced feature is input into the Bi-LSTM recurrent neural network at each time step, and finally a fully connected layer is set for classification. The convolutional neural network comprises five convolutional layers, and maximum pooling operation is performed after each convolutional layer. The Bi-LSTM recurrent neural network consists of two separate LSTM networks, each LSTM having 256 output cells.
Specifically, as shown in fig. 3, the convolutional recurrent neural network model specifically includes the following structure:
(1) an input layer I for inputting each preprocessed spectrogram, the size of which is 500 × 129 × 1;
(2) the VGG convolutional neural network comprises five convolutional layers, each of which is followed by a ReLU activation function, and each of which is subjected to a Batch Normalization (Batch Normalization) operation and a maximum pooling (MaxPooling) operation of 2 × 2 with a step size of 2, namely convolutional layer FC1 (convolution kernel size is 7 × 7, number is 16, sliding step size is 1, padding is 0, convolved output is 494 × 123 × 16, maximum pooled output is 247 × 61 × 16), convolutional FC layer 2 (convolution kernel size is 5 × 5, number is 32, sliding step size is 1, padding is 0, convolved output is 243 × 57 × 32, maximum pooled output is 121 × 28 × 32), convolutional FC3 (convolution kernel size is 3 × 3, number of convolutions is 64, sliding step size is 1, padding is 0, post-output is 119 × 26 × 64, maximum pooled output is 59 × 13 × 64), Convolutional layer FC4 (convolutional kernel size 3 × 3, number 128, sliding step 1, padding 0, convolved output 57 × 11 × 128, max pooled output 28 × 5 × 128), convolutional layer FC5 (convolutional kernel size 3 × 3, number 256, sliding step 1, padding 0, convolved output 26 × 3 × 256, max pooled output 13 × 1 × 256);
(3) the final output features of the convolutional neural network are then sliced along the x-axis direction to obtain 13 features with the size of 1 × 1 × 256. Where each feature is used as input to the Bi-LSTM network at each time step.
(4) The Bi-LSTM is composed of two separate LSTM for capturing timing information in both directions. Each LSTM has 256 output elements, and the outputs of the two LSTM networks are concatenated into a 512-dimensional vector and fed to a fully concatenated softmax classification layer.
In this embodiment, the trained convolutional recurrent neural network model is tested by using 10% of data in the data set as audio to be tested.
The specific process is as follows: and (4) sequentially inputting the test set into a trained convolution cyclic neural network model to complete the steps from S3 to S4 so as to obtain language categories, and comparing the identified language categories with the language categories of each audio sequence in the test set to obtain the detection accuracy.
In this embodiment, the language identification detection precision (i.e., the detection accuracy) of the trained convolutional recurrent neural network model on the test set is 93.53%. In addition, the accuracy of model judgment in three languages, i.e., mandarin, english, and vernacular, is also summarized, and the results are shown in table 1 below.
TABLE 1 accuracy of the model determination for the method of the invention over different categories
precision recall F1-score support
English 0.90 0.93 0.91 40192
Mandarin 0.95 0.91 0.93 40192
Uygur 0.97 0.96 0.96 40192
average 0.94 0.93 0.93 120576
In Table 1, English, Chinese, and Uygur respectively represent different language categories (i.e., English refers to English, Mandarin refers to Mandarin, and Uygur refers to Uygur). In addition, average represents the average degree, precision, call, F1-score represent the accuracy, recall and F1 score, respectively, and the F1 score is a measure of the classification problem. Support means that English, Mandarin and Uygur use 40192 spectrograms respectively during training. It can be seen that the recognition accuracy of the model on the three language categories reaches more than 90%, especially the recognition accuracy of the Uyghur language reaches 97%, and the model can be well applied to the detection of the minority languages.
The above test procedure shows that the language identification method based on the convolutional recurrent neural network model of this embodiment can achieve a high accuracy on the THUYG-20 data set and the YouTube news data set.
In this embodiment, the language identification method based on the deep convolutional recurrent neural network is implemented by a language identification device, the language identification device is a computer equipped with an NVIDIA GTX 1080 graphics card (for GPU acceleration), and a computer program set corresponding to the language identification method based on the deep convolutional recurrent neural network is stored in the computer. The language identification device comprises a preprocessing unit, a language identification unit and a control unit for controlling the above units.
The function executed by the preprocessing part corresponds to the steps S1 to S2 in the language identification method, namely, the input audio sequence to be tested is divided into a plurality of audio frequency segments with the time length of 2S and is subjected to short-time Fourier transform to be converted into a corresponding spectrogram.
The language identification part stores a packed and trained convolution cyclic neural network model, and the executed function corresponds to the steps S3 to S4 in the language identification method, namely, a spectrogram is input into the pre-trained convolution cyclic neural network model so as to obtain the audio class judgment probability corresponding to each audio frequency segment.
In this embodiment, after the computer user inputs the audio sequence to be tested and confirms to identify, the preprocessing unit and the language identification unit perform corresponding processing on the audio sequence to be tested in sequence, so as to output each audio sequence to be tested and the corresponding language category.
Examples effects and effects
According to the language identification method based on the deep convolutional neural network provided by the embodiment, the audio sequence to be detected is divided into audio segments, the audio segments are converted into frequency spectrograms through short-time Fourier transform, and the frequency spectrograms are identified through a convolutional recurrent neural network model combining the convolutional neural network and the recurrent neural network, so that the purpose of extracting features from the frequency spectrograms through the neural network model instead of directly extracting acoustic features is achieved, namely, the acoustic identification function is achieved through an image identification method, and the problems that information with language distinctiveness is difficult to directly obtain from acoustic features of a bottom layer in original sound identification, and professional knowledge in the audio field is additionally required for acoustic features of a high layer are solved. Furthermore, the invention also adopts a Bi-LSTM recurrent neural network to capture the time sequence characteristics, can capture bidirectional time sequence information, can automatically select and forget unimportant nodes and memorize important nodes when processing long-time information and sequence information, and can solve the problems of gradient explosion and gradient disappearance of the traditional recurrent neural network to a certain extent. Therefore, the method and the device can be suitable for language identification tasks in a series of daily and noisy scenes, and have high language identification precision.
In addition, the convolutional recurrent neural network model adopted by the embodiment has a simple structure, and methods such as model mixing, multitask training and metric learning are not needed, so that compared with the existing high-precision model, the model of the embodiment is fast and convenient to construct, expert knowledge in the audio field is not needed, training can be realized by a training set without excessive data, the training process can be completed quickly, and the computing resources consumed by training are less.
The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.
For example, in the embodiment, the language trained by the convolutional recurrent neural network model is mandarin, english, and vernacular, in other embodiments, other language data sets may be used, and the number of language types and the labels of the languages are set during training, so that the training and recognition of other languages can be performed accordingly.
For example, in the embodiment, the convolutional neural network adopts a network model of a VGG architecture, but other convolutional neural network models such as inclusion-v 3 may be adopted in the present invention. The convolutional neural network model has stronger capability of extracting features, but has deeper layers, parameters are six times of those of the original VGG (vertical gradient generator) architecture network, the model structure is more complex, more computing resources are consumed, and the precision is further improved compared with the embodiment. In consideration of practical application, the convolutional neural network of the VGG framework is selected as the feature extractor, the model is simple, and high accuracy can be achieved.

Claims (7)

1. A language identification method based on deep convolution cyclic neural network is used for identifying a voice frequency sequence to be detected so as to identify a corresponding language, and is characterized by comprising the following steps:
step S1, dividing the audio sequence to be tested into a plurality of audio segments with the time length of 2S;
step S2, sequentially converting each audio frequency segment into a corresponding spectrogram through short-time Fourier transform;
step S3, inputting the spectrogram into a pre-trained convolution cyclic neural network model in sequence so as to obtain the audio class judgment probability corresponding to each audio segment;
step S4, obtaining the language category of each audio sequence to be tested according to the audio category judgment probability of all the corresponding audio segments of each audio sequence to be tested,
the convolutional recurrent neural network model comprises a convolutional neural network of a VGG framework and a recurrent neural network of a Bi-LSTM network structure, and feature vectors obtained after the spectrogram is input into the convolutional neural network are input into the recurrent neural network after slicing operation is carried out along a time axis.
2. The language identification method based on the deep convolutional recurrent neural network of claim 1, wherein:
when the audio sequence to be detected is segmented in step S1, the audio segment with the duration less than 2S is directly discarded.
3. The language identification method based on the deep convolutional recurrent neural network of claim 1, wherein:
in step S1, the audio sequence to be tested is segmented, and simultaneously, each segmented audio segment is uniformly encoded in an uncompressed lossless WAVE format.
4. The language identification method based on the deep convolutional recurrent neural network of claim 1, wherein:
wherein the step S4 includes the following sub-steps:
step S4-1, averaging all the corresponding audio category judgment probabilities of each audio sequence to be tested to obtain an average judgment probability of each audio sequence to be tested;
and step S4-2, the category with the maximum probability expression in the average judgment probability is used as the language category of the corresponding audio sequence to be tested.
5. The language identification method based on the deep convolutional recurrent neural network of claim 1, wherein:
the convolutional recurrent neural network model is obtained through the following model training steps:
step T1, constructing an initial convolution cyclic neural network model, wherein model parameters contained in the initial convolution cyclic neural network model are randomly set;
step T2, the audio sequence in the training set is processed through the steps S1 to S2 to generate a spectrogram, and the spectrogram is sequentially input into the initial convolution cyclic neural network model and is subjected to one iteration;
step T3, calculating loss errors respectively by using the model parameters of the last layer of the initial convolution cyclic neural network model;
step T4, reversely propagating the loss error to update model parameters;
and step T5, repeating the steps T2 to T4 until a training completion condition is reached, and obtaining the trained convolution cycle neural network model.
6. The language identification method based on the deep convolutional recurrent neural network of claim 5, wherein:
wherein the audio sequences in the training set comprise general audio data and mixed audio data mixed with one or more of white noise, crack noise, cafe noise, gaussian noise and impulse noise which are randomly generated.
7. A language identification device based on convolution cycle neural network model is used for identifying the audio sequence to be detected so as to identify the corresponding language, and is characterized by comprising:
the preprocessing part is used for dividing the audio sequence to be tested into a plurality of audio segments with the time length of 2s and converting the audio segments into corresponding frequency spectrograms through short-time Fourier transform; and
a language identification part for inputting the spectrogram into a pre-stored convolution cyclic neural network so as to identify the language category corresponding to each audio sequence,
wherein the convolutional recurrent neural network model comprises a convolutional neural network of a VGG framework and a recurrent neural network of a Bi-LSTM network structure, the feature vectors obtained after the spectrogram is input into the convolutional neural network are input into the recurrent neural network after slicing operation is carried out along a time axis,
the language identification part completes the identification of the language category through the following steps:
a characteristic extraction step, namely sequentially inputting the spectrogram into a pre-trained convolution cyclic neural network model so as to obtain characteristic data corresponding to each audio segment;
and a category identification step, namely obtaining each language category corresponding to the audio sequence according to the audio category judgment probability of all the audio segments corresponding to each audio sequence.
CN201911093837.4A 2019-11-11 2019-11-11 Language identification method and device based on deep convolutional recurrent neural network Pending CN110782872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911093837.4A CN110782872A (en) 2019-11-11 2019-11-11 Language identification method and device based on deep convolutional recurrent neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911093837.4A CN110782872A (en) 2019-11-11 2019-11-11 Language identification method and device based on deep convolutional recurrent neural network

Publications (1)

Publication Number Publication Date
CN110782872A true CN110782872A (en) 2020-02-11

Family

ID=69391003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911093837.4A Pending CN110782872A (en) 2019-11-11 2019-11-11 Language identification method and device based on deep convolutional recurrent neural network

Country Status (1)

Country Link
CN (1) CN110782872A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111326139A (en) * 2020-03-10 2020-06-23 科大讯飞股份有限公司 Language identification method, device, equipment and storage medium
CN111444381A (en) * 2020-03-24 2020-07-24 福州瑞芯微电子股份有限公司 Deep learning corpus-based classification method and storage device
CN111540346A (en) * 2020-05-13 2020-08-14 慧言科技(天津)有限公司 Far-field sound classification method and device
CN111613208A (en) * 2020-05-22 2020-09-01 云知声智能科技股份有限公司 Language identification method and equipment
CN111724766A (en) * 2020-06-29 2020-09-29 合肥讯飞数码科技有限公司 Language identification method, related equipment and readable storage medium
CN111782863A (en) * 2020-06-30 2020-10-16 腾讯音乐娱乐科技(深圳)有限公司 Audio segmentation method and device, storage medium and electronic equipment
CN111931690A (en) * 2020-08-28 2020-11-13 Oppo广东移动通信有限公司 Model training method, device, equipment and storage medium
CN112017630A (en) * 2020-08-19 2020-12-01 北京字节跳动网络技术有限公司 Language identification method and device, electronic equipment and storage medium
CN112199548A (en) * 2020-09-28 2021-01-08 华南理工大学 Music audio classification method based on convolution cyclic neural network
CN112270933A (en) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 Audio identification method and device
CN112420024A (en) * 2020-10-23 2021-02-26 四川大学 Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device
CN112489623A (en) * 2020-11-17 2021-03-12 携程计算机技术(上海)有限公司 Language identification model training method, language identification method and related equipment
CN112863482A (en) * 2020-12-31 2021-05-28 思必驰科技股份有限公司 Speech synthesis method and system with rhythm
CN112992119A (en) * 2021-01-14 2021-06-18 安徽大学 Deep neural network-based accent classification method and model thereof
CN112989108A (en) * 2021-02-24 2021-06-18 腾讯科技(深圳)有限公司 Language detection method and device based on artificial intelligence and electronic equipment
CN113282718A (en) * 2021-07-26 2021-08-20 北京快鱼电子股份公司 Language identification method and system based on self-adaptive center anchor
CN113362851A (en) * 2020-03-06 2021-09-07 上海其高电子科技有限公司 Traffic scene sound classification method and system based on deep learning
CN113539238A (en) * 2020-03-31 2021-10-22 中国科学院声学研究所 End-to-end language identification and classification method based on void convolutional neural network
CN113611285A (en) * 2021-09-03 2021-11-05 哈尔滨理工大学 Language identification method based on stacked bidirectional time sequence pooling
CN115798459A (en) * 2023-02-03 2023-03-14 北京探境科技有限公司 Audio processing method and device, storage medium and electronic equipment
CN116469413A (en) * 2023-04-03 2023-07-21 广州市迪士普音响科技有限公司 Compressed audio silence detection method and device based on artificial intelligence

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN109545198A (en) * 2019-01-04 2019-03-29 北京先声智能科技有限公司 A kind of Oral English Practice mother tongue degree judgment method based on convolutional neural networks
CN110033756A (en) * 2019-04-15 2019-07-19 北京达佳互联信息技术有限公司 Language Identification, device, electronic equipment and storage medium
KR20190110939A (en) * 2018-03-21 2019-10-01 한국과학기술원 Environment sound recognition method based on convolutional neural networks, and system thereof
CN110415728A (en) * 2019-07-29 2019-11-05 内蒙古工业大学 A kind of method and apparatus identifying emotional speech

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190110939A (en) * 2018-03-21 2019-10-01 한국과학기술원 Environment sound recognition method based on convolutional neural networks, and system thereof
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN109545198A (en) * 2019-01-04 2019-03-29 北京先声智能科技有限公司 A kind of Oral English Practice mother tongue degree judgment method based on convolutional neural networks
CN110033756A (en) * 2019-04-15 2019-07-19 北京达佳互联信息技术有限公司 Language Identification, device, electronic equipment and storage medium
CN110415728A (en) * 2019-07-29 2019-11-05 内蒙古工业大学 A kind of method and apparatus identifying emotional speech

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
REENA DHAKAD ET AL.: "《Devanagari digit recognition by using artificial neural network》", 《2017 INTERNATIONAL CONFERENCE ON ENERGY, COMMUNICATION, DATA ANALYTICS AND SOFT COMPUTING (ICECDS)》 *
贺菁菁: "《基于深度置信网络的音频语种识别》", 《中国优秀博硕士论文全文数据库(硕士) 信息科技辑》 *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362851A (en) * 2020-03-06 2021-09-07 上海其高电子科技有限公司 Traffic scene sound classification method and system based on deep learning
CN111326139B (en) * 2020-03-10 2024-02-13 科大讯飞股份有限公司 Language identification method, device, equipment and storage medium
CN111326139A (en) * 2020-03-10 2020-06-23 科大讯飞股份有限公司 Language identification method, device, equipment and storage medium
CN111444381A (en) * 2020-03-24 2020-07-24 福州瑞芯微电子股份有限公司 Deep learning corpus-based classification method and storage device
CN111444381B (en) * 2020-03-24 2022-09-30 瑞芯微电子股份有限公司 Deep learning corpus-based classification method and storage device
CN113539238B (en) * 2020-03-31 2023-12-08 中国科学院声学研究所 End-to-end language identification and classification method based on cavity convolutional neural network
CN113539238A (en) * 2020-03-31 2021-10-22 中国科学院声学研究所 End-to-end language identification and classification method based on void convolutional neural network
CN111540346A (en) * 2020-05-13 2020-08-14 慧言科技(天津)有限公司 Far-field sound classification method and device
CN111613208B (en) * 2020-05-22 2023-08-25 云知声智能科技股份有限公司 Language identification method and equipment
CN111613208A (en) * 2020-05-22 2020-09-01 云知声智能科技股份有限公司 Language identification method and equipment
CN111724766B (en) * 2020-06-29 2024-01-05 合肥讯飞数码科技有限公司 Language identification method, related equipment and readable storage medium
CN111724766A (en) * 2020-06-29 2020-09-29 合肥讯飞数码科技有限公司 Language identification method, related equipment and readable storage medium
CN111782863A (en) * 2020-06-30 2020-10-16 腾讯音乐娱乐科技(深圳)有限公司 Audio segmentation method and device, storage medium and electronic equipment
CN112017630B (en) * 2020-08-19 2022-04-01 北京字节跳动网络技术有限公司 Language identification method and device, electronic equipment and storage medium
CN112017630A (en) * 2020-08-19 2020-12-01 北京字节跳动网络技术有限公司 Language identification method and device, electronic equipment and storage medium
CN111931690A (en) * 2020-08-28 2020-11-13 Oppo广东移动通信有限公司 Model training method, device, equipment and storage medium
CN112199548A (en) * 2020-09-28 2021-01-08 华南理工大学 Music audio classification method based on convolution cyclic neural network
CN112420024B (en) * 2020-10-23 2022-09-09 四川大学 Full-end-to-end Chinese and English mixed empty pipe voice recognition method and device
CN112420024A (en) * 2020-10-23 2021-02-26 四川大学 Full-end-to-end Chinese and English mixed air traffic control voice recognition method and device
CN112270933A (en) * 2020-11-12 2021-01-26 北京猿力未来科技有限公司 Audio identification method and device
CN112270933B (en) * 2020-11-12 2024-03-12 北京猿力未来科技有限公司 Audio identification method and device
CN112489623A (en) * 2020-11-17 2021-03-12 携程计算机技术(上海)有限公司 Language identification model training method, language identification method and related equipment
CN112863482A (en) * 2020-12-31 2021-05-28 思必驰科技股份有限公司 Speech synthesis method and system with rhythm
CN112992119B (en) * 2021-01-14 2024-05-03 安徽大学 Accent classification method based on deep neural network and model thereof
CN112992119A (en) * 2021-01-14 2021-06-18 安徽大学 Deep neural network-based accent classification method and model thereof
CN112989108A (en) * 2021-02-24 2021-06-18 腾讯科技(深圳)有限公司 Language detection method and device based on artificial intelligence and electronic equipment
CN113282718A (en) * 2021-07-26 2021-08-20 北京快鱼电子股份公司 Language identification method and system based on self-adaptive center anchor
CN113611285A (en) * 2021-09-03 2021-11-05 哈尔滨理工大学 Language identification method based on stacked bidirectional time sequence pooling
CN113611285B (en) * 2021-09-03 2023-11-24 哈尔滨理工大学 Language identification method based on stacked bidirectional time sequence pooling
CN115798459A (en) * 2023-02-03 2023-03-14 北京探境科技有限公司 Audio processing method and device, storage medium and electronic equipment
CN116469413B (en) * 2023-04-03 2023-12-01 广州市迪士普音响科技有限公司 Compressed audio silence detection method and device based on artificial intelligence
CN116469413A (en) * 2023-04-03 2023-07-21 广州市迪士普音响科技有限公司 Compressed audio silence detection method and device based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
Žmolíková et al. Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
CN111312292A (en) Emotion recognition method and device based on voice, electronic equipment and storage medium
CN111667834B (en) Hearing-aid equipment and hearing-aid method
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
Alashban et al. Speaker gender classification in mono-language and cross-language using BLSTM network
Wang et al. A research on HMM based speech recognition in spoken English
Ling An acoustic model for English speech recognition based on deep learning
CN114125506B (en) Voice auditing method and device
CN111640423B (en) Word boundary estimation method and device and electronic equipment
CN114708857A (en) Speech recognition model training method, speech recognition method and corresponding device
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
CN115881156A (en) Multi-scale-based multi-modal time domain voice separation method
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
Islam et al. Bangla dataset and MMFCC in text-dependent speaker identification.
CN113763992A (en) Voice evaluation method and device, computer equipment and storage medium
Jadhav et al. An Emotion Recognition from Speech using LSTM
Pan et al. Application of hidden Markov models in speech command recognition
Kaur et al. Speech based retrieval system for Punjabi language
CN113035247B (en) Audio text alignment method and device, electronic equipment and storage medium
CN113990288B (en) Method for automatically generating and deploying voice synthesis model by voice customer service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200211