CN110782872A

CN110782872A - Language identification method and device based on deep convolutional recurrent neural network

Info

Publication number: CN110782872A
Application number: CN201911093837.4A
Authority: CN
Inventors: 程颖; 杜姗姗; 冯瑞
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-11-11
Filing date: 2019-11-11
Publication date: 2020-02-11

Abstract

The invention provides a language identification method and a language identification device based on a deep convolution cyclic neural network, which are used for identifying an audio sequence to be detected so as to identify a corresponding language, and the method can realize a language identification function with high accuracy rate without expert knowledge in the audio field, and is characterized by comprising the following steps: step S1, dividing the audio sequence to be tested into a plurality of audio segments with the time length of 2S; step S2, sequentially carrying out short-time Fourier transform on each audio frequency segment to convert the audio frequency segment into a corresponding spectrogram; step S3, inputting the spectrogram into a pre-trained convolution cyclic neural network model in sequence so as to obtain the audio class judgment probability corresponding to each audio segment; step S4, obtaining the language category of each corresponding audio data according to the audio category judgment probability of all the corresponding audio segments of each audio data.

Description

Language identification method and device based on deep convolutional recurrent neural network

Technical Field

The invention relates to the field of audio recognition, relates to a language recognition method in a daily scene, and particularly relates to a language recognition method and device based on a deep convolution cyclic neural network.

Background

In daily life, many intelligent voice assistants require a user to manually specify the input language of the system to work properly, but automatic language recognition techniques can be used to infer the language used by the user. The language identification technology is used as a preprocessing part of a plurality of voice processing tasks and is widely applied to the fields of multi-language voice identification, cross-language communication, machine translation and the like.

Most of traditional language identification technologies are used for identifying the bottom acoustic features through statistical modeling, the commonly used bottom acoustic features such as Mel Frequency Cepstrum Coefficient (MFCC) and perception linear prediction coefficient (PLP) are all that an audio sequence is transformed from a time domain to a frequency domain through Fast Fourier Transform (FFT), then a filter simulates perception of human ears on hearing, and therefore coefficients with a certain dimensionality are extracted from each frame to serve as the bottom acoustic features. The modeling method generally adopts a Gaussian Mixture Model (GMM) and an improved method thereof, the GMM-UBM can well fit the distribution of real data, but the modeled mean value supervectors have a large amount of redundant information, and the classification of the mean value supervectors is difficult, so the identification accuracy rate has great limitation.

However, as deep learning develops in the field of signal processing, language identification methods based on deep learning are increasing, but these research methods mainly focus on processing input sequences of audio data using different forms of recurrent neural networks. Thus, only the time sequence characteristics of the audio are utilized, and the spatial characteristics of the audio are not considered, so that the expected effect is difficult to achieve.

Disclosure of Invention

In order to solve the problems, the invention provides a language identification method for solving the problem of language identification through the image field, the method can complete the language identification function with high accuracy without expert knowledge in the audio field, and the invention adopts the following technical scheme:

the invention provides a language identification method based on a deep convolution cyclic neural network, which is used for identifying an audio sequence to be detected so as to identify a corresponding language and is characterized by comprising the following steps: step S1, dividing the audio sequence to be tested into a plurality of audio segments with the time length of 2S; step S2, sequentially carrying out short-time Fourier transform on each audio frequency segment to convert the audio frequency segment into a corresponding spectrogram; step S3, inputting the spectrogram into a pre-trained convolution cyclic neural network model in sequence so as to obtain the audio class judgment probability corresponding to each audio segment; and step S4, obtaining the language category of each corresponding audio sequence to be tested according to the audio category judgment probability of all corresponding audio segments of each audio sequence to be tested, wherein the convolution cyclic neural network model comprises a convolution neural network of a VGG framework and a cyclic neural network of a Bi-LSTM network structure, and the feature vector obtained after the spectrogram is input into the convolution neural network is input into the cyclic neural network after slicing operation is carried out along a time axis.

The language identification method based on the deep convolution cyclic neural network provided by the invention can also have the technical characteristics that in the step S1, when the audio sequence to be detected is segmented, the audio segment with the duration less than 2S is directly discarded.

The language identification method based on the deep convolutional recurrent neural network provided by the invention can also have the technical characteristics that in the step S1, the audio sequence to be detected is segmented, and simultaneously, all the segmented audio segments are uniformly coded in an uncompressed lossless WAVE format.

The language identification method based on the deep convolutional recurrent neural network provided by the invention can also have the technical characteristics that the step S4 comprises the following sub-steps: step S4-1, averaging all the corresponding audio category judgment probabilities of each audio sequence to be tested to obtain an average judgment probability of each audio sequence to be tested; and step S4-2, taking the category with the maximum probability expression in the average judgment probability as the language category of the corresponding audio sequence to be tested.

The language identification method based on the deep convolution cyclic neural network provided by the invention can also have the technical characteristics that the convolution cyclic neural network model is obtained through the following model training steps: step T1, constructing an initial convolution cyclic neural network model, wherein model parameters contained in the initial convolution cyclic neural network model are randomly set; step T2, generating a spectrogram by the audio sequence in the training set through steps S1 to S2, and sequentially inputting the spectrogram into the initial convolution cyclic neural network model and performing one iteration; step T3, calculating loss errors respectively by using model parameters of the last layer of the initial convolution cyclic neural network model; step T4, the loss error is propagated reversely to update the model parameters; and step T5, repeating the steps T2 to T4 until the training completion condition is reached, and obtaining the trained convolution cyclic neural network model.

The language identification method based on the deep convolution cyclic neural network provided by the invention can also have the technical characteristics that an audio sequence in a training set comprises general audio data and mixed audio data mixed with one or more of white noise, crack noise, cafe noise, Gaussian noise and impulse noise which are randomly generated.

The invention also provides a language identification device based on the convolution cyclic neural network model, which is used for identifying the audio sequence to be detected so as to identify the corresponding language, and is characterized by comprising the following steps: the preprocessing part is used for dividing each audio sequence to be detected in the audio sequence to be detected into a plurality of audio segments with the time length of 2s and converting the audio segments into corresponding spectrograms through short-time Fourier transform; and the language identification part is used for inputting the spectrogram into a pre-trained convolution cyclic neural network so as to identify the language category corresponding to each audio sequence to be detected, wherein the convolution cyclic neural network model comprises a convolution neural network with a VGG framework and a cyclic neural network with a Bi-LSTM network structure, a feature vector obtained after the spectrogram is input into the convolution neural network is input into the cyclic neural network after slicing operation is carried out along a time axis, and the language identification part is used for identifying the language category through the following steps: a characteristic extraction step, namely sequentially inputting the spectrogram into a pre-trained convolution cyclic neural network model so as to obtain characteristic data corresponding to each audio frequency segment; and a category identification step, namely obtaining the language category of each corresponding audio sequence to be detected according to the audio category judgment probability of all the corresponding audio segments of each audio sequence to be detected.

Action and Effect of the invention

According to the language identification method based on the deep convolutional neural network, the audio sequence to be detected is divided into audio segments, the audio segments are converted into frequency spectrograms through short-time Fourier transform, and the frequency spectrograms are identified through a convolutional recurrent neural network model combining the convolutional neural network and the recurrent neural network, so that the purpose of extracting features from the frequency spectrograms through the neural network model instead of directly extracting acoustic features is achieved, namely, the acoustic identification function is achieved through an image identification method, and the problems that information with language distinctiveness is difficult to directly obtain from acoustic features of a bottom layer in original sound identification, and professional knowledge in the audio field is additionally required for acoustic features of a high layer are solved. Furthermore, the invention also adopts a Bi-LSTM recurrent neural network to capture the time sequence characteristics, can capture bidirectional time sequence information, can automatically select and forget unimportant nodes and memorize important nodes when processing long-time information and sequence information, and can solve the problems of gradient explosion and gradient disappearance of the traditional recurrent neural network to a certain extent. Therefore, the method and the device can be suitable for language identification tasks in a series of daily and noisy scenes, and have high language identification precision.

Drawings

FIG. 1 is a flowchart of a language identification method based on a deep convolutional recurrent neural network according to an embodiment of the present invention;

FIG. 2 is a flow chart of a model training process in an embodiment of the invention; and

fig. 3 is a schematic structural diagram of a convolutional recurrent neural network model in an embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation features, the achievement purposes and the effects of the present invention easy to understand, the following embodiments and the accompanying drawings are used to specifically describe the language identification method based on the deep convolutional recurrent neural network.

< example >

In the embodiment, a wiki data set, a chinese data set, and an english data set are taken as examples, and the convolutional recurrent neural network model is trained and tested through the data sets.

Wherein the Uyg language data set is a THUYG-20 Uygur language data set. The data set contained approximately 20 hours of training data and 1 hour of test data recorded in an office environment using an external microphone of an IBM-associated desktop. The speakers are 348 speakers in colleges and universities, and are Uygur speakers from more than 30 states of Xinjiang. The recorded contents comprise conventional topics such as novel, newspaper and various books. The sampling format is 16kHz, 16 bit, mono, wav format. The Chinese and English datasets are the YouTube news dataset, the English channel is the CNN channel, and Chinese is the VOAchina channel. These recordings are of very high quality and can be used for hundreds of hours online, but training uses training data that is as long as the wiki dataset in order to maintain data balance.

These news programs are often characterized by guests or telecommuters, allowing a good combination of different speakers. Furthermore, news programs have noise in real world situations: including music mixing, non-speech audio from video clips, and transitions between utterances. Meanwhile, in order to improve the robustness of the model in the embodiment, some audio signals are also mixed with white noise, crack noise and cafe noise which are randomly generated respectively. These noises simulate scenes that may exist in real life, have strong audibility, and still retain language recognizability. In addition, more audio data can be acquired by adding noise, and data expansion is realized, so that the data volume acquired from the audio sequence to be tested is richer, and the iterative epoch is increased. In addition, in other embodiments, these noises may not be added to the audio signal, or other data expansion methods in the prior art (such as gaussian noise, impulse noise, etc.) may be adopted.

Fig. 1 is a flowchart of a language identification method based on a deep convolutional recurrent neural network in an embodiment of the present invention.

As shown in fig. 1, the language identification method based on the deep convolutional recurrent neural network includes the following steps:

in step S1, the audio sequence to be tested (i.e. the time sequence of each frame of the audio) is divided into a plurality of audio segments with the time length of 2S.

In step S1 of the present embodiment, all audio files are encoded in the uncompressed lossless WAVE format when dividing the audio segments, so that subsequent operations in this format are facilitated and the signal quality is not degraded. Meanwhile, in step S1, audio segments remaining for segmentation for a duration of less than 2S are directly discarded.

And step S2, sequentially carrying out short-time Fourier transform (STFT) on each audio frequency segment divided in the step S1 to obtain a corresponding spectrogram.

In step S2 of this embodiment, the audio segment is converted into a spectrogram by Short Time Fourier Transform (STFT) to perform model training or audio sequence recognition, and a Hanning window is used in the process of short time fourier transform. Since most english voices do not exceed 3kHz in conversational voices, only frequencies up to 5kHz are included in the spectrogram. The time axis (x-axis) is rendered at a rate of 250 pixels per second. The size of the final transformed spectrogram was 500 × 129 × 1.

And step S3, sequentially inputting the spectrogram into a pre-trained convolution cyclic neural network model so as to obtain the audio class judgment probability corresponding to each audio segment.

In this embodiment, the audio category determination probability is used to indicate the probability that the corresponding audio segment belongs to each language category. The convolution cyclic neural network model is divided into two modules, namely a VGG convolution neural network and a Bi-LSTM cyclic neural network. The VGG convolutional neural network is used for extracting the convolutional characteristics of the spectrogram, and the Bi-LSTM recurrent neural network is used for capturing the time sequence characteristics of the sequence. In step S3, after the spectrogram obtains corresponding feature vectors through the convolutional neural network, the feature vectors are stacked and sliced, the cyclic neural network captures time-series features of the audio sequence, and finally the speech category probability (i.e., audio category judgment probability) corresponding to the audio sequence is obtained through softmax classification.

Step S4, obtaining the language category of each corresponding audio sequence to be tested according to the audio category determination probability of all corresponding audio segments of each audio sequence to be tested, in this embodiment, step S4 specifically includes the following sub-steps:

step S4-1, averaging all the corresponding audio category judgment probabilities of each audio sequence to be tested to obtain an average judgment probability of each audio sequence to be tested;

and step S4-2, taking the category with the maximum probability expression in the average judgment probability as the language category of the corresponding audio sequence to be tested.

FIG. 2 is a flow chart of a model training process in an embodiment of the invention.

As shown in fig. 2, the training process of the convolutional recurrent neural network model includes the following steps:

and step T1, constructing an initial convolution cyclic neural network model, wherein model parameters contained in the initial convolution cyclic neural network model are randomly set.

In the embodiment, an initial convolution cyclic neural network model is built by using the existing deep learning framework keras. In the (initial) convolution cyclic neural network model of the present embodiment: the convolutional neural network of the VGG architecture comprises five convolutional layers, wherein each convolutional layer uses a ReLU activation function, and the convolutional layers perform batch normalization (Batchnormalization) operation and maximum Pooling (Max Pooling) operation with 2 x 2 and step length of 2; the Bi-LSTM recurrent neural network consists of two separate LSTMs, each having 256 output cells. The outputs of the two LSTMs are concatenated into 512-dimensional vectors and fed to a fully concatenated layer (e.g. softmax layer) for sorting.

And step T2, generating spectrograms by the audio sequences in the training set through steps S1 to S2, sequentially inputting the spectrograms into the initial convolution cyclic neural network model, and performing one iteration.

In this embodiment, the THUYG-20 uygur language data set and the YouTube news audio data set are used as training sets, and one or more of white noise, crack noise and cafe noise generated randomly are added to the audio signals in the audio data sets to enhance the data. For each audio data of the data set, 200936 spectrograms of mandarin, english and vernacular, respectively, were obtained in the same manner as in steps S1 through S2, and the pictures were divided into a training set, a verification set and a test set at a ratio of 70%, 20% and 10%.

The images in the training set enter the network model in batches for training, the batch size of the training images entering the network model each time is 128, and 12 ten thousand times of iterative training are performed.

And step T3, after iteration, calculating loss errors respectively by using the model parameters of the last layer of the initial convolution cyclic neural network model.

At step T4, the loss error is propagated back to update the model parameters.

In step T3 of this embodiment, after each iteration (i.e., after the training set image passes through the model), the cross entropy loss function is used to calculate the loss error between the output result and the real sample. The calculated loss error is then propagated back through step T4 to update the model parameters.

And step T5, repeating the steps T2 to T4 until the training completion condition is reached, and obtaining the trained convolution cyclic neural network model.

In this embodiment, the training completion condition of the model training is the same as that of the conventional convolutional neural network model, that is, the training is completed after the model parameters of each layer are converged.

And obtaining the trained convolution cyclic neural network model through the iterative training and the processes of error calculation and back propagation in the iterative process. In the embodiment, the language identification method is executed by using the trained convolution cyclic neural network model in a daily scene.

As shown in fig. 3, the convolutional recurrent neural network model of the present invention includes an input layer, a convolutional neural network as a feature extractor, and then the extracted features are stacked and sliced, each sliced feature is input into the Bi-LSTM recurrent neural network at each time step, and finally a fully connected layer is set for classification. The convolutional neural network comprises five convolutional layers, and maximum pooling operation is performed after each convolutional layer. The Bi-LSTM recurrent neural network consists of two separate LSTM networks, each LSTM having 256 output cells.

Specifically, as shown in fig. 3, the convolutional recurrent neural network model specifically includes the following structure:

(1) an input layer I for inputting each preprocessed spectrogram, the size of which is 500 × 129 × 1;

(2) the VGG convolutional neural network comprises five convolutional layers, each of which is followed by a ReLU activation function, and each of which is subjected to a Batch Normalization (Batch Normalization) operation and a maximum pooling (MaxPooling) operation of 2 × 2 with a step size of 2, namely convolutional layer FC1 (convolution kernel size is 7 × 7, number is 16, sliding step size is 1, padding is 0, convolved output is 494 × 123 × 16, maximum pooled output is 247 × 61 × 16), convolutional FC layer 2 (convolution kernel size is 5 × 5, number is 32, sliding step size is 1, padding is 0, convolved output is 243 × 57 × 32, maximum pooled output is 121 × 28 × 32), convolutional FC3 (convolution kernel size is 3 × 3, number of convolutions is 64, sliding step size is 1, padding is 0, post-output is 119 × 26 × 64, maximum pooled output is 59 × 13 × 64), Convolutional layer FC4 (convolutional kernel size 3 × 3, number 128, sliding step 1, padding 0, convolved output 57 × 11 × 128, max pooled output 28 × 5 × 128), convolutional layer FC5 (convolutional kernel size 3 × 3, number 256, sliding step 1, padding 0, convolved output 26 × 3 × 256, max pooled output 13 × 1 × 256);

(3) the final output features of the convolutional neural network are then sliced along the x-axis direction to obtain 13 features with the size of 1 × 1 × 256. Where each feature is used as input to the Bi-LSTM network at each time step.

(4) The Bi-LSTM is composed of two separate LSTM for capturing timing information in both directions. Each LSTM has 256 output elements, and the outputs of the two LSTM networks are concatenated into a 512-dimensional vector and fed to a fully concatenated softmax classification layer.

In this embodiment, the trained convolutional recurrent neural network model is tested by using 10% of data in the data set as audio to be tested.

The specific process is as follows: and (4) sequentially inputting the test set into a trained convolution cyclic neural network model to complete the steps from S3 to S4 so as to obtain language categories, and comparing the identified language categories with the language categories of each audio sequence in the test set to obtain the detection accuracy.

In this embodiment, the language identification detection precision (i.e., the detection accuracy) of the trained convolutional recurrent neural network model on the test set is 93.53%. In addition, the accuracy of model judgment in three languages, i.e., mandarin, english, and vernacular, is also summarized, and the results are shown in table 1 below.

TABLE 1 accuracy of the model determination for the method of the invention over different categories

	precision	recall	F1-score	support
					English	0.90	0.93	0.91	40192
Mandarin	0.95	0.91	0.93	40192
					Uygur	0.97	0.96	0.96	40192
average	0.94	0.93	0.93	120576

In Table 1, English, Chinese, and Uygur respectively represent different language categories (i.e., English refers to English, Mandarin refers to Mandarin, and Uygur refers to Uygur). In addition, average represents the average degree, precision, call, F1-score represent the accuracy, recall and F1 score, respectively, and the F1 score is a measure of the classification problem. Support means that English, Mandarin and Uygur use 40192 spectrograms respectively during training. It can be seen that the recognition accuracy of the model on the three language categories reaches more than 90%, especially the recognition accuracy of the Uyghur language reaches 97%, and the model can be well applied to the detection of the minority languages.

The above test procedure shows that the language identification method based on the convolutional recurrent neural network model of this embodiment can achieve a high accuracy on the THUYG-20 data set and the YouTube news data set.

In this embodiment, the language identification method based on the deep convolutional recurrent neural network is implemented by a language identification device, the language identification device is a computer equipped with an NVIDIA GTX 1080 graphics card (for GPU acceleration), and a computer program set corresponding to the language identification method based on the deep convolutional recurrent neural network is stored in the computer. The language identification device comprises a preprocessing unit, a language identification unit and a control unit for controlling the above units.

The function executed by the preprocessing part corresponds to the steps S1 to S2 in the language identification method, namely, the input audio sequence to be tested is divided into a plurality of audio frequency segments with the time length of 2S and is subjected to short-time Fourier transform to be converted into a corresponding spectrogram.

The language identification part stores a packed and trained convolution cyclic neural network model, and the executed function corresponds to the steps S3 to S4 in the language identification method, namely, a spectrogram is input into the pre-trained convolution cyclic neural network model so as to obtain the audio class judgment probability corresponding to each audio frequency segment.

In this embodiment, after the computer user inputs the audio sequence to be tested and confirms to identify, the preprocessing unit and the language identification unit perform corresponding processing on the audio sequence to be tested in sequence, so as to output each audio sequence to be tested and the corresponding language category.

Examples effects and effects

According to the language identification method based on the deep convolutional neural network provided by the embodiment, the audio sequence to be detected is divided into audio segments, the audio segments are converted into frequency spectrograms through short-time Fourier transform, and the frequency spectrograms are identified through a convolutional recurrent neural network model combining the convolutional neural network and the recurrent neural network, so that the purpose of extracting features from the frequency spectrograms through the neural network model instead of directly extracting acoustic features is achieved, namely, the acoustic identification function is achieved through an image identification method, and the problems that information with language distinctiveness is difficult to directly obtain from acoustic features of a bottom layer in original sound identification, and professional knowledge in the audio field is additionally required for acoustic features of a high layer are solved. Furthermore, the invention also adopts a Bi-LSTM recurrent neural network to capture the time sequence characteristics, can capture bidirectional time sequence information, can automatically select and forget unimportant nodes and memorize important nodes when processing long-time information and sequence information, and can solve the problems of gradient explosion and gradient disappearance of the traditional recurrent neural network to a certain extent. Therefore, the method and the device can be suitable for language identification tasks in a series of daily and noisy scenes, and have high language identification precision.

In addition, the convolutional recurrent neural network model adopted by the embodiment has a simple structure, and methods such as model mixing, multitask training and metric learning are not needed, so that compared with the existing high-precision model, the model of the embodiment is fast and convenient to construct, expert knowledge in the audio field is not needed, training can be realized by a training set without excessive data, the training process can be completed quickly, and the computing resources consumed by training are less.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

For example, in the embodiment, the language trained by the convolutional recurrent neural network model is mandarin, english, and vernacular, in other embodiments, other language data sets may be used, and the number of language types and the labels of the languages are set during training, so that the training and recognition of other languages can be performed accordingly.

For example, in the embodiment, the convolutional neural network adopts a network model of a VGG architecture, but other convolutional neural network models such as inclusion-v 3 may be adopted in the present invention. The convolutional neural network model has stronger capability of extracting features, but has deeper layers, parameters are six times of those of the original VGG (vertical gradient generator) architecture network, the model structure is more complex, more computing resources are consumed, and the precision is further improved compared with the embodiment. In consideration of practical application, the convolutional neural network of the VGG framework is selected as the feature extractor, the model is simple, and high accuracy can be achieved.

Claims

1. A language identification method based on deep convolution cyclic neural network is used for identifying a voice frequency sequence to be detected so as to identify a corresponding language, and is characterized by comprising the following steps:

step S1, dividing the audio sequence to be tested into a plurality of audio segments with the time length of 2S;

step S2, sequentially converting each audio frequency segment into a corresponding spectrogram through short-time Fourier transform;

step S3, inputting the spectrogram into a pre-trained convolution cyclic neural network model in sequence so as to obtain the audio class judgment probability corresponding to each audio segment;

step S4, obtaining the language category of each audio sequence to be tested according to the audio category judgment probability of all the corresponding audio segments of each audio sequence to be tested,

the convolutional recurrent neural network model comprises a convolutional neural network of a VGG framework and a recurrent neural network of a Bi-LSTM network structure, and feature vectors obtained after the spectrogram is input into the convolutional neural network are input into the recurrent neural network after slicing operation is carried out along a time axis.

2. The language identification method based on the deep convolutional recurrent neural network of claim 1, wherein:

when the audio sequence to be detected is segmented in step S1, the audio segment with the duration less than 2S is directly discarded.

3. The language identification method based on the deep convolutional recurrent neural network of claim 1, wherein:

in step S1, the audio sequence to be tested is segmented, and simultaneously, each segmented audio segment is uniformly encoded in an uncompressed lossless WAVE format.

4. The language identification method based on the deep convolutional recurrent neural network of claim 1, wherein:

wherein the step S4 includes the following sub-steps:

and step S4-2, the category with the maximum probability expression in the average judgment probability is used as the language category of the corresponding audio sequence to be tested.

5. The language identification method based on the deep convolutional recurrent neural network of claim 1, wherein:

the convolutional recurrent neural network model is obtained through the following model training steps:

step T1, constructing an initial convolution cyclic neural network model, wherein model parameters contained in the initial convolution cyclic neural network model are randomly set;

step T2, the audio sequence in the training set is processed through the steps S1 to S2 to generate a spectrogram, and the spectrogram is sequentially input into the initial convolution cyclic neural network model and is subjected to one iteration;

step T3, calculating loss errors respectively by using the model parameters of the last layer of the initial convolution cyclic neural network model;

step T4, reversely propagating the loss error to update model parameters;

and step T5, repeating the steps T2 to T4 until a training completion condition is reached, and obtaining the trained convolution cycle neural network model.

6. The language identification method based on the deep convolutional recurrent neural network of claim 5, wherein:

wherein the audio sequences in the training set comprise general audio data and mixed audio data mixed with one or more of white noise, crack noise, cafe noise, gaussian noise and impulse noise which are randomly generated.

7. A language identification device based on convolution cycle neural network model is used for identifying the audio sequence to be detected so as to identify the corresponding language, and is characterized by comprising:

the preprocessing part is used for dividing the audio sequence to be tested into a plurality of audio segments with the time length of 2s and converting the audio segments into corresponding frequency spectrograms through short-time Fourier transform; and

a language identification part for inputting the spectrogram into a pre-stored convolution cyclic neural network so as to identify the language category corresponding to each audio sequence,

wherein the convolutional recurrent neural network model comprises a convolutional neural network of a VGG framework and a recurrent neural network of a Bi-LSTM network structure, the feature vectors obtained after the spectrogram is input into the convolutional neural network are input into the recurrent neural network after slicing operation is carried out along a time axis,

the language identification part completes the identification of the language category through the following steps:

a characteristic extraction step, namely sequentially inputting the spectrogram into a pre-trained convolution cyclic neural network model so as to obtain characteristic data corresponding to each audio segment;

and a category identification step, namely obtaining each language category corresponding to the audio sequence according to the audio category judgment probability of all the audio segments corresponding to each audio sequence.