CN117292675A

CN117292675A - Language identification method based on deep time sequence feature representation

Info

Publication number: CN117292675A
Application number: CN202311388897.5A
Authority: CN
Inventors: 陈晨; 陈勇; 李微微; 杨海陆; 王莉莉; 陈德运
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2023-12-26

Abstract

The invention relates to a language identification method based on deep time sequence feature representation, and belongs to the technical field of language identification. The invention aims to solve the problem that the accuracy of language identification by the existing method is low. The process is as follows: step 1, acquiring audio data sets of different languages; respectively carrying out data enhancement on the audio data sets of different languages; cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set; step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model; the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer; and step 3, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested.

Description

Language identification method based on deep time sequence feature representation

Technical Field

The invention relates to a language identification method based on deep time sequence feature representation, and belongs to the technical field of language identification.

Background

Language identification plays a vital role in modern society. With the increasing globalization trend, we are in a world that blends with each other. In this case, the language identification technique becomes critical. It is not only a breakthrough in technical level, but also a tie for promoting communication between different cultures. By rapidly and accurately identifying the language used by the opposite side, the obstacle of cross-cultural communication can be eliminated, and convenience is created for international cooperation and communication. In the internet age, the propagation speed and scope of information were unprecedented. However, this also presents a challenge in that a large number of multilingual content is emerging from the internet and social media, and language identification techniques allow the platform to automatically identify and categorize such content, allowing users to more easily obtain the desired information. This not only improves the efficiency of information acquisition, but also expands the field of view of the user, enabling them to come into contact with knowledge and culture from all over the world. In addition, language recognition plays a key role in the operation of search engines. It enables a search engine to more accurately understand the query language of a user, thereby providing more pertinent, relevant search results. This is particularly important for users seeking specific information and promotes the level of intelligence of the search engine. In technical and business applications, language identification provides the basis for multilingual applications, multilingual supported software, and cross-language data analysis. In the business industry, language identification technology has also led to a series of innovations such as intelligent customer service, multi-language support for across-country enterprises, etc. Advances in technology have prompted innovations and developments to provide more intelligent and personalized services for users in different languages. This not only promotes the service level of the enterprise, but also expands the internationalization opportunities of the market. In the educational field, language recognition provides unprecedented convenience for cross-cultural learning, allowing learners to more easily access knowledge and culture around the globe. In the field of medical health, the system provides important support for communication between medical staff and patients, and improves the quality and efficiency of medical service especially in a multi-language environment. The importance of language identification not only eliminates language barriers, but also promotes the communication and multiple development of global culture. The information acquisition and communication efficiency is improved, and technological innovation and cross-cultural understanding are promoted. The application of this technology is continuously affecting various areas of society, economy and culture, bringing much wider possibilities to our world.

Timing information is critical to language identification. The speech signal is time-varying and contains rich timing features such as variations in the audio spectrum, the cadence of sound, etc. These timing information are important clues for distinguishing different languages. For example, some languages may have unique patterns in the speech rhythm, while other languages may differ in the pause pattern between syllables. By analyzing these timing characteristics, the speech signal can be more accurately categorized into a particular language. Second, the timing information helps to distinguish between different accents or dialects within the same language. The same language may have subtle pronunciation variations in different regions or groups, which are often reflected in timing characteristics. Therefore, the accurate capture of the timing information can more finely distinguish different accents or dialects, thereby improving the accuracy of language identification.

Disclosure of Invention

The invention aims to solve the problem of low accuracy of language identification by the existing method, and provides a language identification method based on deep time sequence feature representation.

The language identification method based on deep time sequence characteristic representation comprises the following specific processes:

step 1, acquiring audio data sets of different languages;

respectively carrying out data enhancement on the audio data sets of different languages;

cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set;

step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model;

the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;

and step 3, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested.

The beneficial effects of the invention are as follows:

the method can extract the time sequence information between the voice frames and perform language identification. The method is called a language identification method based on deep time sequence characteristic representation.

In order to improve the characteristic representation capability of the neural network, the invention provides a language identification method based on deep time sequence characteristic representation, which can effectively improve the performance of a language identification system.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a network architecture according to the present invention, wherein the feature representation layer may take two different forms, namely a full connection layer and a convolution layer, and the methods corresponding to the different forms are abbreviated as FCLT and CNNLT, respectively, where NoLT refers to the condition without LT;

FIG. 3 is a comparative graph of an ablative experiment of the identification task of the corresponding method FCNoLT and the comparative method FCNoLT, wav2vec2.0 in the OLR2020 database dialect, performance being evaluated with Equal Error Rate (EER) and average loss (Cavg), wherein FCNoLT is the method of FCLT without timing constraints (LT);

FIG. 4 is a graph showing the comparison of the ablation experiments of the identification tasks of the methods CNNLT and comparison methods CNNNoLT and wav2vec2.0 in the OLR2020 database, wherein the performance is evaluated by adopting the Equal Error Rate (EER) and the average loss (Cavg), and the CNNNoLT is the method of the method CNNLT without the time sequence constraint condition (LT);

FIG. 5 is a graph comparing the Equivalent Error Rates (EER) of the inventive method (FCLT and CNNLT) with the comparison method FCNoLT, CNNNoLT, wav2vec2.0, [ Pytorch ] x-vector, [ Kaldi ] i-vector, and the task identified in the OLR2020 database;

FIG. 6 is a graph comparing the average loss (Cavg) of the identification task of the inventive method (FCLT and CNNLT) with the comparison method FCNoLT, CNNNoLT, wav2vec2.0, [ Pytorch ] x-vector, [ Kaldi ] i-vector in the OLR2020 database.

Detailed Description

The first embodiment is as follows: the language identification method based on deep time sequence characteristic representation in the embodiment comprises the following specific processes:

step 1, acquiring audio data sets (a plurality of audio segments) of different languages;

The second embodiment is as follows: the first difference between the embodiment and the specific embodiment is that the data enhancement in the step 1 is to perform data enhancement on each section of audio data in the audio data sets of different languages, so as to obtain the audio data sets of different languages after the data enhancement;

data enhancements include adding noise, speed enhancement, volume enhancement, tone enhancement, movement enhancement, etc.

A tone is the vibration frequency of sound; volume is the vibration amplitude of sound; the mobile enhancement is any splicing of a section of audio data after being disassembled.

Other steps and parameters are the same as in the first embodiment.

And a third specific embodiment: the difference between the embodiment and the first or second embodiment is that the deep learning model is built in the step 2, the training set in the step 1 is input into the deep learning model for training until the set maximum iteration number is reached, and a trained deep learning model is obtained;

the specific process is as follows:

step 21, inputting the training set in step 1 into a pre-training model to obtain a voice characteristic sequence T _n ；

T _n ＝[t ₁ ,t ₂ ,…,t _i ,…,t _n ]

Wherein t is _i ∈R ^F Is a speech feature sequence T _n The i-th vector of (a); f is potential speech feature T _n Is a dimension of (2);

step 22, the voice characteristic sequence T obtained in the step 21 is processed _n Inputting the speech feature vector mu into a time pool to obtain a speech feature vector mu;

step 23, inputting the voice characteristic vector mu obtained in the step 22 into a full-connection layer for prediction, and obtaining a prediction result, namely the language type of the audio data;

and 24, repeatedly executing the steps 21 to 24 until the set maximum iteration number is reached, and obtaining the trained deep learning model.

Other steps and parameters are the same as in the first or second embodiment.

The specific embodiment IV is as follows: this embodiment differs from one to three embodiments in that the pre-training model in step 21 is wav2vec2-base.

Other steps and parameters are the same as in one to three embodiments.

Fifth embodiment: the difference between the present embodiment and one to four embodiments is that the time pool in the step 22 is CNNLT or FCLT;

the loss function expression of the CNNLT or FCLT is as follows:

wherein M is the number of audio data sample categories;taking 0 or 1 as sign function, if audio data sample +.>If the true category of (2) is equal to c, taking 1, otherwise taking 0; />Outputting predicted audio data samples for a time pool>The probability belonging to the class c, N represents the total number of audio data samples input by the time pool, LT represents a regular term, lambda represents a super-parameter, and log base number is 2;

the canonical term LT expression is:

in order to meet the time sequence constraint condition of the time pool G, a regular term formula is provided;

wherein t is _i 、t _i+1 Respectively represent the voice characteristic sequences T _n I+1 vector of (a); n is the speech feature sequence T _n The number of vectors included in the speech feature vector is represented by μ, and the tolerance parameter is represented by α.

Time pool refers to a netThe block is connected, and the voice characteristic vector mu meeting the time sequence constraint condition can be obtained. The time pool sequentially comprises a characteristic representation layer, a pooling layer and a network layer, and utilizes a voice characteristic vector mu and a voice characteristic sequence T _n And (5) performing time sequence condition constraint.

Other steps and parameters are the same as in one to four embodiments.

Specific embodiment six: the present embodiment is different from one to fifth embodiments in that the CNNLT sequentially includes a feature representation layer, a pooling layer, and a network layer;

the characteristic representation layer is a convolution layer;

the pooling layer is an average pooling layer;

the network layer is a feed-Forward Neural Network (FNN).

Other steps and parameters are the same as in one of the first to fifth embodiments.

Seventh embodiment: this embodiment differs from one to five of the embodiments in that the FCLT includes, in order, a feature representation layer, a pooling layer, and a network layer;

the characteristic representation layer is a full connection layer;

the pooling layer is a mean square difference pooling layer;

the network layer is a feed-Forward Neural Network (FNN).

Eighth embodiment: the difference between this embodiment and the sixth or seventh embodiment is that the tolerance parameter α has a value of 0.0001.

Other steps and parameters are the same as those of the sixth or seventh embodiment.

Examples:

the technical scheme adopted by the invention is a language identification method based on a deep learning realization time pool, which comprises the following steps:

the present example is cut to a length of 1 second;

step 3, respectively calculating the voice feature vector mu of the test set, sending the voice feature vector mu into a full-connection layer for classification, randomly extracting 10000 pairs of data as scoring data, and finally, scoring and classifying by using a cosine distance classifier to verify the performance of the deep learning model, if the performance of the deep learning model reaches the standard, executing the step 4, otherwise, continuing to execute the step 2;

step 4, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested;

in this embodiment, the specific process of step 2 is:

T _n ＝[t ₁ ,t ₂ ,…,t _i ,…,t _n ]

In this embodiment, the specific process of step 21 is as follows:

the present example uses a pre-trained wav2vec2.0 model.

The audio data of the training set in the step 1 is sent into a pre-trained wav2vec2.0 model to obtain a potential voice characteristic sequence T _n . Wherein T is _n ＝[t ₁ ,t ₂ ,…,t _i ,…,t _n ]；

Wherein t is _i ∈R ^F Is a speech feature sequence T _n The value of n, the i-th vector in (a), is related to the original audio length of the input, in this example 49, F is the potential speech feature T _n The size of which is model dependent, in this example, 728.

In this embodiment, the specific process of step 22 is:

the time pool is divided into 2 types:

the first type is CNNLT, which sequentially comprises a feature representation layer, a pooling layer and a network layer;

the characteristic representation layer is a convolution layer; the pooling layer is an average pooling layer; the network layer is a feed-Forward Neural Network (FNN);

another 1 is FCLT, comprising a characteristic representation layer, a pooling layer and a network layer in sequence;

the characteristic representation layer is a full connection layer; the pooling layer is a mean square difference pooling layer; the network layer is a feed-Forward Neural Network (FNN).

The loss function expression of the time pool is:

the canonical term LT expression is:

The tolerance parameter alpha takes a value of 0.0001.

The time pool feature representation layer is aimed at performing feature transformation. The pooling layer is used for reducing dimension and obtaining voice characteristic representation. The network layer performs dimension transformation on the voice characteristic representation, ensures the voice characteristic sequence T _n The dimension of the time sequence constraint function formula is consistent, and the calculation of the time sequence constraint function formula is convenient.

In this embodiment, the specific process of step 2 is as follows:

the maximum iteration number is set to 50000, the learning rate of the deep learning model is set to 0.00005, and the super-parameter is set to 0.0001. And obtaining a deep learning model after iteration, wherein the deep learning model comprises a wav2vec2.0 model weight and a time pool weight.

In this embodiment, the specific process of step 3 is as follows:

the test set data is processed in the mode of step 1-2 and is sent to a trained model to obtain a single vector speech representation mu. And selecting 20000 pairs of data as scoring data, wherein 10000 pairs of data in the same category and 10000 pairs of data in different categories are respectively included, and finally scoring by using a cosine distance classifier.

Experimental results:

the invention adopts the dialect recognition task in the eastern language recognition large race 2020 to perform performance verification. The performance evaluation index adopts an Equal Error Rate (EER) and an average loss (Cavg), and the smaller the values of both the error rate (EER) and the average loss (Cavg) are, the better the performance is.

The minimum average loss achieved in the identification task in the method is 0.1323, and the minimum error rate is 13.32%; the performance of dialect recognition is improved to a greater extent than the performance of the i-vector model on Pytorch and the original wav2vec2.0 model given by OLR2020 official. As shown in fig. 3, under the same condition, the lowest average loss under CNNNoLT is 0.1532, and the lowest error rate is 15.41%; compared with CNNNoLT, the method CNNLT has the advantages that the relative average loss and the relative equivalent error rate of the dialect recognition task are reduced by 13.05% and 13.56%, respectively. As shown in fig. 4, under the same conditions, FCNoLT has a minimum average loss of 0.1595 and a minimum error rate of 15.74%; the relative average loss and relative equivalent error rate of the dialect recognition task were reduced by 15.31% and 16.55%, respectively, compared to FCNoLT. As shown in fig. 5, the relative average loss of the dialect recognition task was reduced by 24..49% compared to the performance of the x-vector model on the baseline system Pytorch, respectively; the relative average loss of the dialect recognition task was reduced by 8.76% compared to the performance of the original wav2vec2.0 model, respectively. As shown in fig. 6, the relative error rates of the dialect recognition tasks are respectively reduced by 32.52% compared with the performance of the x-vector model on the baseline system Pytorch; compared with the performance of the original wav2vec2.0 model, the relative equivalent error rate of the dialect recognition task is respectively reduced by 8.26 percent.

The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes according to the principles and ideas disclosed in the present invention are still within the scope of the present invention.

Claims

1. The language identification method based on deep time sequence characteristic representation is characterized by comprising the following steps of: the method comprises the following specific processes:

step 1, acquiring audio data sets of different languages;

2. The language identification method based on deep timing features according to claim 1, wherein: the data enhancement in the step 1 is to conduct data enhancement on each section of audio data in the audio data sets of different languages, and the audio data sets of different languages after the data enhancement are obtained;

data enhancement includes adding noise, speed enhancement, volume enhancement, tone enhancement, movement enhancement.

3. The language identification method based on deep timing features according to claim 2, wherein: the deep learning model is built in the step 2, the training set in the step 1 is input into the deep learning model for training until the set maximum iteration number is reached, and a trained deep learning model is obtained;

the specific process is as follows:

step 21, inputting the training set in step 1 into a pre-training model to obtain a voice characteristic sequenceT _n ；

T _n ＝[t ₁ ,t ₂ ,…,t _i ,…,t _n ]

4. The language identification method based on deep timing features according to claim 3, wherein: the pre-training model in the step 21 is wav2vec2-base.

5. The language identification method based on deep timing features according to claim 4, wherein: the time pool in the step 22 is CNNLT or FCLT;

the loss function expression of the CNNLT or FCLT is as follows:

wherein M is the number of audio data sample categories;taking 0 or 1 as sign function, if audio data sample +.>If the true category of (2) is equal to c, taking 1, otherwise taking 0; />Outputting predicted audio data samples for a time pool>Probability belonging to class c, N represents total number of input audio data samples of the time pool, LT represents a regularization term, and lambda represents a hyper-parameter;

the canonical term LT expression is:

6. The language identification method based on deep timing features according to claim 5, wherein: the CNNLT sequentially comprises a feature representation layer, a pooling layer and a network layer;

the characteristic representation layer is a convolution layer;

the pooling layer is an average pooling layer;

the network layer is a feedforward neural network.

7. The language identification method based on deep timing features according to claim 5, wherein: the FCLT sequentially comprises a characteristic representation layer, a pooling layer and a network layer;

the characteristic representation layer is a full connection layer;

the pooling layer is a mean square difference pooling layer;

the network layer is a feedforward neural network.

8. The language identification method based on deep timing features according to claim 6 or 7, wherein: the tolerance parameter alpha takes a value of 0.0001.