CN117292675A - Language identification method based on deep time sequence feature representation - Google Patents
Language identification method based on deep time sequence feature representation Download PDFInfo
- Publication number
- CN117292675A CN117292675A CN202311388897.5A CN202311388897A CN117292675A CN 117292675 A CN117292675 A CN 117292675A CN 202311388897 A CN202311388897 A CN 202311388897A CN 117292675 A CN117292675 A CN 117292675A
- Authority
- CN
- China
- Prior art keywords
- audio data
- learning model
- deep learning
- layer
- language identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000013136 deep learning model Methods 0.000 claims abstract description 41
- 238000012549 training Methods 0.000 claims abstract description 35
- 230000008569 process Effects 0.000 claims abstract description 11
- 239000013598 vector Substances 0.000 claims description 34
- 238000011176 pooling Methods 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 238000004891 communication Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000002679 ablation Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000012733 comparative method Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a language identification method based on deep time sequence feature representation, and belongs to the technical field of language identification. The invention aims to solve the problem that the accuracy of language identification by the existing method is low. The process is as follows: step 1, acquiring audio data sets of different languages; respectively carrying out data enhancement on the audio data sets of different languages; cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set; step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model; the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer; and step 3, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested.
Description
Technical Field
The invention relates to a language identification method based on deep time sequence feature representation, and belongs to the technical field of language identification.
Background
Language identification plays a vital role in modern society. With the increasing globalization trend, we are in a world that blends with each other. In this case, the language identification technique becomes critical. It is not only a breakthrough in technical level, but also a tie for promoting communication between different cultures. By rapidly and accurately identifying the language used by the opposite side, the obstacle of cross-cultural communication can be eliminated, and convenience is created for international cooperation and communication. In the internet age, the propagation speed and scope of information were unprecedented. However, this also presents a challenge in that a large number of multilingual content is emerging from the internet and social media, and language identification techniques allow the platform to automatically identify and categorize such content, allowing users to more easily obtain the desired information. This not only improves the efficiency of information acquisition, but also expands the field of view of the user, enabling them to come into contact with knowledge and culture from all over the world. In addition, language recognition plays a key role in the operation of search engines. It enables a search engine to more accurately understand the query language of a user, thereby providing more pertinent, relevant search results. This is particularly important for users seeking specific information and promotes the level of intelligence of the search engine. In technical and business applications, language identification provides the basis for multilingual applications, multilingual supported software, and cross-language data analysis. In the business industry, language identification technology has also led to a series of innovations such as intelligent customer service, multi-language support for across-country enterprises, etc. Advances in technology have prompted innovations and developments to provide more intelligent and personalized services for users in different languages. This not only promotes the service level of the enterprise, but also expands the internationalization opportunities of the market. In the educational field, language recognition provides unprecedented convenience for cross-cultural learning, allowing learners to more easily access knowledge and culture around the globe. In the field of medical health, the system provides important support for communication between medical staff and patients, and improves the quality and efficiency of medical service especially in a multi-language environment. The importance of language identification not only eliminates language barriers, but also promotes the communication and multiple development of global culture. The information acquisition and communication efficiency is improved, and technological innovation and cross-cultural understanding are promoted. The application of this technology is continuously affecting various areas of society, economy and culture, bringing much wider possibilities to our world.
Timing information is critical to language identification. The speech signal is time-varying and contains rich timing features such as variations in the audio spectrum, the cadence of sound, etc. These timing information are important clues for distinguishing different languages. For example, some languages may have unique patterns in the speech rhythm, while other languages may differ in the pause pattern between syllables. By analyzing these timing characteristics, the speech signal can be more accurately categorized into a particular language. Second, the timing information helps to distinguish between different accents or dialects within the same language. The same language may have subtle pronunciation variations in different regions or groups, which are often reflected in timing characteristics. Therefore, the accurate capture of the timing information can more finely distinguish different accents or dialects, thereby improving the accuracy of language identification.
Disclosure of Invention
The invention aims to solve the problem of low accuracy of language identification by the existing method, and provides a language identification method based on deep time sequence feature representation.
The language identification method based on deep time sequence characteristic representation comprises the following specific processes:
step 1, acquiring audio data sets of different languages;
respectively carrying out data enhancement on the audio data sets of different languages;
cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set;
step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
and step 3, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested.
The beneficial effects of the invention are as follows:
the method can extract the time sequence information between the voice frames and perform language identification. The method is called a language identification method based on deep time sequence characteristic representation.
In order to improve the characteristic representation capability of the neural network, the invention provides a language identification method based on deep time sequence characteristic representation, which can effectively improve the performance of a language identification system.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a network architecture according to the present invention, wherein the feature representation layer may take two different forms, namely a full connection layer and a convolution layer, and the methods corresponding to the different forms are abbreviated as FCLT and CNNLT, respectively, where NoLT refers to the condition without LT;
FIG. 3 is a comparative graph of an ablative experiment of the identification task of the corresponding method FCNoLT and the comparative method FCNoLT, wav2vec2.0 in the OLR2020 database dialect, performance being evaluated with Equal Error Rate (EER) and average loss (Cavg), wherein FCNoLT is the method of FCLT without timing constraints (LT);
FIG. 4 is a graph showing the comparison of the ablation experiments of the identification tasks of the methods CNNLT and comparison methods CNNNoLT and wav2vec2.0 in the OLR2020 database, wherein the performance is evaluated by adopting the Equal Error Rate (EER) and the average loss (Cavg), and the CNNNoLT is the method of the method CNNLT without the time sequence constraint condition (LT);
FIG. 5 is a graph comparing the Equivalent Error Rates (EER) of the inventive method (FCLT and CNNLT) with the comparison method FCNoLT, CNNNoLT, wav2vec2.0, [ Pytorch ] x-vector, [ Kaldi ] i-vector, and the task identified in the OLR2020 database;
FIG. 6 is a graph comparing the average loss (Cavg) of the identification task of the inventive method (FCLT and CNNLT) with the comparison method FCNoLT, CNNNoLT, wav2vec2.0, [ Pytorch ] x-vector, [ Kaldi ] i-vector in the OLR2020 database.
Detailed Description
The first embodiment is as follows: the language identification method based on deep time sequence characteristic representation in the embodiment comprises the following specific processes:
step 1, acquiring audio data sets (a plurality of audio segments) of different languages;
respectively carrying out data enhancement on the audio data sets of different languages;
cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set;
step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
and step 3, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested.
The second embodiment is as follows: the first difference between the embodiment and the specific embodiment is that the data enhancement in the step 1 is to perform data enhancement on each section of audio data in the audio data sets of different languages, so as to obtain the audio data sets of different languages after the data enhancement;
data enhancements include adding noise, speed enhancement, volume enhancement, tone enhancement, movement enhancement, etc.
A tone is the vibration frequency of sound; volume is the vibration amplitude of sound; the mobile enhancement is any splicing of a section of audio data after being disassembled.
Other steps and parameters are the same as in the first embodiment.
And a third specific embodiment: the difference between the embodiment and the first or second embodiment is that the deep learning model is built in the step 2, the training set in the step 1 is input into the deep learning model for training until the set maximum iteration number is reached, and a trained deep learning model is obtained;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
the specific process is as follows:
step 21, inputting the training set in step 1 into a pre-training model to obtain a voice characteristic sequence T n ;
T n =[t 1 ,t 2 ,…,t i ,…,t n ]
Wherein t is i ∈R F Is a speech feature sequence T n The i-th vector of (a); f is potential speech feature T n Is a dimension of (2);
step 22, the voice characteristic sequence T obtained in the step 21 is processed n Inputting the speech feature vector mu into a time pool to obtain a speech feature vector mu;
step 23, inputting the voice characteristic vector mu obtained in the step 22 into a full-connection layer for prediction, and obtaining a prediction result, namely the language type of the audio data;
and 24, repeatedly executing the steps 21 to 24 until the set maximum iteration number is reached, and obtaining the trained deep learning model.
Other steps and parameters are the same as in the first or second embodiment.
The specific embodiment IV is as follows: this embodiment differs from one to three embodiments in that the pre-training model in step 21 is wav2vec2-base.
Other steps and parameters are the same as in one to three embodiments.
Fifth embodiment: the difference between the present embodiment and one to four embodiments is that the time pool in the step 22 is CNNLT or FCLT;
the loss function expression of the CNNLT or FCLT is as follows:
wherein M is the number of audio data sample categories;taking 0 or 1 as sign function, if audio data sample +.>If the true category of (2) is equal to c, taking 1, otherwise taking 0; />Outputting predicted audio data samples for a time pool>The probability belonging to the class c, N represents the total number of audio data samples input by the time pool, LT represents a regular term, lambda represents a super-parameter, and log base number is 2;
the canonical term LT expression is:
in order to meet the time sequence constraint condition of the time pool G, a regular term formula is provided;
wherein t is i 、t i+1 Respectively represent the voice characteristic sequences T n I+1 vector of (a); n is the speech feature sequence T n The number of vectors included in the speech feature vector is represented by μ, and the tolerance parameter is represented by α.
Time pool refers to a netThe block is connected, and the voice characteristic vector mu meeting the time sequence constraint condition can be obtained. The time pool sequentially comprises a characteristic representation layer, a pooling layer and a network layer, and utilizes a voice characteristic vector mu and a voice characteristic sequence T n And (5) performing time sequence condition constraint.
Other steps and parameters are the same as in one to four embodiments.
Specific embodiment six: the present embodiment is different from one to fifth embodiments in that the CNNLT sequentially includes a feature representation layer, a pooling layer, and a network layer;
the characteristic representation layer is a convolution layer;
the pooling layer is an average pooling layer;
the network layer is a feed-Forward Neural Network (FNN).
Other steps and parameters are the same as in one of the first to fifth embodiments.
Seventh embodiment: this embodiment differs from one to five of the embodiments in that the FCLT includes, in order, a feature representation layer, a pooling layer, and a network layer;
the characteristic representation layer is a full connection layer;
the pooling layer is a mean square difference pooling layer;
the network layer is a feed-Forward Neural Network (FNN).
Other steps and parameters are the same as in one of the first to fifth embodiments.
Eighth embodiment: the difference between this embodiment and the sixth or seventh embodiment is that the tolerance parameter α has a value of 0.0001.
Other steps and parameters are the same as those of the sixth or seventh embodiment.
Examples:
the technical scheme adopted by the invention is a language identification method based on a deep learning realization time pool, which comprises the following steps:
step 1, acquiring audio data sets (a plurality of audio segments) of different languages;
respectively carrying out data enhancement on the audio data sets of different languages;
cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set;
the present example is cut to a length of 1 second;
step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
step 3, respectively calculating the voice feature vector mu of the test set, sending the voice feature vector mu into a full-connection layer for classification, randomly extracting 10000 pairs of data as scoring data, and finally, scoring and classifying by using a cosine distance classifier to verify the performance of the deep learning model, if the performance of the deep learning model reaches the standard, executing the step 4, otherwise, continuing to execute the step 2;
step 4, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested;
in this embodiment, the specific process of step 2 is:
step 21, inputting the training set in step 1 into a pre-training model to obtain a voice characteristic sequence T n ;
T n =[t 1 ,t 2 ,…,t i ,…,t n ]
Wherein t is i ∈R F Is a speech feature sequence T n The i-th vector of (a); f is potential speech feature T n Is a dimension of (2);
step 22, the voice characteristic sequence T obtained in the step 21 is processed n Inputting the speech feature vector mu into a time pool to obtain a speech feature vector mu;
step 23, inputting the voice characteristic vector mu obtained in the step 22 into a full-connection layer for prediction, and obtaining a prediction result, namely the language type of the audio data;
and 24, repeatedly executing the steps 21 to 24 until the set maximum iteration number is reached, and obtaining the trained deep learning model.
In this embodiment, the specific process of step 21 is as follows:
the present example uses a pre-trained wav2vec2.0 model.
The audio data of the training set in the step 1 is sent into a pre-trained wav2vec2.0 model to obtain a potential voice characteristic sequence T n . Wherein T is n =[t 1 ,t 2 ,…,t i ,…,t n ];
Wherein t is i ∈R F Is a speech feature sequence T n The value of n, the i-th vector in (a), is related to the original audio length of the input, in this example 49, F is the potential speech feature T n The size of which is model dependent, in this example, 728.
In this embodiment, the specific process of step 22 is:
the time pool is divided into 2 types:
the first type is CNNLT, which sequentially comprises a feature representation layer, a pooling layer and a network layer;
the characteristic representation layer is a convolution layer; the pooling layer is an average pooling layer; the network layer is a feed-Forward Neural Network (FNN);
another 1 is FCLT, comprising a characteristic representation layer, a pooling layer and a network layer in sequence;
the characteristic representation layer is a full connection layer; the pooling layer is a mean square difference pooling layer; the network layer is a feed-Forward Neural Network (FNN).
The loss function expression of the time pool is:
wherein M is the number of audio data sample categories;taking 0 or 1 as sign function, if audio data sample +.>If the true category of (2) is equal to c, taking 1, otherwise taking 0; />Outputting predicted audio data samples for a time pool>The probability belonging to the class c, N represents the total number of audio data samples input by the time pool, LT represents a regular term, lambda represents a super-parameter, and log base number is 2;
the canonical term LT expression is:
in order to meet the time sequence constraint condition of the time pool G, a regular term formula is provided;
wherein t is i 、t i+1 Respectively represent the voice characteristic sequences T n I+1 vector of (a); n is the speech feature sequence T n The number of vectors included in the speech feature vector is represented by μ, and the tolerance parameter is represented by α.
The tolerance parameter alpha takes a value of 0.0001.
The time pool feature representation layer is aimed at performing feature transformation. The pooling layer is used for reducing dimension and obtaining voice characteristic representation. The network layer performs dimension transformation on the voice characteristic representation, ensures the voice characteristic sequence T n The dimension of the time sequence constraint function formula is consistent, and the calculation of the time sequence constraint function formula is convenient.
In this embodiment, the specific process of step 2 is as follows:
the maximum iteration number is set to 50000, the learning rate of the deep learning model is set to 0.00005, and the super-parameter is set to 0.0001. And obtaining a deep learning model after iteration, wherein the deep learning model comprises a wav2vec2.0 model weight and a time pool weight.
In this embodiment, the specific process of step 3 is as follows:
the test set data is processed in the mode of step 1-2 and is sent to a trained model to obtain a single vector speech representation mu. And selecting 20000 pairs of data as scoring data, wherein 10000 pairs of data in the same category and 10000 pairs of data in different categories are respectively included, and finally scoring by using a cosine distance classifier.
Experimental results:
the invention adopts the dialect recognition task in the eastern language recognition large race 2020 to perform performance verification. The performance evaluation index adopts an Equal Error Rate (EER) and an average loss (Cavg), and the smaller the values of both the error rate (EER) and the average loss (Cavg) are, the better the performance is.
The minimum average loss achieved in the identification task in the method is 0.1323, and the minimum error rate is 13.32%; the performance of dialect recognition is improved to a greater extent than the performance of the i-vector model on Pytorch and the original wav2vec2.0 model given by OLR2020 official. As shown in fig. 3, under the same condition, the lowest average loss under CNNNoLT is 0.1532, and the lowest error rate is 15.41%; compared with CNNNoLT, the method CNNLT has the advantages that the relative average loss and the relative equivalent error rate of the dialect recognition task are reduced by 13.05% and 13.56%, respectively. As shown in fig. 4, under the same conditions, FCNoLT has a minimum average loss of 0.1595 and a minimum error rate of 15.74%; the relative average loss and relative equivalent error rate of the dialect recognition task were reduced by 15.31% and 16.55%, respectively, compared to FCNoLT. As shown in fig. 5, the relative average loss of the dialect recognition task was reduced by 24..49% compared to the performance of the x-vector model on the baseline system Pytorch, respectively; the relative average loss of the dialect recognition task was reduced by 8.76% compared to the performance of the original wav2vec2.0 model, respectively. As shown in fig. 6, the relative error rates of the dialect recognition tasks are respectively reduced by 32.52% compared with the performance of the x-vector model on the baseline system Pytorch; compared with the performance of the original wav2vec2.0 model, the relative equivalent error rate of the dialect recognition task is respectively reduced by 8.26 percent.
The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes according to the principles and ideas disclosed in the present invention are still within the scope of the present invention.
Claims (8)
1. The language identification method based on deep time sequence characteristic representation is characterized by comprising the following steps of: the method comprises the following specific processes:
step 1, acquiring audio data sets of different languages;
respectively carrying out data enhancement on the audio data sets of different languages;
cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set;
step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
and step 3, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested.
2. The language identification method based on deep timing features according to claim 1, wherein: the data enhancement in the step 1 is to conduct data enhancement on each section of audio data in the audio data sets of different languages, and the audio data sets of different languages after the data enhancement are obtained;
data enhancement includes adding noise, speed enhancement, volume enhancement, tone enhancement, movement enhancement.
3. The language identification method based on deep timing features according to claim 2, wherein: the deep learning model is built in the step 2, the training set in the step 1 is input into the deep learning model for training until the set maximum iteration number is reached, and a trained deep learning model is obtained;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
the specific process is as follows:
step 21, inputting the training set in step 1 into a pre-training model to obtain a voice characteristic sequenceT n ;
T n =[t 1 ,t 2 ,…,t i ,…,t n ]
Wherein t is i ∈R F Is a speech feature sequence T n The i-th vector of (a); f is potential speech feature T n Is a dimension of (2);
step 22, the voice characteristic sequence T obtained in the step 21 is processed n Inputting the speech feature vector mu into a time pool to obtain a speech feature vector mu;
step 23, inputting the voice characteristic vector mu obtained in the step 22 into a full-connection layer for prediction, and obtaining a prediction result, namely the language type of the audio data;
and 24, repeatedly executing the steps 21 to 24 until the set maximum iteration number is reached, and obtaining the trained deep learning model.
4. The language identification method based on deep timing features according to claim 3, wherein: the pre-training model in the step 21 is wav2vec2-base.
5. The language identification method based on deep timing features according to claim 4, wherein: the time pool in the step 22 is CNNLT or FCLT;
the loss function expression of the CNNLT or FCLT is as follows:
wherein M is the number of audio data sample categories;taking 0 or 1 as sign function, if audio data sample +.>If the true category of (2) is equal to c, taking 1, otherwise taking 0; />Outputting predicted audio data samples for a time pool>Probability belonging to class c, N represents total number of input audio data samples of the time pool, LT represents a regularization term, and lambda represents a hyper-parameter;
the canonical term LT expression is:
wherein t is i 、t i+1 Respectively represent the voice characteristic sequences T n I+1 vector of (a); n is the speech feature sequence T n The number of vectors included in the speech feature vector is represented by μ, and the tolerance parameter is represented by α.
6. The language identification method based on deep timing features according to claim 5, wherein: the CNNLT sequentially comprises a feature representation layer, a pooling layer and a network layer;
the characteristic representation layer is a convolution layer;
the pooling layer is an average pooling layer;
the network layer is a feedforward neural network.
7. The language identification method based on deep timing features according to claim 5, wherein: the FCLT sequentially comprises a characteristic representation layer, a pooling layer and a network layer;
the characteristic representation layer is a full connection layer;
the pooling layer is a mean square difference pooling layer;
the network layer is a feedforward neural network.
8. The language identification method based on deep timing features according to claim 6 or 7, wherein: the tolerance parameter alpha takes a value of 0.0001.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311388897.5A CN117292675A (en) | 2023-10-24 | 2023-10-24 | Language identification method based on deep time sequence feature representation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311388897.5A CN117292675A (en) | 2023-10-24 | 2023-10-24 | Language identification method based on deep time sequence feature representation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117292675A true CN117292675A (en) | 2023-12-26 |
Family
ID=89253491
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311388897.5A Pending CN117292675A (en) | 2023-10-24 | 2023-10-24 | Language identification method based on deep time sequence feature representation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117292675A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110782872A (en) * | 2019-11-11 | 2020-02-11 | 复旦大学 | Language identification method and device based on deep convolutional recurrent neural network |
CN113282718A (en) * | 2021-07-26 | 2021-08-20 | 北京快鱼电子股份公司 | Language identification method and system based on self-adaptive center anchor |
CN113611285A (en) * | 2021-09-03 | 2021-11-05 | 哈尔滨理工大学 | Language identification method based on stacked bidirectional time sequence pooling |
CN113823262A (en) * | 2021-11-16 | 2021-12-21 | 腾讯科技(深圳)有限公司 | Voice recognition method and device, electronic equipment and storage medium |
US20220121702A1 (en) * | 2020-10-20 | 2022-04-21 | Adobe Inc. | Generating embeddings in a multimodal embedding space for cross-lingual digital image retrieval |
-
2023
- 2023-10-24 CN CN202311388897.5A patent/CN117292675A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110782872A (en) * | 2019-11-11 | 2020-02-11 | 复旦大学 | Language identification method and device based on deep convolutional recurrent neural network |
US20220121702A1 (en) * | 2020-10-20 | 2022-04-21 | Adobe Inc. | Generating embeddings in a multimodal embedding space for cross-lingual digital image retrieval |
CN113282718A (en) * | 2021-07-26 | 2021-08-20 | 北京快鱼电子股份公司 | Language identification method and system based on self-adaptive center anchor |
CN113611285A (en) * | 2021-09-03 | 2021-11-05 | 哈尔滨理工大学 | Language identification method based on stacked bidirectional time sequence pooling |
CN113823262A (en) * | 2021-11-16 | 2021-12-21 | 腾讯科技(深圳)有限公司 | Voice recognition method and device, electronic equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
崔瑞莲;宋彦;蒋兵;戴礼荣;: "基于深度神经网络的语种识别", 模式识别与人工智能, no. 12, 15 December 2015 (2015-12-15) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109829058A (en) | A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning | |
CN101042868B (en) | Clustering system, clustering method, and attribute estimation system using clustering system | |
CN111145718B (en) | Chinese mandarin character-voice conversion method based on self-attention mechanism | |
CN110717018A (en) | Industrial equipment fault maintenance question-answering system based on knowledge graph | |
JP2003036093A (en) | Speech input retrieval system | |
CN116166782A (en) | Intelligent question-answering method based on deep learning | |
KR20200105057A (en) | Apparatus and method for extracting inquiry features for alalysis of inquery sentence | |
CN114203177A (en) | Intelligent voice question-answering method and system based on deep learning and emotion recognition | |
CN111916064A (en) | End-to-end neural network speech recognition model training method | |
Cao et al. | Speaker-independent speech emotion recognition based on random forest feature selection algorithm | |
CN110348482B (en) | Speech emotion recognition system based on depth model integrated architecture | |
CN112685538B (en) | Text vector retrieval method combined with external knowledge | |
Farooq et al. | Mispronunciation detection in articulation points of Arabic letters using machine learning | |
CN115860015B (en) | Translation memory-based transcription text translation method and computer equipment | |
CN117216008A (en) | Knowledge graph-based archive multi-mode intelligent compiling method and system | |
CN115376547B (en) | Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium | |
CN117292675A (en) | Language identification method based on deep time sequence feature representation | |
CN112015921B (en) | Natural language processing method based on learning auxiliary knowledge graph | |
CN114238595A (en) | Metallurgical knowledge question-answering method and system based on knowledge graph | |
Alphonso et al. | Ranking approach to compact text representation for personal digital assistants | |
Hacine-Gharbi et al. | Automatic Classification of French Spontaneous Oral Speech into Injunction and No-injunction Classes. | |
CN113763939B (en) | Mixed voice recognition system and method based on end-to-end model | |
CN114780786B (en) | Voice keyword retrieval method based on bottleneck characteristics and residual error network | |
Nekomoto et al. | akbl at the NTCIR-15 QA Lab-PoliInfo-2 Tasks | |
Çolakoğlu et al. | Multi-lingual Speech Emotion Recognition System Using Machine Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |