CN117292675A - Language identification method based on deep time sequence feature representation - Google Patents

Language identification method based on deep time sequence feature representation Download PDF

Info

Publication number
CN117292675A
CN117292675A CN202311388897.5A CN202311388897A CN117292675A CN 117292675 A CN117292675 A CN 117292675A CN 202311388897 A CN202311388897 A CN 202311388897A CN 117292675 A CN117292675 A CN 117292675A
Authority
CN
China
Prior art keywords
audio data
learning model
deep learning
layer
language identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311388897.5A
Other languages
Chinese (zh)
Inventor
陈晨
陈勇
李微微
杨海陆
王莉莉
陈德运
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202311388897.5A priority Critical patent/CN117292675A/en
Publication of CN117292675A publication Critical patent/CN117292675A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a language identification method based on deep time sequence feature representation, and belongs to the technical field of language identification. The invention aims to solve the problem that the accuracy of language identification by the existing method is low. The process is as follows: step 1, acquiring audio data sets of different languages; respectively carrying out data enhancement on the audio data sets of different languages; cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set; step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model; the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer; and step 3, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested.

Description

Language identification method based on deep time sequence feature representation
Technical Field
The invention relates to a language identification method based on deep time sequence feature representation, and belongs to the technical field of language identification.
Background
Language identification plays a vital role in modern society. With the increasing globalization trend, we are in a world that blends with each other. In this case, the language identification technique becomes critical. It is not only a breakthrough in technical level, but also a tie for promoting communication between different cultures. By rapidly and accurately identifying the language used by the opposite side, the obstacle of cross-cultural communication can be eliminated, and convenience is created for international cooperation and communication. In the internet age, the propagation speed and scope of information were unprecedented. However, this also presents a challenge in that a large number of multilingual content is emerging from the internet and social media, and language identification techniques allow the platform to automatically identify and categorize such content, allowing users to more easily obtain the desired information. This not only improves the efficiency of information acquisition, but also expands the field of view of the user, enabling them to come into contact with knowledge and culture from all over the world. In addition, language recognition plays a key role in the operation of search engines. It enables a search engine to more accurately understand the query language of a user, thereby providing more pertinent, relevant search results. This is particularly important for users seeking specific information and promotes the level of intelligence of the search engine. In technical and business applications, language identification provides the basis for multilingual applications, multilingual supported software, and cross-language data analysis. In the business industry, language identification technology has also led to a series of innovations such as intelligent customer service, multi-language support for across-country enterprises, etc. Advances in technology have prompted innovations and developments to provide more intelligent and personalized services for users in different languages. This not only promotes the service level of the enterprise, but also expands the internationalization opportunities of the market. In the educational field, language recognition provides unprecedented convenience for cross-cultural learning, allowing learners to more easily access knowledge and culture around the globe. In the field of medical health, the system provides important support for communication between medical staff and patients, and improves the quality and efficiency of medical service especially in a multi-language environment. The importance of language identification not only eliminates language barriers, but also promotes the communication and multiple development of global culture. The information acquisition and communication efficiency is improved, and technological innovation and cross-cultural understanding are promoted. The application of this technology is continuously affecting various areas of society, economy and culture, bringing much wider possibilities to our world.
Timing information is critical to language identification. The speech signal is time-varying and contains rich timing features such as variations in the audio spectrum, the cadence of sound, etc. These timing information are important clues for distinguishing different languages. For example, some languages may have unique patterns in the speech rhythm, while other languages may differ in the pause pattern between syllables. By analyzing these timing characteristics, the speech signal can be more accurately categorized into a particular language. Second, the timing information helps to distinguish between different accents or dialects within the same language. The same language may have subtle pronunciation variations in different regions or groups, which are often reflected in timing characteristics. Therefore, the accurate capture of the timing information can more finely distinguish different accents or dialects, thereby improving the accuracy of language identification.
Disclosure of Invention
The invention aims to solve the problem of low accuracy of language identification by the existing method, and provides a language identification method based on deep time sequence feature representation.
The language identification method based on deep time sequence characteristic representation comprises the following specific processes:
step 1, acquiring audio data sets of different languages;
respectively carrying out data enhancement on the audio data sets of different languages;
cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set;
step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
and step 3, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested.
The beneficial effects of the invention are as follows:
the method can extract the time sequence information between the voice frames and perform language identification. The method is called a language identification method based on deep time sequence characteristic representation.
In order to improve the characteristic representation capability of the neural network, the invention provides a language identification method based on deep time sequence characteristic representation, which can effectively improve the performance of a language identification system.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a network architecture according to the present invention, wherein the feature representation layer may take two different forms, namely a full connection layer and a convolution layer, and the methods corresponding to the different forms are abbreviated as FCLT and CNNLT, respectively, where NoLT refers to the condition without LT;
FIG. 3 is a comparative graph of an ablative experiment of the identification task of the corresponding method FCNoLT and the comparative method FCNoLT, wav2vec2.0 in the OLR2020 database dialect, performance being evaluated with Equal Error Rate (EER) and average loss (Cavg), wherein FCNoLT is the method of FCLT without timing constraints (LT);
FIG. 4 is a graph showing the comparison of the ablation experiments of the identification tasks of the methods CNNLT and comparison methods CNNNoLT and wav2vec2.0 in the OLR2020 database, wherein the performance is evaluated by adopting the Equal Error Rate (EER) and the average loss (Cavg), and the CNNNoLT is the method of the method CNNLT without the time sequence constraint condition (LT);
FIG. 5 is a graph comparing the Equivalent Error Rates (EER) of the inventive method (FCLT and CNNLT) with the comparison method FCNoLT, CNNNoLT, wav2vec2.0, [ Pytorch ] x-vector, [ Kaldi ] i-vector, and the task identified in the OLR2020 database;
FIG. 6 is a graph comparing the average loss (Cavg) of the identification task of the inventive method (FCLT and CNNLT) with the comparison method FCNoLT, CNNNoLT, wav2vec2.0, [ Pytorch ] x-vector, [ Kaldi ] i-vector in the OLR2020 database.
Detailed Description
The first embodiment is as follows: the language identification method based on deep time sequence characteristic representation in the embodiment comprises the following specific processes:
step 1, acquiring audio data sets (a plurality of audio segments) of different languages;
respectively carrying out data enhancement on the audio data sets of different languages;
cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set;
step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
and step 3, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested.
The second embodiment is as follows: the first difference between the embodiment and the specific embodiment is that the data enhancement in the step 1 is to perform data enhancement on each section of audio data in the audio data sets of different languages, so as to obtain the audio data sets of different languages after the data enhancement;
data enhancements include adding noise, speed enhancement, volume enhancement, tone enhancement, movement enhancement, etc.
A tone is the vibration frequency of sound; volume is the vibration amplitude of sound; the mobile enhancement is any splicing of a section of audio data after being disassembled.
Other steps and parameters are the same as in the first embodiment.
And a third specific embodiment: the difference between the embodiment and the first or second embodiment is that the deep learning model is built in the step 2, the training set in the step 1 is input into the deep learning model for training until the set maximum iteration number is reached, and a trained deep learning model is obtained;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
the specific process is as follows:
step 21, inputting the training set in step 1 into a pre-training model to obtain a voice characteristic sequence T n
T n =[t 1 ,t 2 ,…,t i ,…,t n ]
Wherein t is i ∈R F Is a speech feature sequence T n The i-th vector of (a); f is potential speech feature T n Is a dimension of (2);
step 22, the voice characteristic sequence T obtained in the step 21 is processed n Inputting the speech feature vector mu into a time pool to obtain a speech feature vector mu;
step 23, inputting the voice characteristic vector mu obtained in the step 22 into a full-connection layer for prediction, and obtaining a prediction result, namely the language type of the audio data;
and 24, repeatedly executing the steps 21 to 24 until the set maximum iteration number is reached, and obtaining the trained deep learning model.
Other steps and parameters are the same as in the first or second embodiment.
The specific embodiment IV is as follows: this embodiment differs from one to three embodiments in that the pre-training model in step 21 is wav2vec2-base.
Other steps and parameters are the same as in one to three embodiments.
Fifth embodiment: the difference between the present embodiment and one to four embodiments is that the time pool in the step 22 is CNNLT or FCLT;
the loss function expression of the CNNLT or FCLT is as follows:
wherein M is the number of audio data sample categories;taking 0 or 1 as sign function, if audio data sample +.>If the true category of (2) is equal to c, taking 1, otherwise taking 0; />Outputting predicted audio data samples for a time pool>The probability belonging to the class c, N represents the total number of audio data samples input by the time pool, LT represents a regular term, lambda represents a super-parameter, and log base number is 2;
the canonical term LT expression is:
in order to meet the time sequence constraint condition of the time pool G, a regular term formula is provided;
wherein t is i 、t i+1 Respectively represent the voice characteristic sequences T n I+1 vector of (a); n is the speech feature sequence T n The number of vectors included in the speech feature vector is represented by μ, and the tolerance parameter is represented by α.
Time pool refers to a netThe block is connected, and the voice characteristic vector mu meeting the time sequence constraint condition can be obtained. The time pool sequentially comprises a characteristic representation layer, a pooling layer and a network layer, and utilizes a voice characteristic vector mu and a voice characteristic sequence T n And (5) performing time sequence condition constraint.
Other steps and parameters are the same as in one to four embodiments.
Specific embodiment six: the present embodiment is different from one to fifth embodiments in that the CNNLT sequentially includes a feature representation layer, a pooling layer, and a network layer;
the characteristic representation layer is a convolution layer;
the pooling layer is an average pooling layer;
the network layer is a feed-Forward Neural Network (FNN).
Other steps and parameters are the same as in one of the first to fifth embodiments.
Seventh embodiment: this embodiment differs from one to five of the embodiments in that the FCLT includes, in order, a feature representation layer, a pooling layer, and a network layer;
the characteristic representation layer is a full connection layer;
the pooling layer is a mean square difference pooling layer;
the network layer is a feed-Forward Neural Network (FNN).
Other steps and parameters are the same as in one of the first to fifth embodiments.
Eighth embodiment: the difference between this embodiment and the sixth or seventh embodiment is that the tolerance parameter α has a value of 0.0001.
Other steps and parameters are the same as those of the sixth or seventh embodiment.
Examples:
the technical scheme adopted by the invention is a language identification method based on a deep learning realization time pool, which comprises the following steps:
step 1, acquiring audio data sets (a plurality of audio segments) of different languages;
respectively carrying out data enhancement on the audio data sets of different languages;
cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set;
the present example is cut to a length of 1 second;
step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
step 3, respectively calculating the voice feature vector mu of the test set, sending the voice feature vector mu into a full-connection layer for classification, randomly extracting 10000 pairs of data as scoring data, and finally, scoring and classifying by using a cosine distance classifier to verify the performance of the deep learning model, if the performance of the deep learning model reaches the standard, executing the step 4, otherwise, continuing to execute the step 2;
step 4, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested;
in this embodiment, the specific process of step 2 is:
step 21, inputting the training set in step 1 into a pre-training model to obtain a voice characteristic sequence T n
T n =[t 1 ,t 2 ,…,t i ,…,t n ]
Wherein t is i ∈R F Is a speech feature sequence T n The i-th vector of (a); f is potential speech feature T n Is a dimension of (2);
step 22, the voice characteristic sequence T obtained in the step 21 is processed n Inputting the speech feature vector mu into a time pool to obtain a speech feature vector mu;
step 23, inputting the voice characteristic vector mu obtained in the step 22 into a full-connection layer for prediction, and obtaining a prediction result, namely the language type of the audio data;
and 24, repeatedly executing the steps 21 to 24 until the set maximum iteration number is reached, and obtaining the trained deep learning model.
In this embodiment, the specific process of step 21 is as follows:
the present example uses a pre-trained wav2vec2.0 model.
The audio data of the training set in the step 1 is sent into a pre-trained wav2vec2.0 model to obtain a potential voice characteristic sequence T n . Wherein T is n =[t 1 ,t 2 ,…,t i ,…,t n ];
Wherein t is i ∈R F Is a speech feature sequence T n The value of n, the i-th vector in (a), is related to the original audio length of the input, in this example 49, F is the potential speech feature T n The size of which is model dependent, in this example, 728.
In this embodiment, the specific process of step 22 is:
the time pool is divided into 2 types:
the first type is CNNLT, which sequentially comprises a feature representation layer, a pooling layer and a network layer;
the characteristic representation layer is a convolution layer; the pooling layer is an average pooling layer; the network layer is a feed-Forward Neural Network (FNN);
another 1 is FCLT, comprising a characteristic representation layer, a pooling layer and a network layer in sequence;
the characteristic representation layer is a full connection layer; the pooling layer is a mean square difference pooling layer; the network layer is a feed-Forward Neural Network (FNN).
The loss function expression of the time pool is:
wherein M is the number of audio data sample categories;taking 0 or 1 as sign function, if audio data sample +.>If the true category of (2) is equal to c, taking 1, otherwise taking 0; />Outputting predicted audio data samples for a time pool>The probability belonging to the class c, N represents the total number of audio data samples input by the time pool, LT represents a regular term, lambda represents a super-parameter, and log base number is 2;
the canonical term LT expression is:
in order to meet the time sequence constraint condition of the time pool G, a regular term formula is provided;
wherein t is i 、t i+1 Respectively represent the voice characteristic sequences T n I+1 vector of (a); n is the speech feature sequence T n The number of vectors included in the speech feature vector is represented by μ, and the tolerance parameter is represented by α.
The tolerance parameter alpha takes a value of 0.0001.
The time pool feature representation layer is aimed at performing feature transformation. The pooling layer is used for reducing dimension and obtaining voice characteristic representation. The network layer performs dimension transformation on the voice characteristic representation, ensures the voice characteristic sequence T n The dimension of the time sequence constraint function formula is consistent, and the calculation of the time sequence constraint function formula is convenient.
In this embodiment, the specific process of step 2 is as follows:
the maximum iteration number is set to 50000, the learning rate of the deep learning model is set to 0.00005, and the super-parameter is set to 0.0001. And obtaining a deep learning model after iteration, wherein the deep learning model comprises a wav2vec2.0 model weight and a time pool weight.
In this embodiment, the specific process of step 3 is as follows:
the test set data is processed in the mode of step 1-2 and is sent to a trained model to obtain a single vector speech representation mu. And selecting 20000 pairs of data as scoring data, wherein 10000 pairs of data in the same category and 10000 pairs of data in different categories are respectively included, and finally scoring by using a cosine distance classifier.
Experimental results:
the invention adopts the dialect recognition task in the eastern language recognition large race 2020 to perform performance verification. The performance evaluation index adopts an Equal Error Rate (EER) and an average loss (Cavg), and the smaller the values of both the error rate (EER) and the average loss (Cavg) are, the better the performance is.
The minimum average loss achieved in the identification task in the method is 0.1323, and the minimum error rate is 13.32%; the performance of dialect recognition is improved to a greater extent than the performance of the i-vector model on Pytorch and the original wav2vec2.0 model given by OLR2020 official. As shown in fig. 3, under the same condition, the lowest average loss under CNNNoLT is 0.1532, and the lowest error rate is 15.41%; compared with CNNNoLT, the method CNNLT has the advantages that the relative average loss and the relative equivalent error rate of the dialect recognition task are reduced by 13.05% and 13.56%, respectively. As shown in fig. 4, under the same conditions, FCNoLT has a minimum average loss of 0.1595 and a minimum error rate of 15.74%; the relative average loss and relative equivalent error rate of the dialect recognition task were reduced by 15.31% and 16.55%, respectively, compared to FCNoLT. As shown in fig. 5, the relative average loss of the dialect recognition task was reduced by 24..49% compared to the performance of the x-vector model on the baseline system Pytorch, respectively; the relative average loss of the dialect recognition task was reduced by 8.76% compared to the performance of the original wav2vec2.0 model, respectively. As shown in fig. 6, the relative error rates of the dialect recognition tasks are respectively reduced by 32.52% compared with the performance of the x-vector model on the baseline system Pytorch; compared with the performance of the original wav2vec2.0 model, the relative equivalent error rate of the dialect recognition task is respectively reduced by 8.26 percent.
The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes according to the principles and ideas disclosed in the present invention are still within the scope of the present invention.

Claims (8)

1. The language identification method based on deep time sequence characteristic representation is characterized by comprising the following steps of: the method comprises the following specific processes:
step 1, acquiring audio data sets of different languages;
respectively carrying out data enhancement on the audio data sets of different languages;
cutting the audio data sets of different languages after data enhancement into audio data with the same length as a training set;
step 2, constructing a deep learning model, and inputting the training set in the step 1 into the deep learning model for training until the set maximum iteration number is reached, so as to obtain a trained deep learning model;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
and step 3, inputting the audio data to be tested into a trained deep learning model to obtain the language type of the audio data to be tested.
2. The language identification method based on deep timing features according to claim 1, wherein: the data enhancement in the step 1 is to conduct data enhancement on each section of audio data in the audio data sets of different languages, and the audio data sets of different languages after the data enhancement are obtained;
data enhancement includes adding noise, speed enhancement, volume enhancement, tone enhancement, movement enhancement.
3. The language identification method based on deep timing features according to claim 2, wherein: the deep learning model is built in the step 2, the training set in the step 1 is input into the deep learning model for training until the set maximum iteration number is reached, and a trained deep learning model is obtained;
the deep learning model sequentially comprises a pre-training model, a time pool and a full-connection layer;
the specific process is as follows:
step 21, inputting the training set in step 1 into a pre-training model to obtain a voice characteristic sequenceT n
T n =[t 1 ,t 2 ,…,t i ,…,t n ]
Wherein t is i ∈R F Is a speech feature sequence T n The i-th vector of (a); f is potential speech feature T n Is a dimension of (2);
step 22, the voice characteristic sequence T obtained in the step 21 is processed n Inputting the speech feature vector mu into a time pool to obtain a speech feature vector mu;
step 23, inputting the voice characteristic vector mu obtained in the step 22 into a full-connection layer for prediction, and obtaining a prediction result, namely the language type of the audio data;
and 24, repeatedly executing the steps 21 to 24 until the set maximum iteration number is reached, and obtaining the trained deep learning model.
4. The language identification method based on deep timing features according to claim 3, wherein: the pre-training model in the step 21 is wav2vec2-base.
5. The language identification method based on deep timing features according to claim 4, wherein: the time pool in the step 22 is CNNLT or FCLT;
the loss function expression of the CNNLT or FCLT is as follows:
wherein M is the number of audio data sample categories;taking 0 or 1 as sign function, if audio data sample +.>If the true category of (2) is equal to c, taking 1, otherwise taking 0; />Outputting predicted audio data samples for a time pool>Probability belonging to class c, N represents total number of input audio data samples of the time pool, LT represents a regularization term, and lambda represents a hyper-parameter;
the canonical term LT expression is:
wherein t is i 、t i+1 Respectively represent the voice characteristic sequences T n I+1 vector of (a); n is the speech feature sequence T n The number of vectors included in the speech feature vector is represented by μ, and the tolerance parameter is represented by α.
6. The language identification method based on deep timing features according to claim 5, wherein: the CNNLT sequentially comprises a feature representation layer, a pooling layer and a network layer;
the characteristic representation layer is a convolution layer;
the pooling layer is an average pooling layer;
the network layer is a feedforward neural network.
7. The language identification method based on deep timing features according to claim 5, wherein: the FCLT sequentially comprises a characteristic representation layer, a pooling layer and a network layer;
the characteristic representation layer is a full connection layer;
the pooling layer is a mean square difference pooling layer;
the network layer is a feedforward neural network.
8. The language identification method based on deep timing features according to claim 6 or 7, wherein: the tolerance parameter alpha takes a value of 0.0001.
CN202311388897.5A 2023-10-24 2023-10-24 Language identification method based on deep time sequence feature representation Pending CN117292675A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311388897.5A CN117292675A (en) 2023-10-24 2023-10-24 Language identification method based on deep time sequence feature representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311388897.5A CN117292675A (en) 2023-10-24 2023-10-24 Language identification method based on deep time sequence feature representation

Publications (1)

Publication Number Publication Date
CN117292675A true CN117292675A (en) 2023-12-26

Family

ID=89253491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311388897.5A Pending CN117292675A (en) 2023-10-24 2023-10-24 Language identification method based on deep time sequence feature representation

Country Status (1)

Country Link
CN (1) CN117292675A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN113282718A (en) * 2021-07-26 2021-08-20 北京快鱼电子股份公司 Language identification method and system based on self-adaptive center anchor
CN113611285A (en) * 2021-09-03 2021-11-05 哈尔滨理工大学 Language identification method based on stacked bidirectional time sequence pooling
CN113823262A (en) * 2021-11-16 2021-12-21 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and storage medium
US20220121702A1 (en) * 2020-10-20 2022-04-21 Adobe Inc. Generating embeddings in a multimodal embedding space for cross-lingual digital image retrieval

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
US20220121702A1 (en) * 2020-10-20 2022-04-21 Adobe Inc. Generating embeddings in a multimodal embedding space for cross-lingual digital image retrieval
CN113282718A (en) * 2021-07-26 2021-08-20 北京快鱼电子股份公司 Language identification method and system based on self-adaptive center anchor
CN113611285A (en) * 2021-09-03 2021-11-05 哈尔滨理工大学 Language identification method based on stacked bidirectional time sequence pooling
CN113823262A (en) * 2021-11-16 2021-12-21 腾讯科技(深圳)有限公司 Voice recognition method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
崔瑞莲;宋彦;蒋兵;戴礼荣;: "基于深度神经网络的语种识别", 模式识别与人工智能, no. 12, 15 December 2015 (2015-12-15) *

Similar Documents

Publication Publication Date Title
CN109829058A (en) A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning
CN101042868B (en) Clustering system, clustering method, and attribute estimation system using clustering system
CN111145718B (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
CN110717018A (en) Industrial equipment fault maintenance question-answering system based on knowledge graph
JP2003036093A (en) Speech input retrieval system
CN116166782A (en) Intelligent question-answering method based on deep learning
KR20200105057A (en) Apparatus and method for extracting inquiry features for alalysis of inquery sentence
CN114203177A (en) Intelligent voice question-answering method and system based on deep learning and emotion recognition
CN111916064A (en) End-to-end neural network speech recognition model training method
Cao et al. Speaker-independent speech emotion recognition based on random forest feature selection algorithm
CN110348482B (en) Speech emotion recognition system based on depth model integrated architecture
CN112685538B (en) Text vector retrieval method combined with external knowledge
Farooq et al. Mispronunciation detection in articulation points of Arabic letters using machine learning
CN115860015B (en) Translation memory-based transcription text translation method and computer equipment
CN117216008A (en) Knowledge graph-based archive multi-mode intelligent compiling method and system
CN115376547B (en) Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium
CN117292675A (en) Language identification method based on deep time sequence feature representation
CN112015921B (en) Natural language processing method based on learning auxiliary knowledge graph
CN114238595A (en) Metallurgical knowledge question-answering method and system based on knowledge graph
Alphonso et al. Ranking approach to compact text representation for personal digital assistants
Hacine-Gharbi et al. Automatic Classification of French Spontaneous Oral Speech into Injunction and No-injunction Classes.
CN113763939B (en) Mixed voice recognition system and method based on end-to-end model
CN114780786B (en) Voice keyword retrieval method based on bottleneck characteristics and residual error network
Nekomoto et al. akbl at the NTCIR-15 QA Lab-PoliInfo-2 Tasks
Çolakoğlu et al. Multi-lingual Speech Emotion Recognition System Using Machine Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination