CN111724766A

CN111724766A - Language identification method, related equipment and readable storage medium

Info

Publication number: CN111724766A
Application number: CN202010607693.6A
Authority: CN
Inventors: 杨军; 方磊; 方四安; 唐磊
Original assignee: Hefei Ustc Iflytek Co ltd
Current assignee: Hefei Ustc Iflytek Co ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-09-29
Anticipated expiration: 2040-06-29
Also published as: CN111724766B

Abstract

The application discloses a language identification method, related equipment and a readable storage medium, wherein after voice data to be identified are obtained, the language characteristics of the voice data are determined; performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result; and when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result, and determining the language of the voice data based on the first language identification result and the second language identification result. In the above scheme, if the first language identification result is inaccurate, the second language identification model with more network layers than the first language identification model can be used for the second identification, so that the identification accuracy is improved.

Description

Language identification method, related equipment and readable storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a language identification method, a related device, and a readable storage medium.

Background

Language identification is a process of judging the language type of a voice data by analyzing and processing the voice data by a computer, and is an important research direction of voice identification. With the continuous acceleration of the globalization process, the language identification has wide application prospects in the fields of multilingual information service, machine translation, military security and the like. In the prior art, languages such as a hybrid Gaussian Model (GMM), a Support Vector Machine (SVM), and a Gaussian hybrid Model-Support Vector Machine (GSV-SVM) are mostly used for recognizing languages.

However, in the prior art, the speech recognition method for speech data is not ideal in the accuracy of the obtained speech recognition result.

Therefore, it is necessary to optimize the language identification method in the prior art.

Disclosure of Invention

In view of the foregoing, the present application provides a language identification method, a related device and a readable storage medium. The specific scheme is as follows:

a language identification method comprises the following steps:

acquiring voice data to be recognized;

determining language features of the voice data;

performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result;

when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result; determining the language of the voice data based on the first language identification result and the second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model.

Optionally, the determining the language characteristic of the voice data includes:

acquiring acoustic features of the voice data;

performing feature conversion on the acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features;

extracting time sequence features from the transformed features by utilizing a time sequence feature extraction module of the language feature extraction model;

and extracting the language features of the voice data from the time sequence features by utilizing a language feature extraction module of the language feature extraction model.

Optionally, the training process of the language feature extraction model includes:

acquiring training voice data;

determining acoustic features of each training speech data, and phoneme information of each training speech data;

training by taking the acoustic feature of each training voice data as a training sample and taking the phoneme information of the training voice data as a sample label to obtain a phoneme recognition model;

and removing an output layer of the phoneme recognition model to obtain the language feature extraction model.

Optionally, the performing, by using a pre-established first language identification model, a first language feature of the speech data to obtain a first language identification result includes:

processing the language features of the voice data by using a mean value super vector feature extraction module of the first language identification model to obtain mean value super vector features of the language features;

and identifying the mean value super-vector characteristics of the language characteristics by using the language identification module of the first language identification model to obtain a first language identification result.

Optionally, the training process of the first language identification model includes:

acquiring a training voice data set corresponding to at least one language;

labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language, wherein the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language;

determining the language features of each training voice data in a training voice data set corresponding to each language;

determining a mean value super vector characteristic set of a training voice data set corresponding to each language by using language characteristics of each training voice data;

and training by using the mean value super vector characteristic set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language to obtain the first language identification model.

Optionally, the determining, by using the language features of each training speech data, a mean value super vector feature set of a training speech data set corresponding to each language includes:

aiming at a training voice data set corresponding to each language, clustering each training voice data in the training voice data set corresponding to each language by using the language features of each training voice data to obtain a training voice data subset corresponding to each language;

aiming at each training voice data subset in the training voice data subsets corresponding to the languages, combining initial mean value super-vector characteristics of each training voice data in the training voice data subsets to obtain mean value super-vector characteristics of the training voice data subsets; and the mean value super vector characteristics of all the training voice data subsets corresponding to the languages form a mean value super vector characteristic set of the training voice data set corresponding to the languages.

Optionally, the second language identification model is obtained by training a preset end-to-end neural network model with the language features of the training voice data as training samples and the language labeled by the training voice data as sample labels.

A language identification device comprising:

the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring voice data to be recognized;

a language feature determination unit, configured to determine a language feature of the voice data;

the first language identification unit is used for carrying out first identification on the language characteristics of the voice data by utilizing a pre-established first language identification model to obtain a first language identification result;

the second language identification unit is used for carrying out second identification on the language characteristics of the voice data by utilizing a pre-established second language identification model to obtain a second language identification result when the first language identification result is inaccurate; the number of network layers of the second language identification model is more than that of the first language identification model;

and the language determining unit is used for determining the language of the voice data based on the first language identification result and the second language identification result.

Optionally, the language feature determining unit includes:

an acoustic feature acquisition unit, configured to acquire an acoustic feature of the voice data;

the feature conversion unit is used for performing feature conversion on the acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features;

a time sequence feature extraction unit, configured to extract time sequence features from the transformed features by using a time sequence feature extraction module of the language feature extraction model;

and the language feature extraction unit is used for extracting the language features of the voice data from the time sequence features by utilizing a language feature extraction module of the language feature extraction model.

acquiring training voice data;

Optionally, the first language identification unit includes:

the mean value super vector feature determining unit is used for processing the language features of the voice data by using a mean value super vector feature extracting module of the first language identification model to obtain mean value super vector features of the language features;

and the recognition unit is used for recognizing the mean value super-vector feature of the language feature by using the language recognition module of the first language recognition model to obtain a first language recognition result.

acquiring a training voice data set corresponding to at least one language;

A language identification device comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the language identification method.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the language identification method as described above.

By means of the technical scheme, the application discloses a language identification method, related equipment and a readable storage medium, wherein after voice data to be identified are obtained, the language characteristics of the voice data are determined; performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result; and when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result, and determining the language of the voice data based on the first language identification result and the second language identification result. In the above scheme, if the first language identification result is inaccurate, the second language identification model with more network layers than the first language identification model can be used for the second identification, so that the identification accuracy is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flow chart illustrating a language identification method disclosed in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a language feature extraction model disclosed in an embodiment of the present application;

FIG. 3 is a schematic diagram of a structure of a phoneme recognition model disclosed in the embodiments of the present application;

FIG. 4 is a schematic structural diagram of a first language identification model according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a language identification device according to an embodiment of the present application;

fig. 6 is a block diagram of a hardware structure of a language identification device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Next, the language identification method provided by the present application is described by the following embodiments.

Referring to fig. 1, fig. 1 is a schematic flow chart of a language identification method disclosed in an embodiment of the present application, where the method may include:

step S101: and acquiring voice data to be recognized.

The voice data to be recognized is voice data spoken by the user according to application requirements, such as voice data input by the user when making a call, voice data input by a voice input method when the user is based on an instant chat tool, and the like, and the application is not limited at all.

Step S102: and determining language features of the voice data.

It should be noted that although the acoustic features such as SDC (shift delta Cepstral, chinese full name: shifted differential cepstrum) features of the speech data may be adopted as the language features of the speech data in the present application, the language information included in the acoustic features of the speech data is often less, and a higher recognition accuracy cannot be guaranteed. For example, when the SDC feature of the speech data is used as the language feature of the speech data, if the speech data is phrase speech data with an effective duration lower than a preset duration (e.g., 3 seconds), the SDC feature is shorter, and the included language information is less, which may result in an inaccurate language recognition result of the phrase speech data. Therefore, in the present application, the language feature of the speech data may be another feature that includes more language information and is determined based on the acoustic feature such as the SDC feature.

The specific implementation manner of determining the language features of the voice data will be described in detail by the following embodiments.

Step S103: and performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result.

Traditional language identification models, such as Gaussian Mixture Models (GMM), Support Vector Machines (SVM), Gaussian Mixture Model Super Vector-Support Vector machines (GSV-SVM), etc., are obtained by training multiple sets of acoustic features such as SDC features, and the acoustic features such as SDC features contain less language information, so that the language identification accuracy of the traditional language identification models is low.

In the application, the pre-established first language identification model may be a model obtained by retraining a conventional language identification model by using a language feature of training data that includes more language information.

The specific implementation manner of the first language recognition result obtained by performing the first recognition on the language features of the speech data by using the first language recognition model will be described in detail in the following embodiments.

Step S104: and judging whether the first language identification result is accurate, and executing the step S105 and the step S106 when the first language identification result is not accurate. When the first language identification result is accurate, step S107 is executed.

In the present application, there are various ways to determine whether the first language identification result is accurate.

As an implementation manner, target languages (for example, chinese, english, french, and others) may be preset, and the first recognition result may include a first score of each target language for the language of the voice data, and a specific implementation manner of determining whether the first recognition result is accurate may be as follows: and judging whether the difference value between the highest first score and the lowest first score in each first score meets a preset condition, if so, determining that the first language identification result is accurate, and otherwise, determining that the first language identification result is inaccurate. The preset condition may be greater than or equal to a preset threshold, within a preset interval, and the like, and the application is not limited at all.

Step S105: and performing secondary recognition on the language features of the voice data by using a pre-established second language recognition model to obtain a secondary language recognition result.

In the application, the number of network layers of the second language identification model is greater than that of the first language identification model, so that the language identification accuracy of the second language identification model is higher than that of the first language identification model.

The specific implementation manner of the second language recognition result obtained by performing the second recognition on the language features of the speech data by using the second language recognition model will be described in detail in the following embodiments.

Step S106: and determining the language of the voice data based on the first language identification result and the second language identification result.

In this application, a target language (for example, chinese, english, french, or other language) may be preset, a first recognition result may include a first score of each language of the voice data for the target language, a second recognition result may include a second score of each language of the voice data for the target language, and a specific implementation manner of determining the language of the voice data may be determined based on the first language recognition result and the second language recognition result: determining the language of the voice data as a final score of each target language based on a first score of the language of the voice data for each target language and a second score of the language of the voice data for each target language; and determining the language of the voice data as a target language corresponding to the highest score in the final scores of each target language, wherein the target language is the language of the voice data.

Based on the first score of the language of the voice data for each target language and the second score of the language of the voice data for each target language, the mode of determining the language of the voice data as the final score of each target language may be: presetting the weight of a first recognition result and the weight of a second recognition result, and fusing the language of the voice data into a first score of each target language and a second score of each target language based on the weight of the first recognition result and the weight of the second recognition result to obtain a final score of each target language of the voice data.

For convenience of understanding, it is assumed that the weight of the first recognition result is α, the weight of the second recognition result is 1- α, the first score of the speech data in the language of chinese is 0.8, the second score of the speech data in the language of chinese is 0.6, and the final score of the speech data in the language of chinese is 0.8 × α +0.6 × (1- α).

Step S107: and determining the language of the voice data based on the first language identification result.

In this application, a target language (for example, chinese, english, french, or other language) may be preset, and the first recognition result may include a first score for each target language of the voice data, and the determining the language of the voice data based on the first recognition result may include: and determining the language of the voice data as a target language corresponding to the highest score in the first score of each target language, wherein the target language is the language of the voice data.

The embodiment discloses a language identification method, after acquiring voice data to be identified, determining language characteristics of the voice data; performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result; and when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result, and determining the language of the voice data based on the first language identification result and the second language identification result. In the above scheme, if the first language identification result is inaccurate, the second language identification model with more network layers than the first language identification model can be used for the second identification, so that the identification accuracy is improved.

In addition, in the present application, the language identification is not performed twice on all the voice data, but only the voice data with an inaccurate first language identification result is subjected to the second language identification, and when the plurality of voice data to be identified need to be subjected to the language identification, the recognition speed is improved compared with the case where the voice data to be identified are subjected to the language identification twice.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a language feature extraction model disclosed in an embodiment of the present application, where the language feature extraction model includes a feature transformation module, a time sequence feature extraction module, and a language feature extraction module. Since DNN (Deep Neural Networks, which is a full name of English) is good at performing nonlinear transformation on data, the feature transformation module can be realized based on DNN in the application. Because the BilSTM (English full name: Bi-directional Long Short-Term Memory, Chinese full name: bidirectional Long Short-Term Memory network) is good at analyzing the time sequence, the time sequence feature extraction module can be realized based on the BilSTM in the application. The dimension of the feature of the upper network layer can be reduced and the training speed of the model can be improved due to the BN (bottle network, Chinese full name: Bottleneck network), so that the language feature extraction module can be realized based on the BN (bottle network, Chinese full name: Bottleneck network).

Based on the language feature extraction model shown in fig. 2, a specific implementation manner of determining the language feature of the voice data in step S102 is described in this application. The method can comprise the following steps:

step S201: and acquiring acoustic features of the voice data.

In the present application, the acoustic feature of the voice data may be an SDC feature of the voice data.

Step S202: and performing feature conversion on the acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features.

The transformed features are nonlinear features corresponding to the acoustic features of the speech data.

Step S203: and extracting time sequence characteristics from the transformed characteristics by utilizing a time sequence characteristic extraction module of the language characteristic extraction model.

Step S204: and extracting the language features of the voice data from the time sequence features by utilizing a language feature extraction module of the language feature extraction model.

In this application, the language feature extraction module of the language feature extraction model can perform dimensionality reduction processing on the time sequence feature to obtain the language feature of the voice data.

For the training of the language feature extraction model, theoretically, the acoustic features of the training voice data can be used as training samples, and the language features labeled on the training voice data can be used as sample labels to obtain the training result. However, it is not practical to label the acoustic features of a piece of speech data with the corresponding language features, and a mature speech recognition model can obtain the phoneme information of the speech data at present, so in the present application, a phoneme recognition model may be preset, and the language feature extraction model may be obtained by training the phoneme recognition model. The method comprises the following specific steps:

please refer to fig. 3, which is a schematic structural diagram of a phoneme recognition model disclosed in an embodiment of the present application, the phoneme recognition model includes a feature transformation module, a time-series feature extraction module, a language feature extraction module, and an output layer, wherein the feature transformation module, the time-series feature extraction module, and the language feature extraction module may be modules of the language feature extraction model.

In the present application, the language feature extraction model may be obtained by training the phoneme recognition model and then removing an output layer of the phoneme recognition model.

Based on the phoneme recognition model shown in fig. 3, the training process of the language feature extraction model may include:

step S301: training speech data is obtained.

Step S302: acoustic features of each training speech data are determined, and phoneme information of each training speech data is determined.

In the present application, the acoustic feature of each training speech data and the phoneme information of each training speech data may be obtained based on a conventional speech recognition model. In this regard, the present application is not described further.

Step S303: and training to obtain a phoneme recognition model by taking the acoustic features of each training voice data as training samples and the phoneme information of the training voice data as sample labels.

Step S304: and removing an output layer of the phoneme recognition model to obtain the language feature extraction model.

As can be seen from fig. 2 and 3, the language feature extraction model shown in fig. 2 can be obtained by removing the output layer of the phoneme recognition model shown in fig. 3.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a first language identification model disclosed in an embodiment of the present application. The first language identification model comprises a mean value super Vector feature extraction module and a language identification module, wherein the language identification module can adopt a Support Vector Machine (Support Vector Machine, SVM for short) algorithm to identify languages.

Based on the structure of the first language identification model shown in fig. 4, in another embodiment of the present application, a specific implementation manner of performing the first recognition on the language features of the voice data by using the pre-established first language identification model in step S103 to obtain a first language recognition result is described, which may include the following steps:

step S401: processing the language features of the voice data by using a mean value super vector feature extraction module in a first language identification model to obtain mean value super vector features of the language features;

step S402: and identifying the mean value super-vector characteristics of the language characteristics by utilizing a language identification module in the first language identification model to obtain a first language identification result.

It should be noted that the training process of the first language identification model may include:

step S501: and acquiring a training voice data set corresponding to at least one language.

In the present application, target languages (for example, chinese, english, french, and others) may be preset, and a training speech data set corresponding to each target language is obtained. It should be noted that, in order to ensure the model effect, in the speech data included in the training speech data set corresponding to each target language, sometimes long speech data with a length longer than a first preset length of time (for example, 3 seconds) and short speech data with a length of time not longer than the first preset length of time are included, and the total length of time of all the speech data needs to reach a second preset length of time (for example, 20 hours).

Step S502: and labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language.

And the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language.

Step S503: and determining the language features of each training voice data in the training voice data set corresponding to each language.

In the present application, each training speech data may be processed based on the language feature extraction model to obtain the language feature of each training speech data.

Step S504: and determining a mean value super-vector characteristic set of the training voice data set corresponding to each language by using the language characteristics of each training voice data.

In the application, the language features of each training voice data can be utilized to determine the general background model and the total difference space matrix.

And aiming at the training voice data set corresponding to each language, determining a mean value super vector characteristic set of the training voice data set corresponding to the language by utilizing the language characteristics of each training voice data in the data set, the general background model and the full-difference space matrix.

As an implementation manner, for each training speech data in the training speech data set corresponding to each language, the language feature of each training speech data, the general background model and the full-difference spatial matrix are used to determine the initial mean value super-vector feature of each training speech data, and the initial mean value super-vector features of all the training speech data constitute a mean value super-vector feature set.

However, the training speech data set corresponding to each language includes a large number of training speech data, and if the initial mean value supervector features of all the training speech data are combined into the mean value supervector feature set, the convergence rate of the first language identification model is slow.

Therefore, another embodiment is proposed in the present application, which aims to reduce the number of mean supervector features in the mean supervector feature set of the training speech data set corresponding to each language, and improve the convergence rate of the first language identification model. The method specifically comprises the following steps:

step S5041: and aiming at the training voice data set corresponding to each language, clustering each training voice data in the training voice data set corresponding to each language by using the language characteristics of each training voice data to obtain a training voice data subset corresponding to each language.

In the application, for a training voice data set corresponding to each language, the language features of each training voice data in the data set, the general background model and the total difference space matrix are utilized to determine the initial mean value super-vector feature and the i-vector feature of each training voice data in the training voice data set corresponding to each language; and clustering each training voice data based on the initial mean value super-vector characteristic or the i-vector characteristic of each training voice data to obtain a training voice data subset corresponding to the language. Each subset of training speech data includes at least one training speech data.

It should be noted that, based on the initial mean value super-vector feature or the i-vector feature of each training speech data, the manner of clustering each training speech data may include: and calculating the similarity of the initial mean value super-vector characteristics or the i-vector characteristics of each training voice data, and clustering each training voice data based on the similarity. Specifically, a plurality of training speech data with high similarity of the initial mean super vector feature or the i-vector feature may be clustered into one training speech data subset.

It should be further noted that the training speech data subsets corresponding to the languages may be all subsets obtained after clustering, but the subsets including training speech data whose number is smaller than a preset threshold (for example, 3) may make the mean supervector in the mean supervector feature set obtained finally relatively discrete, which is not beneficial to training the model.

For convenience of understanding, it is assumed that the training speech data set corresponding to the target language chinese includes 5000 pieces of training speech data, 1000 training speech data subsets are obtained through clustering, 200 training speech data subsets including training speech data pieces smaller than a preset threshold (for example, 3) are obtained, 200 training speech data subsets are discarded, and the remaining 800 training speech data subsets are training speech data subsets corresponding to the target language chinese. Finally, the mean value super vector feature set of the training voice data set corresponding to the Chinese language only comprises 800 mean value super vector features which are far smaller than the original 5000 mean value super vector features.

Step S5042: aiming at each training voice data subset in the training voice data subsets corresponding to the languages, combining initial mean value super-vector characteristics of each training voice data in the training voice data subsets to obtain mean value super-vector characteristics of the training voice data subsets; and the mean value super vector characteristics of all the training voice data subsets corresponding to the languages form a mean value super vector characteristic set of the training voice data set corresponding to the languages.

For convenience of understanding, it is assumed that the training speech data set corresponding to the target language chinese includes 5000 pieces of training speech data, and if the training speech data subset corresponding to the target language is determined without clustering, the mean supervector in the mean supervector feature set of the corresponding training speech data set includes 5000 mean supervectors, which results in too low convergence rate of the first language identification model, whereas based on clustering, the number of mean supervectors included in the mean supervector feature set of the corresponding training speech data set is greatly reduced, which improves the convergence rate of the first language identification model.

Step S505: and training by using the mean value super vector characteristic set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language to obtain the first language identification model.

In another embodiment of the present application, the second language identification model is obtained by training a preset end-to-end neural network model with the language features of the training speech data as training samples and the language labeled by the training speech data as sample labels. As an implementation mode, the preset Neural Network model can be end-to-end TDNN (Time-Delay Neural Network, Chinese).

Based on the above scheme, in the present application, the number of network layers of the second language identification model is large, and therefore, the language identification accuracy of the second language identification model is high. Theoretically, for the voice data, the second language identification model is adopted for identification, and then a language identification result with high accuracy can be obtained. However, because the number of network layers of the second language identification model is large, the speech data is input into the second language identification model, and the second language identification model needs a long time to process the speech data so as to output the language of the speech data. In this case, for the language identification scene with high real-time requirement, the real-time requirement cannot be satisfied by simply adopting the second language identification model. Therefore, in the present application, the speech data is recognized by using the first language recognition model with a small number of network layers to obtain the first language recognition result. If the first language identification result is inaccurate, the second language identification model is used for identification, so that the accuracy of the language identification result can be ensured, and the efficiency of the language identification can be ensured.

The following describes the language identification device disclosed in the embodiment of the present application, and the language identification device described below and the language identification method described above may be referred to correspondingly.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a language identification device disclosed in the embodiment of the present application. As shown in fig. 5, the language identification means may include:

an acquisition unit 11 configured to acquire voice data to be recognized;

a language feature determination unit 12, configured to determine a language feature of the voice data;

a first language identification unit 13, configured to perform first identification on language features of the voice data by using a pre-established first language identification model to obtain a first language identification result;

a second language identification unit 14, configured to perform, when the first language identification result is inaccurate, a second language identification on the language features of the speech data by using a second language identification model that is established in advance, so as to obtain a second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model;

a language determining unit 15, configured to determine a language of the voice data based on the first language identification result and the second language identification result.

Optionally, the language feature determining unit includes:

acquiring training voice data;

Optionally, the first language identification unit includes:

acquiring a training voice data set corresponding to at least one language;

Referring to fig. 6, fig. 6 is a block diagram of a hardware structure of the language identification device according to the embodiment of the present application, and referring to fig. 6, the hardware structure of the language identification device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring voice data to be recognized;

determining language features of the voice data;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

acquiring voice data to be recognized;

determining language features of the voice data;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A language identification method, comprising:

acquiring voice data to be recognized;

determining language features of the voice data;

2. The method of claim 1, wherein determining the linguistic characteristics of the speech data comprises:

acquiring acoustic features of the voice data;

3. The method according to claim 2, wherein the training process of the language feature extraction model comprises:

acquiring training voice data;

4. The method according to claim 1, wherein said performing a first speech recognition on the speech features of the speech data by using a pre-established first speech recognition model to obtain a first speech recognition result comprises:

5. The method according to claim 4, wherein the training process of the first language identification model comprises:

acquiring a training voice data set corresponding to at least one language;

6. The method according to claim 5, wherein the determining the mean supervector feature set of the training speech data set corresponding to each language by using the language features of each training speech data comprises:

7. The method according to claim 1, wherein the second language identification model is obtained by training a preset end-to-end neural network model with language features of training speech data as training samples and languages marked by the training speech data as sample labels.

8. A language identification device, comprising:

9. A language identification device comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, and implement the steps of the language identification method according to any one of claims 1 to 8.

10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the language identification method according to any one of claims 1 to 8.