CN111724766B - Language identification method, related equipment and readable storage medium - Google Patents

Language identification method, related equipment and readable storage medium Download PDF

Info

Publication number
CN111724766B
CN111724766B CN202010607693.6A CN202010607693A CN111724766B CN 111724766 B CN111724766 B CN 111724766B CN 202010607693 A CN202010607693 A CN 202010607693A CN 111724766 B CN111724766 B CN 111724766B
Authority
CN
China
Prior art keywords
language
voice data
training
model
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010607693.6A
Other languages
Chinese (zh)
Other versions
CN111724766A (en
Inventor
杨军
方磊
方四安
唐磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Ustc Iflytek Co ltd
Original Assignee
Hefei Ustc Iflytek Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Ustc Iflytek Co ltd filed Critical Hefei Ustc Iflytek Co ltd
Priority to CN202010607693.6A priority Critical patent/CN111724766B/en
Publication of CN111724766A publication Critical patent/CN111724766A/en
Application granted granted Critical
Publication of CN111724766B publication Critical patent/CN111724766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems

Abstract

The application discloses a language identification method, related equipment and a readable storage medium, wherein after voice data to be identified are acquired, language characteristics of the voice data are determined; performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result; when the first language identification result is inaccurate, the language features of the voice data are identified for the second time by utilizing a pre-established second language identification model, so that the second language identification result is obtained, and the language of the voice data is determined based on the first language identification result and the second language identification result. In the above scheme, if the first language identification result is inaccurate, the second language identification model with more network layers than the first language identification model can be used for the second identification, so that the identification accuracy is improved.

Description

Language identification method, related equipment and readable storage medium
Technical Field
The present invention relates to the field of natural language processing, and more particularly, to a language identification method, related device, and readable storage medium.
Background
The language identification is a process of analyzing and processing voice data by a computer to distinguish the language type of the voice data, and is an important research direction of voice identification. Along with the continuous acceleration of globalization process, language identification has wide application prospect in the fields of multilingual information service, machine translation, military safety and the like. In the prior art, language recognition methods such as a mixed Gaussian model (English full name: gaussian Mixture Model, english abbreviated: GMM), a support vector machine (English full name: support Vector Machine, english abbreviated: SVM) and a Gaussian mixed model supervector-support vector machine (English full name: gaussian Super Vector-Support Vector Machine, english abbreviated: GSV-SVM) are mostly adopted to perform language recognition on voice.
However, in the prior art, the accuracy of the obtained language recognition result is not ideal in the method for recognizing the language of the voice data.
Therefore, it is especially necessary to optimize the language identification method in the prior art.
Disclosure of Invention
In view of the foregoing, the present application proposes a language identification method, related apparatus, and readable storage medium. The specific scheme is as follows:
a language identification method, comprising:
Acquiring voice data to be recognized;
determining language characteristics of the voice data;
performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result;
when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result; determining the language of the voice data based on the first language identification result and the second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model.
Optionally, the determining the language feature of the voice data includes:
acquiring acoustic characteristics of the voice data;
performing feature conversion on acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features;
extracting time sequence features from the transformed features by using a time sequence feature extraction module of the language feature extraction model;
and extracting the language features of the voice data from the time sequence features by using a language feature extraction module of the language feature extraction model.
Optionally, the training process of the language feature extraction model includes:
acquiring training voice data;
determining acoustic features of each training speech data and phoneme information of each training speech data;
taking acoustic characteristics of each piece of training voice data as a training sample, taking phoneme information of the training voice data as a sample label, and training to obtain a phoneme recognition model;
and removing the output layer of the phoneme recognition model to obtain the language feature extraction model.
Optionally, the performing the first recognition on the language feature of the voice data by using a pre-established first language recognition model to obtain a first language recognition result includes:
processing the language features of the voice data by using a mean value supervector feature extraction module of the first language recognition model to obtain mean value supervector features of the language features;
and identifying the mean value supervector characteristic of the language characteristic by using the language identification module of the first language identification model to obtain a first language identification result.
Optionally, the training process of the first language identification model includes:
acquiring a training voice data set corresponding to at least one language;
Labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language, wherein the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language;
determining language characteristics of each training voice data in a training voice data set corresponding to each language;
determining a mean value super-vector feature set of the training voice data set corresponding to each language by using the language features of each training voice data;
and training to obtain the first language identification model by using the mean value supervector feature set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language.
Optionally, the determining, by using language features of each training voice data, a mean value super-vector feature set of a training voice data set corresponding to each language includes:
clustering each training voice data in the training voice data set corresponding to each language by utilizing language characteristics of each training voice data to obtain a training voice data subset corresponding to the language;
Combining initial mean supervector features of each training voice data in the training voice data subsets aiming at each training voice data subset in the training voice data subsets corresponding to the languages to obtain mean supervector features of the training voice data subsets; and the mean supervector characteristics of all training voice data subsets corresponding to the languages form a mean supervector characteristic set of the training voice data set corresponding to the languages.
Optionally, the second language identification model is obtained by training a preset end-to-end neural network model by taking language features of training voice data as training samples and the language marked by the training voice data as a sample label.
A language identification device comprising:
an acquisition unit configured to acquire voice data to be recognized;
the language feature determining unit is used for determining the language feature of the voice data;
the first language identification unit is used for carrying out first identification on the language characteristics of the voice data by utilizing a pre-established first language identification model to obtain a first language identification result;
the second language identification unit is used for carrying out second identification on the language characteristics of the voice data by utilizing a pre-established second language identification model when the first language identification result is inaccurate, so as to obtain a second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model;
And the language determining unit is used for determining the language of the voice data based on the first language identification result and the second language identification result.
Optionally, the language feature determining unit includes:
an acoustic feature acquisition unit configured to acquire acoustic features of the voice data;
the feature conversion unit is used for carrying out feature conversion on the acoustic features of the voice data by utilizing a feature conversion module of a pre-established language feature extraction model to obtain converted features;
the time sequence feature extraction unit is used for extracting time sequence features from the transformed features by utilizing a time sequence feature extraction module of the language feature extraction model;
and the language feature extraction unit is used for extracting the language features of the voice data from the time sequence features by utilizing a language feature extraction module of the language feature extraction model.
Optionally, the training process of the language feature extraction model includes:
acquiring training voice data;
determining acoustic features of each training speech data and phoneme information of each training speech data;
taking acoustic characteristics of each piece of training voice data as a training sample, taking phoneme information of the training voice data as a sample label, and training to obtain a phoneme recognition model;
And removing the output layer of the phoneme recognition model to obtain the language feature extraction model.
Optionally, the first language identification unit includes:
the mean value supervector feature determining unit is used for processing the language features of the voice data by utilizing a mean value supervector feature extracting module of the first language identification model to obtain mean value supervector features of the language features;
and the recognition unit is used for recognizing the mean value supervector characteristic of the language characteristic by utilizing the language recognition module of the first language recognition model to obtain a first language recognition result.
Optionally, the training process of the first language identification model includes:
acquiring a training voice data set corresponding to at least one language;
labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language, wherein the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language;
determining language characteristics of each training voice data in a training voice data set corresponding to each language;
determining a mean value super-vector feature set of the training voice data set corresponding to each language by using the language features of each training voice data;
And training to obtain the first language identification model by using the mean value supervector feature set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language.
Optionally, the determining, by using language features of each training voice data, a mean value super-vector feature set of a training voice data set corresponding to each language includes:
clustering each training voice data in the training voice data set corresponding to each language by utilizing language characteristics of each training voice data to obtain a training voice data subset corresponding to the language;
combining initial mean supervector features of each training voice data in the training voice data subsets aiming at each training voice data subset in the training voice data subsets corresponding to the languages to obtain mean supervector features of the training voice data subsets; and the mean supervector characteristics of all training voice data subsets corresponding to the languages form a mean supervector characteristic set of the training voice data set corresponding to the languages.
Optionally, the second language identification model is obtained by training a preset end-to-end neural network model by taking language features of training voice data as training samples and the language marked by the training voice data as a sample label.
A language identification device comprises a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the language identification method as described above.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a language identification method as described above.
By means of the technical scheme, the application discloses a language identification method, related equipment and a readable storage medium, and after voice data to be identified are acquired, language characteristics of the voice data are determined; performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result; when the first language identification result is inaccurate, the language features of the voice data are identified for the second time by utilizing a pre-established second language identification model, so that the second language identification result is obtained, and the language of the voice data is determined based on the first language identification result and the second language identification result. In the above scheme, if the first language identification result is inaccurate, the second language identification model with more network layers than the first language identification model can be used for the second identification, so that the identification accuracy is improved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 is a flow chart of a language identification method according to an embodiment of the present disclosure;
FIG. 2 is a schematic structural diagram of a language feature extraction model according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a phoneme recognition model according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a first language identification model according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a language identification apparatus according to an embodiment of the present disclosure;
fig. 6 is a block diagram of a hardware structure of a language identification apparatus according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Next, the language identification method provided in the present application will be described by the following examples.
Referring to fig. 1, fig. 1 is a flow chart of a language identification method disclosed in an embodiment of the present application, where the method may include:
step S101: and acquiring voice data to be recognized.
The voice data to be recognized is voice data uttered by the user according to application requirements, such as voice data input when the user calls, voice data input by using a voice input method when the user is based on an instant chat tool, and the like, and the application is not limited in any way.
Step S102: and determining the language characteristics of the voice data.
In this application, although acoustic features such as SDC (english full name: shifted Delta Cepstral, chinese full name: shift differential cepstrum) features of voice data may be used as the language features of voice data, the acoustic features of voice data often contain less language information, and high recognition accuracy cannot be ensured. For example, when the SDC feature of the voice data is used as the language feature of the voice data, if the voice data is phrase voice data with an effective duration being less than a preset duration (for example, 3 seconds), the SDC feature is shorter, and the language information is less, which may cause inaccurate language recognition result of the phrase voice data. Therefore, in the present application, the language features of the voice data may be other features that are determined based on acoustic features such as SDC features and include more language information.
The specific implementation of determining the language features of the speech data will be described in detail by the following examples.
Step S103: and carrying out first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result.
The traditional language recognition models, such as a mixed Gaussian model (English full name: gaussian Mixture Model, english abbreviated as GMM), a support vector machine (English full name: support Vector Machine, english abbreviated as SVM), a Gaussian mixture model supervector-support vector machine (English full name: gaussian Super Vector-Support Vector Machine, english abbreviated as GSV-SVM) and the like, are mostly obtained based on acoustic feature training of SDC features and the like, and the acoustic features of the SDC features and the like contain less language information, so that the language recognition accuracy of the traditional language recognition model is low.
In this application, the pre-established first language identification model may be a model obtained by retraining a conventional language identification model by using language features of training data, which include more language information.
The training process of the first language identification model, and the first language identification model is used to identify the language features of the voice data for the first time, so as to obtain the specific implementation manner of the first language identification result, which will be described in detail in the following embodiments.
Step S104: and judging whether the first language identification result is accurate, and executing step S105 and step S106 when the first language identification result is inaccurate. When the first language identification result is accurate, step S107 is performed.
In the present application, there may be various ways to determine whether the first language identification result is accurate.
As an implementation manner, a target language (for example, chinese, english, french, and others) may be preset, and the first recognition result may include a first score of the language of the voice data for each target language, and a specific implementation manner of determining whether the first recognition result is accurate may be as follows: judging whether the difference value between the highest first score and the lowest first score in each first score meets a preset condition, if so, determining that the first-time language identification result is accurate, and if not, determining that the first-time language identification result is inaccurate. The preset condition may be equal to or greater than a preset threshold, be within a preset interval, and the like, which is not limited in any way.
Step S105: and carrying out second recognition on the language features of the voice data by using a pre-established second language recognition model to obtain a second language recognition result.
In the application, the number of network layers of the second language identification model is more than that of the first language identification model, so that the accuracy of language identification of the second language identification model is higher than that of the first language identification model.
The training process of the second language identification model, and the second language identification model is used to perform the second identification on the language features of the voice data, so as to obtain the specific implementation manner of the second language identification result, which will be described in detail in the following embodiments.
Step S106: and determining the language of the voice data based on the first language identification result and the second language identification result.
In this application, target languages (for example, chinese, english, french, and others) may be preset, where the first recognition result may include a first score of each target language for a language of the voice data, and the second recognition result may include a second score of each target language for a language of the voice data, and then, based on the first and second recognition results, determining a specific implementation manner of the language of the voice data may be: determining a final score for each of the target languages based on the first score for each of the target languages for the languages of the speech data and the second score for each of the target languages for the languages of the speech data; and determining the language of the voice data as the target language corresponding to the highest score in the final score of each target language, and determining the language of the voice data as the language of the voice data.
The method for determining the final score of the language of the voice data for each target language may be as follows: the method comprises the steps of presetting the weight of a first recognition result and the weight of a second recognition result, and fusing the first score of the language of the voice data for each target language and the second score of the language of the voice data for each target language based on the weight of the first recognition result and the weight of the second recognition result to obtain the final score of the language of the voice data for each target language.
For ease of understanding, assuming that the weight of the first recognition result is α and the weight of the second recognition result is 1- α, the first score of the language of the voice data is 0.8, the second score of the language of the voice data is 0.6, and the final score of the language of the voice data is 0.8×α+0.6×1- α.
Step S107: and determining the language of the voice data based on the first language identification result.
In this application, target languages (for example, chinese, english, french, and others) may be preset, and the first recognition result may include a first score of the language of the voice data for each target language, and then determining, based on the first recognition result of the language, the manner of determining the language of the voice data may include: and determining the language of the voice data as the target language corresponding to the highest score in the first scores of the target languages, and determining the language of the voice data as the language of the voice data.
The embodiment discloses a language identification method, which comprises the steps of after voice data to be identified are acquired, determining language characteristics of the voice data; performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result; when the first language identification result is inaccurate, the language features of the voice data are identified for the second time by utilizing a pre-established second language identification model, so that the second language identification result is obtained, and the language of the voice data is determined based on the first language identification result and the second language identification result. In the above scheme, if the first language identification result is inaccurate, the second language identification model with more network layers than the first language identification model can be used for the second identification, so that the identification accuracy is improved.
In addition, in the application, instead of performing two-time language recognition on all voice data, only the voice data with inaccurate first-time language recognition result is subjected to the second-time language recognition, and when a plurality of voice data to be recognized need to be subjected to language recognition, compared with the voice data to be recognized, the voice data to be recognized are subjected to two-time language recognition, so that the recognition speed is improved.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a language feature extraction model according to an embodiment of the present application, where the language feature extraction model includes a feature transformation module, a time sequence feature extraction module, and a language feature extraction module. Among them, since DNN (english full name: deep Neural Networks, chinese full name: deep neural network) is good at performing nonlinear transformation on data, in this application, the feature transformation module may be implemented based on DNN. Since BiLSTM (Bi-directional Long Short-Term Memory, chinese full name: two-way long and short Term Memory network) is good at analyzing time series, the timing feature extraction module can be implemented based on BiLSTM in the present application. Because BN (English full name: bottleneck Networks, chinese full name: bottleneck network) can reduce the dimension of the features of the previous network layer and can improve the training speed of the model, the language feature extraction module can be realized based on BN (English full name: bottleneck Networks, chinese full name: bottleneck network).
Based on the language feature extraction model shown in fig. 2, a specific implementation manner of determining the language feature of the voice data in step S102 is described in this application. The method can comprise the following steps:
step S201: and acquiring acoustic characteristics of the voice data.
In this application, the acoustic feature of the voice data may be an SDC feature of the voice data.
Step S202: and performing feature conversion on the acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features.
The transformed features are nonlinear features corresponding to acoustic features of the speech data.
Step S203: and extracting time sequence features from the transformed features by using a time sequence feature extraction module of the language feature extraction model.
Step S204: and extracting the language features of the voice data from the time sequence features by using a language feature extraction module of the language feature extraction model.
In the application, the language feature extraction module of the language feature extraction model can perform dimension reduction processing on the time sequence features to obtain the language features of the voice data.
For training of the language feature extraction model, the acoustic features of the training voice data can be theoretically used as training samples, the language features marked by the training voice data are used as sample tags, and the training is achieved. However, it is not practical to label the acoustic features of a piece of speech data and the corresponding language features of the acoustic features, and at present, a mature speech recognition model can obtain the phoneme information of the speech data, so in this application, a phoneme recognition model may be preset, and training is performed on the phoneme recognition model to obtain the language feature extraction model. The method comprises the following steps:
Referring to fig. 3, a schematic structural diagram of a phoneme recognition model disclosed in an embodiment of the present application is shown, where the phoneme recognition model includes a feature transformation module, a time sequence feature extraction module, a language feature extraction module, and an output layer, where the feature transformation module, the time sequence feature extraction module, and the language feature extraction module may be the modules of the language feature extraction model.
In the application, after the phoneme recognition model is trained, the output layer of the phoneme recognition model is removed, and the language feature extraction model can be obtained.
Based on the phoneme recognition model shown in fig. 3, the training process for the language feature extraction model may include:
step S301: training speech data is acquired.
Step S302: acoustic features of each training speech data are determined, and phoneme information of each training speech data.
In the present application, acoustic features of each training speech data and phoneme information of each training speech data may be obtained based on a conventional speech recognition model. In this regard, the present application will not be described.
Step S303: and training by taking the acoustic characteristics of each training voice data as a training sample and taking the phoneme information of the training voice data as a sample label to obtain a phoneme recognition model.
Step S304: and removing the output layer of the phoneme recognition model to obtain the language feature extraction model.
As can be seen from fig. 2 and 3, the language feature extraction model shown in fig. 2 can be obtained by removing the output layer of the phoneme recognition model shown in fig. 3.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a first language identification model according to an embodiment of the present application. The first language identification model comprises a mean value supervector feature extraction module and a language identification module, wherein the language identification module can be used for carrying out language identification by adopting a support vector machine (English is called Support Vector Machine, english is called SVM) algorithm.
Based on the structure of the first language recognition model shown in fig. 4, in another embodiment of the present application, the step S103 performs the first recognition on the language features of the voice data by using the pre-established first language recognition model, to obtain a specific implementation manner of the first language recognition result, which may include the following steps:
step S401: processing the language features of the voice data by using a mean value supervector feature extraction module in the first language identification model to obtain mean value supervector features of the language features;
Step S402: and identifying the mean value supervector characteristic of the language characteristic by using a language identification module in the first language identification model to obtain a first language identification result.
It should be noted that, the training process of the first language identification model may include:
step S501: and acquiring a training voice data set corresponding to at least one language.
In the present application, target languages (e.g., chinese, english, french, and others) may be preset, and a training speech data set corresponding to each target language may be obtained. It should be noted that, to ensure the model effect, the voice data included in the training voice data set corresponding to each target language has long voice data with a duration longer than the first preset duration (for example, 3 seconds), and phrase voice data with a duration not longer than the first preset duration, and the total duration of all the voice data needs to reach the second preset duration (for example, 20 hours).
Step S502: and labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language.
The labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language.
Step S503: and determining language characteristics of each training voice data in the training voice data set corresponding to each language.
In the application, each training voice data can be processed based on the language feature extraction model, so that language features of each training voice data can be obtained.
Step S504: and determining a mean value supervector feature set of the training voice data set corresponding to each language by using the language features of each training voice data.
In the application, the universal background model and the full-difference space matrix can be determined by using language features of each training voice data.
And aiming at the training voice data set corresponding to each language, determining the mean value supervector feature set of the training voice data set corresponding to the language by utilizing the language features of each training voice data in the data set, the general background model and the full-difference space matrix.
As an implementation manner, for each training voice data in the training voice data set corresponding to each language, the initial mean value supervector feature of the training voice data set can be determined by using the language features, the universal background model and the full-difference space matrix, and the initial mean value supervector features of all the training voice data form a mean value supervector feature set.
However, the number of training voice data included in the training voice data set corresponding to each language is large, and if the initial mean supervector features of all the training voice data are combined into the mean supervector feature set, the convergence speed of the first language recognition model is slow.
Therefore, another embodiment is provided in the present application, which aims to reduce the number of the mean supervector features in the mean supervector feature set of the training voice data set corresponding to each language, and improve the convergence speed of the first language identification model. The method specifically comprises the following steps:
step S5041: and clustering each training voice data in the training voice data set corresponding to each language by utilizing the language characteristics of each training voice data to obtain a training voice data subset corresponding to the language.
In the application, aiming at the training voice data set corresponding to each language, determining the initial mean value supervector characteristic and the i-vector characteristic of each training voice data in the training voice data set corresponding to the language by utilizing the language characteristic of each training voice data in the data set, the general background model and the full-difference space matrix; based on the initial mean value super-vector characteristics or the i-vector characteristics of each training voice data, clustering each training voice data to obtain a training voice data subset corresponding to the language. Each training speech data subset includes at least one training speech data.
It should be noted that, based on the initial mean value supervector feature or the i-vector feature of each training voice data, the manner of clustering each training voice data may include: and calculating the similarity of the initial mean value supervector characteristic or the i-vector characteristic of each training voice data, and clustering each training voice data based on the similarity. Specifically, a plurality of training speech data with relatively high similarity of the initial mean supervector feature or the i-vector feature may be clustered into one training speech data subset.
It should be further noted that, the training voice data subset corresponding to the language may be all subsets obtained after clustering, but the subset including the training voice data with the number smaller than the preset threshold (for example, 3) may make the average supervector feature in the finally obtained average supervector feature set more discrete, which is not beneficial to training of the model, so in the present application, the training voice data subset corresponding to the language may be a subset including the training voice data with the number not smaller than the preset threshold in the subset obtained after clustering.
For easy understanding, assume that the training voice data set corresponding to the target language Chinese contains 5000 pieces of training voice data, 1000 training voice data subsets are obtained through clustering, 200 training voice data subsets containing training voice data with the number smaller than a preset threshold (for example, 3) are included, the 200 training voice data subsets are discarded, and the remaining 800 training voice data subsets are the training voice data subsets corresponding to the target language Chinese. Finally, the mean value supervector feature set of the training voice data set corresponding to Chinese only comprises 800 mean value supervector features, which is far smaller than the original 5000.
Step S5042: combining initial mean supervector features of each training voice data in the training voice data subsets aiming at each training voice data subset in the training voice data subsets corresponding to the languages to obtain mean supervector features of the training voice data subsets; and the mean supervector characteristics of all training voice data subsets corresponding to the languages form a mean supervector characteristic set of the training voice data set corresponding to the languages.
In order to facilitate understanding, it is assumed that the training voice data set corresponding to the target language Chinese contains 5000 training voice data, if the training voice data subset corresponding to the target language is not determined by clustering, the mean supervector feature set of the corresponding training voice data set will include 5000 mean supervectors, so that the convergence rate of the first language recognition model is too slow, and based on clustering, the number of mean supervectors included in the mean supervector feature set of the corresponding training voice data set is greatly reduced, and the convergence rate of the first language recognition model is improved.
Step S505: and training to obtain the first language identification model by using the mean value supervector feature set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language.
In another embodiment of the present application, the second language identification model is obtained by training a preset end-to-end neural network model with language features of training voice data as training samples and with languages marked by the training voice data as sample labels. As an implementation manner, the preset neural network model may be an end-to-end TDNN (english: time-Delay Neural Network, chinese: time-delay neural network).
Based on the above scheme, in the present application, the number of network layers of the second language identification model is large, so that the accuracy of language identification of the second language identification model is high. In theory, for voice data, the second language identification model is adopted for identification, so that a language identification result with high accuracy can be obtained. However, since the number of network layers of the second language recognition model is large, the second language recognition model needs a long time to process the voice data to output the language of the voice data. In this case, for the language identification scene with higher real-time requirement, the real-time requirement cannot be met by simply adopting the second language identification model. Therefore, in the application, the voice data is first identified by using the first language identification model with a smaller network layer number, so as to obtain the first language identification result. If the first language identification result is inaccurate, the second language identification model is utilized for identification, so that the accuracy of the language identification result can be ensured, and the efficiency of language identification can be ensured.
The language identification device disclosed in the embodiments of the present application will be described below, and the language identification device described below and the language identification method described above may be referred to correspondingly.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a language identification apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the language identification apparatus may include:
an acquisition unit 11 for acquiring voice data to be recognized;
a language feature determining unit 12 for determining a language feature of the voice data;
a first language identification unit 13, configured to perform a first identification on the language feature of the voice data by using a pre-established first language identification model, so as to obtain a first language identification result;
a second language identification unit 14, configured to perform a second recognition on the language feature of the speech data by using a pre-established second language recognition model when the first language recognition result is inaccurate, so as to obtain a second language recognition result; the number of network layers of the second language identification model is more than that of the first language identification model;
the language determining unit 15 is configured to determine the language of the voice data based on the first language identification result and the second language identification result.
Optionally, the language feature determining unit includes:
an acoustic feature acquisition unit configured to acquire acoustic features of the voice data;
the feature conversion unit is used for carrying out feature conversion on the acoustic features of the voice data by utilizing a feature conversion module of a pre-established language feature extraction model to obtain converted features;
the time sequence feature extraction unit is used for extracting time sequence features from the transformed features by utilizing a time sequence feature extraction module of the language feature extraction model;
and the language feature extraction unit is used for extracting the language features of the voice data from the time sequence features by utilizing a language feature extraction module of the language feature extraction model.
Optionally, the training process of the language feature extraction model includes:
acquiring training voice data;
determining acoustic features of each training speech data and phoneme information of each training speech data;
taking acoustic characteristics of each piece of training voice data as a training sample, taking phoneme information of the training voice data as a sample label, and training to obtain a phoneme recognition model;
and removing the output layer of the phoneme recognition model to obtain the language feature extraction model.
Optionally, the first language identification unit includes:
the mean value supervector feature determining unit is used for processing the language features of the voice data by utilizing a mean value supervector feature extracting module of the first language identification model to obtain mean value supervector features of the language features;
and the recognition unit is used for recognizing the mean value supervector characteristic of the language characteristic by utilizing the language recognition module of the first language recognition model to obtain a first language recognition result.
Optionally, the training process of the first language identification model includes:
acquiring a training voice data set corresponding to at least one language;
labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language, wherein the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language;
determining language characteristics of each training voice data in a training voice data set corresponding to each language;
determining a mean value super-vector feature set of the training voice data set corresponding to each language by using the language features of each training voice data;
And training to obtain the first language identification model by using the mean value supervector feature set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language.
Optionally, the determining, by using language features of each training voice data, a mean value super-vector feature set of a training voice data set corresponding to each language includes:
clustering each training voice data in the training voice data set corresponding to each language by utilizing language characteristics of each training voice data to obtain a training voice data subset corresponding to the language;
combining initial mean supervector features of each training voice data in the training voice data subsets aiming at each training voice data subset in the training voice data subsets corresponding to the languages to obtain mean supervector features of the training voice data subsets; and the mean supervector characteristics of all training voice data subsets corresponding to the languages form a mean supervector characteristic set of the training voice data set corresponding to the languages.
Optionally, the second language identification model is obtained by training a preset end-to-end neural network model by taking language features of training voice data as training samples and the language marked by the training voice data as a sample label.
Referring to fig. 6, fig. 6 is a block diagram of a hardware structure of a language identification apparatus according to an embodiment of the present application, and referring to fig. 6, the hardware structure of the language identification apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete communication with each other through the communication bus 4;
processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
acquiring voice data to be recognized;
determining language characteristics of the voice data;
performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result;
When the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result; determining the language of the voice data based on the first language identification result and the second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the application also provides a readable storage medium, which can store a program suitable for being executed by a processor, the program being configured to:
acquiring voice data to be recognized;
determining language characteristics of the voice data;
performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result;
when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result; determining the language of the voice data based on the first language identification result and the second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A language identification method, comprising:
acquiring voice data to be recognized;
determining language characteristics of the voice data;
performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result; the first recognition result comprises a first score of each target language of the voice data, wherein the target language is a preset language;
when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result; the second recognition result comprises a second score of the language of the voice data for each target language;
Determining a final score for each of the target languages based on the first score for each of the target languages for the languages of the speech data and the second score for each of the target languages for the languages of the speech data; determining the language of the voice data as the target language corresponding to the highest score in the final score of each target language, and determining the language of the voice data as the language of the voice data; the number of network layers of the second language identification model is more than that of the first language identification model.
2. The method of claim 1, wherein said determining the language characteristics of the speech data comprises:
acquiring acoustic characteristics of the voice data;
performing feature conversion on acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features;
extracting time sequence features from the transformed features by using a time sequence feature extraction module of the language feature extraction model;
and extracting the language features of the voice data from the time sequence features by using a language feature extraction module of the language feature extraction model.
3. The method of claim 2, wherein the training process of the language feature extraction model comprises:
acquiring training voice data;
determining acoustic features of each training speech data and phoneme information of each training speech data;
taking acoustic characteristics of each piece of training voice data as a training sample, taking phoneme information of the training voice data as a sample label, and training to obtain a phoneme recognition model;
and removing the output layer of the phoneme recognition model to obtain the language feature extraction model.
4. The method of claim 1, wherein the performing the first recognition on the language feature of the voice data using the pre-established first language recognition model to obtain a first language recognition result includes:
processing the language features of the voice data by using a mean value supervector feature extraction module of the first language recognition model to obtain mean value supervector features of the language features;
and identifying the mean value supervector characteristic of the language characteristic by using the language identification module of the first language identification model to obtain a first language identification result.
5. The method of claim 4, wherein the training process of the first language identification model comprises:
acquiring a training voice data set corresponding to at least one language;
labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language, wherein the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language;
determining language characteristics of each training voice data in a training voice data set corresponding to each language;
determining a mean value super-vector feature set of the training voice data set corresponding to each language by using the language features of each training voice data;
and training to obtain the first language identification model by using the mean value supervector feature set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language.
6. The method of claim 5, wherein determining the mean hyper-vector feature set for each language of the training speech data set using the language features of each training speech data, comprises:
Clustering each training voice data in the training voice data set corresponding to each language by utilizing language characteristics of each training voice data to obtain a training voice data subset corresponding to the language;
combining initial mean supervector features of each training voice data in the training voice data subsets aiming at each training voice data subset in the training voice data subsets corresponding to the languages to obtain mean supervector features of the training voice data subsets; and the mean supervector characteristics of all training voice data subsets corresponding to the languages form a mean supervector characteristic set of the training voice data set corresponding to the languages.
7. The method of claim 1, wherein the second language identification model is obtained by training a preset end-to-end neural network model using language features of training voice data as training samples and using languages marked by the training voice data as sample labels.
8. A language identification device, comprising:
an acquisition unit configured to acquire voice data to be recognized;
the language feature determining unit is used for determining the language feature of the voice data;
The first language identification unit is used for carrying out first identification on the language characteristics of the voice data by utilizing a pre-established first language identification model to obtain a first language identification result; the first recognition result comprises a first score of each target language of the voice data, wherein the target language is a preset language;
the second language identification unit is used for carrying out second identification on the language characteristics of the voice data by utilizing a pre-established second language identification model when the first language identification result is inaccurate, so as to obtain a second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model; the second recognition result comprises a second score of the language of the voice data for each target language;
a language determining unit, configured to determine, based on a first score of a language of the voice data for each of the target languages and a second score of a language of the voice data for each of the target languages, a final score of a language of the voice data for each of the target languages; determining the language of the voice data as the target language corresponding to the highest score in the final score of each target language, and determining the language of the voice data as the language of the voice data; the number of network layers of the second language identification model is more than that of the first language identification model.
9. A language identification device comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the respective steps of the language identification method as claimed in any one of claims 1 to 7.
10. A readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of the language identification method according to any one of claims 1 to 7.
CN202010607693.6A 2020-06-29 2020-06-29 Language identification method, related equipment and readable storage medium Active CN111724766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010607693.6A CN111724766B (en) 2020-06-29 2020-06-29 Language identification method, related equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010607693.6A CN111724766B (en) 2020-06-29 2020-06-29 Language identification method, related equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111724766A CN111724766A (en) 2020-09-29
CN111724766B true CN111724766B (en) 2024-01-05

Family

ID=72570223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010607693.6A Active CN111724766B (en) 2020-06-29 2020-06-29 Language identification method, related equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111724766B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528682A (en) * 2020-12-23 2021-03-19 北京百度网讯科技有限公司 Language detection method and device, electronic equipment and storage medium
CN113782000B (en) * 2021-09-29 2022-04-12 北京中科智加科技有限公司 Language identification method based on multiple tasks
CN114678029B (en) * 2022-05-27 2022-09-02 深圳市人马互动科技有限公司 Speech processing method, system, computer readable storage medium and program product

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546555A (en) * 2009-04-14 2009-09-30 清华大学 Constraint heteroscedasticity linear discriminant analysis method for language identification
CN101894548A (en) * 2010-06-23 2010-11-24 清华大学 Modeling method and modeling device for language identification
WO2014029099A1 (en) * 2012-08-24 2014-02-27 Microsoft Corporation I-vector based clustering training data in speech recognition
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN106898355A (en) * 2017-01-17 2017-06-27 清华大学 A kind of method for distinguishing speek person based on two modelings
CN107564513A (en) * 2016-06-30 2018-01-09 阿里巴巴集团控股有限公司 Audio recognition method and device
CN107622770A (en) * 2017-09-30 2018-01-23 百度在线网络技术(北京)有限公司 voice awakening method and device
CN109817220A (en) * 2017-11-17 2019-05-28 阿里巴巴集团控股有限公司 Audio recognition method, apparatus and system
CN110211565A (en) * 2019-05-06 2019-09-06 平安科技(深圳)有限公司 Accent recognition method, apparatus and computer readable storage medium
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN110853618A (en) * 2019-11-19 2020-02-28 腾讯科技(深圳)有限公司 Language identification method, model training method, device and equipment
CN110930978A (en) * 2019-11-08 2020-03-27 北京搜狗科技发展有限公司 Language identification method and device and language identification device
CN111312214A (en) * 2020-03-31 2020-06-19 广东美的制冷设备有限公司 Voice recognition method and device for air conditioner, air conditioner and readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180071029A (en) * 2016-12-19 2018-06-27 삼성전자주식회사 Method and apparatus for speech recognition
CN108335696A (en) * 2018-02-09 2018-07-27 百度在线网络技术(北京)有限公司 Voice awakening method and device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546555A (en) * 2009-04-14 2009-09-30 清华大学 Constraint heteroscedasticity linear discriminant analysis method for language identification
CN101894548A (en) * 2010-06-23 2010-11-24 清华大学 Modeling method and modeling device for language identification
WO2014029099A1 (en) * 2012-08-24 2014-02-27 Microsoft Corporation I-vector based clustering training data in speech recognition
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN107564513A (en) * 2016-06-30 2018-01-09 阿里巴巴集团控股有限公司 Audio recognition method and device
CN106898355A (en) * 2017-01-17 2017-06-27 清华大学 A kind of method for distinguishing speek person based on two modelings
CN107622770A (en) * 2017-09-30 2018-01-23 百度在线网络技术(北京)有限公司 voice awakening method and device
CN109817220A (en) * 2017-11-17 2019-05-28 阿里巴巴集团控股有限公司 Audio recognition method, apparatus and system
CN110211565A (en) * 2019-05-06 2019-09-06 平安科技(深圳)有限公司 Accent recognition method, apparatus and computer readable storage medium
CN110930978A (en) * 2019-11-08 2020-03-27 北京搜狗科技发展有限公司 Language identification method and device and language identification device
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN110853618A (en) * 2019-11-19 2020-02-28 腾讯科技(深圳)有限公司 Language identification method, model training method, device and equipment
CN111312214A (en) * 2020-03-31 2020-06-19 广东美的制冷设备有限公司 Voice recognition method and device for air conditioner, air conditioner and readable storage medium

Also Published As

Publication number Publication date
CN111724766A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN111724766B (en) Language identification method, related equipment and readable storage medium
CN109918680B (en) Entity identification method and device and computer equipment
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN107291783B (en) Semantic matching method and intelligent equipment
CN103456297B (en) A kind of method and apparatus of speech recognition match
CN110210029A (en) Speech text error correction method, system, equipment and medium based on vertical field
CN108986797B (en) Voice theme recognition method and system
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN111709242A (en) Chinese punctuation mark adding method based on named entity recognition
CN110377695B (en) Public opinion theme data clustering method and device and storage medium
CN111192572A (en) Semantic recognition method, device and system
CN110164417B (en) Language vector obtaining and language identification method and related device
CN115064154A (en) Method and device for generating mixed language voice recognition model
CN111444720A (en) Named entity recognition method for English text
CN113850291A (en) Text processing and model training method, device, equipment and storage medium
Bigot et al. Combining acoustic name spotting and continuous context models to improve spoken person name recognition in speech
CN113051384A (en) User portrait extraction method based on conversation and related device
CN117454898A (en) Method and device for realizing legal entity standardized output according to input text
CN110570838A (en) Voice stream processing method and device
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN112967710B (en) Low-resource customer dialect point identification method
CN115691503A (en) Voice recognition method and device, electronic equipment and storage medium
CN114863915A (en) Voice awakening method and system based on semantic preservation
CN113111855A (en) Multi-mode emotion recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant