CN111724766A - Language identification method, related equipment and readable storage medium - Google Patents

Language identification method, related equipment and readable storage medium Download PDF

Info

Publication number
CN111724766A
CN111724766A CN202010607693.6A CN202010607693A CN111724766A CN 111724766 A CN111724766 A CN 111724766A CN 202010607693 A CN202010607693 A CN 202010607693A CN 111724766 A CN111724766 A CN 111724766A
Authority
CN
China
Prior art keywords
language
voice data
training
model
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010607693.6A
Other languages
Chinese (zh)
Other versions
CN111724766B (en
Inventor
杨军
方磊
方四安
唐磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Ustc Iflytek Co ltd
Original Assignee
Hefei Ustc Iflytek Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Ustc Iflytek Co ltd filed Critical Hefei Ustc Iflytek Co ltd
Priority to CN202010607693.6A priority Critical patent/CN111724766B/en
Publication of CN111724766A publication Critical patent/CN111724766A/en
Application granted granted Critical
Publication of CN111724766B publication Critical patent/CN111724766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a language identification method, related equipment and a readable storage medium, wherein after voice data to be identified are obtained, the language characteristics of the voice data are determined; performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result; and when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result, and determining the language of the voice data based on the first language identification result and the second language identification result. In the above scheme, if the first language identification result is inaccurate, the second language identification model with more network layers than the first language identification model can be used for the second identification, so that the identification accuracy is improved.

Description

Language identification method, related equipment and readable storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a language identification method, a related device, and a readable storage medium.
Background
Language identification is a process of judging the language type of a voice data by analyzing and processing the voice data by a computer, and is an important research direction of voice identification. With the continuous acceleration of the globalization process, the language identification has wide application prospects in the fields of multilingual information service, machine translation, military security and the like. In the prior art, languages such as a hybrid Gaussian Model (GMM), a Support Vector Machine (SVM), and a Gaussian hybrid Model-Support Vector Machine (GSV-SVM) are mostly used for recognizing languages.
However, in the prior art, the speech recognition method for speech data is not ideal in the accuracy of the obtained speech recognition result.
Therefore, it is necessary to optimize the language identification method in the prior art.
Disclosure of Invention
In view of the foregoing, the present application provides a language identification method, a related device and a readable storage medium. The specific scheme is as follows:
a language identification method comprises the following steps:
acquiring voice data to be recognized;
determining language features of the voice data;
performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result;
when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result; determining the language of the voice data based on the first language identification result and the second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model.
Optionally, the determining the language characteristic of the voice data includes:
acquiring acoustic features of the voice data;
performing feature conversion on the acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features;
extracting time sequence features from the transformed features by utilizing a time sequence feature extraction module of the language feature extraction model;
and extracting the language features of the voice data from the time sequence features by utilizing a language feature extraction module of the language feature extraction model.
Optionally, the training process of the language feature extraction model includes:
acquiring training voice data;
determining acoustic features of each training speech data, and phoneme information of each training speech data;
training by taking the acoustic feature of each training voice data as a training sample and taking the phoneme information of the training voice data as a sample label to obtain a phoneme recognition model;
and removing an output layer of the phoneme recognition model to obtain the language feature extraction model.
Optionally, the performing, by using a pre-established first language identification model, a first language feature of the speech data to obtain a first language identification result includes:
processing the language features of the voice data by using a mean value super vector feature extraction module of the first language identification model to obtain mean value super vector features of the language features;
and identifying the mean value super-vector characteristics of the language characteristics by using the language identification module of the first language identification model to obtain a first language identification result.
Optionally, the training process of the first language identification model includes:
acquiring a training voice data set corresponding to at least one language;
labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language, wherein the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language;
determining the language features of each training voice data in a training voice data set corresponding to each language;
determining a mean value super vector characteristic set of a training voice data set corresponding to each language by using language characteristics of each training voice data;
and training by using the mean value super vector characteristic set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language to obtain the first language identification model.
Optionally, the determining, by using the language features of each training speech data, a mean value super vector feature set of a training speech data set corresponding to each language includes:
aiming at a training voice data set corresponding to each language, clustering each training voice data in the training voice data set corresponding to each language by using the language features of each training voice data to obtain a training voice data subset corresponding to each language;
aiming at each training voice data subset in the training voice data subsets corresponding to the languages, combining initial mean value super-vector characteristics of each training voice data in the training voice data subsets to obtain mean value super-vector characteristics of the training voice data subsets; and the mean value super vector characteristics of all the training voice data subsets corresponding to the languages form a mean value super vector characteristic set of the training voice data set corresponding to the languages.
Optionally, the second language identification model is obtained by training a preset end-to-end neural network model with the language features of the training voice data as training samples and the language labeled by the training voice data as sample labels.
A language identification device comprising:
the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring voice data to be recognized;
a language feature determination unit, configured to determine a language feature of the voice data;
the first language identification unit is used for carrying out first identification on the language characteristics of the voice data by utilizing a pre-established first language identification model to obtain a first language identification result;
the second language identification unit is used for carrying out second identification on the language characteristics of the voice data by utilizing a pre-established second language identification model to obtain a second language identification result when the first language identification result is inaccurate; the number of network layers of the second language identification model is more than that of the first language identification model;
and the language determining unit is used for determining the language of the voice data based on the first language identification result and the second language identification result.
Optionally, the language feature determining unit includes:
an acoustic feature acquisition unit, configured to acquire an acoustic feature of the voice data;
the feature conversion unit is used for performing feature conversion on the acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features;
a time sequence feature extraction unit, configured to extract time sequence features from the transformed features by using a time sequence feature extraction module of the language feature extraction model;
and the language feature extraction unit is used for extracting the language features of the voice data from the time sequence features by utilizing a language feature extraction module of the language feature extraction model.
Optionally, the training process of the language feature extraction model includes:
acquiring training voice data;
determining acoustic features of each training speech data, and phoneme information of each training speech data;
training by taking the acoustic feature of each training voice data as a training sample and taking the phoneme information of the training voice data as a sample label to obtain a phoneme recognition model;
and removing an output layer of the phoneme recognition model to obtain the language feature extraction model.
Optionally, the first language identification unit includes:
the mean value super vector feature determining unit is used for processing the language features of the voice data by using a mean value super vector feature extracting module of the first language identification model to obtain mean value super vector features of the language features;
and the recognition unit is used for recognizing the mean value super-vector feature of the language feature by using the language recognition module of the first language recognition model to obtain a first language recognition result.
Optionally, the training process of the first language identification model includes:
acquiring a training voice data set corresponding to at least one language;
labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language, wherein the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language;
determining the language features of each training voice data in a training voice data set corresponding to each language;
determining a mean value super vector characteristic set of a training voice data set corresponding to each language by using language characteristics of each training voice data;
and training by using the mean value super vector characteristic set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language to obtain the first language identification model.
Optionally, the determining, by using the language features of each training speech data, a mean value super vector feature set of a training speech data set corresponding to each language includes:
aiming at a training voice data set corresponding to each language, clustering each training voice data in the training voice data set corresponding to each language by using the language features of each training voice data to obtain a training voice data subset corresponding to each language;
aiming at each training voice data subset in the training voice data subsets corresponding to the languages, combining initial mean value super-vector characteristics of each training voice data in the training voice data subsets to obtain mean value super-vector characteristics of the training voice data subsets; and the mean value super vector characteristics of all the training voice data subsets corresponding to the languages form a mean value super vector characteristic set of the training voice data set corresponding to the languages.
Optionally, the second language identification model is obtained by training a preset end-to-end neural network model with the language features of the training voice data as training samples and the language labeled by the training voice data as sample labels.
A language identification device comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the language identification method.
A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the language identification method as described above.
By means of the technical scheme, the application discloses a language identification method, related equipment and a readable storage medium, wherein after voice data to be identified are obtained, the language characteristics of the voice data are determined; performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result; and when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result, and determining the language of the voice data based on the first language identification result and the second language identification result. In the above scheme, if the first language identification result is inaccurate, the second language identification model with more network layers than the first language identification model can be used for the second identification, so that the identification accuracy is improved.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart illustrating a language identification method disclosed in an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a language feature extraction model disclosed in an embodiment of the present application;
FIG. 3 is a schematic diagram of a structure of a phoneme recognition model disclosed in the embodiments of the present application;
FIG. 4 is a schematic structural diagram of a first language identification model according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a language identification device according to an embodiment of the present application;
fig. 6 is a block diagram of a hardware structure of a language identification device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Next, the language identification method provided by the present application is described by the following embodiments.
Referring to fig. 1, fig. 1 is a schematic flow chart of a language identification method disclosed in an embodiment of the present application, where the method may include:
step S101: and acquiring voice data to be recognized.
The voice data to be recognized is voice data spoken by the user according to application requirements, such as voice data input by the user when making a call, voice data input by a voice input method when the user is based on an instant chat tool, and the like, and the application is not limited at all.
Step S102: and determining language features of the voice data.
It should be noted that although the acoustic features such as SDC (shift delta Cepstral, chinese full name: shifted differential cepstrum) features of the speech data may be adopted as the language features of the speech data in the present application, the language information included in the acoustic features of the speech data is often less, and a higher recognition accuracy cannot be guaranteed. For example, when the SDC feature of the speech data is used as the language feature of the speech data, if the speech data is phrase speech data with an effective duration lower than a preset duration (e.g., 3 seconds), the SDC feature is shorter, and the included language information is less, which may result in an inaccurate language recognition result of the phrase speech data. Therefore, in the present application, the language feature of the speech data may be another feature that includes more language information and is determined based on the acoustic feature such as the SDC feature.
The specific implementation manner of determining the language features of the voice data will be described in detail by the following embodiments.
Step S103: and performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result.
Traditional language identification models, such as Gaussian Mixture Models (GMM), Support Vector Machines (SVM), Gaussian Mixture Model Super Vector-Support Vector machines (GSV-SVM), etc., are obtained by training multiple sets of acoustic features such as SDC features, and the acoustic features such as SDC features contain less language information, so that the language identification accuracy of the traditional language identification models is low.
In the application, the pre-established first language identification model may be a model obtained by retraining a conventional language identification model by using a language feature of training data that includes more language information.
The specific implementation manner of the first language recognition result obtained by performing the first recognition on the language features of the speech data by using the first language recognition model will be described in detail in the following embodiments.
Step S104: and judging whether the first language identification result is accurate, and executing the step S105 and the step S106 when the first language identification result is not accurate. When the first language identification result is accurate, step S107 is executed.
In the present application, there are various ways to determine whether the first language identification result is accurate.
As an implementation manner, target languages (for example, chinese, english, french, and others) may be preset, and the first recognition result may include a first score of each target language for the language of the voice data, and a specific implementation manner of determining whether the first recognition result is accurate may be as follows: and judging whether the difference value between the highest first score and the lowest first score in each first score meets a preset condition, if so, determining that the first language identification result is accurate, and otherwise, determining that the first language identification result is inaccurate. The preset condition may be greater than or equal to a preset threshold, within a preset interval, and the like, and the application is not limited at all.
Step S105: and performing secondary recognition on the language features of the voice data by using a pre-established second language recognition model to obtain a secondary language recognition result.
In the application, the number of network layers of the second language identification model is greater than that of the first language identification model, so that the language identification accuracy of the second language identification model is higher than that of the first language identification model.
The specific implementation manner of the second language recognition result obtained by performing the second recognition on the language features of the speech data by using the second language recognition model will be described in detail in the following embodiments.
Step S106: and determining the language of the voice data based on the first language identification result and the second language identification result.
In this application, a target language (for example, chinese, english, french, or other language) may be preset, a first recognition result may include a first score of each language of the voice data for the target language, a second recognition result may include a second score of each language of the voice data for the target language, and a specific implementation manner of determining the language of the voice data may be determined based on the first language recognition result and the second language recognition result: determining the language of the voice data as a final score of each target language based on a first score of the language of the voice data for each target language and a second score of the language of the voice data for each target language; and determining the language of the voice data as a target language corresponding to the highest score in the final scores of each target language, wherein the target language is the language of the voice data.
Based on the first score of the language of the voice data for each target language and the second score of the language of the voice data for each target language, the mode of determining the language of the voice data as the final score of each target language may be: presetting the weight of a first recognition result and the weight of a second recognition result, and fusing the language of the voice data into a first score of each target language and a second score of each target language based on the weight of the first recognition result and the weight of the second recognition result to obtain a final score of each target language of the voice data.
For convenience of understanding, it is assumed that the weight of the first recognition result is α, the weight of the second recognition result is 1- α, the first score of the speech data in the language of chinese is 0.8, the second score of the speech data in the language of chinese is 0.6, and the final score of the speech data in the language of chinese is 0.8 × α +0.6 × (1- α).
Step S107: and determining the language of the voice data based on the first language identification result.
In this application, a target language (for example, chinese, english, french, or other language) may be preset, and the first recognition result may include a first score for each target language of the voice data, and the determining the language of the voice data based on the first recognition result may include: and determining the language of the voice data as a target language corresponding to the highest score in the first score of each target language, wherein the target language is the language of the voice data.
The embodiment discloses a language identification method, after acquiring voice data to be identified, determining language characteristics of the voice data; performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result; and when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result, and determining the language of the voice data based on the first language identification result and the second language identification result. In the above scheme, if the first language identification result is inaccurate, the second language identification model with more network layers than the first language identification model can be used for the second identification, so that the identification accuracy is improved.
In addition, in the present application, the language identification is not performed twice on all the voice data, but only the voice data with an inaccurate first language identification result is subjected to the second language identification, and when the plurality of voice data to be identified need to be subjected to the language identification, the recognition speed is improved compared with the case where the voice data to be identified are subjected to the language identification twice.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a language feature extraction model disclosed in an embodiment of the present application, where the language feature extraction model includes a feature transformation module, a time sequence feature extraction module, and a language feature extraction module. Since DNN (Deep Neural Networks, which is a full name of English) is good at performing nonlinear transformation on data, the feature transformation module can be realized based on DNN in the application. Because the BilSTM (English full name: Bi-directional Long Short-Term Memory, Chinese full name: bidirectional Long Short-Term Memory network) is good at analyzing the time sequence, the time sequence feature extraction module can be realized based on the BilSTM in the application. The dimension of the feature of the upper network layer can be reduced and the training speed of the model can be improved due to the BN (bottle network, Chinese full name: Bottleneck network), so that the language feature extraction module can be realized based on the BN (bottle network, Chinese full name: Bottleneck network).
Based on the language feature extraction model shown in fig. 2, a specific implementation manner of determining the language feature of the voice data in step S102 is described in this application. The method can comprise the following steps:
step S201: and acquiring acoustic features of the voice data.
In the present application, the acoustic feature of the voice data may be an SDC feature of the voice data.
Step S202: and performing feature conversion on the acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features.
The transformed features are nonlinear features corresponding to the acoustic features of the speech data.
Step S203: and extracting time sequence characteristics from the transformed characteristics by utilizing a time sequence characteristic extraction module of the language characteristic extraction model.
Step S204: and extracting the language features of the voice data from the time sequence features by utilizing a language feature extraction module of the language feature extraction model.
In this application, the language feature extraction module of the language feature extraction model can perform dimensionality reduction processing on the time sequence feature to obtain the language feature of the voice data.
For the training of the language feature extraction model, theoretically, the acoustic features of the training voice data can be used as training samples, and the language features labeled on the training voice data can be used as sample labels to obtain the training result. However, it is not practical to label the acoustic features of a piece of speech data with the corresponding language features, and a mature speech recognition model can obtain the phoneme information of the speech data at present, so in the present application, a phoneme recognition model may be preset, and the language feature extraction model may be obtained by training the phoneme recognition model. The method comprises the following specific steps:
please refer to fig. 3, which is a schematic structural diagram of a phoneme recognition model disclosed in an embodiment of the present application, the phoneme recognition model includes a feature transformation module, a time-series feature extraction module, a language feature extraction module, and an output layer, wherein the feature transformation module, the time-series feature extraction module, and the language feature extraction module may be modules of the language feature extraction model.
In the present application, the language feature extraction model may be obtained by training the phoneme recognition model and then removing an output layer of the phoneme recognition model.
Based on the phoneme recognition model shown in fig. 3, the training process of the language feature extraction model may include:
step S301: training speech data is obtained.
Step S302: acoustic features of each training speech data are determined, and phoneme information of each training speech data is determined.
In the present application, the acoustic feature of each training speech data and the phoneme information of each training speech data may be obtained based on a conventional speech recognition model. In this regard, the present application is not described further.
Step S303: and training to obtain a phoneme recognition model by taking the acoustic features of each training voice data as training samples and the phoneme information of the training voice data as sample labels.
Step S304: and removing an output layer of the phoneme recognition model to obtain the language feature extraction model.
As can be seen from fig. 2 and 3, the language feature extraction model shown in fig. 2 can be obtained by removing the output layer of the phoneme recognition model shown in fig. 3.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a first language identification model disclosed in an embodiment of the present application. The first language identification model comprises a mean value super Vector feature extraction module and a language identification module, wherein the language identification module can adopt a Support Vector Machine (Support Vector Machine, SVM for short) algorithm to identify languages.
Based on the structure of the first language identification model shown in fig. 4, in another embodiment of the present application, a specific implementation manner of performing the first recognition on the language features of the voice data by using the pre-established first language identification model in step S103 to obtain a first language recognition result is described, which may include the following steps:
step S401: processing the language features of the voice data by using a mean value super vector feature extraction module in a first language identification model to obtain mean value super vector features of the language features;
step S402: and identifying the mean value super-vector characteristics of the language characteristics by utilizing a language identification module in the first language identification model to obtain a first language identification result.
It should be noted that the training process of the first language identification model may include:
step S501: and acquiring a training voice data set corresponding to at least one language.
In the present application, target languages (for example, chinese, english, french, and others) may be preset, and a training speech data set corresponding to each target language is obtained. It should be noted that, in order to ensure the model effect, in the speech data included in the training speech data set corresponding to each target language, sometimes long speech data with a length longer than a first preset length of time (for example, 3 seconds) and short speech data with a length of time not longer than the first preset length of time are included, and the total length of time of all the speech data needs to reach a second preset length of time (for example, 20 hours).
Step S502: and labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language.
And the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language.
Step S503: and determining the language features of each training voice data in the training voice data set corresponding to each language.
In the present application, each training speech data may be processed based on the language feature extraction model to obtain the language feature of each training speech data.
Step S504: and determining a mean value super-vector characteristic set of the training voice data set corresponding to each language by using the language characteristics of each training voice data.
In the application, the language features of each training voice data can be utilized to determine the general background model and the total difference space matrix.
And aiming at the training voice data set corresponding to each language, determining a mean value super vector characteristic set of the training voice data set corresponding to the language by utilizing the language characteristics of each training voice data in the data set, the general background model and the full-difference space matrix.
As an implementation manner, for each training speech data in the training speech data set corresponding to each language, the language feature of each training speech data, the general background model and the full-difference spatial matrix are used to determine the initial mean value super-vector feature of each training speech data, and the initial mean value super-vector features of all the training speech data constitute a mean value super-vector feature set.
However, the training speech data set corresponding to each language includes a large number of training speech data, and if the initial mean value supervector features of all the training speech data are combined into the mean value supervector feature set, the convergence rate of the first language identification model is slow.
Therefore, another embodiment is proposed in the present application, which aims to reduce the number of mean supervector features in the mean supervector feature set of the training speech data set corresponding to each language, and improve the convergence rate of the first language identification model. The method specifically comprises the following steps:
step S5041: and aiming at the training voice data set corresponding to each language, clustering each training voice data in the training voice data set corresponding to each language by using the language characteristics of each training voice data to obtain a training voice data subset corresponding to each language.
In the application, for a training voice data set corresponding to each language, the language features of each training voice data in the data set, the general background model and the total difference space matrix are utilized to determine the initial mean value super-vector feature and the i-vector feature of each training voice data in the training voice data set corresponding to each language; and clustering each training voice data based on the initial mean value super-vector characteristic or the i-vector characteristic of each training voice data to obtain a training voice data subset corresponding to the language. Each subset of training speech data includes at least one training speech data.
It should be noted that, based on the initial mean value super-vector feature or the i-vector feature of each training speech data, the manner of clustering each training speech data may include: and calculating the similarity of the initial mean value super-vector characteristics or the i-vector characteristics of each training voice data, and clustering each training voice data based on the similarity. Specifically, a plurality of training speech data with high similarity of the initial mean super vector feature or the i-vector feature may be clustered into one training speech data subset.
It should be further noted that the training speech data subsets corresponding to the languages may be all subsets obtained after clustering, but the subsets including training speech data whose number is smaller than a preset threshold (for example, 3) may make the mean supervector in the mean supervector feature set obtained finally relatively discrete, which is not beneficial to training the model.
For convenience of understanding, it is assumed that the training speech data set corresponding to the target language chinese includes 5000 pieces of training speech data, 1000 training speech data subsets are obtained through clustering, 200 training speech data subsets including training speech data pieces smaller than a preset threshold (for example, 3) are obtained, 200 training speech data subsets are discarded, and the remaining 800 training speech data subsets are training speech data subsets corresponding to the target language chinese. Finally, the mean value super vector feature set of the training voice data set corresponding to the Chinese language only comprises 800 mean value super vector features which are far smaller than the original 5000 mean value super vector features.
Step S5042: aiming at each training voice data subset in the training voice data subsets corresponding to the languages, combining initial mean value super-vector characteristics of each training voice data in the training voice data subsets to obtain mean value super-vector characteristics of the training voice data subsets; and the mean value super vector characteristics of all the training voice data subsets corresponding to the languages form a mean value super vector characteristic set of the training voice data set corresponding to the languages.
For convenience of understanding, it is assumed that the training speech data set corresponding to the target language chinese includes 5000 pieces of training speech data, and if the training speech data subset corresponding to the target language is determined without clustering, the mean supervector in the mean supervector feature set of the corresponding training speech data set includes 5000 mean supervectors, which results in too low convergence rate of the first language identification model, whereas based on clustering, the number of mean supervectors included in the mean supervector feature set of the corresponding training speech data set is greatly reduced, which improves the convergence rate of the first language identification model.
Step S505: and training by using the mean value super vector characteristic set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language to obtain the first language identification model.
In another embodiment of the present application, the second language identification model is obtained by training a preset end-to-end neural network model with the language features of the training speech data as training samples and the language labeled by the training speech data as sample labels. As an implementation mode, the preset Neural Network model can be end-to-end TDNN (Time-Delay Neural Network, Chinese).
Based on the above scheme, in the present application, the number of network layers of the second language identification model is large, and therefore, the language identification accuracy of the second language identification model is high. Theoretically, for the voice data, the second language identification model is adopted for identification, and then a language identification result with high accuracy can be obtained. However, because the number of network layers of the second language identification model is large, the speech data is input into the second language identification model, and the second language identification model needs a long time to process the speech data so as to output the language of the speech data. In this case, for the language identification scene with high real-time requirement, the real-time requirement cannot be satisfied by simply adopting the second language identification model. Therefore, in the present application, the speech data is recognized by using the first language recognition model with a small number of network layers to obtain the first language recognition result. If the first language identification result is inaccurate, the second language identification model is used for identification, so that the accuracy of the language identification result can be ensured, and the efficiency of the language identification can be ensured.
The following describes the language identification device disclosed in the embodiment of the present application, and the language identification device described below and the language identification method described above may be referred to correspondingly.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a language identification device disclosed in the embodiment of the present application. As shown in fig. 5, the language identification means may include:
an acquisition unit 11 configured to acquire voice data to be recognized;
a language feature determination unit 12, configured to determine a language feature of the voice data;
a first language identification unit 13, configured to perform first identification on language features of the voice data by using a pre-established first language identification model to obtain a first language identification result;
a second language identification unit 14, configured to perform, when the first language identification result is inaccurate, a second language identification on the language features of the speech data by using a second language identification model that is established in advance, so as to obtain a second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model;
a language determining unit 15, configured to determine a language of the voice data based on the first language identification result and the second language identification result.
Optionally, the language feature determining unit includes:
an acoustic feature acquisition unit, configured to acquire an acoustic feature of the voice data;
the feature conversion unit is used for performing feature conversion on the acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features;
a time sequence feature extraction unit, configured to extract time sequence features from the transformed features by using a time sequence feature extraction module of the language feature extraction model;
and the language feature extraction unit is used for extracting the language features of the voice data from the time sequence features by utilizing a language feature extraction module of the language feature extraction model.
Optionally, the training process of the language feature extraction model includes:
acquiring training voice data;
determining acoustic features of each training speech data, and phoneme information of each training speech data;
training by taking the acoustic feature of each training voice data as a training sample and taking the phoneme information of the training voice data as a sample label to obtain a phoneme recognition model;
and removing an output layer of the phoneme recognition model to obtain the language feature extraction model.
Optionally, the first language identification unit includes:
the mean value super vector feature determining unit is used for processing the language features of the voice data by using a mean value super vector feature extracting module of the first language identification model to obtain mean value super vector features of the language features;
and the recognition unit is used for recognizing the mean value super-vector feature of the language feature by using the language recognition module of the first language recognition model to obtain a first language recognition result.
Optionally, the training process of the first language identification model includes:
acquiring a training voice data set corresponding to at least one language;
labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language, wherein the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language;
determining the language features of each training voice data in a training voice data set corresponding to each language;
determining a mean value super vector characteristic set of a training voice data set corresponding to each language by using language characteristics of each training voice data;
and training by using the mean value super vector characteristic set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language to obtain the first language identification model.
Optionally, the determining, by using the language features of each training speech data, a mean value super vector feature set of a training speech data set corresponding to each language includes:
aiming at a training voice data set corresponding to each language, clustering each training voice data in the training voice data set corresponding to each language by using the language features of each training voice data to obtain a training voice data subset corresponding to each language;
aiming at each training voice data subset in the training voice data subsets corresponding to the languages, combining initial mean value super-vector characteristics of each training voice data in the training voice data subsets to obtain mean value super-vector characteristics of the training voice data subsets; and the mean value super vector characteristics of all the training voice data subsets corresponding to the languages form a mean value super vector characteristic set of the training voice data set corresponding to the languages.
Optionally, the second language identification model is obtained by training a preset end-to-end neural network model with the language features of the training voice data as training samples and the language labeled by the training voice data as sample labels.
Referring to fig. 6, fig. 6 is a block diagram of a hardware structure of the language identification device according to the embodiment of the present application, and referring to fig. 6, the hardware structure of the language identification device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an application specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
acquiring voice data to be recognized;
determining language features of the voice data;
performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result;
when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result; determining the language of the voice data based on the first language identification result and the second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model.
Alternatively, the detailed function and the extended function of the program may be as described above.
Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:
acquiring voice data to be recognized;
determining language features of the voice data;
performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result;
when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result; determining the language of the voice data based on the first language identification result and the second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model.
Alternatively, the detailed function and the extended function of the program may be as described above.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A language identification method, comprising:
acquiring voice data to be recognized;
determining language features of the voice data;
performing first recognition on the language features of the voice data by using a pre-established first language recognition model to obtain a first language recognition result;
when the first language identification result is inaccurate, performing second identification on the language characteristics of the voice data by using a pre-established second language identification model to obtain a second language identification result; determining the language of the voice data based on the first language identification result and the second language identification result; the number of network layers of the second language identification model is more than that of the first language identification model.
2. The method of claim 1, wherein determining the linguistic characteristics of the speech data comprises:
acquiring acoustic features of the voice data;
performing feature conversion on the acoustic features of the voice data by using a feature conversion module of a pre-established language feature extraction model to obtain converted features;
extracting time sequence features from the transformed features by utilizing a time sequence feature extraction module of the language feature extraction model;
and extracting the language features of the voice data from the time sequence features by utilizing a language feature extraction module of the language feature extraction model.
3. The method according to claim 2, wherein the training process of the language feature extraction model comprises:
acquiring training voice data;
determining acoustic features of each training speech data, and phoneme information of each training speech data;
training by taking the acoustic feature of each training voice data as a training sample and taking the phoneme information of the training voice data as a sample label to obtain a phoneme recognition model;
and removing an output layer of the phoneme recognition model to obtain the language feature extraction model.
4. The method according to claim 1, wherein said performing a first speech recognition on the speech features of the speech data by using a pre-established first speech recognition model to obtain a first speech recognition result comprises:
processing the language features of the voice data by using a mean value super vector feature extraction module of the first language identification model to obtain mean value super vector features of the language features;
and identifying the mean value super-vector characteristics of the language characteristics by using the language identification module of the first language identification model to obtain a first language identification result.
5. The method according to claim 4, wherein the training process of the first language identification model comprises:
acquiring a training voice data set corresponding to at least one language;
labeling the training voice data set corresponding to each language to obtain a labeling result of the training voice data set corresponding to each language, wherein the labeling result of the training voice data corresponding to each language is used for indicating the language of the training voice data set corresponding to the language;
determining the language features of each training voice data in a training voice data set corresponding to each language;
determining a mean value super vector characteristic set of a training voice data set corresponding to each language by using language characteristics of each training voice data;
and training by using the mean value super vector characteristic set of the training voice data set corresponding to each language and the labeling result of the training voice data set corresponding to each language to obtain the first language identification model.
6. The method according to claim 5, wherein the determining the mean supervector feature set of the training speech data set corresponding to each language by using the language features of each training speech data comprises:
aiming at a training voice data set corresponding to each language, clustering each training voice data in the training voice data set corresponding to each language by using the language features of each training voice data to obtain a training voice data subset corresponding to each language;
aiming at each training voice data subset in the training voice data subsets corresponding to the languages, combining initial mean value super-vector characteristics of each training voice data in the training voice data subsets to obtain mean value super-vector characteristics of the training voice data subsets; and the mean value super vector characteristics of all the training voice data subsets corresponding to the languages form a mean value super vector characteristic set of the training voice data set corresponding to the languages.
7. The method according to claim 1, wherein the second language identification model is obtained by training a preset end-to-end neural network model with language features of training speech data as training samples and languages marked by the training speech data as sample labels.
8. A language identification device, comprising:
the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring voice data to be recognized;
a language feature determination unit, configured to determine a language feature of the voice data;
the first language identification unit is used for carrying out first identification on the language characteristics of the voice data by utilizing a pre-established first language identification model to obtain a first language identification result;
the second language identification unit is used for carrying out second identification on the language characteristics of the voice data by utilizing a pre-established second language identification model to obtain a second language identification result when the first language identification result is inaccurate; the number of network layers of the second language identification model is more than that of the first language identification model;
and the language determining unit is used for determining the language of the voice data based on the first language identification result and the second language identification result.
9. A language identification device comprising a memory and a processor;
the memory is used for storing programs;
the processor, configured to execute the program, and implement the steps of the language identification method according to any one of claims 1 to 8.
10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the language identification method according to any one of claims 1 to 8.
CN202010607693.6A 2020-06-29 2020-06-29 Language identification method, related equipment and readable storage medium Active CN111724766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010607693.6A CN111724766B (en) 2020-06-29 2020-06-29 Language identification method, related equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010607693.6A CN111724766B (en) 2020-06-29 2020-06-29 Language identification method, related equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111724766A true CN111724766A (en) 2020-09-29
CN111724766B CN111724766B (en) 2024-01-05

Family

ID=72570223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010607693.6A Active CN111724766B (en) 2020-06-29 2020-06-29 Language identification method, related equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111724766B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528682A (en) * 2020-12-23 2021-03-19 北京百度网讯科技有限公司 Language detection method and device, electronic equipment and storage medium
CN113782000A (en) * 2021-09-29 2021-12-10 北京中科智加科技有限公司 Language identification method based on multiple tasks
CN114678029A (en) * 2022-05-27 2022-06-28 深圳市人马互动科技有限公司 Speech processing method, system, computer readable storage medium and program product

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546555A (en) * 2009-04-14 2009-09-30 清华大学 Constraint heteroscedasticity linear discriminant analysis method for language identification
CN101894548A (en) * 2010-06-23 2010-11-24 清华大学 Modeling method and modeling device for language identification
WO2014029099A1 (en) * 2012-08-24 2014-02-27 Microsoft Corporation I-vector based clustering training data in speech recognition
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN106898355A (en) * 2017-01-17 2017-06-27 清华大学 A kind of method for distinguishing speek person based on two modelings
CN107564513A (en) * 2016-06-30 2018-01-09 阿里巴巴集团控股有限公司 Audio recognition method and device
CN107622770A (en) * 2017-09-30 2018-01-23 百度在线网络技术(北京)有限公司 voice awakening method and device
US20180174589A1 (en) * 2016-12-19 2018-06-21 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
CN109817220A (en) * 2017-11-17 2019-05-28 阿里巴巴集团控股有限公司 Audio recognition method, apparatus and system
US20190251963A1 (en) * 2018-02-09 2019-08-15 Baidu Online Network Technology (Beijing) Co., Ltd. Voice awakening method and device
CN110211565A (en) * 2019-05-06 2019-09-06 平安科技(深圳)有限公司 Accent recognition method, apparatus and computer readable storage medium
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN110853618A (en) * 2019-11-19 2020-02-28 腾讯科技(深圳)有限公司 Language identification method, model training method, device and equipment
CN110930978A (en) * 2019-11-08 2020-03-27 北京搜狗科技发展有限公司 Language identification method and device and language identification device
CN111312214A (en) * 2020-03-31 2020-06-19 广东美的制冷设备有限公司 Voice recognition method and device for air conditioner, air conditioner and readable storage medium

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546555A (en) * 2009-04-14 2009-09-30 清华大学 Constraint heteroscedasticity linear discriminant analysis method for language identification
CN101894548A (en) * 2010-06-23 2010-11-24 清华大学 Modeling method and modeling device for language identification
WO2014029099A1 (en) * 2012-08-24 2014-02-27 Microsoft Corporation I-vector based clustering training data in speech recognition
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN107564513A (en) * 2016-06-30 2018-01-09 阿里巴巴集团控股有限公司 Audio recognition method and device
US20180174589A1 (en) * 2016-12-19 2018-06-21 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
CN106898355A (en) * 2017-01-17 2017-06-27 清华大学 A kind of method for distinguishing speek person based on two modelings
CN107622770A (en) * 2017-09-30 2018-01-23 百度在线网络技术(北京)有限公司 voice awakening method and device
CN109817220A (en) * 2017-11-17 2019-05-28 阿里巴巴集团控股有限公司 Audio recognition method, apparatus and system
US20190251963A1 (en) * 2018-02-09 2019-08-15 Baidu Online Network Technology (Beijing) Co., Ltd. Voice awakening method and device
CN110211565A (en) * 2019-05-06 2019-09-06 平安科技(深圳)有限公司 Accent recognition method, apparatus and computer readable storage medium
CN110930978A (en) * 2019-11-08 2020-03-27 北京搜狗科技发展有限公司 Language identification method and device and language identification device
CN110782872A (en) * 2019-11-11 2020-02-11 复旦大学 Language identification method and device based on deep convolutional recurrent neural network
CN110853618A (en) * 2019-11-19 2020-02-28 腾讯科技(深圳)有限公司 Language identification method, model training method, device and equipment
CN111312214A (en) * 2020-03-31 2020-06-19 广东美的制冷设备有限公司 Voice recognition method and device for air conditioner, air conditioner and readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528682A (en) * 2020-12-23 2021-03-19 北京百度网讯科技有限公司 Language detection method and device, electronic equipment and storage medium
CN113782000A (en) * 2021-09-29 2021-12-10 北京中科智加科技有限公司 Language identification method based on multiple tasks
CN114678029A (en) * 2022-05-27 2022-06-28 深圳市人马互动科技有限公司 Speech processing method, system, computer readable storage medium and program product
CN114678029B (en) * 2022-05-27 2022-09-02 深圳市人马互动科技有限公司 Speech processing method, system, computer readable storage medium and program product

Also Published As

Publication number Publication date
CN111724766B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
CN109918680B (en) Entity identification method and device and computer equipment
CN108509619B (en) Voice interaction method and device
CN109918673B (en) Semantic arbitration method and device, electronic equipment and computer-readable storage medium
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
KR101259558B1 (en) apparatus and method for detecting sentence boundaries
CN111724766B (en) Language identification method, related equipment and readable storage medium
CN112100349A (en) Multi-turn dialogue method and device, electronic equipment and storage medium
CN111125354A (en) Text classification method and device
CN108763510A (en) Intension recognizing method, device, equipment and storage medium
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
CN111709242A (en) Chinese punctuation mark adding method based on named entity recognition
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
CN110377695B (en) Public opinion theme data clustering method and device and storage medium
CN108345612A (en) A kind of question processing method and device, a kind of device for issue handling
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN113326360A (en) Natural language understanding method in small sample scene
CN112818086A (en) Multi-label classification method for acquiring client intention label by robot
CN112967710B (en) Low-resource customer dialect point identification method
CN113051384A (en) User portrait extraction method based on conversation and related device
Bigot et al. Combining acoustic name spotting and continuous context models to improve spoken person name recognition in speech
CN117454898A (en) Method and device for realizing legal entity standardized output according to input text
CN115691503A (en) Voice recognition method and device, electronic equipment and storage medium
CN113470617B (en) Speech recognition method, electronic equipment and storage device
CN114974310A (en) Emotion recognition method and device based on artificial intelligence, computer equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant