CN111192570B

CN111192570B - Language model training method, system, mobile terminal and storage medium

Info

Publication number: CN111192570B
Application number: CN202010011026.1A
Authority: CN
Inventors: 张广学; 肖龙源; 蔡振华; 李稀敏; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2022-12-06
Anticipated expiration: 2040-01-06
Also published as: CN111192570A

Abstract

The invention provides a language model training method, a system, a mobile terminal and a storage medium, wherein the method comprises the following steps: acquiring a training text and training vocabularies, classifying the training text to obtain a plurality of language modules, and constructing a language dictionary corresponding to the language modules according to the training vocabularies; performing model training on a module language model in a language module according to a language dictionary, and training a training text to obtain a text language model; acquiring a voice to be recognized for phoneme recognition to obtain a phoneme string, and matching the phoneme string with a module language model to obtain a phoneme matching result; and performing probability calculation on the phoneme matching result through a text language model, and outputting a sentence corresponding to the maximum probability value. The invention improves the training efficiency and accuracy of the language model by classifying the training texts and constructing and designing the language dictionary, and effectively expands the language model by training and designing the model language model and the training texts.

Description

Language model training method, system, mobile terminal and storage medium

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a language model training method, a language model training system, a mobile terminal and a storage medium.

Background

The speech recognition research has been in history for decades, the speech recognition technology mainly comprises four parts, namely acoustic model modeling, language model modeling, pronunciation dictionary construction and decoding, each part can become an independent research direction, and the difficulty of speech data acquisition and labeling is greatly improved relative to images and texts, so that the construction of a complete speech model training system is a work which consumes a lot of time and has high difficulty, and the development of the speech recognition technology is greatly hindered.

In the existing language model training process, the language model can be trained only according to the vocabulary and sentence patterns pre-stored in the database, and the vocabulary and sentence patterns cannot be added in time in the training process, so that the efficiency and expansibility of the language model training are low.

Disclosure of Invention

The embodiment of the invention aims to provide a language model training method, a language model training system, a mobile terminal and a storage medium, and aims to solve the problems of low efficiency and expansibility of the existing language model training.

The embodiment of the invention is realized in such a way that a language model training method comprises the following steps:

acquiring a training text and training vocabularies, classifying the training text to obtain a plurality of language modules, and constructing a language dictionary corresponding to the language modules according to the training vocabularies;

performing model training on a module language model in the language module according to the language dictionary, and training the training text to obtain a text language model;

acquiring a voice to be recognized for phoneme recognition to obtain a phoneme string, and matching the phoneme string with the module language model to obtain a phoneme matching result;

and performing probability calculation on the phoneme matching result through the text language model, and outputting a sentence corresponding to the maximum probability value.

Further, the step of performing model training on the module language model in the language module according to the language dictionary comprises:

extracting a language text corresponding to the language module from the training text according to the voice dictionary;

training the module language model by adopting a 3-gram training mode according to the language text;

and acquiring the word frequency of the corresponding word in the language text extracted from the language module, and constructing a Huffman tree model according to the word frequency and the training result of the language model.

Further, the step of matching the phoneme string with the module language model comprises:

matching the phoneme string with sample phonemes in each module language model in sequence;

when the matching number between the phoneme string and the sample phonemes in the module language model is larger than or equal to a preset number, outputting all the successfully matched sample phonemes;

and when the matching number is smaller than the preset number, outputting the result of the language module corresponding to the module language model.

Further, the step of performing probability calculation on the phoneme matching result through the text language model comprises:

combining the sample phonemes output by the language modules to obtain combined information, wherein a plurality of phoneme combined strings are stored in the combined information;

and respectively carrying out probability calculation on the phoneme combined strings according to the text language model to obtain a plurality of probability values.

Further, after the step of sequentially matching the phoneme string with the sample phoneme in each of the module language models, the method further includes:

and when the phoneme string is unsuccessfully matched with the module language model, carrying out error marking on the phoneme string according to the module language model.

Further, after the step of matching the phoneme string with the module language model, the method further includes:

when the phoneme string is successfully matched with the module language model, carrying out vocabulary type marking on the phoneme string;

and performing type matching according to the marking result of the vocabulary type mark on the phoneme string to obtain a sentence type, and performing context marking on the speech to be recognized according to the sentence type.

Another object of an embodiment of the present invention is to provide a language model training system, which includes:

the text classification module is used for acquiring a training text and training vocabularies, classifying the training text to obtain a plurality of language modules, and constructing a language dictionary corresponding to the language modules according to the training vocabularies;

the model training module is used for performing model training on a module language model in the language module according to the language dictionary and training the training text to obtain a text language model;

the phoneme matching module is used for acquiring the speech to be recognized for phoneme recognition to obtain a phoneme string and matching the phoneme string with the module language model to obtain a phoneme matching result;

and the probability calculation module is used for performing probability calculation on the phoneme matching result through the text language model and outputting the sentence corresponding to the maximum probability value.

Still further, the model training module is further configured to:

Another object of an embodiment of the present invention is to provide a mobile terminal, including a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal execute the above language model training method.

Another object of an embodiment of the present invention is to provide a storage medium, which stores a computer program used in the above-mentioned mobile terminal, wherein the computer program, when executed by a processor, implements the steps of the above-mentioned language model training method.

According to the embodiment of the invention, the training efficiency and the accuracy of the language model are effectively improved by classifying the training text and constructing and designing the language dictionary, the expansion of the language model can be effectively carried out by carrying out model training on the module language model in the language module and designing the training text, and the recognition efficiency of the voice model is effectively improved by carrying out voice recognition based on a phoneme recognition mode.

Drawings

FIG. 1 is a flowchart of a language model training method according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a language model training method according to a second embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a language model training system according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a mobile terminal according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Example one

Referring to fig. 1, it is a flowchart of a language model training method according to a first embodiment of the present invention, including the steps of:

step S10, acquiring a training text and training vocabularies, classifying the training text to obtain a plurality of language modules, and constructing a language dictionary corresponding to the language modules according to the training vocabularies;

the word language in the training text can be set according to requirements, for example, the word language can be Chinese, english, korean or Japanese, and the like, and both the training vocabulary and the training text can be obtained based on a database, and the training vocabulary includes noun vocabulary, verb vocabulary, adjective vocabulary, adverb vocabulary, and the like;

specifically, in the step, the training text may be classified by using a classifier, the classifier is configured to classify text characters in the training text according to different word attributes to correspondingly obtain a plurality of language modules, and the language modules may be a noun module, a verb module, an adjective module, an adverb module, and the like;

preferably, in the step, the design of the language dictionary is constructed, so that the subsequent stable execution of the language model training is effectively ensured, the accuracy of the language model training is improved, and the design of the language dictionary corresponding to the language module is constructed according to the training vocabulary, so that a noun dictionary, a verb dictionary, an adjective dictionary, an adverb dictionary and the like are correspondingly obtained;

step S20, performing model training on a module language model in the language module according to the language dictionary, and training the training text to obtain a text language model;

each language module is provided with a module language model, and the module language model is used for identifying the vocabulary input by the corresponding language module so as to judge whether the vocabulary input in the current language module is the vocabulary in the language module, thereby achieving the effect of judging the type of the vocabulary;

preferably, in this step, the training modes for the module language model and the training text can be selected according to the requirements, and in this embodiment, a 3-gram training mode is adopted for model training to obtain a trained module language model and a trained text language model;

step S30, acquiring a voice to be recognized, performing phoneme recognition to obtain a phoneme string, and matching the phoneme string with the module language model to obtain a phoneme matching result;

the method comprises the steps of inputting the speech to be recognized into a preset acoustic model to output a phoneme string, wherein the phoneme string is composed of a plurality of phonemes, and each phoneme corresponds to a character in the speech to be recognized;

preferably, in this step, through the design of matching the phoneme string with the module language model, attributes of phonemes on the phoneme string are respectively determined, for example, when it is determined that the matching between the current phoneme and the module language model in the noun module is successful, it is determined that a word corresponding to the current phoneme is a noun word;

specifically, in the step, the phoneme string is sequentially matched with the module language models in the noun module, the verb module, the adjective module and the adverb module to sequentially judge the attributes of the phonemes in the phoneme string, so that whether vocabularies such as nouns, verbs, adjectives or adverbs exist in the speech to be recognized corresponding to the phoneme string can be effectively judged;

for example, when the phoneme string is successfully matched with the module language models in the noun module, the verb module, the adjective module and the adverb module, determining that the noun, the verb, the adjective and the adverb all exist in the speech to be recognized corresponding to the phoneme string, and determining the number of words corresponding to the speech to be recognized by recognizing the number of times that the phoneme string is successfully matched with the corresponding module language model;

step S40, carrying out probability calculation on the phoneme matching result through the text language model, and outputting a sentence corresponding to the maximum probability value;

the text language model is used for carrying out probability calculation on the phoneme matching result so as to respectively calculate the probability value of sentences formed by output results among all language modules, and judging the recognition result according to the probability value;

for example, outputting results between all language modules includes: the method comprises the steps that a sentence A, a sentence B and a sentence C are judged through the text language model distribution, so that a probability A, a probability B and a probability C are obtained, wherein the probability A is greater than the probability B, and the probability B is greater than the probability C, so that the sentence A is output, and a recognition result for the voice to be recognized is obtained;

according to the embodiment, the training efficiency and the accuracy of the language model are effectively improved by classifying the training texts and constructing and designing the language dictionary, the model training is carried out on the module language model in the language module and the training text is designed, so that the language model can be effectively expanded, and the voice recognition is carried out in a phoneme recognition-based mode, so that the recognition efficiency of the voice model is effectively improved.

Example two

Please refer to fig. 2, which is a flowchart illustrating a language model training method according to a second embodiment of the present invention, including the steps of:

step S11, acquiring a training text and training vocabularies, classifying the training text to obtain a plurality of language modules, and constructing a language dictionary corresponding to the language modules according to the training vocabularies;

preferably, in other embodiments, the language module can be further divided into a state word module and the like according to different text attributes in the training text;

specifically, in this step, the language module and the language dictionary adopt a one-to-one correspondence relationship, so that a noun dictionary, a verb dictionary, an adjective dictionary, an adverb dictionary, and the like are obtained correspondingly by constructing a dictionary according to the training vocabulary;

step S21, extracting a language text corresponding to the language module from the training text according to the voice dictionary, and training the module language model by adopting a 3-gram training mode according to the language text;

the extracted language text can be set according to requirements, for example, the language text can be extracted in a preset audio mode, that is, the preset audio is edited and matched with a voice dictionary, and finally the extracted text of the training text is extracted according to a matching result;

specifically, in the step, texts corresponding to the noun module, the verb module, the adjective module and the adverb module are extracted from the training text based on the speech dictionary, and the language model corresponding to the module is trained according to the speech text, so that the training efficiency and the accuracy of model training are effectively improved;

step S31, acquiring the word frequency of the corresponding word in the language text extracted from the language module, and constructing a Huffman tree model according to the word frequency and the training result of the language model;

based on the extraction of the language text in step S21, the word frequency of the corresponding word in each language module is calculated, and a huffman tree model is constructed according to the word frequency extraction result, so that the design of the huffman tree model and the 3-gram training mode is adopted in the embodiment, so that the contents of newly added words, newly added sentences and the like can be effectively added in the language model training process, and the expansibility of language model training is further improved;

step S41, training the training text to obtain a text language model;

step S51, acquiring a voice to be recognized for phoneme recognition to obtain a phoneme string, and matching the phoneme string with sample phonemes in each module language model in sequence;

for example, when the phoneme string is successfully matched with the language models of the noun module, the verb module, the adjective module and the adverb module, determining that the noun, the verb, the adjective and the adverb exist in the speech to be recognized corresponding to the phoneme string, and determining the number of words and phrases corresponding to the speech to be recognized by recognizing the number of times of successful matching of the phoneme string and the language model of the corresponding module;

step S61, when the matching of the phoneme string and the module language model fails, carrying out error marking on the phoneme string according to the module language model;

when the phoneme in the phoneme string is judged to be not matched with all phonemes in a module language model in any language module, judging that the phoneme string is not matched with the current module language model, and carrying out error marking on the phoneme string through the name of the module language model or the name of the corresponding language module;

specifically, the error flag may be marked by using a character, a number, or an image, for example, when the error flag is marked by using a character, the error flag is marked according to the name of the language module, for example, when the matching between the noun module and the phoneme string fails, the "missing noun" is marked on the phoneme string, and when the matching between the verb module and the phoneme string fails, the "missing verb" is marked on the phoneme string;

when error marking of the phoneme string is carried out in an image mode, a corresponding preset image is inquired according to the name of the language module so as to mark the phoneme string according to the preset image, the preset image can be set according to requirements, and the preset image corresponding to each language module is different;

step S71, when the matching number between the phoneme string and the sample phonemes in the module language model is greater than or equal to a preset number, outputting all the successfully matched sample phonemes;

in this embodiment, the preset number may be set according to a requirement, where the preset number is 2, that is, when it is determined that the matching number between the phoneme string and the sample phonemes in the module language model is greater than or equal to 2, all the matched sample phonemes are output as an output result of the current language module;

for example, when the phoneme string is successfully matched with the sample phoneme a, the sample phoneme B and the sample phoneme C in the module language model in the noun module, the sample phoneme a, the sample phoneme B and the sample phoneme C are used as the output result of the noun module;

step S81, when the matching number is smaller than the preset number, outputting the result of the language module corresponding to the module language model to obtain a phoneme matching result;

when the matching number is judged to be less than 2 and greater than 0, namely the matching number is 1, directly outputting the output result of the language module;

for example, when the matching between the module language model in the verb module and the phoneme string is successful only once, directly outputting the result of the verb module;

step S91, combining the sample phonemes output by the language modules to obtain combined information;

in the step, by designing the combination of the sample phonemes output by each language module, the diversity of output results is effectively improved;

for example, when the noun module is matched with the phoneme string, the output result obtained is: the output result obtained after matching the verb module with the phoneme string is as follows: a sample phoneme C; the output result obtained after the adjective module is matched with the phoneme string is as follows: a sample phoneme D and a sample phoneme E; when the adverb module does not match the phoneme string, the combined information obtained by combining includes:

first phoneme combination string: a sample phoneme A, a sample phoneme C and a sample phoneme D;

the second phoneme combined string: sample phoneme B, sample phoneme C and sample phoneme D;

third phoneme combined string: a sample phoneme A, a sample phoneme C and a sample phoneme E;

fourth phoneme combined string: sample phoneme B, sample phoneme C and sample phoneme E;

step S101, respectively carrying out probability calculation on the phoneme combination strings according to the text language model to obtain a plurality of probability values, and outputting sentences corresponding to the maximum probability values;

preferably, in this embodiment, when the step of matching between the phoneme string and the module language model is completed, the method further includes:

performing type matching according to a marking result of the vocabulary type mark on the phoneme string to obtain a sentence type, and performing context marking on the voice to be recognized according to the sentence type;

preferably, the phoneme string and the speech to be recognized corresponding to the phoneme string can be effectively marked with sentence types, such as statement sentence marks, question sentence marks or sentence structure marks, by carrying out type matching design according to the marking result of the vocabulary type mark on the phoneme string;

specifically, the sentence structure may be set according to the need, for example, a subject + predicate structure, a subject + predicate + object structure, or the like, and thus, whether or not the speech to be recognized lacks a corresponding sentence component may be analyzed by the respective language modules, and whether or not the speech to be recognized has a subject and a predicate may be analyzed by the verb modules, for example.

In the embodiment, the training efficiency and accuracy of the language model are effectively improved by classifying the training texts and constructing and designing the language dictionary, the model training is carried out on the module language model in the language module and the training text is designed, so that the language model can be effectively expanded, and the recognition efficiency of the voice model is effectively improved by carrying out voice recognition based on the phoneme recognition mode.

EXAMPLE III

Please refer to fig. 3, which is a schematic structural diagram of a language model training system 100 according to a third embodiment of the present invention, including: a text classification module 10, a model training module 11, a phoneme matching module 12 and a probability calculation module 13, wherein:

the text classification module 10 is configured to obtain a training text and a training vocabulary, classify the training text to obtain a plurality of language modules, and construct a language dictionary corresponding to the language modules according to the training vocabulary;

and the model training module 11 is configured to perform model training on a module language model in the language module according to the language dictionary, and train the training text to obtain a text language model.

Wherein the model training module 11 is further configured to: extracting a language text corresponding to the language module from the training text according to the speech dictionary; training the module language model by adopting a 3-gram training mode according to the language text; and acquiring the word frequency of the corresponding word in the language text extracted from the language module, and constructing a Huffman tree model according to the word frequency and the training result of the language model.

And the phoneme matching module 12 is configured to obtain a speech to be recognized, perform phoneme recognition to obtain a phoneme string, and match the phoneme string with the module language model to obtain a phoneme matching result.

Wherein the phoneme matching module 12 is further configured to: matching the phoneme string with sample phonemes in each module language model in sequence; when the matching number between the phoneme string and the sample phonemes in the module language model is larger than or equal to a preset number, outputting all the successfully matched sample phonemes; and when the matching number is smaller than the preset number, outputting the result of the language module corresponding to the module language model.

And a probability calculation module 13, configured to perform probability calculation on the phoneme matching result through the text language model, and output a sentence corresponding to the maximum probability value.

Wherein, the probability calculation module 13 is further configured to: combining the sample phonemes output by the language modules to obtain combined information, wherein a plurality of phoneme combined strings are stored in the combined information; and respectively carrying out probability calculation on the phoneme combination strings according to the text language model to obtain a plurality of probability values.

Preferably, the language model training system 100 further comprises:

and the type marking module 14 is used for carrying out error marking on the phoneme string according to the module language model when the phoneme string is unsuccessfully matched with the module language model.

Furthermore, the type marking module 14 is further configured to: when the phoneme string is successfully matched with the module language model, carrying out vocabulary type marking on the phoneme string; and performing type matching according to the marking result of the vocabulary type mark on the phoneme string to obtain a sentence type, and performing context marking on the speech to be recognized according to the sentence type.

Example four

Referring to fig. 4, a mobile terminal 101 according to a fourth embodiment of the present invention includes a storage device and a processor, where the storage device is used to store a computer program, and the processor runs the computer program to make the mobile terminal 101 execute the above language model training method.

The present embodiment also provides a storage medium on which a computer program used in the above-mentioned mobile terminal 101 is stored, which when executed, includes the steps of:

acquiring a voice to be recognized, performing phoneme recognition to obtain a phoneme string, and matching the phoneme string with the module language model to obtain a phoneme matching result;

and performing probability calculation on the phoneme matching result through the text language model, and outputting a sentence corresponding to the maximum probability value. The storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.

It should be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional units and modules is only used for illustration, and in practical applications, the above function distribution may be performed by different functional units or modules as needed, that is, the internal structure of the storage device is divided into different functional units or modules to perform all or part of the above described functions. Each functional unit and module in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application.

Those skilled in the art will appreciate that the component structures shown in FIG. 3 are not intended to limit the language model training system of the present invention and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components, and that the language model training method of FIGS. 1-2 may be implemented using more or fewer components than those shown in FIG. 3, or some components in combination, or a different arrangement of components. The units, modules, etc. referred to herein are a series of computer programs that can be executed by a processor (not shown) in the target language model training system and that can perform specific functions, and all of the computer programs can be stored in a storage device (not shown) of the target language model training system.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for language model training, the method comprising:

performing probability calculation on the phoneme matching result through the text language model, and outputting a sentence corresponding to the maximum probability value;

the step of matching the phoneme string with the module language model comprises:

2. The method of claim 1, wherein the step of model training the module language model in the language module according to the language dictionary comprises:

extracting a language text corresponding to the language module from the training text according to the language dictionary;

3. The language model training method as claimed in claim 1, wherein the step of performing probability calculation on the phoneme matching result by the text language model comprises:

and respectively carrying out probability calculation on the phoneme combination strings according to the text language model to obtain a plurality of probability values.

4. The method of language model training as recited in claim 1, wherein after the step of sequentially matching the phone string to the sample phones in each of the modular language models, the method further comprises:

5. The method of language model training as recited in claim 1, wherein after the step of matching the phoneme string to the module language model, the method further comprises:

6. A language model training system, the system comprising:

the model training module is used for carrying out model training on a module language model in the language module according to the language dictionary and training the training text to obtain a text language model;

the phoneme matching module is used for acquiring the speech to be recognized for phoneme recognition to obtain a phoneme string and matching the phoneme string with the module language model to obtain a phoneme matching result; the step of matching the phoneme string with the module language model comprises: matching the phoneme string with sample phonemes in each module language model in sequence; when the matching number between the phoneme string and the sample phonemes in the module language model is larger than or equal to a preset number, outputting all the successfully matched sample phonemes; when the matching number is smaller than the preset number, outputting the result of the language module corresponding to the module language model;

and the probability calculation module is used for performing probability calculation on the phoneme matching result through the text language model and outputting a sentence corresponding to the maximum probability value.

7. The language model training system of claim 6, wherein the model training module is further to:

8. A mobile terminal, characterized in that it comprises a storage device for storing a computer program and a processor for executing the computer program to make the mobile terminal execute the language model training method according to any one of claims 1 to 5.

9. A storage medium, characterized in that it stores a computer program for use in a mobile terminal according to claim 8, which computer program, when being executed by a processor, carries out the steps of the language model training method according to any one of claims 1 to 5.