CN109710087B

CN109710087B - Input method model generation method and device

Info

Publication number: CN109710087B
Application number: CN201811620636.0A
Authority: CN
Inventors: 许晏铭
Original assignee: Beijing Kingsoft Internet Security Software Co Ltd
Current assignee: Beijing Kingsoft Internet Security Software Co Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2023-01-13
Anticipated expiration: 2038-12-28
Also published as: CN109710087A

Abstract

The invention provides an input method model generation method and device, wherein the method comprises the following steps: acquiring training data and a word segmentation word bank, wherein the word segmentation word bank comprises: words related to the input method scene; inquiring each sentence in the training data aiming at each word in the word segmentation word bank, and acquiring the word frequency of the word and the binary relation word corresponding to the word; generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; for each sentence in the training data, segmenting the sentence by adopting a prefix tree to obtain at least one segmentation result, and further generating a directed acyclic graph corresponding to the sentence; determining a word segmentation result corresponding to a sentence according to a maximum probability path in the directed acyclic graph; according to the word segmentation result corresponding to each sentence in the training data, the N-element model in the input method application is generated, so that an HMM model is not needed, sample data of the HMM model is not needed to be manually marked, the cost of the input method model is reduced, and the accuracy of the input method model is improved.

Description

Input method model generation method and device

Technical Field

The invention relates to the technical field of text processing, in particular to an input method model generation method and device.

Background

At present, when a user inputs pinyin in an input method application, the input method application inputs the pinyin input by the user into an N-gram model to obtain a candidate word list. The process of generating the N-element model mainly comprises the steps of acquiring training data, inputting each sentence in the training data into a trained Hidden Markov Model (HMM), and obtaining a word segmentation result corresponding to each sentence; and then generating an N-element model by adopting word segmentation results corresponding to each sentence. In the scheme, the HMM model needs to be trained by adopting a large amount of sample data, and the sample data needs to be manually segmented, so that the generation cost of the N-element model is increased, errors easily exist when the sample data is manually segmented, the accuracy of the HMM model obtained by training is reduced, and the accuracy of the N-element model is further reduced.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present invention is to provide an input method model generation method, which is used for solving the problems of poor accuracy and high generation cost of an N-ary model in the input method application in the prior art.

The second purpose of the invention is to provide an input method model generation device.

The third purpose of the invention is to provide another input method model generation device.

A fourth object of the invention is to propose a non-transitory computer-readable storage medium.

A fifth object of the invention is to propose a computer program product.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides an input method model generating method, including:

acquiring training data and a word segmentation word bank, wherein the word segmentation word bank comprises: words greater than a preset number threshold; the words include: words related to the input method scene;

for each word in the word segmentation word bank, inquiring each sentence in training data to obtain the word frequency of the word and the binary relation word corresponding to the word;

generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; the nodes in the prefix tree are words or binary relation words;

for each sentence in the training data, segmenting the sentence by adopting the prefix tree to obtain at least one segmentation result, and generating a directed acyclic graph corresponding to the sentence according to the at least one segmentation result;

determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to the sentence according to the maximum probability path;

and generating an N-element model in the input method application according to the word segmentation result corresponding to each sentence in the training data.

Further, the querying each sentence in the training data for each word in the word segmentation lexicon to obtain the word frequency of the word and the binary relation word corresponding to the word includes:

inquiring each sentence in training data aiming at each word in the word segmentation word bank to obtain the word frequency of the word;

and acquiring words after the words in each sentence, and determining the words after the words in each sentence as the binary relation words corresponding to the words.

Further, the determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to the sentence according to the maximum probability path, includes:

determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph;

judging whether the probability corresponding to the maximum probability path is greater than or equal to a preset probability threshold value or not;

if the probability corresponding to the maximum probability path is larger than or equal to a preset probability threshold, traversing a user dictionary according to a segmentation result corresponding to the maximum probability path, and judging whether a plurality of continuous words matched with words in the user dictionary exist in the segmentation result;

and if the continuous words exist in the segmentation result, integrating the continuous words to obtain a word segmentation result corresponding to the sentence.

Further, traversing the user dictionary according to the segmentation result corresponding to the maximum probability path, and determining whether a plurality of continuous words matched with the words in the user dictionary exist in the segmentation result, further comprising:

acquiring the occupation ratio of the single characters in the segmentation result corresponding to the maximum probability path;

judging whether the occupation ratio of the single characters is larger than or equal to a preset occupation ratio threshold value or not;

if the occupation ratio of the single words is larger than or equal to a preset occupation ratio threshold value, judging whether a plurality of continuous words matched with the words in the user dictionary exist in the segmentation result;

and if the occupation ratio of the single characters is smaller than a preset occupation ratio threshold, determining the segmentation result corresponding to the maximum probability path as a word segmentation result.

Further, the determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to the sentence according to the maximum probability path, further includes:

and if the probability corresponding to the maximum probability path is smaller than a preset probability threshold, inputting the sentence into a trained language statistical model, and obtaining a word segmentation result corresponding to the sentence.

Further, the generating an N-ary model in the input method application according to the word segmentation result corresponding to each sentence in the training data includes:

aiming at each word segmentation word in the word segmentation result corresponding to each sentence, acquiring the word frequency of the word segmentation word in the word segmentation result;

inquiring word segmentation results corresponding to each sentence according to the word segmentation words to obtain binary relation words corresponding to the word segmentation words;

acquiring the frequency of the word segmentation words and the frequency of the simultaneous occurrence of the corresponding binary relation words in the word segmentation result;

integrating the participle words and the corresponding binary relation words into participle words when the frequency is greater than a preset frequency threshold;

and generating an N-element model according to the word frequency of each participle word and the word frequency of the corresponding binary relation word.

Further, after generating an N-gram model in the input method application according to the word segmentation result corresponding to each sentence in the training data, the method further includes:

obtaining pinyin input by a user;

inputting the pinyin into an N-element model, and acquiring each word corresponding to the pinyin and the occurrence probability of each word;

and generating a candidate word list according to the occurrence probability of each word, so that a user can select the word from the candidate word list and input the word.

The input method model generation method of the embodiment of the invention obtains training data and a participle word bank, wherein the participle word bank comprises the following steps: words greater than a preset number threshold; the words include: words related to the input method scene; inquiring each sentence in the training data aiming at each word in the word segmentation word bank, and acquiring the word frequency of the word and the binary relation word corresponding to the word; generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; the nodes in the prefix tree are words or binary relation words; segmenting each sentence in the training data by adopting a prefix tree to obtain at least one segmentation result, and generating a directed acyclic graph corresponding to the sentence according to the at least one segmentation result; determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to a sentence according to the maximum probability path; according to the word segmentation result corresponding to each sentence in the training data, the N-gram model in the input method application is generated, so that an HMM model is not needed, sample data of the HMM model is not needed to be manually marked, the generation cost of the N-gram model is reduced, and the accuracy of the N-gram model is improved.

In order to achieve the above object, a second aspect of the present invention provides an input method model generating apparatus, including:

the acquisition module is used for acquiring training data and a word segmentation word bank, wherein the word segmentation word bank comprises: words greater than a preset number threshold; the words include: words related to input method scenes;

the query module is used for querying each sentence in the training data aiming at each word in the word segmentation word bank to obtain the word frequency of the word and the binary relation word corresponding to the word;

the generating module is used for generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; the nodes in the prefix tree are words or binary relation words;

the segmentation module is used for segmenting each sentence in the training data by adopting the prefix tree to obtain at least one segmentation result, and generating a directed acyclic graph corresponding to the sentence according to the at least one segmentation result;

the determining module is used for determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to the sentence according to the maximum probability path;

the generating module is further configured to generate an N-ary model in the input method application according to the word segmentation result corresponding to each sentence in the training data.

Further, the query module is specifically configured to,

Further, the determining module is specifically configured to,

Further, the determining module is specifically further configured to,

and if the occupation ratio of the single words is smaller than a preset occupation ratio threshold value, determining the segmentation result corresponding to the maximum probability path as a word segmentation result.

Further, the determining module is specifically further configured to,

and if the probability corresponding to the maximum probability path is smaller than a preset probability threshold value, inputting the sentence into a trained language statistical model, and acquiring a word segmentation result corresponding to the sentence.

Further, the generating module is specifically configured to,

when the frequency is larger than a preset frequency threshold value, integrating the participle words and the corresponding binary relation words into participle words;

Further, the obtaining module is further configured to obtain pinyin input by a user;

the obtaining module is further configured to input the pinyin into an N-gram model, and obtain each word corresponding to the pinyin and occurrence probability of each word;

the generating module is further used for generating a candidate word list according to the occurrence probability of each word, so that a user can select the word from the candidate word list and input the word.

The input method model generation device of the embodiment of the invention obtains training data and a participle word bank, wherein the participle word bank comprises the following components: words greater than a preset number threshold; the words include: words related to input method scenes; inquiring each sentence in the training data aiming at each word in the word segmentation word bank, and acquiring the word frequency of the word and the binary relation word corresponding to the word; generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; the nodes in the prefix tree are words or binary relation words; segmenting each sentence in the training data by adopting a prefix tree to obtain at least one segmentation result, and generating a directed acyclic graph corresponding to the sentence according to the at least one segmentation result; determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to a sentence according to the maximum probability path; according to the word segmentation result corresponding to each sentence in the training data, the N-gram model in the input method application is generated, so that an HMM model is not needed, sample data of the HMM model is not needed to be manually marked, the generation cost of the N-gram model is reduced, and the accuracy of the N-gram model is improved.

To achieve the above object, a third embodiment of the present invention provides another input method model generation apparatus, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the input method model generation method as described above when executing the computer program.

In order to achieve the above object, a fourth aspect of the present invention provides a non-transitory computer readable storage medium, wherein when the instructions in the storage medium are executed by a processor, the method as described above is implemented.

To achieve the above object, a fifth embodiment of the present invention provides a computer program product, which when executed by an instruction processor in the computer program product, implements the method as described above.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flow chart of a method for generating an input method model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the use of an input method application;

FIG. 3 is a schematic flow chart of another input method model generation method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an input method model generation apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of another input method model generation apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

An input method model generation method and apparatus according to an embodiment of the present invention are described below with reference to the drawings.

Fig. 1 is a schematic flow chart of a method for generating an input method model according to an embodiment of the present invention. As shown in fig. 1, the input method model generation method includes the following steps:

s101, training data and a word segmentation word bank are obtained, wherein the word segmentation word bank comprises: words greater than a preset number threshold; the words include: words associated with the input method scene.

The execution main body of the input method model generation method provided by the invention is an input method model generation device, and the input method model generation device can be hardware equipment such as terminal equipment and a server, or software installed on the hardware equipment. The software may be an input method application, and the hardware device may be a background server corresponding to the input method application, or a terminal device installed with the input method application.

In this embodiment, the training data is a large number of training sentences acquired from websites and the like by means of crawlers and the like. The word segmentation word bank is a word bank obtained in advance, for example, a word bank generated according to each word in a preset dictionary. In this embodiment, the input method model generation method is applied to an input method application, so that when a user inputs a word related to an input method scenario, the input method application can include a proper word in a candidate word list provided by the user, and the word related to the input method scenario may be added to a word segmentation lexicon so as to identify the word related to the input method scenario in a training sentence. Among them, terms related to the input method scene are terms commonly used when the user inputs, such as "who my is", "at dry prayer", and the like. Wherein, the word refers to the combination of words and phrases, including: words, phrases, and whole words.

S102, inquiring each sentence in the training data aiming at each word in the word segmentation word bank, and obtaining the word frequency of the word and the binary relation word corresponding to the word.

In this embodiment, the process of the input method model generation apparatus executing step 102 may specifically be that, for each word in the participle lexicon, each sentence in the training data is queried to obtain the word frequency of the word; and acquiring words behind the words in each sentence, and determining the words behind the words in each sentence as the binary relation words corresponding to the words.

The term frequency of a term refers to the percentage of the occurrence frequency of the term in the training data, the occurrence frequency, and the like. The term following a term refers to the term immediately following the term. For example, in the sentence "who my is", the word after the word "i" is "who".

S103, generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; the nodes in the prefix tree are words or binary relation words.

In this embodiment, the prefix tree includes a root node, a child node, and the like. The number of the root nodes is multiple, and the root nodes are respectively the first words of each sentence in the training data; the child nodes corresponding to the root nodes are binary relation words corresponding to the root node words; the child node corresponding to the child node is a binary relation word corresponding to the child node word, or a ternary relation word corresponding to the root node.

S104, segmenting each sentence in the training data by using the prefix tree to obtain at least one segmentation result, and generating a directed acyclic graph corresponding to the sentence according to the at least one segmentation result.

In this embodiment, the process of segmenting the sentence by the input method model generation device using the prefix tree for each sentence in the training data to obtain at least one segmentation result may specifically be that, for each sentence, the sentence is compared with each root node in the prefix tree one by one, the root node matched with the sentence is determined, and the word of the matched root node is determined as the first word in the sentence; and then comparing partial contents except the first word in the sentence with the child nodes corresponding to the matched root node, determining the second word in the sentence, and sequentially performing the operations so as to obtain at least one segmentation result corresponding to the sentence. Wherein each segmentation result corresponds to one segmentation mode. For example, taking the sentence "who my is" as an example, the sentence may correspond to four segmentation results, the first segmentation result being "i", "is", "who"; the second segmentation result is "I am", "who"; the third segmentation result is "I", "who"; the fourth segmentation result is "who my is".

In this embodiment, the process of generating the directed acyclic graph corresponding to the sentence by the input method model generation device according to the at least one segmentation result may specifically be that a first word in the at least one segmentation result is determined as an initial node, a second word is connected to the corresponding initial node, and a third word is connected to the corresponding second word, so as to obtain the directed acyclic graph. If the two second words correspond to the same initial node, one initial node is connected with the two second words; if a second word corresponds to two different initial nodes, the two initial nodes are connected with the same second word.

And S105, determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to the sentence according to the maximum probability path.

In this embodiment, the input method model generation device may specifically use a dynamic programming algorithm, and combine word frequencies of words in the directed acyclic graph to calculate occurrence probabilities of paths in the directed acyclic graph, so as to obtain a maximum probability path therein, and determine words on the maximum probability path as word segmentation results corresponding to the sentences.

And S106, generating an N-element model in the input method application according to the word segmentation result corresponding to each sentence in the training data.

In this embodiment, the process of the input method model generating device executing step 106 may specifically be that, for each participle word in the participle result corresponding to each sentence, the word frequency of the participle word in the participle result is obtained; inquiring word segmentation results corresponding to each sentence according to the word segmentation words to obtain binary relation words corresponding to the word segmentation words; acquiring the word segmentation words and the frequency of the simultaneous occurrence of the corresponding binary relation words in the word segmentation result; when the frequency is greater than a preset frequency threshold value, integrating the participle words and the corresponding binary relation words into participle words; and generating an N-element model according to the word frequency of each participle word and the word frequency of the corresponding binary relation word.

The corresponding participle words with high simultaneous occurrence frequency and the corresponding binary relation words are integrated into participle words, the word frequency of the integrated participle words is calculated and determined according to the word frequency of the participle words before integration and the word frequency of the corresponding binary relation words, and then when a user inputs pinyin corresponding to the integrated participle words in the input method application, the input method application can add the integrated participle words into a candidate word list in time by combining with an N-element model and provide the candidate word list for the user, so that the user can select required words from the candidate word list and input the words, the input efficiency of the input method application is improved, and the use experience of the user on the input method application is improved.

Further, on the basis of the above embodiment, after step 106, the method may further include the following steps: obtaining pinyin input by a user; inputting the pinyin into an N-element model, and acquiring each word corresponding to the pinyin and the occurrence probability of each word; and generating a candidate word list according to the occurrence probability of each word, so that a user can select the word from the candidate word list and input the word. Fig. 2 is a schematic diagram showing the use of the input method application. In fig. 2, in the case where the user inputs the pinyin "yiqiqub", the first word in the candidate word list is "go together".

Among them, the N-Gram is a Language Model commonly used in large vocabulary continuous speech recognition, and for Chinese, we refer to it as Chinese Language Model (CLM). The Chinese language model can realize the automatic conversion from the voice and the pinyin to the Chinese characters by using the collocation information between adjacent words in the context.

The input method model generation method of the embodiment of the invention obtains training data and a participle word bank, wherein the participle word bank comprises the following steps: words greater than a preset number threshold; the words include: words related to the input method scene; inquiring each sentence in the training data aiming at each word in the word segmentation word bank, and acquiring the word frequency of the word and the binary relation word corresponding to the word; generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; nodes in the prefix tree are words or binary relation words; segmenting each sentence in the training data by adopting a prefix tree to obtain at least one segmentation result, and generating a directed acyclic graph corresponding to the sentence according to the at least one segmentation result; determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to a sentence according to the maximum probability path; according to the word segmentation result corresponding to each sentence in the training data, the N-gram model in the input method application is generated, so that an HMM model is not needed, sample data of the HMM model is not needed to be manually marked, the generation cost of the N-gram model is reduced, and the accuracy of the N-gram model is improved.

Fig. 3 is a schematic flow chart of another input method model generation method according to an embodiment of the present invention. As shown in fig. 3, step 105 may specifically include the following steps based on the embodiment shown in fig. 1:

s1051, determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph.

Specifically, for each path in the directed acyclic graph, the input method model generation device may first obtain a first word in the path, obtain a word frequency of the word, and determine an occurrence probability of the word; then, a second word in the path is obtained, the second word is a binary relation word of the first word, the occurrence probability of the first word followed by the second word is determined, and the analogy is further performed to determine the occurrence probability of the path; and after the occurrence probability of each path is obtained, the path with the maximum occurrence probability is the path with the maximum probability in the directed acyclic graph.

S1052, judging whether the probability corresponding to the maximum probability path is larger than or equal to a preset probability threshold value.

The predetermined probability threshold may be sixty percent, eighty percent, or the like, for example. If the probability corresponding to the maximum probability path is greater than or equal to a preset probability threshold, the fact that the unknown words do not exist in the sentence is indicated, and the segmentation result corresponding to the maximum probability path is a more appropriate segmentation result; if the probability corresponding to the maximum probability path is smaller than the preset probability threshold, it indicates that an unknown word may exist in the sentence, and the segmentation result corresponding to the maximum probability path is not suitable, and the sentence needs to be segmented again. The unknown words refer to words which are not included in the word segmentation list but are required to be segmented, and include various proper nouns, acronyms, newly added words and the like.

S1053, if the probability corresponding to the maximum probability path is larger than or equal to the preset probability threshold, traversing the user dictionary according to the segmentation result corresponding to the maximum probability path, and judging whether a plurality of continuous words matched with the words in the user dictionary exist in the segmentation result.

In this embodiment, the user dictionary refers to a dictionary including professional words and specialized words. The term is used specifically, such as place name, building name, etc. The words in the word segmentation word bank are generally common words, the special words and the professional words are few, and the professional words or the special words in the sentences are easily divided into a plurality of single words by combining the word segmentation word bank. Therefore, the step 1053 is needed to integrate the continuous single words in the segmentation result into professional words or special words.

Further, in this embodiment, when the professional term or the special term is segmented into a plurality of words, the number of words in the segmentation result is generally high, and therefore, in order to further improve the generation efficiency of the input method model, after the probability corresponding to the maximum probability path is greater than or equal to the preset probability threshold, the input method model generating device may first perform the following steps: acquiring the occupation ratio of single words in the segmentation result corresponding to the maximum probability path, traversing the user dictionary according to the segmentation result corresponding to the maximum probability path if the occupation ratio of the single words is greater than or equal to a preset occupation ratio threshold, and judging whether a plurality of continuous words matched with the words in the user dictionary exist in the segmentation result; and if the single character occupation ratio is smaller than a preset occupation ratio threshold, directly determining the segmentation result corresponding to the maximum probability path as a word segmentation result.

In addition, if the probability corresponding to the maximum probability path is smaller than the preset probability threshold, the sentence can be input into the trained language statistical model, and the word segmentation result corresponding to the sentence is obtained. The language statistical Model can be an HMM (Hidden Markov Model), and the Model can be trained by the sentences determined in the steps 1051 to 1054 and the corresponding word segmentation results, so that the situation that the sample data is manually segmented and labeled is avoided, the HMM Model is trained by the word segmentation labeling results, the training cost of the HMM Model is reduced, and the training accuracy of the HMM Model is improved.

And S1054, if a plurality of continuous words exist in the segmentation result, integrating the continuous words to obtain a word segmentation result corresponding to the sentence.

In addition, if a plurality of continuous words matched with the words in the user dictionary do not exist in the segmentation result, the segmentation result corresponding to the maximum probability path is directly determined as the word segmentation result.

The input method model generation method of the embodiment of the invention obtains training data and a participle word bank, wherein the participle word bank comprises the following steps: words greater than a preset number threshold; the words include: words related to input method scenes; inquiring each sentence in the training data aiming at each word in the word segmentation word bank, and acquiring the word frequency of the word and the binary relation word corresponding to the word; generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; the nodes in the prefix tree are words or binary relation words; segmenting each sentence in the training data by adopting a prefix tree to obtain at least one segmentation result, and generating a directed acyclic graph corresponding to the sentence according to the at least one segmentation result; determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph; judging whether the probability corresponding to the maximum probability path is greater than or equal to a preset probability threshold value or not; if the probability corresponding to the maximum probability path is larger than or equal to a preset probability threshold, traversing the user dictionary according to the segmentation result corresponding to the maximum probability path, and judging whether a plurality of continuous words matched with the words in the user dictionary exist in the segmentation result; if a plurality of continuous words exist in the segmentation result, integrating the plurality of continuous words to obtain a word segmentation result corresponding to the sentence; according to the word segmentation result corresponding to each sentence in the training data, the N-gram model in the input method application is generated, so that an HMM model is not needed, sample data of the HMM model is not needed to be manually marked, the generation cost of the N-gram model is reduced, and the accuracy of the N-gram model is improved.

Fig. 4 is a schematic structural diagram of an input method model generation apparatus according to an embodiment of the present invention. As shown in fig. 4, includes: an obtaining module 41, a querying module 42, a generating module 43, a segmenting module 44 and a determining module 45.

The obtaining module 41 is configured to obtain training data and a word segmentation lexicon, where the word segmentation lexicon includes: words greater than a preset number threshold; the words include: words related to the input method scene;

a query module 42, configured to query each sentence in the training data for each term in the word segmentation lexicon, to obtain a word frequency of the term and a binary relation word corresponding to the term;

a generating module 43, configured to generate a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; the nodes in the prefix tree are words or binary relation words;

a segmentation module 44, configured to segment each sentence in the training data by using the prefix tree to obtain at least one segmentation result, and generate a directed acyclic graph corresponding to the sentence according to the at least one segmentation result;

a determining module 45, configured to determine a maximum probability path in the directed acyclic graph according to a word frequency of each word in the directed acyclic graph, and determine a word segmentation result corresponding to the sentence according to the maximum probability path;

the generating module 43 is further configured to generate an N-ary model in the application of the input method according to the word segmentation result corresponding to each sentence in the training data.

The input method model generation device provided by the invention can be hardware equipment such as terminal equipment and a server, or software installed on the hardware equipment. The software may be an input method application, and the hardware device may be a background server corresponding to the input method application, or a terminal device installed with the input method application.

In this embodiment, the training data is a large number of training sentences acquired from a website or the like by means of a crawler or the like. The word segmentation word bank is a word bank obtained in advance, for example, a word bank generated according to each word in a preset dictionary. In this embodiment, the input method model generation method is applied to an input method application, so that when a user inputs a word related to an input method scenario, the input method application can include a proper word in a candidate word list provided by the user, and the word related to the input method scenario may be added to a word segmentation lexicon so as to identify the word related to the input method scenario in a training sentence. Among them, terms related to the input method scene are terms commonly used when a user inputs, such as "who i am", "on dry prayer". Wherein, the words refer to the combination of words and phrases, including: words, phrases, and whole words.

In this embodiment, the query module 42 is specifically configured to, for each term in the word segmentation word bank, query each sentence in the training data to obtain the word frequency of the term; and acquiring words after the words in each sentence, and determining the words after the words in each sentence as the binary relation words corresponding to the words.

The term frequency of a term refers to the ratio of the occurrence times of the term in the training data, the occurrence frequency and the like. The term following a term refers to the term immediately following the term. For example, in the sentence "who my is", the word after the word "me" is "who".

In this embodiment, the process of segmenting the sentence by the segmentation module 44 using the prefix tree for each sentence in the training data to obtain at least one segmentation result may specifically be that, for each sentence, the sentence is compared with each root node in the prefix tree one by one, the root node matched with the sentence is determined, and the word of the matched root node is determined as the first word in the sentence; and then comparing partial contents except the first word in the sentence with the child nodes corresponding to the matched root node, determining the second word in the sentence, and sequentially performing the operations so as to obtain at least one segmentation result corresponding to the sentence. Wherein each segmentation result corresponds to one segmentation mode. For example, taking the sentence "who my is" as an example, the sentence may correspond to four segmentation results, the first segmentation result being "i", "is", "who"; the second segmentation result is "I am", "who"; the third segmentation result is "I", "who"; the fourth segmentation result is "who i am".

In this embodiment, the process of generating the directed acyclic graph corresponding to the sentence by the segmentation module 44 according to the at least one segmentation result may specifically be that a first word in the at least one segmentation result is determined as an initial node, a second word is connected to the corresponding initial node, and a third word is connected to the corresponding second word, so as to obtain the directed acyclic graph. If the two second words correspond to the same initial node, one initial node is connected with the two second words; if a second word corresponds to two different initial nodes, the two initial nodes are connected with the same second word.

In this embodiment, the generating module 43 may be specifically configured to, for each participle word in the participle result corresponding to each sentence, obtain a word frequency of the participle word in the participle result; inquiring a word segmentation result corresponding to each sentence according to the word segmentation words, and acquiring binary relation words corresponding to the word segmentation words; obtaining word segmentation words and the frequency of the simultaneous occurrence of the corresponding binary relation words in the word segmentation result; integrating the participle words and the corresponding binary relation words into participle words when the frequency is greater than a preset frequency threshold; and generating an N-element model according to the word frequency of each participle word and the word frequency of the corresponding binary relation word.

The corresponding word segmentation words with high simultaneous occurrence frequency and the corresponding binary relation words are integrated into word segmentation words, the word frequency of the integrated word segmentation words is calculated and determined according to the word frequency of the word segmentation words before integration and the word frequency of the corresponding binary relation words, and then when a user inputs pinyin corresponding to the integrated word segmentation words in the input method application, the input method application can add the integrated word segmentation words into the candidate word list in time by combining the N-element model and provide the user with the integrated word segmentation words, so that the user can select required words from the candidate word list and input the words, the input efficiency of the input method application is improved, and the use experience of the user on the input method application is improved.

Further, on the basis of the above embodiment, the obtaining module 41 is further configured to obtain pinyin input by a user;

the obtaining module 41 is further configured to input the pinyin into an N-ary model, and obtain each word corresponding to the pinyin and occurrence probability of each word;

the generating module 43 is further configured to generate a candidate word list according to the occurrence probability of each word, so that a user can select a word from the candidate word list and input the word.

The input method model generation device of the embodiment of the invention obtains training data and a participle word bank, wherein the participle word bank comprises the following components: words greater than a preset number threshold; the words include: words related to the input method scene; inquiring each sentence in the training data aiming at each word in the word segmentation word bank, and acquiring the word frequency of the word and the binary relation word corresponding to the word; generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; the nodes in the prefix tree are words or binary relation words; for each sentence in the training data, segmenting the sentence by adopting a prefix tree to obtain at least one segmentation result, and generating a directed acyclic graph corresponding to the sentence according to the at least one segmentation result; determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to a sentence according to the maximum probability path; according to the word segmentation result corresponding to each sentence in the training data, the N-gram model in the input method application is generated, so that an HMM model is not needed, sample data of the HMM model is not needed to be manually marked, the generation cost of the N-gram model is reduced, and the accuracy of the N-gram model is improved.

Further, on the basis of the above-mentioned embodiment, the determining module 45 may be specifically configured to,

if the probability corresponding to the maximum probability path is larger than or equal to a preset probability threshold, traversing a user dictionary according to the segmentation result corresponding to the maximum probability path, and judging whether a plurality of continuous words matched with the words in the user dictionary exist in the segmentation result;

In this embodiment, the input method model generation device may specifically use a dynamic programming algorithm, and combine word frequencies of words in the directed acyclic graph to calculate occurrence probabilities of paths in the directed acyclic graph, so as to obtain a maximum probability path therein, and determine words on the maximum probability path as word segmentation results corresponding to the sentence.

Specifically, for each path in the directed acyclic graph, the input method model generation device may first obtain a first term in the path, obtain a term frequency of the term, and determine an occurrence probability of the term; then, a second word in the path is obtained, the second word is a binary relation word of the first word, the occurrence probability of the second word following the first word is determined, and the analogy is further carried out to determine the occurrence probability of the path; and after the occurrence probability of each path is obtained, the path with the maximum occurrence probability is the path with the maximum probability in the directed acyclic graph.

In this embodiment, the predetermined probability threshold may be sixty percent, eighty percent, or the like, for example. If the probability corresponding to the maximum probability path is greater than or equal to a preset probability threshold, the sentence is represented that no unknown word exists, and the segmentation result corresponding to the maximum probability path is a more appropriate segmentation result; if the probability corresponding to the maximum probability path is smaller than the preset probability threshold, it indicates that an unknown word may exist in the sentence, and the segmentation result corresponding to the maximum probability path is not suitable, and the sentence needs to be segmented again. The unknown words refer to words which are not included in the word segmentation list but are required to be segmented, and include various proper nouns, acronyms, newly-added words and the like.

In this embodiment, the user dictionary refers to a dictionary including professional words and specialized words. The term is used specifically, such as place name, building name, etc. The words in the word segmentation word bank are generally common words, the special words and the professional words are few, and the professional words or the special words in the sentences are easily divided into a plurality of single words by combining the word segmentation word bank. Therefore, a plurality of continuous single words in the segmentation result need to be integrated into a professional word or a special word.

Further, in this embodiment, when the professional term or the special term is segmented into a plurality of words, the number of words in the segmentation result is generally high, and therefore, in order to further improve the generation efficiency of the input method model, after the probability corresponding to the maximum probability path is greater than or equal to the preset probability threshold, the input method model generating device may first perform the following steps: acquiring the occupation ratio of the single words in the segmentation result corresponding to the maximum probability path, traversing the user dictionary according to the segmentation result corresponding to the maximum probability path if the occupation ratio of the single words is greater than or equal to a preset occupation ratio threshold, and judging whether a plurality of continuous words matched with the words in the user dictionary exist in the segmentation result; and if the single character occupation ratio is smaller than a preset occupation ratio threshold, directly determining the segmentation result corresponding to the maximum probability path as a word segmentation result.

In addition, if the probability corresponding to the maximum probability path is smaller than the preset probability threshold, the sentence can be input into the trained language statistical model, and the word segmentation result corresponding to the sentence is obtained. The language statistical Model may be an HMM Model (Hidden Markov Model), and the Model may be trained by using the sentence and the corresponding word segmentation result determined by the determining module 45, so as to avoid manually performing word segmentation and labeling on the sample data, and further train the HMM Model by using the word segmentation and labeling result, thereby reducing the training cost of the HMM Model and improving the training accuracy of the HMM Model.

The input method model generation device of the embodiment of the invention obtains training data and a participle word bank, wherein the participle word bank comprises the following components: words greater than a preset number threshold; the words include: words related to input method scenes; inquiring each sentence in the training data aiming at each word in the word segmentation word bank, and acquiring the word frequency of the word and the binary relation word corresponding to the word; generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; nodes in the prefix tree are words or binary relation words; for each sentence in the training data, segmenting the sentence by adopting a prefix tree to obtain at least one segmentation result, and generating a directed acyclic graph corresponding to the sentence according to the at least one segmentation result; determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph; judging whether the probability corresponding to the maximum probability path is greater than or equal to a preset probability threshold value or not; if the probability corresponding to the maximum probability path is larger than or equal to a preset probability threshold, traversing the user dictionary according to the segmentation result corresponding to the maximum probability path, and judging whether a plurality of continuous words matched with the words in the user dictionary exist in the segmentation result; if a plurality of continuous words exist in the segmentation result, integrating the plurality of continuous words to obtain a word segmentation result corresponding to the sentence; according to the word segmentation result corresponding to each sentence in the training data, the N-gram model in the input method application is generated, so that an HMM model is not needed, sample data of the HMM model is not needed to be manually marked, the generation cost of the N-gram model is reduced, and the accuracy of the N-gram model is improved.

Fig. 5 is a schematic structural diagram of another input method model generation apparatus according to an embodiment of the present invention. The input method model generation device includes:

memory 1001, processor 1002, and computer programs stored on memory 1001 and executable on processor 1002.

The processor 1002 implements the input method model generation method provided in the above-described embodiment when executing the program.

Further, the input method model generation apparatus further includes:

a communication interface 1003 for communicating between the memory 1001 and the processor 1002.

A memory 1001 for storing computer programs that may be run on the processor 1002.

Memory 1001 may include high-speed RAM memory and may also include non-volatile memory (e.g., at least one disk memory).

The processor 1002 is configured to implement the method for generating an input method model according to the foregoing embodiment when executing the program.

If the memory 1001, the processor 1002, and the communication interface 1003 are implemented independently, the communication interface 1003, the memory 1001, and the processor 1002 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

Optionally, in a specific implementation, if the memory 1001, the processor 1002, and the communication interface 1003 are integrated on one chip, the memory 1001, the processor 1002, and the communication interface 1003 may complete communication with each other through an internal interface.

The processor 1002 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention.

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, wherein the program is characterized by implementing the input method model generation method as described above when executed by a processor.

The present embodiment also provides a computer program product, wherein when the instruction processor in the computer program product executes, the method for generating the input method model as described above is provided.

In the description of the specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specified otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for generating an input method model, comprising:

acquiring training data and a word segmentation word bank, wherein the word segmentation word bank comprises: words greater than a preset number threshold; the words include: words related to input method scenes;

inquiring each sentence in training data aiming at each word in the word segmentation word bank, and acquiring the word frequency of the word and the binary relation word corresponding to the word;

generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; the nodes in the prefix tree are words or binary relation words, wherein the prefix tree comprises a plurality of root nodes which are respectively the first words of each sentence in the training data;

generating an N-element model in the input method application according to the word segmentation result corresponding to each sentence in the training data;

wherein, for each sentence in the training data, the sentence is segmented by using the prefix tree to obtain at least one segmentation result, including:

aiming at each sentence, comparing the sentence with each root node in a prefix tree one by one, determining the root node matched with the sentence, and determining the word of the matched root node as the first word in the sentence; and then comparing part of contents except the first word in the sentence with the child nodes corresponding to the matched root nodes, determining a second word in the sentence, and sequentially performing the operations so as to obtain at least one segmentation result corresponding to the sentence.

2. The method according to claim 1, wherein the querying each sentence in the training data for each word in the word segmentation lexicon to obtain the word frequency of the word and the binary relation word corresponding to the word comprises:

3. The method according to claim 1, wherein the determining a maximum probability path in the directed acyclic graph according to a word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to the sentence according to the maximum probability path, comprises:

4. The method of claim 3, wherein traversing the user dictionary according to the segmentation result corresponding to the maximum probability path, and determining whether a plurality of continuous words matching the words in the user dictionary exist in the segmentation result, further comprises:

5. The method according to claim 3, wherein the determining a maximum probability path in the directed acyclic graph according to a word frequency of each word in the directed acyclic graph, and determining a word segmentation result corresponding to the sentence according to the maximum probability path, further comprises:

6. The method according to claim 1, wherein generating an N-gram model in an input method application according to a segmentation result corresponding to each sentence in the training data comprises:

aiming at each participle word in the participle result corresponding to each sentence, acquiring the word frequency of the participle word in the participle result;

inquiring word segmentation results corresponding to each sentence according to the word segmentation words, and acquiring binary relation words corresponding to the word segmentation words;

7. The method according to claim 1, wherein after generating the N-gram in the input method application according to the segmentation result corresponding to each sentence in the training data, the method further comprises:

obtaining pinyin input by a user;

and generating a candidate word list according to the occurrence probability of each word, so that the user can select the word from the candidate word list and input the word.

8. An input method model generation device, comprising:

the generating module is used for generating a prefix tree according to each word in the word segmentation word bank and the corresponding binary relation word; the nodes in the prefix tree are words or binary relation words, wherein the prefix tree comprises a plurality of root nodes which are respectively the first words of each sentence in the training data;

the determining module is used for determining a maximum probability path in the directed acyclic graph according to the word frequency of each word in the directed acyclic graph and determining a word segmentation result corresponding to the sentence according to the maximum probability path;

the generating module is further used for generating an N-element model in the input method application according to the word segmentation result corresponding to each sentence in the training data;

for each sentence, comparing the sentence with each root node in a prefix tree one by one, determining the root node matched with the sentence, and determining the words of the matched root node as the first words in the sentence; and then comparing part of contents except the first word in the sentence with the child nodes corresponding to the matched root nodes, determining a second word in the sentence, and sequentially performing the operations so as to obtain at least one segmentation result corresponding to the sentence.

9. The apparatus of claim 8, wherein the query module is specifically configured to,

10. The apparatus of claim 8, wherein the means for determining is configured to,

11. The apparatus of claim 10, wherein the determination module is further specifically configured to,

if the occupation ratio of the single words is larger than or equal to a preset occupation ratio threshold, judging whether a plurality of continuous words matched with the words in the user dictionary exist in the segmentation result;

12. The apparatus of claim 10, wherein the determination module is further specifically configured to,

13. The apparatus of claim 8, wherein the generation module is specifically configured to,

14. The apparatus of claim 8,

the obtaining module is also used for obtaining pinyin input by a user;

the obtaining module is further configured to input the pinyin into an N-ary model, and obtain each word corresponding to the pinyin and occurrence probability of each word;

15. An input method model generation device, comprising:

a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for generating an input method model according to any one of claims 1 to 7 when executing the program.

16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the input method model generation method according to any one of claims 1 to 7.