CN113378561A

CN113378561A - Word prediction template generation method and device

Info

Publication number: CN113378561A
Application number: CN202110933954.8A
Authority: CN
Inventors: 崔燕红; 余金林; 宁超; 陈益梦; 王昊天
Original assignee: Beijing Teddy Bear Mobile Technology Co ltd
Current assignee: Beijing Teddy Bear Mobile Technology Co ltd
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2021-09-10

Abstract

A method and a device for generating a word prediction template are disclosed. The method comprises the following steps: acquiring a training corpus; performing word segmentation on the training corpus to obtain a plurality of words; determining characteristic information of each word; aiming at a target word in the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target word, wherein N is more than or equal to 1 and less than or equal to M, M is the smaller value of the maximum extensible word quantity of the target word on the left side and the right side in the training corpus, and the initial value of N is 1; extracting words from the training corpus by using a candidate prediction template; and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words.

Description

Word prediction template generation method and device

Technical Field

The application relates to the technical field of natural language processing, in particular to a word prediction template generation method and device.

Background

In the field of natural language processing, keyword prediction and extraction are generally performed by using a word vector template. At present, there are two common word vector template generation methods, one is manual generation, that is, a technician generates a word vector template according to experience and experiments and summary rules. The method has the defects that the word vector templates cannot be generated automatically and in batches, and the efficiency is low. The other method is to use a neural network model for training to obtain a word vector template, such as the Bert technique and the Albert technique, which can generate word vector templates automatically and in batches, and greatly improves efficiency compared with a manual method, but based on the technical principle of the method, when the word vector template is applied to scenes of regular short texts, massive training corpora still need to be used for training, so that consumption in training is large, and the accuracy of word prediction and extraction by using the word vector template obtained by training is low.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a device for generating a word prediction template, which are used for predicting and extracting words of short texts, do not need to use massive training corpora for training, and have the advantages of low consumption and high accuracy.

In order to achieve the above object, in a first aspect, an embodiment of the present invention provides a word prediction template generation method, including:

acquiring a training corpus;

performing word segmentation on the training corpus to obtain a plurality of words;

determining characteristic information of each word;

aiming at a target word in the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target word, wherein N is more than or equal to 1 and less than or equal to M, M is the smaller value of the maximum extensible word quantity of the target word on the left side and the right side in the training corpus, and the initial value of N is 1;

extracting words from the training corpus by using a candidate prediction template;

and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words.

Preferably, after performing word extraction in the corpus by using the candidate prediction template, the method further includes: and if the extracted words comprise words different from the target word and N is less than M, N +1, executing to generate a candidate prediction template by using the Nth left word and the Nth right word of the target word and the characteristic information of the target word.

Preferably, after performing word extraction in the corpus by using the candidate prediction template, the method further includes: if the extracted words comprise words different from the target words and N is equal to M, judging the proportion of the words different from the target words in the extracted words, and if the proportion is not greater than a preset proportion threshold value, determining the candidate prediction template as the word prediction template corresponding to the target words.

Preferably, before generating, for a target word in the plurality of words, a candidate prediction template using an nth word on the left, an nth word on the right, and feature information of the target word, the method further includes: and replacing the generalizable words in the plurality of words by using identification values according to categories, and recording regular expressions of the words corresponding to the identification values.

Preferably, the method further comprises: and fusing the plurality of word prediction templates to generate a combined word prediction template.

In a second aspect, an embodiment of the present invention provides a word prediction template generation apparatus, including:

the acquisition unit is used for acquiring the training corpus;

the word segmentation unit is used for segmenting the training corpus to obtain a plurality of words;

the first determining unit is used for determining the characteristic information of each word;

a generating unit, configured to generate, for a target word in the multiple words, a candidate prediction template using an nth left word and an nth right word of the target word and feature information of the target word, where N is greater than or equal to 1 and is less than or equal to M, M is a smaller value of the maximum extensible word numbers of the target word on the left and right in the corpus, and an initial value of N is 1;

the extraction unit is used for extracting words from the training corpus by using the candidate prediction template;

and the second determining unit is used for determining the candidate prediction template as the word prediction template corresponding to the target word if the extracted words are the same as the target word.

Preferably, the generating unit is further configured to, if the extracted words include a word different from the target word and N is less than M, perform N +1 generation of a candidate prediction template using an nth left word, an nth right word, and feature information of the target word.

Preferably, the apparatus further comprises: the judging unit is used for judging the proportion of the words which are different from the target words in the extracted words if the extracted words comprise the words which are different from the target words and N is equal to M; the second determining unit is further configured to determine that the candidate prediction template is a word prediction template corresponding to the target word if the ratio is not greater than a preset ratio threshold.

Preferably, the apparatus further comprises: and the generalization unit is used for replacing the generalizable words in the plurality of words by using the identification values according to the categories and recording the regular expressions of the words corresponding to the identification values.

Preferably, the apparatus further comprises: and the fusion unit is used for fusing the plurality of word prediction templates to generate a combined word prediction template.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores a computer program for executing the word prediction template generation method according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the word prediction template generation method according to the first aspect.

By utilizing the word prediction template generation method and device provided by the invention, a plurality of words are obtained by segmenting the training corpus; determining characteristic information of each word; and aiming at target words of the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target words, then extracting the words in the training corpus by utilizing the candidate prediction template, and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words. Based on the method, for the word prediction and extraction of the short text, massive training corpora are not needed for training, the consumption is low, and the word prediction template is added with the feature information of the words, so that the accuracy rate of the word prediction and extraction can be effectively improved by using the word prediction template.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 is a flowchart illustrating a method for generating a word prediction template according to an exemplary embodiment of the present disclosure;

fig. 2 is a block diagram of a word prediction template generating apparatus according to an exemplary embodiment of the present application;

fig. 3 is a block diagram of another word prediction template generation apparatus according to an exemplary embodiment of the present application;

FIG. 4 is a block diagram of another word prediction template generation apparatus provided in an exemplary embodiment of the present application;

fig. 5 is a block diagram of still another word prediction template generation apparatus according to an exemplary embodiment of the present application;

fig. 6 is a block diagram of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Fig. 1 is a flowchart illustrating a word prediction template generation method according to an embodiment of the present application. The word prediction template generation method can be applied to electronic equipment. As shown in fig. 1, the method for generating a word prediction template provided in this embodiment includes:

step 101, obtaining a corpus.

In one example, the corpus may be determined according to a scene to which the word prediction template needs to be applied, and the corpus that is strongly related to the scene is selected to reduce the usage amount of the corpus and improve the accuracy of the word prediction template.

And 102, segmenting the training corpus to obtain a plurality of words.

In one example, the present invention is not limited to the above-mentioned segmentation techniques, such as word segmentation method for character string matching, word sense segmentation method, and statistical segmentation method.

And 103, determining the characteristic information of each word.

In one example, the characteristic information of the word may be part of speech, such as noun, verb, subject, preposition, etc., or subject, such as sports, entertainment, literature, etc., or may be proper noun, such as time, place, name of person, etc.

It should be noted that, for the same template set, the feature information of the word is of the same category. Specifically, each type of feature information corresponds to a recognition rule, and before training, the corresponding recognition rule can be preset according to requirements, and the feature information of each word can be determined according to the preset recognition rule.

And 104, aiming at a target word in the plurality of words, generating a candidate prediction template by using the Nth word on the left side, the Nth word on the right side and the characteristic information of the target word.

Wherein, N is more than or equal to 1 and less than or equal to M, M is the smaller value of the maximum extensible word quantity of the target word on the left side and the right side in the training corpus, and the initial value of N is 1.

Prior to step 104, the method may further comprise:

and replacing generalizable words in the plurality of words by using identification values according to categories, and recording regular expressions of the words corresponding to the identification values. Wherein, the generalizable words may include: a term of art.

In an example, the target word may be any one of a plurality of words obtained by segmenting the training corpus, or may be an identification value of any one of the words after generalization.

And 105, extracting words in the training corpus by using the candidate prediction template.

If the extracted words are all the same as the target word, step 106 is performed. If the extracted words include words different from the target word and N is less than M, then N +1, step 104 is performed again. If the extracted words include a word different from the target word and N is equal to M, step 107 is performed.

And step 106, determining the candidate prediction template as a word prediction template corresponding to the target word.

In one example, the method may further comprise: and fusing the plurality of word prediction templates to generate a combined word prediction template.

And step 107, judging the proportion of the words which are different from the target words in the extracted words.

If the proportion of the words different from the target words in the extracted words is not greater than the preset proportion threshold, step 106 is executed. And if the proportion of the words different from the target words in the extracted words is not larger than the preset proportion threshold value, ending the process and not generating the word prediction template corresponding to the target words.

In one example, the duty threshold may be set to 5%, and the higher the required accuracy, the larger the duty threshold that is set.

In a specific example, assume that the target word is a, M takes a value of 4, and N takes an initial value of 1. The 1 st word and the 2 nd word on the left side of the A are respectively recorded as L1 and L2.. LN, the 1 st word and the 2 nd word on the right side of the A are respectively recorded as R1 and R2.. RN, the characteristic information of the A is recorded as S, and when N is respectively 1 and 2.. N, the candidate prediction templates corresponding to the target word are respectively recorded as F1 and F2... FN. Based on this, when N is 1, the candidate prediction template F1 corresponding to the target word may be recorded as [ L1, S, R1 ].

And (3) extracting words from the training corpus by using the candidate prediction template F1, assuming that 20 words are extracted, and if the 20 words are all the same as A, determining the candidate prediction template F1 as a word prediction template corresponding to the word A. If a word different from a is included in the 20 words, N +1 is generated as a candidate prediction template F2 when N is 2, i.e., [ L2, L1, S, R1, R2 ].

Similarly, extracting words from the corpus by using the candidate prediction template F2, assuming that 15 words are extracted, if all the 15 words are the same as a, determining that the candidate prediction template F2 is the word prediction template corresponding to the word a. If the 15 words include words different from A, N +1 is generated into a candidate prediction template F3 when N is 3, namely [ L3, L2, L1, S, R1, R2, R3 ].

Similarly, extracting words from the corpus by using the candidate prediction template F3, assuming that 13 words are extracted, if all the 13 words are the same as a, determining that the candidate prediction template F3 is the word prediction template corresponding to the word a. If the 13 words include words different from a, N +1 is generated as a candidate prediction template F4 when N is 4, i.e., [ L4, L3, L2, L1, S, R1, R2, R3, R4 ].

Similarly, extracting words from the corpus by using the candidate prediction template F4, assuming that 10 words are extracted, if all the 10 words are the same as a, determining that the candidate prediction template F4 is the word prediction template corresponding to the word a. If a word different from a is included in the 10 words, since M is 4, that is, N = M, it is necessary to determine whether the proportion of the word different from a included in the 10 words is not greater than a preset proportion threshold. Assuming that the preset percentage threshold is 10%, if only 1 word different from a is included in the extracted 10 words, the percentage of the word different from a included in the 10 words is 10%, and is not greater than 10% of the preset percentage threshold, determining that the candidate prediction template F4 is the word prediction template corresponding to the word a. If the number of the extracted 10 words different from a is greater than 1, the proportion of the words different from a included in the 10 words is certainly greater than the preset proportion threshold value by 10%, and then the word prediction template corresponding to the word a is not generated.

Further, in another example, to make the word prediction template more accurate, the word prediction template method provided in this embodiment may further include: and acquiring a new training corpus, taking the generated word prediction template as a candidate prediction template, and executing the step 105 to verify and calibrate the generated word prediction template. Therefore, the word prediction template is updated without being trained again completely in the embodiment, and the omission can be continuously checked in the using process, so that the word prediction template is more and more accurate and richer.

By utilizing the word prediction template generation method provided by the embodiment of the invention, a plurality of words are obtained by segmenting the training corpus; determining characteristic information of each word; and aiming at target words of the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target words, then extracting the words in the training corpus by utilizing the candidate prediction template, and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words. Based on the method, for the word prediction and extraction of the short text, massive training corpora are not needed for training, the consumption is low, and the word prediction template is added with the feature information of the words, so that the accuracy rate of the word prediction and extraction can be effectively improved by using the word prediction template.

An embodiment of the present invention provides a word prediction template generation apparatus, and fig. 2 is a structural diagram of the word prediction template generation apparatus. The apparatus may include:

an obtaining unit 201, configured to obtain a corpus;

a word segmentation unit 202, configured to perform word segmentation on the training corpus to obtain multiple words;

a first determining unit 203, configured to determine feature information of each word;

a generating unit 204, configured to generate, for a target word in the multiple words, a candidate prediction template by using an nth left word, an nth right word, and feature information of the target word, where N is greater than or equal to 1 and is less than or equal to M, M is a smaller value of the maximum number of words that the left and right of the target word can be expanded in the corpus, and an initial value of N is 1;

an extracting unit 205, configured to perform word extraction in the corpus by using a candidate prediction template;

a second determining unit 206, configured to determine that the candidate prediction template is a word prediction template corresponding to the target word if all the extracted words are the same as the target word.

Preferably, the generating unit 204 is further configured to, if the extracted words include a word different from the target word and N is less than M, perform N +1 generation of a candidate prediction template by using the nth word on the left, the nth word on the right, and the feature information of the target word.

Preferably, as shown in fig. 3, the apparatus further comprises: a determining unit 207 configured to determine a ratio of a word different from the target word among the extracted words if the extracted words include a word different from the target word and N is equal to M; the second determining unit 206 is further configured to determine that the candidate prediction template is a word prediction template corresponding to the target word if the ratio is not greater than a preset ratio threshold.

Preferably, as shown in fig. 4, the apparatus further includes: and a generalization unit 208, configured to replace, by category, a generalizable term in the plurality of terms with an identification value, and record a regular expression of the term corresponding to the identification value.

Preferably, as shown in fig. 5, the apparatus further includes: a fusion unit 209 is configured to fuse the plurality of word prediction templates to generate a combined word prediction template.

By utilizing the word prediction template generation device provided by the invention, a plurality of words are obtained by segmenting the training corpus; determining characteristic information of each word; and aiming at target words of the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target words, then extracting the words in the training corpus by utilizing the candidate prediction template, and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words. Based on the method, for the word prediction and extraction of the short text, massive training corpora are not needed for training, the consumption is low, and the word prediction template is added with the feature information of the words, so that the accuracy rate of the word prediction and extraction can be effectively improved by using the word prediction template.

Next, an electronic apparatus 11 according to an embodiment of the present application is described with reference to fig. 6.

As shown in fig. 6, the electronic device 11 includes one or more processors 111 and memory 112.

The processor 111 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 11 to perform desired functions.

Memory 112 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 111 to implement the word prediction template generation methods of the various embodiments of the present application described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 11 may further include: an input device 113 and an output device 114, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

The input device 113 may include, for example, a keyboard, a mouse, and the like.

The output device 114 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 114 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for the sake of simplicity, only some of the components of the electronic device 11 relevant to the present application are shown in fig. 6, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 11 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the word prediction template generation method according to various embodiments of the present application described in the "exemplary methods" section above of this specification.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a word prediction template generation method according to various embodiments of the present application described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method for generating a word prediction template, the method comprising:

acquiring a training corpus;

determining characteristic information of each word;

2. The method according to claim 1, wherein after said extracting words from said corpus using said candidate predictive template, said method further comprises:

and if the extracted words comprise words different from the target word and N is less than M, N +1, executing to generate a candidate prediction template by using the Nth left word and the Nth right word of the target word and the characteristic information of the target word.

3. The method according to claim 1, wherein after said extracting words from said corpus using said candidate predictive template, said method further comprises:

if the extracted words comprise words different from the target words and N is equal to M, judging the proportion of the words different from the target words in the extracted words, and if the proportion is not greater than a preset proportion threshold value, determining the candidate prediction template as the word prediction template corresponding to the target words.

4. The method of claim 1, wherein before generating, for a target word of the plurality of words, a candidate prediction template using an nth word to the left, an nth word to the right of the target word, and feature information of the target word, the method further comprises:

and replacing the generalizable words in the plurality of words by using identification values according to categories, and recording regular expressions of the words corresponding to the identification values.

5. The method of claim 1, further comprising:

and fusing the plurality of word prediction templates to generate a combined word prediction template.

6. An apparatus for generating a word prediction template, the apparatus comprising:

the acquisition unit is used for acquiring the training corpus;

7. The apparatus of claim 6, wherein the generating unit is further configured to perform, if the extracted words include a word different from the target word and N is less than M, N +1, generating a candidate prediction template using an nth left word, an nth right word, and feature information of the target word.

8. The apparatus of claim 6, further comprising:

the judging unit is used for judging the proportion of the words which are different from the target words in the extracted words if the extracted words comprise the words which are different from the target words and N is equal to M;

the second determining unit is further configured to determine that the candidate prediction template is a word prediction template corresponding to the target word if the ratio is not greater than a preset ratio threshold.

9. The apparatus of claim 6, further comprising:

and the generalization unit is used for replacing the generalizable words in the plurality of words by using the identification values according to the categories and recording the regular expressions of the words corresponding to the identification values.

10. The apparatus of claim 6, further comprising:

and the fusion unit is used for fusing the plurality of word prediction templates to generate a combined word prediction template.

11. A computer-readable storage medium storing a computer program for executing the word prediction template generating method according to any one of claims 1 to 5.

12. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the word prediction template generation method according to any one of claims 1 to 5.