CN113378561A - Word prediction template generation method and device - Google Patents

Word prediction template generation method and device Download PDF

Info

Publication number
CN113378561A
CN113378561A CN202110933954.8A CN202110933954A CN113378561A CN 113378561 A CN113378561 A CN 113378561A CN 202110933954 A CN202110933954 A CN 202110933954A CN 113378561 A CN113378561 A CN 113378561A
Authority
CN
China
Prior art keywords
word
words
target
prediction template
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110933954.8A
Other languages
Chinese (zh)
Inventor
崔燕红
余金林
宁超
陈益梦
王昊天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Teddy Bear Mobile Technology Co ltd
Original Assignee
Beijing Teddy Bear Mobile Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Teddy Bear Mobile Technology Co ltd filed Critical Beijing Teddy Bear Mobile Technology Co ltd
Priority to CN202110933954.8A priority Critical patent/CN113378561A/en
Publication of CN113378561A publication Critical patent/CN113378561A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A method and a device for generating a word prediction template are disclosed. The method comprises the following steps: acquiring a training corpus; performing word segmentation on the training corpus to obtain a plurality of words; determining characteristic information of each word; aiming at a target word in the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target word, wherein N is more than or equal to 1 and less than or equal to M, M is the smaller value of the maximum extensible word quantity of the target word on the left side and the right side in the training corpus, and the initial value of N is 1; extracting words from the training corpus by using a candidate prediction template; and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words.

Description

Word prediction template generation method and device
Technical Field
The application relates to the technical field of natural language processing, in particular to a word prediction template generation method and device.
Background
In the field of natural language processing, keyword prediction and extraction are generally performed by using a word vector template. At present, there are two common word vector template generation methods, one is manual generation, that is, a technician generates a word vector template according to experience and experiments and summary rules. The method has the defects that the word vector templates cannot be generated automatically and in batches, and the efficiency is low. The other method is to use a neural network model for training to obtain a word vector template, such as the Bert technique and the Albert technique, which can generate word vector templates automatically and in batches, and greatly improves efficiency compared with a manual method, but based on the technical principle of the method, when the word vector template is applied to scenes of regular short texts, massive training corpora still need to be used for training, so that consumption in training is large, and the accuracy of word prediction and extraction by using the word vector template obtained by training is low.
Disclosure of Invention
In order to solve the problems, the invention provides a method and a device for generating a word prediction template, which are used for predicting and extracting words of short texts, do not need to use massive training corpora for training, and have the advantages of low consumption and high accuracy.
In order to achieve the above object, in a first aspect, an embodiment of the present invention provides a word prediction template generation method, including:
acquiring a training corpus;
performing word segmentation on the training corpus to obtain a plurality of words;
determining characteristic information of each word;
aiming at a target word in the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target word, wherein N is more than or equal to 1 and less than or equal to M, M is the smaller value of the maximum extensible word quantity of the target word on the left side and the right side in the training corpus, and the initial value of N is 1;
extracting words from the training corpus by using a candidate prediction template;
and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words.
Preferably, after performing word extraction in the corpus by using the candidate prediction template, the method further includes: and if the extracted words comprise words different from the target word and N is less than M, N +1, executing to generate a candidate prediction template by using the Nth left word and the Nth right word of the target word and the characteristic information of the target word.
Preferably, after performing word extraction in the corpus by using the candidate prediction template, the method further includes: if the extracted words comprise words different from the target words and N is equal to M, judging the proportion of the words different from the target words in the extracted words, and if the proportion is not greater than a preset proportion threshold value, determining the candidate prediction template as the word prediction template corresponding to the target words.
Preferably, before generating, for a target word in the plurality of words, a candidate prediction template using an nth word on the left, an nth word on the right, and feature information of the target word, the method further includes: and replacing the generalizable words in the plurality of words by using identification values according to categories, and recording regular expressions of the words corresponding to the identification values.
Preferably, the method further comprises: and fusing the plurality of word prediction templates to generate a combined word prediction template.
In a second aspect, an embodiment of the present invention provides a word prediction template generation apparatus, including:
the acquisition unit is used for acquiring the training corpus;
the word segmentation unit is used for segmenting the training corpus to obtain a plurality of words;
the first determining unit is used for determining the characteristic information of each word;
a generating unit, configured to generate, for a target word in the multiple words, a candidate prediction template using an nth left word and an nth right word of the target word and feature information of the target word, where N is greater than or equal to 1 and is less than or equal to M, M is a smaller value of the maximum extensible word numbers of the target word on the left and right in the corpus, and an initial value of N is 1;
the extraction unit is used for extracting words from the training corpus by using the candidate prediction template;
and the second determining unit is used for determining the candidate prediction template as the word prediction template corresponding to the target word if the extracted words are the same as the target word.
Preferably, the generating unit is further configured to, if the extracted words include a word different from the target word and N is less than M, perform N +1 generation of a candidate prediction template using an nth left word, an nth right word, and feature information of the target word.
Preferably, the apparatus further comprises: the judging unit is used for judging the proportion of the words which are different from the target words in the extracted words if the extracted words comprise the words which are different from the target words and N is equal to M; the second determining unit is further configured to determine that the candidate prediction template is a word prediction template corresponding to the target word if the ratio is not greater than a preset ratio threshold.
Preferably, the apparatus further comprises: and the generalization unit is used for replacing the generalizable words in the plurality of words by using the identification values according to the categories and recording the regular expressions of the words corresponding to the identification values.
Preferably, the apparatus further comprises: and the fusion unit is used for fusing the plurality of word prediction templates to generate a combined word prediction template.
In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores a computer program for executing the word prediction template generation method according to the first aspect.
In a fourth aspect, an embodiment of the present invention provides an electronic device, including:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the word prediction template generation method according to the first aspect.
By utilizing the word prediction template generation method and device provided by the invention, a plurality of words are obtained by segmenting the training corpus; determining characteristic information of each word; and aiming at target words of the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target words, then extracting the words in the training corpus by utilizing the candidate prediction template, and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words. Based on the method, for the word prediction and extraction of the short text, massive training corpora are not needed for training, the consumption is low, and the word prediction template is added with the feature information of the words, so that the accuracy rate of the word prediction and extraction can be effectively improved by using the word prediction template.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.
FIG. 1 is a flowchart illustrating a method for generating a word prediction template according to an exemplary embodiment of the present disclosure;
fig. 2 is a block diagram of a word prediction template generating apparatus according to an exemplary embodiment of the present application;
fig. 3 is a block diagram of another word prediction template generation apparatus according to an exemplary embodiment of the present application;
FIG. 4 is a block diagram of another word prediction template generation apparatus provided in an exemplary embodiment of the present application;
fig. 5 is a block diagram of still another word prediction template generation apparatus according to an exemplary embodiment of the present application;
fig. 6 is a block diagram of an electronic device according to an exemplary embodiment of the present application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
Fig. 1 is a flowchart illustrating a word prediction template generation method according to an embodiment of the present application. The word prediction template generation method can be applied to electronic equipment. As shown in fig. 1, the method for generating a word prediction template provided in this embodiment includes:
step 101, obtaining a corpus.
In one example, the corpus may be determined according to a scene to which the word prediction template needs to be applied, and the corpus that is strongly related to the scene is selected to reduce the usage amount of the corpus and improve the accuracy of the word prediction template.
And 102, segmenting the training corpus to obtain a plurality of words.
In one example, the present invention is not limited to the above-mentioned segmentation techniques, such as word segmentation method for character string matching, word sense segmentation method, and statistical segmentation method.
And 103, determining the characteristic information of each word.
In one example, the characteristic information of the word may be part of speech, such as noun, verb, subject, preposition, etc., or subject, such as sports, entertainment, literature, etc., or may be proper noun, such as time, place, name of person, etc.
It should be noted that, for the same template set, the feature information of the word is of the same category. Specifically, each type of feature information corresponds to a recognition rule, and before training, the corresponding recognition rule can be preset according to requirements, and the feature information of each word can be determined according to the preset recognition rule.
And 104, aiming at a target word in the plurality of words, generating a candidate prediction template by using the Nth word on the left side, the Nth word on the right side and the characteristic information of the target word.
Wherein, N is more than or equal to 1 and less than or equal to M, M is the smaller value of the maximum extensible word quantity of the target word on the left side and the right side in the training corpus, and the initial value of N is 1.
Prior to step 104, the method may further comprise:
and replacing generalizable words in the plurality of words by using identification values according to categories, and recording regular expressions of the words corresponding to the identification values. Wherein, the generalizable words may include: a term of art.
In an example, the target word may be any one of a plurality of words obtained by segmenting the training corpus, or may be an identification value of any one of the words after generalization.
And 105, extracting words in the training corpus by using the candidate prediction template.
If the extracted words are all the same as the target word, step 106 is performed. If the extracted words include words different from the target word and N is less than M, then N +1, step 104 is performed again. If the extracted words include a word different from the target word and N is equal to M, step 107 is performed.
And step 106, determining the candidate prediction template as a word prediction template corresponding to the target word.
In one example, the method may further comprise: and fusing the plurality of word prediction templates to generate a combined word prediction template.
And step 107, judging the proportion of the words which are different from the target words in the extracted words.
If the proportion of the words different from the target words in the extracted words is not greater than the preset proportion threshold, step 106 is executed. And if the proportion of the words different from the target words in the extracted words is not larger than the preset proportion threshold value, ending the process and not generating the word prediction template corresponding to the target words.
In one example, the duty threshold may be set to 5%, and the higher the required accuracy, the larger the duty threshold that is set.
In a specific example, assume that the target word is a, M takes a value of 4, and N takes an initial value of 1. The 1 st word and the 2 nd word on the left side of the A are respectively recorded as L1 and L2.. LN, the 1 st word and the 2 nd word on the right side of the A are respectively recorded as R1 and R2.. RN, the characteristic information of the A is recorded as S, and when N is respectively 1 and 2.. N, the candidate prediction templates corresponding to the target word are respectively recorded as F1 and F2... FN. Based on this, when N is 1, the candidate prediction template F1 corresponding to the target word may be recorded as [ L1, S, R1 ].
And (3) extracting words from the training corpus by using the candidate prediction template F1, assuming that 20 words are extracted, and if the 20 words are all the same as A, determining the candidate prediction template F1 as a word prediction template corresponding to the word A. If a word different from a is included in the 20 words, N +1 is generated as a candidate prediction template F2 when N is 2, i.e., [ L2, L1, S, R1, R2 ].
Similarly, extracting words from the corpus by using the candidate prediction template F2, assuming that 15 words are extracted, if all the 15 words are the same as a, determining that the candidate prediction template F2 is the word prediction template corresponding to the word a. If the 15 words include words different from A, N +1 is generated into a candidate prediction template F3 when N is 3, namely [ L3, L2, L1, S, R1, R2, R3 ].
Similarly, extracting words from the corpus by using the candidate prediction template F3, assuming that 13 words are extracted, if all the 13 words are the same as a, determining that the candidate prediction template F3 is the word prediction template corresponding to the word a. If the 13 words include words different from a, N +1 is generated as a candidate prediction template F4 when N is 4, i.e., [ L4, L3, L2, L1, S, R1, R2, R3, R4 ].
Similarly, extracting words from the corpus by using the candidate prediction template F4, assuming that 10 words are extracted, if all the 10 words are the same as a, determining that the candidate prediction template F4 is the word prediction template corresponding to the word a. If a word different from a is included in the 10 words, since M is 4, that is, N = M, it is necessary to determine whether the proportion of the word different from a included in the 10 words is not greater than a preset proportion threshold. Assuming that the preset percentage threshold is 10%, if only 1 word different from a is included in the extracted 10 words, the percentage of the word different from a included in the 10 words is 10%, and is not greater than 10% of the preset percentage threshold, determining that the candidate prediction template F4 is the word prediction template corresponding to the word a. If the number of the extracted 10 words different from a is greater than 1, the proportion of the words different from a included in the 10 words is certainly greater than the preset proportion threshold value by 10%, and then the word prediction template corresponding to the word a is not generated.
Further, in another example, to make the word prediction template more accurate, the word prediction template method provided in this embodiment may further include: and acquiring a new training corpus, taking the generated word prediction template as a candidate prediction template, and executing the step 105 to verify and calibrate the generated word prediction template. Therefore, the word prediction template is updated without being trained again completely in the embodiment, and the omission can be continuously checked in the using process, so that the word prediction template is more and more accurate and richer.
By utilizing the word prediction template generation method provided by the embodiment of the invention, a plurality of words are obtained by segmenting the training corpus; determining characteristic information of each word; and aiming at target words of the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target words, then extracting the words in the training corpus by utilizing the candidate prediction template, and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words. Based on the method, for the word prediction and extraction of the short text, massive training corpora are not needed for training, the consumption is low, and the word prediction template is added with the feature information of the words, so that the accuracy rate of the word prediction and extraction can be effectively improved by using the word prediction template.
An embodiment of the present invention provides a word prediction template generation apparatus, and fig. 2 is a structural diagram of the word prediction template generation apparatus. The apparatus may include:
an obtaining unit 201, configured to obtain a corpus;
a word segmentation unit 202, configured to perform word segmentation on the training corpus to obtain multiple words;
a first determining unit 203, configured to determine feature information of each word;
a generating unit 204, configured to generate, for a target word in the multiple words, a candidate prediction template by using an nth left word, an nth right word, and feature information of the target word, where N is greater than or equal to 1 and is less than or equal to M, M is a smaller value of the maximum number of words that the left and right of the target word can be expanded in the corpus, and an initial value of N is 1;
an extracting unit 205, configured to perform word extraction in the corpus by using a candidate prediction template;
a second determining unit 206, configured to determine that the candidate prediction template is a word prediction template corresponding to the target word if all the extracted words are the same as the target word.
Preferably, the generating unit 204 is further configured to, if the extracted words include a word different from the target word and N is less than M, perform N +1 generation of a candidate prediction template by using the nth word on the left, the nth word on the right, and the feature information of the target word.
Preferably, as shown in fig. 3, the apparatus further comprises: a determining unit 207 configured to determine a ratio of a word different from the target word among the extracted words if the extracted words include a word different from the target word and N is equal to M; the second determining unit 206 is further configured to determine that the candidate prediction template is a word prediction template corresponding to the target word if the ratio is not greater than a preset ratio threshold.
Preferably, as shown in fig. 4, the apparatus further includes: and a generalization unit 208, configured to replace, by category, a generalizable term in the plurality of terms with an identification value, and record a regular expression of the term corresponding to the identification value.
Preferably, as shown in fig. 5, the apparatus further includes: a fusion unit 209 is configured to fuse the plurality of word prediction templates to generate a combined word prediction template.
By utilizing the word prediction template generation device provided by the invention, a plurality of words are obtained by segmenting the training corpus; determining characteristic information of each word; and aiming at target words of the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target words, then extracting the words in the training corpus by utilizing the candidate prediction template, and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words. Based on the method, for the word prediction and extraction of the short text, massive training corpora are not needed for training, the consumption is low, and the word prediction template is added with the feature information of the words, so that the accuracy rate of the word prediction and extraction can be effectively improved by using the word prediction template.
Next, an electronic apparatus 11 according to an embodiment of the present application is described with reference to fig. 6.
As shown in fig. 6, the electronic device 11 includes one or more processors 111 and memory 112.
The processor 111 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 11 to perform desired functions.
Memory 112 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 111 to implement the word prediction template generation methods of the various embodiments of the present application described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.
In one example, the electronic device 11 may further include: an input device 113 and an output device 114, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input device 113 may include, for example, a keyboard, a mouse, and the like.
The output device 114 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 114 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.
Of course, for the sake of simplicity, only some of the components of the electronic device 11 relevant to the present application are shown in fig. 6, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 11 may include any other suitable components, depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the word prediction template generation method according to various embodiments of the present application described in the "exemplary methods" section above of this specification.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a word prediction template generation method according to various embodiments of the present application described in the "exemplary methods" section above of this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (12)

1. A method for generating a word prediction template, the method comprising:
acquiring a training corpus;
performing word segmentation on the training corpus to obtain a plurality of words;
determining characteristic information of each word;
aiming at a target word in the plurality of words, generating a candidate prediction template by utilizing the Nth word on the left side, the Nth word on the right side and the characteristic information of the target word, wherein N is more than or equal to 1 and less than or equal to M, M is the smaller value of the maximum extensible word quantity of the target word on the left side and the right side in the training corpus, and the initial value of N is 1;
extracting words from the training corpus by using a candidate prediction template;
and if the extracted words are the same as the target words, determining the candidate prediction template as the word prediction template corresponding to the target words.
2. The method according to claim 1, wherein after said extracting words from said corpus using said candidate predictive template, said method further comprises:
and if the extracted words comprise words different from the target word and N is less than M, N +1, executing to generate a candidate prediction template by using the Nth left word and the Nth right word of the target word and the characteristic information of the target word.
3. The method according to claim 1, wherein after said extracting words from said corpus using said candidate predictive template, said method further comprises:
if the extracted words comprise words different from the target words and N is equal to M, judging the proportion of the words different from the target words in the extracted words, and if the proportion is not greater than a preset proportion threshold value, determining the candidate prediction template as the word prediction template corresponding to the target words.
4. The method of claim 1, wherein before generating, for a target word of the plurality of words, a candidate prediction template using an nth word to the left, an nth word to the right of the target word, and feature information of the target word, the method further comprises:
and replacing the generalizable words in the plurality of words by using identification values according to categories, and recording regular expressions of the words corresponding to the identification values.
5. The method of claim 1, further comprising:
and fusing the plurality of word prediction templates to generate a combined word prediction template.
6. An apparatus for generating a word prediction template, the apparatus comprising:
the acquisition unit is used for acquiring the training corpus;
the word segmentation unit is used for segmenting the training corpus to obtain a plurality of words;
the first determining unit is used for determining the characteristic information of each word;
a generating unit, configured to generate, for a target word in the multiple words, a candidate prediction template using an nth left word and an nth right word of the target word and feature information of the target word, where N is greater than or equal to 1 and is less than or equal to M, M is a smaller value of the maximum extensible word numbers of the target word on the left and right in the corpus, and an initial value of N is 1;
the extraction unit is used for extracting words from the training corpus by using the candidate prediction template;
and the second determining unit is used for determining the candidate prediction template as the word prediction template corresponding to the target word if the extracted words are the same as the target word.
7. The apparatus of claim 6, wherein the generating unit is further configured to perform, if the extracted words include a word different from the target word and N is less than M, N +1, generating a candidate prediction template using an nth left word, an nth right word, and feature information of the target word.
8. The apparatus of claim 6, further comprising:
the judging unit is used for judging the proportion of the words which are different from the target words in the extracted words if the extracted words comprise the words which are different from the target words and N is equal to M;
the second determining unit is further configured to determine that the candidate prediction template is a word prediction template corresponding to the target word if the ratio is not greater than a preset ratio threshold.
9. The apparatus of claim 6, further comprising:
and the generalization unit is used for replacing the generalizable words in the plurality of words by using the identification values according to the categories and recording the regular expressions of the words corresponding to the identification values.
10. The apparatus of claim 6, further comprising:
and the fusion unit is used for fusing the plurality of word prediction templates to generate a combined word prediction template.
11. A computer-readable storage medium storing a computer program for executing the word prediction template generating method according to any one of claims 1 to 5.
12. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the word prediction template generation method according to any one of claims 1 to 5.
CN202110933954.8A 2021-08-16 2021-08-16 Word prediction template generation method and device Pending CN113378561A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110933954.8A CN113378561A (en) 2021-08-16 2021-08-16 Word prediction template generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110933954.8A CN113378561A (en) 2021-08-16 2021-08-16 Word prediction template generation method and device

Publications (1)

Publication Number Publication Date
CN113378561A true CN113378561A (en) 2021-09-10

Family

ID=77577170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110933954.8A Pending CN113378561A (en) 2021-08-16 2021-08-16 Word prediction template generation method and device

Country Status (1)

Country Link
CN (1) CN113378561A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550200A (en) * 2015-12-02 2016-05-04 北京信息科技大学 Chinese segmentation method oriented to patent abstract
CN105938495A (en) * 2016-04-29 2016-09-14 乐视控股(北京)有限公司 Entity relationship recognition method and apparatus
CN109740159A (en) * 2018-12-29 2019-05-10 北京泰迪熊移动科技有限公司 For naming the processing method and processing device of Entity recognition
WO2020207179A1 (en) * 2019-04-09 2020-10-15 山东科技大学 Method for extracting concept word from video caption
CN112966106A (en) * 2021-03-05 2021-06-15 平安科技(深圳)有限公司 Text emotion recognition method, device and equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550200A (en) * 2015-12-02 2016-05-04 北京信息科技大学 Chinese segmentation method oriented to patent abstract
CN105938495A (en) * 2016-04-29 2016-09-14 乐视控股(北京)有限公司 Entity relationship recognition method and apparatus
CN109740159A (en) * 2018-12-29 2019-05-10 北京泰迪熊移动科技有限公司 For naming the processing method and processing device of Entity recognition
WO2020207179A1 (en) * 2019-04-09 2020-10-15 山东科技大学 Method for extracting concept word from video caption
CN112966106A (en) * 2021-03-05 2021-06-15 平安科技(深圳)有限公司 Text emotion recognition method, device and equipment and storage medium

Similar Documents

Publication Publication Date Title
JP7031101B2 (en) Methods, systems and tangible computer readable devices
US20200334492A1 (en) Ablation on observable data for determining influence on machine learning systems
CN108959482B (en) Single-round dialogue data classification method and device based on deep learning and electronic equipment
US8762132B2 (en) Textual entailment recognition apparatus, textual entailment recognition method, and computer-readable recording medium
CN108595629B (en) Data processing method and application for answer selection system
US11232263B2 (en) Generating summary content using supervised sentential extractive summarization
CN108509427B (en) Data processing method and application of text data
US11544457B2 (en) Machine learning based abbreviation expansion
US11908477B2 (en) Automatic extraction of conversation highlights
US11082369B1 (en) Domain-specific chatbot utterance collection
US20220414332A1 (en) Method and system for automatically generating blank-space inference questions for foreign language sentence
WO2021001517A1 (en) Question answering systems
US9286289B2 (en) Ordering a lexicon network for automatic disambiguation
KR20210044559A (en) Method and device for determining output token
CN110633456A (en) Language identification method, language identification device, server and storage medium
WO2019085118A1 (en) Topic model-based associated word analysis method, and electronic apparatus and storage medium
CN111161730B (en) Voice instruction matching method, device, equipment and storage medium
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
JP6082657B2 (en) Pose assignment model selection device, pose assignment device, method and program thereof
CN113642739B (en) Training method of sensitive word shielding quality evaluation model and corresponding evaluation method
CN113378561A (en) Word prediction template generation method and device
CN114118072A (en) Document structuring method and device, electronic equipment and computer readable storage medium
CN112596868A (en) Model training method and device
TW202217595A (en) Machine reading comprehension method, machine reading comprehension device and non-transient computer readable medium
KR20200112353A (en) Method of analyzing relationships of words or documents by subject and device implementing the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: East of 1st floor, No.36 Haidian Street, Haidian District, Beijing, 100000

Applicant after: Beijing Teddy Bear Mobile Technology Co.,Ltd.

Address before: 100085 07a36, block D, 7 / F, No.28, information road, Haidian District, Beijing

Applicant before: BEIJING TEDDY BEAR MOBILE TECHNOLOGY Co.,Ltd.