CN109299453B

CN109299453B - Method and device for constructing dictionary and computer-readable storage medium

Info

Publication number: CN109299453B
Application number: CN201710607574.9A
Authority: CN
Inventors: 张旸; 王雅圣; 毕舒展; 颜友亮
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-07-24
Filing date: 2017-07-24
Publication date: 2021-02-09
Anticipated expiration: 2037-07-24
Also published as: CN109299453A

Abstract

The embodiment of the invention provides a method and a device for constructing a dictionary, wherein the method comprises the following steps: acquiring a candidate word and a paraphrase of the candidate word; selecting a characteristic word of the candidate word from the paraphrases of the candidate word; obtaining an initial judgment result of the candidate word through a preset classifier according to the characteristic word of the candidate word; obtaining a judgment result of each intermediate word through the classifier according to the feature word selected from the paraphrase of each intermediate word in the at least one intermediate word, wherein the at least one intermediate word comprises the N-level feature words of the candidate word; and determining a final judgment result of the candidate word according to the initial judgment result of the candidate word and the judgment result of the at least one intermediate word, wherein the final judgment result of the candidate word is used for indicating whether the candidate word can be added to the dictionary or not. Therefore, the accuracy of the dictionary can be improved.

Description

Method and device for constructing dictionary and computer-readable storage medium

Technical Field

The embodiment of the invention relates to the field of natural language processing, in particular to a method and a device for constructing a dictionary.

Background

The dictionary is a key resource in the natural language processing process, most dictionaries are constructed based on manpower at present, namely dictionaries identified in a corpus by manual arrangement, but the defects of the manually constructed dictionaries are that words in the dictionaries are not perfect, and particularly for existing network new words which are continuously emerged, the defects of the manually constructed dictionaries are more obvious and cannot well meet practical application.

In order to make the construction of the dictionary more complete, a mode of automatically constructing the dictionary is introduced. At present, a method for constructing a dictionary is known, which searches paraphrases of words in an existing dictionary through a paraphrase knowledge base (e.g., a modern chinese dictionary, an encyclopedia, etc.), then extracts Bag of words (BoW) features from the paraphrases of the words, so-called BoW features, which are a plurality of feature words extracted from the paraphrases of the words and frequency of occurrence of each feature word, and further constructs a classifier based on the BoW features. When the candidate word is required to be added to the dictionary, based on the same principle, BoW characteristics are extracted from the paraphrase of the candidate word, and whether the candidate word can be added to the dictionary is judged through a classifier according to the BoW characteristics of the candidate word.

However, in the method for constructing a dictionary based on the BoW features, only the occurrence frequency of each feature word in the feature words corresponding to the word is considered, and each feature word is taken as an independent body, so that semantic information in the paraphrase cannot be well represented, for example, the type tendency of a part of words is related to information such as habitual usage and common collocation of the part of words, which cannot be embodied by the BoW features extracted from the paraphrase, and therefore, the part of words can bring great interference to the judgment of the type tendency of candidate words, and the accuracy of the dictionary is reduced.

Therefore, how to improve the accuracy of the dictionary is a problem that needs to be solved urgently.

Disclosure of Invention

The embodiment of the invention provides a method for constructing a dictionary, which can improve the accuracy of the dictionary.

In a first aspect, a method for constructing a dictionary is provided, the method comprising:

acquiring candidate words;

obtaining paraphrases of the candidate words from a paraphrase knowledge base;

selecting a characteristic word of the candidate word from the paraphrases of the candidate word, wherein the characteristic word of the candidate word is a real word in the paraphrases of the candidate word;

according to the characteristic words of the candidate words, an initial judgment result of the candidate words is obtained through a preset classifier, and the classifier is used for indicating the probability that one word belongs to the dictionary;

obtaining a judgment result of each intermediate word through the classifier according to the feature word selected from the paraphrase of each intermediate word in the at least one intermediate word, wherein the at least one intermediate word comprises N-level feature words of the candidate words,

the N-level feature words are feature words of the candidate words, where N is 1, or,

a Kth level feature word in the N level feature words is a feature word selected from paraphrases of a Kth-1 level feature word in the N level feature words, wherein both N and K are integers greater than 1, and K is less than or equal to N;

and determining a final judgment result of the candidate word according to the initial judgment result of the candidate word and the judgment result of the at least one intermediate word, wherein the final judgment result of the candidate word is used for indicating whether the candidate word can be added to the dictionary.

Therefore, in the method for constructing a dictionary provided in the embodiment of the present invention, at least one intermediate word is derived from the definitions of the candidate words, that is, the at least one intermediate word includes N-level feature words of the candidate word, and the K-level feature word is a feature word selected from the definitions of the K-1-level feature word (or, the next-level feature word is a feature word selected from the definitions of the previous-level feature word), and a final determination result for determining whether the candidate word can be added to the dictionary is obtained by performing a comprehensive decision on the candidate word according to an initial determination result generated based on the candidate word and a determination result generated based on each intermediate word, and for a feature word with unknown type tendency (e.g., emotional tendency) in the definitions of the candidate word, information such as type tendency and usage collocation of each feature word in the N-level feature words of the candidate word can be analyzed, furthermore, noise caused by the characteristic words of the fuzzy meaning can be effectively reduced, and the accuracy of the dictionary is improved; in addition, because the paraphrase knowledge base has more resources and can collect more new words, the embodiment of the invention can judge for many times by searching the candidate word or the paraphrase words of the N-level characteristic words of the candidate word, thereby being beneficial to expanding the dictionary in real time.

With reference to the first aspect, in some implementations of the first aspect, a feature word of the candidate word is selected from paraphrases of the candidate word, including;

selecting a real word in the paraphrase of the candidate word from the paraphrase of the candidate word;

and taking a word which is common to a real word in the paraphrase of the candidate word and a word collected in the dictionary as a characteristic word of the candidate word.

Therefore, by selecting the common words in the paraphrases of the candidate words and the words collected by the dictionary as the feature words of the candidate words, the words which do not belong to the dictionary can be filtered, the interference caused by the feature words of the candidate words with unobvious type tendency can be effectively reduced, and the accuracy of the dictionary is further improved.

With reference to the first aspect, in some implementation manners of the first aspect, the obtaining an initial determination result of the candidate word through a preset classifier according to the feature word of the candidate word includes:

determining the part of speech of the characteristic words of the candidate words and the syntactic structure to which the characteristic words of the candidate words belong according to the characteristic words of the candidate words;

converting the characteristic words of the candidate words, the part of speech of the characteristic words of the candidate words and the syntactic structures to which the characteristic words of the candidate words belong into characteristic vectors;

and taking the feature vector as an input parameter, and obtaining the initial judgment result through the classifier.

Therefore, the information such as the part of speech of the characteristic word of the candidate word, the syntactic structure to which the characteristic word of the candidate word belongs is extracted through the characteristic word of the candidate word, the semantic information expressed by the paraphrase of the candidate word can be better represented, the information such as the type tendency and the usage collocation of the characteristic word of the candidate word can be effectively analyzed, and the accuracy of the dictionary is improved.

With reference to the first aspect, in some implementations of the first aspect, the K-th level feature word is specifically a feature word selected from paraphrases of feature words used for generating a first determination result in the K-1-th level feature word, and the first determination result is used to indicate that a probability that a word belongs to the dictionary satisfies a preset condition.

Therefore, the first judgment result meeting the preset condition is screened out from the judgment results generated by the K-1 level feature words, the feature words used for generating the first judgment result are used as the K level feature words, the feature words used for generating the judgment results with the probability that the indicator words belong to the dictionary and do not meet the preset condition can be filtered, and the accurate final judgment result can be obtained by setting a small judgment group number N.

With reference to the first aspect, in some implementations of the first aspect, the determining a final decision result of the candidate word according to the initial decision result of the candidate word and the decision result of the at least one intermediate word includes:

and under the condition that the initial judgment result of the candidate word is used for indicating that the probability of the candidate word belonging to the dictionary meets a preset condition, and the judgment result of the at least one intermediate word is used for indicating that the probability of each intermediate word belonging to the dictionary meets the preset condition, determining that the final judgment result of the candidate word is that the candidate word can be added to the dictionary.

converting the initial judgment result of the candidate word and the judgment result of the at least one intermediate word into a judgment vector;

and determining a final judgment result of the candidate word by using the judgment vector as an input parameter through a preset first formula, wherein the first formula is used for indicating whether the candidate word can be added to the dictionary.

In a second aspect, an apparatus for constructing a dictionary is provided, which may be used to perform the operations of the first aspect and any possible implementation manner of the first aspect. In particular, the apparatus may comprise means for performing the operations in the first aspect described above or any possible implementation manner of the first aspect.

In a third aspect, an apparatus is provided, comprising: a memory for storing a computer program; a processor configured to execute the computer program stored in the memory to cause the apparatus to perform the operations of the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, there is provided a computer program product comprising: computer program code which, when executed by a processor in a processing unit or in an apparatus, causes the apparatus or the apparatus to perform the method in the first aspect and its embodiments described above.

In a fifth aspect, a computer storage medium is provided, which stores a computer program that, when run on a computer, causes the computer to perform the operations of the first aspect or any possible implementation manner of the first aspect.

In some implementations, the obtaining, by the classifier, a determination result of each intermediate word according to a feature word selected from a paraphrase of each intermediate word in the at least one intermediate word includes:

determining feature words selected from paraphrases of the i-1 th level feature words in the N levels of feature words as the i-th level feature words in the N levels of feature words after a judgment result is generated based on the i-1 th level feature words in the N levels of feature words;

and obtaining a judgment result of the ith level feature word through the classifier according to the feature word selected from the paraphrase of the ith level feature word, wherein i belongs to [2, N ].

In some implementations described above, determining, after generating a determination result based on the i-1 th level feature word of the N-level feature words, a feature word selected from paraphrases of the i-1 th level feature word in the N-level feature words as the i-th level feature word in the N-level feature words includes:

determining a judgment result as a feature word of a first judgment result from the judgment results of the i-1 level feature words, wherein the first judgment result is used for indicating that the probability of a word belonging to the dictionary meets a preset condition;

and determining the characteristic words selected from the paraphrases of the characteristic words of which the judgment results are the first judgment results as the ith-level characteristic words.

In some implementations of the foregoing, the determining a final decision result of the candidate word according to the initial decision result of the candidate word and the decision result of the at least one intermediate word includes:

Drawings

FIG. 1 is a schematic flow chart diagram of a method for constructing a dictionary in accordance with an embodiment of the present invention.

Fig. 2 and 3 are schematic block diagrams of the relationship of N sets of intermediate decisions of a method according to an embodiment of the invention.

FIG. 4 is a detailed flow diagram of a method for constructing a dictionary in accordance with an embodiment of the present invention.

Fig. 5 is an apparatus for constructing a dictionary according to an embodiment of the present invention.

Fig. 6 is an apparatus according to an embodiment of the invention.

Detailed Description

The technical solution in the embodiments of the present invention will be described below with reference to the accompanying drawings.

The embodiment of the invention can be applied to the construction and expansion of dictionaries in natural language processing, wherein the dictionaries can be dictionaries with special purposes, such as emotion dictionaries, dirty word dictionaries and the like, and can also be dictionaries constructed based on practical purposes.

The natural language processing process comprises the following general steps: inputting language text- > extracting characteristics from the language text- > establishing a model based on the characteristics- > predicting and classifying the language text, the step of extracting the characteristics from the language text usually needs to be assisted by external resources, and a dictionary is one of the main resources in the aspect, namely, the characteristics are extracted from the language text based on the dictionary, so that the processing of natural language is completed.

For example, when emotion analysis is required for target language text, an emotion dictionary is required to determine which emotion words are contained in a sentence; for another example, when the target text language needs to be subjected to the dirty word filtering, a dirty word dictionary is needed to determine whether the dirty words appear in the sentence, and the like.

Hereinafter, a method for constructing a dictionary according to an embodiment of the present invention will be described in detail with reference to fig. 1 and 4.

FIG. 1 is a schematic flow chart diagram illustrating a method 100 for constructing a dictionary in accordance with an embodiment of the present invention.

In step S110, a candidate word is acquired.

That is, the candidate word is a word to be determined whether the word can be added to the dictionary.

In step S120, paraphrases of the candidate word are obtained from the paraphrase knowledge base.

The paraphrase knowledge base can be all information bases capable of finding paraphrases of words, for example, the paraphrase knowledge base can be a modern Chinese dictionary, an encyclopedia and the like.

In step S130, a feature word of the candidate word is obtained from the paraphrase of the candidate word, and the feature word of the candidate word is an actual word in the paraphrase of the candidate word.

Specifically, the paraphrase of the candidate word can be decomposed by a word segmentation tool to obtain a plurality of words, the actual word in the plurality of words is determined as the characteristic word of the candidate word,

for example, the candidate word is "ink", and its corresponding definitions are: the candidate word refers to precious calligraphy and painting, and is also used for respecting the calligraphy and painting written by other people, so that the characteristic word of the candidate word can be: precious, calligraphy and painting, honorable, nobody, calligraphy and painting.

In step S140, an initial determination result of the candidate word is obtained through a preset classifier according to the feature word of the candidate word, where the classifier is used to indicate a probability that a word belongs to the dictionary.

That is, the initial determination result obtained by the classifier of the feature word of the candidate word indicates the probability that the candidate word belongs to the dictionary: if the initial judgment result meets the preset condition, indicating that the probability that the candidate word belongs to the dictionary is higher, otherwise, indicating that the probability that the candidate word belongs to the dictionary is lower.

In other words, the classifier may also be used to represent the confidence of a word with respect to the dictionary, or in other words, the reliability of a word with respect to the dictionary. In general, the classifier is based on obtaining a decision result from a feature word of a word, the decision result serving as a decision factor for deciding whether a candidate word can be added to the dictionary.

In the embodiment of the invention, the classifier can be obtained by training based on words with known type tendencies or can be obtained by training based on rules designed by artificial experience.

From another perspective, the classifier itself is a functional model, which may be a machine learning model trained based on data, or a functional model designed based on artificial experience.

Optionally, the parameters of the classifier are parameters trained based on a set of words, the set of words including words belonging to the dictionary and words not belonging to the dictionary.

Specifically, a plurality of words may be extracted from an existing dictionary, some of which belong to the dictionary and some of which do not belong to the dictionary, and in general, a word having a high frequency of appearance (for example, the top 1000 words having a high frequency) may be extracted as data for training the classifier. The specific method for training the classifier is the same as the method for determining whether the candidate word can be added to the dictionary in the embodiment of the present invention, and for brevity, the details are not repeated here.

Specifically, according to the feature word of the candidate word, information related to the feature word of the candidate word (for convenience of distinguishing and understanding, it is recorded as feature word information) is obtained, and the feature word information may be: the part of speech of the characteristic word of the candidate word, the word length of the characteristic word of the candidate word, the preceding and following of the characteristic word of the candidate word, and the like. And then, converting the feature word information into a feature vector, taking the feature vector as an input parameter of the classifier, calculating the result of the feature vector through a function model designed in advance in the classifier, wherein the obtained result of the feature vector is the initial judgment result of the candidate word, and the judgment result of the candidate word is used for representing the degree of the candidate word belonging to the dictionary based on the function of the classifier.

In step S150, a determination result of each intermediate word is obtained through the classifier according to the feature words extracted from the paraphrases of each intermediate word in the at least one intermediate word, wherein the at least one intermediate word includes N-level feature words of the candidate word,

the N-level feature word is a feature word of the candidate word, where N is 1, or a K-th feature word in the N-level feature word is a feature word selected from paraphrases of K-1-th feature words in the N-level feature word, where N and K are both integers greater than 1, and K is less than or equal to N.

Specifically, in the embodiment of the present invention, the at least one intermediate word may be derived based on the paraphrase of the candidate word, each intermediate word is used to generate a determination result, and the determination result of each intermediate word may be used as a determination basis for determining whether the candidate word can be added to the dictionary.

Wherein the at least one intermediate word includes an N-level feature word of the candidate word, that is, the N-level feature word is related to the feature word of the candidate word:

when N is 1, the N-level feature word is the feature word of the candidate word;

when N is larger than 1, the K-th level feature word in the N-level feature words is a feature word selected from the paraphrases of the K-1-th level feature words in the N-level feature words, namely, the next level feature word in the N-level feature words is determined based on the paraphrases of the previous level feature words, and the K-th level feature word is any one level feature word in the N-level feature words. Hereinafter, for convenience of description, the determination based on the candidate word is referred to as an initial determination, and the determination based on at least one intermediate word derived from the feature word of the candidate word (i.e., the N-level feature word of the candidate word) is referred to as an intermediate determination.

In the embodiment of the present invention, based on the N-level feature words, N sets of intermediate decisions are performed by the classifier to obtain N sets of decision results, where each set of intermediate decisions has multiple intermediate decisions, the N sets of intermediate decisions correspond to the N-level feature words one to one, and the K-th set of intermediate decisions generate decision results based on the corresponding K-th set of feature words, so that it can be understood that the K-th feature word is an intermediate word used in the K-th set of intermediate decisions, and the feature word of the K-th feature word is a feature word obtained based on definitions of the intermediate word used in the K-th set of decisions, and therefore, descriptions related to the intermediate word and the K-th feature word are understood to have the same meaning, and hereinafter, different description manners may be adopted for different description environments. Wherein the Kth set of intermediate decisions is any one of the N sets of intermediate decisions.

The relationship of the N sets of intermediate decisions in the embodiment of the present invention is briefly described below by fig. 2 and 3.

Fig. 2 is a schematic block diagram illustrating the relationship between N sets of intermediate decisions according to the method of the embodiment of the present invention, and as shown in fig. 2, the intermediate word in the 1 st set of intermediate decisions is the feature word (i.e., the level 1 feature word) selected from the paraphrases of the candidate word, and the intermediate word in the 2 nd set of intermediate decisions is the feature word (i.e., the level 2 feature word) selected from the paraphrases of the candidate word in the 1 st set of intermediate decisions, which is repeated so as to obtain all the decision results of the N sets of intermediate decisions.

It should be noted that, in the embodiment of the present invention, in addition to only one determination result in the initial determination result, since the number of the feature words selected from the paraphrases is at least one, and correspondingly, the number of the intermediate words in each group of intermediate determinations is also at least one, then, in a group of intermediate determinations, each intermediate word generates a determination result.

That is, in the K-th group of intermediate decisions among the N groups of intermediate decisions, at least one decision result is generated based on at least one intermediate word, the at least one decision result being a decision result corresponding to the K-th group of intermediate decisions.

Specifically, as shown in fig. 3, in the initial determination and the 1 st group intermediate determination example, two feature words are selected from the definitions of the candidate words in the initial determination, the feature words are the feature word #01 and the feature word #02, two intermediate determinations can be performed in the 1 st group intermediate determination, the feature word #01 is the intermediate word #11, the feature word #02 is the intermediate word #12, and two intermediate determinations are performed on the basis of the two intermediate words, respectively, to generate the determination result #11 of the intermediate word #11 and the determination result #12 of the intermediate word # 12.

For the N-level feature words, the feature words of the candidate words are level 1 feature words (i.e., intermediate words used in the 1 st group of intermediate decisions) of the N-level feature words, paraphrases are searched for each intermediate word, real words in the paraphrases of each intermediate word are used as the feature words of each intermediate word, meanwhile, the feature words of each intermediate word are used as level 2 feature words in the N-level feature words, and so on, and the N-level feature words can be determined.

For better understanding of the embodiments of the present invention, the relationship between the N-level feature words is described below by specific examples.

Continuing with the candidate word "ink ribbon" described above as an example, the determination process of the N-level feature word (i.e., the at least one intermediate word) of the embodiment of the present invention is described:

the level 1 characteristic words (namely, the characteristic words of the candidate words) are precious, calligraphy and painting, honorable names, other people, characters and painting;

determining the level 2 characteristic words:

by looking up the feature words for each word from the paraphrases of each of the 6 words in the level 1 feature words: the explanation of "precious" is "valuable and very rare", the explanation of the selected characteristic words is "value", the explanation of the selected character and picture "is" calligraphy and painting ", the explanation of the selected characteristic words is" calligraphy and painting ", and similarly, the acquisition modes of other characteristic words are also the same, so that the 2 nd-level characteristic word is: value, calligraphy, painting, … …;

determining the level 3 characteristic words: among the level 2 feature words (i.e., the intermediate words used in the 2 nd group of intermediate decisions), the feature words selected in the paraphrases of each intermediate word are taken as level 3 feature words;

and analogizing in sequence to obtain the N-level characteristic words of the candidate words.

Therefore, in the same manner as in step S140, a feature vector of each intermediate word is obtained by processing the feature word of each intermediate word, and the feature vector of the corresponding intermediate word is used as an input parameter of the classifier, so as to finally obtain a determination result of each intermediate word.

It should be noted that the determination result of the intermediate word (i.e., the N-level feature word) can be understood from two aspects: on one hand, the decision result of the intermediate word is also used to indicate the probability that the intermediate word belongs to the dictionary, or the confidence of the intermediate word relative to the dictionary, purely from the classifier perspective; on the other hand, the determination result of the intermediate word may also be used to indicate the degree of importance of the intermediate word to the result of determining whether the candidate word can be added to the dictionary in view of determining whether the candidate word can be added to the dictionary as a whole.

For example, assuming that the determination results are expressed as "yes" and "no", if the determination result is "yes", it means that the probability that the intermediate word belongs to the dictionary is high, and even it can be considered that the intermediate word can belong to the dictionary, or that it means that the intermediate word is important for determining the result that the candidate word can be added to the dictionary, and conversely, it means that the probability that the intermediate word belongs to the dictionary is low, and even that the intermediate word can not belong to the dictionary, or that it is not important for determining the result that the candidate word can be added to the dictionary.

In the embodiment of the present invention, there are a plurality of ways to generate the determination result based on the N-level feature words, and a description is given below of the plurality of ways.

Mode 1

After a judgment result is generated based on the i-1 th level feature words in the N levels of feature words, determining the feature words selected from the paraphrases of the i-1 th level feature words in the N levels of feature words as the i-th level feature words in the N levels of feature words;

and obtaining a judgment result of the ith level feature word by the classifier according to the feature word selected from the paraphrase of the ith level feature word, wherein i belongs to [2, N ].

That is, in the N-group intermediate judgment, in the i-1-th group intermediate judgment, after a judgment result is generated based on the i-1-th level feature word in the N-level feature words, the i-th group intermediate judgment is performed, and a judgment result of the i-group intermediate judgment is obtained.

Continuing with the example of fig. 2, in the initial determination, an initial determination result is generated according to the definitions of the candidate words, in the 1 st group of intermediate determinations, the 1 st-level feature word is used as the intermediate word in the 1 st group of intermediate determinations, and at the same time, a determination result is generated, and in the 2 nd group of intermediate determinations, the 2 nd-level feature word is used as the intermediate word in the 2 nd group of intermediate determinations, and at the same time, a determination result is generated, and this is repeated, so that all determination results of the N groups of intermediate determinations are obtained.

Mode 2

Determining all intermediate words required by intermediate judgment through multiple paraphrase search, inputting the feature vectors corresponding to the feature words of all the intermediate words into at least one classifier for parallel processing,

specifically, firstly, all intermediate words are found out based on the feature words of the candidate words, then, the feature words are selected again from the paraphrases of each intermediate word in all the intermediate words, and the judgment results of all the intermediate words are obtained through the classifier. In other words, in this way, all the intermediate words are found first, and then the determination results of all the intermediate words are obtained through the classifiers, in implementation, the determination results of all the intermediate words can be obtained through one classifier, or the determination results of all the intermediate words can be obtained through parallel calculation by a plurality of classifiers having the same function, which is how to implement the present invention.

In step S160, a final determination result of the candidate word is determined according to the initial determination result of the candidate word and the determination result of the at least one intermediate word, where the final determination result of the candidate word is used to indicate whether the candidate word can be added to the dictionary.

Specifically, for determining whether the candidate word can be added to the dictionary, a decision synthesis for a plurality of words by the classifier is required, that is, whether the candidate word can be added to the dictionary is determined according to the initial determination result of the candidate word and the determination result of the at least one intermediate word.

Thus, compared with the prior art that whether the candidate word can be added to the dictionary is determined once only through the frequency of appearance of the feature words of the candidate word, the embodiment of the invention performs the processes of searching and determining for many times, namely, not only the candidate word is used for determination, but also at least one intermediate word derived based on the feature words of the candidate word is used for determination, or the embodiment of the invention performs determination not only through searching the definition of the candidate word, but also based on at least one intermediate word derived from the words in the definition of the candidate word, so that the feature words with unclear type tendency (e.g. emotional tendency) in the definition of the candidate word are determined for many times, the analysis of information such as the type tendency and usage of the feature words of the candidate word is facilitated, and the collocation noise caused by the words with fuzzy meaning can be effectively reduced, the accuracy of the dictionary is improved.

In step 130, optionally, the selecting a feature word of the candidate word from the paraphrase of the candidate word includes:

and taking the real word in the paraphrase of the candidate word and the word shared by the collected words in the dictionary as the characteristic word of the candidate word.

That is, the feature word selected from the paraphrases of the candidate word is a word belonging to the dictionary, for example, the candidate word is "ink bao", and its corresponding paraphrases are: the term refers to precious calligraphy and painting, and is also used to respect the calligraphy and painting written by other people, so the words belonging to the dictionary in the paraphrase are: precious and honorable, the words not belonging to the dictionary are: calligraphy and painting, others, characters and painting, then, the words belonging to the dictionary, namely "precious" and "respect" are taken as intermediate words (or 1 st level characteristic words) of the next judgment.

In addition, by extracting the feature words belonging to the dictionary, keywords related to the type tendency of the candidate words can be focused quickly, the number of intermediate decisions can be reduced, and the processing speed can be increased.

Based on the same manner, when the feature word of the intermediate word is selected from the paraphrase of the intermediate word, the selected feature word of the intermediate word may also belong to the dictionary, and for brevity, the details are not repeated here.

In step S140, optionally, obtaining an initial determination result of the candidate word by using a preset classifier according to the feature word of the candidate word includes:

determining the part of speech of the characteristic word of the candidate word and the syntactic structure to which the characteristic word of the candidate word belongs according to the characteristic word of the candidate word;

Specifically, when a feature word is selected from the paraphrase of the candidate word, based on the feature word of the candidate word, information such as the part of speech of the feature word of the candidate word, the syntactic structure to which the feature word of the candidate word belongs is extracted through a correlation tool, the correlation information of the feature word of the candidate word is converted into a feature vector, and the feature vector is input to the classifier, so that an initial determination result of the candidate word is obtained.

In the following, similarly, taking the candidate word "ink" as an example, information such as the part of speech of the feature word of the candidate word and the syntactic structure to which the candidate word belongs is described, and a word belonging to the dictionary in the paraphrase of the candidate word is taken as the feature word.

Candidate words: ink is used;

the release is as follows: the characters and the paintings are precious and are also used for respecting the characters or the paintings written by other people;

characteristic words: precious and respectful;

part of speech of the characteristic word: adjectives, verbs;

the syntactic structure that the characteristic word belongs to: centering structures, and bingo structures.

Wherein, for the characteristic word "precious", the part of speech is adjective, and the syntactic structure to which "precious" belongs is a centering structure (i.e. precious calligraphy and painting); for the feature word "respect", the syntactic structure to which the part of speech is "verb" and "respect" belongs is a bingo structure (i.e., respect the other person).

Similarly, when the feature word is selected from the paraphrase of the at least one intermediate word, based on the feature word of each intermediate word, information such as the part of speech of the feature word of each intermediate word and the syntactic structure to which the feature word of each intermediate word belongs is extracted through a correlation tool, the correlation information of the feature word of each intermediate word is converted into a feature vector and input to the classifier, and the judgment result of each intermediate word is obtained.

By way of example and not limitation, the BoW feature may also be generated based on the feature words of the candidate words or the intermediate words, and used in combination with the embodiments of the present invention, which is not limited herein.

Optionally, the K-th level feature word is specifically a feature word selected from paraphrases of feature words used for generating a first determination result in the K-1-th level feature word, and the first determination result is used for indicating that the probability of the word belonging to the dictionary meets a preset condition.

That is, in the N-group intermediate determination processes, a determination result satisfying a preset condition, that is, a first determination result, is selected from the determination results generated in the K-1-th group intermediate determination, and a feature word selected from the definitions of the feature words of the K-1-th level feature word used for generating the first determination result is used as a K-th level feature word for generating the determination result of the K-th level feature word in the K-th group intermediate determination.

The preset condition may be a threshold set based on a machine learning model, and when the result obtained by calculating the feature word of the intermediate word by the classifier is greater than the threshold, the determination result output by the classifier is yes, otherwise, the determination result output by the classifier is no; as another example, the preset condition may be an artificial empirical rule, and the artificial empirical rule is embodied in the classifier through a function model.

For example, with the 1 st group intermediate decision shown in fig. 3 and the 2 nd group intermediate decision example not shown, assuming that, in the 1 st group intermediate decision, of the decision results of two intermediate words (i.e., the 1 st-level feature words) generated, the decision result #11 of the intermediate word #11 satisfies the preset condition, and the decision result #12 of the intermediate word #12 does not satisfy the preset condition, then the decision result #11 of the intermediate word #11 is the first decision result, and then the feature word of the intermediate word #11 is taken as the intermediate word (i.e., the 2 nd-level feature word) of the 2 nd group intermediate decision.

The specific process of generating the determination result in the embodiment of the present invention is described in detail through various embodiments, and a brief description is given below to the process of generating the final determination result based on a plurality of determination results.

In step S160, optionally, the determining a final determination result of the candidate word according to the initial determination result of the candidate word and the determination result of the intermediate word includes:

and under the condition that the initial judgment result of the candidate word is used for indicating that the probability that the candidate word belongs to the dictionary meets a preset condition, and the judgment result of the intermediate word is used for indicating that the probability that the intermediate word belongs to the dictionary meets the preset condition, determining that the final judgment result of the candidate word is that the candidate word can be added to the dictionary.

Specifically, when all the determination results satisfy the preset condition, it is determined that the candidate word can be added to the dictionary, and conversely, as long as the determination result of one word does not satisfy the preset condition, it is determined that the candidate word cannot be added to the dictionary.

For example, the predetermined condition may be a manual rule of thumb that specifies: if the paraphrase contains the feature words belonging to the dictionary, the initial determination result or the determination result of the intermediate words is yes, if the paraphrase does not contain the feature words belonging to the dictionary, the initial determination result or the determination result of the intermediate words is no, and when all the determination results are yes, it is determined that the candidate words can be added to the dictionary.

In step S160, optionally, the determining a final decision result of the candidate word according to the initial decision result of the candidate word and the decision result of the at least one intermediate word includes:

and determining a final judgment result of the candidate word by using the judgment vector as an input parameter through a preset first formula, wherein the first formula is used for indicating whether the candidate word can be added to the dictionary or not.

Specifically, a final determination result of the candidate word may be determined by a machine-learning model fusion method, for example, the model fusion method may be constructed by: the final decision result of some candidate words with known class attributes is used as a training set, based on the way of generating decision results according to the embodiment of the present invention, the decision result obtained by each candidate word through the classifier is used as a feature, and a first formula (or a function model) is generated based on all decision results to generate a decision vector, so that the first formula is used for comprehensively deciding whether the candidate word can be added to the dictionary comprehensive decision model.

Correspondingly, when a candidate word with unknown type tendency is actually judged, the obtained initial judgment result of the candidate word and the judgment results of a plurality of intermediate words are converted into a judgment vector, the judgment vector is used as an input parameter of the first formula, and a final judgment result is obtained through calculation.

FIG. 4 is a detailed flow diagram of a method 200 for constructing a dictionary according to an embodiment of the present invention.

As shown in fig. 4, the method 200 includes the steps of:

before step 1, candidate words are obtained;

step 1: finding paraphrases of words (i.e., candidate words);

step 2: selecting feature words from the paraphrases obtained in the step 1, and further generating feature vectors;

and step 3: obtaining a judgment result through a classifier according to the feature words or the feature vectors of the feature words;

after the judgment result is obtained, judging whether the judgment times reach a preset value or not,

if not, performing step 4: taking the feature words extracted from the last paraphrase as intermediate words of the next judgment, and repeating the steps 1 to 3 until the number of times of the cyclic judgment reaches a preset value;

and 5: and performing comprehensive judgment based on all judgment results obtained in the previous 4 steps to determine a final judgment result.

Next, a method for constructing a dictionary according to an embodiment of the present invention is described by way of a specific example with reference to a specific flowchart shown in fig. 4.

Wherein, the emotion dictionary is used as a dictionary for construction in the embodiment of the invention, the modern Chinese dictionary is a paraphrase knowledge base, and the candidate words are: ink, change, number of decisions is 2, i.e. the initial decision and the group 1 intermediate decision described above, the classifier is generated based on the artificial experience rule, and the specified preset conditions are: if the paraphrases include the feature words belonging to the dictionary, the judgment result is yes, otherwise, the judgment result is no, and the comprehensive judgment rule is as follows: if the determination results obtained in both determinations are yes, the final determination result is that the candidate word can be added to the dictionary.

First, the method according to the embodiment of the present invention will be described by taking the candidate word "ink treasure" as an example.

Initial judgment:

step 1: searching the explanation of 'Mobao': the characters and the paintings are precious and are also used for respecting the characters or the paintings written by other people;

step 2: extracting feature words: precious and respectful;

and step 3: if the two feature words in the step 2 belong to the emotion dictionary, the initial judgment result of the 'Mobao' is yes "

And 4, step 4: respectively taking 'precious' and 'respect' as intermediate words of the 1 st group intermediate judgment;

group 1 intermediate decisions

Step 1: find "precious" paraphrases: extremely valuable, very difficult to obtain, looking for the paraphrase of "respect": respectfully call;

step 2: the characteristic words of extracting "precious" are: valuable, the characteristic words for extracting the "respect" are: respect;

and step 3: if both the two middle corresponding feature words in step 2 belong to the emotion dictionary, the determination result for "precious" is yes, and the determination result for "respect" is yes.

And 5: if the results of the initial determination and the group 1 intermediate determination are both yes, the final determination is that "ink can be added to the emotion dictionary.

Next, the method according to the embodiment of the present invention will be described by taking the candidate word "change" as an example.

Initial judgment:

step 1: find the paraphrase of "Change": things are significantly different;

step 2: extracting feature words: is remarkable;

and step 3: all the feature words in the step 2 belong to the emotion dictionary, and the initial judgment result of the change is yes "

And 4, step 4: "significant" as an intermediary for the group 1 intermediate predicate;

group 1 intermediate decision:

step 1: find the paraphrase of "significant": is very obvious;

step 2: extracting the characteristic words of 'significant' as follows: obviously;

and step 3: if the obvious does not belong to the emotion dictionary in the step 2, the judgment result of the obvious is negative;

and 5: if there is no in the determination results obtained in the initial determination and the group 1 intermediate determination, the final determination result is that "change" cannot be added to the emotion dictionary.

As seen from the determination result of the candidate word "change" in the embodiment, although the "change" contains the emotional word "significant", the "change" itself has no emotion, because the emotional tendency of "significant" is related to usage and collocation, and may bring noise characteristics to the determination of the paraphrase. Therefore, the noises can be filtered by carrying out multiple paraphrase searches on the 'significant', and the accuracy of the dictionary is improved while the advantage of wider resources of the paraphrase knowledge base is inherited.

Therefore, in the method for constructing a dictionary according to the embodiment of the present invention, on one hand, at least one intermediate word is derived from the paraphrase of the candidate word, that is, the at least one intermediate word includes the N-level feature word of the candidate word, and the K-level feature word is a feature word selected from the paraphrase of the K-1-level feature word (or, the next-level feature word is a feature word selected from the paraphrase of the previous-level feature word), and a final determination result for determining whether the candidate word can be added to the dictionary is obtained by performing a comprehensive decision on the candidate word according to an initial determination result generated based on the candidate word and a determination result generated based on each intermediate word, and for a feature word with unknown type tendency (e.g., emotional tendency) in the paraphrase of the candidate word, information such as type tendency and collocation method of each feature word in the N-level feature word of the candidate word can be helpful for parsing, furthermore, noise caused by the characteristic words of the fuzzy meaning can be effectively reduced, and the accuracy of the dictionary is improved; in addition, because the paraphrase knowledge base has more resources, more new words can be collected, the embodiment of the invention judges for many times by searching the candidate words or the words in the paraphrases of the N-level characteristic words of the candidate words, which is beneficial to expanding the dictionary in real time;

on the other hand, by selecting the common words of the words collected by the dictionary in the paraphrases of the candidate words as the feature words of the candidate words, the words which do not belong to the dictionary can be filtered, the interference caused by the feature words of the candidate words with unobvious type tendency can be effectively reduced, the accuracy of the dictionary is further improved, in addition, by selecting the feature words belonging to the dictionary, the keywords related to the type tendency of the candidate words can be quickly focused, the number of intermediate judgment can be reduced, and the processing speed is improved;

on the other hand, through the characteristic words of the candidate words, the information such as the part of speech of the characteristic words of the candidate words, the syntactic structure to which the characteristic words of the candidate words belong is extracted, the semantic information expressed by the paraphrases of the candidate words can be better represented, and the information such as the type tendency, the usage collocation and the like of the characteristic words of the candidate words can be effectively analyzed, so that the accuracy of the dictionary is improved;

on the other hand, a first judgment result meeting the preset condition is screened from the judgment results generated by the K-1 level feature words, the feature words used for generating the first judgment result are used as the K level feature words, the feature words used for generating the judgment results of the dictionary with the probability that the indicator belongs to the dictionary and does not meet the preset condition can be filtered, and a relatively accurate final judgment result can be obtained by setting a small judgment group number N.

The method for constructing a dictionary according to the embodiment of the present invention is described in detail above with reference to fig. 1 to 4, and the apparatus for constructing a dictionary according to the embodiment of the present invention is described in detail below with reference to fig. 5 and 6, and the technical features described in the method embodiment are also applicable to the following apparatus embodiments.

Fig. 5 shows an apparatus for constructing a dictionary according to an embodiment of the present invention, the apparatus 300 including:

an obtaining unit 310, configured to obtain a candidate word;

the obtaining unit is also used for obtaining paraphrases of the candidate words from the paraphrase knowledge base;

a processing unit 320, configured to select a feature word of the candidate word from the paraphrases of the candidate word obtained in the obtaining unit, where the feature word of the candidate word is an actual word in the paraphrases of the candidate word;

the processing unit 320 is further configured to obtain an initial determination result of the candidate word through a preset classifier according to the feature word of the candidate word, where the classifier is configured to indicate a probability that a word belongs to the dictionary;

the processing unit 320 is further configured to obtain a determination result of each intermediate word through the classifier according to a feature word selected from the paraphrase of each intermediate word in at least one intermediate word, where the at least one intermediate word includes N-level feature words of the candidate word,

the N-level feature word is a feature word of the candidate word, where N is 1, or,

the K-th level feature word in the N-level feature words is a feature word selected from the paraphrases of the K-1-th level feature word in the N-level feature words, both N and K are integers greater than 1, and K is less than or equal to N;

the processing unit 320 is further configured to determine a final decision result of the candidate word according to the initial decision result of the candidate word and the decision result of the at least one intermediate word, where the final decision result of the candidate word is used to indicate whether the candidate word can be added to the dictionary.

Therefore, the apparatus for constructing a dictionary according to the embodiment of the present invention derives at least one intermediate word from the paraphrase of the candidate word, that is, the at least one intermediate word includes the N-level feature words of the candidate word, and the K-level feature word is a feature word selected from the paraphrase of the K-1-level feature word (or, the next-level feature word is a feature word selected from the paraphrase of the previous-level feature word), and performs a comprehensive decision on the candidate word according to the initial decision result generated based on the candidate word and the decision result generated based on each intermediate word, to obtain a final decision result for determining whether the candidate word can be added to the dictionary, and for a feature word whose type tendency (e.g., emotional tendency) is not obvious in the paraphrase of the candidate word, it may help to resolve the information such as the type tendency and usage collocation of each feature word in the N-level feature words of the candidate word, furthermore, noise caused by the characteristic words of the fuzzy meaning can be effectively reduced, and the accuracy of the dictionary is improved; in addition, because the paraphrase knowledge base has more resources and can collect more new words, the embodiment of the invention can judge for many times by searching the candidate word or the paraphrase words of the N-level characteristic words of the candidate word, thereby being beneficial to expanding the dictionary in real time.

Optionally, the obtaining unit 310 is specifically configured to:

Therefore, the device can filter the words which do not belong to the dictionary by selecting the words which are in common with the words collected by the dictionary in the paraphrases of the candidate words as the feature words of the candidate words, thereby more effectively reducing the interference caused by the feature words of the candidate words with unobvious type tendency and further improving the accuracy of the dictionary.

Optionally, the processing unit 320 is specifically configured to:

Therefore, the device extracts the information such as the part of speech of the characteristic word of the candidate word, the syntactic structure to which the characteristic word of the candidate word belongs and the like through the characteristic word of the candidate word, can better represent semantic information expressed by the paraphrase of the candidate word, and can further effectively analyze information such as the type tendency, the usage collocation and the like of the characteristic word of the candidate word, thereby improving the accuracy of the dictionary.

Therefore, the device can filter out the feature words used for generating the judgment result of which the probability that the indicator words belong to the dictionary does not meet the preset condition by screening out the first judgment result meeting the preset condition from the judgment results generated by the K-1 level feature words and taking the feature words used for generating the first judgment result as the K level feature words, and can obtain a more accurate final judgment result by setting a small judgment group number N.

The processing unit 320 is specifically configured to:

Fig. 6 shows an apparatus for constructing a dictionary according to an embodiment of the present invention, which includes an input device 410, an output device 420, a processor 430, and a memory 440, where the input device 410, the output device 420, the processor 430, and the memory 440 are in communication with each other through an internal connection path.

The memory 440 stores programs. In particular, the program may include program code comprising computer operating instructions. Memory 440 may include both read-only memory and random-access memory, and provides instructions and data to processor 430. Memory 440 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least 1 disk memory.

The processor 430 executes the program stored in the memory 440, and the memory 440 may be integrated with the processor 430 or may be independent of the processor 430.

Specifically, the processor 430 is configured to:

acquiring candidate words;

obtaining the paraphrase of the candidate word from a paraphrase knowledge base;

selecting a characteristic word of the candidate word from the paraphrases of the candidate word acquired from the acquisition unit, wherein the characteristic word of the candidate word is a real word in the paraphrases of the candidate word;

obtaining a judgment result of each intermediate word through the classifier according to the feature word selected from the paraphrase of each intermediate word in the at least one intermediate word, wherein the at least one intermediate word comprises the N-level feature words of the candidate word,

and determining a final judgment result of the candidate word according to the initial judgment result of the candidate word and the judgment result of the at least one intermediate word, wherein the final judgment result of the candidate word is used for indicating whether the candidate word can be added to the dictionary or not.

Therefore, the apparatus provided in the embodiment of the present invention derives at least one intermediate word from the definitions of the candidate word, that is, the at least one intermediate word includes N-level feature words of the candidate word, and the K-level feature word is a feature word selected from the definitions of the K-1-level feature word (or, the next-level feature word is a feature word selected from the definitions of the previous-level feature word), and performs a comprehensive decision on the candidate word according to an initial decision result generated based on the candidate word and a decision result generated based on each intermediate word to obtain a final decision result for determining whether the candidate word can be added to the dictionary, so as to help to resolve information such as type tendency and usage collocation of each feature word in the N-level feature words of the candidate word for which type tendency (e.g., emotion tendency) is not obvious in the definitions of the candidate word, furthermore, noise caused by the characteristic words of the fuzzy meaning can be effectively reduced, and the accuracy of the dictionary is improved; in addition, because the paraphrase knowledge base has more resources and can collect more new words, the embodiment of the invention can judge for many times by searching the candidate word or the paraphrase words of the N-level characteristic words of the candidate word, thereby being beneficial to expanding the dictionary in real time.

Optionally, the processor 430 is specifically configured to:

The processor 430 is specifically configured to:

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only a specific implementation of the embodiments of the present invention, but the scope of the embodiments of the present invention is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present invention, and all such changes or substitutions should be covered by the scope of the embodiments of the present invention. Therefore, the protection scope of the embodiments of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for constructing a dictionary, the method comprising:

acquiring candidate words;

obtaining paraphrases of the candidate words from a paraphrase knowledge base;

2. The method of claim 1, wherein selecting a feature word of the candidate word from the paraphrases of the candidate word comprises;

3. The method according to claim 1 or 2, wherein obtaining an initial determination result of the candidate word by a preset classifier according to the feature word of the candidate word comprises:

4. The method according to claim 1 or 2, wherein the K-th level feature word is specifically a feature word selected from paraphrases of feature words used for generating a first determination result in the K-1-th level feature word, and the first determination result is used for indicating that a probability that a word belongs to the dictionary satisfies a preset condition.

5. The method according to claim 1 or 2, wherein determining a final decision result of the candidate word according to the initial decision result of the candidate word and the decision result of the at least one intermediate word comprises:

6. The method according to claim 1 or 2, wherein determining a final decision result of the candidate word according to the initial decision result of the candidate word and the decision result of the at least one intermediate word comprises:

7. An apparatus for constructing a dictionary, the apparatus comprising:

the acquisition unit is used for acquiring candidate words;

the obtaining unit is further used for obtaining paraphrases of the candidate words from a paraphrase knowledge base;

the processing unit is used for selecting a characteristic word of the candidate word from the paraphrases of the candidate word acquired by the acquisition unit, wherein the characteristic word of the candidate word is an actual word in the paraphrases of the candidate word;

the processing unit is further configured to obtain an initial determination result of the candidate word through a preset classifier according to the feature word of the candidate word, where the classifier is configured to indicate a probability that one word belongs to the dictionary;

the processing unit is further configured to obtain a determination result of each intermediate word through the classifier according to a feature word selected from a paraphrase of each intermediate word in at least one intermediate word, where the at least one intermediate word includes N-level feature words of the candidate word,

the processing unit is further configured to determine a final decision result of the candidate word according to the initial decision result of the candidate word and the decision result of the at least one intermediate word, where the final decision result of the candidate word is used to indicate whether the candidate word can be added to the dictionary.

8. The apparatus according to claim 7, wherein the obtaining unit is specifically configured to:

9. The apparatus according to claim 7 or 8, wherein the processing unit is specifically configured to:

10. The apparatus according to claim 7 or 8, wherein the level K feature words are feature words specifically selected from paraphrases of feature words used for generating a first determination result in the level K-1 feature words, and the first determination result is used for indicating that a probability that a word belongs to the dictionary satisfies a preset condition.

11. The apparatus according to claim 7 or 8, wherein the processing unit is specifically configured to:

12. The apparatus according to claim 7 or 8, wherein the processing unit is specifically configured to:

13. A computer-readable storage medium characterized in that it stores a computer program that causes a terminal device to execute the method for constructing a dictionary of any one of claims 1 to 6.