CN110569503B

CN110569503B - Word statistics and WordNet-based semantic item representation and disambiguation method

Info

Publication number: CN110569503B
Application number: CN201910803617.XA
Authority: CN
Inventors: 朱新华; 郭青松; 温海旭; 陈宏朝
Original assignee: Yami Technology Guangzhou Co ltd
Current assignee: Yami Technology Guangzhou Co ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2023-12-29
Anticipated expiration: 2039-08-28
Also published as: CN110569503A

Abstract

The invention discloses a word statistics and WordNet-based term meaning expression and disambiguation method, which utilizes word meaning sets and synonym sets which are already well arranged in WordNet and are widely accepted internationally as priori knowledge, and provides a wikipedia word statistics-based term meaning vector generation method.

Description

Word statistics and WordNet-based semantic item representation and disambiguation method

Technical Field

The invention relates to the field of natural language understanding in artificial intelligence, in particular to a word statistics and WordNet-based meaning item representation and disambiguation method.

Background

The development of the deep learning technology in the artificial intelligence field is rapid, the image field is excellent, and the application is wide in the aspect of natural language processing. With the combination of deep neural networks and natural language processing, word vectors have also been proposed. The method aims at solving the vector representation of natural language in a neural network, converts words into non-dense vectors, and for similar words, the corresponding words are similar in vector space. In natural language processing applications, word vectors are input as features of a deep learning model, and therefore the effect of the final model is also largely dependent on the effect of the word vectors.

The word vectors based on the neural network are trained through big data, so that the training is more accurate, but the weight of an input layer in the training network is directly used as the word vectors, the semantic interpretation of vector dimensions is lacking, and therefore the term vectors cannot be obtained through the combination of the word vectors. The word vector based on statistics takes words as dimensions, and the vector dimensions have rich semantics, so that the meaning item vector can be obtained by combining the word vectors. However, since the phenomenon of word ambiguity is common in natural language, word meaning of words must be used correctly in order for a computer to understand natural language accurately. Word meaning is a specific meaning reflected by words in a certain language environment, has more similar and materialized semantic attributes in the context, and can better reflect the relation between words. At present, various word vectors generally generate unique word vectors aiming at words, and meaning term word vectors are not trained, so that in practical application, each word can only use the unique word vectors to carry out semantic computation in different language environments, and the accuracy of the semantic computation is greatly reduced.

Disclosure of Invention

The invention aims to solve the problem that at present, each word can only use a unique word vector to carry out semantic computation under different language environments, so that the accuracy of semantic computation is greatly reduced, and provides a semantic item representation and disambiguation method based on word statistics and WordNet.

In order to solve the problems, the invention is realized by the following technical scheme:

a word statistics and WordNet based method of semantic item representation and disambiguation, comprising the steps of:

step 1, acquiring an offline webpage file of the wikipedia, and preprocessing the offline webpage file to obtain preprocessed wikipedia corpus;

step 2, selecting words with word frequency row of the previous K as training target words and vector dimension words for word statistics training on the preprocessed wikipedia corpus to obtain a word co-occurrence matrix and word vectors;

step 3, acquiring a meaning item set of the word and a synonym set of the word from WordNet;

step 4, generating a term meaning vector of the word by combining the term co-occurrence matrix and the term vector obtained in the step 2 and the term meaning set and the synonym set obtained in the step 3;

step 5, acquiring an annotation set of each sense item of the word from WordNet;

step 6, forming a text pair list to be compared by the meaning item annotation sentence and the disambiguation text;

step 7, performing root reduction processing on texts in the text pair list to be compared, extracting nouns and verbs in the texts as core semantic bags respectively, so as to convert comparison of the text pairs into comparison of the core semantic bags consisting of the nouns and the verbs;

step 8, calculating the similarity between the annotation set and the disambiguation text of each meaning item of the word through a core semantic bag;

and 9, outputting the meaning item with the highest similarity between the annotation set and the disambiguation text as a disambiguation result according to the similarity between the annotation set and the disambiguation text of each meaning item of the word.

The specific process of the step 4 is as follows:

step 4.1, meaning term t for the ith term t of word t ⁱ The word vector V (t) of the word t is taken as a sense term t ⁱ Is to initialize the sense term vector SV ₀ (t ⁱ ) Order SV ₀ (t ⁱ )＝V(t)；

Step 4.2, for a single synonym st in the synonym set, sequentially adopting the following formulas to make the word vector V (st) of the single synonym st and the initialization sense term vector SV ₀ (t ⁱ ) Iterative merging is carried out to generate a sense term t ⁱ First order sense term vector SV ₁ (t ⁱ ) The following formula is shown:

SV ₁ (t ⁱ )＝{(s _i ,wt(s _i ,SV ₀ (t ⁱ ))+wt(s _i ,V(st))|s _i ∈D ₁ ∪D ₂ }

wherein wt(s) _i ,SV ₀ (t ⁱ ) A) represents dimension word s _i Initializing the sense term vector SV ₀ (t ⁱ ) Weight, wt(s) _i V (st)) represents a dimension word s _i Weights in word vector V (st), D ₁ Representing initialization sense term vector SV ₀ (t ⁱ ) A set of dimension words with a medium weight other than 0, D ₂ A set of dimension words of a weight other than 0 in the word vector V (st);

before each iteration combining, the result SV of the previous iteration combining is used ₁ (t ⁱ ) As an initialization sense term vector SV ₀ (t ⁱ ) Order SV ₀ (t ⁱ )＝SV ₁ (t ⁱ ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t ⁱ The synonym set does not have any meaning synonyms, let SV ₁ (t ⁱ )＝SV ₀ (t ⁱ )；

Step 4.3, for the multi-sense synonym dt in the synonym set, sequentially adopting the following formulas to make the multi-sense synonym dt word vector V (dt) and the first-order sense item vector SV ₁ (t ⁱ ) Iterative merging is carried out to generate a sense term t ⁱ Is a second order sense term vector SV ₂ (t ⁱ ) The following formula is shown:

wherein wt(s) _i ,SV ₁ (t ⁱ ) A) represents dimension word s _i First order sense term vector SV ₁ (t ⁱ ) Weight, wt ² (s _i V (dt)) represents a dimension word s _i Weights in the word vector V (dt), wt(s) _j ,SV ₁ (t ⁱ ) A) represents dimension word s _j First order sense term vector SV ₁ (t ⁱ ) Weight of D ₃ Representing the first order sense item directionQuantity SV ₁ (t ⁱ ) A set of dimension words with a medium weight other than 0, D ₄ A set of dimension words of a weight other than 0 in the word vector V (dt);

before each iteration combining, the result SV of the previous iteration combining is used ₂ (t ⁱ ) As first order sense term vector SV ₁ (t ⁱ ) Order SV ₁ (t ⁱ )＝SV ₂ (t ⁱ ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t ⁱ The synonym set does not have any ambiguous synonyms, let SV ₂ (t ⁱ )＝SV ₁ (t ⁱ )；

Step 4.4 for sense item t ⁱ The method comprises the steps of taking a word vector V (ft) of a first word ft in a combined synonym ct as an initialized combined word vector CV of the combined synonym ct in a synonym set ₀ (ct), i.e. let CV ₀ (ct)＝V(ft)；

Step 4.5, for the independent word at in the combined word synonym ct, sequentially adopting the following formulas to combine the word vector V (at) with the initialized combined word vector CV ₀ (ct) performing iterative merging to generate a first-order combined word vector CV of the combined synonym ct ₁ (ct) the following formula:

wherein wt(s) _i ,CV ₀ (ct)) represents dimension word s _i Initializing a combined word vector CV ₀ Weight in (ct), wt(s) _i V (at)) represents dimension word s _i Weights in the word vector V (at), wt(s) _j ,CV ₀ (ct)) represents dimension word s _j Initializing a combined word vector CV ₀ Weight in (ct), D ₅ Representing an initialized combined word vector CV ₀ (ct) set of dimension words with weight other than 0, D ₆ A set of dimension words of a weight other than 0 in the word vector V (at);

before each iteration is combined, the result CV of the previous iteration is used ₁ (ct) as an initialization combined word vector CV ₀ (ct), i.e. let CV ₀ (ct)＝CV ₁ (ct)；

Step 4.6, the t obtained in the step 4.3 is sequentially carried out ⁱ Is a second order sense term vector SV ₂ (t ⁱ ) And the first-level combined word vector CV obtained in the step 4.5 ₁ (ct) iterative merging to generate the sense term t ⁱ Final vector SFV (t) ⁱ ) The following formula is shown:

wherein wt(s) _i ,SV ₂ (t ⁱ ) A) represents dimension word s _i Second order term vector SV ₂ (t ⁱ ) Weight, wt(s) _i ,CV ₁ (ct)) represents dimension word s _i Combining word vectors CV at first order ₁ Weight in (ct), wt(s) _j ,SV ₂ (t ⁱ ) A) represents dimension word s _j Second order term vector SV ₂ (t ⁱ ) Weight of D ₇ Representing a second order sense term vector SV ₂ (t ⁱ ) A set of dimension words with a medium weight other than 0, D ₈ Representing first order combined word vector CV ₁ A set of dimension words in (ct) having a weight other than 0;

before each iterative merge, the result SFV (t ⁱ ) As a second order sense term vector SV ₂ (t ⁱ ) Order SV ₂ (t ⁱ )＝SFV(t ⁱ ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t ⁱ No combined synonym in the synonym set, let SFV (t ⁱ )＝SV ₂ (t ⁱ )。

In the above scheme, for the generated word vector and the term vector, only the dimension word and its weight in which the weight is not zero are stored, and the weight of the dimension word that is not stored is defaulted to 0.

In step 8, the term t is defined as term t ⁱ Is set of annotations of gloss (t) ⁱ ) Text of disambiguation text where word t to be disambiguated is located ^t Similarity sim (gloss (t) ⁱ ),text ^t ) The method comprises the following steps:

sim(gloss(t ⁱ ),text ^t )＝max{sim(glBag _j ,textBag)|j∈[1,p _i ]}

wherein sim (glBag _j textBag) represents the core semantic bag glBag _j Similarity to textbiag,max {.cndot. } represents maximum value, glBag _j Representing the slave annotation sentence gl _j Core semantic bag composed of nouns and verbs extracted from Chinese, gl _j Representing the sense term t ⁱ Is set of annotations of gloss (t) ⁱ ) Annotating sentences separated by a semicolon, textbook representing text from the text to be disambiguated ^t Core semantic bag extracted from Chinese and composed of nouns and verbs, p _i Representing the sense term t ⁱ Is set of annotations of gloss (t) ⁱ ) Annotating the number of sentences, B ₁ Representation core semantic bag glBag _j ，B ₂ Representing the core semantic bag textBag, |·| representing the number of words in the core semantic bag, depth (u) representing the depth of word u in the WordNet hierarchy, depth (v) representing the depth of word v in the WordNet hierarchy, LCS (u, v) representing the nearest public parent node of words u and v in the WordNet hierarchy, depth (LCS (u, v)) representing the depth of the nearest public parent node in the WordNet hierarchy.

Compared with the prior art, the invention utilizes the word meaning item set and the synonym set which are well arranged and widely accepted internationally in WordNet as priori knowledge, and provides a meaning item vector generation method based on wikipedia word statistics.

Drawings

FIG. 1 is a flow chart of a word statistics and WordNet based semantic item representation and disambiguation method.

Detailed Description

The present invention will be further described in detail with reference to specific examples in order to make the objects, technical solutions and advantages of the present invention more apparent.

The word statistics and WordNet based meaning item representation and disambiguation method specifically comprises the following steps as shown in fig. 1:

firstly, an offline page file of the Wikipedia is obtained, illegal characters in the offline page file are converted into spaces, a picture form is deleted, only titles are reserved, the links remain texts, and finally, plain texts containing a-Z (conversion of A-Z ranges into lower cases) and numbers are reserved. After the cleaning is finished, generating a co-occurrence matrix through a word statistical model, acquiring corresponding word vectors from the co-occurrence matrix, finally forming an initial word vector, and taking the initial word vector as input of a sense term generation model, then taking sense terms corresponding to words in WordNet and a synonym set as input, aiming at generating corresponding sense term vectors by utilizing the synonym set corresponding to the words, firstly, obtaining all words in the synonym set from the initial word vector in a table-look-up mode by the model, taking the input words as source words, searching corresponding word vectors as a benchmark, and carrying out intersection operation with other words in the synonym set. The problem of combining words is also related to, and there is no good method to generate a vector of combined words, and the solution of the present invention is to split the combined words into individual words and then combine the vectors of the individual words, i.e. the intersecting parts of the dimensions are added, without intersecting performing a combining operation. Finally, the vector of the word corresponding to the sense term is output.

1. Generating a wikipedia-based word co-occurrence matrix and word vector

The invention is based on the word statistical vector trained by the wiki encyclopedia open corpus, because the words in the wiki encyclopedia are more, the words with the word frequency of the first 30 ten thousand are taken as the training target words and dimension words, the word vectors of the 30 ten thousand words are finally obtained, the dimension of each word vector is 30 ten thousand dimensions, and each dimension is a word, thereby having specific meaning. For example, one example word statistical vector is shown below:

V(deckhand)＝{(guinean,0.284611),(trawler,0.250539),(cowell,0.247986),…}

the specific generation steps are as follows:

(1) An offline page data file of the wikipedia is downloaded and preprocessed.

The off-line page data file of the wikipedia is firstly obtained through a dump backup database provided by the wikipedia. The present invention uses JWPL (Java Wikipedia Library) tools to parse wikipedia download databases, and JWPL runs on optimized databases created from wikipedia download databases, allowing quick access to wikipedia page articles, categories, links, redirects, etc. The off-line wikipedia page contains various data, namely texts, pictures, tables, links and unique characters in the webpage, the invention cleans the wikipedia page data by using a formula (1), finally leaves text data required by training word vectors, converts uppercase characters in an A-Z range into lowercase a-Z, converts non-displayable symbols into spaces, and inputs the preprocessed wikipedia page data into the word statistical model in the step (2).

Page _wiki ＝{lower(w)|w∈S} (1)

Where lower is a function of lowercase conversion of characters, and S is a set of displayable characters and numbers.

(2) Generating a word co-occurrence matrix and a word vector based on wikipedia word statistics.

For the preprocessed wikipedia corpus, because the words are abundant and more, the invention is convenient for training, and generates more effective word vectors, and finally words with the word frequency row K in front are selected as training target words and vector dimension words to carry out word statistics training, so as to obtain word co-occurrence matrixes and word vectors shown in formulas (2) and (3).

Equation (2) represents a wiki-based word co-occurrence matrix, which consists of a weight matrix of K x K, wherein each row represents a word vector of a target word in the matrix, i.e., a weight vector;

V(t _i )＝{(t _j ,w _i,j )|t _j ∈T ^K } (3)

equation (3) represents the target word t _i Co-occurrence matrix of wordsIn which the word vector is K of the shape (t _j ,w _i,j ) Dimension word and weight pair composition of (1), T ^K Representing a set of K dimension words, t _j Representing T ^K One dimension word, w _i,j Representing the target word t _i In dimension word t _j Weights on, w _i,j The calculation formula of (2) is as follows:

defining a target word t _i Taking L epsilon [2,5 as a center and a co-occurrence window with L words as left and right boundaries]Then, the target word t under the window is calculated according to the formula (4) _i In dimension word t _j Weight w on _i,j ：

Wherein f (t) _i ,t _j ) For the word t _i And t _j The number of co-occurrences measured in the Wikipedia corpus in accordance with a specified co-occurrence window, f (t _i )、f(t _j ) Respectively represent the words t _i And t _j Number of occurrences in wikipedia corpus.

2. Acquiring meaning item set and synonym set of words from WordNet

The present invention uses internationally widely accepted WordNet as a priori knowledge base of the set of sense items and the set of synonyms for the input term. In WordNet, if a term is ambiguous, multiple sense items are provided, and each sense item is typically composed of multiple synonyms into a set of synonyms and includes corresponding annotations. Annotations are typically defined and exemplified by very simple and easy to understand sentences. Vocabulary linguistics originate from the recognition of words: words are generally divided into word shapes and word senses, the word shapes are usually used for specifying source words or subject words, and the word senses represent word shapes, namely vocabulary concepts represented by the source words. For the same source word, different word senses may be represented in different contexts, so in order to be able to better distinguish word senses, grammatical classification of words is typically represented by a mapping relationship between word shapes and word senses. In WordNet, some word shapes may correspond to a plurality of different word senses, i.e., the word is ambiguous; some word senses may be represented by different word shapes, i.e., one sense and multiple words. The original word vector based on wikipedia word statistics is based on wikipedia text corpus, and synonym sets and polysemous terms of words cannot be distinguished. Therefore, the word net dictionary is used for acquiring the meaning item set of the word and the synonym set thereof, and the operation steps are as follows:

(1) The index word t is input in the WordNet dictionary.

(2) Searching the WordNet dictionary to obtain a set of sense terms, senset (t), for the term t as shown in equation (5), and a set of synonyms, sense (t), for each sense term as shown in equation (6) ⁱ )：

SenSet(t)＝{sense(t ⁱ )|i∈[1,n _t ]} (5)

Equation (5) shows that word t has n in WordNet _t Sense term, n _t Is a positive integer;

sense(t ⁱ )＝{t,t _j |j∈[0,m _i ]} (6)

equation (6) shows that the synonym set of the ith term of term t is defined by t and m outside t _i Individual word composition, m _i Is 0 or a positive integer.

For example, for the term Brazil, there are 2 sense terms in WordNet:

brazil sense item 1: { Source words: brazil, synonym: federative Republic of Brazil, synonym: brasil }

Brazil sense term 2: { Source: brazil, synonym: brazil nut }.

3. Generating term vectors of words by merging word vectors of synonyms

The invention relates to a method for generating a term vector based on Wikipedia word statistics and WordNet, which comprises the following steps:

(1) Initializing the sense item vector.

The ith sense term t for word t ⁱ The present invention initializes a sense term vector SV using a word vector V (t) generated by formula (3) ₀ (t ⁱ ) The method comprises the following steps:

SV ₀ (t ⁱ )＝V(t) (7)

(2) The initial sense term vector is combined with the nonsense synonym vector.

The present invention defines synonyms with only one sense term in WordNet as single sense synonyms. For a source word, if the number of sense term elements is smaller, the meaning of the expression of the source word is more definite, the ambiguity is less, and the obtained word vector interference is less, so that the invention directly adds the weights of the single sense synonym vector and the source word vector to achieve the effect of highlighting the single sense synonym vector in the generation of the sense term vector.

Let sense term t ⁱ Synonym set sense (t) ⁱ ) If there is a single synonym st, the present invention uses the following equation (8) to combine the word vector V (st) of st with the initial sense term vector SV ₀ (t ⁱ ) Merging to generate new sense term vector SV ₁ (t ⁱ )：

SV ₁ (t ⁱ )＝{(s _i ,wt(s _i ,SV ₀ (t ⁱ ))+wt(s _i ,V(st))|s _i ∈D ₁ ∪D ₂ } (8)

Wherein SV is ₀ (t ⁱ ) Generated by equation (7), V (st) generated by equation (3), D ₁ Representing SV ₀ (t ⁱ ) A set of dimension words with a medium weight other than 0, D ₂ Representing a set of dimension words in V (st) with a weight other than 0, the function wt (s, V) represents the weight of dimension word s in vector V.

(3) Repeating the step (2), and adding sense (t ⁱ ) Word vector and term vector SV of all single synonyms in (1) ₁ (t ⁱ ) And combining. Wherein, the result SV of the last merging is used before each merging ₁ (t ⁱ ) Initializing vector SV ₀ (t ⁱ ) Order SV ₀ (t ⁱ )＝SV ₁ (t ⁱ ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t ⁱ Synonym set sense (t) ⁱ ) Does not have any meaning synonyms, let SV ₁ (t ⁱ )＝SV ₀ (t ⁱ )。

(4) The sense term vector is combined with the multi-sense synonym vector.

Synonyms with multiple sense items in WordNet are defined as multi-sense synonyms in the present invention. In order to reduce the negative effect of the sense synonym vector on the generation of the sense term vector, the invention adopts the following formula (9) to combine the sense synonym vector V (dt) with the sense term vector SV ₁ (t ⁱ ) Nonlinear combination is carried out to generate a new sense term vector SV ₂ (t ⁱ )：

Wherein SV is ₁ (t ⁱ ) Generated by the step (2) and the step (3), dt represents the meaning term t ⁱ Synonym set sense (t) ⁱ ) Is a ambiguous synonym of V (dt) is a word vector of dt generated by equation (3), D ₃ Representing SV ₁ (t ⁱ ) A set of dimension words with a medium weight other than 0, D ₄ Representing a set of dimension words in V (dt) with a weight other than 0.

(5) Repeating the step (4), and adding sense (t ⁱ ) Word vectors and term vectors SV of all ambiguous synonyms in (1) ₂ (t ⁱ ) And combining. Wherein, the result SV of the last merging is used before each merging ₂ (t ⁱ ) Initializing vector SV ₁ (t ⁱ ) Order SV ₁ (t ⁱ )＝SV ₂ (t ⁱ ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t ⁱ Synonym set sense (t) ⁱ ) Does not have any ambiguous synonyms, let SV ₂ (t ⁱ )＝SV ₁ (t ⁱ )。

(6) A combined synonym vector is generated.

The present invention defines a phrase consisting of multiple independent words in a WordNet synonym phrase as a combined synonym, e.g., a combined word: computer-implemented_axial_thermal→word 1: computed + word 2: axial + word 3: tomograph. For sense item t ⁱ Synonym set sense (t) ⁱ ) The invention firstly adopts the word vector of the first word ft in the ct to initialize the combined word vector CV according to the following formula (10) ₀ (ct)：

CV ₀ (ct)＝V(ft) (10)

Then, for the independent word at in the combined word synonym ct, the present invention uses the following equation (11) to combine the word vector V (at) with the combined word vector CV ₀ (ct) merging to generate a new combined word vector CV ₁ (ct)：

Wherein CV ₀ (ct) is generated by equation (10), V (at) is generated by equation (3), D ₅ Representing CV ₀ (ct) set of dimension words with weight other than 0, D ₆ Representing a set of dimension words in V (at) with a weight other than 0.

Finally, repeatedly executing the formula (11) to combine all independent word vectors in the synonym ct of the combined word with the combined word vector CV ₁ (ct) combining. Wherein, the result CV of the last merging is used before each merging ₁ (ct) initializing vector CV ₀ (ct), i.e. let CV ₀ (ct)＝CV ₁ (ct)。

(7) Combining the term vector with the combined synonym vector.

For the combined synonym vector CV generated according to step (6) ₁ (ct) the present invention uses the following equation (12) to combine it with the sense term vector SV ₂ (t ⁱ ) Merging to generate sense item t ⁱ Final vector SFV (t) ⁱ )。

Wherein SV is ₂ (t ⁱ ) Generated by the step (4) and the step (5), CV ₁ (ct) is generated by step (6), D ₇ Representing SV ₂ (t ⁱ ) A set of dimension words with a medium weight other than 0, D ₈ Representing CV ₁ (ct) a set of dimension words with a weight other than 0.

(8) Repeating the step (7), and adding sense (t ⁱ ) All combined synonym vectors in (t) ⁱ ) And combining. Wherein, the result SFV (t ⁱ ) Initializing vector SV ₂ (t ⁱ ) Order SV ₂ (t ⁱ )＝SFV(t ⁱ ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t ⁱ Synonym set sense (t) ⁱ ) Without any combined synonyms, let SFV (t ⁱ )＝SV ₂ (t ⁱ )。

4. Method for disambiguating sense term based on WordNet

Since the phenomenon of word ambiguity in natural language is common, it is important to have a computer understand natural language accurately and automatically disambiguate. Word meaning is a specific meaning reflected by words in a certain language environment, has more obvious and materialized semantic attribute in the context, and can better reflect the relation between words. The term disambiguation refers to determining a term corresponding to a word in a specific text. The disambiguation of the sense term is a precondition and matching method of applying the sense term vector.

The invention provides a term disambiguation method based on WordNet, which can be matched with the use of term vectors generated by the invention in practical application. The word Net-based sense item disambiguation step provided by the invention is as follows:

(1) An annotation set for each term of a term is obtained from WordNet.

The meaning item annotation is extracted. In WordNet, for one term of a word, all synonyms are put in a collection as a set of synonyms for the term, and in addition, an annotation is added, which is typically composed of several simple sentences (usually definition and example sentences), so that the user of WordNet can distinguish between possibly confusing word senses. The invention extracts annotation sentences in the meaning synonym set from WordNet, and defines the annotation set as shown in a formula (13):

gloss(t ⁱ )＝{gl _j |j∈[1,p _i ]} (13)

equation (13) represents the sense term t ⁱ Is set by p _i A composition of annotated sentences, wherein gl _j Representing the sense term t ⁱ Any one of the annotation sentences separated by a semicolon.

For example, taking Brazil as an example, the corpus of Brazil in WordNet is:

Brazil：

1.Brazil,Federative Republic of Brazil,Brasil--(the largest Latin American country and the largest Portuguese speaking country in the world；located in the central and northeastern part of South America；world's leading coffee exporter)

2.brazil nut,brazil--(three-sided tropical American nut with white oily meat and hard brown shell)

the annotation set for the two sense terms thus obtained for Brazil is:

gloss(Brazil ¹ )＝{the largest Latin American country and the largest Portuguese speaking country in the world；located in the central and northeastern part of South America；world's leading coffee exporter}

gloss(Brazil ² )＝{three-sided tropical American nut with white oily meat and hard brown shell}

(2) The meaning item annotation sentence and the disambiguated text form a list of text pairs to be compared.

Combining the meaning item annotation sentences extracted in the step (1) with disambiguated texts in which the meaning items of the words to be disambiguated are located respectively to form a text pair list to be compared, as shown in a formula (14):

TextList(gloss(t ⁱ ),text ^t )＝{(gl _j :text ^t )|j∈[1,p _i ]} (14)

wherein, gloss (t ⁱ ) Representing the sense term t ⁱ Is generated by equation (13), text ^t Representing the text in which the word t to be disambiguated is located, (gl) _j :text ^t ) Representing a text pair to be compared. For example, the number of the cells to be processed,

text for statement to be disambiguated ^Brazil ＝”Unlike in the US where African Americans were united in the civil rights struggle,in<b>Brazil</b>the philosophy of whitening has helped divide blacks from other non-whites and prevented a more active civil rights movement”

A list of comparison text pairs is obtained as follows:

List(gloss(Brazil ² ),text ^Brazil )＝{”three-sided tropical American nut with white oily meat and hard brown shell”:”Unlike in the US where African Americans were united in the civil rights struggle,in<b>Brazil</b>the philosophy of whitening has helped divide blacks from other non-whites and prevented a more active civil rights movement”}

(3) The comparison of text pairs is converted into a comparison of core semantic bags consisting of nouns and verbs.

Performing root reduction processing on texts in the text pair list to be compared generated in the step (2), extracting nouns and verbs in the texts as core semantic bags, converting comparison of the text pairs into comparison of the core semantic bags, as shown in a formula (15), and generating a corresponding core semantic bag comparison pair list, as shown in a formula (16):

TextList(gloss(t ⁱ ),text ^t )＝BagList(gloss(t ⁱ ),text ^t ) (15)

BagList(gloss(t ⁱ ),text ^t )＝{(glBag _j :textBag)|j∈[1,p _i ]} (16)

wherein, (glBag _j textBag) represents a core semantic comparison pair, glBag _j Representing slaveAnnotating sentences gl _j The extracted core semantic bag composed of nouns and verbs, textbook represents text to be disambiguated ^t The core semantic bag which is extracted from the Chinese language and consists of nouns and verbs.

For example, using a Stanford part-of-speech reduction and part-of-speech tagging tool, a List (gloss (Brazil ² ),text ^Brazil ) Is a core semantic bag:

List(gloss(Brazil ² ),text ^Brazil )＝{(nut,oily,meat,shell):(US,African,Americans,civil,rights,Brazil,philosophy,whitening,blacks,non-whites,active,civil,rights,movement)}

(4) And calculating the similarity between the annotation set of each meaning item of the word and the disambiguation text through the core semantic bag.

The term t is calculated by the following equation (17) ⁱ Is set of annotations of gloss (t) ⁱ ) Text with text to be disambiguated ^t Similarity of (3):

sim(gloss(t ⁱ ),text ^t )＝max{sim(glBag _j ,textBag)|j∈[1,p _i ]} (17)

wherein, max {.cndot } represents the maximum value in the set, the core semantic bag glBag _j The similarity with textbiag is calculated by the following formula (18):

wherein B is ₁ Representation core semantic bag glBag _j ，B ₂ Represents the core semantic bag textBag, |B ₁ |、|B ₂ I represents the representation core semantic bag B respectively ₁ 、B ₂ The number of words in (a). The similarity sim (u, v) of the words u and v is calculated using the following formula (19):

where the function depth (u) represents the depth of the word u in the WordNet hierarchy and LCS (u, v) represents the nearest common parent node of the words u and v in WordNet.

(5) And outputting the meaning item with the highest similarity between the annotation set and the disambiguation text as a disambiguation result.

According to the step (4), calculating the annotation set of each meaning item of the word t and text to be disambiguated ^t Similarity of (2), and determining the sense term t with the maximum similarity ^* As a final disambiguation result, i.e. the word t is in the disambiguated text ^t Disambiguation results in (a) are annotation set and text ^t Meaning item t with highest similarity ^* As shown in formula (20):

wherein n is _t Meaning item number of expression t, t ^* Let sim (gloss (t) ⁱ ) Text) term with the largest calculated value, sim (gloss (t ⁱ ),text ^t ) Calculated by equation (16).

For example:

the sense term Brazil can be calculated by formulas (17), (18) and (19) ¹ Annotation set and disambiguated text of (a) ^Brazil The similarity of (2) is:

sim(gloss(Brazil ¹ ),text ^Brazil )＝MAX(0.627,0.408,0.745)＝0.745

alike obtainable meaning item Brazil ² Annotation set and disambiguated text of (a) ^Brazil The similarity of (2) is:

sim(gloss(Brazil ² ),text ^Brazil )＝MAX(0.576)＝0.576

finally, the meaning item 1 with the maximum similarity is calculated by a formula (20) to be used as an disambiguation result, namely, the disambiguation result is a meaning item Brazil with the similarity between the annotation set and disambiguation text equal to 0.745 ¹ 。

5. Vector preservation structure

The word vector of the invention is a high-dimensional vector, the number of dimension words is up to 30 ten thousand, and for the convenience of storage and calculation, the invention only stores the dimension words and the weights thereof with non-zero weights for the generated word vector and sense term vector, and defaults the weights of the dimension words which are not stored to 0.

For example:

SFV(Brazil ¹ )＝{(impa,0.042714),(lluvia,0.036314),(maracana,0.035894),(petropolis,0.04243),...,(in,0.008653),(to,0.000161),(and,0.002992)}

6. experimental comparison

The present example uses the version of wikipedia in english published on 7.1.2019 for experimental comparison, which contains 15GB of page text, including 5,895,703 pages of articles. The present embodiment uses JWPL (Java Wikipedia Library) tools to parse the Wikipedia download database. JWPL runs on an optimized database created from a Wikipedia download database, and can quickly access Wikipedia page articles, categories, links, redirects, etc. The present embodiment generates a knowledge corpus disambiguated with sense terms using WordNet 3.0 as a sense term vector. For the wikipedia offline page, this embodiment uses a perl tool for preprocessing and cleaning the data. Then, using DISSECT tool to obtain original word statistical vector, its generation process is: firstly, selecting words with word frequency row of the first 30 ten thousand as target words and word vector dimensions for word statistics training, then obtaining the co-occurrence times of the target words and the dimension words by utilizing a window of L=2 context, and finally obtaining co-occurrence matrixes and word vectors of the words. Finally, the method provided by the invention is used for generating the meaning item vector of the word.

The present embodiment performs corresponding tests on the generated term vector based on wikipedia word statistics and WordNet and the proposed term disambiguation method, and selects the international widely accepted universal term relevance test set WordSim-353 and the term relevance test set SCWS-2003 with disambiguated text (Stanford's Contextual Word Similarities), as shown in table 1:

TABLE 1 two data sets for experimental comparison

For the two data sets, the present embodiment uses Spearman coefficient comparison to present the test results of the method of the present invention, and the Spearman coefficient calculation formula is as follows:

where n represents the number of word pairs in the dataset, d _i Refers to the level difference between the variable Xi, which refers to the i-th element in the manual decision value list, and the variable Yi, which refers to the i-th element in the calculated value list. The comparison result of the method of the invention implemented in the embodiment and the original word statistical vector in two data sets is shown in the following table:

TABLE 2 Spearman coefficient comparison of term vectors and word vectors in two exemplary data sets

Method	WordSim-353	SCWS
			Original word statistical vector	0.634	0.584
The invention relates to a sense item vector and a disambiguation method	0.638	0.631

As can be seen from the experiment of the embodiment, for the WordSim-353 data set without disambiguation text, the calculated results of the meaning item vector, the disambiguation method and the original word statistical vector are identical, and are slightly improved, so that the method does not deviate from the main direction of the word statistical vector and does not cause negative influence; in the SCWS2003 data set with disambiguation text, it can be seen that the disambiguation method provided by the invention has a key effect on the data set, and the Spearman coefficient is greatly improved from 0.58 to 0.63, which fully illustrates that the method for generating and disambiguating the sense term vector based on wikipedia word statistics and WordNet is feasible and excellent.

Claims

1. A word statistics and WordNet based meaning item representation and disambiguation method is characterized by comprising the following steps:

step 4, generating a term meaning vector of the word by combining the term co-occurrence matrix and the term vector obtained in the step 2 and the term meaning set and the synonym set obtained in the step 3; namely:

Step 4.2, for a single synonym st in the synonym set, sequentially adopting the following formulas to carry out word vector of the single synonym stV (st) and initialization sense term vector SV ₀ (t ⁱ ) Iterative merging is carried out to generate a sense term t ⁱ First order sense term vector SV ₁ (t ⁱ ) The following formula is shown:

wherein wt(s) _i ,SV ₁ (t ⁱ ) A) represents dimension word s _i First order sense term vector SV ₁ (t ⁱ ) Weight, wt ² (s _i V (dt)) represents a dimension word s _i Weights in the word vector V (dt), wt(s) _j ,SV ₁ (t ⁱ ) A) represents dimension word s _j First order sense term vector SV ₁ (t ⁱ ) Weight of D ₃ Representing first order sense term vector SV ₁ (t ⁱ ) A set of dimension words with a medium weight other than 0, D ₄ A set of dimension words of a weight other than 0 in the word vector V (dt);

before each iterative merge, the result SFV (t ⁱ ) As a second order sense term vector SV ₂ (t ⁱ ) Order SV ₂ (t ⁱ )＝SFV(t ⁱ ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t ⁱ No combined synonym in the synonym set, let SFV (t ⁱ )＝SV ₂ (t ⁱ )；

2. The word statistics and WordNet based term meaning expression and disambiguation method of claim 1, wherein for the generated word vectors and term meaning vectors, only dimension words and their weights are stored where the weights are not zero, and the weights of dimension words that are not stored are defaulted to 0.

3. The method for word statistics and word net based term representation and disambiguation according to claim 1, wherein in step 8, term t is the term t ⁱ Is set of annotations of gloss (t) ⁱ ) Text of disambiguation text where word t to be disambiguated is located ^t Similarity sim (gloss (t) ⁱ ),text ^t ) The method comprises the following steps:

sim(gloss(t ⁱ ),text ^t )＝max{sim(glBag _j ,textBag)|j∈[1,p _i ]}

wherein sim (glBag _j textBag) represents the core semantic bag glBag _j Similarity to textbiag,

max {.cndot. } represents maximum value, glBag _j Representing the slave annotation sentence gl _j Core semantic bag composed of nouns and verbs extracted from Chinese, gl _j Representing the sense term t ⁱ Is set of annotations of gloss (t) ⁱ ) Annotating sentences separated by a semicolon, textbook representing text from the text to be disambiguated ^t Extracted from Chinese by nounCore semantic bag composed of verbs, p _i Representing the sense term t ⁱ Is set of annotations of gloss (t) ⁱ ) Annotating the number of sentences, B ₁ Representation core semantic bag glBag _j ，B ₂ Representing the core semantic bag textBag, |·| representing the number of words in the core semantic bag, depth (u) representing the depth of word u in the WordNet hierarchy, depth (v) representing the depth of word v in the WordNet hierarchy, LCS (u, v) representing the nearest public parent node of words u and v in the WordNet hierarchy, depth (LCS (u, v)) representing the depth of the nearest public parent node in the WordNet hierarchy.