CN110569503B - Word statistics and WordNet-based semantic item representation and disambiguation method - Google Patents

Word statistics and WordNet-based semantic item representation and disambiguation method Download PDF

Info

Publication number
CN110569503B
CN110569503B CN201910803617.XA CN201910803617A CN110569503B CN 110569503 B CN110569503 B CN 110569503B CN 201910803617 A CN201910803617 A CN 201910803617A CN 110569503 B CN110569503 B CN 110569503B
Authority
CN
China
Prior art keywords
word
vector
sense
term
synonym
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910803617.XA
Other languages
Chinese (zh)
Other versions
CN110569503A (en
Inventor
朱新华
郭青松
温海旭
陈宏朝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yami Technology Guangzhou Co ltd
Original Assignee
Yami Technology Guangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yami Technology Guangzhou Co ltd filed Critical Yami Technology Guangzhou Co ltd
Priority to CN201910803617.XA priority Critical patent/CN110569503B/en
Publication of CN110569503A publication Critical patent/CN110569503A/en
Application granted granted Critical
Publication of CN110569503B publication Critical patent/CN110569503B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a word statistics and WordNet-based term meaning expression and disambiguation method, which utilizes word meaning sets and synonym sets which are already well arranged in WordNet and are widely accepted internationally as priori knowledge, and provides a wikipedia word statistics-based term meaning vector generation method.

Description

Word statistics and WordNet-based semantic item representation and disambiguation method
Technical Field
The invention relates to the field of natural language understanding in artificial intelligence, in particular to a word statistics and WordNet-based meaning item representation and disambiguation method.
Background
The development of the deep learning technology in the artificial intelligence field is rapid, the image field is excellent, and the application is wide in the aspect of natural language processing. With the combination of deep neural networks and natural language processing, word vectors have also been proposed. The method aims at solving the vector representation of natural language in a neural network, converts words into non-dense vectors, and for similar words, the corresponding words are similar in vector space. In natural language processing applications, word vectors are input as features of a deep learning model, and therefore the effect of the final model is also largely dependent on the effect of the word vectors.
The word vectors based on the neural network are trained through big data, so that the training is more accurate, but the weight of an input layer in the training network is directly used as the word vectors, the semantic interpretation of vector dimensions is lacking, and therefore the term vectors cannot be obtained through the combination of the word vectors. The word vector based on statistics takes words as dimensions, and the vector dimensions have rich semantics, so that the meaning item vector can be obtained by combining the word vectors. However, since the phenomenon of word ambiguity is common in natural language, word meaning of words must be used correctly in order for a computer to understand natural language accurately. Word meaning is a specific meaning reflected by words in a certain language environment, has more similar and materialized semantic attributes in the context, and can better reflect the relation between words. At present, various word vectors generally generate unique word vectors aiming at words, and meaning term word vectors are not trained, so that in practical application, each word can only use the unique word vectors to carry out semantic computation in different language environments, and the accuracy of the semantic computation is greatly reduced.
Disclosure of Invention
The invention aims to solve the problem that at present, each word can only use a unique word vector to carry out semantic computation under different language environments, so that the accuracy of semantic computation is greatly reduced, and provides a semantic item representation and disambiguation method based on word statistics and WordNet.
In order to solve the problems, the invention is realized by the following technical scheme:
a word statistics and WordNet based method of semantic item representation and disambiguation, comprising the steps of:
step 1, acquiring an offline webpage file of the wikipedia, and preprocessing the offline webpage file to obtain preprocessed wikipedia corpus;
step 2, selecting words with word frequency row of the previous K as training target words and vector dimension words for word statistics training on the preprocessed wikipedia corpus to obtain a word co-occurrence matrix and word vectors;
step 3, acquiring a meaning item set of the word and a synonym set of the word from WordNet;
step 4, generating a term meaning vector of the word by combining the term co-occurrence matrix and the term vector obtained in the step 2 and the term meaning set and the synonym set obtained in the step 3;
step 5, acquiring an annotation set of each sense item of the word from WordNet;
step 6, forming a text pair list to be compared by the meaning item annotation sentence and the disambiguation text;
step 7, performing root reduction processing on texts in the text pair list to be compared, extracting nouns and verbs in the texts as core semantic bags respectively, so as to convert comparison of the text pairs into comparison of the core semantic bags consisting of the nouns and the verbs;
step 8, calculating the similarity between the annotation set and the disambiguation text of each meaning item of the word through a core semantic bag;
and 9, outputting the meaning item with the highest similarity between the annotation set and the disambiguation text as a disambiguation result according to the similarity between the annotation set and the disambiguation text of each meaning item of the word.
The specific process of the step 4 is as follows:
step 4.1, meaning term t for the ith term t of word t i The word vector V (t) of the word t is taken as a sense term t i Is to initialize the sense term vector SV 0 (t i ) Order SV 0 (t i )=V(t);
Step 4.2, for a single synonym st in the synonym set, sequentially adopting the following formulas to make the word vector V (st) of the single synonym st and the initialization sense term vector SV 0 (t i ) Iterative merging is carried out to generate a sense term t i First order sense term vector SV 1 (t i ) The following formula is shown:
SV 1 (t i )={(s i ,wt(s i ,SV 0 (t i ))+wt(s i ,V(st))|s i ∈D 1 ∪D 2 }
wherein wt(s) i ,SV 0 (t i ) A) represents dimension word s i Initializing the sense term vector SV 0 (t i ) Weight, wt(s) i V (st)) represents a dimension word s i Weights in word vector V (st), D 1 Representing initialization sense term vector SV 0 (t i ) A set of dimension words with a medium weight other than 0, D 2 A set of dimension words of a weight other than 0 in the word vector V (st);
before each iteration combining, the result SV of the previous iteration combining is used 1 (t i ) As an initialization sense term vector SV 0 (t i ) Order SV 0 (t i )=SV 1 (t i ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t i The synonym set does not have any meaning synonyms, let SV 1 (t i )=SV 0 (t i );
Step 4.3, for the multi-sense synonym dt in the synonym set, sequentially adopting the following formulas to make the multi-sense synonym dt word vector V (dt) and the first-order sense item vector SV 1 (t i ) Iterative merging is carried out to generate a sense term t i Is a second order sense term vector SV 2 (t i ) The following formula is shown:
wherein wt(s) i ,SV 1 (t i ) A) represents dimension word s i First order sense term vector SV 1 (t i ) Weight, wt 2 (s i V (dt)) represents a dimension word s i Weights in the word vector V (dt), wt(s) j ,SV 1 (t i ) A) represents dimension word s j First order sense term vector SV 1 (t i ) Weight of D 3 Representing the first order sense item directionQuantity SV 1 (t i ) A set of dimension words with a medium weight other than 0, D 4 A set of dimension words of a weight other than 0 in the word vector V (dt);
before each iteration combining, the result SV of the previous iteration combining is used 2 (t i ) As first order sense term vector SV 1 (t i ) Order SV 1 (t i )=SV 2 (t i ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t i The synonym set does not have any ambiguous synonyms, let SV 2 (t i )=SV 1 (t i );
Step 4.4 for sense item t i The method comprises the steps of taking a word vector V (ft) of a first word ft in a combined synonym ct as an initialized combined word vector CV of the combined synonym ct in a synonym set 0 (ct), i.e. let CV 0 (ct)=V(ft);
Step 4.5, for the independent word at in the combined word synonym ct, sequentially adopting the following formulas to combine the word vector V (at) with the initialized combined word vector CV 0 (ct) performing iterative merging to generate a first-order combined word vector CV of the combined synonym ct 1 (ct) the following formula:
wherein wt(s) i ,CV 0 (ct)) represents dimension word s i Initializing a combined word vector CV 0 Weight in (ct), wt(s) i V (at)) represents dimension word s i Weights in the word vector V (at), wt(s) j ,CV 0 (ct)) represents dimension word s j Initializing a combined word vector CV 0 Weight in (ct), D 5 Representing an initialized combined word vector CV 0 (ct) set of dimension words with weight other than 0, D 6 A set of dimension words of a weight other than 0 in the word vector V (at);
before each iteration is combined, the result CV of the previous iteration is used 1 (ct) as an initialization combined word vector CV 0 (ct), i.e. let CV 0 (ct)=CV 1 (ct);
Step 4.6, the t obtained in the step 4.3 is sequentially carried out i Is a second order sense term vector SV 2 (t i ) And the first-level combined word vector CV obtained in the step 4.5 1 (ct) iterative merging to generate the sense term t i Final vector SFV (t) i ) The following formula is shown:
wherein wt(s) i ,SV 2 (t i ) A) represents dimension word s i Second order term vector SV 2 (t i ) Weight, wt(s) i ,CV 1 (ct)) represents dimension word s i Combining word vectors CV at first order 1 Weight in (ct), wt(s) j ,SV 2 (t i ) A) represents dimension word s j Second order term vector SV 2 (t i ) Weight of D 7 Representing a second order sense term vector SV 2 (t i ) A set of dimension words with a medium weight other than 0, D 8 Representing first order combined word vector CV 1 A set of dimension words in (ct) having a weight other than 0;
before each iterative merge, the result SFV (t i ) As a second order sense term vector SV 2 (t i ) Order SV 2 (t i )=SFV(t i ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t i No combined synonym in the synonym set, let SFV (t i )=SV 2 (t i )。
In the above scheme, for the generated word vector and the term vector, only the dimension word and its weight in which the weight is not zero are stored, and the weight of the dimension word that is not stored is defaulted to 0.
In step 8, the term t is defined as term t i Is set of annotations of gloss (t) i ) Text of disambiguation text where word t to be disambiguated is located t Similarity sim (gloss (t) i ),text t ) The method comprises the following steps:
sim(gloss(t i ),text t )=max{sim(glBag j ,textBag)|j∈[1,p i ]}
wherein sim (glBag j textBag) represents the core semantic bag glBag j Similarity to textbiag,max {.cndot. } represents maximum value, glBag j Representing the slave annotation sentence gl j Core semantic bag composed of nouns and verbs extracted from Chinese, gl j Representing the sense term t i Is set of annotations of gloss (t) i ) Annotating sentences separated by a semicolon, textbook representing text from the text to be disambiguated t Core semantic bag extracted from Chinese and composed of nouns and verbs, p i Representing the sense term t i Is set of annotations of gloss (t) i ) Annotating the number of sentences, B 1 Representation core semantic bag glBag j ,B 2 Representing the core semantic bag textBag, |·| representing the number of words in the core semantic bag, depth (u) representing the depth of word u in the WordNet hierarchy, depth (v) representing the depth of word v in the WordNet hierarchy, LCS (u, v) representing the nearest public parent node of words u and v in the WordNet hierarchy, depth (LCS (u, v)) representing the depth of the nearest public parent node in the WordNet hierarchy.
Compared with the prior art, the invention utilizes the word meaning item set and the synonym set which are well arranged and widely accepted internationally in WordNet as priori knowledge, and provides a meaning item vector generation method based on wikipedia word statistics.
Drawings
FIG. 1 is a flow chart of a word statistics and WordNet based semantic item representation and disambiguation method.
Detailed Description
The present invention will be further described in detail with reference to specific examples in order to make the objects, technical solutions and advantages of the present invention more apparent.
The word statistics and WordNet based meaning item representation and disambiguation method specifically comprises the following steps as shown in fig. 1:
firstly, an offline page file of the Wikipedia is obtained, illegal characters in the offline page file are converted into spaces, a picture form is deleted, only titles are reserved, the links remain texts, and finally, plain texts containing a-Z (conversion of A-Z ranges into lower cases) and numbers are reserved. After the cleaning is finished, generating a co-occurrence matrix through a word statistical model, acquiring corresponding word vectors from the co-occurrence matrix, finally forming an initial word vector, and taking the initial word vector as input of a sense term generation model, then taking sense terms corresponding to words in WordNet and a synonym set as input, aiming at generating corresponding sense term vectors by utilizing the synonym set corresponding to the words, firstly, obtaining all words in the synonym set from the initial word vector in a table-look-up mode by the model, taking the input words as source words, searching corresponding word vectors as a benchmark, and carrying out intersection operation with other words in the synonym set. The problem of combining words is also related to, and there is no good method to generate a vector of combined words, and the solution of the present invention is to split the combined words into individual words and then combine the vectors of the individual words, i.e. the intersecting parts of the dimensions are added, without intersecting performing a combining operation. Finally, the vector of the word corresponding to the sense term is output.
1. Generating a wikipedia-based word co-occurrence matrix and word vector
The invention is based on the word statistical vector trained by the wiki encyclopedia open corpus, because the words in the wiki encyclopedia are more, the words with the word frequency of the first 30 ten thousand are taken as the training target words and dimension words, the word vectors of the 30 ten thousand words are finally obtained, the dimension of each word vector is 30 ten thousand dimensions, and each dimension is a word, thereby having specific meaning. For example, one example word statistical vector is shown below:
V(deckhand)={(guinean,0.284611),(trawler,0.250539),(cowell,0.247986),…}
the specific generation steps are as follows:
(1) An offline page data file of the wikipedia is downloaded and preprocessed.
The off-line page data file of the wikipedia is firstly obtained through a dump backup database provided by the wikipedia. The present invention uses JWPL (Java Wikipedia Library) tools to parse wikipedia download databases, and JWPL runs on optimized databases created from wikipedia download databases, allowing quick access to wikipedia page articles, categories, links, redirects, etc. The off-line wikipedia page contains various data, namely texts, pictures, tables, links and unique characters in the webpage, the invention cleans the wikipedia page data by using a formula (1), finally leaves text data required by training word vectors, converts uppercase characters in an A-Z range into lowercase a-Z, converts non-displayable symbols into spaces, and inputs the preprocessed wikipedia page data into the word statistical model in the step (2).
Page wiki ={lower(w)|w∈S} (1)
Where lower is a function of lowercase conversion of characters, and S is a set of displayable characters and numbers.
(2) Generating a word co-occurrence matrix and a word vector based on wikipedia word statistics.
For the preprocessed wikipedia corpus, because the words are abundant and more, the invention is convenient for training, and generates more effective word vectors, and finally words with the word frequency row K in front are selected as training target words and vector dimension words to carry out word statistics training, so as to obtain word co-occurrence matrixes and word vectors shown in formulas (2) and (3).
Equation (2) represents a wiki-based word co-occurrence matrix, which consists of a weight matrix of K x K, wherein each row represents a word vector of a target word in the matrix, i.e., a weight vector;
V(t i )={(t j ,w i,j )|t j ∈T K } (3)
equation (3) represents the target word t i Co-occurrence matrix of wordsIn which the word vector is K of the shape (t j ,w i,j ) Dimension word and weight pair composition of (1), T K Representing a set of K dimension words, t j Representing T K One dimension word, w i,j Representing the target word t i In dimension word t j Weights on, w i,j The calculation formula of (2) is as follows:
defining a target word t i Taking L epsilon [2,5 as a center and a co-occurrence window with L words as left and right boundaries]Then, the target word t under the window is calculated according to the formula (4) i In dimension word t j Weight w on i,j
Wherein f (t) i ,t j ) For the word t i And t j The number of co-occurrences measured in the Wikipedia corpus in accordance with a specified co-occurrence window, f (t i )、f(t j ) Respectively represent the words t i And t j Number of occurrences in wikipedia corpus.
2. Acquiring meaning item set and synonym set of words from WordNet
The present invention uses internationally widely accepted WordNet as a priori knowledge base of the set of sense items and the set of synonyms for the input term. In WordNet, if a term is ambiguous, multiple sense items are provided, and each sense item is typically composed of multiple synonyms into a set of synonyms and includes corresponding annotations. Annotations are typically defined and exemplified by very simple and easy to understand sentences. Vocabulary linguistics originate from the recognition of words: words are generally divided into word shapes and word senses, the word shapes are usually used for specifying source words or subject words, and the word senses represent word shapes, namely vocabulary concepts represented by the source words. For the same source word, different word senses may be represented in different contexts, so in order to be able to better distinguish word senses, grammatical classification of words is typically represented by a mapping relationship between word shapes and word senses. In WordNet, some word shapes may correspond to a plurality of different word senses, i.e., the word is ambiguous; some word senses may be represented by different word shapes, i.e., one sense and multiple words. The original word vector based on wikipedia word statistics is based on wikipedia text corpus, and synonym sets and polysemous terms of words cannot be distinguished. Therefore, the word net dictionary is used for acquiring the meaning item set of the word and the synonym set thereof, and the operation steps are as follows:
(1) The index word t is input in the WordNet dictionary.
(2) Searching the WordNet dictionary to obtain a set of sense terms, senset (t), for the term t as shown in equation (5), and a set of synonyms, sense (t), for each sense term as shown in equation (6) i ):
SenSet(t)={sense(t i )|i∈[1,n t ]} (5)
Equation (5) shows that word t has n in WordNet t Sense term, n t Is a positive integer;
sense(t i )={t,t j |j∈[0,m i ]} (6)
equation (6) shows that the synonym set of the ith term of term t is defined by t and m outside t i Individual word composition, m i Is 0 or a positive integer.
For example, for the term Brazil, there are 2 sense terms in WordNet:
brazil sense item 1: { Source words: brazil, synonym: federative Republic of Brazil, synonym: brasil }
Brazil sense term 2: { Source: brazil, synonym: brazil nut }.
3. Generating term vectors of words by merging word vectors of synonyms
The invention relates to a method for generating a term vector based on Wikipedia word statistics and WordNet, which comprises the following steps:
(1) Initializing the sense item vector.
The ith sense term t for word t i The present invention initializes a sense term vector SV using a word vector V (t) generated by formula (3) 0 (t i ) The method comprises the following steps:
SV 0 (t i )=V(t) (7)
(2) The initial sense term vector is combined with the nonsense synonym vector.
The present invention defines synonyms with only one sense term in WordNet as single sense synonyms. For a source word, if the number of sense term elements is smaller, the meaning of the expression of the source word is more definite, the ambiguity is less, and the obtained word vector interference is less, so that the invention directly adds the weights of the single sense synonym vector and the source word vector to achieve the effect of highlighting the single sense synonym vector in the generation of the sense term vector.
Let sense term t i Synonym set sense (t) i ) If there is a single synonym st, the present invention uses the following equation (8) to combine the word vector V (st) of st with the initial sense term vector SV 0 (t i ) Merging to generate new sense term vector SV 1 (t i ):
SV 1 (t i )={(s i ,wt(s i ,SV 0 (t i ))+wt(s i ,V(st))|s i ∈D 1 ∪D 2 } (8)
Wherein SV is 0 (t i ) Generated by equation (7), V (st) generated by equation (3), D 1 Representing SV 0 (t i ) A set of dimension words with a medium weight other than 0, D 2 Representing a set of dimension words in V (st) with a weight other than 0, the function wt (s, V) represents the weight of dimension word s in vector V.
(3) Repeating the step (2), and adding sense (t i ) Word vector and term vector SV of all single synonyms in (1) 1 (t i ) And combining. Wherein, the result SV of the last merging is used before each merging 1 (t i ) Initializing vector SV 0 (t i ) Order SV 0 (t i )=SV 1 (t i ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t i Synonym set sense (t) i ) Does not have any meaning synonyms, let SV 1 (t i )=SV 0 (t i )。
(4) The sense term vector is combined with the multi-sense synonym vector.
Synonyms with multiple sense items in WordNet are defined as multi-sense synonyms in the present invention. In order to reduce the negative effect of the sense synonym vector on the generation of the sense term vector, the invention adopts the following formula (9) to combine the sense synonym vector V (dt) with the sense term vector SV 1 (t i ) Nonlinear combination is carried out to generate a new sense term vector SV 2 (t i ):
Wherein SV is 1 (t i ) Generated by the step (2) and the step (3), dt represents the meaning term t i Synonym set sense (t) i ) Is a ambiguous synonym of V (dt) is a word vector of dt generated by equation (3), D 3 Representing SV 1 (t i ) A set of dimension words with a medium weight other than 0, D 4 Representing a set of dimension words in V (dt) with a weight other than 0.
(5) Repeating the step (4), and adding sense (t i ) Word vectors and term vectors SV of all ambiguous synonyms in (1) 2 (t i ) And combining. Wherein, the result SV of the last merging is used before each merging 2 (t i ) Initializing vector SV 1 (t i ) Order SV 1 (t i )=SV 2 (t i ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t i Synonym set sense (t) i ) Does not have any ambiguous synonyms, let SV 2 (t i )=SV 1 (t i )。
(6) A combined synonym vector is generated.
The present invention defines a phrase consisting of multiple independent words in a WordNet synonym phrase as a combined synonym, e.g., a combined word: computer-implemented_axial_thermal→word 1: computed + word 2: axial + word 3: tomograph. For sense item t i Synonym set sense (t) i ) The invention firstly adopts the word vector of the first word ft in the ct to initialize the combined word vector CV according to the following formula (10) 0 (ct):
CV 0 (ct)=V(ft) (10)
Then, for the independent word at in the combined word synonym ct, the present invention uses the following equation (11) to combine the word vector V (at) with the combined word vector CV 0 (ct) merging to generate a new combined word vector CV 1 (ct):
Wherein CV 0 (ct) is generated by equation (10), V (at) is generated by equation (3), D 5 Representing CV 0 (ct) set of dimension words with weight other than 0, D 6 Representing a set of dimension words in V (at) with a weight other than 0.
Finally, repeatedly executing the formula (11) to combine all independent word vectors in the synonym ct of the combined word with the combined word vector CV 1 (ct) combining. Wherein, the result CV of the last merging is used before each merging 1 (ct) initializing vector CV 0 (ct), i.e. let CV 0 (ct)=CV 1 (ct)。
(7) Combining the term vector with the combined synonym vector.
For the combined synonym vector CV generated according to step (6) 1 (ct) the present invention uses the following equation (12) to combine it with the sense term vector SV 2 (t i ) Merging to generate sense item t i Final vector SFV (t) i )。
Wherein SV is 2 (t i ) Generated by the step (4) and the step (5), CV 1 (ct) is generated by step (6), D 7 Representing SV 2 (t i ) A set of dimension words with a medium weight other than 0, D 8 Representing CV 1 (ct) a set of dimension words with a weight other than 0.
(8) Repeating the step (7), and adding sense (t i ) All combined synonym vectors in (t) i ) And combining. Wherein, the result SFV (t i ) Initializing vector SV 2 (t i ) Order SV 2 (t i )=SFV(t i ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t i Synonym set sense (t) i ) Without any combined synonyms, let SFV (t i )=SV 2 (t i )。
4. Method for disambiguating sense term based on WordNet
Since the phenomenon of word ambiguity in natural language is common, it is important to have a computer understand natural language accurately and automatically disambiguate. Word meaning is a specific meaning reflected by words in a certain language environment, has more obvious and materialized semantic attribute in the context, and can better reflect the relation between words. The term disambiguation refers to determining a term corresponding to a word in a specific text. The disambiguation of the sense term is a precondition and matching method of applying the sense term vector.
The invention provides a term disambiguation method based on WordNet, which can be matched with the use of term vectors generated by the invention in practical application. The word Net-based sense item disambiguation step provided by the invention is as follows:
(1) An annotation set for each term of a term is obtained from WordNet.
The meaning item annotation is extracted. In WordNet, for one term of a word, all synonyms are put in a collection as a set of synonyms for the term, and in addition, an annotation is added, which is typically composed of several simple sentences (usually definition and example sentences), so that the user of WordNet can distinguish between possibly confusing word senses. The invention extracts annotation sentences in the meaning synonym set from WordNet, and defines the annotation set as shown in a formula (13):
gloss(t i )={gl j |j∈[1,p i ]} (13)
equation (13) represents the sense term t i Is set by p i A composition of annotated sentences, wherein gl j Representing the sense term t i Any one of the annotation sentences separated by a semicolon.
For example, taking Brazil as an example, the corpus of Brazil in WordNet is:
Brazil:
1.Brazil,Federative Republic of Brazil,Brasil--(the largest Latin American country and the largest Portuguese speaking country in the world;located in the central and northeastern part of South America;world's leading coffee exporter)
2.brazil nut,brazil--(three-sided tropical American nut with white oily meat and hard brown shell)
the annotation set for the two sense terms thus obtained for Brazil is:
gloss(Brazil 1 )={the largest Latin American country and the largest Portuguese speaking country in the world;located in the central and northeastern part of South America;world's leading coffee exporter}
gloss(Brazil 2 )={three-sided tropical American nut with white oily meat and hard brown shell}
(2) The meaning item annotation sentence and the disambiguated text form a list of text pairs to be compared.
Combining the meaning item annotation sentences extracted in the step (1) with disambiguated texts in which the meaning items of the words to be disambiguated are located respectively to form a text pair list to be compared, as shown in a formula (14):
TextList(gloss(t i ),text t )={(gl j :text t )|j∈[1,p i ]} (14)
wherein, gloss (t i ) Representing the sense term t i Is generated by equation (13), text t Representing the text in which the word t to be disambiguated is located, (gl) j :text t ) Representing a text pair to be compared. For example, the number of the cells to be processed,
text for statement to be disambiguated Brazil =”Unlike in the US where African Americans were united in the civil rights struggle,in<b>Brazil</b>the philosophy of whitening has helped divide blacks from other non-whites and prevented a more active civil rights movement”
A list of comparison text pairs is obtained as follows:
List(gloss(Brazil 2 ),text Brazil )={”three-sided tropical American nut with white oily meat and hard brown shell”:”Unlike in the US where African Americans were united in the civil rights struggle,in<b>Brazil</b>the philosophy of whitening has helped divide blacks from other non-whites and prevented a more active civil rights movement”}
(3) The comparison of text pairs is converted into a comparison of core semantic bags consisting of nouns and verbs.
Performing root reduction processing on texts in the text pair list to be compared generated in the step (2), extracting nouns and verbs in the texts as core semantic bags, converting comparison of the text pairs into comparison of the core semantic bags, as shown in a formula (15), and generating a corresponding core semantic bag comparison pair list, as shown in a formula (16):
TextList(gloss(t i ),text t )=BagList(gloss(t i ),text t ) (15)
BagList(gloss(t i ),text t )={(glBag j :textBag)|j∈[1,p i ]} (16)
wherein, (glBag j textBag) represents a core semantic comparison pair, glBag j Representing slaveAnnotating sentences gl j The extracted core semantic bag composed of nouns and verbs, textbook represents text to be disambiguated t The core semantic bag which is extracted from the Chinese language and consists of nouns and verbs.
For example, using a Stanford part-of-speech reduction and part-of-speech tagging tool, a List (gloss (Brazil 2 ),text Brazil ) Is a core semantic bag:
List(gloss(Brazil 2 ),text Brazil )={(nut,oily,meat,shell):(US,African,Americans,civil,rights,Brazil,philosophy,whitening,blacks,non-whites,active,civil,rights,movement)}
(4) And calculating the similarity between the annotation set of each meaning item of the word and the disambiguation text through the core semantic bag.
The term t is calculated by the following equation (17) i Is set of annotations of gloss (t) i ) Text with text to be disambiguated t Similarity of (3):
sim(gloss(t i ),text t )=max{sim(glBag j ,textBag)|j∈[1,p i ]} (17)
wherein, max {.cndot } represents the maximum value in the set, the core semantic bag glBag j The similarity with textbiag is calculated by the following formula (18):
wherein B is 1 Representation core semantic bag glBag j ,B 2 Represents the core semantic bag textBag, |B 1 |、|B 2 I represents the representation core semantic bag B respectively 1 、B 2 The number of words in (a). The similarity sim (u, v) of the words u and v is calculated using the following formula (19):
where the function depth (u) represents the depth of the word u in the WordNet hierarchy and LCS (u, v) represents the nearest common parent node of the words u and v in WordNet.
(5) And outputting the meaning item with the highest similarity between the annotation set and the disambiguation text as a disambiguation result.
According to the step (4), calculating the annotation set of each meaning item of the word t and text to be disambiguated t Similarity of (2), and determining the sense term t with the maximum similarity * As a final disambiguation result, i.e. the word t is in the disambiguated text t Disambiguation results in (a) are annotation set and text t Meaning item t with highest similarity * As shown in formula (20):
wherein n is t Meaning item number of expression t, t * Let sim (gloss (t) i ) Text) term with the largest calculated value, sim (gloss (t i ),text t ) Calculated by equation (16).
For example:
the sense term Brazil can be calculated by formulas (17), (18) and (19) 1 Annotation set and disambiguated text of (a) Brazil The similarity of (2) is:
sim(gloss(Brazil 1 ),text Brazil )=MAX(0.627,0.408,0.745)=0.745
alike obtainable meaning item Brazil 2 Annotation set and disambiguated text of (a) Brazil The similarity of (2) is:
sim(gloss(Brazil 2 ),text Brazil )=MAX(0.576)=0.576
finally, the meaning item 1 with the maximum similarity is calculated by a formula (20) to be used as an disambiguation result, namely, the disambiguation result is a meaning item Brazil with the similarity between the annotation set and disambiguation text equal to 0.745 1
5. Vector preservation structure
The word vector of the invention is a high-dimensional vector, the number of dimension words is up to 30 ten thousand, and for the convenience of storage and calculation, the invention only stores the dimension words and the weights thereof with non-zero weights for the generated word vector and sense term vector, and defaults the weights of the dimension words which are not stored to 0.
For example:
SFV(Brazil 1 )={(impa,0.042714),(lluvia,0.036314),(maracana,0.035894),(petropolis,0.04243),...,(in,0.008653),(to,0.000161),(and,0.002992)}
6. experimental comparison
The present example uses the version of wikipedia in english published on 7.1.2019 for experimental comparison, which contains 15GB of page text, including 5,895,703 pages of articles. The present embodiment uses JWPL (Java Wikipedia Library) tools to parse the Wikipedia download database. JWPL runs on an optimized database created from a Wikipedia download database, and can quickly access Wikipedia page articles, categories, links, redirects, etc. The present embodiment generates a knowledge corpus disambiguated with sense terms using WordNet 3.0 as a sense term vector. For the wikipedia offline page, this embodiment uses a perl tool for preprocessing and cleaning the data. Then, using DISSECT tool to obtain original word statistical vector, its generation process is: firstly, selecting words with word frequency row of the first 30 ten thousand as target words and word vector dimensions for word statistics training, then obtaining the co-occurrence times of the target words and the dimension words by utilizing a window of L=2 context, and finally obtaining co-occurrence matrixes and word vectors of the words. Finally, the method provided by the invention is used for generating the meaning item vector of the word.
The present embodiment performs corresponding tests on the generated term vector based on wikipedia word statistics and WordNet and the proposed term disambiguation method, and selects the international widely accepted universal term relevance test set WordSim-353 and the term relevance test set SCWS-2003 with disambiguated text (Stanford's Contextual Word Similarities), as shown in table 1:
TABLE 1 two data sets for experimental comparison
For the two data sets, the present embodiment uses Spearman coefficient comparison to present the test results of the method of the present invention, and the Spearman coefficient calculation formula is as follows:
where n represents the number of word pairs in the dataset, d i Refers to the level difference between the variable Xi, which refers to the i-th element in the manual decision value list, and the variable Yi, which refers to the i-th element in the calculated value list. The comparison result of the method of the invention implemented in the embodiment and the original word statistical vector in two data sets is shown in the following table:
TABLE 2 Spearman coefficient comparison of term vectors and word vectors in two exemplary data sets
Method WordSim-353 SCWS
Original word statistical vector 0.634 0.584
The invention relates to a sense item vector and a disambiguation method 0.638 0.631
As can be seen from the experiment of the embodiment, for the WordSim-353 data set without disambiguation text, the calculated results of the meaning item vector, the disambiguation method and the original word statistical vector are identical, and are slightly improved, so that the method does not deviate from the main direction of the word statistical vector and does not cause negative influence; in the SCWS2003 data set with disambiguation text, it can be seen that the disambiguation method provided by the invention has a key effect on the data set, and the Spearman coefficient is greatly improved from 0.58 to 0.63, which fully illustrates that the method for generating and disambiguating the sense term vector based on wikipedia word statistics and WordNet is feasible and excellent.

Claims (3)

1. A word statistics and WordNet based meaning item representation and disambiguation method is characterized by comprising the following steps:
step 1, acquiring an offline webpage file of the wikipedia, and preprocessing the offline webpage file to obtain preprocessed wikipedia corpus;
step 2, selecting words with word frequency row of the previous K as training target words and vector dimension words for word statistics training on the preprocessed wikipedia corpus to obtain a word co-occurrence matrix and word vectors;
step 3, acquiring a meaning item set of the word and a synonym set of the word from WordNet;
step 4, generating a term meaning vector of the word by combining the term co-occurrence matrix and the term vector obtained in the step 2 and the term meaning set and the synonym set obtained in the step 3; namely:
step 4.1, meaning term t for the ith term t of word t i The word vector V (t) of the word t is taken as a sense term t i Is to initialize the sense term vector SV 0 (t i ) Order SV 0 (t i )=V(t);
Step 4.2, for a single synonym st in the synonym set, sequentially adopting the following formulas to carry out word vector of the single synonym stV (st) and initialization sense term vector SV 0 (t i ) Iterative merging is carried out to generate a sense term t i First order sense term vector SV 1 (t i ) The following formula is shown:
SV 1 (t i )={(s i ,wt(s i ,SV 0 (t i ))+wt(s i ,V(st))|s i ∈D 1 ∪D 2 }
wherein wt(s) i ,SV 0 (t i ) A) represents dimension word s i Initializing the sense term vector SV 0 (t i ) Weight, wt(s) i V (st)) represents a dimension word s i Weights in word vector V (st), D 1 Representing initialization sense term vector SV 0 (t i ) A set of dimension words with a medium weight other than 0, D 2 A set of dimension words of a weight other than 0 in the word vector V (st);
before each iteration combining, the result SV of the previous iteration combining is used 1 (t i ) As an initialization sense term vector SV 0 (t i ) Order SV 0 (t i )=SV 1 (t i ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t i The synonym set does not have any meaning synonyms, let SV 1 (t i )=SV 0 (t i );
Step 4.3, for the multi-sense synonym dt in the synonym set, sequentially adopting the following formulas to make the multi-sense synonym dt word vector V (dt) and the first-order sense item vector SV 1 (t i ) Iterative merging is carried out to generate a sense term t i Is a second order sense term vector SV 2 (t i ) The following formula is shown:
wherein wt(s) i ,SV 1 (t i ) A) represents dimension word s i First order sense term vector SV 1 (t i ) Weight, wt 2 (s i V (dt)) represents a dimension word s i Weights in the word vector V (dt), wt(s) j ,SV 1 (t i ) A) represents dimension word s j First order sense term vector SV 1 (t i ) Weight of D 3 Representing first order sense term vector SV 1 (t i ) A set of dimension words with a medium weight other than 0, D 4 A set of dimension words of a weight other than 0 in the word vector V (dt);
before each iteration combining, the result SV of the previous iteration combining is used 2 (t i ) As first order sense term vector SV 1 (t i ) Order SV 1 (t i )=SV 2 (t i ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t i The synonym set does not have any ambiguous synonyms, let SV 2 (t i )=SV 1 (t i );
Step 4.4 for sense item t i The method comprises the steps of taking a word vector V (ft) of a first word ft in a combined synonym ct as an initialized combined word vector CV of the combined synonym ct in a synonym set 0 (ct), i.e. let CV 0 (ct)=V(ft);
Step 4.5, for the independent word at in the combined word synonym ct, sequentially adopting the following formulas to combine the word vector V (at) with the initialized combined word vector CV 0 (ct) performing iterative merging to generate a first-order combined word vector CV of the combined synonym ct 1 (ct) the following formula:
wherein wt(s) i ,CV 0 (ct)) represents dimension word s i Initializing a combined word vector CV 0 Weight in (ct), wt(s) i V (at)) represents dimension word s i Weights in the word vector V (at), wt(s) j ,CV 0 (ct)) represents dimension word s j Initializing a combined word vector CV 0 Weight in (ct), D 5 Representing an initialized combined word vector CV 0 (ct) set of dimension words with weight other than 0, D 6 A set of dimension words of a weight other than 0 in the word vector V (at);
before each iteration is combined, the result CV of the previous iteration is used 1 (ct) as an initialization combined word vector CV 0 (ct), i.e. let CV 0 (ct)=CV 1 (ct);
Step 4.6, the t obtained in the step 4.3 is sequentially carried out i Is a second order sense term vector SV 2 (t i ) And the first-level combined word vector CV obtained in the step 4.5 1 (ct) iterative merging to generate the sense term t i Final vector SFV (t) i ) The following formula is shown:
wherein wt(s) i ,SV 2 (t i ) A) represents dimension word s i Second order term vector SV 2 (t i ) Weight, wt(s) i ,CV 1 (ct)) represents dimension word s i Combining word vectors CV at first order 1 Weight in (ct), wt(s) j ,SV 2 (t i ) A) represents dimension word s j Second order term vector SV 2 (t i ) Weight of D 7 Representing a second order sense term vector SV 2 (t i ) A set of dimension words with a medium weight other than 0, D 8 Representing first order combined word vector CV 1 A set of dimension words in (ct) having a weight other than 0;
before each iterative merge, the result SFV (t i ) As a second order sense term vector SV 2 (t i ) Order SV 2 (t i )=SFV(t i ) The method comprises the steps of carrying out a first treatment on the surface of the If sense item t i No combined synonym in the synonym set, let SFV (t i )=SV 2 (t i );
Step 5, acquiring an annotation set of each sense item of the word from WordNet;
step 6, forming a text pair list to be compared by the meaning item annotation sentence and the disambiguation text;
step 7, performing root reduction processing on texts in the text pair list to be compared, extracting nouns and verbs in the texts as core semantic bags respectively, so as to convert comparison of the text pairs into comparison of the core semantic bags consisting of the nouns and the verbs;
step 8, calculating the similarity between the annotation set and the disambiguation text of each meaning item of the word through a core semantic bag;
and 9, outputting the meaning item with the highest similarity between the annotation set and the disambiguation text as a disambiguation result according to the similarity between the annotation set and the disambiguation text of each meaning item of the word.
2. The word statistics and WordNet based term meaning expression and disambiguation method of claim 1, wherein for the generated word vectors and term meaning vectors, only dimension words and their weights are stored where the weights are not zero, and the weights of dimension words that are not stored are defaulted to 0.
3. The method for word statistics and word net based term representation and disambiguation according to claim 1, wherein in step 8, term t is the term t i Is set of annotations of gloss (t) i ) Text of disambiguation text where word t to be disambiguated is located t Similarity sim (gloss (t) i ),text t ) The method comprises the following steps:
sim(gloss(t i ),text t )=max{sim(glBag j ,textBag)|j∈[1,p i ]}
wherein sim (glBag j textBag) represents the core semantic bag glBag j Similarity to textbiag,
max {.cndot. } represents maximum value, glBag j Representing the slave annotation sentence gl j Core semantic bag composed of nouns and verbs extracted from Chinese, gl j Representing the sense term t i Is set of annotations of gloss (t) i ) Annotating sentences separated by a semicolon, textbook representing text from the text to be disambiguated t Extracted from Chinese by nounCore semantic bag composed of verbs, p i Representing the sense term t i Is set of annotations of gloss (t) i ) Annotating the number of sentences, B 1 Representation core semantic bag glBag j ,B 2 Representing the core semantic bag textBag, |·| representing the number of words in the core semantic bag, depth (u) representing the depth of word u in the WordNet hierarchy, depth (v) representing the depth of word v in the WordNet hierarchy, LCS (u, v) representing the nearest public parent node of words u and v in the WordNet hierarchy, depth (LCS (u, v)) representing the depth of the nearest public parent node in the WordNet hierarchy.
CN201910803617.XA 2019-08-28 2019-08-28 Word statistics and WordNet-based semantic item representation and disambiguation method Active CN110569503B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910803617.XA CN110569503B (en) 2019-08-28 2019-08-28 Word statistics and WordNet-based semantic item representation and disambiguation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910803617.XA CN110569503B (en) 2019-08-28 2019-08-28 Word statistics and WordNet-based semantic item representation and disambiguation method

Publications (2)

Publication Number Publication Date
CN110569503A CN110569503A (en) 2019-12-13
CN110569503B true CN110569503B (en) 2023-12-29

Family

ID=68776561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910803617.XA Active CN110569503B (en) 2019-08-28 2019-08-28 Word statistics and WordNet-based semantic item representation and disambiguation method

Country Status (1)

Country Link
CN (1) CN110569503B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7457531B2 (en) * 2020-02-28 2024-03-28 株式会社Screenホールディングス Similarity calculation device, similarity calculation program, and similarity calculation method
CN113128210A (en) * 2021-03-08 2021-07-16 西安理工大学 Webpage table information analysis method based on synonym discovery
CN114091473B (en) * 2022-01-20 2022-05-03 北京建筑大学 Web service discovery method based on comprehensive semantics
CN117610579B (en) * 2024-01-19 2024-04-16 卓世未来(天津)科技有限公司 Semantic analysis method and system based on long-short-term memory network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916887A (en) * 2006-09-06 2007-02-21 哈尔滨工程大学 Method for eliminating ambiguity without directive word meaning based on technique of substitution words
CN101295294A (en) * 2008-06-12 2008-10-29 昆明理工大学 Improved Bayes acceptation disambiguation method based on information gain
CN103729343A (en) * 2013-10-10 2014-04-16 上海交通大学 Semantic ambiguity eliminating method based on encyclopedia link co-occurrence
CN108446269A (en) * 2018-03-05 2018-08-24 昆明理工大学 A kind of Word sense disambiguation method and device based on term vector
CN108647705A (en) * 2018-04-23 2018-10-12 北京交通大学 Image, semantic disambiguation method and device based on image and text semantic similarity
CN108874772A (en) * 2018-05-25 2018-11-23 太原理工大学 A kind of polysemant term vector disambiguation method
CN108932222A (en) * 2017-05-22 2018-12-04 中国移动通信有限公司研究院 A kind of method and device obtaining the word degree of correlation
CN109325230A (en) * 2018-09-21 2019-02-12 广西师范大学 A kind of phrase semantic degree of correlation judgment method based on wikipedia bi-directional chaining

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7899666B2 (en) * 2007-05-04 2011-03-01 Expert System S.P.A. Method and system for automatically extracting relations between concepts included in text

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916887A (en) * 2006-09-06 2007-02-21 哈尔滨工程大学 Method for eliminating ambiguity without directive word meaning based on technique of substitution words
CN101295294A (en) * 2008-06-12 2008-10-29 昆明理工大学 Improved Bayes acceptation disambiguation method based on information gain
CN103729343A (en) * 2013-10-10 2014-04-16 上海交通大学 Semantic ambiguity eliminating method based on encyclopedia link co-occurrence
CN108932222A (en) * 2017-05-22 2018-12-04 中国移动通信有限公司研究院 A kind of method and device obtaining the word degree of correlation
CN108446269A (en) * 2018-03-05 2018-08-24 昆明理工大学 A kind of Word sense disambiguation method and device based on term vector
CN108647705A (en) * 2018-04-23 2018-10-12 北京交通大学 Image, semantic disambiguation method and device based on image and text semantic similarity
CN108874772A (en) * 2018-05-25 2018-11-23 太原理工大学 A kind of polysemant term vector disambiguation method
CN109325230A (en) * 2018-09-21 2019-02-12 广西师范大学 A kind of phrase semantic degree of correlation judgment method based on wikipedia bi-directional chaining

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
An efficient approach for measuring semantic relatedness using Wikipedia bidirectional links;Xinhua Zhu 等;《Journal of Applied Intelligence》;20190502;第3708-3730页 *
一种基于Perl的词义消岐方法研究与设计;史海峰 等;《电脑知识与技术》;20090831;第5卷(第24期);第6765、6776页 *
借重于人工知识库的词和义项的向量表示:以HowNet为例;孙茂松 等;《中文信息学报》;20161130;第30卷(第6期);第1-6、14页 *
多特征融合的中文命名实体链接方法研究;林泽斐 等;《情报学报》;20190131;第38卷(第1期);第68-78页 *

Also Published As

Publication number Publication date
CN110569503A (en) 2019-12-13

Similar Documents

Publication Publication Date Title
CN110569503B (en) Word statistics and WordNet-based semantic item representation and disambiguation method
Vougiouklis et al. Neural wikipedian: Generating textual summaries from knowledge base triples
Zouaghi et al. Combination of information retrieval methods with LESK algorithm for Arabic word sense disambiguation
US20180260381A1 (en) Prepositional phrase attachment over word embedding products
CN109783806B (en) Text matching method utilizing semantic parsing structure
Sarkar et al. A practical part-of-speech tagger for Bengali
Stankevičius et al. Testing pre-trained Transformer models for Lithuanian news clustering
pal Singh et al. Naive Bayes classifier for word sense disambiguation of Punjabi language
Han et al. Unsupervised Word Sense Disambiguation based on Word Embedding and Collocation.
Vakare et al. Sentence semantic similarity using dependency parsing
Raj et al. An Artificial Neural Network Approach for Sentence Boundary Disambiguation in Urdu Language Text.
CN113963748A (en) Protein knowledge map vectorization method
Nathani et al. Part of Speech Tagging for a Resource Poor Language: Sindhi in Devanagari Script using HMM and CRF
Pitichotchokphokhin et al. Discover underlying topics in Thai news articles: a comparative study of probabilistic and matrix factorization approaches
Rebala et al. Natural language processing
Bindu et al. Design and development of a named entity based question answering system for Malayalam language
Velasco et al. Automatic wordnet construction using word sense induction through sentence embeddings
Beumer Evaluation of Text Document Clustering using k-Means
Bergsma Large-scale semi-supervised learning for natural language processing
Khedkar et al. A survey of machine translation and parts of speech tagging for indian languages
Angle et al. Kannada morpheme segmentation using machine learning
Alrehaili et al. Extraction of multi-word terms and complex terms from the Classical Arabic text of the Quran
Bhola et al. Text Summarization Based On Ranking Techniques
Naeem et al. Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning
Abduljabbar et al. Term Extraction for a Single & Multi-word Based on Islamic Corpus English

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230816

Address after: Room 801, 85 Kefeng Road, Huangpu District, Guangzhou City, Guangdong Province

Applicant after: Yami Technology (Guangzhou) Co.,Ltd.

Address before: 541004 No. 15 Yucai Road, Qixing District, Guilin, the Guangxi Zhuang Autonomous Region

Applicant before: Guangxi Normal University

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant