CN113065002A - Chinese semantic disambiguation method based on knowledge graph and context - Google Patents

Chinese semantic disambiguation method based on knowledge graph and context Download PDF

Info

Publication number
CN113065002A
CN113065002A CN202110417960.8A CN202110417960A CN113065002A CN 113065002 A CN113065002 A CN 113065002A CN 202110417960 A CN202110417960 A CN 202110417960A CN 113065002 A CN113065002 A CN 113065002A
Authority
CN
China
Prior art keywords
context
disambiguation
dictionary
vector
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110417960.8A
Other languages
Chinese (zh)
Other versions
CN113065002B (en
Inventor
刘子宇
张华平
雷玉新
杨耀飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202110417960.8A priority Critical patent/CN113065002B/en
Publication of CN113065002A publication Critical patent/CN113065002A/en
Application granted granted Critical
Publication of CN113065002B publication Critical patent/CN113065002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention relates to a Chinese semantic disambiguation method based on a knowledge graph and a context, belonging to the technical field of natural language processing. By constructing the disambiguation knowledge map and semantic disambiguation based on context, the invention can extract ambiguous word entities, disambiguation word entities and the relationship between the ambiguous word entities in an acquired data set which is not marked by explicit semantics and consists of original sentences and results after disambiguation modification, and simultaneously takes the context as the attribute of the disambiguation word entities, thereby depositing disambiguation knowledge in the knowledge map and enabling semantic disambiguation work. The method and the device can accurately find the registered ambiguous words in the new text to be disambiguated. The invention realizes the vector representation of the context and the similarity calculation based on the vector, so that the software of the invention can sense the context of the ambiguous word more accurately.

Description

Chinese semantic disambiguation method based on knowledge graph and context
Technical Field
The invention relates to a Chinese semantic disambiguation method, in particular to a Chinese semantic disambiguation method based on a knowledge graph and a context, and belongs to the technical field of natural language processing.
Background
Semantic disambiguation, which is a core and difficult problem of natural language processing tasks, affects the performance of almost all tasks, such as search engines, opinion mining, text understanding and generation, reasoning, and the like. Disambiguation, i.e., the process of determining object semantics from context, where the "object" may be a word or a phrase. Currently, methods of semantic disambiguation include dictionary-based methods, supervised methods, unsupervised or semi-supervised methods, and the like. Although these methods have achieved good results in some fields, there is no semantic disambiguation method that can be well adapted to movie and television drama lines mainly composed of chinese spoken language.
The semantic disambiguation method based on the dictionary is one of the most basic methods, a certain word to be disambiguated and the context thereof are given, the idea of the method is to calculate the coverage between the definition of each word sense in the semantic dictionary and the context, and the word sense with the largest coverage is selected as the correct word sense of the word to be disambiguated in the context. However, since the definition of word senses in dictionaries is usually compact, the coverage with the context of the word to be disambiguated may be zero, resulting in a low disambiguation performance.
A supervised disambiguation method builds a disambiguation model by using semantic annotation corpora, and the research focuses on the representation of features. Although the supervised disambiguation method can achieve better disambiguation performance, a large amount of manual language material labeling is needed, and time and labor are wasted.
To avoid the need for large-scale corpora, semi-supervised or unsupervised methods require little or no manual labeling of the corpora. Some methods only need to use a small amount of manually labeled corpora as seed data, and other methods extract seed data from the word-aligned bilingual corpora. According to the idea that different ambiguities of words are often reflected in the difference of syntax collocation, the semantic preference of a syntax structure is automatically obtained in a large-scale corpus by calculating the semantic preference strength and the selection relevance degree, and then the semantic preference is used for semantic disambiguation.
In general, semi-supervised or unsupervised methods, while not requiring large amounts of manually labeled data, rely on a large scale unlabeled corpus and syntactic analysis results on that corpus. On the other hand, the coverage of the word to be disambiguated may be affected. For example, some methods only examine the syntax of a part of special structure, but only disambiguate words at a few specific syntactic positions, such as a verb, a subject word or object of a verb, and a noun modified by an adjective, and cannot cover all ambiguous words.
Common contextual features can be generalized into three types:
(1) the lexical characteristics generally refer to words appearing in upper and lower windows of the word to be disambiguated and parts of speech thereof;
(2) the syntactic characteristics utilize the syntactic relation characteristics of the word to be disambiguated in context, such as the dynamic-guest relation, whether the word is provided with a host/object, the type of a host/object chunk, the word in the host/object core and the like;
(3) the semantic features add semantic information on the basis of syntactic relations, such as the semantic of subject/object core words, and even semantic role labeling information.
In recent years, with the application of deep learning in the field of natural language processing, many semantic disambiguation methods based on deep learning methods have appeared. The deep learning method can automatically extract low-level or high-level features required for classification, thereby reducing the workload in the aspect of feature extraction.
Disclosure of Invention
The invention aims to solve the technical problems that ambiguous words cannot be accurately acquired and the adaptability to the specific field is not strong in the prior art, and creatively provides a Chinese semantic disambiguation method based on a knowledge graph and a context.
The innovation points of the invention are as follows: construction of disambiguation knowledge maps and semantic disambiguation based on context. On the basis of an acquired data set consisting of an original sentence and a result after disambiguation modification, carrying out ambiguous word discovery from the acquired data set, extracting the relationship between ambiguous words (or phrases) and the disambiguated words (or phrases), constructing a disambiguation knowledge map containing an ambiguous word modification method and a context when the ambiguous words (or phrases) are modified, finding ambiguous words in a text to be disambiguated by using the knowledge map, and recommending disambiguation modification suggestions according to the context of the ambiguous words.
The invention is realized by the following technical scheme.
First, a disambiguation knowledge-graph is defined:
a knowledge graph in the general sense is a semantic network that reveals relationships between entities, and is composed of several pieces of knowledge, each piece of knowledge being represented as a triplet (entity 1, relationship, entity 2) or (entity, attribute value). The disambiguation knowledge-graph of the present invention is defined as follows:
one piece of knowledge is represented by a triplet T ═ Va,R,Vd),VaFor ambiguous words, i.e. words or phrases containing multiple semantics, VaIs a list consisting of a number of words; r represents the relationship "can be replaced by", i.e. VaCan be replaced by Vb;VdIs a VaDisambiguate word of, i.e. VaAlternative unambiguous words or phrases, VdLists consisting of several words, VdThere are two attributes, the frequency with which T appears in the acquisition dataset and the full set of context in which T is located when in the acquisition dataset.
A Chinese semantic disambiguation method based on knowledge graph and context comprises the following steps:
step 1: constructing a disambiguation knowledge graph, comprising the following steps;
step 1.1: the acquired dataset is pre-processed. The acquired data set contains a training set and a validation set.
Wherein, the training set has L tuples composed of undistinguished sentences and disambiguated sentences, each sentence in each tuple is carried out the operations of removing symbols, dividing words and the like, and the tuple P composed of 2 lists is obtained (S)a,Sd) The L tuples P form a set G.
Step 1.2: extracting ambiguous words, a replacing method thereof and a context when replacing occurs, and performing the following processing on each tuple P in the set G obtained in step 1.1:
step 1.2.1: calculating SaAnd SdIn a set of commonly occurring words H ═ Sa∩Sd
Step 1.2.2: calculation of Ia=Sa-H and Id=Sd-H,IaIs shown only at SaThe word in (I)dIs shown only at SdThe word appearing in (1); if IaAnd IdIf any one of the two is empty, the operation is terminated.
Step 1.2.3: will IaThe element in (1) is present in SaWhen they appear in (A), the elements adjacent in position are combined to form a list I'a=[Va1Va2…Vax],x≥1。
Step 1.2.4: will IdThe element in (1) is present in SdAre combined to form a list I'd=[Vd1Vd2…Vdy],y≥1。
Step 1.2.5: is prepared from'aAnd l'dIs performed an align operation (e.g., l'a1 element and I'dCorresponds to the 1 st element in (a), and so on), z triplets T (V) are formeda,R,Vd) And z is min (x, y). If x and y are not equal, then I 'is discarded'aAnd l'dThe list with the larger number of elements in the two lists has more elements than the other list.
Since R in all triples formed in this step has the same meaning, and is the relation "can be replaced", only V is considered in storageaAnd VbForm by VaA dictionary D or D' being a key; one V of D or DaCorresponding value is represented byaCorresponding total VbIs a bond, T ═ Va,R,Vd) The corresponding frequency of occurrence and the context at which T occurs are a dictionary of values.
The contained context is that the dictionary of the 3 sentences of the upper sentence, the lower sentence is D, the contained context is that the dictionary of the 1 sentence is D ', and the number of keys of the dictionary D or D' is N.
Step 1.3: extracting the context when the ambiguous word is unchanged, for each value V of the dictionary D or D' obtained in step 1.2aAnd the tuple P in the set G obtained in step 1.1 ═ S (S)a,Sd) For each VaThe following operations are carried out:
step 1.3.1: judgment VaWhether or not S of any P is simultaneously presentaAnd SdIf there is no such case, the operation is ended; if yes, skipping to step 3.2;
step 1.3.2: will VaAppears at SaAll context contexts in (1) are stored in dictionary D or D ', i.e. in dictionary D or D' key VaAdding a key V to the corresponding value (for a dictionary)aThe corresponding values are the frequency of occurrence of the context and all context;
step 1.4: the context contexts are expressed in an index form, e in total are numbered from 0 for each context appearing in the dictionary D or D 'obtained in step 1.3 to form an index, that is, one context corresponds to one number, and the context text in the dictionary D or D' obtained in step 1.3 is replaced by the corresponding number.
Step 1.5: using a BERT pre-training model, representing e context contexts originally represented by texts in a dictionary D by using a vector C with a dimension D, and splicing all the vectors C according to a formula (1) to obtain a context matrix C:
Figure BDA0003026713110000041
wherein, ciA vector representing the ith context and T representing the matrix transpose.
Similarly, the context matrix from dictionary D 'is C'.
And at this point, the construction of the disambiguation knowledge graph is completed.
Step 2: performing context-based semantic disambiguation comprising the steps of:
step 2.1: loading a disambiguation knowledge graph, wherein the disambiguation knowledge graph comprises dictionaries D and D 'obtained in the step 1 and matrixes C and C';
step 2.2: acquiring a list of sentences to be disambiguated, wherein the total number of the sentences is M;
step 2.3: performing word segmentation on the M sentences to be disambiguated in the step 2.2 in the list to obtain M lists Q;
step 2.4: initializing a variable j equal to 1, wherein j represents a jth list, and the value range is that j is more than or equal to 1 and less than or equal to M;
step 2.5: from the jth list QjTo find out ambiguous words, for each key V in the dictionary DaJudgment of VaWhether or not Q is presentjIn (1), only when VaEach element of (1) is present in QjAnd these elements are in VaNeutralization is at QjIf the sequences in (1) are consistent, then V is consideredaIs present in QjPerforming the following steps; if present, converting VaPut into the set UjPerforming the following steps; if set UjNot empty, i.e. QjIf the ambiguous word exists, go to step 2.6; if set UjIs empty, i.e. QjIf no ambiguous word is present, the process jumps to step 2.7.
Step 2.6: for U obtained in step 2.5jEach element V inaSemantic disambiguation is performed and modification suggestions are given.
The method comprises the following specific steps:
step 2.6.1 Using the BERT Pre-training model, let VaAt QjThe context of the location is represented by a vector f;
step 2.6.2 according to VaThe number of all context contexts relevant in the dictionary D, the representation V consisting of the rows corresponding to the number, is obtained in the matrix CaThe multidimensional vector F of the related three context sets, the representation V consisting of numbered corresponding rows, is obtained in the matrix CaA multidimensional vector F' of related single sentence context sets;
step 2.6.3, calculating the similarity of each vector in the vector F and the multidimensional vector F according to the formula (2) to obtain VaAt QjContext and disambiguation knowledge inSimilarity vector g of context of three sentences appearing in the atlas. Similarly, calculating the similarity of each vector in the vector F and the multidimensional vector F' to obtain VaAt QjA similarity vector g' of the context in (1) to the single sentence context appearing in the disambiguation knowledge-graph;
Figure BDA0003026713110000051
wherein | F represents the modulus of F, | F | represents the modulus of F, FTRepresenting the transposed matrix of F.
Step 2.6.4, calculating the mixed similarity of g and g' according to the formula (3), and enabling the V corresponding to the context with the highest mixed similaritybAs a modification suggestion output, skipping to step 2.8;
mix_similarity=(1-α)g+αg′ (3)
wherein alpha is the proportion of the preset single sentence context similarity to the mixed similarity;
step 2.7: outputting an 'unambiguous' modification suggestion, and skipping to the step 2.8;
step 2.8: increasing the value of j by 1, judging the value of j, if j is more than or equal to 1 and less than or equal to M, skipping to the step 2.5, otherwise, skipping to the step 2.9 when j is more than M.
Step 2.9: the disambiguation result is saved so that each list Q in step 2.3 has a modification proposal corresponding to it.
Wherein the modification suggestion includes whether to suggest a modification and each ambiguous word hit in Q (if any) recommends a modified disambiguation word.
Advantageous effects
Compared with the prior art, the method of the invention has the following advantages:
1. by combining knowledge graph correlation technology, the invention can extract ambiguous word entities and disambiguation word entities and the relation between the ambiguous word entities in an acquired data set which is not marked by explicit semantics and consists of original sentences and results after disambiguation modification, and simultaneously takes context as the attribute of the disambiguation word entities, thereby precipitating disambiguation knowledge in a knowledge graph and enabling semantic disambiguation work;
2. by means of the disambiguation knowledge graph, the method can accurately find the registered ambiguous words in the new text to be disambiguated;
3. by combining the BERT related technology, the invention realizes the vector representation of the context and the similarity calculation based on the vector, so that the software of the invention can sense the context of the ambiguous word more accurately;
drawings
FIG. 1 is an overall architecture of a disambiguation knowledge-graph upon which the method of the present invention relies;
FIG. 2 is a flow chart of an embodiment of the method of step 1 of the present invention for constructing a disambiguation knowledgegraph;
FIG. 3 is a flow chart of an embodiment of the method step 2 of the present invention for semantic disambiguation based on context.
Detailed Description
The method of the present invention will be described in further detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Examples
The semantic disambiguation of movie and television drama lines composed of Chinese spoken language is taken as an example.
In the specific implementation of the method of the present invention, the data set obtained in step 1 refers to a data set composed of 53 movie series lines, where 1 movie series is a test set, and others are training sets, and the organization form and specific content are shown in table 1:
table 1 data set example
Original sentence Disambiguation sentence
Mother wants to embrace grandson Mother wants to obtain grandson
He finds and hits yes bar He is shortbeating and is bar
…… ……
The embodiment is based on the method, disambiguation knowledge can be deposited in the knowledge graph, and semantic disambiguation work is enabled; the method can accurately find the registered ambiguous words in the new text to be disambiguated; the vector representation of the context and the similarity calculation based on the vector are realized, so that the software can sense the context of the ambiguous word more accurately; aiming at the specific field of semantic disambiguation of film and television play lines composed of Chinese spoken languages, the method has more adaptability and can provide powerful effect-promoting hand grips for the actual line disambiguation work.
The proposed disambiguation knowledge-graph structure of the present invention is shown in FIG. 1, where each piece of knowledge is represented in the form of a triple (entity 1, relationship, entity 2);
wherein, the entity 1 is an "ambiguous word", that is, a word or a phrase containing various semantics;
the entity 2 is a disambiguation word of the entity 1, namely an unambiguous word or phrase substituted by the entity 1; entity 2 has two attributes, which are the frequency with which the triplet appears in the acquisition dataset and the full set of context in which the triplet was in the acquisition dataset.
The relationship is "replaceable", i.e., entity 1 can be replaced with entity 2;
the disambiguation knowledge map corresponds to the disambiguation knowledge map in step 1;
FIG. 2 is a flow chart of a specific implementation of the method for Chinese semantic disambiguation based on a knowledge graph and context according to step 1 of the present invention to construct a disambiguation knowledge graph;
FIG. 3 is a flow chart of the embodiment of semantic disambiguation based on context in step 2 of the Chinese semantic disambiguation method based on knowledge graph and context according to the present invention.
On the basis of the above-mentioned acquired data set, a disambiguation knowledge map can be constructed, corresponding to step 1 of the method proposed by the present invention, an example of which is shown in table 2;
TABLE 2 disambiguation knowledgegraph example
Figure BDA0003026713110000071
Figure BDA0003026713110000081
One of the disambiguated words corresponds to a plurality of context contexts, and one context contains two conditions, wherein one condition only contains the content of the sentence, and the other condition contains the content of the upper sentence, the lower sentence and the 3 sentences.
Based on the disambiguation knowledge graph, semantic disambiguation based on context is carried out on the test set, corresponding to step 2 of the method provided by the invention, and the result is shown in table 3;
table 3 test set disambiguation results example
Figure BDA0003026713110000082
Wherein the interpretation of the states is shown in table 4;
table 4 state interpretation
Figure BDA0003026713110000083
Figure BDA0003026713110000091
The threshold of the probability is set to 0.9 in the present embodiment.
The result of semantic disambiguation of the test set is evaluated, the evaluation index accuracy is defined as the percentage of all input sentences giving correct suggestions, the baseline is defined as the percentage of all input sentences giving suggestions without modification, and the evaluation ratio of the embodiment and the baseline is shown in table 5, so that the accuracy of the method is improved compared with the baseline;
TABLE 5 comparison of the results
Method Rate of accuracy
Base line 86.0769%
The method of the invention 93.6925%
While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims (2)

1. A Chinese semantic disambiguation method based on knowledge graph and context is characterized in that:
firstly, defining a disambiguation knowledge graph: one piece of knowledge is represented by a triplet T ═ Va,R,Vd),VaFor ambiguous words, i.e. words or phrases containing multiple semantics, VaIs a list consisting of a number of words; r representsThe relationship "can be replaced by", i.e. VaCan be replaced by Vb;VdIs a VaDisambiguate word of, i.e. VaAlternative unambiguous words or phrases, VdLists consisting of several words, VdThere are two attributes, which are the frequency of occurrence of T in the acquired dataset and the corpus of context in which T is located when in the acquired dataset;
the method comprises the following steps:
step 1: constructing a disambiguation knowledge graph, comprising the following steps;
step 1.1: preprocessing the acquired data set; the acquired data set comprises a training set and a verification set;
wherein, the training set has L tuples composed of undistinguished sentences and disambiguated sentences, each sentence in each tuple is processed with operations including symbol removal and word segmentation, and the tuple P composed of 2 lists is obtained (S)a,Sd) L tuples P form a set G;
step 1.2: extracting ambiguous words, a replacing method thereof and a context when replacing occurs, and performing the following processing on each tuple P in the set G obtained in step 1.1:
step 1.2.1: calculating SaAnd SdIn a set of commonly occurring words H ═ Sa∩Sd
Step 1.2.2: calculation of Ia=Sa-H and Id=Sd-H,IaIs shown only at SaThe word in (I)dRepresents a word that appears only in Sd; if IaAnd IdIf any one of the operation modes is empty, the operation is ended;
step 1.2.3: will IaThe element in (1) is present in SaWhen they appear in (A), the elements adjacent in position are combined to form a list I'a=[Va1 Va2…Vax],x≥1;
Step 1.2.4: will IdThe element in (1) is present in SdAre combined to form a list I'd=[Vd1 Vd2…Vdy],y≥1;
Step 1.2.5: is prepared from'aAnd l'dThe elements in (a) are aligned to form z triples T ═ Va,R,Vd) Z ═ min (x, y); if x and y are not equal, then I 'is discarded'aAnd l'dThe list with more elements in the two lists has more elements than the other list;
all triples formed in this step have the same meaning of R, and are the relation "can be replaced", then only V is considered in storageaAnd VbForm by VaA dictionary D or D' being a key; one V of D or DaCorresponding value is represented byaCorresponding total VbIs a bond, T ═ Va,R,Vd) A dictionary with corresponding frequency of occurrence and context as a value when T occurs; the contained context is that the dictionary of the 3 sentences of the upper sentence, the lower sentence is D, the contained context is that the dictionary of the 1 sentence is D ', and the number of keys of the dictionary D or D' is N;
step 1.3: extracting the context when the ambiguous word is unchanged, for each value V of the dictionary D or D' obtained in step 1.2aAnd the tuple P in the set G obtained in step 1.1 ═ S (S)a,Sd) For each VaThe following operations are carried out:
step 1.3.1: judgment VaWhether or not S of any P is simultaneously presentaAnd SdIf there is no such case, the operation is ended; if yes, skipping to step 3.2;
step 1.3.2: will VaAppears at SaAll context contexts in (1) are stored in dictionary D or D ', i.e. in dictionary D or D' key VaIncreasing key V in corresponding valueaThe corresponding values are the frequency of occurrence of the context and all context;
step 1.4: representing the context contexts in an index form, numbering e contexts from 0 to form an index for each context appearing in the dictionary D or D 'obtained in the step 1.3, namely, one context corresponds to one number, and replacing the context text in the dictionary D or D' obtained in the step 1.3 with the corresponding number;
step 1.5: using a BERT pre-training model, representing e context contexts originally represented by texts in a dictionary D by using a vector C with a dimension D, and splicing all the vectors C according to a formula (1) to obtain a context matrix C:
Figure FDA0003026713100000021
wherein, ciA vector representing the ith context, T representing a matrix transpose;
similarly, the context matrix obtained from the dictionary D 'is C';
step 2: performing context-based semantic disambiguation comprising the steps of:
step 2.1: loading a disambiguation knowledge graph, wherein the disambiguation knowledge graph comprises dictionaries D and D 'obtained in the step 1 and matrixes C and C';
step 2.2: acquiring a list of sentences to be disambiguated, wherein the total number of the sentences is M;
step 2.3: performing word segmentation on the M sentences to be disambiguated in the step 2.2 in the list to obtain M lists Q;
step 2.4: initializing a variable j equal to 1, wherein j represents a jth list, and the value range is that j is more than or equal to 1 and less than or equal to M;
step 2.5: from the jth list QjTo find out ambiguous words, for each key V in the dictionary DaJudgment of VaWhether or not Q is presentjIn (1), only when VaEach element of (1) is present in QjAnd these elements are in VaNeutralization is at QjIf the sequences in (1) are consistent, then V is consideredaIs present in QjPerforming the following steps; if present, converting VaPut into the set UjPerforming the following steps; if set UjNot empty, i.e. QjIf the ambiguous word exists, go to step 2.6; if set UjIs empty, i.e. QjIf no ambiguous word exists, skipping to step 2.7;
step 2.6: for U obtained in step 2.5jEach of which isElement VaPerforming semantic disambiguation and giving modification suggestions;
step 2.7: outputting an 'unambiguous' modification suggestion, and skipping to the step 2.8;
step 2.8: increasing the value of j by 1, judging the value of j, if j is more than or equal to 1 and less than or equal to M, skipping to the step 2.5, otherwise, skipping to the step 2.9 when j is more than M;
step 2.9: saving the disambiguation result so that each list Q in step 2.3 has a modification suggestion corresponding thereto;
wherein the modification suggestion includes whether to suggest a modification and each ambiguous word hit in Q recommends a modified disambiguating word.
2. The method of Chinese semantic disambiguation based on knowledge-graph and contextual context as recited in claim 1, wherein step 2.6 comprises the steps of:
step 2.6.1: using the BERT pre-training model, let VaAt QjThe context of the location is represented by a vector f;
step 2.6.2: according to VaThe number of all context contexts relevant in the dictionary D, the representation V consisting of the rows corresponding to the number, is obtained in the matrix CaThe multidimensional vector F of the related three context sets, the representation V consisting of numbered corresponding rows, is obtained in the matrix CaA multidimensional vector F' of related single sentence context sets;
step 2.6.3: calculating the similarity of each vector in the vector F and the multi-dimensional vector F according to the formula (2) to obtain VaAt QjA similarity vector g of the context in (1) and the three context sentences appearing in the disambiguation knowledge-graph; similarly, calculating the similarity of each vector in the vector F and the multidimensional vector F' to obtain VaAt QjA similarity vector g' of the context in (1) to the single sentence context appearing in the disambiguation knowledge-graph;
Figure FDA0003026713100000031
where, | F | | represents the modulus of F, FTA transposed matrix representing F;
step 2.6.4: calculating the mixed similarity of g and g' according to the formula (3), and enabling the context with the highest mixed similarity to correspond to VbAs a modification suggestion output, skipping to step 2.8;
mix-similarity=(1-α)g+αg′ (3)
wherein alpha is the proportion of the preset context similarity of the single sentence to the similar mixture.
CN202110417960.8A 2021-04-19 2021-04-19 Chinese semantic disambiguation method based on knowledge graph and context Active CN113065002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110417960.8A CN113065002B (en) 2021-04-19 2021-04-19 Chinese semantic disambiguation method based on knowledge graph and context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110417960.8A CN113065002B (en) 2021-04-19 2021-04-19 Chinese semantic disambiguation method based on knowledge graph and context

Publications (2)

Publication Number Publication Date
CN113065002A true CN113065002A (en) 2021-07-02
CN113065002B CN113065002B (en) 2022-10-14

Family

ID=76567006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110417960.8A Active CN113065002B (en) 2021-04-19 2021-04-19 Chinese semantic disambiguation method based on knowledge graph and context

Country Status (1)

Country Link
CN (1) CN113065002B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916887A (en) * 2006-09-06 2007-02-21 哈尔滨工程大学 Method for eliminating ambiguity without directive word meaning based on technique of substitution words
US20110119047A1 (en) * 2009-11-19 2011-05-19 Tatu Ylonen Oy Ltd Joint disambiguation of the meaning of a natural language expression
CN105630770A (en) * 2015-12-23 2016-06-01 华建宇通科技(北京)有限责任公司 Word segmentation phonetic transcription and ligature writing method and device based on SC grammar
CN112214999A (en) * 2020-09-30 2021-01-12 内蒙古科技大学 Word meaning disambiguation method and device based on combination of graph model and word vector

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916887A (en) * 2006-09-06 2007-02-21 哈尔滨工程大学 Method for eliminating ambiguity without directive word meaning based on technique of substitution words
US20110119047A1 (en) * 2009-11-19 2011-05-19 Tatu Ylonen Oy Ltd Joint disambiguation of the meaning of a natural language expression
CN105630770A (en) * 2015-12-23 2016-06-01 华建宇通科技(北京)有限责任公司 Word segmentation phonetic transcription and ligature writing method and device based on SC grammar
CN112214999A (en) * 2020-09-30 2021-01-12 内蒙古科技大学 Word meaning disambiguation method and device based on combination of graph model and word vector

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
STEFAN ZWICKLBAUER ET AL.: "Search-based entity disambiguation with document-centric knowledge bases", 《PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON KNOWLEDGE TECHNOLOGIES AND DATA-DRIVEN BUSINESS》 *
鹿文鹏等: "基于领域知识的图模型词义消歧方法", 《自动化学报》 *

Also Published As

Publication number Publication date
CN113065002B (en) 2022-10-14

Similar Documents

Publication Publication Date Title
US5724593A (en) Machine assisted translation tools
US7478033B2 (en) Systems and methods for translating Chinese pinyin to Chinese characters
US8762358B2 (en) Query language determination using query terms and interface language
US20200226126A1 (en) Vector-based contextual text searching
US20060253273A1 (en) Information extraction using a trainable grammar
Adler et al. An unsupervised morpheme-based HMM for Hebrew morphological disambiguation
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
US11068653B2 (en) System and method for context-based abbreviation disambiguation using machine learning on synonyms of abbreviation expansions
CN111061882A (en) Knowledge graph construction method
JP2011118689A (en) Retrieval method and system
CN110991180A (en) Command identification method based on keywords and Word2Vec
CN113312922B (en) Improved chapter-level triple information extraction method
CN112417823B (en) Chinese text word order adjustment and word completion method and system
Sarkar et al. A practical part-of-speech tagger for Bengali
CN100361124C (en) System and method for word analysis
Patil et al. Issues and challenges in marathi named entity recognition
CN109614493B (en) Text abbreviation recognition method and system based on supervision word vector
CN102024026B (en) Method and system for processing query terms
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Barari et al. CloniZER spell checker adaptive language independent spell checker
CN107168950B (en) Event phrase learning method and device based on bilingual semantic mapping
CN113065002B (en) Chinese semantic disambiguation method based on knowledge graph and context
Li et al. New word discovery algorithm based on n-gram for multi-word internal solidification degree and frequency
CN113963748A (en) Protein knowledge map vectorization method
Tianwen et al. Evaluate the chinese version of machine translation based on perplexity analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant