CN113065002A

CN113065002A - Chinese semantic disambiguation method based on knowledge graph and context

Info

Publication number: CN113065002A
Application number: CN202110417960.8A
Authority: CN
Inventors: 刘子宇; 张华平; 雷玉新; 杨耀飞
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-07-02
Anticipated expiration: 2041-04-19
Also published as: CN113065002B

Abstract

The invention relates to a Chinese semantic disambiguation method based on a knowledge graph and a context, belonging to the technical field of natural language processing. By constructing the disambiguation knowledge map and semantic disambiguation based on context, the invention can extract ambiguous word entities, disambiguation word entities and the relationship between the ambiguous word entities in an acquired data set which is not marked by explicit semantics and consists of original sentences and results after disambiguation modification, and simultaneously takes the context as the attribute of the disambiguation word entities, thereby depositing disambiguation knowledge in the knowledge map and enabling semantic disambiguation work. The method and the device can accurately find the registered ambiguous words in the new text to be disambiguated. The invention realizes the vector representation of the context and the similarity calculation based on the vector, so that the software of the invention can sense the context of the ambiguous word more accurately.

Description

Chinese semantic disambiguation method based on knowledge graph and context

Technical Field

The invention relates to a Chinese semantic disambiguation method, in particular to a Chinese semantic disambiguation method based on a knowledge graph and a context, and belongs to the technical field of natural language processing.

Background

Semantic disambiguation, which is a core and difficult problem of natural language processing tasks, affects the performance of almost all tasks, such as search engines, opinion mining, text understanding and generation, reasoning, and the like. Disambiguation, i.e., the process of determining object semantics from context, where the "object" may be a word or a phrase. Currently, methods of semantic disambiguation include dictionary-based methods, supervised methods, unsupervised or semi-supervised methods, and the like. Although these methods have achieved good results in some fields, there is no semantic disambiguation method that can be well adapted to movie and television drama lines mainly composed of chinese spoken language.

The semantic disambiguation method based on the dictionary is one of the most basic methods, a certain word to be disambiguated and the context thereof are given, the idea of the method is to calculate the coverage between the definition of each word sense in the semantic dictionary and the context, and the word sense with the largest coverage is selected as the correct word sense of the word to be disambiguated in the context. However, since the definition of word senses in dictionaries is usually compact, the coverage with the context of the word to be disambiguated may be zero, resulting in a low disambiguation performance.

A supervised disambiguation method builds a disambiguation model by using semantic annotation corpora, and the research focuses on the representation of features. Although the supervised disambiguation method can achieve better disambiguation performance, a large amount of manual language material labeling is needed, and time and labor are wasted.

To avoid the need for large-scale corpora, semi-supervised or unsupervised methods require little or no manual labeling of the corpora. Some methods only need to use a small amount of manually labeled corpora as seed data, and other methods extract seed data from the word-aligned bilingual corpora. According to the idea that different ambiguities of words are often reflected in the difference of syntax collocation, the semantic preference of a syntax structure is automatically obtained in a large-scale corpus by calculating the semantic preference strength and the selection relevance degree, and then the semantic preference is used for semantic disambiguation.

In general, semi-supervised or unsupervised methods, while not requiring large amounts of manually labeled data, rely on a large scale unlabeled corpus and syntactic analysis results on that corpus. On the other hand, the coverage of the word to be disambiguated may be affected. For example, some methods only examine the syntax of a part of special structure, but only disambiguate words at a few specific syntactic positions, such as a verb, a subject word or object of a verb, and a noun modified by an adjective, and cannot cover all ambiguous words.

Common contextual features can be generalized into three types:

(1) the lexical characteristics generally refer to words appearing in upper and lower windows of the word to be disambiguated and parts of speech thereof;

(2) the syntactic characteristics utilize the syntactic relation characteristics of the word to be disambiguated in context, such as the dynamic-guest relation, whether the word is provided with a host/object, the type of a host/object chunk, the word in the host/object core and the like;

(3) the semantic features add semantic information on the basis of syntactic relations, such as the semantic of subject/object core words, and even semantic role labeling information.

In recent years, with the application of deep learning in the field of natural language processing, many semantic disambiguation methods based on deep learning methods have appeared. The deep learning method can automatically extract low-level or high-level features required for classification, thereby reducing the workload in the aspect of feature extraction.

Disclosure of Invention

The invention aims to solve the technical problems that ambiguous words cannot be accurately acquired and the adaptability to the specific field is not strong in the prior art, and creatively provides a Chinese semantic disambiguation method based on a knowledge graph and a context.

The innovation points of the invention are as follows: construction of disambiguation knowledge maps and semantic disambiguation based on context. On the basis of an acquired data set consisting of an original sentence and a result after disambiguation modification, carrying out ambiguous word discovery from the acquired data set, extracting the relationship between ambiguous words (or phrases) and the disambiguated words (or phrases), constructing a disambiguation knowledge map containing an ambiguous word modification method and a context when the ambiguous words (or phrases) are modified, finding ambiguous words in a text to be disambiguated by using the knowledge map, and recommending disambiguation modification suggestions according to the context of the ambiguous words.

The invention is realized by the following technical scheme.

First, a disambiguation knowledge-graph is defined:

a knowledge graph in the general sense is a semantic network that reveals relationships between entities, and is composed of several pieces of knowledge, each piece of knowledge being represented as a triplet (entity 1, relationship, entity 2) or (entity, attribute value). The disambiguation knowledge-graph of the present invention is defined as follows:

one piece of knowledge is represented by a triplet T ═ V_a,R,V_d)，V_aFor ambiguous words, i.e. words or phrases containing multiple semantics, V_aIs a list consisting of a number of words; r represents the relationship "can be replaced by", i.e. V_aCan be replaced by V_b；V_dIs a V_aDisambiguate word of, i.e. V_aAlternative unambiguous words or phrases, V_dLists consisting of several words, V_dThere are two attributes, the frequency with which T appears in the acquisition dataset and the full set of context in which T is located when in the acquisition dataset.

A Chinese semantic disambiguation method based on knowledge graph and context comprises the following steps:

step 1: constructing a disambiguation knowledge graph, comprising the following steps;

step 1.1: the acquired dataset is pre-processed. The acquired data set contains a training set and a validation set.

Wherein, the training set has L tuples composed of undistinguished sentences and disambiguated sentences, each sentence in each tuple is carried out the operations of removing symbols, dividing words and the like, and the tuple P composed of 2 lists is obtained (S)_a,S_d) The L tuples P form a set G.

Step 1.2: extracting ambiguous words, a replacing method thereof and a context when replacing occurs, and performing the following processing on each tuple P in the set G obtained in step 1.1:

step 1.2.1: calculating S_aAnd S_dIn a set of commonly occurring words H ═ S_a∩S_d。

Step 1.2.2: calculation of I_a＝S_a-H and I_d＝S_d-H，I_aIs shown only at S_aThe word in (I)_dIs shown only at S_dThe word appearing in (1); if I_aAnd I_dIf any one of the two is empty, the operation is terminated.

Step 1.2.3: will I_aThe element in (1) is present in S_aWhen they appear in (A), the elements adjacent in position are combined to form a list I'_a＝[V_a1V_a2…V_ax],x≥1。

Step 1.2.4: will I_dThe element in (1) is present in S_dAre combined to form a list I'_d＝[V_d1V_d2…V_dy],y≥1。

Step 1.2.5: is prepared from'_aAnd l'_dIs performed an align operation (e.g., l'_a1 element and I'_dCorresponds to the 1 st element in (a), and so on), z triplets T (V) are formed_a,R,V_d) And z is min (x, y). If x and y are not equal, then I 'is discarded'_aAnd l'_dThe list with the larger number of elements in the two lists has more elements than the other list.

Since R in all triples formed in this step has the same meaning, and is the relation "can be replaced", only V is considered in storage_aAnd V_bForm by V_aA dictionary D or D' being a key; one V of D or D_aCorresponding value is represented by_aCorresponding total V_bIs a bond, T ═ V_a,R,V_d) The corresponding frequency of occurrence and the context at which T occurs are a dictionary of values.

The contained context is that the dictionary of the 3 sentences of the upper sentence, the lower sentence is D, the contained context is that the dictionary of the 1 sentence is D ', and the number of keys of the dictionary D or D' is N.

Step 1.3: extracting the context when the ambiguous word is unchanged, for each value V of the dictionary D or D' obtained in step 1.2_aAnd the tuple P in the set G obtained in step 1.1 ═ S (S)_a,S_d) For each V_aThe following operations are carried out:

step 1.3.1: judgment V_aWhether or not S of any P is simultaneously present_aAnd S_dIf there is no such case, the operation is ended; if yes, skipping to step 3.2;

step 1.3.2: will V_aAppears at S_aAll context contexts in (1) are stored in dictionary D or D ', i.e. in dictionary D or D' key V_aAdding a key V to the corresponding value (for a dictionary)_aThe corresponding values are the frequency of occurrence of the context and all context;

step 1.4: the context contexts are expressed in an index form, e in total are numbered from 0 for each context appearing in the dictionary D or D 'obtained in step 1.3 to form an index, that is, one context corresponds to one number, and the context text in the dictionary D or D' obtained in step 1.3 is replaced by the corresponding number.

Step 1.5: using a BERT pre-training model, representing e context contexts originally represented by texts in a dictionary D by using a vector C with a dimension D, and splicing all the vectors C according to a formula (1) to obtain a context matrix C:

wherein, c_iA vector representing the ith context and T representing the matrix transpose.

Similarly, the context matrix from dictionary D 'is C'.

And at this point, the construction of the disambiguation knowledge graph is completed.

Step 2: performing context-based semantic disambiguation comprising the steps of:

step 2.1: loading a disambiguation knowledge graph, wherein the disambiguation knowledge graph comprises dictionaries D and D 'obtained in the step 1 and matrixes C and C';

step 2.2: acquiring a list of sentences to be disambiguated, wherein the total number of the sentences is M;

step 2.3: performing word segmentation on the M sentences to be disambiguated in the step 2.2 in the list to obtain M lists Q;

step 2.4: initializing a variable j equal to 1, wherein j represents a jth list, and the value range is that j is more than or equal to 1 and less than or equal to M;

step 2.5: from the jth list Q_jTo find out ambiguous words, for each key V in the dictionary D_aJudgment of V_aWhether or not Q is present_jIn (1), only when V_aEach element of (1) is present in Q_jAnd these elements are in V_aNeutralization is at Q_jIf the sequences in (1) are consistent, then V is considered_aIs present in Q_jPerforming the following steps; if present, converting V_aPut into the set U_jPerforming the following steps; if set U_jNot empty, i.e. Q_jIf the ambiguous word exists, go to step 2.6; if set U_jIs empty, i.e. Q_jIf no ambiguous word is present, the process jumps to step 2.7.

Step 2.6: for U obtained in step 2.5_jEach element V in_aSemantic disambiguation is performed and modification suggestions are given.

The method comprises the following specific steps:

step 2.6.1 Using the BERT Pre-training model, let V_aAt Q_jThe context of the location is represented by a vector f;

step 2.6.2 according to V_aThe number of all context contexts relevant in the dictionary D, the representation V consisting of the rows corresponding to the number, is obtained in the matrix C_aThe multidimensional vector F of the related three context sets, the representation V consisting of numbered corresponding rows, is obtained in the matrix C_aA multidimensional vector F' of related single sentence context sets;

step 2.6.3, calculating the similarity of each vector in the vector F and the multidimensional vector F according to the formula (2) to obtain V_aAt Q_jContext and disambiguation knowledge inSimilarity vector g of context of three sentences appearing in the atlas. Similarly, calculating the similarity of each vector in the vector F and the multidimensional vector F' to obtain V_aAt Q_jA similarity vector g' of the context in (1) to the single sentence context appearing in the disambiguation knowledge-graph;

wherein | F represents the modulus of F, | F | represents the modulus of F, F^TRepresenting the transposed matrix of F.

Step 2.6.4, calculating the mixed similarity of g and g' according to the formula (3), and enabling the V corresponding to the context with the highest mixed similarity_bAs a modification suggestion output, skipping to step 2.8;

mix_similarity＝(1-α)g+αg′ (3)

wherein alpha is the proportion of the preset single sentence context similarity to the mixed similarity;

step 2.7: outputting an 'unambiguous' modification suggestion, and skipping to the step 2.8;

step 2.8: increasing the value of j by 1, judging the value of j, if j is more than or equal to 1 and less than or equal to M, skipping to the step 2.5, otherwise, skipping to the step 2.9 when j is more than M.

Step 2.9: the disambiguation result is saved so that each list Q in step 2.3 has a modification proposal corresponding to it.

Wherein the modification suggestion includes whether to suggest a modification and each ambiguous word hit in Q (if any) recommends a modified disambiguation word.

Advantageous effects

Compared with the prior art, the method of the invention has the following advantages:

1. by combining knowledge graph correlation technology, the invention can extract ambiguous word entities and disambiguation word entities and the relation between the ambiguous word entities in an acquired data set which is not marked by explicit semantics and consists of original sentences and results after disambiguation modification, and simultaneously takes context as the attribute of the disambiguation word entities, thereby precipitating disambiguation knowledge in a knowledge graph and enabling semantic disambiguation work;

2. by means of the disambiguation knowledge graph, the method can accurately find the registered ambiguous words in the new text to be disambiguated;

3. by combining the BERT related technology, the invention realizes the vector representation of the context and the similarity calculation based on the vector, so that the software of the invention can sense the context of the ambiguous word more accurately;

drawings

FIG. 1 is an overall architecture of a disambiguation knowledge-graph upon which the method of the present invention relies;

FIG. 2 is a flow chart of an embodiment of the method of step 1 of the present invention for constructing a disambiguation knowledgegraph;

FIG. 3 is a flow chart of an embodiment of the method step 2 of the present invention for semantic disambiguation based on context.

Detailed Description

The method of the present invention will be described in further detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Examples

The semantic disambiguation of movie and television drama lines composed of Chinese spoken language is taken as an example.

In the specific implementation of the method of the present invention, the data set obtained in step 1 refers to a data set composed of 53 movie series lines, where 1 movie series is a test set, and others are training sets, and the organization form and specific content are shown in table 1:

table 1 data set example

Original sentence	Disambiguation sentence
		Mother wants to embrace grandson	Mother wants to obtain grandson
He finds and hits yes bar	He is shortbeating and is bar
		……	……

The embodiment is based on the method, disambiguation knowledge can be deposited in the knowledge graph, and semantic disambiguation work is enabled; the method can accurately find the registered ambiguous words in the new text to be disambiguated; the vector representation of the context and the similarity calculation based on the vector are realized, so that the software can sense the context of the ambiguous word more accurately; aiming at the specific field of semantic disambiguation of film and television play lines composed of Chinese spoken languages, the method has more adaptability and can provide powerful effect-promoting hand grips for the actual line disambiguation work.

The proposed disambiguation knowledge-graph structure of the present invention is shown in FIG. 1, where each piece of knowledge is represented in the form of a triple (entity 1, relationship, entity 2);

wherein, the entity 1 is an "ambiguous word", that is, a word or a phrase containing various semantics;

the entity 2 is a disambiguation word of the entity 1, namely an unambiguous word or phrase substituted by the entity 1; entity 2 has two attributes, which are the frequency with which the triplet appears in the acquisition dataset and the full set of context in which the triplet was in the acquisition dataset.

The relationship is "replaceable", i.e., entity 1 can be replaced with entity 2;

the disambiguation knowledge map corresponds to the disambiguation knowledge map in step 1;

FIG. 2 is a flow chart of a specific implementation of the method for Chinese semantic disambiguation based on a knowledge graph and context according to step 1 of the present invention to construct a disambiguation knowledge graph;

FIG. 3 is a flow chart of the embodiment of semantic disambiguation based on context in step 2 of the Chinese semantic disambiguation method based on knowledge graph and context according to the present invention.

On the basis of the above-mentioned acquired data set, a disambiguation knowledge map can be constructed, corresponding to step 1 of the method proposed by the present invention, an example of which is shown in table 2;

TABLE 2 disambiguation knowledgegraph example

One of the disambiguated words corresponds to a plurality of context contexts, and one context contains two conditions, wherein one condition only contains the content of the sentence, and the other condition contains the content of the upper sentence, the lower sentence and the 3 sentences.

Based on the disambiguation knowledge graph, semantic disambiguation based on context is carried out on the test set, corresponding to step 2 of the method provided by the invention, and the result is shown in table 3;

table 3 test set disambiguation results example

Wherein the interpretation of the states is shown in table 4;

table 4 state interpretation

The threshold of the probability is set to 0.9 in the present embodiment.

The result of semantic disambiguation of the test set is evaluated, the evaluation index accuracy is defined as the percentage of all input sentences giving correct suggestions, the baseline is defined as the percentage of all input sentences giving suggestions without modification, and the evaluation ratio of the embodiment and the baseline is shown in table 5, so that the accuracy of the method is improved compared with the baseline;

TABLE 5 comparison of the results

Method	Rate of accuracy
		Base line	86.0769％
The method of the invention	93.6925％

While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims

1. A Chinese semantic disambiguation method based on knowledge graph and context is characterized in that:

firstly, defining a disambiguation knowledge graph: one piece of knowledge is represented by a triplet T ═ V_a，R，V_d)，V_aFor ambiguous words, i.e. words or phrases containing multiple semantics, V_aIs a list consisting of a number of words; r representsThe relationship "can be replaced by", i.e. V_aCan be replaced by V_b；V_dIs a V_aDisambiguate word of, i.e. V_aAlternative unambiguous words or phrases, V_dLists consisting of several words, V_dThere are two attributes, which are the frequency of occurrence of T in the acquired dataset and the corpus of context in which T is located when in the acquired dataset;

the method comprises the following steps:

step 1.1: preprocessing the acquired data set; the acquired data set comprises a training set and a verification set;

wherein, the training set has L tuples composed of undistinguished sentences and disambiguated sentences, each sentence in each tuple is processed with operations including symbol removal and word segmentation, and the tuple P composed of 2 lists is obtained (S)_a，S_d) L tuples P form a set G;

step 1.2.1: calculating S_aAnd S_dIn a set of commonly occurring words H ═ S_a∩S_d；

Step 1.2.2: calculation of I_a＝S_a-H and I_d＝S_d-H，I_aIs shown only at S_aThe word in (I)_dRepresents a word that appears only in Sd; if I_aAnd I_dIf any one of the operation modes is empty, the operation is ended;

step 1.2.3: will I_aThe element in (1) is present in S_aWhen they appear in (A), the elements adjacent in position are combined to form a list I'_a＝[V_a1 V_a2…V_ax]，x≥1；

Step 1.2.4: will I_dThe element in (1) is present in S_dAre combined to form a list I'_d＝[V_d1 V_d2…V_dy]，y≥1；

Step 1.2.5: is prepared from'_aAnd l'_dThe elements in (a) are aligned to form z triples T ═ V_a，R，V_d) Z ═ min (x, y); if x and y are not equal, then I 'is discarded'_aAnd l'_dThe list with more elements in the two lists has more elements than the other list;

all triples formed in this step have the same meaning of R, and are the relation "can be replaced", then only V is considered in storage_aAnd V_bForm by V_aA dictionary D or D' being a key; one V of D or D_aCorresponding value is represented by_aCorresponding total V_bIs a bond, T ═ V_a，R，V_d) A dictionary with corresponding frequency of occurrence and context as a value when T occurs; the contained context is that the dictionary of the 3 sentences of the upper sentence, the lower sentence is D, the contained context is that the dictionary of the 1 sentence is D ', and the number of keys of the dictionary D or D' is N;

step 1.3: extracting the context when the ambiguous word is unchanged, for each value V of the dictionary D or D' obtained in step 1.2_aAnd the tuple P in the set G obtained in step 1.1 ═ S (S)_a，S_d) For each V_aThe following operations are carried out:

step 1.3.2: will V_aAppears at S_aAll context contexts in (1) are stored in dictionary D or D ', i.e. in dictionary D or D' key V_aIncreasing key V in corresponding value_aThe corresponding values are the frequency of occurrence of the context and all context;

step 1.4: representing the context contexts in an index form, numbering e contexts from 0 to form an index for each context appearing in the dictionary D or D 'obtained in the step 1.3, namely, one context corresponds to one number, and replacing the context text in the dictionary D or D' obtained in the step 1.3 with the corresponding number;

wherein, c_iA vector representing the ith context, T representing a matrix transpose;

similarly, the context matrix obtained from the dictionary D 'is C';

step 2.5: from the jth list Q_jTo find out ambiguous words, for each key V in the dictionary D_aJudgment of V_aWhether or not Q is present_jIn (1), only when V_aEach element of (1) is present in Q_jAnd these elements are in V_aNeutralization is at Q_jIf the sequences in (1) are consistent, then V is considered_aIs present in Q_jPerforming the following steps; if present, converting V_aPut into the set U_jPerforming the following steps; if set U_jNot empty, i.e. Q_jIf the ambiguous word exists, go to step 2.6; if set U_jIs empty, i.e. Q_jIf no ambiguous word exists, skipping to step 2.7;

step 2.6: for U obtained in step 2.5_jEach of which isElement V_aPerforming semantic disambiguation and giving modification suggestions;

step 2.8: increasing the value of j by 1, judging the value of j, if j is more than or equal to 1 and less than or equal to M, skipping to the step 2.5, otherwise, skipping to the step 2.9 when j is more than M;

step 2.9: saving the disambiguation result so that each list Q in step 2.3 has a modification suggestion corresponding thereto;

wherein the modification suggestion includes whether to suggest a modification and each ambiguous word hit in Q recommends a modified disambiguating word.

2. The method of Chinese semantic disambiguation based on knowledge-graph and contextual context as recited in claim 1, wherein step 2.6 comprises the steps of:

step 2.6.1: using the BERT pre-training model, let V_aAt Q_jThe context of the location is represented by a vector f;

step 2.6.2: according to V_aThe number of all context contexts relevant in the dictionary D, the representation V consisting of the rows corresponding to the number, is obtained in the matrix C_aThe multidimensional vector F of the related three context sets, the representation V consisting of numbered corresponding rows, is obtained in the matrix C_aA multidimensional vector F' of related single sentence context sets;

step 2.6.3: calculating the similarity of each vector in the vector F and the multi-dimensional vector F according to the formula (2) to obtain V_aAt Q_jA similarity vector g of the context in (1) and the three context sentences appearing in the disambiguation knowledge-graph; similarly, calculating the similarity of each vector in the vector F and the multidimensional vector F' to obtain V_aAt Q_jA similarity vector g' of the context in (1) to the single sentence context appearing in the disambiguation knowledge-graph;

where, | F | | represents the modulus of F, F^TA transposed matrix representing F;

step 2.6.4: calculating the mixed similarity of g and g' according to the formula (3), and enabling the context with the highest mixed similarity to correspond to V_bAs a modification suggestion output, skipping to step 2.8;

mix-similarity＝(1-α)g+αg′ (3)

wherein alpha is the proportion of the preset context similarity of the single sentence to the similar mixture.