CN113177105A

CN113177105A - Word embedding-based multi-source heterogeneous water conservancy field data fusion method

Info

Publication number: CN113177105A
Application number: CN202110490308.9A
Authority: CN
Inventors: 胡伟; 高祥涛; 朱向荣; 陆小明; 高凤宁; 司存友; 曹帅
Original assignee: Jiangsu Province Hydrology And Water Resources Investigation Bureau; Nanjing University
Current assignee: Jiangsu Province Hydrology And Water Resources Investigation Bureau; Nanjing University
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2021-07-27

Abstract

The invention discloses a word embedding-based multi-source heterogeneous water conservancy field data fusion method, which comprises the following steps of: given multi-source heterogeneous water conservancy field data, firstly, the multi-source heterogeneous water conservancy field data is constructed into a water conservancy knowledge map. Next, using the word embedding model, a vector representation is generated for each entity or attribute in the hydraulic knowledge graph. Then, similarity between every two is calculated according to the literal quantity Chinese part, the literal quantity English part and the vector representation of the entity or the attribute. And finally, combining the three similarities to obtain similarity scores of the two candidate entities or attributes. And using a preset similarity score threshold value and a candidate similar entry quantity upper limit to restrict the quantity of similar entities or attributes, and obtaining an entity pair or an attribute pair which is finally determined to be matched. By applying the method and the device, the similar entity pair similar attribute pair in the multi-source heterogeneous water conservancy field data can be found, and the complexity of data retrieval by water conservancy professional practitioners is reduced.

Description

Word embedding-based multi-source heterogeneous water conservancy field data fusion method

Technical Field

The invention relates to the technical field of knowledge maps, in particular to a word embedding-based multi-source heterogeneous water conservancy field data fusion method.

Background

In 2012, Google corporation first proposed a new concept, the knowledge graph, which improves the quality of the search by introducing the knowledge graph to structure information about the search targets. From the content perspective, the knowledge graph is mainly composed of interconnected entities and their attributes; and in essence, it can be viewed as a knowledge base built based on a semantic network, where each piece of knowledge can be represented by a triplet. For example (Yangcheng lake, position, Suzhou), characterizes a piece of knowledge (facts) in the real world: yangcheng lake is located in Suzhou. Since many scenes in the real world are suitable for representation by knowledge graph, in recent years, the construction and application work on knowledge graphs has become a new research hotspot. Currently, a large set of quality knowledge maps are emerging in the industry, such as Freebase, which is widely used in real world applications.

"water is invisible and has a shape of ten thousand, and the treatment of water and water consumption is a millennium problem of maintaining the livelihood of people. Due to the inherent continuity in time span and the wide distribution in space span, the water conservancy field can continuously generate massive field data, and the water conservancy field data is particularly suitable for being managed by using a knowledge graph. The problems of flood control and drainage, water environment, water resource, water ecology and the like need extensive knowledge and complex reasoning, and the knowledge map can be used as a powerful tool for storing, managing and utilizing knowledge by experts and common practitioners in the water conservancy field.

Traditionally, the water conservancy industry generally adopts a keyword-based search technology, and information retrieval is difficult to perform by using the relation between objects. On the other hand, the same entity or attribute of different data sources can be expressed in different texts, and the search technology based on keywords is difficult to deal with the retrieval problem of multi-source heterogeneous data.

Disclosure of Invention

The purpose of the invention is as follows: in view of the problems and deficiencies in the prior art, the invention aims to provide a word embedding-based multi-source heterogeneous water conservancy field data fusion method, which can find similar entities and attributes for entities and attributes in multi-source heterogeneous water conservancy field data, assist in linking and fusing the multi-source heterogeneous water conservancy field data, improve recall rate of water conservancy field data retrieval, and improve information retrieval efficiency of water conservancy professional practitioners.

The technical scheme is as follows: in order to achieve the purpose, the technical scheme adopted by the invention is a word embedding-based multi-source heterogeneous water conservancy field data fusion method, which comprises the following steps:

(1.1) for the currently given water conservancy field data, separating the entity from the attribute to generate a candidate entity pair and a candidate attribute pair;

(1.2) for the candidate entity pair and the candidate attribute pair generated in the step (1.1), respectively calculating the similarity of Chinese character face quantity, English word face quantity and vector representation level of two entities or attributes;

(1.3) calculating the similarity of an entity pair and the similarity of an attribute pair by combining the Chinese character face quantity, the English character face quantity and the similarity of the vector representation level calculated in the step (1.2);

(1.4) comparing the similarity calculated in the step (1.3) with a preset threshold, filtering the candidate entity pairs and the candidate attribute pairs with the similarity lower than the threshold, reserving the candidate entity pairs and the candidate attribute pairs with the similarity higher than the threshold, and screening out the matching entity pairs and the matching attribute pairs.

Further, the candidate entity pair consists of two candidate entities, the candidate attribute pair consists of two candidate attributes, the step (1.2) comprises the steps of:

(2.1) calculating the character string similarity of the two candidate entities or the attribute Chinese names according to the Jacobian index;

(2.2) calculating the character string similarity of the English names of the two candidate entities or the attributes according to the editing distance;

and (2.3) calculating the similarity of the two candidate entities or the candidate attribute embedded vector level according to the cosine distance.

Further, the step (2.3) comprises the steps of:

(3.1) for the candidate entities and the candidate attributes generated in the step (1.1), obtaining vector representations of the candidate entities and the candidate attributes by using a CBoW word vector model;

and (3.2) according to the vector representation of each candidate entity and candidate attribute obtained in the step (3.1), extracting the vector representation of the current candidate entity pair or candidate attribute pair, and calculating the cosine similarity of the vector representation of the candidate entity pair or candidate attribute pair.

Further, the step (1.3) comprises the steps of:

(4.1) determining the weights of the similarity of the Chinese character face quantity, the English character face quantity and the vector representation level in the similarity of the entity pair and the attribute pair according to the water conservancy field data characteristics, and ensuring that the sum of the weights of the similarity of the Chinese character face quantity, the English character face quantity and the vector representation level is 1;

and (4.2) according to the weight determined in the step (4.1), calculating a weighted average of the similarity of the Chinese character face quantity, the English face quantity and the vector representation level as the similarity of the entity pair and the similarity of the attribute pair.

Has the advantages that: (1) similar entities and attributes in the multi-source heterogeneous water conservancy data are matched, and fusion of the multi-source heterogeneous water conservancy field data is assisted. (2) The method can be used as a component to be applied to the traditional water conservancy field data retrieval method based on keywords, the recall rate of retrieval is improved, and the efficiency of data retrieval of water conservancy field workers is further improved.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

In order to better develop the aggregation effect of knowledge, the multi-source heterogeneous data needs to be linked and fused. The word embedding technology in the field of machine learning can project entities and attributes in different knowledge maps to a uniform low-dimensional vector space, and realize linkage and fusion of multi-source heterogeneous data.

The water conservancy field data are managed by using the knowledge graph, and the similar entity pairs and the similar attribute pairs in the knowledge graph are matched by using a word vector technology, so that the fusion of the multi-source heterogeneous water conservancy field data is realized. Dividing the water conservancy knowledge map into an entity part and an attribute part, and respectively carrying out fusion of multi-source heterogeneous data and matching of similar concepts on the entity part and the attribute part. Firstly, respectively calculating the similarity of Chinese character face quantity, English character face quantity and vector representation level for candidate entities or candidate attributes in a candidate entity pair or a candidate attribute pair. And then calculating a weighted average value of the three similarity degrees to obtain the similarity degree of the candidate entity pair or the candidate attribute pair. And finally, filtering the candidate entity pairs and the candidate attribute pairs by using a preset similarity threshold and a preset matching number upper limit to obtain matching entity pairs and matching attribute pairs.

The overall process of the invention is shown in fig. 1, and comprises 4 parts: dividing the current knowledge graph into a candidate entity pair and a candidate attribute pair, respectively calculating the similarity of Chinese character face quantity, English character face quantity and vector representation level for the candidate entity pair or the candidate attribute pair, calculating a weighted average value for the three similarities as the similarity of the candidate entity pair or the candidate attribute pair, and screening out a matched entity pair or the candidate attribute pair according to a preset threshold value.

The specific implementation methods are respectively described as follows:

1. partitioning a current knowledge-graph into candidate entity pairs and candidate attribute pairs

For a given knowledge graph, separating head and tail entities from attributes in the triples to generate entity sets and attribute sets. Aiming at the entity set, matching every two entities in the entity set, calculating the similarity of the literal quantity of the names of the entities, and directly filtering the entity pairs with the similarity lower than a threshold value; and calculating the similarity of the literal quantities of the attribute names of a pair of attribute sets, wherein the attribute pairs with the similarity lower than a threshold value are directly filtered. In the invention, the similarity of the word sizes is calculated by using the Jacobian index, and the similarity threshold is set to be 0.4. The jacarat index, also called cross-over-cross-over ratio, is used for measuring the similarity of a finite sample set, and is defined as the ratio between the size of the intersection of two sets and the size of a union set, and the calculation method is as follows:

J(A，B)＝|A∩B|÷|A∪B|

in the formula, a and B represent two sets of jacobian indexes to be calculated, J (a, B) represents the jacobian indexes of the two sets to be calculated, | a ∞ B | represents the size of the intersection of the two sets, and | a ∑ B | represents the size of the union of the two sets.

In the invention, two candidate entities or candidate attributes are split into a set of Chinese entries, and then the Jacobi index of the two entry sets is calculated to be used as the similarity of Chinese character face quantity levels.

2. Calculating Chinese and English literal quantity and vector representation similarity for candidate entity pair or candidate attribute pair

And calculating the similarity of the character face quantity in the candidate entity pair or the candidate attribute pair, and using a Jacobian index. And calculating the similarity of the candidate entity pair or the candidate attribute pair to the English word size, and using the edit distance. The edit distance measures the degree of difference between two character strings by calculating the minimum number of operations required to process one character string into another. The levens distance is used in the present invention, and the defined atomic editing operations include deleting, adding, and replacing a character. The edit distance of two candidate entities or candidate attributes is normalized to a similarity measure between 0-1 using the following:

S(C，D)＝1-L(C，D)÷max(|C|，|D|)

in the formula, C and D represent two character strings whose similarity needs to be measured, S (C, D) represents the calculated similarity based on the edit distance of the two character strings, L (C, D) represents the edit distance of the two character strings, and max (| C |, | D |) represents the length of a longer character string of the two character strings. The degree of difference between the two strings is measured by dividing the edit distance by the greater of the lengths of the two strings, with the value normalized to between 0 and 1.

And training on a given knowledge graph by using a word embedding model to obtain vector representations of the entities and the attributes, acquiring vector representations corresponding to two entities or attributes in a candidate entity pair or a candidate attribute pair, calculating the similarity of the vector representations of the candidate entity pair or the candidate attribute pair, and using cosine similarity. The cosine similarity measures the similarity between two vector included angles by measuring cosine values of the two vector included angles, and the cosine values of the two vector included angles can be solved by an Euclidean dot product formula:

cos(θ)＝(E·F)÷(|E|·|F|)

in the formula, E and F represent two vectors of which the cosine of the included angle needs to be calculated, theta represents the included angle of the two vectors, E · F represents the dot product of the two vectors, | E | · | F | represents the product of the lengths of the two vectors, and cos (theta) represents the cosine value of the included angle of the two vectors.

The representation learning model used in the invention is CBoW (Continuous bands-of-Words) model in word2vec algorithm. The word2vec algorithm is based on the distributed assumption that the word frequency of a document represents the subject of the document, and two words similar in context have similar semantics. The CBoW model predicts the central word by using the context, namely, the training input is a word vector corresponding to the context-dependent word of a characteristic word, and the output is a word vector of a specific central word.

3. Calculating the weighted average of the three similarity as the similarity of the candidate entity pair or the candidate attribute pair

In the step 2, the similarity of the candidate entity pair or the candidate attribute pair in the Chinese character face quantity, the English face quantity and the vector representation level is already calculated. In order to obtain the similarity of the candidate entity pair or the candidate attribute pair, a weighted average is calculated for the three similarities, and the similarity of the Chinese character face quantity is written as a, the similarity of the English character face quantity is written as b, the similarity of the vector representation is written as c, and the weighted average is written as d:

d＝α*a+β*b+γ*c

in the formula, α, β, and γ are weights of the three-part similarity in the final candidate entity pair or the candidate attribute pair similarity, and are set to be 0.6, 0.3, and 0.1 in the present invention.

4. Screening out matched entity pairs or candidate attribute pairs according to a preset threshold value

And 3, calculating the similarity of each candidate entity pair or candidate attribute pair, and the invention aims to perform fusion of multi-source heterogeneous water conservancy field data by using a word embedding technology, and needs to filter candidate entity pairs and candidate attribute pairs with lower similarity by using a preset threshold value so as to improve the accuracy. The preset similarity threshold is 0.6, the candidate entity pairs and the candidate attribute pairs with the similarity lower than the threshold are filtered, and the candidate entity pairs and the candidate attribute pairs with the similarity higher than the threshold are reserved.

Because the similarity of some entities or attributes is excessive, the number of the matched entities or attributes is restricted by using an upper limit value of one matched entity and one matched attribute, the method is limited to 10, namely, the candidate entities or the candidate attributes with the similarity higher than the similarity threshold value of 0.6 and the similarity size of 10 before are reserved as the finally generated matched entity pair and matched attribute pair.

The multi-source heterogeneous water conservancy field data fusion method based on the word vector can match similar entities and similar attributes in multi-source heterogeneous water conservancy field data, improves recall rate of water conservancy field information retrieval, and can improve accuracy of retrieval results by using double constraints of a threshold and an upper limit. Several examples are given in table one below, where the first column is the object entity or attribute and the second column is the entity or attribute with higher similarity (in descending order of similarity).

Table 1: matching entity pair or matching attribute pair paradigm in the present invention

Claims

1. A multi-source heterogeneous water conservancy field data fusion method based on word embedding is characterized by comprising the following steps:

2. The word embedding-based multi-source heterogeneous water conservancy field data fusion method according to claim 1, wherein the candidate entity pair consists of two candidate entities, the candidate attribute pair consists of two candidate attributes, and the step (1.2) comprises the following steps:

(2.1) calculating the character string similarity of the Chinese names of the two candidate entities or the candidate attributes according to the Jacobian index;

(2.2) calculating the character string similarity of the English names of the two candidate entities or the candidate attributes according to the editing distance;

3. The word-embedding-based multi-source heterogeneous water conservancy field data fusion method according to claim 2, wherein the step (2.3) comprises the following steps:

4. The word embedding-based multi-source heterogeneous water conservancy field data fusion method according to claim 1, wherein the step (1.3) comprises the following steps: