CN109582759B

CN109582759B - Method for measuring similarity of documents

Info

Publication number: CN109582759B
Application number: CN201811361247.0A
Authority: CN
Inventors: 李泽源; 方鑫; 王鹏; 陈达纲; 宋亚军; 李泽松
Original assignee: CETC Big Data Research Institute Co Ltd
Current assignee: CETC Big Data Research Institute Co Ltd
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2021-10-22
Anticipated expiration: 2038-11-15
Also published as: CN109582759A

Abstract

The invention provides a method for measuring the similarity of documents, which comprises the following steps: constructing an ontology knowledge base-B document text preprocessing-calculating the similarity of four types of information-calculating the similarity of the rest content of documents-document similarity. The document similarity acquired by the method can be used for retrieval, search and recommendation of documents, the convenience of daily work of a official can be improved, the document similarity is calculated by utilizing the latest ontology knowledge base method, and the calculation accuracy is higher compared with the traditional classical algorithm, such as doc2vec and LDA.

Description

Method for measuring similarity of documents

Technical Field

The invention relates to a method for measuring the similarity of documents, belonging to the technical field of measuring the similarity of documents.

Background

In the information explosion era, the increasing speed of documents and articles far surpasses the performance of searching and navigation algorithms. In china, most of the public policies concerning various aspects of life, such as education, health care, real estate, finance, etc., are published through official a documents. In some fields, particularly the real estate and education fields, local policies are often revised, and different local policies are adopted in different cities, so that citizens cannot find proper A files; for the public officer, various B organ documents need to be inquired during the work so as to write new documents, analyze the policy A and interpret the relevant policy to the public; in addition, the public also finds it difficult to find the required A documents accurately because the popular search engines in China, such as hundredths, dog hunting, do not focus on document searching and indexing. Thus increasing the difficulty for citizens and officials to find relevant documents, and requiring recommendation systems in addition to search engines if the user wants to find unknown relevant documents.

Currently, most documents focus on semantic similarity between words, sentences and paragraphs, and accurately measuring semantic relevance or similarity between documents plays an important role in many applications, such as searching, recommendation, and the like. However, comparing semantic similarity between documents is also extremely difficult due to the complexity of natural language semantics. In general, documents are considered similar if they are considered to have the same meaning or to convey the same idea or topic. The measure of document similarity includes two parts: an efficient representation of the documents, and a similarity measure of the respective representations between the documents. Many previous studies followed the idea that large text units consist of small text units, word similarity constitutes sentence-like reads, and words in documents and their order are two of the main factors in computing document similarity. In most existing approaches, documents are mapped to fixed-length vectors, which are the well-known bag-of-words. In a bag of words, a document is represented as a frequency vector consisting of a vocabulary of words, each component of the vector reflecting the frequency with which a word appears in the document. But the bag of words only represents the frequency of words, it does not take into account multiple meanings of the same word, and it also ignores the order of words in the document. Potential topic models such as LSA (Latent Semantic Analysis), LDA (Latent Dirichlet Allocation) or word2vec map words and documents to potential topics and vectors represented by their components, which require smaller vector lengths compared to bags of words, and the corresponding potential topics have the disadvantage that their generated word vectors are difficult to interpret.

In addition to the above-mentioned methods, much work has focused on semantic similarity at the concept level, as concept similarity can be measured by knowledge-based methods that use paths between concepts in a knowledgegraph or ontology repository to compute their similarity, and some researchers make concept vectors from a set of ontology attributes of concept similarity, rather than directly using distances between entities in a knowledgegraph.

Whereas organ B documents are written in a standard, highly structured and in a canonical format. We can extract some structured information from each document and describe them with a unified structure. The structured information is an important factor for composing the official document and represents the semantics of the official document to a certain extent. We therefore treat the document as semi-structured data after data cleansing. Based on this, we propose a method for measuring the similarity of documents.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method for measuring the similarity of documents, which utilizes the latest ontology knowledge base method to calculate the similarity of documents, and has higher calculation accuracy compared with the traditional classical algorithm, such as doc2vec and LDA.

The invention is realized by the following technical scheme.

The invention provides a method for measuring the similarity of documents, which comprises the following steps:

constructing an ontology knowledge base: constructing ontology knowledge bases of an organization unit A and a document theme B;

b, preprocessing document text: four types of information in two documents requiring comparative similarity are extracted: organ information, subject information, genre information, text and date information;

and thirdly, calculating the similarity of the four types of information: respectively calculating the organ unit similarity, the theme similarity, the genre similarity and the issue date similarity of the two documents;

fourthly, calculating the similarity of the rest contents of the official document: calculating the similarity of the text information except for the organ unit, the official document theme, the genre and the text sending date through doc2 vec;

document similarity: and weighting and summing the similarities in the third document and the fourth document to obtain the similarity of the two documents.

The step II comprises the following steps:

(2.1) acquiring organization unit information: extracting a sending organ and a receiving organ in the official documents through regular matching;

(2.2) acquiring the subject information of the official document: extracting a title and the first two sections from the file A, matching and discarding organ unit information, and then performing basic text preprocessing;

(2.3) acquiring the genre information: dividing the genre into sub-genres according to the specific function of the genre, and determining the official document genre information through regular matching;

(2.4) acquiring message sending date information: by regularly matching the time in the official document.

In the step (2.2), the text preprocessing is divided into the following steps:

(2.2.1) performing word segmentation and stop word elimination on the residual text, wherein the words only comprise numbers and words which only appear once in all the document corpora;

(2.2.2) matching keywords in the topic ontology library from the remaining words, and determining the topic label of the A document.

In the step (2.3), the types of the genres are divided into 15 types according to the official document format of the Party government organization, and each type of the genres is subdivided into one level according to specific functions and is subdivided into the subdivisions.

In the third step, based on the ontology knowledge base, the computer organization unit similarity S_dep(e_x，e_y) The calculation formula of (2) is as follows:

S_dep(e_x,e_y)＝1-d(e_x，e_y)；

wherein e is_x，e_yIs a unit of organization, d (e)_x，e_y) Is e_x，e_yDistance in ontology repository;

d (e) is_x，e_y) The calculation formula of (2) is as follows:

wherein d (root, x) represents the distance from the node x to the root node of the ontology knowledge base, d (root, y) represents the distance from the node y to the root node of the ontology knowledge base, d (lca (x, y), x) represents the distance from the node x to the common nearest node of x and y, and d (lca (x, y), y) represents the distance from the node y to the common nearest node of x and y;

when one document contains a plurality of organ units, the similarity S of the organ unit information in the two documents_depThe calculation formula of (i, j) is:

wherein e is_i，mIs the mth unit of the customs clearance in the document i, belongs to_j，mIs the organ unit closest to the mth organ unit in the document j, M is the total number of the department entities appearing in the document i, N is the total number of the department entities appearing in the document j, d (e)_i，m，∈_j，m) Is organization unit e_i，mAnd e_j，mDistance in ontology repository, d (e)_j，m，∈_i，m) Is e_j，mAnd e_i，mDistance in the ontology repository.

In the third step, the similarity S of the official document subjects is calculated based on the ontology knowledge base_top(e_w，e_z) The calculation formula of (2) is as follows:

S_top(e_w，e_z)＝1-d(e_w，e_z)；

wherein e is_w，e_zIs a subject of official documents, d (e)_w，e_z) Is e_w，e_zDistance in ontology repository;

d (e) is_w，e_z) The calculation formula of (2) is as follows:

wherein d (root, w) represents the distance from the node w to the root node of the ontology repository, d (root, z) represents the distance from the node z to the root node of the ontology repository, d (lca (w, z), w) represents the distance from the node w to the common nearest node of w and z, and d (lca (w, z), z) represents the distance from the node z to the common nearest node of w and z;

when one official document contains a plurality of official document themes, the similarity S of the official document theme information in the two official documents_topThe formula for the calculation of (d, s) is:

wherein e is_d，cIs the subject of the c < th > document in the document d, belongs to_s，cIs the document topic in document s closest to document d, the c-th document topic, F is the total number of document topics appearing in document d, G is the total number of document topics appearing in document s, d (e)_d，c，∈_s，c) Is the subject of official document e_d，cAnd e_s，cDistance in ontology repository, d (e)_s，c，∈_s，c) Is e_s，cAnd e_s，cDistance in the ontology repository.

In the third step, the similarity of the official document genres is 0, 0.5 or 1, and if the two official document genres and the subdivided genres are completely consistent, the similarity is 1; if the two official documents have the same genre but different subdivision genres, the similarity is 0.5; if the two documents are of different genres, the similarity is 0.

In the step III, the similarity S of the text-sending dates_dateThe calculation formula of (i, j) is:

wherein date_iIs the date of issue of the ith official document_jIs the issue date of the jth official document.

In the step (i), the first step,

ontology knowledge base of organ A: constructing a direct relation map of an organization unit by combing the business and administrative relations of all committee offices A;

b ontology knowledge base of official document subject: and based on the official document classification of the B office website, subdividing the official document theme downwards to form an ontology knowledge base of the official document theme.

In the fifth step, the similarity between the official document i and the official document j is weighted by the organ unit similarity, the official document theme similarity, the genre similarity, the issue date similarity and the similarity of the rest official document contents, and the concrete formula is as follows:

wherein the SIM_doc(i, j) represents the similarity of the document i to the document j, S_k(i, j) represents one of 5 similarities, w_kRepresents its weight;

the specific weight assignment is as follows: the weight of organ unit similarity is 0.25, the weight of official document theme similarity is 0.25, the weight of genre similarity is 0.05, the weight of issue date similarity is 0.05, and the weight of official document remaining content similarity is 0.4.

The invention has the beneficial effects that:

1. the official document similarity obtained by the method can be used for retrieval, search and recommendation of official documents, and the convenience of daily work of a officer can be improved;

2. the method utilizes the latest ontology knowledge base method to calculate the similarity of the documents, and compared with the traditional classical algorithm, such as doc2vec and LDA, the calculation accuracy is higher.

Drawings

FIG. 1 is a connection diagram of an ontology repository in organization A of the present invention;

FIG. 2 is a schematic diagram of the linkage of the vector classes of the present invention;

FIG. 3 is a graph a of the results of comparing the present invention with the prior art;

fig. 4 is a graph b of the results of the present invention compared to the prior art.

Detailed Description

The technical solution of the present invention is further described below, but the scope of the claimed invention is not limited to the described.

A method for measuring document similarity, comprising the steps of:

further, ontology knowledge base of organ a: constructing a direct relation map of an organization unit by combing the business and administrative relations of all committees A, as shown in FIG. 1;

b ontology knowledge base of official document subject: based on the official document classification of the B office website, subdividing the official document theme downwards to form an ontology knowledge base of the official document theme;

preferably, the genre and the text-sending date do not need to construct a knowledge base;

b, preprocessing document text: four types of information in two documents requiring comparative similarity are extracted: organ unit information, subject information (official document subject information), genre information and text-sending date information, wherein the four types of information are processed separately;

specifically, the step II comprises the following steps:

(2.2.2) matching keywords in the topic ontology library from the remaining words, and determining the topic label of the document A;

(2.3) acquiring the genre information: the official document format of the party administrative organ shows that there are 15 kinds of government documents in total, each genre is subdivided into one level according to the specific function of the genre, and the genre is subdivided into a fine genre, for example, a command genre can be subdivided into a publication order, an administrative order, an exemption order and a reward and punishment order, as shown in fig. 2, the carriers determine the information of the official document genre through regular matching;

further, 15 genres and sub-genres are respectively: resolution (promulgation resolution, approval resolution, declarative resolution), decision (imperative decision, changeability decision, legal decision, reward-punishability decision), command (announcement, political order, awards), bulletin (conference bulletin, matter bulletin, joint bulletin), announcement (important matter announcement, statutory matter announcement, professional announcement), announcement (informed (transactional) announcement, regulatory (restrictive) announcement), opinion (programmatic opinion, immediate opinion, instructive opinion), announcement (issued announcement, wholesale announcement, forwarded announcement, indicative announcement, exempt announcement, transactional announcement), announcement (manifest, announcement, situation announcement), announcement (routine report, special report, general report), solicitation (solicitation for instruction, solicitation for help, solicitation for wholesale), approval (approval item approval, approval regulation approval, policy approval, affirmative approval, negative approval, answer approval), protocol (legislative protocol, decision-making protocol for major matters, exemptive protocol, constructive protocol), letter (business agreement letter, inquiry letter, invitation letter, approval letter), presidential (office meeting presidential, professional meeting presidential);

document similarity: weighting and summing the similarity of the fourth and fifth documents to obtain the similarity of the two documents;

furthermore, in the fifth step, the similarity between the official document i and the official document j is weighted by the organ unit similarity, the official document theme similarity, the genre similarity, the issue date similarity and the similarity of the rest content of the official document, and the specific formula is as follows:

S_dep(e_x，e_y)＝1-d(e_x，e_y)；

d (e) is_x，e_y) The calculation formula of (2) is as follows:

wherein e is_i，mIs the mth unit of the customs clearance in the document i, belongs to_j，mIs the nearest organ unit of the m-th organ unit in the document j to the document i,m is the total number of departmental entities appearing in the document i, N is the total number of departmental entities appearing in the document j, d (e)_i，m，∈_j，m) Is a unit of authority e_i，mAnd e_j，mDistances in the ontology repository are calculated as d (e) above_x，e_y)，d(e_j，m，∈_i，m) Is e_j，mAnd e_i，mDistance in the ontology repository.

S_top(e_w，e_z)＝1-d(e_w，e_z)；

d (e) is_w，e_z) The calculation formula of (2) is as follows:

wherein e is_d，cIs the subject of the c < th > document in the document d, belongs to_s，cIs the subject of the document closest to the subject of the c-th document in the document d in the document s, and F isThe total number of document topics appearing in document d, G is the total number of document topics appearing in document s, d (e)_d，c，∈_s，c) Is the subject of official document e_d，cAnd e_s，cThe distance in the ontology repository is calculated as d (e) in the above formula_w，e_z)，d(e_s，c，∈_s，c) Is e_s，cAnd e_s，cDistance in the ontology repository.

In the third step, the similarity of the official document and the genre is 0, 0.5 or 1, and if the two official document and the subdivided genre are completely consistent (for example, both belong to the administrative orders in the command class), the similarity is 1; if the two official documents have the same genre but different subdivision genres, the similarity is 0.5; if the two documents are of different genres, e.g. command and decision, the similarity is 0.

Examples

As mentioned above, the document date of the A document can be directly obtained in the document crawling process, the information of the institution and the genre can be easily obtained through the regular expression, and for the subject information of the document, a supervised or unsupervised algorithm can be used for extracting the subject information.

For comparison, the similarity of the a-files was directly calculated using the classical LDA and doc2vec algorithms. To obtain reasonable performance for LDA and doc2vec, 13238 a files were used to train the word embedding model (word embedding) in addition to those selected as the a documents of the evaluation dataset.

In particular, the similarity SIM between the document i and the document j_doc(i, j) may be calculated using a weighted sum of the similarity between their elements, for example, in a formula such asThe following:

where k is the kth element of the document, w_kIs the weight of the kth element, S_k(i, j) is one of 5 similarities.

In formula (1), the weight of each element, i.e., organ unit, subject information, genre, text date, and remaining text, is set to 0.25, 0.05, and 0.4, respectively. To evaluate the performance of the comparison method, we used Pearson correlation coefficients to measure their relationship to the 2-and 5-level ground truth (ground truth), and the results are shown in table 1, which is superior to LDA and doc2vec, as predicted.

Table 1: comparison of similarity correlations between different methods (the larger the value, the better)

Since in the level 2 ground truth (ground true), we use 0 or 1 to represent similarity, we further set a threshold to predict whether two documents are similar. If the accuracy of the A-file pair is greater than the threshold, then they are judged to be similar. The corresponding F1 values and accuracy results are shown in fig. 3 and 4. The results show that our proposed method is superior to the traditional LDA and doc2vec methods.

Claims

1. A method for measuring the similarity of documents is characterized in that: the method comprises the following steps:

constructing an ontology knowledge base: constructing an ontology knowledge base of government institution units and party government official document subjects;

② party political official document text preprocessing: four types of information in two documents requiring comparative similarity are extracted: organ information, subject information, genre information, text and date information;

document similarity: weighting and summing the similarities in the third document and the fourth document to obtain the similarity of the two documents;

in the third step, based on the ontology knowledge base, the computer organization unit similarity S_dep(e_x,e_y) The calculation formula of (2) is as follows:

S_dep(e_x,e_y)＝1-d(e_x,e_y)；

wherein e is_x，e_yIs a unit of organization, d (e)_x,e_y) Is e_x，e_yDistance in ontology repository;

d (e) is_x,e_y) The calculation formula of (2) is as follows:

wherein e is_i,mIs the mth unit of the customs clearance in the document i, belongs to_j,mIs the nearest organ unit in the document j to the mth organ unit in the document i, and M is the nearest organ unit in the document jTotal number of department entities appearing in document i, N is total number of department entities appearing in document j, d (e)_i,m,∈_j,m) Is organization unit e_i,mAnd e_j,mDistance in ontology repository, d (e)_j,m,∈_i,m) Is e_j,mAnd e_i,mDistance in the ontology repository.

2. The method of claim 1, wherein: the step II comprises the following steps:

(2.2) acquiring the subject information of the official document: extracting a title and the first two sections from government documents, matching and discarding information of organs and units, and then performing basic text preprocessing;

3. The method of claim 2, wherein: in the step (2.2), the text preprocessing is divided into the following steps:

(2.2.2) matching keywords in the topic ontology library from the remaining words to determine the topic tags of the government documents.

4. The method of claim 2, wherein: in the step (2.3), the types of the genres are divided into 15 types according to the official document format of the Party government organization, and each type of the genres is subdivided into one level according to specific functions and is subdivided into the subdivisions.

5. The method of claim 1The method for measuring the similarity of the official documents is characterized in that: in the third step, the similarity S of the official document subjects is calculated based on the ontology knowledge base_top(e_w,e_z) The calculation formula of (2) is as follows:

S_top(e_w,e_z)＝1-d(e_w,e_z)；

wherein e is_w,e_zIs a subject of official documents, d (e)_w,e_z) Is e_w,e_zDistance in ontology repository;

d (e) is_w,e_z) The calculation formula of (2) is as follows:

wherein e is_d,cIs the subject of the c < th > document in the document d, belongs to_s,cIs the document topic in document s that is closest to the c-th document topic in document d, F is the total number of document topics appearing in document d, G is the total number of document topics appearing in document s, d (e)_d,c,∈_s,c) Is the subject of official document e_d,cAnd e_s,cDistance in ontology repository, d (e)_s,c,∈_s,c) Is e_s,cAnd e_s,cDistance in the ontology repository.

6. The method of claim 1, wherein: in the third step, the similarity of the official document genres is 0, 0.5 or 1, and if the two official document genres and the subdivided genres are completely consistent, the similarity is 1; if the two official documents have the same genre but different subdivision genres, the similarity is 0.5; if the two documents are of different genres, the similarity is 0.

7. The method of claim 1, wherein: in the step III, the similarity S of the text-sending dates_dateThe calculation formula of (i, j) is:

8. The method of claim 1, wherein: in the step (i), the first step,

ontology knowledge base of government organization: constructing a direct relation map of an organization unit by combing the business and administrative relations of each committee and office of the government;

ontology knowledge base of party political official document subject: based on the classification of official documents of the website of the office of the state department, the subject of the official documents is subdivided downwards to form a body knowledge base of the subject of the official documents.

9. The method of claim 1, wherein: in the fifth step, the similarity between the official document i and the official document j is weighted by the organ unit similarity, the official document theme similarity, the genre similarity, the issue date similarity and the similarity of the rest official document contents, and the concrete formula is as follows: