CN109582759B - Method for measuring similarity of documents - Google Patents

Method for measuring similarity of documents Download PDF

Info

Publication number
CN109582759B
CN109582759B CN201811361247.0A CN201811361247A CN109582759B CN 109582759 B CN109582759 B CN 109582759B CN 201811361247 A CN201811361247 A CN 201811361247A CN 109582759 B CN109582759 B CN 109582759B
Authority
CN
China
Prior art keywords
document
similarity
official
documents
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811361247.0A
Other languages
Chinese (zh)
Other versions
CN109582759A (en
Inventor
李泽源
方鑫
王鹏
陈达纲
宋亚军
李泽松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Big Data Research Institute Co Ltd
Original Assignee
CETC Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Big Data Research Institute Co Ltd filed Critical CETC Big Data Research Institute Co Ltd
Priority to CN201811361247.0A priority Critical patent/CN109582759B/en
Publication of CN109582759A publication Critical patent/CN109582759A/en
Application granted granted Critical
Publication of CN109582759B publication Critical patent/CN109582759B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for measuring the similarity of documents, which comprises the following steps: constructing an ontology knowledge base-B document text preprocessing-calculating the similarity of four types of information-calculating the similarity of the rest content of documents-document similarity. The document similarity acquired by the method can be used for retrieval, search and recommendation of documents, the convenience of daily work of a official can be improved, the document similarity is calculated by utilizing the latest ontology knowledge base method, and the calculation accuracy is higher compared with the traditional classical algorithm, such as doc2vec and LDA.

Description

Method for measuring similarity of documents
Technical Field
The invention relates to a method for measuring the similarity of documents, belonging to the technical field of measuring the similarity of documents.
Background
In the information explosion era, the increasing speed of documents and articles far surpasses the performance of searching and navigation algorithms. In china, most of the public policies concerning various aspects of life, such as education, health care, real estate, finance, etc., are published through official a documents. In some fields, particularly the real estate and education fields, local policies are often revised, and different local policies are adopted in different cities, so that citizens cannot find proper A files; for the public officer, various B organ documents need to be inquired during the work so as to write new documents, analyze the policy A and interpret the relevant policy to the public; in addition, the public also finds it difficult to find the required A documents accurately because the popular search engines in China, such as hundredths, dog hunting, do not focus on document searching and indexing. Thus increasing the difficulty for citizens and officials to find relevant documents, and requiring recommendation systems in addition to search engines if the user wants to find unknown relevant documents.
Currently, most documents focus on semantic similarity between words, sentences and paragraphs, and accurately measuring semantic relevance or similarity between documents plays an important role in many applications, such as searching, recommendation, and the like. However, comparing semantic similarity between documents is also extremely difficult due to the complexity of natural language semantics. In general, documents are considered similar if they are considered to have the same meaning or to convey the same idea or topic. The measure of document similarity includes two parts: an efficient representation of the documents, and a similarity measure of the respective representations between the documents. Many previous studies followed the idea that large text units consist of small text units, word similarity constitutes sentence-like reads, and words in documents and their order are two of the main factors in computing document similarity. In most existing approaches, documents are mapped to fixed-length vectors, which are the well-known bag-of-words. In a bag of words, a document is represented as a frequency vector consisting of a vocabulary of words, each component of the vector reflecting the frequency with which a word appears in the document. But the bag of words only represents the frequency of words, it does not take into account multiple meanings of the same word, and it also ignores the order of words in the document. Potential topic models such as LSA (Latent Semantic Analysis), LDA (Latent Dirichlet Allocation) or word2vec map words and documents to potential topics and vectors represented by their components, which require smaller vector lengths compared to bags of words, and the corresponding potential topics have the disadvantage that their generated word vectors are difficult to interpret.
In addition to the above-mentioned methods, much work has focused on semantic similarity at the concept level, as concept similarity can be measured by knowledge-based methods that use paths between concepts in a knowledgegraph or ontology repository to compute their similarity, and some researchers make concept vectors from a set of ontology attributes of concept similarity, rather than directly using distances between entities in a knowledgegraph.
Whereas organ B documents are written in a standard, highly structured and in a canonical format. We can extract some structured information from each document and describe them with a unified structure. The structured information is an important factor for composing the official document and represents the semantics of the official document to a certain extent. We therefore treat the document as semi-structured data after data cleansing. Based on this, we propose a method for measuring the similarity of documents.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method for measuring the similarity of documents, which utilizes the latest ontology knowledge base method to calculate the similarity of documents, and has higher calculation accuracy compared with the traditional classical algorithm, such as doc2vec and LDA.
The invention is realized by the following technical scheme.
The invention provides a method for measuring the similarity of documents, which comprises the following steps:
constructing an ontology knowledge base: constructing ontology knowledge bases of an organization unit A and a document theme B;
b, preprocessing document text: four types of information in two documents requiring comparative similarity are extracted: organ information, subject information, genre information, text and date information;
and thirdly, calculating the similarity of the four types of information: respectively calculating the organ unit similarity, the theme similarity, the genre similarity and the issue date similarity of the two documents;
fourthly, calculating the similarity of the rest contents of the official document: calculating the similarity of the text information except for the organ unit, the official document theme, the genre and the text sending date through doc2 vec;
document similarity: and weighting and summing the similarities in the third document and the fourth document to obtain the similarity of the two documents.
The step II comprises the following steps:
(2.1) acquiring organization unit information: extracting a sending organ and a receiving organ in the official documents through regular matching;
(2.2) acquiring the subject information of the official document: extracting a title and the first two sections from the file A, matching and discarding organ unit information, and then performing basic text preprocessing;
(2.3) acquiring the genre information: dividing the genre into sub-genres according to the specific function of the genre, and determining the official document genre information through regular matching;
(2.4) acquiring message sending date information: by regularly matching the time in the official document.
In the step (2.2), the text preprocessing is divided into the following steps:
(2.2.1) performing word segmentation and stop word elimination on the residual text, wherein the words only comprise numbers and words which only appear once in all the document corpora;
(2.2.2) matching keywords in the topic ontology library from the remaining words, and determining the topic label of the A document.
In the step (2.3), the types of the genres are divided into 15 types according to the official document format of the Party government organization, and each type of the genres is subdivided into one level according to specific functions and is subdivided into the subdivisions.
In the third step, based on the ontology knowledge base, the computer organization unit similarity Sdep(ex,ey) The calculation formula of (2) is as follows:
Sdep(ex,ey)=1-d(ex,ey);
wherein e isx,eyIs a unit of organization, d (e)x,ey) Is ex,eyDistance in ontology repository;
d (e) isx,ey) The calculation formula of (2) is as follows:
Figure GDA0001966016050000041
wherein d (root, x) represents the distance from the node x to the root node of the ontology knowledge base, d (root, y) represents the distance from the node y to the root node of the ontology knowledge base, d (lca (x, y), x) represents the distance from the node x to the common nearest node of x and y, and d (lca (x, y), y) represents the distance from the node y to the common nearest node of x and y;
when one document contains a plurality of organ units, the similarity S of the organ unit information in the two documentsdepThe calculation formula of (i, j) is:
Figure GDA0001966016050000042
wherein e isi,mIs the mth unit of the customs clearance in the document i, belongs toj,mIs the organ unit closest to the mth organ unit in the document j, M is the total number of the department entities appearing in the document i, N is the total number of the department entities appearing in the document j, d (e)i,m,∈j,m) Is organization unit ei,mAnd ej,mDistance in ontology repository, d (e)j,m,∈i,m) Is ej,mAnd ei,mDistance in the ontology repository.
In the third step, the similarity S of the official document subjects is calculated based on the ontology knowledge basetop(ew,ez) The calculation formula of (2) is as follows:
Stop(ew,ez)=1-d(ew,ez);
wherein e isw,ezIs a subject of official documents, d (e)w,ez) Is ew,ezDistance in ontology repository;
d (e) isw,ez) The calculation formula of (2) is as follows:
Figure GDA0001966016050000051
wherein d (root, w) represents the distance from the node w to the root node of the ontology repository, d (root, z) represents the distance from the node z to the root node of the ontology repository, d (lca (w, z), w) represents the distance from the node w to the common nearest node of w and z, and d (lca (w, z), z) represents the distance from the node z to the common nearest node of w and z;
when one official document contains a plurality of official document themes, the similarity S of the official document theme information in the two official documentstopThe formula for the calculation of (d, s) is:
Figure GDA0001966016050000052
wherein e isd,cIs the subject of the c < th > document in the document d, belongs tos,cIs the document topic in document s closest to document d, the c-th document topic, F is the total number of document topics appearing in document d, G is the total number of document topics appearing in document s, d (e)d,c,∈s,c) Is the subject of official document ed,cAnd es,cDistance in ontology repository, d (e)s,c,∈s,c) Is es,cAnd es,cDistance in the ontology repository.
In the third step, the similarity of the official document genres is 0, 0.5 or 1, and if the two official document genres and the subdivided genres are completely consistent, the similarity is 1; if the two official documents have the same genre but different subdivision genres, the similarity is 0.5; if the two documents are of different genres, the similarity is 0.
In the step III, the similarity S of the text-sending datesdateThe calculation formula of (i, j) is:
Figure GDA0001966016050000061
wherein dateiIs the date of issue of the ith official documentjIs the issue date of the jth official document.
In the step (i), the first step,
ontology knowledge base of organ A: constructing a direct relation map of an organization unit by combing the business and administrative relations of all committee offices A;
b ontology knowledge base of official document subject: and based on the official document classification of the B office website, subdividing the official document theme downwards to form an ontology knowledge base of the official document theme.
In the fifth step, the similarity between the official document i and the official document j is weighted by the organ unit similarity, the official document theme similarity, the genre similarity, the issue date similarity and the similarity of the rest official document contents, and the concrete formula is as follows:
Figure GDA0001966016050000062
wherein the SIMdoc(i, j) represents the similarity of the document i to the document j, Sk(i, j) represents one of 5 similarities, wkRepresents its weight;
the specific weight assignment is as follows: the weight of organ unit similarity is 0.25, the weight of official document theme similarity is 0.25, the weight of genre similarity is 0.05, the weight of issue date similarity is 0.05, and the weight of official document remaining content similarity is 0.4.
The invention has the beneficial effects that:
1. the official document similarity obtained by the method can be used for retrieval, search and recommendation of official documents, and the convenience of daily work of a officer can be improved;
2. the method utilizes the latest ontology knowledge base method to calculate the similarity of the documents, and compared with the traditional classical algorithm, such as doc2vec and LDA, the calculation accuracy is higher.
Drawings
FIG. 1 is a connection diagram of an ontology repository in organization A of the present invention;
FIG. 2 is a schematic diagram of the linkage of the vector classes of the present invention;
FIG. 3 is a graph a of the results of comparing the present invention with the prior art;
fig. 4 is a graph b of the results of the present invention compared to the prior art.
Detailed Description
The technical solution of the present invention is further described below, but the scope of the claimed invention is not limited to the described.
A method for measuring document similarity, comprising the steps of:
constructing an ontology knowledge base: constructing ontology knowledge bases of an organization unit A and a document theme B;
further, ontology knowledge base of organ a: constructing a direct relation map of an organization unit by combing the business and administrative relations of all committees A, as shown in FIG. 1;
b ontology knowledge base of official document subject: based on the official document classification of the B office website, subdividing the official document theme downwards to form an ontology knowledge base of the official document theme;
preferably, the genre and the text-sending date do not need to construct a knowledge base;
b, preprocessing document text: four types of information in two documents requiring comparative similarity are extracted: organ unit information, subject information (official document subject information), genre information and text-sending date information, wherein the four types of information are processed separately;
specifically, the step II comprises the following steps:
(2.1) acquiring organization unit information: extracting a sending organ and a receiving organ in the official documents through regular matching;
(2.2) acquiring the subject information of the official document: extracting a title and the first two sections from the file A, matching and discarding organ unit information, and then performing basic text preprocessing;
(2.2.1) performing word segmentation and stop word elimination on the residual text, wherein the words only comprise numbers and words which only appear once in all the document corpora;
(2.2.2) matching keywords in the topic ontology library from the remaining words, and determining the topic label of the document A;
(2.3) acquiring the genre information: the official document format of the party administrative organ shows that there are 15 kinds of government documents in total, each genre is subdivided into one level according to the specific function of the genre, and the genre is subdivided into a fine genre, for example, a command genre can be subdivided into a publication order, an administrative order, an exemption order and a reward and punishment order, as shown in fig. 2, the carriers determine the information of the official document genre through regular matching;
further, 15 genres and sub-genres are respectively: resolution (promulgation resolution, approval resolution, declarative resolution), decision (imperative decision, changeability decision, legal decision, reward-punishability decision), command (announcement, political order, awards), bulletin (conference bulletin, matter bulletin, joint bulletin), announcement (important matter announcement, statutory matter announcement, professional announcement), announcement (informed (transactional) announcement, regulatory (restrictive) announcement), opinion (programmatic opinion, immediate opinion, instructive opinion), announcement (issued announcement, wholesale announcement, forwarded announcement, indicative announcement, exempt announcement, transactional announcement), announcement (manifest, announcement, situation announcement), announcement (routine report, special report, general report), solicitation (solicitation for instruction, solicitation for help, solicitation for wholesale), approval (approval item approval, approval regulation approval, policy approval, affirmative approval, negative approval, answer approval), protocol (legislative protocol, decision-making protocol for major matters, exemptive protocol, constructive protocol), letter (business agreement letter, inquiry letter, invitation letter, approval letter), presidential (office meeting presidential, professional meeting presidential);
(2.4) acquiring message sending date information: by regularly matching the time in the official document.
And thirdly, calculating the similarity of the four types of information: respectively calculating the organ unit similarity, the theme similarity, the genre similarity and the issue date similarity of the two documents;
fourthly, calculating the similarity of the rest contents of the official document: calculating the similarity of the text information except for the organ unit, the official document theme, the genre and the text sending date through doc2 vec;
document similarity: weighting and summing the similarity of the fourth and fifth documents to obtain the similarity of the two documents;
furthermore, in the fifth step, the similarity between the official document i and the official document j is weighted by the organ unit similarity, the official document theme similarity, the genre similarity, the issue date similarity and the similarity of the rest content of the official document, and the specific formula is as follows:
Figure GDA0001966016050000091
wherein the SIMdoc(i, j) represents the similarity of the document i to the document j, Sk(i, j) represents one of 5 similarities, wkRepresents its weight;
the specific weight assignment is as follows: the weight of organ unit similarity is 0.25, the weight of official document theme similarity is 0.25, the weight of genre similarity is 0.05, the weight of issue date similarity is 0.05, and the weight of official document remaining content similarity is 0.4.
In the third step, based on the ontology knowledge base, the computer organization unit similarity Sdep(ex,ey) The calculation formula of (2) is as follows:
Sdep(ex,ey)=1-d(ex,ey);
wherein e isx,eyIs a unit of organization, d (e)x,ey) Is ex,eyDistance in ontology repository;
d (e) isx,ey) The calculation formula of (2) is as follows:
Figure GDA0001966016050000092
wherein d (root, x) represents the distance from the node x to the root node of the ontology knowledge base, d (root, y) represents the distance from the node y to the root node of the ontology knowledge base, d (lca (x, y), x) represents the distance from the node x to the common nearest node of x and y, and d (lca (x, y), y) represents the distance from the node y to the common nearest node of x and y;
when one document contains a plurality of organ units, the similarity S of the organ unit information in the two documentsdepThe calculation formula of (i, j) is:
Figure GDA0001966016050000101
wherein e isi,mIs the mth unit of the customs clearance in the document i, belongs toj,mIs the nearest organ unit of the m-th organ unit in the document j to the document i,m is the total number of departmental entities appearing in the document i, N is the total number of departmental entities appearing in the document j, d (e)i,m,∈j,m) Is a unit of authority ei,mAnd ej,mDistances in the ontology repository are calculated as d (e) abovex,ey),d(ej,m,∈i,m) Is ej,mAnd ei,mDistance in the ontology repository.
In the third step, the similarity S of the official document subjects is calculated based on the ontology knowledge basetop(ew,ez) The calculation formula of (2) is as follows:
Stop(ew,ez)=1-d(ew,ez);
wherein e isw,ezIs a subject of official documents, d (e)w,ez) Is ew,ezDistance in ontology repository;
d (e) isw,ez) The calculation formula of (2) is as follows:
Figure GDA0001966016050000102
wherein d (root, w) represents the distance from the node w to the root node of the ontology repository, d (root, z) represents the distance from the node z to the root node of the ontology repository, d (lca (w, z), w) represents the distance from the node w to the common nearest node of w and z, and d (lca (w, z), z) represents the distance from the node z to the common nearest node of w and z;
when one official document contains a plurality of official document themes, the similarity S of the official document theme information in the two official documentstopThe formula for the calculation of (d, s) is:
Figure GDA0001966016050000111
wherein e isd,cIs the subject of the c < th > document in the document d, belongs tos,cIs the subject of the document closest to the subject of the c-th document in the document d in the document s, and F isThe total number of document topics appearing in document d, G is the total number of document topics appearing in document s, d (e)d,c,∈s,c) Is the subject of official document ed,cAnd es,cThe distance in the ontology repository is calculated as d (e) in the above formulaw,ez),d(es,c,∈s,c) Is es,cAnd es,cDistance in the ontology repository.
In the third step, the similarity of the official document and the genre is 0, 0.5 or 1, and if the two official document and the subdivided genre are completely consistent (for example, both belong to the administrative orders in the command class), the similarity is 1; if the two official documents have the same genre but different subdivision genres, the similarity is 0.5; if the two documents are of different genres, e.g. command and decision, the similarity is 0.
In the step III, the similarity S of the text-sending datesdateThe calculation formula of (i, j) is:
Figure GDA0001966016050000112
wherein dateiIs the date of issue of the ith official documentjIs the issue date of the jth official document.
Examples
As mentioned above, the document date of the A document can be directly obtained in the document crawling process, the information of the institution and the genre can be easily obtained through the regular expression, and for the subject information of the document, a supervised or unsupervised algorithm can be used for extracting the subject information.
For comparison, the similarity of the a-files was directly calculated using the classical LDA and doc2vec algorithms. To obtain reasonable performance for LDA and doc2vec, 13238 a files were used to train the word embedding model (word embedding) in addition to those selected as the a documents of the evaluation dataset.
In particular, the similarity SIM between the document i and the document jdoc(i, j) may be calculated using a weighted sum of the similarity between their elements, for example, in a formula such asThe following:
Figure GDA0001966016050000121
where k is the kth element of the document, wkIs the weight of the kth element, Sk(i, j) is one of 5 similarities.
In formula (1), the weight of each element, i.e., organ unit, subject information, genre, text date, and remaining text, is set to 0.25, 0.05, and 0.4, respectively. To evaluate the performance of the comparison method, we used Pearson correlation coefficients to measure their relationship to the 2-and 5-level ground truth (ground truth), and the results are shown in table 1, which is superior to LDA and doc2vec, as predicted.
Table 1: comparison of similarity correlations between different methods (the larger the value, the better)
Figure GDA0001966016050000122
Since in the level 2 ground truth (ground true), we use 0 or 1 to represent similarity, we further set a threshold to predict whether two documents are similar. If the accuracy of the A-file pair is greater than the threshold, then they are judged to be similar. The corresponding F1 values and accuracy results are shown in fig. 3 and 4. The results show that our proposed method is superior to the traditional LDA and doc2vec methods.

Claims (9)

1. A method for measuring the similarity of documents is characterized in that: the method comprises the following steps:
constructing an ontology knowledge base: constructing an ontology knowledge base of government institution units and party government official document subjects;
② party political official document text preprocessing: four types of information in two documents requiring comparative similarity are extracted: organ information, subject information, genre information, text and date information;
and thirdly, calculating the similarity of the four types of information: respectively calculating the organ unit similarity, the theme similarity, the genre similarity and the issue date similarity of the two documents;
fourthly, calculating the similarity of the rest contents of the official document: calculating the similarity of the text information except for the organ unit, the official document theme, the genre and the text sending date through doc2 vec;
document similarity: weighting and summing the similarities in the third document and the fourth document to obtain the similarity of the two documents;
in the third step, based on the ontology knowledge base, the computer organization unit similarity Sdep(ex,ey) The calculation formula of (2) is as follows:
Sdep(ex,ey)=1-d(ex,ey);
wherein e isx,eyIs a unit of organization, d (e)x,ey) Is ex,eyDistance in ontology repository;
d (e) isx,ey) The calculation formula of (2) is as follows:
Figure FDA0003215458970000011
wherein d (root, x) represents the distance from the node x to the root node of the ontology knowledge base, d (root, y) represents the distance from the node y to the root node of the ontology knowledge base, d (lca (x, y), x) represents the distance from the node x to the common nearest node of x and y, and d (lca (x, y), y) represents the distance from the node y to the common nearest node of x and y;
when one document contains a plurality of organ units, the similarity S of the organ unit information in the two documentsdepThe calculation formula of (i, j) is:
Figure FDA0003215458970000021
wherein e isi,mIs the mth unit of the customs clearance in the document i, belongs toj,mIs the nearest organ unit in the document j to the mth organ unit in the document i, and M is the nearest organ unit in the document jTotal number of department entities appearing in document i, N is total number of department entities appearing in document j, d (e)i,m,∈j,m) Is organization unit ei,mAnd ej,mDistance in ontology repository, d (e)j,m,∈i,m) Is ej,mAnd ei,mDistance in the ontology repository.
2. The method of claim 1, wherein: the step II comprises the following steps:
(2.1) acquiring organization unit information: extracting a sending organ and a receiving organ in the official documents through regular matching;
(2.2) acquiring the subject information of the official document: extracting a title and the first two sections from government documents, matching and discarding information of organs and units, and then performing basic text preprocessing;
(2.3) acquiring the genre information: dividing the genre into sub-genres according to the specific function of the genre, and determining the official document genre information through regular matching;
(2.4) acquiring message sending date information: by regularly matching the time in the official document.
3. The method of claim 2, wherein: in the step (2.2), the text preprocessing is divided into the following steps:
(2.2.1) performing word segmentation and stop word elimination on the residual text, wherein the words only comprise numbers and words which only appear once in all the document corpora;
(2.2.2) matching keywords in the topic ontology library from the remaining words to determine the topic tags of the government documents.
4. The method of claim 2, wherein: in the step (2.3), the types of the genres are divided into 15 types according to the official document format of the Party government organization, and each type of the genres is subdivided into one level according to specific functions and is subdivided into the subdivisions.
5. The method of claim 1The method for measuring the similarity of the official documents is characterized in that: in the third step, the similarity S of the official document subjects is calculated based on the ontology knowledge basetop(ew,ez) The calculation formula of (2) is as follows:
Stop(ew,ez)=1-d(ew,ez);
wherein e isw,ezIs a subject of official documents, d (e)w,ez) Is ew,ezDistance in ontology repository;
d (e) isw,ez) The calculation formula of (2) is as follows:
Figure FDA0003215458970000031
wherein d (root, w) represents the distance from the node w to the root node of the ontology repository, d (root, z) represents the distance from the node z to the root node of the ontology repository, d (lca (w, z), w) represents the distance from the node w to the common nearest node of w and z, and d (lca (w, z), z) represents the distance from the node z to the common nearest node of w and z;
when one official document contains a plurality of official document themes, the similarity S of the official document theme information in the two official documentstopThe formula for the calculation of (d, s) is:
Figure FDA0003215458970000032
wherein e isd,cIs the subject of the c < th > document in the document d, belongs tos,cIs the document topic in document s that is closest to the c-th document topic in document d, F is the total number of document topics appearing in document d, G is the total number of document topics appearing in document s, d (e)d,c,∈s,c) Is the subject of official document ed,cAnd es,cDistance in ontology repository, d (e)s,c,∈s,c) Is es,cAnd es,cDistance in the ontology repository.
6. The method of claim 1, wherein: in the third step, the similarity of the official document genres is 0, 0.5 or 1, and if the two official document genres and the subdivided genres are completely consistent, the similarity is 1; if the two official documents have the same genre but different subdivision genres, the similarity is 0.5; if the two documents are of different genres, the similarity is 0.
7. The method of claim 1, wherein: in the step III, the similarity S of the text-sending datesdateThe calculation formula of (i, j) is:
Figure FDA0003215458970000041
wherein dateiIs the date of issue of the ith official documentjIs the issue date of the jth official document.
8. The method of claim 1, wherein: in the step (i), the first step,
ontology knowledge base of government organization: constructing a direct relation map of an organization unit by combing the business and administrative relations of each committee and office of the government;
ontology knowledge base of party political official document subject: based on the classification of official documents of the website of the office of the state department, the subject of the official documents is subdivided downwards to form a body knowledge base of the subject of the official documents.
9. The method of claim 1, wherein: in the fifth step, the similarity between the official document i and the official document j is weighted by the organ unit similarity, the official document theme similarity, the genre similarity, the issue date similarity and the similarity of the rest official document contents, and the concrete formula is as follows:
Figure FDA0003215458970000042
wherein the SIMdoc(i, j) represents the similarity of the document i to the document j, Sk(i, j) represents one of 5 similarities, wkRepresents its weight;
the specific weight assignment is as follows: the weight of organ unit similarity is 0.25, the weight of official document theme similarity is 0.25, the weight of genre similarity is 0.05, the weight of issue date similarity is 0.05, and the weight of official document remaining content similarity is 0.4.
CN201811361247.0A 2018-11-15 2018-11-15 Method for measuring similarity of documents Active CN109582759B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811361247.0A CN109582759B (en) 2018-11-15 2018-11-15 Method for measuring similarity of documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811361247.0A CN109582759B (en) 2018-11-15 2018-11-15 Method for measuring similarity of documents

Publications (2)

Publication Number Publication Date
CN109582759A CN109582759A (en) 2019-04-05
CN109582759B true CN109582759B (en) 2021-10-22

Family

ID=65922849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811361247.0A Active CN109582759B (en) 2018-11-15 2018-11-15 Method for measuring similarity of documents

Country Status (1)

Country Link
CN (1) CN109582759B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722556A (en) * 2012-05-29 2012-10-10 清华大学 Model comparison method based on similarity measurement
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN107958061A (en) * 2017-12-01 2018-04-24 厦门快商通信息技术有限公司 The computational methods and computer-readable recording medium of a kind of text similarity
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10642912B2 (en) * 2016-08-17 2020-05-05 Adobe Inc. Control of document similarity determinations by respective nodes of a plurality of computing devices

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722556A (en) * 2012-05-29 2012-10-10 清华大学 Model comparison method based on similarity measurement
CN107247780A (en) * 2017-06-12 2017-10-13 北京理工大学 A kind of patent document method for measuring similarity of knowledge based body
CN107958061A (en) * 2017-12-01 2018-04-24 厦门快商通信息技术有限公司 The computational methods and computer-readable recording medium of a kind of text similarity
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于向量空间模型和专利文献特征的相似专利确定方法;陈芨熙等;《浙江大学学报》;20091031;第1848-1852页 *

Also Published As

Publication number Publication date
CN109582759A (en) 2019-04-05

Similar Documents

Publication Publication Date Title
Hammad et al. An approach for detecting spam in Arabic opinion reviews
CN106997382A (en) Innovation intention label automatic marking method and system based on big data
Kumar et al. Exploration of sentiment analysis and legitimate artistry for opinion mining
Avasthi et al. Techniques, applications, and issues in mining large-scale text databases
Feng et al. Extracting common emotions from blogs based on fine-grained sentiment clustering
Huang et al. Expert as a service: Software expert recommendation via knowledge domain embeddings in stack overflow
CN114254201A (en) Recommendation method for science and technology project review experts
Chaudhuri et al. Hidden features identification for designing an efficient research article recommendation system
Dobrovolskyi et al. Collecting the Seminal Scientific Abstracts with Topic Modelling, Snowball Sampling and Citation Analysis.
Mgarbi et al. Towards a new job offers recommendation system based on the candidate resume
Brewster et al. Ontologies, taxonomies, thesauri: Learning from texts
Stylios et al. Using Bio-inspired intelligence for Web opinion Mining
Kawamura et al. Funding map using paragraph embedding based on semantic diversity
Yang et al. EFS: Expert finding system based on Wikipedia link pattern analysis
CN109582759B (en) Method for measuring similarity of documents
Tong et al. A document exploring system on LDA topic model for Wikipedia articles
Mahalakshmi et al. On the expressive power of scientific manuscripts
Scholtes et al. Big data analytics for e-discovery
Li et al. A hybrid approach for measuring similarity between government documents of China
JP2011150603A (en) Category theme phrase extracting device, hierarchical tag attaching device, method, and program, and computer-readable recording medium
Mateen et al. An Analysis on Text Mining Techniques for Smart Literature Review
Wang et al. The big data analysis and visualization of mass messages under “smart government affairs” based on text mining
Bai et al. WHOSe Heritage: Classification of UNESCO World Heritage" Outstanding Universal Value" Documents with Soft Labels
Ojokoh et al. A graph model with integrated pattern and query-based technique for extracting answer to questions in community question answering system
Todkar et al. Recommendation engine feedback session strategy for mapping user search goals (FFS: Recommendation system)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant