CN113553825A

CN113553825A - Method and system for analyzing context relationship of electronic official document

Info

Publication number: CN113553825A
Application number: CN202110837789.6A
Authority: CN
Inventors: 许建兵; 朱彦欣; 冯伟; 刘伟康; 李强
Original assignee: Anhui Suncn Pap Information Technology Co ltd
Current assignee: Anhui Suncn Pap Information Technology Co ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2021-10-26
Anticipated expiration: 2041-07-23
Also published as: CN113553825B

Abstract

The invention relates to an electronic official document context relationship analysis method and system, wherein the method comprises the following steps of: predicting the subject words of the known text through a subject model, and storing the subject words to form a database; feature extraction: extracting a characteristic theme of the target official document; and (3) text retrieval: searching the characteristic theme of the target official document from the database, and screening a plurality of similar texts to form a candidate set; text vectorization: performing text vectorization on the target official document and the similar text to obtain document feature vectors of the target official document and the similar text; text calculation: calculating the cosine distance between the target official document and the similar text; text screening: comparing cosine distances of the target official document and the similar texts, and screening out the similar texts of which the cosine distances are smaller than a threshold value; generating a relation tree: and the target official document is a root node, the rest similar texts are father nodes, and a relationship tree is generated.

Description

Method and system for analyzing context relationship of electronic official document

Technical Field

The invention belongs to the technical field of text analysis, and particularly relates to an electronic official document context relationship analysis method and system.

Background

The government affair office system comprises several functions of handling documents, handling affairs, meeting and the like, and the influence and the connection among the documents, the meetings, the events and the news are needed to be known, so that the relation among the affairs needs to be related through an effective method, and an effective document context is formed.

In the related art, the discovery of the context relationship of the electronic official document is mainly realized by a rule method, and the specific realization method is as follows:

1. identifying key entities in the electronic documents to be analyzed, such as policy and regulation names, document names and the like, by using rules or algorithms;

2. searching a database according to the identified entity to find official documents, news, meetings and the like containing the entity; and then calculating the similarity between the retrieved news, meetings and the like and the electronic official documents to be analyzed, filtering out the meetings, the news and the like with lower similarity, sequencing the rest news, meetings and the like according to the time dimension, and displaying.

The related art has the following problems that for government affair information, cross-system government affair information and the like disclosed on the internet, accurate association is difficult to carry out through a simple means due to messy and unscrambled information; there is also no efficient analysis method for social impact of the official documents. Therefore, a method for conveniently analyzing the context of the electronic official document is needed, and the follow-up work can be conveniently guided through the official document information.

Disclosure of Invention

In order to solve the problems, the invention discloses an electronic official document context relationship analysis method which is used for analyzing and processing an electronic official document and providing data support for subsequent work.

In a first aspect, the invention discloses an analysis method for context relationship of electronic documents, which comprises the following steps,

and (3) data storage: predicting the subject words of the known text through a subject model, and storing the subject words to form a database;

feature extraction: extracting a characteristic theme of the target official document;

and (3) text retrieval: searching the characteristic theme of the target official document from the database, and screening out a plurality of similar texts;

text vectorization: performing text vectorization on the target official document and the similar text to obtain document feature vectors of the target official document and the similar text;

text calculation: calculating the cosine distance between the target official document and the similar text according to the document feature vectors of the target official document and the similar text;

text screening: comparing cosine distances between the target official document and the similar texts, and selecting the similar texts of which the cosine distances are more than or equal to a threshold value;

generating a relation tree: and generating a relation tree by taking the target official document as a root node and the rest similar texts as father nodes.

Further, the text vectorization of the target official document and the similar text to obtain the document feature vectors of the target official document and the similar text specifically includes,

predicting word vectors of the target official document and the similar text titles and carrying out weighted average to obtain title feature vectors of the target official document and the similar text;

predicting word vectors of the target official document and the text of the similar text and carrying out weighted average to obtain text characteristic vectors of the target official document and the similar text;

and performing weighted average calculation on the title feature vector and the text feature vector to obtain the document feature vectors of the target official document and the similar text.

Furthermore, when the weighted average calculation is performed on the title feature vector and the text feature vector, the weight of the title feature vector is greater than the weight of the text feature vector.

Furthermore, the analysis method also comprises secondary screening of the text,

and screening time before comparing the cosine distances of the target official document and the similar texts and screening out the similar texts of which the cosine distances are smaller than a threshold value, and screening out the similar texts of which the release time is outside a specified time interval.

Furthermore, the analysis method further comprises adding child nodes of the relationship tree,

after the relationship tree is generated, inputting a similar document corresponding to the father node, repeatedly performing text retrieval, text vectorization, text calculation and text screening, and adding the obtained similar text of the father node into the relationship tree as a child node.

Furthermore, the depth of the relation tree is a preset value.

On the other hand, the invention also discloses an electronic official document context relationship analysis system, which comprises the following technical scheme,

an electronic official document context relationship analysis system, the analysis system comprising,

a data saving module: the system comprises a topic model, a database and a database, wherein the topic model is used for predicting the topic words of the known text and storing the topic words to form the database;

a feature extraction module: the characteristic theme is used for extracting the target official document;

a text retrieval module: the system is used for searching the characteristic theme of the target official document from the database and screening a plurality of similar texts;

a text vectorization module: the text vectorization is carried out on the target official document and the similar text to obtain document feature vectors of the target official document and the similar text;

a text calculation module: the cosine distance between the target official document and the similar text is calculated according to the document feature vectors of the target official document and the similar text;

the text screening module: the similar text is used for comparing the cosine distances of the target official document and the similar text, and the similar text of which the cosine distance is more than or equal to a threshold value is selected;

a relationship tree generation module: and generating a relationship tree by taking the target official document as a root node and the rest similar texts as father nodes.

Further, the text vectorization module includes a title vectorization unit, a body vectorization unit, and a feature vector calculation unit,

the title vectorization unit is used for predicting word vectors of the target official document and the similar text titles and carrying out weighted average to obtain title feature vectors of the target official document and the similar text;

the text vectorization unit is used for predicting word vectors of the target official document and the text of the similar text and carrying out weighted average to obtain text characteristic vectors of the target official document and the similar text;

the feature vector calculation unit is used for performing weighted average calculation on the title feature vector and the text feature vector to obtain document feature vectors of the target official document and the similar text.

Furthermore, the analysis system further comprises a text secondary screening module, wherein the text secondary screening module is used for carrying out time screening before comparing the cosine distances between the target official document and the similar texts and screening out the similar texts of which the cosine distances are smaller than a threshold value, and screening out the similar texts of which the release time is not within a specified time interval.

Furthermore, the analysis system further comprises a child node adding module, wherein the child node adding module is used for repeatedly performing text retrieval, text vectorization, text calculation and text screening after the relationship tree is generated, and adding the obtained similar text of the father node into the relationship tree as a child node.

The present invention has at least the following advantages,

the method comprises the steps of establishing a database of known texts, extracting and searching features of a target text input by a user to obtain similar texts in the database, screening step by step to obtain a final similar text, and analyzing and displaying the official context relationship between the target text and the similar text in a manner of establishing a relationship tree, so that the emotion trend of the text can be conveniently analyzed and follow-up work can be conveniently guided.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow diagram of the analytical method of the present application;

fig. 2 is a schematic diagram of a relationship tree structure in the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the embodiment of the present application further discloses an electronic official document context relationship analysis method, which includes the steps of, after a user inputs an official document, performing subject word extraction, similar text retrieval, text vectorization, text calculation and screening, and generating and storing a relationship tree.

The embodiment of the present application further discloses an electronic official document context relationship analysis system, which includes: the system comprises a data storage module, a feature extraction module, a text retrieval module, a text vectorization module, a text calculation module, a text screening module, a relation tree generation module and a child node adding module.

The following describes the steps of the method with reference to the above analysis system, and the analysis method includes the following steps:

and S1, storing data, wherein the data storage module predicts the subject words of the known text through the subject model and stores the subject words to form a database.

The Topic Model is a BTM (Biterm Topic Model), an open-source known text is obtained through a trained BTM, subject words of the known text are predicted, all the subject words are collected and stored to form a database, and subsequent retrieval and use are facilitated. For example, meetings, news, and the like are crawled as known text from a number of currently popular news portal sites.

After crawling to obtain the known text, the known text can be subjected to data cleaning, word segmentation and stop word removal are included, and the structure of the database is more accurate. When the subject term of the known text is predicted, n subject terms are extracted from each text, and then the subject terms are stored in a database. The BTM model is used for extracting the subject words, and the method has a good extracting effect on known texts such as official documents, meetings and news with the possibility of short texts.

And S2, feature extraction, wherein the target document is input, and the feature extraction module extracts the feature theme of the target document.

The characteristic topics correspond to the topic words, when the characteristic topics of the target official document are extracted, the BTM model is also adopted for prediction, and besides prediction of each characteristic topic, prediction of the probability of occurrence of each characteristic topic is also carried out.

And S3, text retrieval, wherein the text retrieval module retrieves the characteristic theme of the target document from the database and screens out a plurality of similar texts to form a candidate set.

When searching is carried out, the characteristic subject of the target official document is used as a search word, and the probability of the appearance of the characteristic subject is used as a weight. The retrieved similar texts need to be sorted, and the Elasticsearch itself carries out similarity sorting on the retrieved data. The method uses the elastic search to search and sort, the sorting is carried out according to the word frequency of the similar texts and the sorting is carried out from top to bottom, and the top n similar texts in the search result are used as a similar candidate set for the subsequent analysis process.

The method has the advantages that the text which is possibly similar can be screened out by using the Elasticissearch search, the running complexity of the subsequent algorithm is reduced, and the running efficiency of the whole algorithm can be improved. Increasing the number of similar texts in the candidate set can increase the recall rate, and at the same time, the complexity of the algorithm is also increased, and a reasonable value should be selected, in the embodiment of the present application, n is set to 10 in the screening, and the top 10 similar texts are selected.

And S4, text vectorization, wherein the text vectorization module carries out text vectorization on the target official document and the similar text to obtain the document feature vectors of the target official document and the similar text.

The text vectorization module comprises a title vectorization unit, a text vectorization unit and a feature vector calculation unit.

In this embodiment, a word vector of each piece of data is predicted by an ELMO model (deep quantized word representation), and during vectorization, a title vectorization unit predicts a title word vector of each piece of data (including a target text and similar texts in a candidate set) respectively, and obtains title feature vectors of the target text and the similar texts by weighted averaging. And the text vectorization unit respectively predicts the text word vector of each piece of data (including the target text and the similar texts in the candidate set), and obtains the text feature vectors of the target official document and the similar texts through weighted average. And the feature vector calculation unit performs weighted average calculation on the title feature vector and the text feature vector to obtain the document feature vectors of the target official document and the similar text.

The title feature vector is title _ embedding, the text feature vector is content _ embedding, the document feature vector is news _ embedding, when performing weighted average calculation,

news_embedding＝0.6*title_embedding+0.4*content_embedding

it can be seen that the title feature vector is given higher weight, which takes into account that the focus on the title is generally higher than the body when looking at and referencing the official document.

And S5, text calculation, wherein the text calculation module calculates the cosine distance between the target official document and the similar text according to the document feature vectors of the target official document and the similar text.

And respectively calculating to obtain cosine distances between the target official document and 10 similar texts in the candidate set, and storing the cosine distances between each similar text and the target official document for a subsequent screening process.

Commonly used text similarity calculation methods include: jacard similarity, edit distance, cosine similarity, etc.

The Jacard similarity measures the discrimination of the two sets by the proportion of different elements in the two sets in all elements, and has the defect that the Jacard similarity is only suitable for the sets of binary data;

the editing distance refers to the minimum number of editing operations required for converting one character string into another character string, and the defects are high algorithm complexity and low efficiency.

The Jacard similarity and the editing distance are discretized analysis of words in the document and cannot be considered in the semantic level.

Cosine similarity is a measure for measuring the difference between two individuals by using a cosine value of an included angle between two vectors in a vector space, and is commonly used for operation between two text vectors.

And S6, text screening, wherein the text screening module is used for screening out similar texts of which the cosine distance is smaller than a threshold value by comparing the preselected distance between the target official document and each similar text, and keeping the similar texts of which the cosine distance is larger than or equal to the threshold value.

In addition, the text secondary screening module can also screen similar texts, time screening is carried out before the cosine distance between the target official document and the similar texts is compared and the similar texts with the cosine distance smaller than a threshold value are screened out, the similar texts with the release time outside a specified time interval are screened out, and only the similar texts with the release time between the first half year and the second half year of the release time of the target official document are reserved, so that the similar texts have higher timeliness. And screening the similar texts to obtain the remaining similar texts which meet the conditions (namely, the cosine distance is greater than or equal to the threshold value).

And S7, generating a relation tree, wherein the relation tree generation module takes the target official document as a root node, and the screened residual similar texts as father nodes to generate the relation tree with the depth of 2.

As shown in fig. 2, the diagram is a schematic diagram of a relationship tree, and is used for performing distance display on the generated relationship tree. After the screening analysis is carried out, the target official document input by the user is taken as a root node, the remaining similar texts stored in the previous step are taken as father nodes, a relationship tree convenient to store and view is formed, and meanwhile sequencing of all the father nodes in the relationship tree is carried out through the publishing time of the similar official documents.

After the above steps are performed, the relationship tree needs to be grown, the child node adding module repeats the steps of S3-S6 after the relationship tree is generated (at this time, the single similar text is used as the target text, that is, the input related meeting or news corresponding to the parent node), and the obtained similar text is used as the child node. In the screening process of the child node in this step, meetings and news which are already appeared in the parent node need to be screened out. After acquiring the child nodes of all father nodes, only keeping the child nodes which belong to half or more than half of father nodes at the same time, filtering all the other child nodes, and only selecting the child node with the highest similarity with the father node for the kept child nodes to add the child node into the relationship tree. And for the child nodes, the steps are continued until no new child nodes are generated or the depth of the depth channel of the relation tree is specified. In this embodiment, the specified depth of the relationship tree is 3.

In fig. 2, for convenience of illustration, only 6 similar texts are selected for each search, that is, n is 6, and three layers of texts from top to bottom represent a root node, a parent node, and a child node in the relationship tree, respectively. The texts framed by the wide dashed lines in the parent node and the child node layer represent the screened and filtered texts, are not added into the relationship tree, and are placed in fig. 2 to facilitate the display of the relationship tree. The text in the dotted dashed box represents the retrieved but unfiltered filtered text from which the child nodes were filtered.

Through the steps, the target official document and the similar text are connected in the relation tree mode, the target official document and the similar text are displayed in the relation tree mode, checking and subsequent analysis are facilitated, meanwhile sequencing is conducted according to the time dimension, and analysis of the emotion trend is facilitated.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for analyzing context relationship of electronic official document is characterized by comprising the following steps,

text screening: comparing cosine distances of the target official document and the similar texts, and selecting the similar texts of which the cosine distances are greater than or equal to a threshold value;

generating a relation tree: and generating a relationship tree by taking the target document as a root node and the similar text of which the cosine distance is greater than or equal to a threshold value as a father node.

2. The method according to claim 1, wherein the text vectorization of the target document and the similar text to obtain the document feature vectors of the target document and the similar text specifically comprises,

3. The method as claimed in claim 2, wherein the weight of the header feature vector is greater than the weight of the body feature vector when performing weighted average calculation on the header feature vector and the body feature vector.

4. The method according to claim 1, further comprising a second text filtering,

5. The method of claim 1, further comprising adding child nodes of a relationship tree,

6. The method according to any one of claims 1 or 5, wherein the depth of the relationship tree is a predetermined value.

7. An electronic official document context relationship analysis system, characterized in that the analysis system comprises,

the text screening module: the similar text is used for comparing the cosine distances of the target official document and the similar text, and the similar text of which the cosine distance is greater than or equal to a threshold value is selected;

a relationship tree generation module: and generating a relationship tree by taking the target document as a root node and the similar text of which the cosine distance is greater than or equal to a threshold value as a father node.

8. The electronic official document context and relationship analysis system of claim 7, wherein the text vectorization module includes a title vectorization unit, a body vectorization unit and a feature vector calculation unit,

9. The system according to claim 7, further comprising a secondary text filtering module,

the text secondary screening module is used for screening time before comparing cosine distances of the target official document and the similar text and screening out the similar text of which the cosine distance is smaller than a threshold value, and screening out the similar text of which the release time is not within a specified time interval.

10. The electronic official document context and relationship analysis system of claim 7, further comprising a child node adding module,

and the child node adding module is used for repeatedly performing text retrieval, text vectorization, text calculation and text screening after the relationship tree is generated, and adding the obtained similar text of the father node into the relationship tree as a child node.