CN110991163A - Document comparison analysis method and device, electronic equipment and storage medium - Google Patents

Document comparison analysis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110991163A
CN110991163A CN201911199360.8A CN201911199360A CN110991163A CN 110991163 A CN110991163 A CN 110991163A CN 201911199360 A CN201911199360 A CN 201911199360A CN 110991163 A CN110991163 A CN 110991163A
Authority
CN
China
Prior art keywords
document
analyzed
element set
matching degree
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911199360.8A
Other languages
Chinese (zh)
Other versions
CN110991163B (en
Inventor
王文广
贺梦洁
陈运文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Datagrand Tech Inc
Original Assignee
Datagrand Tech Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Datagrand Tech Inc filed Critical Datagrand Tech Inc
Priority to CN201911199360.8A priority Critical patent/CN110991163B/en
Publication of CN110991163A publication Critical patent/CN110991163A/en
Application granted granted Critical
Publication of CN110991163B publication Critical patent/CN110991163B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention discloses a document comparison analysis method and device, electronic equipment and a storage medium. The method comprises the steps of analyzing a first document to be analyzed and a second document to be analyzed respectively to obtain a first element set and a second element set; calculating the matching degree of each element in the first element set and each element in the second element set; respectively determining the elements with the highest matching degree with each element in the first element set in the second element set; if the matching degree of the target element in the first element set and the element with the highest matching degree in the second element set is smaller than a set threshold value, identifying the target element; if the matching degree of the elements in the first element set and the elements with the highest matching degree in the second element set is larger than a set threshold value, forming an element pair; and analyzing the characters in the element pairs through an element comparison module, and respectively identifying the differences in the element pairs in the first document to be analyzed and the second document to be analyzed. The difference between the documents can be quickly and accurately found.

Description

Document comparison analysis method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to a natural language processing technology, in particular to a document comparison analysis method, a document comparison analysis device, electronic equipment and a storage medium.
Background
In the modern times, information updating iteration is rapid, and for a management decision maker, each slight information change can become an important factor influencing decision making, so that talents who grasp more comprehensive and timely effective information can become winners. However, information changes frequently, the number of documents to be read in a limited time is increasing, and how to quickly know valuable information in the documents in a short time becomes a challenge for management decision makers. In particular, in various types of official documents such as a government work report, a law and regulation, a standard document, a contract, an official document, a research report, and the like, there are cases where there are periodically released or different modified versions. And the differences between different versions of these documents, in addition to the content itself, are particularly important.
In the prior art, documents of different modified versions are generally checked manually, different documents are searched, and then corresponding decisions are made. However, this requires a person to read the documents of different modified versions, which wastes human resources, time and energy, and it is very likely to miss the different parts of the documents of different modified versions.
Disclosure of Invention
The invention provides a document comparison analysis method, a document comparison analysis device, electronic equipment and a storage medium, which can realize quick and accurate search of differences among documents of different modified versions, are convenient for a decision maker to quickly look up and make decisions, and save human resources, time and energy.
In a first aspect, an embodiment of the present invention provides a document comparison analysis method, where the method includes:
respectively analyzing the first document to be analyzed and the second document to be analyzed to obtain a first element set and a second element set;
calculating, by an element matching module, a degree of matching of each element in the first element set with each element in the second element set;
respectively determining the elements with the highest matching degree with each element in the first element set in the second element set;
if the matching degree of the target element in the first element set and the element with the highest matching degree in the second element set is smaller than a set threshold value, identifying the target element in a first document to be analyzed;
if the matching degree of the elements in the first element set and the elements with the highest matching degree in the second element set is larger than a set threshold value, forming an element pair;
analyzing the characters in the element pairs through an element comparison module, and respectively identifying the differences in the element pairs in a first document to be analyzed and a second document to be analyzed.
In a second aspect, an embodiment of the present invention further provides a document comparison analysis apparatus, where the apparatus includes:
the element set analysis module is used for respectively analyzing the first document to be analyzed and the second document to be analyzed to obtain a first element set and a second element set;
the element matching module is used for calculating the matching degree of each element in the first element set and each element in the second element set;
the element determining module is used for respectively determining the elements with the highest matching degree with each element in the first element set in the second element set;
the first identification module is used for identifying the target element in a first document to be analyzed if the matching degree of the target element in the first element set and the element with the highest matching degree in the second element set is smaller than a set threshold value;
an element pair forming module, configured to form an element pair if a matching degree of an element in the first element set and an element in the second element set with a highest matching degree is greater than a set threshold;
and the second identification module is used for analyzing the characters in the element pairs through the element comparison module and respectively identifying the differences in the element pairs in the first document to be analyzed and the second document to be analyzed.
In a third aspect, an embodiment of the present invention further provides an electronic device for comparing and analyzing documents, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the document alignment analysis method according to any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the document matching analysis method according to any embodiment of the present invention.
The method includes the steps that a first document to be analyzed and a second document to be analyzed are analyzed respectively to obtain a first element set and a second element set; calculating the matching degree of each element in the first element set and each element in the second element set through an element matching module; respectively determining the elements with the highest matching degree with each element in the first element set in the second element set; if the matching degree of the target element in the first element set and the element with the highest matching degree in the second element set is smaller than a set threshold value, identifying the target element in the first document to be analyzed; if the matching degree of the elements in the first element set and the elements with the highest matching degree in the second element set is larger than a set threshold value, forming an element pair; the characters in the element pairs are analyzed through the element comparison module, and the differences in the element pairs are respectively identified in the first document to be analyzed and the second document to be analyzed, so that the problem of document comparison and acquisition of differences between different versions is solved, the differences between the documents of different modified versions can be quickly and accurately found, a decision maker can conveniently and quickly look up and make a decision, and the effects of saving human resources, time and energy are achieved.
Drawings
FIG. 1a is a flowchart of a document comparison analysis method according to an embodiment of the present invention;
FIG. 1b is a schematic diagram illustrating an analysis process of a document comparison analysis method according to an embodiment of the present invention;
fig. 1c is a schematic structural diagram of an element matching module according to an embodiment of the present invention;
FIG. 1d is a diagram illustrating a document matching analysis method according to a second embodiment of the present invention;
FIG. 1e is a flowchart of a document comparison analysis method according to the second embodiment of the present invention
FIG. 2 is a schematic structural diagram of a document comparison analysis apparatus according to a third embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device for comparing and analyzing documents according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1a is a flowchart of a document comparison analysis method according to an embodiment of the present invention, where the embodiment is applicable to comparison analysis of multiple documents to find differences in the documents, for example, the embodiment may be applicable to comparison analysis of documents of different versions, and the method may be executed by a document comparison analysis device, where the device may be implemented by software and/or hardware, and the device may be integrated in a processor, as shown in fig. 1a, and specifically includes:
and step 110, analyzing the first document to be analyzed and the second document to be analyzed respectively to obtain a first element set and a second element set.
The first document to be analyzed may be a latest version of the document, the second document to be analyzed may be a different version of the same type as the first document to be analyzed, a document published in a different period, or any document related to the first document to be analyzed, such as different articles describing the same topic (e.g., different newsletters of the same event). The number of the first to-be-analyzed document and the second to-be-analyzed document is not limited, a plurality of documents can be provided, and a plurality of documents can be compared in a serial connection mode. The first document to be analyzed and the second document to be analyzed may be documents that are published periodically, such as by day, week, month, quarter, year, and years. For example, government work reports planned for 5 years or different expressions of the same fact in different periods, work reports of various levels of governments each year, periodic industry reports; the first document to be analyzed and the second document to be analyzed can be document comparison among different versions, such as different amendments of a legal document, or contracts, treatises, financial reports and the like of different revisions; the first document to be analyzed and the second document to be analyzed may be different documents describing or discussing the same topic, event or fact, such as news reports from different sources for the same event, research reports from different institutions for the same financial report, etc. The formats of the first document to be analyzed and the second document to be analyzed may be unlimited, and may be doc, docx, pdf, html, and the like, or may be a document in a picture format, such as jpg, png, tif, and the like. OCR technology may be employed to perform text recognition on a document in a picture format. The first element set and the second element set may be sets of paragraphs, sentences, or phrases obtained by parsing the first document to be analyzed and the second document to be analyzed, respectively. The analyzing of the first document to be analyzed and the second document to be analyzed may be performed by segmenting the first document to be analyzed and the second document to be analyzed according to a certain rule or standard according to a required granularity. Wherein, the granularity can be a paragraph, a sentence or a short sentence.
In an implementation manner of the embodiment of the present invention, optionally, analyzing the first document to be analyzed and the second document to be analyzed respectively to obtain a first element set and a second element set, where the analyzing includes: and respectively segmenting the first document to be analyzed and the second document to be analyzed according to the natural attribute of the paragraph, the natural attribute of the sentence or the natural attribute of the short sentence to obtain a first element set and a second element set.
When the formats of the first document to be analyzed and the second document to be analyzed are regular, segmentation can be performed according to natural attributes of paragraphs, sentences or short sentences. Fig. 1b is a schematic diagram of an analysis process of a document comparison analysis method according to an embodiment of the present invention, and as shown in fig. 1b, a first document to be analyzed and a second document to be analyzed may be analyzed according to a required granularity. If the required granularity is a paragraph, the first document to be analyzed and the second document to be analyzed can be respectively segmented according to the natural attributes of the paragraph. At this time, the first element set and the second element set are paragraph sets of the first document to be analyzed and the second document to be analyzed, respectively. A paragraph may be a natural paragraph, and the natural attribute of the paragraph may be an end marker of the natural paragraph, such as a sentence ending with a period, an exclamation point, a question mark, etc., followed by a line break, or a line break characteristic that is common in content, such as an empty line, an indentation of the first character, or a sinking of the first character, etc. When the required granularity is a sentence, the first document to be analyzed and the second document to be analyzed can be respectively segmented according to the natural attributes of the sentence. At this time, the first element set and the second element set are sentence sets of the first document to be analyzed and the second document to be analyzed, respectively. The natural attribute of a sentence may be a period, semicolon, exclamation point, question mark, or the like. When the required granularity is a short sentence, the first document to be analyzed and the second document to be analyzed can be respectively segmented according to the natural attributes of the short sentence. At this time, the first element set and the second element set are short sentence sets of the first document to be analyzed and the second document to be analyzed, respectively. The natural attribute of a phrase may be comma, pause, semicolon, colon, or the like.
In an implementation manner of the embodiment of the present invention, optionally, analyzing the first document to be analyzed and the second document to be analyzed respectively to obtain a first element set and a second element set, where the analyzing includes: analyzing paragraphs, sentences or short sentences of the first document to be analyzed through a pre-trained sequence labeling model to obtain a first element set consisting of the paragraphs, the sentences or the short sentences; and analyzing the paragraphs, sentences or phrases of the second document to be analyzed through a pre-trained sequence marking model to obtain a second element set consisting of the paragraphs, the sentences or the phrases.
Wherein if the first document to be analyzed and the second document to be analyzed are in an irregular format, such as a legal regulation or a government notice on a website, the original formats thereof are destroyed. At this time, the analysis may be performed by using a machine learning model, such as a paragraph, a sentence, or a short sentence analysis on the first document to be analyzed or the second document to be analyzed by using a pre-trained sequence tagging model. The algorithms of the sequence labeling Model include, but are not limited to, Conditional Random Field (CRF), Hidden Markov chain (HMM), long short-Term Memory (LSTM) and CRF, Convolutional Neural Networks (CNN) and CRF, etc.
In an implementation manner of the embodiment of the present invention, optionally, a pre-labeled training set document is input into the mathematical model, and the mathematical model is trained to obtain a pre-trained sequence labeling model.
The pre-trained sequence labeling model may be obtained by inputting an original training set document and a pre-labeled training set document into a mathematical model and training the mathematical model. For the trained and pre-trained sequence labeling model, paragraph, sentence or phrase analysis can be performed on the first document to be analyzed or the second document to be analyzed. The granularity is paragraph, sentence or short sentence, which can be set in advance according to the requirement. If there is little change between different versions of a document, the granularity may be paragraphs; when there is a little bit between documents of different versions, the granularity may be a sentence; the granularity may be a short sentence when there is a large gap between different versions of a document.
In an implementation manner of the embodiment of the present invention, optionally, the first character of each paragraph in the training set document is marked as a first identifier, the last character of each paragraph is marked as a second identifier, and the character between the first character and the last character of each paragraph is marked as a third identifier; or; the first character of each sentence in the training set document is marked as a first identification, the last character of each sentence is marked as a second identification, and the character between the first character and the last character of each sentence is marked as a third identification; or; the first character of each short sentence in the training set document is marked as a first identification, the last character of each short sentence is marked as a second identification, and the character between the first character and the last character of each short sentence is marked as a third identification.
The beginning, the end and the content in the middle of the beginning and the end of the paragraph, sentence or phrase may be marked with different labels, such as the letter B at the beginning of the paragraph, sentence or phrase, the letter E at the end of the paragraph, sentence or phrase, and the letter O at the middle of the beginning and the end of the paragraph, sentence or phrase. Of course, the letters B, E and O may be other identifiers, and the invention is not limited in this regard. And inputting the pre-marked training set documents into the mathematical model, and training the mathematical model to obtain a pre-trained sequence labeling model. By using the pre-trained sequence tagging model, when the formats of the first document to be analyzed and the second document to be analyzed are irregular, the first document to be analyzed or the second document to be analyzed can be parsed into paragraphs, sentences or phrases, the beginning, the end and the content between the beginning and the end of the paragraphs, sentences or phrases can be tagged with different labels, and the tagging rules can be consistent with the tagging rules of the pre-tagged training set documents. Illustratively, when a training set document with paragraph marks is input into a mathematical model for training to obtain a sequence labeling model, after a first document to be analyzed or a second document to be analyzed is input into the pre-trained sequence labeling model, a first element set or a second element set composed of paragraphs is obtained, wherein the beginning, the end and the content between the beginning and the end of each paragraph are marked, and the mark of the paragraph mark can be consistent with the paragraph mark of the training set document. If the training set documents are sentence-labeled, the first element set or the second element set obtained above may be correspondingly composed of sentences. If the training set document is marked by a phrase, the obtained first element set or second element set can be correspondingly composed of phrases. Of course, when the formats of the first document to be analyzed and the second document to be analyzed are regular, the pre-trained sequence labeling model may also be used to perform paragraph, sentence or phrase analysis on the first document to be analyzed or the second document to be analyzed. The first document to be analyzed or the second document to be analyzed can be subjected to multiple analysis, so that reasonable analysis is determined, and the difference between the first document to be analyzed and the second document to be analyzed is conveniently searched.
And step 120, calculating the matching degree of each element in the first element set and each element in the second element set through an element matching module.
Wherein the matching degree may be a degree of similarity of each element in the first element set with each element in the second element set.
In an implementation manner of the embodiment of the present invention, optionally, the element matching module includes a first embedding layer, a second embedding layer, a full connection layer, and at least one network unit; a first embedding layer for input of elements of a first set of elements; the second embedded layer is used for inputting elements in a second element set; the first embedded layer and the second embedded layer are respectively connected with the first network unit; the full connection layer is connected with the last network unit; the network unit comprises a first network structure, a first self-encoder, a second network structure, a second sub-encoder and an attention layer; the first network structure and the second network structure are both connected with the attention layer; the first network structure and the attention layer are both connected with the first self-encoder; the second network fabric and the attention layer are both connected to a second self-encoder.
In a specific implementation manner of the embodiment of the present invention, optionally, the element matching module includes a first embedding layer, a second embedding layer, a full connection layer, and at least one network unit; a first embedding layer for input of elements of a first set of elements; the second embedded layer is used for inputting elements in a second element set; the first embedded layer and the second embedded layer are respectively connected with the first network unit; the full connection layer is connected with the last network unit; the network unit comprises a first long-short term memory unit, a first self-encoder, a second long-short term memory unit, a second sub-encoder and an attention layer; the first long-short term memory unit and the second long-short term memory unit are both connected with the attention layer; the first long-short term memory unit and the attention layer are both connected with the first self-encoder; the second long-short term memory unit and the attention layer are connected with a second self-encoder.
Fig. 1c is a schematic structural diagram of an element matching module according to an embodiment of the present invention, and as shown in fig. 1c, elements in a first element set, such as element a, are input to a first long-short term memory unit of a network unit through a first embedded layer; the elements in the second element set, such as element b, are input into a second long-short term memory unit of the network unit through a second embedding layer; the result of the element a obtained by the first long-short term memory unit and the result of the element b obtained by the second long-short term memory unit are merged at the attention level; adding the results obtained through the attention layer to the results obtained by the first long-short term memory unit and the results obtained by the second long-short term memory unit respectively; inputting a result obtained by adding the attention layer to the first long-short term memory unit and a result obtained by adding the attention layer to the second long-short term memory unit to the first self-encoder and the second self-encoder, respectively; and merging the results of the first self-encoder and the second self-encoder into a full connection layer, and outputting the matching degree of the element a and the element b through the full connection layer. It should be noted that the process of the results of the first embedded layer and the second embedded layer to the first self-encoder and the second self-encoder is referred to as a network element processing process. For example, when there are two network units, the output results of the first self-encoder and the second self-encoder of the first network unit are input to the first long-short term memory unit and the second long-short term memory unit of the second network unit, respectively, and then the network unit processing procedure is performed. When more network units exist, the result flow of each link is analogized, the output results of the first self-encoder and the second self-encoder of the last network unit are converged to the full connection layer, and the matching degree is obtained through the full connection layer.
It should be further noted that, in a specific implementation manner of the embodiment of the present invention, the first Network structure and the second Network structure may be Network structures such as a long-short term memory Unit, a Gated Recurrent Unit (GRU), a Recurrent Neural Network (RNN), or a Convolutional Neural Network (CNN).
And step 130, respectively determining the elements with the highest matching degree with each element in the first element set in the second element set.
And calculating the matching degree of each element in the first element set and each element in the second element set through the element matching module. For each element in the first element set, the element with the highest matching degree can be found from the second element set, for example, for the element a in the first element set, the element b with the highest matching degree can be found from the second element set.
And 140, if the matching degree of the target element in the first element set and the element with the highest matching degree in the second element set is smaller than a set threshold value, identifying the target element in the first document to be analyzed.
If the matching degree between the element a and the element b is smaller than a set threshold, no element in the second element set is matched with the first element set, that is, the element a is a new element for the second document to be analyzed, and the element a can be identified in the first document to be analyzed. The area highlighting can be adopted, the area marks of different line types, such as straight lines, broken lines and wavy lines, or the marks are marked by graphs formed by different line types, such as squares, ellipses and the like. The identification may also be performed in the form of highlighting, line-type, or graphics combined with each other, and the present invention is not particularly limited. The user such as a decision maker can be obviously reminded through the identification, and the newly added elements for the second document to be analyzed exist in the first document to be analyzed, so that the user such as the decision maker can conveniently check the document in time and make a decision according to the elements.
And 150, if the matching degree of the elements in the first element set and the elements with the highest matching degree in the second element set is greater than a set threshold value, forming an element pair.
If the matching degree between the element a and the element b is larger than a set threshold value, the element a and the element b are formed into an element pair, namely the element a and the element b are matched with each other.
And 160, analyzing the characters in the element pairs through the element comparison module, and respectively identifying the differences in the element pairs in the first document to be analyzed and the second document to be analyzed.
If the element a in the first element set and the element b in the second element set form an element pair, the characters in the element a and the element b can be analyzed through the element comparison module. The element comparison module can be a mathematical model trained in advance through machine learning, and can analyze differences in element pairs, such as character differences, semantic differences and emotion differences. The character difference can also be realized by comparing and judging character strings in the element pairs. The difference in the element pair may be identified in both the first document to be analyzed and the second document to be analyzed, or may be identified only in the first document to be analyzed or the second document to be analyzed as needed.
In an implementation manner of the embodiment of the present invention, optionally, identifying differences in the element pairs in the first document to be analyzed and the second document to be analyzed respectively includes: respectively identifying the character differences in the element pairs in a first document to be analyzed and a second document to be analyzed; respectively identifying words with character differences in the element pairs and semantic similarity larger than a set threshold in a first document to be analyzed and a second document to be analyzed; and respectively identifying the characters with emotional differences in the element pair in the first document to be analyzed and the second document to be analyzed.
The character difference means that characters or characters in the element pair are different, the same semantics means that the characters or characters in the element pair are different, but the semantic similarity is larger than a set threshold, the emotion difference means that emotion analysis is performed on the characters or characters in the element pair, the characters or characters are divided into negative, neutral and positive, and the characters or characters in the element pair are different in emotion. The text difference, the semantic meaning or the emotion difference can be identified in the first document to be analyzed and the second document to be analyzed, or can be identified only in the first document to be analyzed or the second document to be analyzed according to requirements. For example, for the character difference, the semantic meaning, or the emotion difference in the element pair, the identification may be performed by using lines of different highlight colors and/or different line types and/or different graphics, for the character difference, the highlight identification of a red background may be used, for the semantic meaning, the highlight identification of a yellow background may be used, and for the emotion difference, the highlight display of a blue background may be used. The document comparison and analysis result can be displayed on different interfaces, such as a browser, windows program or software, dedicated software docked to the inside of an enterprise or organization, and the like. The user can conveniently and quickly find the document comparison and analysis result.
The method comprises the steps of analyzing a first document to be analyzed and a second document to be analyzed respectively to obtain a first element set and a second element set; calculating the matching degree of each element in the first element set and each element in the second element set through an element matching module; respectively determining the elements with the highest matching degree with each element in the first element set in the second element set; if the matching degree of the target element in the first element set and the element with the highest matching degree in the second element set is smaller than a set threshold value, identifying the target element in the first document to be analyzed; if the matching degree of the elements in the first element set and the elements with the highest matching degree in the second element set is larger than a set threshold value, forming an element pair; the characters in the element pairs are analyzed through the element comparison module, and the differences in the element pairs are respectively identified in the first document to be analyzed and the second document to be analyzed, so that the problem of document comparison and acquisition of differences between different versions is solved, the differences between the documents of different modified versions can be quickly and accurately found, a decision maker can conveniently and quickly look up and make a decision, and the effects of saving human resources, time and energy are achieved.
Example two
FIG. 1e is a flowchart of a document comparison analysis method according to a second embodiment of the present invention; in this embodiment, the technical solution is further refined, as shown in fig. 1e, the method specifically includes:
and 310, analyzing the first document to be analyzed and the second document to be analyzed respectively through an element analysis module to obtain a first element set and a second element set.
The element analysis module has the function of analyzing according to the element granularity needed by the first document to be analyzed and the second document to be analyzed, and the analysis result is an element set. As shown in fig. 1b, when the granularity is paragraph, the parsing result is that the first element set and the second element set are paragraph sets; when the granularity is a sentence, the parsing result is that the first element set and the second element set are sentence subsets; when the granularity is a short sentence, the parsing result is that the first element set and the second element set are short sentence sets. The analysis can be performed according to natural attributes of paragraphs, sentences or phrases, or can be performed by adopting a pre-trained sequence labeling model.
And step 320, calculating the matching degree of each element in the first element set and each element in the second element set through an element matching module, and finding an element pair.
And when the matching degree of the elements in the first element set and the elements with the highest matching degree in the second element set is greater than a set threshold value, forming element pairs. When a threshold value is set for the matching degree education of the elements in the first element set and the elements with the highest matching degree in the second element set, the elements in the first element set are newly added to the second element set.
And step 330, analyzing the elements through an element comparison module, and displaying the difference.
The difference can be character difference, semantic difference and emotion difference, and can be displayed through different highlight colors.
In the embodiment of the invention, the first document to be analyzed and the second document to be analyzed are respectively analyzed through an element analysis module to obtain a first element set and a second element set; calculating the matching degree of each element in the first element set and each element in the second element set through an element matching module, and finding out an element pair; the elements are analyzed through the element comparison module, and the difference is displayed, so that the problem of document comparison and difference acquisition between different versions is solved, the difference between the documents of different modified versions can be quickly and accurately found, a decision maker can conveniently and quickly look up and make a decision, and the effects of saving human resources, time and energy are achieved.
For example, fig. 1d is a design diagram of a document alignment analysis method provided in the second embodiment of the present invention, as shown in fig. 1 d. The document comparison analysis method provided by the embodiment of the invention can be used in the following process: the document comparison analysis method provided by the embodiment of the invention can comprise an element analysis module, an element matching module, an element comparison module and a report generation module. The method comprises the steps that documents of the same type and different versions can be analyzed through an element analysis module to obtain an element set corresponding to the documents; elements in the element set corresponding to the document can be matched through an element matching module, namely, the matching degree is obtained, and the element with the highest matching degree is selected for matching; analyzing the matched elements through an element comparison module, and judging whether characters have differences, whether the semantics of the character differences are the same and whether emotions are different; and highlighting the comparison result through a report generation module.
EXAMPLE III
Fig. 2 is a schematic structural diagram of a document comparison analysis apparatus according to a third embodiment of the present invention, as shown in fig. 2, the apparatus includes: an element set parsing module 210, an element matching module 220, an element determination module 230, a first identification module 240, an element pair formation module 250, and a second identification module 260.
The element set analysis module 210 is configured to analyze the first document to be analyzed and the second document to be analyzed respectively to obtain a first element set and a second element set;
an element matching module 220, configured to calculate a matching degree between each element in the first element set and each element in the second element set;
an element determining module 230, configured to determine, respectively, an element in the second element set that has a highest matching degree with each element in the first element set;
a first identifying module 240, configured to identify a target element in the first document to be analyzed if a matching degree of the target element in the first element set and an element with a highest matching degree in the second element set is smaller than a set threshold;
an element pair forming module 250, configured to form an element pair if a matching degree of an element in the first element set and an element in the second element set with a highest matching degree is greater than a set threshold;
the second identification module 260 is configured to analyze the characters in the element pairs through the element comparison module, and identify differences in the element pairs in the first document to be analyzed and the second document to be analyzed, respectively.
Optionally, the element matching module 220 includes a first embedding layer, a second embedding layer, a full connection layer, and at least one network unit; a first embedding layer for input of elements of a first set of elements; the second embedded layer is used for inputting elements in a second element set; the first embedded layer and the second embedded layer are respectively connected with the first network unit; the full connection layer is connected with the last network unit; the network unit comprises a first network structure, a first self-encoder, a second network structure, a second sub-encoder and an attention layer; the first long network structure and the second network structure are both connected with the attention layer; the first network structure and the attention layer are both connected with the first self-encoder; the second network fabric and the attention layer are both connected to a second self-encoder.
The first network structure and the second network structure may be network structures such as long-short term memory units, GRUs, RNNs, or CNNs.
Optionally, the element set parsing module 210 includes: and the element set analysis unit is used for respectively segmenting the first document to be analyzed and the second document to be analyzed according to the natural attribute of the paragraph, the natural attribute of the sentence or the natural attribute of the short sentence to obtain a first element set and a second element set.
Optionally, the element set parsing module 210 includes: the pre-trained sequence labeling model is used for analyzing paragraphs, sentences or short sentences of the first document to be analyzed to obtain a first element set consisting of the paragraphs, the sentences or the short sentences; and analyzing paragraphs, sentences or short sentences of the second document to be analyzed to obtain a second element set consisting of the paragraphs, the sentences or the short sentences.
Optionally, the apparatus further includes: and inputting the pre-marked training set documents into a mathematical model, and training the mathematical model to obtain a pre-trained sequence labeling model.
Optionally, the apparatus further includes: a training set document identification module, configured to mark a first character of each paragraph in a training set document as a first identifier, mark a last character of each paragraph as a second identifier, and mark a character between the first character and the last character of each paragraph as a third identifier; or; the method comprises the steps of marking a first character of each sentence in a training set document as a first identification, marking a last character of each sentence as a second identification, and marking a character between the first character and the last character of each sentence as a third identification; or; the method is used for marking the first character of each short sentence in the training set document as a first mark, the last character of each short sentence as a second mark, and the character between the first character and the last character of each short sentence as a third mark.
Optionally, the second identifying module 260 is specifically configured to identify the text differences in the element pairs in the first document to be analyzed and the second document to be analyzed respectively; respectively identifying words with character differences in the element pairs and semantic similarity larger than a set threshold in a first document to be analyzed and a second document to be analyzed; and respectively identifying the characters with emotional differences in the element pair in the first document to be analyzed and the second document to be analyzed.
The document comparison analysis device provided by the embodiment of the invention can execute the document comparison analysis method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example four
Fig. 3 is a schematic structural diagram of an electronic device for comparing and analyzing documents according to a fourth embodiment of the present invention, as shown in fig. 3, the device includes:
one or more processors 410, one processor 410 being exemplified in FIG. 3;
a memory 420;
the apparatus may further include: an input device 430 and an output device 440.
The processor 410, the memory 420, the input device 430 and the output device 440 of the apparatus may be connected by a bus or other means, for example, in fig. 3.
The memory 420, which is a non-transitory computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a document alignment analysis method in the embodiment of the present invention (for example, the element set parsing module 210, the element matching module 220, the element determining module 230, the first identifying module 240, the element pair forming module 250, and the second identifying module 260 shown in fig. 2). The processor 410 executes various functional applications and data processing of the computer device by running the software programs, instructions and modules stored in the memory 420, so as to implement a document comparison analysis method of the above method embodiment, that is:
respectively analyzing the first document to be analyzed and the second document to be analyzed to obtain a first element set and a second element set;
calculating, by an element matching module, a degree of matching of each element in the first element set with each element in the second element set;
respectively determining the elements with the highest matching degree with each element in the first element set in the second element set;
if the matching degree of the target element in the first element set and the element with the highest matching degree in the second element set is smaller than a set threshold value, identifying the target element in a first document to be analyzed;
if the matching degree of the elements in the first element set and the elements with the highest matching degree in the second element set is larger than a set threshold value, forming an element pair;
analyzing the characters in the element pairs through an element comparison module, and respectively identifying the differences in the element pairs in a first document to be analyzed and a second document to be analyzed.
The memory 420 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 420 may optionally include memory located remotely from processor 410, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 440 may include a display device such as a display screen.
The embodiment of the invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements a document comparison analysis method provided by the embodiment of the invention:
respectively analyzing the first document to be analyzed and the second document to be analyzed to obtain a first element set and a second element set;
calculating, by an element matching module, a degree of matching of each element in the first element set with each element in the second element set;
respectively determining the elements with the highest matching degree with each element in the first element set in the second element set;
if the matching degree of the target element in the first element set and the element with the highest matching degree in the second element set is smaller than a set threshold value, identifying the target element in a first document to be analyzed;
if the matching degree of the elements in the first element set and the elements with the highest matching degree in the second element set is larger than a set threshold value, forming an element pair;
analyzing the characters in the element pairs through an element comparison module, and respectively identifying the differences in the element pairs in a first document to be analyzed and a second document to be analyzed.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, Python, Go, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A document comparison analysis method is characterized by comprising the following steps:
respectively analyzing the first document to be analyzed and the second document to be analyzed to obtain a first element set and a second element set;
calculating, by an element matching module, a degree of matching of each element in the first element set with each element in the second element set;
respectively determining the elements with the highest matching degree with each element in the first element set in the second element set;
if the matching degree of the target element in the first element set and the element with the highest matching degree in the second element set is smaller than a set threshold value, identifying the target element in a first document to be analyzed;
if the matching degree of the elements in the first element set and the elements with the highest matching degree in the second element set is larger than a set threshold value, forming an element pair;
analyzing the characters in the element pairs through an element comparison module, and respectively identifying the differences in the element pairs in a first document to be analyzed and a second document to be analyzed.
2. The method of claim 1, wherein the element matching module comprises a first embedding layer, a second embedding layer, a fully connected layer, and at least one network element;
the first embedding layer is for input of elements in the first set of elements; the second embedding layer is used for inputting elements in the second element set;
the first embedded layer and the second embedded layer are respectively connected with a first network unit;
the full connection layer is connected with the last network unit;
the network unit comprises a first network structure, a first self-encoder, a second network structure, a second sub-encoder and an attention layer;
the first network structure and the second network structure are both connected with an attention layer;
the first network fabric and the attention layer are both connected with the first self-encoder;
the second network fabric and the attention layer are both connected to the second self-encoder.
3. The method according to claim 1, wherein the analyzing the first document to be analyzed and the second document to be analyzed respectively to obtain a first element set and a second element set, comprises:
and respectively segmenting the first document to be analyzed and the second document to be analyzed according to the natural attribute of the paragraph, the natural attribute of the sentence or the natural attribute of the short sentence to obtain a first element set and a second element set.
4. The method according to claim 1, wherein the analyzing the first document to be analyzed and the second document to be analyzed respectively to obtain a first element set and a second element set, comprises:
analyzing paragraphs, sentences or short sentences of the first document to be analyzed through a pre-trained sequence labeling model to obtain a first element set consisting of the paragraphs, the sentences or the short sentences;
and analyzing paragraphs, sentences or phrases of the second document to be analyzed through the pre-trained sequence marking model to obtain a second element set consisting of the paragraphs, the sentences or the phrases.
5. The method of claim 4, further comprising:
and inputting the pre-marked training set documents into a mathematical model, and training the mathematical model to obtain a pre-trained sequence labeling model.
6. The method of claim 5,
the first character of each paragraph in the training set document is marked as a first identifier, the last character of each paragraph is marked as a second identifier, and the character between the first character and the last character of each paragraph is marked as a third identifier; alternatively, the first and second electrodes may be,
the first character of each sentence in the training set document is marked as a first identification, the last character of each sentence is marked as a second identification, and the character between the first character and the last character of each sentence is marked as a third identification; alternatively, the first and second electrodes may be,
the first character of each short sentence in the training set document is marked as a first mark, the last character of each short sentence is marked as a second mark, and the character between the first character and the last character of each short sentence is marked as a third mark.
7. The method of claim 1, wherein identifying differences in the pairs of elements in a first document to be analyzed and a second document to be analyzed, respectively, comprises:
respectively identifying the character differences in the element pairs in a first document to be analyzed and a second document to be analyzed;
respectively identifying words with character differences in the element pairs and semantic similarity larger than a set threshold in a first document to be analyzed and a second document to be analyzed;
and respectively identifying the characters with emotion difference in the element pair in the first document to be analyzed and the second document to be analyzed.
8. A document comparison analysis device, comprising:
the element set analysis module is used for respectively analyzing the first document to be analyzed and the second document to be analyzed to obtain a first element set and a second element set;
the element matching module is used for calculating the matching degree of each element in the first element set and each element in the second element set;
the element determining module is used for respectively determining the elements with the highest matching degree with each element in the first element set in the second element set;
the first identification module is used for identifying the target element in a first document to be analyzed if the matching degree of the target element in the first element set and the element with the highest matching degree in the second element set is smaller than a set threshold value;
an element pair forming module, configured to form an element pair if a matching degree of an element in the first element set and an element in the second element set with a highest matching degree is greater than a set threshold;
and the second identification module is used for analyzing the characters in the element pairs through the element comparison module and respectively identifying the differences in the element pairs in the first document to be analyzed and the second document to be analyzed.
9. An electronic device for comparing and analyzing documents, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the document alignment analysis method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method for document alignment analysis according to any one of claims 1 to 7.
CN201911199360.8A 2019-11-29 2019-11-29 Document comparison and analysis method and device, electronic equipment and storage medium Active CN110991163B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911199360.8A CN110991163B (en) 2019-11-29 2019-11-29 Document comparison and analysis method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911199360.8A CN110991163B (en) 2019-11-29 2019-11-29 Document comparison and analysis method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110991163A true CN110991163A (en) 2020-04-10
CN110991163B CN110991163B (en) 2023-09-19

Family

ID=70088398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911199360.8A Active CN110991163B (en) 2019-11-29 2019-11-29 Document comparison and analysis method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110991163B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395852A (en) * 2020-12-22 2021-02-23 江西金格科技股份有限公司 Comparison method of multi-file format layout document
CN112668899A (en) * 2020-12-31 2021-04-16 无锡软美信息科技有限公司 Contract risk identification method and device based on artificial intelligence
CN113128195A (en) * 2021-04-23 2021-07-16 达而观信息科技(上海)有限公司 Method and device for automatically searching local difference points based on document structure in financial industry
CN113688616A (en) * 2021-10-27 2021-11-23 深圳市明源云科技有限公司 Method, device and equipment for detecting chart report difference and storage medium
CN113807072A (en) * 2020-06-12 2021-12-17 深圳市迪博企业风险管理技术有限公司 Method and system for quickly identifying difference before and after revision of online approval document
CN114170423A (en) * 2022-02-14 2022-03-11 成都数之联科技股份有限公司 Image document layout identification method, device and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8381095B1 (en) * 2011-11-07 2013-02-19 International Business Machines Corporation Automated document revision markup and change control
CN110032736A (en) * 2019-03-22 2019-07-19 深兰科技(上海)有限公司 A kind of text analyzing method, apparatus and storage medium
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8381095B1 (en) * 2011-11-07 2013-02-19 International Business Machines Corporation Automated document revision markup and change control
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium
CN110032736A (en) * 2019-03-22 2019-07-19 深兰科技(上海)有限公司 A kind of text analyzing method, apparatus and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈叶旺;李文;彭鑫;赵文耘;: "基于本体的文档语义标注改进方法" *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807072A (en) * 2020-06-12 2021-12-17 深圳市迪博企业风险管理技术有限公司 Method and system for quickly identifying difference before and after revision of online approval document
CN112395852A (en) * 2020-12-22 2021-02-23 江西金格科技股份有限公司 Comparison method of multi-file format layout document
CN112668899A (en) * 2020-12-31 2021-04-16 无锡软美信息科技有限公司 Contract risk identification method and device based on artificial intelligence
CN112668899B (en) * 2020-12-31 2022-11-01 广东粤禾农业小额贷款股份有限公司 Contract risk identification method and device based on artificial intelligence
CN113128195A (en) * 2021-04-23 2021-07-16 达而观信息科技(上海)有限公司 Method and device for automatically searching local difference points based on document structure in financial industry
CN113688616A (en) * 2021-10-27 2021-11-23 深圳市明源云科技有限公司 Method, device and equipment for detecting chart report difference and storage medium
CN113688616B (en) * 2021-10-27 2022-02-25 深圳市明源云科技有限公司 Method, device and equipment for detecting chart report difference and storage medium
CN114170423A (en) * 2022-02-14 2022-03-11 成都数之联科技股份有限公司 Image document layout identification method, device and system

Also Published As

Publication number Publication date
CN110991163B (en) 2023-09-19

Similar Documents

Publication Publication Date Title
CN110991163B (en) Document comparison and analysis method and device, electronic equipment and storage medium
CN106776581B (en) Subjective text emotion analysis method based on deep learning
CN108829681B (en) Named entity extraction method and device
CN108664474B (en) Resume analysis method based on deep learning
CN111209412A (en) Method for building knowledge graph of periodical literature by cyclic updating iteration
AU2018279013B2 (en) Method and system for extraction of relevant sections from plurality of documents
CN111191275A (en) Sensitive data identification method, system and device
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN113204967B (en) Resume named entity identification method and system
CN110457585B (en) Negative text pushing method, device and system and computer equipment
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN113377916B (en) Extraction method of main relations in multiple relations facing legal text
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN111178080B (en) Named entity identification method and system based on structured information
Li et al. A method for resume information extraction using bert-bilstm-crf
CN112749283A (en) Entity relationship joint extraction method for legal field
CN111782793A (en) Intelligent customer service processing method, system and equipment
US11645457B2 (en) Natural language processing and data set linking
CN111597302B (en) Text event acquisition method and device, electronic equipment and storage medium
CN113505786A (en) Test question photographing and judging method and device and electronic equipment
Chumwatana COMMENT ANALYSIS FOR PRODUCT AND SERVICE SATISFACTION FROM THAI CUSTOMERS'REVIEW IN SOCIAL NETWORK
CN112052424A (en) Content auditing method and device
CN112667819A (en) Entity description reasoning knowledge base construction and reasoning evidence quantitative information acquisition method and device
CN116796726A (en) Resume analysis method, resume analysis device, terminal equipment and medium
CN109165295B (en) Intelligent resume evaluation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 501, 502, 503, No. 66 Boxia Road, China (Shanghai) Pilot Free Trade Zone, Pudong New Area, Shanghai, March 2012

Applicant after: Daguan Data Co.,Ltd.

Address before: Room 301, 303 and 304, block B, 112 liangxiu Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201203

Applicant before: DATAGRAND INFORMATION TECHNOLOGY (SHANGHAI) Co.,Ltd.

GR01 Patent grant
GR01 Patent grant