CN112926299B - Text comparison method, contract review method and auditing system - Google Patents

Text comparison method, contract review method and auditing system Download PDF

Info

Publication number
CN112926299B
CN112926299B CN202110331873.0A CN202110331873A CN112926299B CN 112926299 B CN112926299 B CN 112926299B CN 202110331873 A CN202110331873 A CN 202110331873A CN 112926299 B CN112926299 B CN 112926299B
Authority
CN
China
Prior art keywords
text
comparison
deleted
paragraph
character strings
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110331873.0A
Other languages
Chinese (zh)
Other versions
CN112926299A (en
Inventor
张玄武
赵贺平
金宏洲
程亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Tiangu Information Technology Co ltd
Original Assignee
Hangzhou Tiangu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Tiangu Information Technology Co ltd filed Critical Hangzhou Tiangu Information Technology Co ltd
Priority to CN202110331873.0A priority Critical patent/CN112926299B/en
Publication of CN112926299A publication Critical patent/CN112926299A/en
Application granted granted Critical
Publication of CN112926299B publication Critical patent/CN112926299B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text comparison method, a contract review method and an audit system in the field of language processing, which comprise the following steps of respectively dividing paragraphs of comparison texts; analyzing the clause titles to form a clause title list, and analyzing whether the clauses of the positions of addition, deletion or replacement exist according to the coincidence degree of the clause title lists of the two texts; comparing paragraphs under the same clause title of the two texts, and analyzing whether the continuous deleted character strings or the continuous newly added character strings exist; or, deleting character strings continuously and adding character strings continuously; and judging the modified content as one or more of deletion, addition, replacement and position change according to the analyzed continuously modified character string content. Compared with the traditional comparison tool, new comparison functions such as replacement, position change, modification after position change and the like are added. The structural information of the text, the connection mode and the expression mode of the paragraph sentences are considered, and compared with the traditional direct full text comparison, the accuracy is greatly improved.

Description

Text comparison method, contract review method and auditing system
Technical Field
The invention relates to a language processing technology, in particular to an automatic text comparison and automatic contract review technology.
Background
In the normal business process, small and medium-sized enterprises and above enterprises have a series of contract circulation and auditing processes. The general process is that sales of enterprises or business personnel are responsible for communicating with clients and making contracts, after the contracts are made, the contract is checked by legal personnel, after the contract is checked by legal personnel, the contract is submitted to financial personnel for checking, after the financial checking is passed, related leaders and the like are checked, after the checking is passed, the contractor transfers the contract to related sponsors and sealing parties of the enterprises, and in the whole checking and transferring process, any step of checking is not passed, the contract is returned to the business personnel for readjustment of the contract. The process is low in efficiency, a large number of risk problems related to versions exist in the whole circulation process, and once an auditor makes mistakes, huge risks are caused to business and operation of enterprises.
Currently, related business personnel can perform comparison through common document tools, but common document tools are mechanical comparison, such as only have new addition and deletion comparison functions, and if a modifier performs other structural adjustment on a text, the common document tools cannot be marked and embodied.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a text comparison method and a contract review method, and functions of replacement, replacement modification, position change among clauses and character position change prompt in the clauses are added.
In order to solve the technical problems, the invention is solved by the following technical scheme:
a text comparison method comprises the following steps,
respectively dividing paragraphs of the comparison text;
analyzing the clause titles to form a clause title list, and analyzing whether the clauses of the positions of addition, deletion or replacement exist according to the coincidence degree of the clause title lists of the two texts;
comparing paragraphs under the same clause title of the two texts, and analyzing whether the continuous deleted character strings or the continuous newly added character strings exist; or, deleting character strings continuously and adding character strings continuously;
and judging the modified content as one or more of deletion, addition, replacement and position change according to the analyzed continuously modified character string content.
Optionally, if the continuous deleted character string and the continuous newly added character string are the same in context, the modified part is marked as replacement, and the deleted character string is replaced by the newly added character string.
Optionally, if the character string is continuously deleted or newly added, judging the number of deleted or newly added characters, and if the number of deleted or newly added characters is smaller than a first threshold, marking the character string as deleted or newly added;
if the number of deleted or newly added character strings is greater than or equal to a first threshold value, judging whether the character strings which are continuously deleted or newly added exist other character strings which are similar to the character strings,
if not, marking as a new character string or a deleted character string;
if so, the mark is a position change.
Optionally, the method for judging whether the continuous deleted character string or the continuous newly added character string has other character strings similar to the continuous deleted character string or the continuous newly added character string comprises,
forming a comparison list by all other continuously added character strings or continuously deleted character strings, and performing similarity comparison by using the character strings in the comparison list and the continuously deleted character strings or the continuously added character strings which are larger than or equal to a first threshold value;
if the similarity is smaller than the second threshold, marking as a new character or a deleted character;
and if the similarity is greater than or equal to a second threshold, marking as position change.
Optionally, if the mark is a position change, comparing the character strings with the position change in the two texts, and judging whether deleted or newly added characters exist in the character strings with the position change, if so, marking the deleted or newly added characters.
Optionally, comparing whether the title list overlapped with the clause titles has repeated clause titles, and entering paragraph comparison under the same clause titles if the clause titles and the clause contents thereof are not repeated;
and (3) comparing specific paragraphs in clauses sequentially according to the repeated parts of the clause titles, wherein the added clause titles are marked as whole deleted clauses if in the reference text, and are marked as whole newly added clauses if in the comparison text.
Optionally, the character is de-duplicated according to the character and the coordinates of the character; the method comprises the steps of,
setting a threshold according to the overall abscissa of the text, and filtering text characters with the abscissa exceeding the threshold;
and comparing the coordinates of the tables with the coordinates of all the characters, and independently marking out the characters belonging to each table to form independent paragraphs.
Optionally, creating a regular expression of the text paragraph start marker, extracting the position of the start in each paragraph in the text using the regular expression, and classifying the text between the two paragraph starts to the paragraph where the first start is located;
creating a regular expression of a text paragraph start mark, matching text after the start of the last paragraph with an ending regular expression, and combining the text between the matched ending and the start of the last paragraph into the last text paragraph.
Creating a regular expression of the attachment beginning, and typing the text between the two identified attachment beginning into the first attachment paragraph, the last attachment paragraph containing all text after the last attachment paragraph beginning.
And sequentially creating and drawing text which is not subjected to paragraph grouping among all paragraphs according to the positions.
And finally, ordering all the classified paragraphs according to the page number and the ordinate of the first word, and obtaining the structured text with the classified paragraphs.
The invention also provides a contract review method, which adopts the text comparison method as claimed in claim 1,
calling an entity identification module to analyze entity content of the modified content and judging whether the modified content contains risk modification or not;
if the modified content contains a named entity, the output is a high risk modification, and if the modified content does not contain a named entity, the output is a general risk modification.
Optionally, the named entity includes a person name, a company name, an identification card number, a company-related identification number, a time, an amount, and a number.
The invention also provides an auditing system, which comprises a paragraph dividing module for dividing paragraphs;
the clause comparison module is used for calculating whether new added, deleted or position changed clauses exist;
a paragraph comparison module for calculating whether there is a character string added, deleted or changed in position;
and the risk assessment module is used for analyzing the risk level of the modified content.
Optionally, the system further comprises a display module for displaying the modified content, respectively displaying the reference text and the comparison text document, and highlighting the modified position label according to the coordinates; the risk detail module displays all the modification positions and is connected with the display module in a coordinate mode; and the risk statistics module is used for counting the residual number of risks in real time.
The invention has the beneficial effects that: compared with the traditional comparison tool, new functions of replacement, replacement modification, position change post-modification and the like are added. The structural information of the text, the connection mode and the expression mode of the paragraph sentences are considered, and compared with the traditional direct full text comparison, the accuracy is greatly improved. And each level of personnel in the circulation process is assisted to audit the contract, so that the workload in the audit process is reduced, the audit efficiency is improved, and the potential contract content error caused by improper operation or improper version management in the audit process is reduced, thereby reducing risks in enterprise operation and operation.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a general flow chart of a text alignment method;
FIG. 2 is a flow chart of paragraph division;
FIG. 3 is a flow chart of a clause comparison;
FIG. 4 is a flow chart of paragraph comparison;
fig. 5 is a flow chart of risk assessment.
Detailed Description
The present invention will be described in further detail with reference to the following examples, which are illustrative of the present invention and are not intended to limit the present invention thereto.
Example 1:
in this embodiment, the contract text is taken as a comparison subject, and the reference text includes a reference text and a comparison text, where the reference text may be an input text or a processed existing text stored in a system database.
The matching and comparison methods described herein all use the "gestalt pattern matching" method, and in other embodiments, other text similarity calculation methods may be used, without limitation.
As shown in fig. 1, a text comparison method, in which the processing methods of the reference text and the comparison text will be described in this embodiment, in other embodiments, only the comparison text processing may be performed, and the processed comparison text and the processed existing reference text may be compared.
Wherein the general flow comprises the following steps,
the treaty document carries a very large amount of structured information, the text often being organized and modified by terms, sub-terms, etc. Therefore, when contract comparison is performed, the direct comparison of all texts ignores the modification intention of the modifier, and the comparison result often cannot reflect the real intention of the modifier, for example, the intention recognition errors of text addition, deletion, replacement and the like, the recognition errors of the addition and deletion after the position of the paragraph change, the recognition errors caused by the inclusion of header footers for comparison and the like. According to the technical scheme, firstly, text paragraph structures are clearly carded, data are regulated, unnecessary comparison texts are removed, and the parts with the same intention of the contract editor clauses are compared. To achieve this, a paragraph division should first be performed.
Paragraph division, as shown in fig. 2, the reference text and the comparison text are respectively subjected to paragraph division, that is, the text extracted from the file is structured according to terms, so that the disordered text is changed into the structured text which is divided according to terms, paragraphs, tables and the like, and unnecessary text is removed.
1) Text preprocessing: performing de-duplication on the character according to the character and whether the coordinates of the character are repeated; and setting a threshold according to the abscissa of the whole text, and filtering text characters with the abscissa exceeding the threshold, so as to filter out the information such as the header, footer, page number and the like of the text.
2) And (3) table processing: and comparing the coordinates of the tables with the coordinates of all the characters, and independently marking out the characters belonging to each table to form independent paragraphs.
And traversing the coordinates of the character and the table coordinates in sequence, drawing the character in the table coordinates from the total text and drawing the character into a corresponding table, sequentially forming table paragraphs, and sequentially numbering the tables.
3) And dividing the paragraphs of the contract text according to the division modes of the paragraphs in the contract text, such as 'first bar', 'first chapter', and the like.
Creating a regular expression of a text paragraph start marker, extracting the position of the start in each paragraph in the text using the regular expression, and classifying the text between the two paragraph starts to the paragraph where the first start is located;
creating a regular expression of a text paragraph end mark, matching the text after the beginning of the last paragraph with the end regular expression, and forming the text between the matched end and the beginning of the last paragraph into the last text paragraph.
Creating a regular expression of the attachment beginning, and typing the text between the two identified attachment beginning into the first attachment paragraph, the last attachment paragraph containing all text after the last attachment paragraph beginning.
And sequentially creating and drawing text which is not subjected to paragraph grouping among all paragraphs according to the positions.
The regular expression of the end mark comprises keywords such as no text below, first party stamping, accessory and the like;
the regular expression of the head mark comprises keywords such as a first bar and a first chapter, and the regular expression of the head mark comprises keywords such as an accessory and a risk prompt letter. Only a few examples are given, and other similar notations are within the broad scope of this scheme.
4) And finally, ordering all the classified paragraphs according to the page number and the ordinate of the first word to obtain the structured text with the classified paragraphs.
(II) clause comparison As shown in FIG. 3, since the contract text is generally divided by clauses and the clause heading means the intent expressed by the entire clause, it can be considered that the clause heading is representative of all paragraphs in the same clause and that all paragraphs with the same or similar clause heading are paragraphs in the same clause.
1) Analyzing the clause titles of the reference text and the comparison text to form two or more groups (if more than one reference text or more than one comparison text are analyzed at the same time, two or more groups of clause title lists exist) of clause title lists, and analyzing whether the clause paragraphs with addition, deletion or replacement positions exist according to the coincidence degree of the clause title lists of the reference text and the comparison text;
comparing the text with the clause title list of the reference text;
if the clause title only appears in the title list of the reference text, marking the whole section as a deletion clause;
if the clause title appears only in the title list of the comparison text, the whole section is marked as an added clause.
If the order of the clause titles analyzed into the two clause title lists changes, the changing part is marked as a clause position change. This is not what the conventional alignment tool considers to be new and deletion, but rather a clause position variation behavior. In the further comparison, the text of the original position is not compared directly by the traditional comparison tool, but the clauses and the paragraphs with the same clause title after the position change are further compared.
Checking whether a repeated clause title part exists in a title list with overlapped clause titles of the comparison text, directly starting specific paragraph comparison in clauses of non-repeated clause titles and contents thereof, and sequentially comparing specific paragraphs in clauses according to the sequence for the part with the repeated clause title, wherein the more than one clause title is marked as a whole deleted clause if in the reference text, and is marked as a whole newly added clause if in the comparison text.
In summary, the modification of the contract text is different from the modification of the normal text, and the adjustment of the paragraph structure exists, so that the current common text comparison tool often does not have the recognition capability of paragraph overall deletion, addition, position change and the like, or the modification intention of an error editor is easy to understand. The first step of the overall scheme should therefore first make a paragraph-level comparison, more accurate identification of the intent of the modification. Including the intention to identify whether a paragraph is the same paragraph, and to distinguish whether a paragraph is an addition, deletion, or simply a position.
(III) paragraph comparison, as shown in FIG. 4, i.e., whether there is a modification to a paragraph in the same clause title in the reference text and the comparison text. Since the contract terms are different from the usual text, the text content is often divided according to the sub-terms, and the sub-terms are also changed integrally when changing, so that the problems of addition, deletion and replacement, the position change of the sub-terms, the modification after the position change and the like are required to be considered when comparing.
And sequentially connecting clause paragraphs to be compared according to the character sequence to form character strings, and comparing the two groups of character strings.
Comparing paragraphs under the same clause title of the two texts, and analyzing whether only character strings are continuously deleted or only character strings are continuously added; or, deleting character strings continuously and adding character strings continuously;
and judging the modified content as one or more of deletion, addition, replacement and position change according to the analyzed continuously modified character string content.
The following descriptions are all new or deleted meanings after comparing the comparison text with the reference text.
1) Specifically, if the continuous deleted character string and the continuous newly added character string are the same, the modified part is marked as a replacement, and the deleted content is replaced by the newly added content. If the character "abc" is deleted continuously, there is a new character "abc" added continuously.
If the character string is continuously deleted, judging the number of the deleted characters, and if the number of the deleted characters is smaller than a first threshold value, marking the character string as the deleted character string;
if the number of the deleted characters is greater than or equal to a first threshold value, judging whether other similar character strings exist in the continuous deleted character strings, and if not, marking the character strings as the deleted character strings;
if so, the mark is a position change.
2) If the character string is a continuous newly-added character string, judging the number of the continuous newly-added character string, and if the number of the continuous newly-added character string is smaller than a first threshold value, marking the character string as the newly-added character string;
if the number of the continuous newly added character strings is larger than or equal to a first threshold value, judging whether other similar character strings exist in the continuous newly added character strings, and if not, marking the continuous newly added character strings as the newly added character strings;
if so, the mark is a position change.
In the present embodiment, the first threshold, that is, the word number threshold, is 10 characters, that is, the possibility that whether there is a position change is needed for continuously deleting or continuously adding character strings with 10 characters or more than 10 characters, otherwise, whether there is a position change is not needed to be judged, and the character strings are directly marked as deleted character strings or newly added character strings.
3) The method for judging whether the continuous deleted character string or the continuous newly added character string has other character strings similar to the continuous deleted character string or the continuous newly added character string comprises,
forming a comparison list by all the continuous newly-added character strings or the continuous newly-added character strings similar to the continuous deleted character strings or the continuous newly-added character strings which are larger than or equal to a first threshold value, and comparing the similarity between the character strings in the comparison list and the continuous deleted character strings or the continuous newly-added character strings which are larger than or equal to the first threshold value;
examples: deleting (lacking) character strings xxxttedfgq in the comparison text, and newly adding a plurality of character strings in the text, such as xxtttedfgq, xxeetedfgqq, xaatteofgqq, and the like, and then composing a comparison list of all similar continuous newly added character strings in the comparison text, and comparing the xxxttedfgq with the comparison list in sequence.
If the similarity is smaller than the second threshold value, marking the character string (the character string in the comparison text) as a newly added character string or a deleted character string;
if the similarity is greater than or equal to the second threshold, marking as position change, and marking out deleted or newly added character string parts.
In this embodiment, the second threshold, i.e. the similarity threshold, is more than 60%, i.e. the content of the two groups of strings is more than 60% and the strings are determined to be similar.
Examples: after comparison, finding a character string xxtttedfgq with the similarity greater than or equal to 60%, and adding a similar xxtttedfgq to the deleted (missing) character string which is originally compared with the reference text, namely the position change.
Proper adjustment of the second threshold may adjust the accuracy and fault tolerance in the alignment process.
4) In 3), although the position change modification has been determined in the above step, the similarity determination is not 100% of the similarity, and therefore, it is necessary to compare the character strings of the position change in the two texts, and determine whether or not there is a deleted, newly added or replaced character in the character string of the position change, and if so, mark the deleted, newly added or replaced character.
Examples: comparing the deleted character string xxxttedfgq with the newly added character string xxtttedfgq, wherein the third x is replaced by t in the deleted character string;
the above example is annotated with two modifications after the system auto-review mechanism is completed: 1. position change and modification; 2. replacement modification of x and t after the position change.
5) Further, in step 3), in order to obtain the alignment list, all the strings that meet the conditions need to be found, where all the strings that meet the conditions may be aligned by a number of characters alignment method or a similarity alignment method, for example:
judging whether the difference value of the character numbers of the two character strings is within a third threshold value; or alternatively, the first and second heat exchangers may be,
and judging whether the content similarity of the character strings is larger than a fourth threshold value.
Example 2:
a contract review method further requires determining whether there is a risk of contract text modification content based on the text comparison method described in embodiment 1.
As shown in fig. 5: calling an entity identification module to analyze entity content of the modified content and judging whether the modified content contains risk modification or not;
if the modified content contains a named entity, the output is a high risk modification, and if the modified content does not contain a named entity, the output is a general risk modification.
The named entity comprises a person name, a company name, an identity card number, a company related identification number, time, amount and digits.
Example 3:
an auditing system for implementing the text comparison method described in the above embodiment 1, comprising:
the paragraph dividing module is used for dividing paragraphs;
the clause and title comparison module is used for calculating whether the clause of the new addition, deletion or position change exists;
a paragraph comparison module for calculating whether there is a character string added, deleted or changed in position;
and the risk analysis module is used for analyzing the risk level of the modified content.
Besides the functional modules, the system also comprises a UI interface, namely an audit interface, and specifically comprises the following steps:
and the display module is used for displaying the modified content, displaying the reference text and the comparison text document respectively, and highlighting the modified position label according to the coordinates.
And the risk detail module displays all the modification positions and is in coordinate linkage with the display module. The reviewer is provided with a pass or no pass button everywhere, and after clicking on the pass by, the reviewer will be marked as risk-free here. When the reviewer click does not pass, the original risk assessment is maintained and a review opinion entry window is provided.
And the risk statistics module is used for counting the residual number of risks in real time.
The traditional comparison tool is often only one comparison software, and when comparison work needs to be embedded into the actual operation workflow of a company, software-level cooperation cannot be effectively performed. The auditor has risks of missed audit and misaudit due to fatigue of the auditor, defects of the comparison tool and the like. Therefore, the comparison tool needs reasonable pages, software architecture, interface design and the like to cooperate well with the current workflow development software level of the company so as to avoid risks and improve efficiency.
Other needs to be described: in order to be embedded in a user's digitizing system, the comparison results need to have a more general data interaction format. The data format of the comparison result is json format, and the comparison result comprises information of text paragraphs, information of paragraph modification, related modification position coordinates, risk level and the like. The software development and interface formats employ the standard RESTful standard.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed.
The units may or may not be physically separate, and the components shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the method described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the present invention is not limited thereto, but any changes or substitutions within the technical scope of the present invention should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (11)

1. A text comparison method is characterized by comprising the following steps,
respectively dividing paragraphs of the reference text and the comparison text;
analyzing the clause titles to form a clause title list, and analyzing whether the clauses of the positions of addition, deletion or replacement exist according to the coincidence degree of the clause title lists of the two texts;
comparing paragraphs under the same clause title of the two texts, and analyzing whether the continuous deleted character strings or the continuous newly added character strings exist; or, deleting character strings continuously and adding character strings continuously;
judging the modified content as one or more of deletion, addition, replacement and position change according to the analyzed continuously modified character string content;
if the character string is a continuous deleted character string and a continuous newly-added character string, and the continuous deleted character string and the continuous newly-added character string have the same context, the modified part is marked as replacement, the deleted character string is replaced by the newly-added character string, and the deletion, the newly-added, the replacement and the position change have meanings after the comparison of the comparison text and the reference text.
2. The text comparison method according to claim 1, wherein if the character string is continuously deleted or newly added, the number of deleted or newly added characters is judged, and if the number of deleted or newly added characters is smaller than a first threshold, the character string is marked as deleted or newly added;
if the number of deleted or newly added character strings is greater than or equal to a first threshold value, judging whether the character strings which are continuously deleted or newly added exist other character strings which are similar to the character strings,
if not, marking as a new character string or a deleted character string;
if so, the mark is a position change.
3. The text comparison method according to claim 2, wherein the method of judging whether the continuously deleted character string or the continuously newly added character string has other character strings similar thereto comprises,
forming a comparison list by all other continuously added character strings or continuously deleted character strings, and performing similarity comparison by using the character strings in the comparison list and the continuously deleted character strings or the continuously added character strings which are larger than or equal to a first threshold value;
if the similarity is smaller than the second threshold, marking as a new character or a deleted character;
and if the similarity is greater than or equal to a second threshold, marking as position change.
4. A text comparison method according to claim 2 or 3, wherein if the character is marked as a position change, the character strings of the two texts with the position change are compared, and whether deleted or newly added characters exist in the character strings with the position change is judged, and if so, the deleted or newly added characters are marked.
5. The text comparison method according to claim 1, wherein the comparison is performed on whether the title list with overlapped clauses has repeated clauses, and the non-repeated clauses and the clauses enter the paragraph comparison under the same clauses;
and (3) comparing specific paragraphs in the clauses in turn according to the sequence for the repeated parts of the clause titles, marking the newly added clause titles as whole deleted clauses if the clause titles are in the reference text, and marking the newly added clauses as whole if the clauses are in the comparison text.
6. The text comparison method of claim 1, further comprising text preprocessing,
repeating the character according to the character and the coordinate of the character; the method comprises the steps of,
setting a threshold according to the overall abscissa of the text, and filtering text characters with the abscissa exceeding the threshold;
and comparing the coordinates of the tables with the coordinates of all the characters, and independently marking out the characters belonging to each table to form independent paragraphs.
7. A text comparison method according to claim 1 or 6, wherein the method of paragraph segmentation comprises,
creating a regular expression of a text paragraph start marker, extracting the position of the start in each paragraph in the text using the regular expression, and classifying the text between the two paragraph starts to the paragraph where the first start is located;
creating a regular expression of a text paragraph beginning mark, matching the text after the beginning of the last paragraph with an ending regular expression, and forming the text between the matched ending and the beginning of the last paragraph into the last text paragraph;
creating a regular expression of the attachment beginning, and scribing the text between the two identified attachment beginning into a first attachment paragraph, the last attachment paragraph containing all text after the last attachment paragraph beginning;
sequentially creating and drawing text which is not subjected to paragraph grouping among all paragraphs according to the positions of the text;
and finally, ordering all the classified paragraphs according to the page number and the ordinate of the first word, and obtaining the structured text with the classified paragraphs.
8. A contract review method, characterized in that the text comparison method according to any one of claims 1 to 7 is used to identify file modification contents,
then, the entity identification module is called to analyze the entity content of the modified content and judge whether the modified content contains risk modification;
if the modified content contains a named entity, the output is a high risk modification, and if the modified content does not contain a named entity, the output is a general risk modification.
9. A method of contract review as claimed in claim 8, wherein the named entities include person names, company names, identification numbers, company-related identification numbers, times, amounts, and digits.
10. An auditing system for implementing the text comparison method of any one of claims 1-7, and the contract review method of claim 8 or 9, comprising a paragraph partitioning module for partitioning paragraphs;
the clause comparison module is used for calculating whether new added, deleted or position changed clauses exist;
a paragraph comparison module for calculating whether there is a character string added, deleted or changed in position;
and the risk assessment module is used for analyzing the risk level of the modified content.
11. The auditing system of claim 10, further comprising,
the display module is used for displaying the modified content, displaying the reference text and the comparison text document respectively, and highlighting the modified position label according to the coordinates;
the risk detail module displays all the modification positions and is connected with the display module in a coordinate mode;
and the risk statistics module is used for counting the residual number of risks in real time.
CN202110331873.0A 2021-03-29 2021-03-29 Text comparison method, contract review method and auditing system Active CN112926299B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110331873.0A CN112926299B (en) 2021-03-29 2021-03-29 Text comparison method, contract review method and auditing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110331873.0A CN112926299B (en) 2021-03-29 2021-03-29 Text comparison method, contract review method and auditing system

Publications (2)

Publication Number Publication Date
CN112926299A CN112926299A (en) 2021-06-08
CN112926299B true CN112926299B (en) 2024-04-09

Family

ID=76176328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110331873.0A Active CN112926299B (en) 2021-03-29 2021-03-29 Text comparison method, contract review method and auditing system

Country Status (1)

Country Link
CN (1) CN112926299B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111932412A (en) * 2020-09-04 2020-11-13 汪宏杰 Contract drafting and revising method, device, storage medium and equipment
CN113254598B (en) * 2021-06-23 2024-02-20 中国银行股份有限公司 Document comparison method, device, server, medium and product
CN113807416B (en) * 2021-08-30 2024-04-05 国泰新点软件股份有限公司 Model training method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705952A (en) * 2019-08-15 2020-01-17 平安信托有限责任公司 Contract auditing method and device
CN112199937A (en) * 2020-11-12 2021-01-08 深圳供电局有限公司 Short text similarity analysis method and system, computer equipment and medium
CN112199940A (en) * 2020-11-12 2021-01-08 深圳供电局有限公司 Project evaluation method and storage medium
CN112199938A (en) * 2020-11-12 2021-01-08 深圳供电局有限公司 Scientific and technological project similarity analysis method, computer equipment and storage medium
CN112330214A (en) * 2020-11-26 2021-02-05 杭州睿胜软件有限公司 Contract review method and device and readable storage medium
CN112395851A (en) * 2020-11-18 2021-02-23 北京北大英华科技有限公司 Text comparison method and device, computer equipment and readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7788084B2 (en) * 2006-09-19 2010-08-31 Xerox Corporation Labeling of work of art titles in text for natural language processing
US20140053069A1 (en) * 2012-08-16 2014-02-20 Sap Ag Identifying and mitigating risks in contract document using text analysis with custom high risk clause dictionary
US10169414B2 (en) * 2016-04-26 2019-01-01 International Business Machines Corporation Character matching in text processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705952A (en) * 2019-08-15 2020-01-17 平安信托有限责任公司 Contract auditing method and device
CN112199937A (en) * 2020-11-12 2021-01-08 深圳供电局有限公司 Short text similarity analysis method and system, computer equipment and medium
CN112199940A (en) * 2020-11-12 2021-01-08 深圳供电局有限公司 Project evaluation method and storage medium
CN112199938A (en) * 2020-11-12 2021-01-08 深圳供电局有限公司 Scientific and technological project similarity analysis method, computer equipment and storage medium
CN112395851A (en) * 2020-11-18 2021-02-23 北京北大英华科技有限公司 Text comparison method and device, computer equipment and readable storage medium
CN112330214A (en) * 2020-11-26 2021-02-05 杭州睿胜软件有限公司 Contract review method and device and readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
An integrated note summarization system by using text mining algorithms;Hei-Chia Wang;Information Sciences;20200331;第513卷;536-552 *
一种面向自动化标检的文本分类方法;郭泽;焦倩倩;;现代防御技术(第05期);101-108 *
基于中文文本的编辑距离算法的改进;王芳;王继荣;杨晓东;李军;;青岛大学学报(自然科学版)(第03期);63-66 *

Also Published As

Publication number Publication date
CN112926299A (en) 2021-06-08

Similar Documents

Publication Publication Date Title
CN112926299B (en) Text comparison method, contract review method and auditing system
US8315997B1 (en) Automatic identification of document versions
US11232300B2 (en) System and method for automatic detection and verification of optical character recognition data
CN106649223A (en) Financial report automatic generation method based on natural language processing
US9304993B2 (en) Methods and data structures for multiple combined improved searchable formatted documents including citation and corpus generation
CN110765770A (en) Automatic contract generation method and device
CN109933796B (en) Method and device for extracting key information of bulletin text
CN101539904A (en) Automatic indexing method of quotations
CN111680634A (en) Document file processing method and device, computer equipment and storage medium
CN110909123B (en) Data extraction method and device, terminal equipment and storage medium
CN111177332B (en) Method and device for automatically extracting judge document case-related label and judge result
CN109885658B (en) Index data extraction method and device and computer equipment
CN112395851A (en) Text comparison method and device, computer equipment and readable storage medium
CN114492368A (en) AI bid automatic scoring method, system and storage medium
US11941565B2 (en) Citation and policy based document classification
CN112418875B (en) Cross-platform tax intelligent customer service corpus migration method and device
CN108073678A (en) Applied to document analyzing and processing method, system and the device in big data analysis
EP1286284B1 (en) Spreadsheet data processing system
JP6155409B1 (en) Financial analysis system and financial analysis program
EP3470993A1 (en) A method and system for click thru capability of electronic media
CN114495138A (en) Intelligent document identification and feature extraction method, device platform and storage medium
CN115374204A (en) Technical scheme for efficiently tracking fund source and destination
CN114331316A (en) AI and RPA-based order information entry method, device, equipment and medium
TW202127301A (en) Pdf method for parsing table data in pdf file
CN117454851B (en) PDF document-oriented form data extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant