CN108984593A

CN108984593A - The method that multi-format text keeps off typing and compares

Info

Publication number: CN108984593A
Application number: CN201810549599.2A
Authority: CN
Inventors: 鞠非; 华凯; 顾梅; 吴国奇; 汤丹
Original assignee: Changzhou Power Supply Branch Jiangsu Electric Power Co Ltd; State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd
Current assignee: Changzhou Power Supply Branch Jiangsu Electric Power Co Ltd; State Grid Corp of China SGCC; State Grid Jiangsu Electric Power Co Ltd
Priority date: 2013-12-18
Filing date: 2013-12-18
Publication date: 2018-12-11
Also published as: CN108804624A; CN103823838A; CN108959203A; CN103823838B

Abstract

The present invention relates to a kind of methods multi-format text gear typing and compared, first determine whether document to be logged is paper document, then passing through headend equipment if it is paper document will be in file automatically scanning typing unprocessed form document library made of paper, it is then directly entered in unprocessed form document library if it is electronic document, all documents in unprocessed form document library are converted to the document of unified format again, then determinant attribute mark and basic management are carried out to document, the document based on content is carried out finally by Nakastu algorithm and Words partition system to compare, and it will be carried out in document associations and input database according to similarity degree is compared.The present invention can by all types of and format document automatic input, homogeneous classification, intelligent management and with the comparison of existing file, improve document utilization efficiency, save document comparison time, promoted document management efficiency.

Description

The method that multi-format text keeps off typing and compares

The application be application No. is: 201310696955.0, invention and created name is that " a kind of multi-format document typing is simultaneously compared Pair method ", the applying date are as follows: the divisional application of the application for a patent for invention on December 18th, 2013.

Technical field

The present invention relates to document process management field, more particularly relate to it is a kind of by electronic document or paper document typing simultaneously The method being compared.

Background technique

Typical document comparison technology application has at present: (1) information intelligent retrieval: search engine is to user entered keyword Reaction be to list all information to match with the keyword.(2) automatically request-answering system: in such systems, problem is more Kind of multiplicity, and very huge, some problems are very similar, if manually answered, will take a substantial amount of time and The very high problem of similarity is classified as one kind, makes system to this by manpower if applicating text similarity technology in such systems Class problem makes answer automatically, will save a large amount of time.(3) text duplicate checking: in certain fields, it is contemplated that privacy and original creation Property, it is desirable that text cannot repeat, then applicating text similarity technology, the calculating of similarity is carried out to this class text, just It can be seen which text repeatedly occurs.By above-mentioned, document comparison technology is increasingly used in every field.

Currently, the comparison analysis management research to document is concentrated mainly on Text similarity computing, for text similarity Calculation side focuses on similarity of character string, has formed the clustering algorithm of comparative maturity, but these algorithms are during comparison The semanteme for not accounting for text or character, the similarity point of reference calculated is not high, gives user in actual application It is not high to provide reference value.Although calculating text similarity by segmenting, i.e., by Chinese Word Automatic Segmentation according to semantic angle It is segmented, the similarity calculated between text is then combined according to participle and alignment algorithm, emphasis is compared from word-level To similarity between document.But it is all single TXT text or Word file that these documents, which compare the document supported, for more The comparison of format file can not be carried out directly, need to largely effect on work by that could compare after manually formatting in advance Efficiency.

Summary of the invention

It document typing to multiple format and can be compared the technical problem to be solved in the present invention is to provide a kind of Method.

Realize that the technical solution of the object of the invention is to provide a kind of method multi-format text gear typing and compared, including as follows Step:

1. judgement need whether the document of typing is paper file, if it is paper file will then need the paper file of typing by It is placed into scanning device after being stacked neatly according to precedence, by scanning device by file scanning at the electronics text of PDF format Shelves are simultaneously stored into the unprocessed form document library of the storage equipment for the computer being electrically connected with scanning device；

If it is the electronic document of the multiple format including PDF, Word or TXT, then the storage of computer is directly stored in In the unprocessed form document library of equipment；

2. being converted into the document of unified format to each electronic document in unprocessed form document library by computer and storage being arrived In the unified format file library of the storage equipment of computer, the file format after sets itself is converted can according to need, preferably File format be Word format or TXT textual form, if the file format of original electronic document and setting conversion after File format is unanimously then directly copied to unified format file library from unprocessed form document library；

3. being by participle at the content of Word format or each electronic document of TXT textual form to format unified after conversion The contents extraction of each document is sentence set, and is stored in the form of entry corresponding with each document to sentence number by system According in table；

4. to format unified after conversion at each electronic document of Word format or TXT textual form carry out include classification, title, The mark of determinant attribute including source, keyword, creation time, and stored in the form of entry corresponding with each document Into sentence data table；

5. select newest typing unify in a document or unified format file library in format file library some document as to It compares document to be compared with other all documents in unified format file library, first by sentence data table according to the packet of document It includes the determinant attribute including classification, title, source, keyword, creation time to be compared and match, thus from unified format text Filtered out in shelves library any one attribute in determinant attribute including classification, title, source, keyword, creation time with The classification of document to be compared, title, source, keyword, 5 determinant attributes of creation time any one attributes match all texts Shelves；

6. to the document 5. screened by step as reference documents one by one with document to be compared by sentence data table by 3. entry information corresponding with each document that step obtains is compared, when 2 documents compare as unit of sentence, according to Nakatsu algorithm compares sentence item by item to calculate the similarity between sentence, utilizes calculation further according to the similarity of each sentence The art method of average calculates the similarity of 2 document entirety；

7. recording the document to be compared 6. obtained by step with the whole similarity of any one reference documents to corresponding number According in library.

Step 3. in, by Words partition system by the contents extraction of each document be sentence set detailed process be, will be every A document carries out being decomposed to form document decomposition tree, and document decomposition tree corresponding to each document includes n (n >=1) a sentence, sentence It stores in the matrix form, each sentence is made of line number, row number, length, content, similarity information, then the matrix of n-th of sentence It is made of line number n, row number n, length n, content n, similarity n.

Step 6. in, sentence is compared according to Nakatsu algorithm item by item to calculate the specific side of similarity between sentence Method is: setting two sentences to be compared is sentence A and sentence B, calculates the longest common subsequence of sentence A and sentence B, note first It as name MaxLen (A, B), specially sets M=Len (A), N=Len (B), i.e. M are the length of character string A, and N is the length of character string B Degree, in order to without loss of generality, it is assumed that M≤N；

If the aM A=a1a2 ..., indicating A by a1a2 ..., this M character of aM forms；

B=b1b2 ... bN, indicating B by b1b2 ..., this N number of character of bN forms；

Then MaxLen (i, j)=MaxLen (a1a2 ... ai, b1b2 ... bj), wherein 1≤i≤M, 1≤j≤N；

All LCS(Longest Common for having length with character string a1a2 ... ai for k are indicated with L (k, i) Subsequence, longest common subsequence) minimum value of j in character string b1b2 ... bj, it is formulated and is exactly: L (k, i)= Min { j } Where LCS (i, j)=k；

The first step, initialize array LL () and P ()；

LL(0)=0

LL(i)=V 1≤i≤M

P(i)=V 1≤i≤M

At this point, LL (0) indicates L (0,0)；LL (1) indicates L (1,0)；LL (2) indicates L (2,1)；……

Second step, successively calculates the element on first diagonal line, calculates L (1,1) with temporary variable T；T=F(L(0,0),L(1, 0))=F(LL(0),LL(1)).Note: F expression is minimized operation, and the value of T is assigned to LL (1).The LL of LL (1) expression at this time (1, 1), LL (2) indicates L (2,1)；Calculating above is repeated, until this diagonal line has been calculated, not if it is first of row k For the value of V, which is assigned to P (k)；

After first diagonal line has been calculated, at this point, LL (0) indicates L (0,1)；LL (1) indicates L (1,1)；LL (2) expression L (2, 2)；……；

If this diagonal line is not solution, second step is repeated, next diagonal line is calculated, until encountering solution.But to infuse Meaning: i-th diagonal line only has m-i+1 element, arrives LL (m-i+1) so only calculating.

If certain some cornerwise element is V, the element after this diagonal line is all V, there is no need to It calculates.

Then editing distance between sentence A and sentence B is calculated, is indicated with LD (A, B), it is obvious that if LD (A, B)=0 table Show that sentence A is identical with sentence B.

A=a₁a₂……a_N, indicate that A is by a₁a₂……a_NThis N number of character composition, Len (A)=N；

B=b₁b₂……b_M, indicate that B is by b₁b₂……b_MThis M character composition, Len (B)=M；

Define LD (i, j)=LD (a₁a₂……a_i,b₁b₂……b_j), wherein 0≤i≤N, 0≤j≤M；

LD matrix is initialized, according to LD (N, M)=LD (A, B), LD (0,0)=0, LD (0, j)=j, LD (i, 0)=i calculates separately LD Matrix setup values；

The other rows of LD matrix are calculated, if according to formula a_i=b_j, then LD (i, j)=LD (i-1, j-1), if a_i≠b_j, then LD (i, j)= Min (LD (i-1, j-1), LD (i-1, j), LD (i, j-1))+1, is finally calculated LD (A, B) value；

Calculate similarity SIM (A, B)=LCS (A, B)/(LD (A, the B)+LCS (A, B)) of sentence A and sentence B.

Step 2. in, the conversion method of the electronic document of PDF format be first extract PDF document in each page content stream, Then the content stream extracted is decrypted, then the content stream after decryption is decoded with Filter decoding algorithm, finally Content of text and its relevant information are extracted from decoded content stream and are stored as the document of the unified format of setting.

Step 1. in, the preferred scanner of scanning device.

The present invention has the effect of positive: (1) method of multi-format text of the invention gear typing and comparison can be by papery Document or all types of electronic document typing document libraries simultaneously unify format to facilitate management and be compared, and improve document Utilization efficiency saves document comparison time, promotes document management efficiency.

(2) multi-format text gear typing of the invention and the method compared using Nakatsu algorithm compare item by item sentence to The similarity between sentence is calculated, calculates the whole of 2 documents using arithmetic mean method further according to the similarity of each sentence Similarity, it is more accurate for the calculating of the similarity of 2 documents, it is preferable to compare effect.

(3) method of multi-format text of the invention gear typing and comparison passes through Words partition system for the contents extraction of each document For sentence set, each document is carried out to be decomposed to form document decomposition tree, document decomposition tree corresponding to each document includes n (n >=1) a sentence, sentence are stored in the matrix form, and each sentence is made of line number, row number, length, content, similarity information, then The matrix of n-th of sentence is made of line number n, row number n, length n, content n, similarity n, is decomposed to form by Words partition system each The document tree of a document is more careful detailed, to improve the precision of subsequent comparison process, promotes document management efficiency.

Detailed description of the invention

Fig. 1 is the flow diagram for the method that multi-format text of the invention keeps off typing and comparison；

Fig. 2 be step of the invention 3. in Words partition system detailed process schematic diagram.

Specific embodiment

(embodiment 1)

See Fig. 1, the method for the gear typing of multi-format text and comparison of the present embodiment includes the following steps:

1. judgement need whether the document of typing is paper file, if it is paper file will then need the paper file of typing by It is placed into scanning device after being stacked neatly according to precedence, by scanning device by file scanning at the electronics text of PDF format Shelves are simultaneously stored into the unprocessed form document library of the storage equipment for the computer being electrically connected with scanning device, and scanning device is preferably swept Retouch instrument；

2. being converted into the document of unified format to each electronic document in unprocessed form document library by computer and storage being arrived In the unified format file library of the storage equipment of computer, the file format after sets itself is converted can according to need, preferably File format be Word format or TXT textual form, if the file format of original electronic document and setting conversion after File format is unanimously then directly copied to unified format file library from unprocessed form document library, in addition the electronic document of PDF format Conversion method be first extract PDF document in each page content stream, then the content stream extracted is decrypted, then uses Filter decoding algorithm is decoded the content stream after decryption, finally from decoded content stream extract content of text and its Relevant information and the document for being stored as the unified format set.

3. to format unified after conversion at the content of Word format or each electronic document of TXT textual form, by dividing The contents extraction of each document is sentence set, and is stored in the form of entry corresponding with each document to sentence by word system In subdatasheet；Words partition system is that a chinese character sequence is cut into individual word one by one, by continuous word sequence according to Certain specification is reassembled into the process of word sequence, such as " will strengthen regulatory efforts " to be decomposed by Words partition system and " add Greatly ", " supervision " and " dynamics " three words；

See Fig. 2, by Words partition system by the contents extraction of each document be sentence set detailed process be, by each document into Row is decomposed to form document decomposition tree, and document decomposition tree corresponding to each document includes n (n >=1) a sentence, and sentence is with rectangular Formula storage, each sentence is made of line number, row number, length, content, similarity information, then the matrix of n-th of sentence by line number n, Row number n, length n, content n, similarity n are constituted.

4. to format unified after conversion at Word format or each electronic document of TXT textual form by manually to every A document carries out the mark of the determinant attribute including classification, title, source, keyword, creation time, and with each text The form of the corresponding entry of shelves is stored into sentence data table.

5. selecting newest typing to unify some document in a document or unified format file library in format file library to make It is compared for document to be compared with other all documents in unified format file library, first by sentence data table according to document The determinant attribute including classification, title, source, keyword, creation time be compared and match, thus from unified lattice Any of determinant attribute including classification, title, source, keyword, creation time category is filtered out in formula document library Property with the classification of document to be compared, title, source, keyword, 5 determinant attributes of creation time any one attributes match institute There is document.

6. passing through sentence data table with document to be compared one by one as reference documents to the document 5. screened by step In the entry information corresponding with each document that is 3. obtained by step be compared, when 2 documents compare as unit of sentence, Sentence is compared item by item according to Nakatsu algorithm to calculate the similarity between sentence, further according to the similarity benefit of each sentence The whole similarity of 2 documents (any one reference documents and document to be compared) is calculated with arithmetic mean method.

Be according to the specific method that Nakatsu algorithm compares sentence item by item to calculate the similarity between sentence: set to Two sentences compared are sentence A and sentence B, calculate the longest common subsequence of sentence A and sentence B first, are denoted as running after fame MaxLen (A, B) is specially set M=Len (A), and N=Len (B), i.e. M are the length of character string A, and N is the length of character string B, is Without loss of generality, it is assumed that M≤N；

If the aM A=a1a2 ..., indicating A by a1a2 ..., this M character of aM forms；

The first step, initialize array LL () and P ()；

LL(0)=0

LL(i)=V 1≤i≤M

P(i)=V 1≤i≤M

7. recording the document to be compared 6. obtained by step to the whole similarity of any one reference documents to corresponding Database in.

Claims

1. a kind of method that multi-format text keeps off typing and compares, includes the following steps:

3. being by participle at the content of Word format or each electronic document of TXT textual form to format unified after conversion The contents extraction of each document is sentence set, and is stored in the form of entry corresponding with each document to sentence number by system According in table；By Words partition system by the contents extraction of each document be sentence set detailed process be to carry out each document It is decomposed to form document decomposition tree, document decomposition tree corresponding to each document includes n (n >=1) a sentence, and sentence is in the matrix form Storage, each sentence are made of line number, row number, length, content, similarity information, then the matrix of n-th of sentence is by line number n, column Number n, length n, content n, similarity n are constituted；

6. to the document 5. screened by step as reference documents one by one with document to be compared by sentence data table by 3. entry information corresponding with each document that step obtains is compared, when 2 documents compare as unit of sentence, according to Nakatsu algorithm compares sentence item by item to calculate the similarity between sentence, utilizes calculation further according to the similarity of each sentence The art method of average calculates the similarity of 2 document entirety: setting two sentences to be compared is sentence A and sentence B, first calculating sentence The longest common subsequence of A and sentence B are denoted as run after fame MaxLen (A, B), specially set M=Len (A), and N=Len (B), i.e., M is The length of character string A, N is the length of character string B, in order to without loss of generality, it is assumed that M≤N；

If the aM A=a1a2 ..., indicating A by a1a2 ..., this M character of aM forms；

The first step, initialize array LL () and P ()；

LL(0)=0

LL(i)=V 1≤i≤M

P(i)=V 1≤i≤M

Second step, successively calculates the element on first diagonal line, calculates L (1,1) with temporary variable T；T=F(L(0,0),L(1, 0))=F(LL(0),LL(1))；

F expression is minimized operation, and the value of T is assigned to LL (1)；

LL (1) indicates LL (1,1) at this time, and LL (2) indicates L (2,1)；Calculating above is repeated, until this diagonal line has been calculated, First if it is row k is not the value of V, which is assigned to P (k)；

If this diagonal line is not solution, second step is repeated, next diagonal line is calculated, until encountering solution；

Noting however that: i-th diagonal line only has m-i+1 element, arrives LL (m-i+1) so only calculating；

If certain some cornerwise element is V, the element after this diagonal line is all V, and there is no need to calculate ?；

Then editing distance between sentence A and sentence B is calculated, is indicated with LD (A, B), it is obvious that if LD (A, B)=0 indicates sentence Sub- A is identical with sentence B；

A=a1a2 ... aN, indicating A by a1a2 ..., this N number of character of aN forms, Len (A)=N；

B=b1b2 ... bM, indicating B by b1b2 ..., this M character of bM forms, Len (B)=M；

It defines LD (i, j)=LD (a1a2 ... ai, b1b2 ... bj), wherein 0≤i≤N, 0≤j≤M；

The other rows of LD matrix are calculated, if according to formula ai=bj, LD (i, j)=LD (i-1, j-1), if ai ≠ bj, LD (i, j) =Min (LD (i-1, j-1), LD (i-1, j), LD (i, j-1))+1, is finally calculated LD (A, B) value；

Calculate similarity SIM (A, B)=LCS (A, B)/(LD (A, the B)+LCS (A, B)) of sentence A and sentence B；

2. multi-format document typing according to claim 1 and compare method, it is characterised in that: step 2. in, PDF lattice The conversion method of the electronic document of formula is to extract the content stream of each page in PDF document first, then flows into the content extracted Row decryption, then the content stream after decryption is decoded with Filter decoding algorithm, finally extracted from decoded content stream Content of text and its relevant information and the document for being stored as the unified format set.