CN108984593A - The method that multi-format text keeps off typing and compares - Google Patents

The method that multi-format text keeps off typing and compares Download PDF

Info

Publication number
CN108984593A
CN108984593A CN201810549599.2A CN201810549599A CN108984593A CN 108984593 A CN108984593 A CN 108984593A CN 201810549599 A CN201810549599 A CN 201810549599A CN 108984593 A CN108984593 A CN 108984593A
Authority
CN
China
Prior art keywords
document
sentence
format
library
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810549599.2A
Other languages
Chinese (zh)
Inventor
鞠非
华凯
顾梅
吴国奇
汤丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Power Supply Branch Jiangsu Electric Power Co Ltd
State Grid Corp of China SGCC
State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Changzhou Power Supply Branch Jiangsu Electric Power Co Ltd
State Grid Corp of China SGCC
State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Power Supply Branch Jiangsu Electric Power Co Ltd, State Grid Corp of China SGCC, State Grid Jiangsu Electric Power Co Ltd filed Critical Changzhou Power Supply Branch Jiangsu Electric Power Co Ltd
Priority to CN201810549599.2A priority Critical patent/CN108984593A/en
Publication of CN108984593A publication Critical patent/CN108984593A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of methods multi-format text gear typing and compared, first determine whether document to be logged is paper document, then passing through headend equipment if it is paper document will be in file automatically scanning typing unprocessed form document library made of paper, it is then directly entered in unprocessed form document library if it is electronic document, all documents in unprocessed form document library are converted to the document of unified format again, then determinant attribute mark and basic management are carried out to document, the document based on content is carried out finally by Nakastu algorithm and Words partition system to compare, and it will be carried out in document associations and input database according to similarity degree is compared.The present invention can by all types of and format document automatic input, homogeneous classification, intelligent management and with the comparison of existing file, improve document utilization efficiency, save document comparison time, promoted document management efficiency.

Description

The method that multi-format text keeps off typing and compares
The application be application No. is: 201310696955.0, invention and created name is that " a kind of multi-format document typing is simultaneously compared Pair method ", the applying date are as follows: the divisional application of the application for a patent for invention on December 18th, 2013.
Technical field
The present invention relates to document process management field, more particularly relate to it is a kind of by electronic document or paper document typing simultaneously The method being compared.
Background technique
Typical document comparison technology application has at present: (1) information intelligent retrieval: search engine is to user entered keyword Reaction be to list all information to match with the keyword.(2) automatically request-answering system: in such systems, problem is more Kind of multiplicity, and very huge, some problems are very similar, if manually answered, will take a substantial amount of time and The very high problem of similarity is classified as one kind, makes system to this by manpower if applicating text similarity technology in such systems Class problem makes answer automatically, will save a large amount of time.(3) text duplicate checking: in certain fields, it is contemplated that privacy and original creation Property, it is desirable that text cannot repeat, then applicating text similarity technology, the calculating of similarity is carried out to this class text, just It can be seen which text repeatedly occurs.By above-mentioned, document comparison technology is increasingly used in every field.
Currently, the comparison analysis management research to document is concentrated mainly on Text similarity computing, for text similarity Calculation side focuses on similarity of character string, has formed the clustering algorithm of comparative maturity, but these algorithms are during comparison The semanteme for not accounting for text or character, the similarity point of reference calculated is not high, gives user in actual application It is not high to provide reference value.Although calculating text similarity by segmenting, i.e., by Chinese Word Automatic Segmentation according to semantic angle It is segmented, the similarity calculated between text is then combined according to participle and alignment algorithm, emphasis is compared from word-level To similarity between document.But it is all single TXT text or Word file that these documents, which compare the document supported, for more The comparison of format file can not be carried out directly, need to largely effect on work by that could compare after manually formatting in advance Efficiency.
Summary of the invention
It document typing to multiple format and can be compared the technical problem to be solved in the present invention is to provide a kind of Method.
Realize that the technical solution of the object of the invention is to provide a kind of method multi-format text gear typing and compared, including as follows Step:
1. judgement need whether the document of typing is paper file, if it is paper file will then need the paper file of typing by It is placed into scanning device after being stacked neatly according to precedence, by scanning device by file scanning at the electronics text of PDF format Shelves are simultaneously stored into the unprocessed form document library of the storage equipment for the computer being electrically connected with scanning device;
If it is the electronic document of the multiple format including PDF, Word or TXT, then the storage of computer is directly stored in In the unprocessed form document library of equipment;
2. being converted into the document of unified format to each electronic document in unprocessed form document library by computer and storage being arrived In the unified format file library of the storage equipment of computer, the file format after sets itself is converted can according to need, preferably File format be Word format or TXT textual form, if the file format of original electronic document and setting conversion after File format is unanimously then directly copied to unified format file library from unprocessed form document library;
3. being by participle at the content of Word format or each electronic document of TXT textual form to format unified after conversion The contents extraction of each document is sentence set, and is stored in the form of entry corresponding with each document to sentence number by system According in table;
4. to format unified after conversion at each electronic document of Word format or TXT textual form carry out include classification, title, The mark of determinant attribute including source, keyword, creation time, and stored in the form of entry corresponding with each document Into sentence data table;
5. select newest typing unify in a document or unified format file library in format file library some document as to It compares document to be compared with other all documents in unified format file library, first by sentence data table according to the packet of document It includes the determinant attribute including classification, title, source, keyword, creation time to be compared and match, thus from unified format text Filtered out in shelves library any one attribute in determinant attribute including classification, title, source, keyword, creation time with The classification of document to be compared, title, source, keyword, 5 determinant attributes of creation time any one attributes match all texts Shelves;
6. to the document 5. screened by step as reference documents one by one with document to be compared by sentence data table by 3. entry information corresponding with each document that step obtains is compared, when 2 documents compare as unit of sentence, according to Nakatsu algorithm compares sentence item by item to calculate the similarity between sentence, utilizes calculation further according to the similarity of each sentence The art method of average calculates the similarity of 2 document entirety;
7. recording the document to be compared 6. obtained by step with the whole similarity of any one reference documents to corresponding number According in library.
Step 3. in, by Words partition system by the contents extraction of each document be sentence set detailed process be, will be every A document carries out being decomposed to form document decomposition tree, and document decomposition tree corresponding to each document includes n (n >=1) a sentence, sentence It stores in the matrix form, each sentence is made of line number, row number, length, content, similarity information, then the matrix of n-th of sentence It is made of line number n, row number n, length n, content n, similarity n.
Step 6. in, sentence is compared according to Nakatsu algorithm item by item to calculate the specific side of similarity between sentence Method is: setting two sentences to be compared is sentence A and sentence B, calculates the longest common subsequence of sentence A and sentence B, note first It as name MaxLen (A, B), specially sets M=Len (A), N=Len (B), i.e. M are the length of character string A, and N is the length of character string B Degree, in order to without loss of generality, it is assumed that M≤N;
If the aM A=a1a2 ..., indicating A by a1a2 ..., this M character of aM forms;
B=b1b2 ... bN, indicating B by b1b2 ..., this N number of character of bN forms;
Then MaxLen (i, j)=MaxLen (a1a2 ... ai, b1b2 ... bj), wherein 1≤i≤M, 1≤j≤N;
All LCS(Longest Common for having length with character string a1a2 ... ai for k are indicated with L (k, i) Subsequence, longest common subsequence) minimum value of j in character string b1b2 ... bj, it is formulated and is exactly: L (k, i)= Min { j } Where LCS (i, j)=k;
The first step, initialize array LL () and P ();
LL(0)=0
LL(i)=V 1≤i≤M
P(i)=V 1≤i≤M
At this point, LL (0) indicates L (0,0);LL (1) indicates L (1,0);LL (2) indicates L (2,1);……
Second step, successively calculates the element on first diagonal line, calculates L (1,1) with temporary variable T;T=F(L(0,0),L(1, 0))=F(LL(0),LL(1)).Note: F expression is minimized operation, and the value of T is assigned to LL (1).The LL of LL (1) expression at this time (1, 1), LL (2) indicates L (2,1);Calculating above is repeated, until this diagonal line has been calculated, not if it is first of row k For the value of V, which is assigned to P (k);
After first diagonal line has been calculated, at this point, LL (0) indicates L (0,1);LL (1) indicates L (1,1);LL (2) expression L (2, 2);……;
If this diagonal line is not solution, second step is repeated, next diagonal line is calculated, until encountering solution.But to infuse Meaning: i-th diagonal line only has m-i+1 element, arrives LL (m-i+1) so only calculating.
If certain some cornerwise element is V, the element after this diagonal line is all V, there is no need to It calculates.
Then editing distance between sentence A and sentence B is calculated, is indicated with LD (A, B), it is obvious that if LD (A, B)=0 table Show that sentence A is identical with sentence B.
A=a1a2……aN, indicate that A is by a1a2……aNThis N number of character composition, Len (A)=N;
B=b1b2……bM, indicate that B is by b1b2……bMThis M character composition, Len (B)=M;
Define LD (i, j)=LD (a1a2……ai,b1b2……bj), wherein 0≤i≤N, 0≤j≤M;
LD matrix is initialized, according to LD (N, M)=LD (A, B), LD (0,0)=0, LD (0, j)=j, LD (i, 0)=i calculates separately LD Matrix setup values;
The other rows of LD matrix are calculated, if according to formula ai=bj, then LD (i, j)=LD (i-1, j-1), if ai≠bj, then LD (i, j)= Min (LD (i-1, j-1), LD (i-1, j), LD (i, j-1))+1, is finally calculated LD (A, B) value;
Calculate similarity SIM (A, B)=LCS (A, B)/(LD (A, the B)+LCS (A, B)) of sentence A and sentence B.
Step 2. in, the conversion method of the electronic document of PDF format be first extract PDF document in each page content stream, Then the content stream extracted is decrypted, then the content stream after decryption is decoded with Filter decoding algorithm, finally Content of text and its relevant information are extracted from decoded content stream and are stored as the document of the unified format of setting.
Step 1. in, the preferred scanner of scanning device.
The present invention has the effect of positive: (1) method of multi-format text of the invention gear typing and comparison can be by papery Document or all types of electronic document typing document libraries simultaneously unify format to facilitate management and be compared, and improve document Utilization efficiency saves document comparison time, promotes document management efficiency.
(2) multi-format text gear typing of the invention and the method compared using Nakatsu algorithm compare item by item sentence to The similarity between sentence is calculated, calculates the whole of 2 documents using arithmetic mean method further according to the similarity of each sentence Similarity, it is more accurate for the calculating of the similarity of 2 documents, it is preferable to compare effect.
(3) method of multi-format text of the invention gear typing and comparison passes through Words partition system for the contents extraction of each document For sentence set, each document is carried out to be decomposed to form document decomposition tree, document decomposition tree corresponding to each document includes n (n >=1) a sentence, sentence are stored in the matrix form, and each sentence is made of line number, row number, length, content, similarity information, then The matrix of n-th of sentence is made of line number n, row number n, length n, content n, similarity n, is decomposed to form by Words partition system each The document tree of a document is more careful detailed, to improve the precision of subsequent comparison process, promotes document management efficiency.
Detailed description of the invention
Fig. 1 is the flow diagram for the method that multi-format text of the invention keeps off typing and comparison;
Fig. 2 be step of the invention 3. in Words partition system detailed process schematic diagram.
Specific embodiment
(embodiment 1)
See Fig. 1, the method for the gear typing of multi-format text and comparison of the present embodiment includes the following steps:
1. judgement need whether the document of typing is paper file, if it is paper file will then need the paper file of typing by It is placed into scanning device after being stacked neatly according to precedence, by scanning device by file scanning at the electronics text of PDF format Shelves are simultaneously stored into the unprocessed form document library of the storage equipment for the computer being electrically connected with scanning device, and scanning device is preferably swept Retouch instrument;
If it is the electronic document of the multiple format including PDF, Word or TXT, then the storage of computer is directly stored in In the unprocessed form document library of equipment;
2. being converted into the document of unified format to each electronic document in unprocessed form document library by computer and storage being arrived In the unified format file library of the storage equipment of computer, the file format after sets itself is converted can according to need, preferably File format be Word format or TXT textual form, if the file format of original electronic document and setting conversion after File format is unanimously then directly copied to unified format file library from unprocessed form document library, in addition the electronic document of PDF format Conversion method be first extract PDF document in each page content stream, then the content stream extracted is decrypted, then uses Filter decoding algorithm is decoded the content stream after decryption, finally from decoded content stream extract content of text and its Relevant information and the document for being stored as the unified format set.
3. to format unified after conversion at the content of Word format or each electronic document of TXT textual form, by dividing The contents extraction of each document is sentence set, and is stored in the form of entry corresponding with each document to sentence by word system In subdatasheet;Words partition system is that a chinese character sequence is cut into individual word one by one, by continuous word sequence according to Certain specification is reassembled into the process of word sequence, such as " will strengthen regulatory efforts " to be decomposed by Words partition system and " add Greatly ", " supervision " and " dynamics " three words;
See Fig. 2, by Words partition system by the contents extraction of each document be sentence set detailed process be, by each document into Row is decomposed to form document decomposition tree, and document decomposition tree corresponding to each document includes n (n >=1) a sentence, and sentence is with rectangular Formula storage, each sentence is made of line number, row number, length, content, similarity information, then the matrix of n-th of sentence by line number n, Row number n, length n, content n, similarity n are constituted.
4. to format unified after conversion at Word format or each electronic document of TXT textual form by manually to every A document carries out the mark of the determinant attribute including classification, title, source, keyword, creation time, and with each text The form of the corresponding entry of shelves is stored into sentence data table.
5. selecting newest typing to unify some document in a document or unified format file library in format file library to make It is compared for document to be compared with other all documents in unified format file library, first by sentence data table according to document The determinant attribute including classification, title, source, keyword, creation time be compared and match, thus from unified lattice Any of determinant attribute including classification, title, source, keyword, creation time category is filtered out in formula document library Property with the classification of document to be compared, title, source, keyword, 5 determinant attributes of creation time any one attributes match institute There is document.
6. passing through sentence data table with document to be compared one by one as reference documents to the document 5. screened by step In the entry information corresponding with each document that is 3. obtained by step be compared, when 2 documents compare as unit of sentence, Sentence is compared item by item according to Nakatsu algorithm to calculate the similarity between sentence, further according to the similarity benefit of each sentence The whole similarity of 2 documents (any one reference documents and document to be compared) is calculated with arithmetic mean method.
Be according to the specific method that Nakatsu algorithm compares sentence item by item to calculate the similarity between sentence: set to Two sentences compared are sentence A and sentence B, calculate the longest common subsequence of sentence A and sentence B first, are denoted as running after fame MaxLen (A, B) is specially set M=Len (A), and N=Len (B), i.e. M are the length of character string A, and N is the length of character string B, is Without loss of generality, it is assumed that M≤N;
If the aM A=a1a2 ..., indicating A by a1a2 ..., this M character of aM forms;
B=b1b2 ... bN, indicating B by b1b2 ..., this N number of character of bN forms;
Then MaxLen (i, j)=MaxLen (a1a2 ... ai, b1b2 ... bj), wherein 1≤i≤M, 1≤j≤N;
All LCS(Longest Common for having length with character string a1a2 ... ai for k are indicated with L (k, i) Subsequence, longest common subsequence) minimum value of j in character string b1b2 ... bj, it is formulated and is exactly: L (k, i)= Min { j } Where LCS (i, j)=k;
The first step, initialize array LL () and P ();
LL(0)=0
LL(i)=V 1≤i≤M
P(i)=V 1≤i≤M
At this point, LL (0) indicates L (0,0);LL (1) indicates L (1,0);LL (2) indicates L (2,1);……
Second step, successively calculates the element on first diagonal line, calculates L (1,1) with temporary variable T;T=F(L(0,0),L(1, 0))=F(LL(0),LL(1)).Note: F expression is minimized operation, and the value of T is assigned to LL (1).The LL of LL (1) expression at this time (1, 1), LL (2) indicates L (2,1);Calculating above is repeated, until this diagonal line has been calculated, not if it is first of row k For the value of V, which is assigned to P (k);
After first diagonal line has been calculated, at this point, LL (0) indicates L (0,1);LL (1) indicates L (1,1);LL (2) expression L (2, 2);……;
If this diagonal line is not solution, second step is repeated, next diagonal line is calculated, until encountering solution.But to infuse Meaning: i-th diagonal line only has m-i+1 element, arrives LL (m-i+1) so only calculating.
If certain some cornerwise element is V, the element after this diagonal line is all V, there is no need to It calculates.
Then editing distance between sentence A and sentence B is calculated, is indicated with LD (A, B), it is obvious that if LD (A, B)=0 table Show that sentence A is identical with sentence B.
A=a1a2……aN, indicate that A is by a1a2……aNThis N number of character composition, Len (A)=N;
B=b1b2……bM, indicate that B is by b1b2……bMThis M character composition, Len (B)=M;
Define LD (i, j)=LD (a1a2……ai,b1b2……bj), wherein 0≤i≤N, 0≤j≤M;
LD matrix is initialized, according to LD (N, M)=LD (A, B), LD (0,0)=0, LD (0, j)=j, LD (i, 0)=i calculates separately LD Matrix setup values;
The other rows of LD matrix are calculated, if according to formula ai=bj, then LD (i, j)=LD (i-1, j-1), if ai≠bj, then LD (i, j)= Min (LD (i-1, j-1), LD (i-1, j), LD (i, j-1))+1, is finally calculated LD (A, B) value;
Calculate similarity SIM (A, B)=LCS (A, B)/(LD (A, the B)+LCS (A, B)) of sentence A and sentence B.
7. recording the document to be compared 6. obtained by step to the whole similarity of any one reference documents to corresponding Database in.

Claims (2)

1. a kind of method that multi-format text keeps off typing and compares, includes the following steps:
1. judgement need whether the document of typing is paper file, if it is paper file will then need the paper file of typing by It is placed into scanning device after being stacked neatly according to precedence, by scanning device by file scanning at the electronics text of PDF format Shelves are simultaneously stored into the unprocessed form document library of the storage equipment for the computer being electrically connected with scanning device;
If it is the electronic document of the multiple format including PDF, Word or TXT, then the storage of computer is directly stored in In the unprocessed form document library of equipment;
2. being converted into the document of unified format to each electronic document in unprocessed form document library by computer and storage being arrived In the unified format file library of the storage equipment of computer, the file format after sets itself is converted can according to need, preferably File format be Word format or TXT textual form, if the file format of original electronic document and setting conversion after File format is unanimously then directly copied to unified format file library from unprocessed form document library;
3. being by participle at the content of Word format or each electronic document of TXT textual form to format unified after conversion The contents extraction of each document is sentence set, and is stored in the form of entry corresponding with each document to sentence number by system According in table;By Words partition system by the contents extraction of each document be sentence set detailed process be to carry out each document It is decomposed to form document decomposition tree, document decomposition tree corresponding to each document includes n (n >=1) a sentence, and sentence is in the matrix form Storage, each sentence are made of line number, row number, length, content, similarity information, then the matrix of n-th of sentence is by line number n, column Number n, length n, content n, similarity n are constituted;
4. to format unified after conversion at each electronic document of Word format or TXT textual form carry out include classification, title, The mark of determinant attribute including source, keyword, creation time, and stored in the form of entry corresponding with each document Into sentence data table;
5. select newest typing unify in a document or unified format file library in format file library some document as to It compares document to be compared with other all documents in unified format file library, first by sentence data table according to the packet of document It includes the determinant attribute including classification, title, source, keyword, creation time to be compared and match, thus from unified format text Filtered out in shelves library any one attribute in determinant attribute including classification, title, source, keyword, creation time with The classification of document to be compared, title, source, keyword, 5 determinant attributes of creation time any one attributes match all texts Shelves;
6. to the document 5. screened by step as reference documents one by one with document to be compared by sentence data table by 3. entry information corresponding with each document that step obtains is compared, when 2 documents compare as unit of sentence, according to Nakatsu algorithm compares sentence item by item to calculate the similarity between sentence, utilizes calculation further according to the similarity of each sentence The art method of average calculates the similarity of 2 document entirety: setting two sentences to be compared is sentence A and sentence B, first calculating sentence The longest common subsequence of A and sentence B are denoted as run after fame MaxLen (A, B), specially set M=Len (A), and N=Len (B), i.e., M is The length of character string A, N is the length of character string B, in order to without loss of generality, it is assumed that M≤N;
If the aM A=a1a2 ..., indicating A by a1a2 ..., this M character of aM forms;
B=b1b2 ... bN, indicating B by b1b2 ..., this N number of character of bN forms;
Then MaxLen (i, j)=MaxLen (a1a2 ... ai, b1b2 ... bj), wherein 1≤i≤M, 1≤j≤N;
All LCS(Longest Common for having length with character string a1a2 ... ai for k are indicated with L (k, i) Subsequence, longest common subsequence) minimum value of j in character string b1b2 ... bj, it is formulated and is exactly: L (k, i)= Min { j } Where LCS (i, j)=k;
The first step, initialize array LL () and P ();
LL(0)=0
LL(i)=V 1≤i≤M
P(i)=V 1≤i≤M
At this point, LL (0) indicates L (0,0);LL (1) indicates L (1,0);LL (2) indicates L (2,1);……
Second step, successively calculates the element on first diagonal line, calculates L (1,1) with temporary variable T;T=F(L(0,0),L(1, 0))=F(LL(0),LL(1));
F expression is minimized operation, and the value of T is assigned to LL (1);
LL (1) indicates LL (1,1) at this time, and LL (2) indicates L (2,1);Calculating above is repeated, until this diagonal line has been calculated, First if it is row k is not the value of V, which is assigned to P (k);
After first diagonal line has been calculated, at this point, LL (0) indicates L (0,1);LL (1) indicates L (1,1);LL (2) expression L (2, 2);……;
If this diagonal line is not solution, second step is repeated, next diagonal line is calculated, until encountering solution;
Noting however that: i-th diagonal line only has m-i+1 element, arrives LL (m-i+1) so only calculating;
If certain some cornerwise element is V, the element after this diagonal line is all V, and there is no need to calculate ?;
Then editing distance between sentence A and sentence B is calculated, is indicated with LD (A, B), it is obvious that if LD (A, B)=0 indicates sentence Sub- A is identical with sentence B;
A=a1a2 ... aN, indicating A by a1a2 ..., this N number of character of aN forms, Len (A)=N;
B=b1b2 ... bM, indicating B by b1b2 ..., this M character of bM forms, Len (B)=M;
It defines LD (i, j)=LD (a1a2 ... ai, b1b2 ... bj), wherein 0≤i≤N, 0≤j≤M;
LD matrix is initialized, according to LD (N, M)=LD (A, B), LD (0,0)=0, LD (0, j)=j, LD (i, 0)=i calculates separately LD Matrix setup values;
The other rows of LD matrix are calculated, if according to formula ai=bj, LD (i, j)=LD (i-1, j-1), if ai ≠ bj, LD (i, j) =Min (LD (i-1, j-1), LD (i-1, j), LD (i, j-1))+1, is finally calculated LD (A, B) value;
Calculate similarity SIM (A, B)=LCS (A, B)/(LD (A, the B)+LCS (A, B)) of sentence A and sentence B;
7. recording the document to be compared 6. obtained by step with the whole similarity of any one reference documents to corresponding number According in library.
2. multi-format document typing according to claim 1 and compare method, it is characterised in that: step 2. in, PDF lattice The conversion method of the electronic document of formula is to extract the content stream of each page in PDF document first, then flows into the content extracted Row decryption, then the content stream after decryption is decoded with Filter decoding algorithm, finally extracted from decoded content stream Content of text and its relevant information and the document for being stored as the unified format set.
CN201810549599.2A 2013-12-18 2013-12-18 The method that multi-format text keeps off typing and compares Pending CN108984593A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810549599.2A CN108984593A (en) 2013-12-18 2013-12-18 The method that multi-format text keeps off typing and compares

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810549599.2A CN108984593A (en) 2013-12-18 2013-12-18 The method that multi-format text keeps off typing and compares
CN201310696955.0A CN103823838B (en) 2013-12-18 2013-12-18 A kind of method of multi-format document typing and comparison

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201310696955.0A Division CN103823838B (en) 2013-12-18 2013-12-18 A kind of method of multi-format document typing and comparison

Publications (1)

Publication Number Publication Date
CN108984593A true CN108984593A (en) 2018-12-11

Family

ID=50758902

Family Applications (4)

Application Number Title Priority Date Filing Date
CN201810549597.3A Pending CN108804624A (en) 2013-12-18 2013-12-18 The method of text gear typing and comparison
CN201810549599.2A Pending CN108984593A (en) 2013-12-18 2013-12-18 The method that multi-format text keeps off typing and compares
CN201810549598.8A Pending CN108959203A (en) 2013-12-18 2013-12-18 A kind of method text gear typing and compared
CN201310696955.0A Active CN103823838B (en) 2013-12-18 2013-12-18 A kind of method of multi-format document typing and comparison

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201810549597.3A Pending CN108804624A (en) 2013-12-18 2013-12-18 The method of text gear typing and comparison

Family Applications After (2)

Application Number Title Priority Date Filing Date
CN201810549598.8A Pending CN108959203A (en) 2013-12-18 2013-12-18 A kind of method text gear typing and compared
CN201310696955.0A Active CN103823838B (en) 2013-12-18 2013-12-18 A kind of method of multi-format document typing and comparison

Country Status (1)

Country Link
CN (4) CN108804624A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135264A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Data entry method, device, computer equipment and storage medium
CN112487781A (en) * 2020-12-10 2021-03-12 成都海光微电子技术有限公司 File comparison method and device, storage medium and equipment

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701256A (en) * 2016-03-23 2016-06-22 南京南瑞继保电气有限公司 Communication point table file comparison method
CN106033475A (en) * 2016-05-18 2016-10-19 苏州奖多多科技有限公司 Information matching method and device and electronic equipment
CN105912883A (en) * 2016-06-30 2016-08-31 广州市皓轩软件科技有限公司 Structural data extraction method for ICD pacemaker
CN107169011B (en) * 2017-03-31 2021-06-11 百度在线网络技术(北京)有限公司 Webpage originality identification method and device based on artificial intelligence and storage medium
CN107368472B (en) * 2017-07-26 2021-01-05 成都科来软件有限公司 Storage method of document analysis result capable of being iteratively optimized
CN109062872B (en) * 2018-07-13 2023-04-18 上海溱云科技有限公司 Method for uniformly processing customs files with different formats
CN109271641B (en) * 2018-11-20 2023-09-08 广西三方大供应链技术服务有限公司 Text similarity calculation method and device and electronic equipment
CN112948574A (en) * 2019-12-11 2021-06-11 上海交通大学 System and method for uploading and classifying batch files
CN111026718A (en) * 2019-12-11 2020-04-17 广州地铁集团有限公司 Technical method for analyzing excel file of rail transit engineering cost achievement
CN110955638A (en) * 2019-12-17 2020-04-03 江苏扬子易联智能软件有限公司 File comparison display method and system
CN111382562B (en) * 2020-03-05 2024-03-01 百度在线网络技术(北京)有限公司 Text similarity determination method and device, electronic equipment and storage medium
CN111563372B (en) * 2020-05-11 2021-04-13 世纪金榜集团股份有限公司 Typesetting document content self-duplication checking method based on teaching book publishing
CN114939532B (en) * 2022-07-11 2022-11-08 河北汇金集团股份有限公司 Sorting method for disordered documents

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1687926A (en) * 2005-04-18 2005-10-26 福州大学 Method of PDF file information extraction system based on XML
CN101630321A (en) * 2009-08-26 2010-01-20 中山大学 On-line article screening method based on data mining (DM)
CN101763343A (en) * 2008-12-23 2010-06-30 上海晨鸟信息科技有限公司 Document editor principle supporting format comparison and plagiarism check and method
CN101957809A (en) * 2010-10-14 2011-01-26 传神联合(北京)信息技术有限公司 Anti-plagiarism method
CN102622338A (en) * 2012-02-24 2012-08-01 北京工业大学 Computer-assisted computing method of semantic distance between short texts

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4038717B2 (en) * 2002-09-13 2008-01-30 富士ゼロックス株式会社 Text sentence comparison device
CN100412869C (en) * 2006-04-13 2008-08-20 北大方正集团有限公司 Improved file similarity measure method based on file structure
CN102004779B (en) * 2010-11-19 2012-11-28 百度在线网络技术(北京)有限公司 Document sharing platform and document processing method
CN102799647B (en) * 2012-06-30 2015-01-21 华为技术有限公司 Method and device for webpage reduplication deletion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1687926A (en) * 2005-04-18 2005-10-26 福州大学 Method of PDF file information extraction system based on XML
CN101763343A (en) * 2008-12-23 2010-06-30 上海晨鸟信息科技有限公司 Document editor principle supporting format comparison and plagiarism check and method
CN101630321A (en) * 2009-08-26 2010-01-20 中山大学 On-line article screening method based on data mining (DM)
CN101957809A (en) * 2010-10-14 2011-01-26 传神联合(北京)信息技术有限公司 Anti-plagiarism method
CN102622338A (en) * 2012-02-24 2012-08-01 北京工业大学 Computer-assisted computing method of semantic distance between short texts

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEIXIN_34417814: "文本比较算法Ⅷ——再议Nakatsu算法", 《HTTPS://BLOG.CSDN.NET/WEIXIN_34417814/ARTICLE/DETAILS/85478665》 *
王森,王宇: "基于文本结构树的论文复制检测算法", 《现代图书情报技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135264A (en) * 2019-04-16 2019-08-16 深圳壹账通智能科技有限公司 Data entry method, device, computer equipment and storage medium
CN112487781A (en) * 2020-12-10 2021-03-12 成都海光微电子技术有限公司 File comparison method and device, storage medium and equipment

Also Published As

Publication number Publication date
CN108804624A (en) 2018-11-13
CN103823838A (en) 2014-05-28
CN108959203A (en) 2018-12-07
CN103823838B (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN103823838B (en) A kind of method of multi-format document typing and comparison
US11907244B2 (en) Modifying field definitions to include post-processing instructions
WO2019227585A1 (en) Index-based resume data processing method, device, apparatus, and storage medium
Rao et al. PRIX: Indexing and querying XML using prufer sequences
US20150026556A1 (en) Systems and Methods for Extracting Table Information from Documents
US8140267B2 (en) System and method for identifying similar molecules
US8725781B2 (en) Sentiment cube
CN104572849A (en) Automatic standardized filing method based on text semantic mining
CN108573045A (en) A kind of alignment matrix similarity retrieval method based on multistage fingerprint
US20130086035A1 (en) Method and apparatus for generating extended page snippet of search result
Zu et al. Resume information extraction with a novel text block segmentation algorithm
CN109446410A (en) Knowledge point method for pushing, device and computer readable storage medium
CN105404677A (en) Tree structure based retrieval method
Sangati et al. Multiword expression identification with recurring tree fragments and association measures
CN105426490A (en) Tree structure based indexing method
CN111091003A (en) Parallel extraction method based on knowledge graph query
CN107451168A (en) File Classification System and Method Based on Vocabulary Statistics
Tian A mathematical indexing method based on the hierarchical features of operators in formulae
TWI534640B (en) Chinese network information monitoring and analysis system and its method
Subercaze et al. Mining user-generated comments
Luo et al. Biotable: A tool to extract semantic structure of table in biology literature
CN110674254B (en) Intelligent contract information extraction method based on deep learning and statistical extraction model
Wang et al. A new model of document structure analysis
Flynn Document classification in support of automated metadata extraction form heterogeneous collections
Nguyen et al. Py_ape: Text Data Acquiring, Extracting, Cleaning and Schema Matching in Python

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181211