CN107656909A - A kind of Documents Similarity decision method and device based on document composite character - Google Patents

A kind of Documents Similarity decision method and device based on document composite character Download PDF

Info

Publication number
CN107656909A
CN107656909A CN201711041146.0A CN201711041146A CN107656909A CN 107656909 A CN107656909 A CN 107656909A CN 201711041146 A CN201711041146 A CN 201711041146A CN 107656909 A CN107656909 A CN 107656909A
Authority
CN
China
Prior art keywords
similarity
feature
document
sequence
characteristic value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711041146.0A
Other languages
Chinese (zh)
Other versions
CN107656909B (en
Inventor
魏效征
王志海
喻波
安鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wondersoft Technology Co Ltd
Original Assignee
Beijing Wondersoft Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wondersoft Technology Co Ltd filed Critical Beijing Wondersoft Technology Co Ltd
Priority to CN201711041146.0A priority Critical patent/CN107656909B/en
Publication of CN107656909A publication Critical patent/CN107656909A/en
Application granted granted Critical
Publication of CN107656909B publication Critical patent/CN107656909B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of Documents Similarity decision method and device based on document composite character, this method comprises the following steps:File or data flow to input carry out matching regular expressions;If it fails to match, terminate, if the match is successful, feature reprocessing is carried out to multiple feature strings of matching regular expressions output;Chained list management is carried out to multiple results of feature reprocessing respectively, forms multiple Feature lists;Multiple Feature lists are carried out with chained list traversal and feature merger processing;Export similarity result of determination.By this programme, greatly improve the recognition capability of list data in structured document, can the significantly Documents Similarity of bar excel form types decision-making ability, speed is faster, it is readily appreciated that, is adapted to actual business requirement, solid technical capability is provided for data management and control.

Description

A kind of Documents Similarity decision method and device based on document composite character
Technical field
The present invention relates to computer search field, and in particular to a kind of Documents Similarity based on document composite character judges Method and apparatus.
Background technology
Documents Similarity judges to be widely used in the various applications such as interconnection search, public sentiment report, enterprise's classification.Cause The document of this either form types of structuring, or the character type document of non-structural type, there is the similar knowledge of more text Method for distinguishing.
However, the document containing form is the common format commonly used in enterprise's routine work, it is often more comprising enterprise Business information or sensitive data.Such as in a financial report, descriptive text is removed, the form in report perhaps can include More sensitive informations, such as various financial index of company etc..This non-structured document containing compared with multilist, it is both different In structured document, also different from non-structured document, but a kind of document of mixed type.Therefore the document of the type is being judged During similarity, it is usually used in judging that the method for non-structured document or structured document can not all obtain good effect.Therefore It is very necessary for anti-data-leakage engineering how a kind of method that can very well judge mixed type Documents Similarity is designed.
The Documents Similarity judgement that prior art includes is the important technology of text information processing field, such as:
Document 1, application number:CN201210491145.7, denomination of invention:A kind of Text similarity computing method;
Document 2, application number:CN201410491458.1, denomination of invention:A kind of Text character extraction system and method.
Above-mentioned prior art has the following disadvantages:
(1) influence of structural data in non-structured document is not accounted for.Numeral in document, such as identification card number, Bank's card number, credit card check code, phone number etc. are very important digital information, are especially carrying out the mistake of anti-leaking data Cheng Zhong, the importance of these features are far longer than keyword.
(2) the document properties feature in document is not accounted for.The attributes such as the header of document, footer, author, remark information are An important factor for judging document similarity.
(3) key characteristics, canonical feature and the alternate Documents Similarity incidence relation of document properties are not accounted for.
The content of the invention
In order to solve the above technical problems, the invention provides a kind of Documents Similarity judgement side based on document composite character Method, comprise the following steps:
1) carries out matching regular expressions to the file or data flow of input;
2) if it fails to match by, step 7) is jumped to, if the match is successful, various features is obtained, jumps to step 3);
3) carries out chained list management to the characteristic value of every kind of feature, forms multiple Feature lists;
4) forms multiple characteristic sequences by the characteristic value in the multiple Feature list and its position in chained list;
5) similarity between the sequence of calculations;
6) exports similarity result of determination;
7) terminates.
According to an embodiment of the invention, it is preferred that if the match is successful, it is necessary to be located again to characteristic value in step 2) Reason, remove pseudo-characteristic value.
According to an embodiment of the invention, it is preferred that pass through the K-D distances or the Chinese between the sequence of calculation in the step 5) Prescribed distance judges the similarity between sequence.
According to an embodiment of the invention, it is preferred that the step 6) combines document before similarity result of determination is exported Similarity between determined property sequence.
According to an embodiment of the invention, it is preferred that after the step 6), it is also necessary to which result of determination is input to depth Habit or SVM modules, obtain decision model.
In order to solve the above technical problems, the invention provides a kind of Documents Similarity based on document composite character to judge dress Put, including:
Matching regular expressions module, file or data flow to input carry out matching regular expressions, obtained a variety of Feature;
Chained list management module, chained list management is carried out to the characteristic value of every kind of feature, forms multiple Feature lists;
Characteristic sequence generation module, it is made up of the characteristic value in the multiple Feature list and its position in chained list more Individual characteristic sequence;
Similarity calculation module, the similarity between the sequence of calculation;
As a result output module, similarity result of determination is exported.
According to an embodiment of the invention, it is preferred that also reprocess module including feature, characteristic value is reprocessed, gone Except pseudo-characteristic value.
According to an embodiment of the invention, it is preferred that result output module combines text before similarity result of determination is exported Similarity between shelves determined property sequence.
According to an embodiment of the invention, it is preferred that also include determining whether that model forms module, it is necessary to which result of determination is input to Deep learning or SVM modules, obtain decision model.
In order to solve the above technical problems, the invention provides a kind of computer-readable storage medium, it includes computer program Instruction, by performing the computer program instructions, the method for realizing one of the claims.
Following technique effect is achieved by technical scheme:
By scheme proposed by the present invention, the recognition capability of list data in structured document can be greatly improved, can be with Increase substantially the decision-making ability of the Documents Similarity of excel form types.Hash of the program than traditional structured document Method, speed is faster, it is readily appreciated that, it is adapted to actual business requirement, solid technical capability is provided for data management and control.
Brief description of the drawings
Fig. 1 is the present invention based on canonical post processing source code data detection method message processing flow figure
Embodiment
<Decision method>
The invention discloses a kind of Documents Similarity decision method based on document composite character, comprise the following steps:
1) carries out matching regular expressions to the file or data flow of input;
2) if it fails to match by, step 7) is jumped to, if the match is successful, various features is obtained, jumps to step 3);
3) carries out chained list management to the characteristic value of every kind of feature, forms multiple Feature lists;
4) forms multiple characteristic sequences by the characteristic value in the multiple Feature list and its position in chained list;
5) similarity between the sequence of calculations;
6) exports similarity result of determination;
7) terminates.
If the match is successful, it is necessary to be reprocessed to characteristic value, removal pseudo-characteristic value in step 2).
Judged in the step 5) by K-D between sequence of calculation distance or Hamming distance similar between sequence Degree.
The step 6) judges the similarity between sequence before similarity result of determination is exported with reference to document properties.
After the step 6), it is also necessary to result of determination is input into deep learning or SVM modules, obtains decision model.
The document properties include:Document author, title, summary, header, footer etc..
As shown in figure 1, after office documents are converted into txt texts, handling process is entered.List data in semi-structured Often just like lower class likelihood data, xyz represents three kinds of characteristics respectively.
Main flow can be in matching template a variety of regular expressions to being scanned in full, x can be obtained after scanningi* yi*ziEtc. sequence (* represents any character).To the specific feature of every class, it is established that Feature list, record the appearance of such characteristic value Document misregistration amount and character numerical value.The minimum length of Feature list determines the quantity of sequence, i.e., if in multiple chained lists most The length of small chained list is 50, and maximum chained list length is 100, eventually forms 50 sequences.For sequential value,
xi*yi*zi, xi+1*yi+1*zi+1
Between similitude, can be measured by K-D distances or (also known as Hamming distance) Hamming distances.By above Method, the judgement of structured content similitude can be completed.Judgement for semi-structured document similitude, it is also contemplated that non-knot The judgement of structure content similarities.Judgement for unstructured content similitude, it is unstructured interior according to the attribute of document Hold, similarity analysis is carried out according to common methods such as SVM.
Structuring and the combination of unstructured similitude are judged, be judge semi-structured document similitude it is a kind of very Good scheme.
In the present invention propose based on canonical feature, words feature, document properties feature Documents Similarity decision method, pin To the unstructured data detection demand in business data security management and control, solve of structured content in non-structured text Match somebody with somebody, also solve structured document, the similarity of digital form document judges, forms a kind of new Documents Similarity and judges Method.
(1) accurate post processing checking is done to canonical feature, so as to ensure the accuracy of characteristic matching result.
Many digital contents, if such as identity card, bank card, cell-phone number etc. only matched by canonical engine, very Easily produce wrong report.Therefore canonical post processing script (i.e. characteristic value reprocessing program) is introduced, to the matching result of canonical engine Verified, the degree of accuracy can be improved.For example some numerals similar with ID card No. are there may be in text, pass through logarithm Numeral included in word judged, for example since the 7th is the date of birth, if it is not, then can be determined that is not ID card No., the judgement of other characteristic values are similar.
(2) it is document properties are very crucial as judgement of the feature to text similarity.The summary of document, remarks, header The information such as footer, different from common words text feature, the writer identity for reflecting document, type of theme, document class can be given Type all information, judge that the information content of offer is very big for similarity.
(3) deep learning method is used, it is easy to implement.To three category informations such as text, canonical feature, document properties as defeated Enter, by Masses of Document depth training method, determine the weight of three category informations and the deep learning model relied on.Obtained mould Type can also receive feedback information and optimize in the matching process of reality.
The invention discloses a kind of method of discrimination of the hybrid document similarity based on canonical and keyword feature.This method The canonical feature in addition to keyword is considered, can so differentiate that non-structured document can also differentiate the similar of structured document Degree.In addition, alsoing for improving the hit accuracy rate of regular expression identification feature, introduce and canonical expression identification result is entered Row post-processing function.The technology establishes normalized vector characterized by canonical and keyword, to document, the final phase for differentiating document Like degree.
<Decision maker>
The invention discloses a kind of Documents Similarity decision maker based on document composite character, including:
Matching regular expressions module, file or data flow to input carry out matching regular expressions, obtained a variety of Feature;
Chained list management module, chained list management is carried out to the characteristic value of every kind of feature, forms multiple Feature lists;
Characteristic sequence generation module, it is made up of the characteristic value in the multiple Feature list and its position in chained list more Individual characteristic sequence;
Similarity calculation module, the similarity between the sequence of calculation;
As a result output module, similarity result of determination is exported.
The device also includes feature reprocessing module, and characteristic value is reprocessed, removes pseudo-characteristic value.
Wherein, as a result output module judges the phase between sequence before similarity result of determination is exported with reference to document properties Like degree.
The device also includes determining whether that model forms module, it is necessary to which result of determination is input into deep learning or SVM modules, obtains Take decision model.
<Specific embodiment>
Certain enterprise carries out similarity judgement to the document comprising user's salary information.Use is included in salary information in document The information of family name, identity card, bank card, cell-phone number etc., in addition to establish matched rule
The feature 3 of 1 feature of feature 2
Identity card Unionpay card number handset number ...
1. determine the regular expression of identity card bank card mobile phone etc.;The post processing script of identity card is determined, to consider to save Part, the date of birth, whether last bit check of identity card is correct, to consider the card bin of bank of Unionpay card number beginning, to consider The luhn verifications of Bank Account Number;Eventually form the sequence xyz. of three above feature
2. the keyword that wage information is related in document is extracted, such as position hierarchy, department information, performance, subsidy etc..
3. extract the attributes such as the author author of document, title title, summary, header header, footer footer letter Breath.
4. the information being collected into is input to deep learning or svm modules, obtains decision model as input;
5. using new document as input, similarity is judged.Obtain a result.
By the invention it is possible to greatly improve the recognition capability of list data in structured document, can increase substantially The decision-making ability of the Documents Similarity of excel form types.The program is than the hash methods of traditional structured document, and speed is more It hurry up, it is readily appreciated that, it is adapted to actual business requirement, solid technical capability is provided for data management and control.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvement for being made etc., the guarantor in the present invention all should be protected Within the scope of shield.

Claims (10)

1. a kind of Documents Similarity decision method based on document composite character, comprises the following steps:
1) carries out matching regular expressions to the file or data flow of input;
2) if it fails to match by, step 7) is jumped to, if the match is successful, various features is obtained, jumps to step 3);
3) carries out chained list management to the characteristic value of every kind of feature, forms multiple Feature lists;
4) forms multiple characteristic sequences by the characteristic value in the multiple Feature list and its position in chained list;
5) similarity between the sequence of calculations;
6) exports similarity result of determination;
7) terminates.
2. according to the method for claim 1, gone if the match is successful, it is necessary to reprocessed to characteristic value in step 2) Except pseudo-characteristic value.
3. according to the method for claim 1, pass through the K-D distances or Hamming distance between the sequence of calculation in the step 5) From the similarity judged between sequence.
4. according to the method for claim 1, the step 6) combines document properties before similarity result of determination is exported Judge the similarity between sequence.
5. according to the method for claim 1, after the step 6), it is also necessary to by result of determination be input to deep learning or SVM modules, obtain decision model.
6. a kind of Documents Similarity decision maker based on document composite character, including:
Matching regular expressions module, file or data flow to input carry out matching regular expressions, obtain various features;
Chained list management module, chained list management is carried out to the characteristic value of every kind of feature, forms multiple Feature lists;
Characteristic sequence generation module, multiple spies are formed by the characteristic value in the multiple Feature list and its position in chained list Levy sequence;
Similarity calculation module, the similarity between the sequence of calculation;
As a result output module, similarity result of determination is exported.
7. device according to claim 6, in addition to feature reprocessing module, reprocess to characteristic value, remove pseudo- Characteristic value.
8. device according to claim 6, as a result output module combines document category before similarity result of determination is exported Property judges the similarity between sequence.
9. according to the method for claim 6, also include determining whether that model forms module, it is necessary to which result of determination is input into depth Study or SVM modules, obtain decision model.
10. a kind of computer-readable storage medium, it includes computer program instructions, by performing the computer program instructions, The method for realizing one of claim 1-5.
CN201711041146.0A 2017-10-30 2017-10-30 Document similarity judgment method and device based on document mixing characteristics Active CN107656909B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711041146.0A CN107656909B (en) 2017-10-30 2017-10-30 Document similarity judgment method and device based on document mixing characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711041146.0A CN107656909B (en) 2017-10-30 2017-10-30 Document similarity judgment method and device based on document mixing characteristics

Publications (2)

Publication Number Publication Date
CN107656909A true CN107656909A (en) 2018-02-02
CN107656909B CN107656909B (en) 2021-06-01

Family

ID=61096204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711041146.0A Active CN107656909B (en) 2017-10-30 2017-10-30 Document similarity judgment method and device based on document mixing characteristics

Country Status (1)

Country Link
CN (1) CN107656909B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472209A (en) * 2019-07-04 2019-11-19 重庆金融资产交易所有限责任公司 Table generation method, device and computer equipment based on deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894143A (en) * 2010-06-28 2010-11-24 北京用友政务软件有限公司 Federated search and search result integrated display method and system
US20130054612A1 (en) * 2006-10-10 2013-02-28 Abbyy Software Ltd. Universal Document Similarity
CN104318340A (en) * 2014-09-25 2015-01-28 中国科学院软件研究所 Information visualization method and intelligent visual analysis system based on text curriculum vitae information
CN105573971A (en) * 2014-10-10 2016-05-11 富士通株式会社 Table reconstruction apparatus and method
CN105894253A (en) * 2016-05-09 2016-08-24 陈包容 Method and device for automatic pushing of job application demand

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130054612A1 (en) * 2006-10-10 2013-02-28 Abbyy Software Ltd. Universal Document Similarity
CN101894143A (en) * 2010-06-28 2010-11-24 北京用友政务软件有限公司 Federated search and search result integrated display method and system
CN104318340A (en) * 2014-09-25 2015-01-28 中国科学院软件研究所 Information visualization method and intelligent visual analysis system based on text curriculum vitae information
CN105573971A (en) * 2014-10-10 2016-05-11 富士通株式会社 Table reconstruction apparatus and method
CN105894253A (en) * 2016-05-09 2016-08-24 陈包容 Method and device for automatic pushing of job application demand

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472209A (en) * 2019-07-04 2019-11-19 重庆金融资产交易所有限责任公司 Table generation method, device and computer equipment based on deep learning
CN110472209B (en) * 2019-07-04 2024-02-06 深圳同奈信息科技有限公司 Deep learning-based table generation method and device and computer equipment

Also Published As

Publication number Publication date
CN107656909B (en) 2021-06-01

Similar Documents

Publication Publication Date Title
AU2005201758B2 (en) Method of learning associations between documents and data sets
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
CN103336766B (en) Short text garbage identification and modeling method and device
Xu et al. Using deep linguistic features for finding deceptive opinion spam
US10997366B2 (en) Methods, devices and systems for data augmentation to improve fraud detection
CN109960727B (en) Personal privacy information automatic detection method and system for unstructured text
US20110029303A1 (en) Word classification system, method, and program
WO2019179010A1 (en) Data set acquisition method, classification method and device, apparatus, and storage medium
CN112184145A (en) AI-based unmanned intervention approval system
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN107294834A (en) A kind of method and apparatus for recognizing spam
CN112257444B (en) Financial information negative entity discovery method, device, electronic equipment and storage medium
CN110610003B (en) Method and system for assisting text annotation
CN112084308A (en) Method, system and storage medium for text type data recognition
CN112463922A (en) Risk user identification method and storage medium
CN107656909A (en) A kind of Documents Similarity decision method and device based on document composite character
CN110321557A (en) A kind of file classification method, device, electronic equipment and storage medium
CN115994531A (en) Multi-dimensional text comprehensive identification method
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document
CN108171589A (en) Verification method and device
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF
CN112100336A (en) Method and device for identifying preservation time of file and storage medium
CN111680513B (en) Feature information identification method and device and computer readable storage medium
Kini Term frequency tokenization for fake news detection
CN117332084B (en) Machine learning method suitable for detecting malicious comments and false news simultaneously

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant