CN107656909A - A kind of Documents Similarity decision method and device based on document composite character - Google Patents
A kind of Documents Similarity decision method and device based on document composite character Download PDFInfo
- Publication number
- CN107656909A CN107656909A CN201711041146.0A CN201711041146A CN107656909A CN 107656909 A CN107656909 A CN 107656909A CN 201711041146 A CN201711041146 A CN 201711041146A CN 107656909 A CN107656909 A CN 107656909A
- Authority
- CN
- China
- Prior art keywords
- similarity
- feature
- document
- sequence
- characteristic value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of Documents Similarity decision method and device based on document composite character, this method comprises the following steps:File or data flow to input carry out matching regular expressions;If it fails to match, terminate, if the match is successful, feature reprocessing is carried out to multiple feature strings of matching regular expressions output;Chained list management is carried out to multiple results of feature reprocessing respectively, forms multiple Feature lists;Multiple Feature lists are carried out with chained list traversal and feature merger processing;Export similarity result of determination.By this programme, greatly improve the recognition capability of list data in structured document, can the significantly Documents Similarity of bar excel form types decision-making ability, speed is faster, it is readily appreciated that, is adapted to actual business requirement, solid technical capability is provided for data management and control.
Description
Technical field
The present invention relates to computer search field, and in particular to a kind of Documents Similarity based on document composite character judges
Method and apparatus.
Background technology
Documents Similarity judges to be widely used in the various applications such as interconnection search, public sentiment report, enterprise's classification.Cause
The document of this either form types of structuring, or the character type document of non-structural type, there is the similar knowledge of more text
Method for distinguishing.
However, the document containing form is the common format commonly used in enterprise's routine work, it is often more comprising enterprise
Business information or sensitive data.Such as in a financial report, descriptive text is removed, the form in report perhaps can include
More sensitive informations, such as various financial index of company etc..This non-structured document containing compared with multilist, it is both different
In structured document, also different from non-structured document, but a kind of document of mixed type.Therefore the document of the type is being judged
During similarity, it is usually used in judging that the method for non-structured document or structured document can not all obtain good effect.Therefore
It is very necessary for anti-data-leakage engineering how a kind of method that can very well judge mixed type Documents Similarity is designed.
The Documents Similarity judgement that prior art includes is the important technology of text information processing field, such as:
Document 1, application number:CN201210491145.7, denomination of invention:A kind of Text similarity computing method;
Document 2, application number:CN201410491458.1, denomination of invention:A kind of Text character extraction system and method.
Above-mentioned prior art has the following disadvantages:
(1) influence of structural data in non-structured document is not accounted for.Numeral in document, such as identification card number,
Bank's card number, credit card check code, phone number etc. are very important digital information, are especially carrying out the mistake of anti-leaking data
Cheng Zhong, the importance of these features are far longer than keyword.
(2) the document properties feature in document is not accounted for.The attributes such as the header of document, footer, author, remark information are
An important factor for judging document similarity.
(3) key characteristics, canonical feature and the alternate Documents Similarity incidence relation of document properties are not accounted for.
The content of the invention
In order to solve the above technical problems, the invention provides a kind of Documents Similarity judgement side based on document composite character
Method, comprise the following steps:
1) carries out matching regular expressions to the file or data flow of input;
2) if it fails to match by, step 7) is jumped to, if the match is successful, various features is obtained, jumps to step 3);
3) carries out chained list management to the characteristic value of every kind of feature, forms multiple Feature lists;
4) forms multiple characteristic sequences by the characteristic value in the multiple Feature list and its position in chained list;
5) similarity between the sequence of calculations;
6) exports similarity result of determination;
7) terminates.
According to an embodiment of the invention, it is preferred that if the match is successful, it is necessary to be located again to characteristic value in step 2)
Reason, remove pseudo-characteristic value.
According to an embodiment of the invention, it is preferred that pass through the K-D distances or the Chinese between the sequence of calculation in the step 5)
Prescribed distance judges the similarity between sequence.
According to an embodiment of the invention, it is preferred that the step 6) combines document before similarity result of determination is exported
Similarity between determined property sequence.
According to an embodiment of the invention, it is preferred that after the step 6), it is also necessary to which result of determination is input to depth
Habit or SVM modules, obtain decision model.
In order to solve the above technical problems, the invention provides a kind of Documents Similarity based on document composite character to judge dress
Put, including:
Matching regular expressions module, file or data flow to input carry out matching regular expressions, obtained a variety of
Feature;
Chained list management module, chained list management is carried out to the characteristic value of every kind of feature, forms multiple Feature lists;
Characteristic sequence generation module, it is made up of the characteristic value in the multiple Feature list and its position in chained list more
Individual characteristic sequence;
Similarity calculation module, the similarity between the sequence of calculation;
As a result output module, similarity result of determination is exported.
According to an embodiment of the invention, it is preferred that also reprocess module including feature, characteristic value is reprocessed, gone
Except pseudo-characteristic value.
According to an embodiment of the invention, it is preferred that result output module combines text before similarity result of determination is exported
Similarity between shelves determined property sequence.
According to an embodiment of the invention, it is preferred that also include determining whether that model forms module, it is necessary to which result of determination is input to
Deep learning or SVM modules, obtain decision model.
In order to solve the above technical problems, the invention provides a kind of computer-readable storage medium, it includes computer program
Instruction, by performing the computer program instructions, the method for realizing one of the claims.
Following technique effect is achieved by technical scheme:
By scheme proposed by the present invention, the recognition capability of list data in structured document can be greatly improved, can be with
Increase substantially the decision-making ability of the Documents Similarity of excel form types.Hash of the program than traditional structured document
Method, speed is faster, it is readily appreciated that, it is adapted to actual business requirement, solid technical capability is provided for data management and control.
Brief description of the drawings
Fig. 1 is the present invention based on canonical post processing source code data detection method message processing flow figure
Embodiment
<Decision method>
The invention discloses a kind of Documents Similarity decision method based on document composite character, comprise the following steps:
1) carries out matching regular expressions to the file or data flow of input;
2) if it fails to match by, step 7) is jumped to, if the match is successful, various features is obtained, jumps to step 3);
3) carries out chained list management to the characteristic value of every kind of feature, forms multiple Feature lists;
4) forms multiple characteristic sequences by the characteristic value in the multiple Feature list and its position in chained list;
5) similarity between the sequence of calculations;
6) exports similarity result of determination;
7) terminates.
If the match is successful, it is necessary to be reprocessed to characteristic value, removal pseudo-characteristic value in step 2).
Judged in the step 5) by K-D between sequence of calculation distance or Hamming distance similar between sequence
Degree.
The step 6) judges the similarity between sequence before similarity result of determination is exported with reference to document properties.
After the step 6), it is also necessary to result of determination is input into deep learning or SVM modules, obtains decision model.
The document properties include:Document author, title, summary, header, footer etc..
As shown in figure 1, after office documents are converted into txt texts, handling process is entered.List data in semi-structured
Often just like lower class likelihood data, xyz represents three kinds of characteristics respectively.
Main flow can be in matching template a variety of regular expressions to being scanned in full, x can be obtained after scanningi*
yi*ziEtc. sequence (* represents any character).To the specific feature of every class, it is established that Feature list, record the appearance of such characteristic value
Document misregistration amount and character numerical value.The minimum length of Feature list determines the quantity of sequence, i.e., if in multiple chained lists most
The length of small chained list is 50, and maximum chained list length is 100, eventually forms 50 sequences.For sequential value,
xi*yi*zi, xi+1*yi+1*zi+1
Between similitude, can be measured by K-D distances or (also known as Hamming distance) Hamming distances.By above
Method, the judgement of structured content similitude can be completed.Judgement for semi-structured document similitude, it is also contemplated that non-knot
The judgement of structure content similarities.Judgement for unstructured content similitude, it is unstructured interior according to the attribute of document
Hold, similarity analysis is carried out according to common methods such as SVM.
Structuring and the combination of unstructured similitude are judged, be judge semi-structured document similitude it is a kind of very
Good scheme.
In the present invention propose based on canonical feature, words feature, document properties feature Documents Similarity decision method, pin
To the unstructured data detection demand in business data security management and control, solve of structured content in non-structured text
Match somebody with somebody, also solve structured document, the similarity of digital form document judges, forms a kind of new Documents Similarity and judges
Method.
(1) accurate post processing checking is done to canonical feature, so as to ensure the accuracy of characteristic matching result.
Many digital contents, if such as identity card, bank card, cell-phone number etc. only matched by canonical engine, very
Easily produce wrong report.Therefore canonical post processing script (i.e. characteristic value reprocessing program) is introduced, to the matching result of canonical engine
Verified, the degree of accuracy can be improved.For example some numerals similar with ID card No. are there may be in text, pass through logarithm
Numeral included in word judged, for example since the 7th is the date of birth, if it is not, then can be determined that is not
ID card No., the judgement of other characteristic values are similar.
(2) it is document properties are very crucial as judgement of the feature to text similarity.The summary of document, remarks, header
The information such as footer, different from common words text feature, the writer identity for reflecting document, type of theme, document class can be given
Type all information, judge that the information content of offer is very big for similarity.
(3) deep learning method is used, it is easy to implement.To three category informations such as text, canonical feature, document properties as defeated
Enter, by Masses of Document depth training method, determine the weight of three category informations and the deep learning model relied on.Obtained mould
Type can also receive feedback information and optimize in the matching process of reality.
The invention discloses a kind of method of discrimination of the hybrid document similarity based on canonical and keyword feature.This method
The canonical feature in addition to keyword is considered, can so differentiate that non-structured document can also differentiate the similar of structured document
Degree.In addition, alsoing for improving the hit accuracy rate of regular expression identification feature, introduce and canonical expression identification result is entered
Row post-processing function.The technology establishes normalized vector characterized by canonical and keyword, to document, the final phase for differentiating document
Like degree.
<Decision maker>
The invention discloses a kind of Documents Similarity decision maker based on document composite character, including:
Matching regular expressions module, file or data flow to input carry out matching regular expressions, obtained a variety of
Feature;
Chained list management module, chained list management is carried out to the characteristic value of every kind of feature, forms multiple Feature lists;
Characteristic sequence generation module, it is made up of the characteristic value in the multiple Feature list and its position in chained list more
Individual characteristic sequence;
Similarity calculation module, the similarity between the sequence of calculation;
As a result output module, similarity result of determination is exported.
The device also includes feature reprocessing module, and characteristic value is reprocessed, removes pseudo-characteristic value.
Wherein, as a result output module judges the phase between sequence before similarity result of determination is exported with reference to document properties
Like degree.
The device also includes determining whether that model forms module, it is necessary to which result of determination is input into deep learning or SVM modules, obtains
Take decision model.
<Specific embodiment>
Certain enterprise carries out similarity judgement to the document comprising user's salary information.Use is included in salary information in document
The information of family name, identity card, bank card, cell-phone number etc., in addition to establish matched rule
The feature 3 of 1 feature of feature 2
Identity card Unionpay card number handset number ...
1. determine the regular expression of identity card bank card mobile phone etc.;The post processing script of identity card is determined, to consider to save
Part, the date of birth, whether last bit check of identity card is correct, to consider the card bin of bank of Unionpay card number beginning, to consider
The luhn verifications of Bank Account Number;Eventually form the sequence xyz. of three above feature
2. the keyword that wage information is related in document is extracted, such as position hierarchy, department information, performance, subsidy etc..
3. extract the attributes such as the author author of document, title title, summary, header header, footer footer letter
Breath.
4. the information being collected into is input to deep learning or svm modules, obtains decision model as input;
5. using new document as input, similarity is judged.Obtain a result.
By the invention it is possible to greatly improve the recognition capability of list data in structured document, can increase substantially
The decision-making ability of the Documents Similarity of excel form types.The program is than the hash methods of traditional structured document, and speed is more
It hurry up, it is readily appreciated that, it is adapted to actual business requirement, solid technical capability is provided for data management and control.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all
Within the spirit and principles in the present invention, any modification, equivalent substitution and improvement for being made etc., the guarantor in the present invention all should be protected
Within the scope of shield.
Claims (10)
1. a kind of Documents Similarity decision method based on document composite character, comprises the following steps:
1) carries out matching regular expressions to the file or data flow of input;
2) if it fails to match by, step 7) is jumped to, if the match is successful, various features is obtained, jumps to step 3);
3) carries out chained list management to the characteristic value of every kind of feature, forms multiple Feature lists;
4) forms multiple characteristic sequences by the characteristic value in the multiple Feature list and its position in chained list;
5) similarity between the sequence of calculations;
6) exports similarity result of determination;
7) terminates.
2. according to the method for claim 1, gone if the match is successful, it is necessary to reprocessed to characteristic value in step 2)
Except pseudo-characteristic value.
3. according to the method for claim 1, pass through the K-D distances or Hamming distance between the sequence of calculation in the step 5)
From the similarity judged between sequence.
4. according to the method for claim 1, the step 6) combines document properties before similarity result of determination is exported
Judge the similarity between sequence.
5. according to the method for claim 1, after the step 6), it is also necessary to by result of determination be input to deep learning or
SVM modules, obtain decision model.
6. a kind of Documents Similarity decision maker based on document composite character, including:
Matching regular expressions module, file or data flow to input carry out matching regular expressions, obtain various features;
Chained list management module, chained list management is carried out to the characteristic value of every kind of feature, forms multiple Feature lists;
Characteristic sequence generation module, multiple spies are formed by the characteristic value in the multiple Feature list and its position in chained list
Levy sequence;
Similarity calculation module, the similarity between the sequence of calculation;
As a result output module, similarity result of determination is exported.
7. device according to claim 6, in addition to feature reprocessing module, reprocess to characteristic value, remove pseudo-
Characteristic value.
8. device according to claim 6, as a result output module combines document category before similarity result of determination is exported
Property judges the similarity between sequence.
9. according to the method for claim 6, also include determining whether that model forms module, it is necessary to which result of determination is input into depth
Study or SVM modules, obtain decision model.
10. a kind of computer-readable storage medium, it includes computer program instructions, by performing the computer program instructions,
The method for realizing one of claim 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711041146.0A CN107656909B (en) | 2017-10-30 | 2017-10-30 | Document similarity judgment method and device based on document mixing characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711041146.0A CN107656909B (en) | 2017-10-30 | 2017-10-30 | Document similarity judgment method and device based on document mixing characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107656909A true CN107656909A (en) | 2018-02-02 |
CN107656909B CN107656909B (en) | 2021-06-01 |
Family
ID=61096204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711041146.0A Active CN107656909B (en) | 2017-10-30 | 2017-10-30 | Document similarity judgment method and device based on document mixing characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107656909B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472209A (en) * | 2019-07-04 | 2019-11-19 | 重庆金融资产交易所有限责任公司 | Table generation method, device and computer equipment based on deep learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894143A (en) * | 2010-06-28 | 2010-11-24 | 北京用友政务软件有限公司 | Federated search and search result integrated display method and system |
US20130054612A1 (en) * | 2006-10-10 | 2013-02-28 | Abbyy Software Ltd. | Universal Document Similarity |
CN104318340A (en) * | 2014-09-25 | 2015-01-28 | 中国科学院软件研究所 | Information visualization method and intelligent visual analysis system based on text curriculum vitae information |
CN105573971A (en) * | 2014-10-10 | 2016-05-11 | 富士通株式会社 | Table reconstruction apparatus and method |
CN105894253A (en) * | 2016-05-09 | 2016-08-24 | 陈包容 | Method and device for automatic pushing of job application demand |
-
2017
- 2017-10-30 CN CN201711041146.0A patent/CN107656909B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130054612A1 (en) * | 2006-10-10 | 2013-02-28 | Abbyy Software Ltd. | Universal Document Similarity |
CN101894143A (en) * | 2010-06-28 | 2010-11-24 | 北京用友政务软件有限公司 | Federated search and search result integrated display method and system |
CN104318340A (en) * | 2014-09-25 | 2015-01-28 | 中国科学院软件研究所 | Information visualization method and intelligent visual analysis system based on text curriculum vitae information |
CN105573971A (en) * | 2014-10-10 | 2016-05-11 | 富士通株式会社 | Table reconstruction apparatus and method |
CN105894253A (en) * | 2016-05-09 | 2016-08-24 | 陈包容 | Method and device for automatic pushing of job application demand |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472209A (en) * | 2019-07-04 | 2019-11-19 | 重庆金融资产交易所有限责任公司 | Table generation method, device and computer equipment based on deep learning |
CN110472209B (en) * | 2019-07-04 | 2024-02-06 | 深圳同奈信息科技有限公司 | Deep learning-based table generation method and device and computer equipment |
Also Published As
Publication number | Publication date |
---|---|
CN107656909B (en) | 2021-06-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2005201758B2 (en) | Method of learning associations between documents and data sets | |
WO2019184217A1 (en) | Hotspot event classification method and apparatus, and storage medium | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
Xu et al. | Using deep linguistic features for finding deceptive opinion spam | |
US10997366B2 (en) | Methods, devices and systems for data augmentation to improve fraud detection | |
CN109960727B (en) | Personal privacy information automatic detection method and system for unstructured text | |
US20110029303A1 (en) | Word classification system, method, and program | |
WO2019179010A1 (en) | Data set acquisition method, classification method and device, apparatus, and storage medium | |
CN112184145A (en) | AI-based unmanned intervention approval system | |
CN110083832B (en) | Article reprint relation identification method, device, equipment and readable storage medium | |
CN107294834A (en) | A kind of method and apparatus for recognizing spam | |
CN112257444B (en) | Financial information negative entity discovery method, device, electronic equipment and storage medium | |
CN110610003B (en) | Method and system for assisting text annotation | |
CN112084308A (en) | Method, system and storage medium for text type data recognition | |
CN112463922A (en) | Risk user identification method and storage medium | |
CN107656909A (en) | A kind of Documents Similarity decision method and device based on document composite character | |
CN110321557A (en) | A kind of file classification method, device, electronic equipment and storage medium | |
CN115994531A (en) | Multi-dimensional text comprehensive identification method | |
CN116029280A (en) | Method, device, computing equipment and storage medium for extracting key information of document | |
CN108171589A (en) | Verification method and device | |
Souza et al. | ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF | |
CN112100336A (en) | Method and device for identifying preservation time of file and storage medium | |
CN111680513B (en) | Feature information identification method and device and computer readable storage medium | |
Kini | Term frequency tokenization for fake news detection | |
CN117332084B (en) | Machine learning method suitable for detecting malicious comments and false news simultaneously |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |