CN107656909A

CN107656909A - A kind of Documents Similarity decision method and device based on document composite character

Info

Publication number: CN107656909A
Application number: CN201711041146.0A
Authority: CN
Inventors: 魏效征; 王志海; 喻波; 安鹏
Original assignee: Beijing Wondersoft Technology Co Ltd
Current assignee: Beijing Wondersoft Technology Co Ltd
Priority date: 2017-10-30
Filing date: 2017-10-30
Publication date: 2018-02-02
Anticipated expiration: 2037-10-30
Also published as: CN107656909B

Abstract

The invention discloses a kind of Documents Similarity decision method and device based on document composite character, this method comprises the following steps：File or data flow to input carry out matching regular expressions；If it fails to match, terminate, if the match is successful, feature reprocessing is carried out to multiple feature strings of matching regular expressions output；Chained list management is carried out to multiple results of feature reprocessing respectively, forms multiple Feature lists；Multiple Feature lists are carried out with chained list traversal and feature merger processing；Export similarity result of determination.By this programme, greatly improve the recognition capability of list data in structured document, can the significantly Documents Similarity of bar excel form types decision-making ability, speed is faster, it is readily appreciated that, is adapted to actual business requirement, solid technical capability is provided for data management and control.

Description

A kind of Documents Similarity decision method and device based on document composite character

Technical field

The present invention relates to computer search field, and in particular to a kind of Documents Similarity based on document composite character judges Method and apparatus.

Background technology

Documents Similarity judges to be widely used in the various applications such as interconnection search, public sentiment report, enterprise's classification.Cause The document of this either form types of structuring, or the character type document of non-structural type, there is the similar knowledge of more text Method for distinguishing.

However, the document containing form is the common format commonly used in enterprise's routine work, it is often more comprising enterprise Business information or sensitive data.Such as in a financial report, descriptive text is removed, the form in report perhaps can include More sensitive informations, such as various financial index of company etc..This non-structured document containing compared with multilist, it is both different In structured document, also different from non-structured document, but a kind of document of mixed type.Therefore the document of the type is being judged During similarity, it is usually used in judging that the method for non-structured document or structured document can not all obtain good effect.Therefore It is very necessary for anti-data-leakage engineering how a kind of method that can very well judge mixed type Documents Similarity is designed.

The Documents Similarity judgement that prior art includes is the important technology of text information processing field, such as：

Document 1, application number：CN201210491145.7, denomination of invention：A kind of Text similarity computing method；

Document 2, application number：CN201410491458.1, denomination of invention：A kind of Text character extraction system and method.

Above-mentioned prior art has the following disadvantages：

(1) influence of structural data in non-structured document is not accounted for.Numeral in document, such as identification card number, Bank's card number, credit card check code, phone number etc. are very important digital information, are especially carrying out the mistake of anti-leaking data Cheng Zhong, the importance of these features are far longer than keyword.

(2) the document properties feature in document is not accounted for.The attributes such as the header of document, footer, author, remark information are An important factor for judging document similarity.

(3) key characteristics, canonical feature and the alternate Documents Similarity incidence relation of document properties are not accounted for.

The content of the invention

In order to solve the above technical problems, the invention provides a kind of Documents Similarity judgement side based on document composite character Method, comprise the following steps：

1) carries out matching regular expressions to the file or data flow of input；

2) if it fails to match by, step 7) is jumped to, if the match is successful, various features is obtained, jumps to step 3)；

3) carries out chained list management to the characteristic value of every kind of feature, forms multiple Feature lists；

4) forms multiple characteristic sequences by the characteristic value in the multiple Feature list and its position in chained list；

5) similarity between the sequence of calculations；

6) exports similarity result of determination；

7) terminates.

According to an embodiment of the invention, it is preferred that if the match is successful, it is necessary to be located again to characteristic value in step 2) Reason, remove pseudo-characteristic value.

According to an embodiment of the invention, it is preferred that pass through the K-D distances or the Chinese between the sequence of calculation in the step 5) Prescribed distance judges the similarity between sequence.

According to an embodiment of the invention, it is preferred that the step 6) combines document before similarity result of determination is exported Similarity between determined property sequence.

According to an embodiment of the invention, it is preferred that after the step 6), it is also necessary to which result of determination is input to depth Habit or SVM modules, obtain decision model.

In order to solve the above technical problems, the invention provides a kind of Documents Similarity based on document composite character to judge dress Put, including：

Matching regular expressions module, file or data flow to input carry out matching regular expressions, obtained a variety of Feature；

Chained list management module, chained list management is carried out to the characteristic value of every kind of feature, forms multiple Feature lists；

Characteristic sequence generation module, it is made up of the characteristic value in the multiple Feature list and its position in chained list more Individual characteristic sequence；

Similarity calculation module, the similarity between the sequence of calculation；

As a result output module, similarity result of determination is exported.

According to an embodiment of the invention, it is preferred that also reprocess module including feature, characteristic value is reprocessed, gone Except pseudo-characteristic value.

According to an embodiment of the invention, it is preferred that result output module combines text before similarity result of determination is exported Similarity between shelves determined property sequence.

According to an embodiment of the invention, it is preferred that also include determining whether that model forms module, it is necessary to which result of determination is input to Deep learning or SVM modules, obtain decision model.

In order to solve the above technical problems, the invention provides a kind of computer-readable storage medium, it includes computer program Instruction, by performing the computer program instructions, the method for realizing one of the claims.

Following technique effect is achieved by technical scheme：

By scheme proposed by the present invention, the recognition capability of list data in structured document can be greatly improved, can be with Increase substantially the decision-making ability of the Documents Similarity of excel form types.Hash of the program than traditional structured document Method, speed is faster, it is readily appreciated that, it is adapted to actual business requirement, solid technical capability is provided for data management and control.

Brief description of the drawings

Fig. 1 is the present invention based on canonical post processing source code data detection method message processing flow figure

Embodiment

The invention discloses a kind of Documents Similarity decision method based on document composite character, comprise the following steps：

5) similarity between the sequence of calculations；

6) exports similarity result of determination；

7) terminates.

If the match is successful, it is necessary to be reprocessed to characteristic value, removal pseudo-characteristic value in step 2).

Judged in the step 5) by K-D between sequence of calculation distance or Hamming distance similar between sequence Degree.

The step 6) judges the similarity between sequence before similarity result of determination is exported with reference to document properties.

After the step 6), it is also necessary to result of determination is input into deep learning or SVM modules, obtains decision model.

The document properties include：Document author, title, summary, header, footer etc..

As shown in figure 1, after office documents are converted into txt texts, handling process is entered.List data in semi-structured Often just like lower class likelihood data, xyz represents three kinds of characteristics respectively.

Main flow can be in matching template a variety of regular expressions to being scanned in full, x can be obtained after scanning_i* y_i*z_iEtc. sequence (* represents any character).To the specific feature of every class, it is established that Feature list, record the appearance of such characteristic value Document misregistration amount and character numerical value.The minimum length of Feature list determines the quantity of sequence, i.e., if in multiple chained lists most The length of small chained list is 50, and maximum chained list length is 100, eventually forms 50 sequences.For sequential value,

x_i*y_i*z_i, x_i+1*y_i+1*z_i+1

Between similitude, can be measured by K-D distances or (also known as Hamming distance) Hamming distances.By above Method, the judgement of structured content similitude can be completed.Judgement for semi-structured document similitude, it is also contemplated that non-knot The judgement of structure content similarities.Judgement for unstructured content similitude, it is unstructured interior according to the attribute of document Hold, similarity analysis is carried out according to common methods such as SVM.

Structuring and the combination of unstructured similitude are judged, be judge semi-structured document similitude it is a kind of very Good scheme.

In the present invention propose based on canonical feature, words feature, document properties feature Documents Similarity decision method, pin To the unstructured data detection demand in business data security management and control, solve of structured content in non-structured text Match somebody with somebody, also solve structured document, the similarity of digital form document judges, forms a kind of new Documents Similarity and judges Method.

(1) accurate post processing checking is done to canonical feature, so as to ensure the accuracy of characteristic matching result.

Many digital contents, if such as identity card, bank card, cell-phone number etc. only matched by canonical engine, very Easily produce wrong report.Therefore canonical post processing script (i.e. characteristic value reprocessing program) is introduced, to the matching result of canonical engine Verified, the degree of accuracy can be improved.For example some numerals similar with ID card No. are there may be in text, pass through logarithm Numeral included in word judged, for example since the 7th is the date of birth, if it is not, then can be determined that is not ID card No., the judgement of other characteristic values are similar.

(2) it is document properties are very crucial as judgement of the feature to text similarity.The summary of document, remarks, header The information such as footer, different from common words text feature, the writer identity for reflecting document, type of theme, document class can be given Type all information, judge that the information content of offer is very big for similarity.

(3) deep learning method is used, it is easy to implement.To three category informations such as text, canonical feature, document properties as defeated Enter, by Masses of Document depth training method, determine the weight of three category informations and the deep learning model relied on.Obtained mould Type can also receive feedback information and optimize in the matching process of reality.

The invention discloses a kind of method of discrimination of the hybrid document similarity based on canonical and keyword feature.This method The canonical feature in addition to keyword is considered, can so differentiate that non-structured document can also differentiate the similar of structured document Degree.In addition, alsoing for improving the hit accuracy rate of regular expression identification feature, introduce and canonical expression identification result is entered Row post-processing function.The technology establishes normalized vector characterized by canonical and keyword, to document, the final phase for differentiating document Like degree.

The invention discloses a kind of Documents Similarity decision maker based on document composite character, including：

As a result output module, similarity result of determination is exported.

The device also includes feature reprocessing module, and characteristic value is reprocessed, removes pseudo-characteristic value.

Wherein, as a result output module judges the phase between sequence before similarity result of determination is exported with reference to document properties Like degree.

The device also includes determining whether that model forms module, it is necessary to which result of determination is input into deep learning or SVM modules, obtains Take decision model.

Certain enterprise carries out similarity judgement to the document comprising user's salary information.Use is included in salary information in document The information of family name, identity card, bank card, cell-phone number etc., in addition to establish matched rule

The feature 3 of 1 feature of feature 2

Identity card Unionpay card number handset number ...

1. determine the regular expression of identity card bank card mobile phone etc.；The post processing script of identity card is determined, to consider to save Part, the date of birth, whether last bit check of identity card is correct, to consider the card bin of bank of Unionpay card number beginning, to consider The luhn verifications of Bank Account Number；Eventually form the sequence xyz. of three above feature

2. the keyword that wage information is related in document is extracted, such as position hierarchy, department information, performance, subsidy etc..

3. extract the attributes such as the author author of document, title title, summary, header header, footer footer letter Breath.

4. the information being collected into is input to deep learning or svm modules, obtains decision model as input；

5. using new document as input, similarity is judged.Obtain a result.

By the invention it is possible to greatly improve the recognition capability of list data in structured document, can increase substantially The decision-making ability of the Documents Similarity of excel form types.The program is than the hash methods of traditional structured document, and speed is more It hurry up, it is readily appreciated that, it is adapted to actual business requirement, solid technical capability is provided for data management and control.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvement for being made etc., the guarantor in the present invention all should be protected Within the scope of shield.

Claims

1. a kind of Documents Similarity decision method based on document composite character, comprises the following steps：

5) similarity between the sequence of calculations；

6) exports similarity result of determination；

7) terminates.

2. according to the method for claim 1, gone if the match is successful, it is necessary to reprocessed to characteristic value in step 2) Except pseudo-characteristic value.

3. according to the method for claim 1, pass through the K-D distances or Hamming distance between the sequence of calculation in the step 5) From the similarity judged between sequence.

4. according to the method for claim 1, the step 6) combines document properties before similarity result of determination is exported Judge the similarity between sequence.

5. according to the method for claim 1, after the step 6), it is also necessary to by result of determination be input to deep learning or SVM modules, obtain decision model.

6. a kind of Documents Similarity decision maker based on document composite character, including：

Matching regular expressions module, file or data flow to input carry out matching regular expressions, obtain various features；

Characteristic sequence generation module, multiple spies are formed by the characteristic value in the multiple Feature list and its position in chained list Levy sequence；

As a result output module, similarity result of determination is exported.

7. device according to claim 6, in addition to feature reprocessing module, reprocess to characteristic value, remove pseudo- Characteristic value.

8. device according to claim 6, as a result output module combines document category before similarity result of determination is exported Property judges the similarity between sequence.

9. according to the method for claim 6, also include determining whether that model forms module, it is necessary to which result of determination is input into depth Study or SVM modules, obtain decision model.

10. a kind of computer-readable storage medium, it includes computer program instructions, by performing the computer program instructions, The method for realizing one of claim 1-5.