CN108021951A - A kind of method of document detection, server and computer-readable recording medium - Google Patents

A kind of method of document detection, server and computer-readable recording medium Download PDF

Info

Publication number
CN108021951A
CN108021951A CN201711468430.6A CN201711468430A CN108021951A CN 108021951 A CN108021951 A CN 108021951A CN 201711468430 A CN201711468430 A CN 201711468430A CN 108021951 A CN108021951 A CN 108021951A
Authority
CN
China
Prior art keywords
document
checked
detection
default
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711468430.6A
Other languages
Chinese (zh)
Inventor
宋鹏举
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN201711468430.6A priority Critical patent/CN108021951A/en
Publication of CN108021951A publication Critical patent/CN108021951A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses a kind of method of document detection, server and computer-readable recording medium, the present invention presets the fingerprint base of document by building, further according to the characteristic in fingerprint base, similarity detection and the detection of document multiplicity are carried out to document to be detected, the similarity detection of document to be detected and document multiplicity are detected with realizing, so as to solve the problems, such as inaccurate to document detection in the prior art.

Description

A kind of method of document detection, server and computer-readable recording medium
Technical field
The present invention relates to field of computer technology, more particularly to a kind of method of document detection, server and computer Readable storage medium storing program for executing.
Background technology
Document class data is data type common in business unit, for business unit, has many important data to be Exist in the form of document class data, the content of important document class data is effectively protected, is the important composition in enterprise security Part.
The prior art includes two kinds of directions, and a kind of is the detection mode of document similarity degree, this kind of detection method cannot have The duplicate contents of effect detection text block, another kind is the detection mode of the length of identical content blocks, this kind of detection method cannot have The similarity degree of effect detection document, that is to say, that the prior art accurately cannot all be detected document very much.
The content of the invention
It is existing to solve the present invention provides a kind of method of document detection, server and computer-readable recording medium The problem of inaccurate to document detection in technology.
On the one hand, the present invention provides a kind of method of document detection, this method to include:The fingerprint of the default document of structure Storehouse, the fingerprint base are used for the characteristic for storing the default document;According to the characteristic in the fingerprint base, to be checked Survey document and carry out similarity detection and the detection of document multiplicity.
Further, the fingerprint base of the default document of the structure, specifically includes:The content in default document is extracted, is used Local sensitivity hash algorithm generates the Feature Words and tagged word of the default document, and obtains the sentence in the default document Block, fingerprint base is stored in by the Feature Words, the tagged word and the statement block.
Further, the characteristic in the fingerprint base, similarity detection and document are carried out to document to be detected Multiplicity detects, and specifically includes:According to the characteristic of the fingerprint base and document to be checked, to document to be checked and default document into Row similarity detects, and carries out document multiplicity detection to document to be checked and default document.
Further, this method further includes:Extract the characteristic of the document to be checked.
Further, the characteristic for extracting the document to be checked, specifically includes:Using local sensitivity hash algorithm The Feature Words and tagged word of the document to be checked are extracted, and obtain the statement block in the default document.
Further, it is described that document multiplicity detection is carried out to document to be checked and default document, specifically include:
Judge that there are institute if it is, recording document to be checked with the presence or absence of the statement block in default document in document to be checked The number of statement block is stated, otherwise, then judging document to be checked, there is no repeat with the default document.
Further, judge to specifically include with the presence or absence of the statement block in default document in document to be checked:Judge text to be checked It whether there is in shelves, it is equivalent in meaning with the statement block in default document, and with the difference of the number of words of the statement block in predetermined word Statement block in number scope, if it is, determining document to be checked, there are the statement block.
Further, the characteristic include it is following in one or more:Tagged word, Feature Words and statement block.
On the other hand, the present invention also provides a kind of server, it is total that the server includes processor, memory and communication Line;
The communication bus is used for realization the connection communication between processor and memory;
The processor is used to perform the computer instruction stored in memory, to realize the document described in any of the above-described kind The method of detection.
Another further aspect, the present invention also provides a kind of computer-readable recording medium, computer-readable recording medium storage has One or more program, one or more program can be performed by one or more processor, to realize that the present invention provides Any document detection method.
The present invention has the beneficial effect that:
The present invention presets the fingerprint base of document by building, further according to the characteristic in fingerprint base, to document to be detected Similarity detection and the detection of document multiplicity are carried out, the similarity detection of document to be detected and document multiplicity are examined with realizing Survey, so as to solve the problems, such as inaccurate to document detection in the prior art.
Brief description of the drawings
Fig. 1 is a kind of flow diagram of the method for document detection of the embodiment of the present invention;
Fig. 2 is the flow diagram of the method for another document detection of the embodiment of the present invention;
Fig. 3 is the structure diagram of the server of the embodiment of the present invention.
Embodiment
Inaccurate to document detection in the prior art in order to solve the problems, such as, the present invention provides a kind of side of document detection Method, the present invention preset the fingerprint base of document by building, and further according to the characteristic in fingerprint base, phase is carried out to document to be detected Like degree detection and the detection of document multiplicity, so that existing detect the similarity detection of document to be detected and document multiplicity.Below With reference to attached drawing and embodiment, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein Only to explain the present invention, the present invention is not limited.
An embodiment of the present invention provides a kind of method of document detection, referring to Fig. 1, this method includes:
The fingerprint base of the default document of S101, structure, the fingerprint base are used for the characteristic for storing the default document;
S102, the characteristic in the fingerprint base, carry out document to be detected similarity detection and document repeats Degree detection.
That is, the present invention presets the fingerprint base of document by building, further according to the characteristic in fingerprint base, treat Detect document and carry out similarity detection and the detection of document multiplicity, to realize to the similarity detection of document to be detected and document weight Multiplicity detects, so as to solve the problems, such as inaccurate to document detection in the prior art.
It should be noted that the characteristic described in the embodiment of the present invention includes tagged word, Feature Words and statement block.When So, those skilled in the art can also set other characteristics according to being actually needed.
That is, the embodiment of the present invention is to be directed to that at the same time document cannot be carried out similarity detection and document repetition in the prior art The problem of degree detection, by the way that the characteristic of acquisition is stored in fingerprint base, and by setting program so that server can According to characteristic in fingerprint base, similarity detection is carried out to document to be detected and document multiplicity detects, so as to reduce To the complexity of document detection, and improve the process performance that system detects file.
When it is implemented, the embodiment of the present invention is by carrying out similarity inspection to document to be detected to tagged word and Feature Words Survey, and document multiplicity detection is carried out to document to be detected by statement block, so as to fulfill to the comprehensive and accurate of document to be detected True detection.
When it is implemented, building the fingerprint base of default document described in the embodiment of the present invention, specifically include:Extract default document In content, the Feature Words and tagged word of the default document are generated using local sensitivity hash algorithm, and obtain described default Statement block in document, fingerprint base is stored in by the Feature Words, the tagged word and the statement block.
That is, the embodiment of the present invention be by local sensitivity hash algorithm generate the default document Feature Words and Tagged word, and the statement block in default document is obtained by setting, and obtained Feature Words, tagged word and statement block are stored in Fingerprint base.
When it is implemented, the characteristic described in the embodiment of the present invention in the fingerprint base, to document to be detected into Row similarity detects and the detection of document multiplicity, specifically includes:According to the characteristic of the fingerprint base and document to be checked, treat Examine document and default document carries out similarity detection, and document multiplicity detection is carried out to document to be checked and default document.
Specifically, the embodiment of the present invention is to carry out similarity detection and document multiplicity to document to be checked according to fingerprint base Detection.
When it is implemented, the embodiment of the present invention needs to extract the characteristic of the document to be checked, specific abstracting method with The above method is identical, i.e., the Feature Words and tagged word of the document to be checked are extracted by local sensitivity hash algorithm.
When it is implemented, in the embodiment of the present invention, document multiplicity detection is carried out to document to be checked and default document, specifically Including:Judge that there are institute's predicate if it is, recording document to be checked with the presence or absence of the statement block in default document in document to be checked The number of sentence block, otherwise, then judging document to be checked, there is no repeat with the default document.
Specifically, judge described in the embodiment of the present invention in document to be checked with the presence or absence of the statement block in default document, tool Body includes:Judge to whether there is in document to be checked, it is equivalent in meaning with the statement block in default document, and with the word of the statement block Statement block of several differences in the range of default number of words, if it is, determining document to be checked, there are the statement block.
That is, in order to obtain more accurately document multiplicity degree testing result, the document of the embodiment of the present invention repeats Degree detection be document it is equivalent in meaning on the premise of, there may be certain difference in number of words.
When it is implemented, there is the number of statement block by recording in the embodiment of the present invention in document to be checked, it is to be checked to determine The multiplicity of document and default document.
Fig. 2 is the flow diagram of the method for another document detection of the embodiment of the present invention, below in conjunction with Fig. 2 to this The invention method carries out detailed explanation and illustration:
The document content of important documents (i.e. above-mentioned default document) is extracted first;
The characteristic of document content is obtained, and is stored in fingerprint base;
Detection model is built according to fingerprint base, specifically, structure fingerprint library initialization detecting system, and construct corresponding Detection model;
After being pre-processed for document to be detected, similarity detection is carried out to document to be detected and document multiplicity is examined Survey.
The step specifically includes:Content characteristic is extracted using local sensitivity hash algorithm, according to the content characteristic pair of extraction Document to be checked carries out similarity detection and document block repeats to detect.
The present invention is by merging two kinds of detection functions to document, so that reduce the complexity to document detection, so that Improve the process performance that system detects file.
Correspondingly, as shown in figure 3, the embodiment of the present invention also provides a kind of server, including:Processor, memory with And communication bus;
Communication bus is used for realization the connection communication between processor and memory;
Memory is used to store computer instruction, and processor is used for the computer instruction of run memory storage, to realize The step of method of any of embodiment of the method document detection, and reach corresponding technique effect, it for details, reference can be made to method Embodiment is understood, is no longer described in detail herein.
Correspondingly, the embodiment of the present invention also provides a kind of computer-readable recording medium, computer-readable recording medium One or more program is stored with, one or more program can be performed by one or more processor, foregoing to realize The method for any document detection that embodiment provides, therefore can also realize corresponding technique effect, it for details, reference can be made to method reality Apply example to be understood, be no longer described in detail herein.
Although being example purpose, the preferred embodiment of the present invention is had been disclosed for, those skilled in the art will recognize Various improvement, increase and substitution are also possible, and therefore, the scope of the present invention should be not limited to above-described embodiment.

Claims (10)

  1. A kind of 1. method of document detection, it is characterised in that including:
    The fingerprint base of the default document of structure, the fingerprint base are used for the characteristic for storing the default document;
    According to the characteristic in the fingerprint base, similarity detection is carried out to document to be detected and document multiplicity detects.
  2. 2. according to the method described in claim 1, it is characterized in that, the fingerprint base of the default document of the structure, specifically includes:
    The content in default document is extracted, the Feature Words and feature of the default document are generated by local sensitivity hash algorithm Word, and the statement block in the default document is obtained, the Feature Words, the tagged word and the statement block are stored in fingerprint Storehouse.
  3. 3. according to the method described in claim 1, it is characterized in that, characteristic in the fingerprint base, to be detected Document carries out similarity detection and the detection of document multiplicity, specifically includes:
    According to the characteristic of the fingerprint base and document to be checked, similarity detection is carried out to document to be checked and default document, and Document multiplicity detection is carried out to document to be checked and default document.
  4. 4. according to the method described in claim 3, it is characterized in that, further include:
    Extract the characteristic of the document to be checked.
  5. 5. according to the method described in claim 4, it is characterized in that, the characteristic of the extraction document to be checked, specifically includes:
    The Feature Words and tagged word of the document to be checked are extracted by local sensitivity hash algorithm.
  6. 6. according to the method described in claim 3, it is characterized in that, document multiplicity inspection is carried out to document to be checked and default document Survey, specifically include:
    Judge that there are institute's predicate if it is, recording document to be checked with the presence or absence of the statement block in default document in document to be checked The number of sentence block, otherwise, then judging document to be checked, there is no repeat with the default document.
  7. 7. according to the method described in claim 6, it is characterized in that, judge in document to be checked with the presence or absence of the language in default document Sentence block, specifically includes:
    Judge to whether there is in document to be checked, it is equivalent in meaning with the statement block in default document, and with the number of words of the statement block Statement block of the difference in the range of default number of words, if it is, determining document to be checked, there are the statement block.
  8. 8. according to the method described in claim 1, it is characterized in that,
    The characteristic include it is following in one or more:Tagged word, Feature Words and statement block.
  9. 9. a kind of server, it is characterised in that the server includes processor, memory and communication bus;
    The communication bus is used for realization the connection communication between processor and memory;
    The processor is used to perform the computer instruction stored in memory, to realize any one of claim 1 to 8 Document detection method.
  10. A kind of 10. computer-readable recording medium, it is characterised in that the computer-readable recording medium storage have one or Multiple programs, one or more of programs can be performed by one or more processor, to realize in claim 1 to 8 The method of any one of them document detection.
CN201711468430.6A 2017-12-29 2017-12-29 A kind of method of document detection, server and computer-readable recording medium Pending CN108021951A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711468430.6A CN108021951A (en) 2017-12-29 2017-12-29 A kind of method of document detection, server and computer-readable recording medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711468430.6A CN108021951A (en) 2017-12-29 2017-12-29 A kind of method of document detection, server and computer-readable recording medium

Publications (1)

Publication Number Publication Date
CN108021951A true CN108021951A (en) 2018-05-11

Family

ID=62071831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711468430.6A Pending CN108021951A (en) 2017-12-29 2017-12-29 A kind of method of document detection, server and computer-readable recording medium

Country Status (1)

Country Link
CN (1) CN108021951A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344407A (en) * 2018-10-29 2019-02-15 北京天融信网络安全技术有限公司 Semantic-based document fingerprint construction method, storage medium and computer equipment
CN112861505A (en) * 2021-02-04 2021-05-28 北京百度网讯科技有限公司 Method and device for detecting repeatability and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101458708A (en) * 2008-12-05 2009-06-17 北京大学 Searching result clustering method and device
US8015168B2 (en) * 2007-11-12 2011-09-06 Sap Ag String pooling
CN102937994A (en) * 2012-11-15 2013-02-20 北京锐安科技有限公司 Similar document query method based on stop words
CN103207864A (en) * 2012-01-13 2013-07-17 北京中文在线数字出版股份有限公司 Online novel content similarity comparison method
CN106844314A (en) * 2017-02-21 2017-06-13 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8015168B2 (en) * 2007-11-12 2011-09-06 Sap Ag String pooling
CN101458708A (en) * 2008-12-05 2009-06-17 北京大学 Searching result clustering method and device
CN103207864A (en) * 2012-01-13 2013-07-17 北京中文在线数字出版股份有限公司 Online novel content similarity comparison method
CN102937994A (en) * 2012-11-15 2013-02-20 北京锐安科技有限公司 Similar document query method based on stop words
CN106844314A (en) * 2017-02-21 2017-06-13 北京焦点新干线信息技术有限公司 A kind of duplicate checking method and device of article

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344407A (en) * 2018-10-29 2019-02-15 北京天融信网络安全技术有限公司 Semantic-based document fingerprint construction method, storage medium and computer equipment
CN109344407B (en) * 2018-10-29 2024-02-09 天融信雄安网络安全技术有限公司 Semantic-based document fingerprint construction method, storage medium and computer equipment
CN112861505A (en) * 2021-02-04 2021-05-28 北京百度网讯科技有限公司 Method and device for detecting repeatability and electronic equipment

Similar Documents

Publication Publication Date Title
US11126720B2 (en) System and method for automated machine-learning, zero-day malware detection
US11188650B2 (en) Detection of malware using feature hashing
US11481492B2 (en) Method and system for static behavior-predictive malware detection
US10049096B2 (en) System and method of template creation for a data extraction tool
KR101337874B1 (en) System and method for detecting malwares in a file based on genetic map of the file
US11783034B2 (en) Apparatus and method for detecting malicious script
US11373065B2 (en) Dictionary based deduplication of training set samples for machine learning based computer threat analysis
US11650579B2 (en) Information processing device, production facility monitoring method, and computer-readable recording medium recording production facility monitoring program
CN109858248A (en) Malice Word document detection method and device
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
CN106649221A (en) Method and device for detecting duplicated texts
CN110929110B (en) Electronic document detection method, device, equipment and storage medium
CN108153728B (en) Keyword determination method and device
CN107786529B (en) Website detection method, device and system
CN108021951A (en) A kind of method of document detection, server and computer-readable recording medium
CN112231696B (en) Malicious sample identification method, device, computing equipment and medium
CN112395866A (en) Customs declaration data matching method and device
US11308208B2 (en) Classifying ransom notes in received files for ransomware process detection and prevention
CN114254069A (en) Domain name similarity detection method and device and storage medium
Carpineto et al. Automatic assessment of website compliance to the European cookie law with CooLCheck
CN109657472B (en) SQL injection vulnerability detection method, device, equipment and readable storage medium
WO2016127858A1 (en) Method and device for identifying webpage intrusion script features
US20200334353A1 (en) Method and system for detecting and classifying malware based on families
US9990339B1 (en) Systems and methods for detecting character encodings of text streams
CN108804916A (en) Detection method, device, electronic equipment and the storage medium of malicious file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180511