CN108021951A - A kind of method of document detection, server and computer-readable recording medium - Google Patents
A kind of method of document detection, server and computer-readable recording medium Download PDFInfo
- Publication number
- CN108021951A CN108021951A CN201711468430.6A CN201711468430A CN108021951A CN 108021951 A CN108021951 A CN 108021951A CN 201711468430 A CN201711468430 A CN 201711468430A CN 108021951 A CN108021951 A CN 108021951A
- Authority
- CN
- China
- Prior art keywords
- document
- checked
- detection
- default
- characteristic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Storage Device Security (AREA)
Abstract
The invention discloses a kind of method of document detection, server and computer-readable recording medium, the present invention presets the fingerprint base of document by building, further according to the characteristic in fingerprint base, similarity detection and the detection of document multiplicity are carried out to document to be detected, the similarity detection of document to be detected and document multiplicity are detected with realizing, so as to solve the problems, such as inaccurate to document detection in the prior art.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of method of document detection, server and computer
Readable storage medium storing program for executing.
Background technology
Document class data is data type common in business unit, for business unit, has many important data to be
Exist in the form of document class data, the content of important document class data is effectively protected, is the important composition in enterprise security
Part.
The prior art includes two kinds of directions, and a kind of is the detection mode of document similarity degree, this kind of detection method cannot have
The duplicate contents of effect detection text block, another kind is the detection mode of the length of identical content blocks, this kind of detection method cannot have
The similarity degree of effect detection document, that is to say, that the prior art accurately cannot all be detected document very much.
The content of the invention
It is existing to solve the present invention provides a kind of method of document detection, server and computer-readable recording medium
The problem of inaccurate to document detection in technology.
On the one hand, the present invention provides a kind of method of document detection, this method to include:The fingerprint of the default document of structure
Storehouse, the fingerprint base are used for the characteristic for storing the default document;According to the characteristic in the fingerprint base, to be checked
Survey document and carry out similarity detection and the detection of document multiplicity.
Further, the fingerprint base of the default document of the structure, specifically includes:The content in default document is extracted, is used
Local sensitivity hash algorithm generates the Feature Words and tagged word of the default document, and obtains the sentence in the default document
Block, fingerprint base is stored in by the Feature Words, the tagged word and the statement block.
Further, the characteristic in the fingerprint base, similarity detection and document are carried out to document to be detected
Multiplicity detects, and specifically includes:According to the characteristic of the fingerprint base and document to be checked, to document to be checked and default document into
Row similarity detects, and carries out document multiplicity detection to document to be checked and default document.
Further, this method further includes:Extract the characteristic of the document to be checked.
Further, the characteristic for extracting the document to be checked, specifically includes:Using local sensitivity hash algorithm
The Feature Words and tagged word of the document to be checked are extracted, and obtain the statement block in the default document.
Further, it is described that document multiplicity detection is carried out to document to be checked and default document, specifically include:
Judge that there are institute if it is, recording document to be checked with the presence or absence of the statement block in default document in document to be checked
The number of statement block is stated, otherwise, then judging document to be checked, there is no repeat with the default document.
Further, judge to specifically include with the presence or absence of the statement block in default document in document to be checked:Judge text to be checked
It whether there is in shelves, it is equivalent in meaning with the statement block in default document, and with the difference of the number of words of the statement block in predetermined word
Statement block in number scope, if it is, determining document to be checked, there are the statement block.
Further, the characteristic include it is following in one or more:Tagged word, Feature Words and statement block.
On the other hand, the present invention also provides a kind of server, it is total that the server includes processor, memory and communication
Line;
The communication bus is used for realization the connection communication between processor and memory;
The processor is used to perform the computer instruction stored in memory, to realize the document described in any of the above-described kind
The method of detection.
Another further aspect, the present invention also provides a kind of computer-readable recording medium, computer-readable recording medium storage has
One or more program, one or more program can be performed by one or more processor, to realize that the present invention provides
Any document detection method.
The present invention has the beneficial effect that:
The present invention presets the fingerprint base of document by building, further according to the characteristic in fingerprint base, to document to be detected
Similarity detection and the detection of document multiplicity are carried out, the similarity detection of document to be detected and document multiplicity are examined with realizing
Survey, so as to solve the problems, such as inaccurate to document detection in the prior art.
Brief description of the drawings
Fig. 1 is a kind of flow diagram of the method for document detection of the embodiment of the present invention;
Fig. 2 is the flow diagram of the method for another document detection of the embodiment of the present invention;
Fig. 3 is the structure diagram of the server of the embodiment of the present invention.
Embodiment
Inaccurate to document detection in the prior art in order to solve the problems, such as, the present invention provides a kind of side of document detection
Method, the present invention preset the fingerprint base of document by building, and further according to the characteristic in fingerprint base, phase is carried out to document to be detected
Like degree detection and the detection of document multiplicity, so that existing detect the similarity detection of document to be detected and document multiplicity.Below
With reference to attached drawing and embodiment, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein
Only to explain the present invention, the present invention is not limited.
An embodiment of the present invention provides a kind of method of document detection, referring to Fig. 1, this method includes:
The fingerprint base of the default document of S101, structure, the fingerprint base are used for the characteristic for storing the default document;
S102, the characteristic in the fingerprint base, carry out document to be detected similarity detection and document repeats
Degree detection.
That is, the present invention presets the fingerprint base of document by building, further according to the characteristic in fingerprint base, treat
Detect document and carry out similarity detection and the detection of document multiplicity, to realize to the similarity detection of document to be detected and document weight
Multiplicity detects, so as to solve the problems, such as inaccurate to document detection in the prior art.
It should be noted that the characteristic described in the embodiment of the present invention includes tagged word, Feature Words and statement block.When
So, those skilled in the art can also set other characteristics according to being actually needed.
That is, the embodiment of the present invention is to be directed to that at the same time document cannot be carried out similarity detection and document repetition in the prior art
The problem of degree detection, by the way that the characteristic of acquisition is stored in fingerprint base, and by setting program so that server can
According to characteristic in fingerprint base, similarity detection is carried out to document to be detected and document multiplicity detects, so as to reduce
To the complexity of document detection, and improve the process performance that system detects file.
When it is implemented, the embodiment of the present invention is by carrying out similarity inspection to document to be detected to tagged word and Feature Words
Survey, and document multiplicity detection is carried out to document to be detected by statement block, so as to fulfill to the comprehensive and accurate of document to be detected
True detection.
When it is implemented, building the fingerprint base of default document described in the embodiment of the present invention, specifically include:Extract default document
In content, the Feature Words and tagged word of the default document are generated using local sensitivity hash algorithm, and obtain described default
Statement block in document, fingerprint base is stored in by the Feature Words, the tagged word and the statement block.
That is, the embodiment of the present invention be by local sensitivity hash algorithm generate the default document Feature Words and
Tagged word, and the statement block in default document is obtained by setting, and obtained Feature Words, tagged word and statement block are stored in
Fingerprint base.
When it is implemented, the characteristic described in the embodiment of the present invention in the fingerprint base, to document to be detected into
Row similarity detects and the detection of document multiplicity, specifically includes:According to the characteristic of the fingerprint base and document to be checked, treat
Examine document and default document carries out similarity detection, and document multiplicity detection is carried out to document to be checked and default document.
Specifically, the embodiment of the present invention is to carry out similarity detection and document multiplicity to document to be checked according to fingerprint base
Detection.
When it is implemented, the embodiment of the present invention needs to extract the characteristic of the document to be checked, specific abstracting method with
The above method is identical, i.e., the Feature Words and tagged word of the document to be checked are extracted by local sensitivity hash algorithm.
When it is implemented, in the embodiment of the present invention, document multiplicity detection is carried out to document to be checked and default document, specifically
Including:Judge that there are institute's predicate if it is, recording document to be checked with the presence or absence of the statement block in default document in document to be checked
The number of sentence block, otherwise, then judging document to be checked, there is no repeat with the default document.
Specifically, judge described in the embodiment of the present invention in document to be checked with the presence or absence of the statement block in default document, tool
Body includes:Judge to whether there is in document to be checked, it is equivalent in meaning with the statement block in default document, and with the word of the statement block
Statement block of several differences in the range of default number of words, if it is, determining document to be checked, there are the statement block.
That is, in order to obtain more accurately document multiplicity degree testing result, the document of the embodiment of the present invention repeats
Degree detection be document it is equivalent in meaning on the premise of, there may be certain difference in number of words.
When it is implemented, there is the number of statement block by recording in the embodiment of the present invention in document to be checked, it is to be checked to determine
The multiplicity of document and default document.
Fig. 2 is the flow diagram of the method for another document detection of the embodiment of the present invention, below in conjunction with Fig. 2 to this
The invention method carries out detailed explanation and illustration:
The document content of important documents (i.e. above-mentioned default document) is extracted first;
The characteristic of document content is obtained, and is stored in fingerprint base;
Detection model is built according to fingerprint base, specifically, structure fingerprint library initialization detecting system, and construct corresponding
Detection model;
After being pre-processed for document to be detected, similarity detection is carried out to document to be detected and document multiplicity is examined
Survey.
The step specifically includes:Content characteristic is extracted using local sensitivity hash algorithm, according to the content characteristic pair of extraction
Document to be checked carries out similarity detection and document block repeats to detect.
The present invention is by merging two kinds of detection functions to document, so that reduce the complexity to document detection, so that
Improve the process performance that system detects file.
Correspondingly, as shown in figure 3, the embodiment of the present invention also provides a kind of server, including:Processor, memory with
And communication bus;
Communication bus is used for realization the connection communication between processor and memory;
Memory is used to store computer instruction, and processor is used for the computer instruction of run memory storage, to realize
The step of method of any of embodiment of the method document detection, and reach corresponding technique effect, it for details, reference can be made to method
Embodiment is understood, is no longer described in detail herein.
Correspondingly, the embodiment of the present invention also provides a kind of computer-readable recording medium, computer-readable recording medium
One or more program is stored with, one or more program can be performed by one or more processor, foregoing to realize
The method for any document detection that embodiment provides, therefore can also realize corresponding technique effect, it for details, reference can be made to method reality
Apply example to be understood, be no longer described in detail herein.
Although being example purpose, the preferred embodiment of the present invention is had been disclosed for, those skilled in the art will recognize
Various improvement, increase and substitution are also possible, and therefore, the scope of the present invention should be not limited to above-described embodiment.
Claims (10)
- A kind of 1. method of document detection, it is characterised in that including:The fingerprint base of the default document of structure, the fingerprint base are used for the characteristic for storing the default document;According to the characteristic in the fingerprint base, similarity detection is carried out to document to be detected and document multiplicity detects.
- 2. according to the method described in claim 1, it is characterized in that, the fingerprint base of the default document of the structure, specifically includes:The content in default document is extracted, the Feature Words and feature of the default document are generated by local sensitivity hash algorithm Word, and the statement block in the default document is obtained, the Feature Words, the tagged word and the statement block are stored in fingerprint Storehouse.
- 3. according to the method described in claim 1, it is characterized in that, characteristic in the fingerprint base, to be detected Document carries out similarity detection and the detection of document multiplicity, specifically includes:According to the characteristic of the fingerprint base and document to be checked, similarity detection is carried out to document to be checked and default document, and Document multiplicity detection is carried out to document to be checked and default document.
- 4. according to the method described in claim 3, it is characterized in that, further include:Extract the characteristic of the document to be checked.
- 5. according to the method described in claim 4, it is characterized in that, the characteristic of the extraction document to be checked, specifically includes:The Feature Words and tagged word of the document to be checked are extracted by local sensitivity hash algorithm.
- 6. according to the method described in claim 3, it is characterized in that, document multiplicity inspection is carried out to document to be checked and default document Survey, specifically include:Judge that there are institute's predicate if it is, recording document to be checked with the presence or absence of the statement block in default document in document to be checked The number of sentence block, otherwise, then judging document to be checked, there is no repeat with the default document.
- 7. according to the method described in claim 6, it is characterized in that, judge in document to be checked with the presence or absence of the language in default document Sentence block, specifically includes:Judge to whether there is in document to be checked, it is equivalent in meaning with the statement block in default document, and with the number of words of the statement block Statement block of the difference in the range of default number of words, if it is, determining document to be checked, there are the statement block.
- 8. according to the method described in claim 1, it is characterized in that,The characteristic include it is following in one or more:Tagged word, Feature Words and statement block.
- 9. a kind of server, it is characterised in that the server includes processor, memory and communication bus;The communication bus is used for realization the connection communication between processor and memory;The processor is used to perform the computer instruction stored in memory, to realize any one of claim 1 to 8 Document detection method.
- A kind of 10. computer-readable recording medium, it is characterised in that the computer-readable recording medium storage have one or Multiple programs, one or more of programs can be performed by one or more processor, to realize in claim 1 to 8 The method of any one of them document detection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711468430.6A CN108021951A (en) | 2017-12-29 | 2017-12-29 | A kind of method of document detection, server and computer-readable recording medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711468430.6A CN108021951A (en) | 2017-12-29 | 2017-12-29 | A kind of method of document detection, server and computer-readable recording medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108021951A true CN108021951A (en) | 2018-05-11 |
Family
ID=62071831
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711468430.6A Pending CN108021951A (en) | 2017-12-29 | 2017-12-29 | A kind of method of document detection, server and computer-readable recording medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108021951A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344407A (en) * | 2018-10-29 | 2019-02-15 | 北京天融信网络安全技术有限公司 | Semantic-based document fingerprint construction method, storage medium and computer equipment |
CN112861505A (en) * | 2021-02-04 | 2021-05-28 | 北京百度网讯科技有限公司 | Method and device for detecting repeatability and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101458708A (en) * | 2008-12-05 | 2009-06-17 | 北京大学 | Searching result clustering method and device |
US8015168B2 (en) * | 2007-11-12 | 2011-09-06 | Sap Ag | String pooling |
CN102937994A (en) * | 2012-11-15 | 2013-02-20 | 北京锐安科技有限公司 | Similar document query method based on stop words |
CN103207864A (en) * | 2012-01-13 | 2013-07-17 | 北京中文在线数字出版股份有限公司 | Online novel content similarity comparison method |
CN106844314A (en) * | 2017-02-21 | 2017-06-13 | 北京焦点新干线信息技术有限公司 | A kind of duplicate checking method and device of article |
-
2017
- 2017-12-29 CN CN201711468430.6A patent/CN108021951A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8015168B2 (en) * | 2007-11-12 | 2011-09-06 | Sap Ag | String pooling |
CN101458708A (en) * | 2008-12-05 | 2009-06-17 | 北京大学 | Searching result clustering method and device |
CN103207864A (en) * | 2012-01-13 | 2013-07-17 | 北京中文在线数字出版股份有限公司 | Online novel content similarity comparison method |
CN102937994A (en) * | 2012-11-15 | 2013-02-20 | 北京锐安科技有限公司 | Similar document query method based on stop words |
CN106844314A (en) * | 2017-02-21 | 2017-06-13 | 北京焦点新干线信息技术有限公司 | A kind of duplicate checking method and device of article |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344407A (en) * | 2018-10-29 | 2019-02-15 | 北京天融信网络安全技术有限公司 | Semantic-based document fingerprint construction method, storage medium and computer equipment |
CN109344407B (en) * | 2018-10-29 | 2024-02-09 | 天融信雄安网络安全技术有限公司 | Semantic-based document fingerprint construction method, storage medium and computer equipment |
CN112861505A (en) * | 2021-02-04 | 2021-05-28 | 北京百度网讯科技有限公司 | Method and device for detecting repeatability and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11126720B2 (en) | System and method for automated machine-learning, zero-day malware detection | |
US11188650B2 (en) | Detection of malware using feature hashing | |
US11481492B2 (en) | Method and system for static behavior-predictive malware detection | |
US10049096B2 (en) | System and method of template creation for a data extraction tool | |
KR101337874B1 (en) | System and method for detecting malwares in a file based on genetic map of the file | |
US11783034B2 (en) | Apparatus and method for detecting malicious script | |
US11373065B2 (en) | Dictionary based deduplication of training set samples for machine learning based computer threat analysis | |
US11650579B2 (en) | Information processing device, production facility monitoring method, and computer-readable recording medium recording production facility monitoring program | |
CN109858248A (en) | Malice Word document detection method and device | |
KR101638535B1 (en) | Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same | |
CN106649221A (en) | Method and device for detecting duplicated texts | |
CN110929110B (en) | Electronic document detection method, device, equipment and storage medium | |
CN108153728B (en) | Keyword determination method and device | |
CN107786529B (en) | Website detection method, device and system | |
CN108021951A (en) | A kind of method of document detection, server and computer-readable recording medium | |
CN112231696B (en) | Malicious sample identification method, device, computing equipment and medium | |
CN112395866A (en) | Customs declaration data matching method and device | |
US11308208B2 (en) | Classifying ransom notes in received files for ransomware process detection and prevention | |
CN114254069A (en) | Domain name similarity detection method and device and storage medium | |
Carpineto et al. | Automatic assessment of website compliance to the European cookie law with CooLCheck | |
CN109657472B (en) | SQL injection vulnerability detection method, device, equipment and readable storage medium | |
WO2016127858A1 (en) | Method and device for identifying webpage intrusion script features | |
US20200334353A1 (en) | Method and system for detecting and classifying malware based on families | |
US9990339B1 (en) | Systems and methods for detecting character encodings of text streams | |
CN108804916A (en) | Detection method, device, electronic equipment and the storage medium of malicious file |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180511 |