CN114239544A

CN114239544A - Text detection method and system based on document fingerprints

Info

Publication number: CN114239544A
Application number: CN202111629638.8A
Authority: CN
Inventors: 杨竣
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd; Hubei Topsec Network Security Technology Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd; Hubei Topsec Network Security Technology Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-03-25

Abstract

The embodiment of the application provides a text detection method and system based on document fingerprints, and relates to the technical field of network security. The text detection method based on the document fingerprint comprises the following steps: acquiring text data of a document to be detected; processing the text data in a unified coding format to obtain unified text data; dividing the unified text data through a preset symbol to obtain fingerprint database data of the document to be detected; and detecting the fingerprint database data of the document to be detected through the template fingerprint database data to obtain a detection result. The text detection method based on the document fingerprint can prevent the leakage of confidential documents and realize the technical effect of improving the document detection capability.

Description

Text detection method and system based on document fingerprints

Technical Field

The application relates to the technical field of network security, in particular to a text detection method and system based on document fingerprints.

Background

At present, electronic documents and digital products are convenient for people in work such as office work, teaching and the like, and meanwhile, huge safety risks also exist. The existing digital copyright protection technology is mainly based on modern cryptography theories, such as an encryption system, a digital signature system and the like. The method mainly solves the safety problem of digital products in the storage and transmission processes. But once decrypted, these digital product contents can be copied, propagated and leaked, and must be protected using Data Leakage Prevention (DLP) technology based on content identification.

In the prior art, the traditional data leakage prevention technology mainly depends on keyword matching and regular expression matching, and the methods have great limitations. For example, after data to be identified is simply added, deleted and modified, these conventional matching methods will fail, and thus the sensitive data cannot be protected normally. A "document fingerprint" match may ensure accurate detection of unstructured data stored in the form of a document, file formats including Microsoft Word files, PowerPoint files, PDF documents, and so forth. Protected documents include financial, co-purchased documents, and other sensitive or proprietary information. DLP systems may utilize fingerprinting algorithms to create fingerprint features for documents to match retrieved portions of the original document, drafts, or different versions of the protected document.

Disclosure of Invention

An object of the embodiments of the present application is to provide a document fingerprint-based text detection method, system, electronic device, and computer-readable storage medium, which can prevent leakage of confidential documents and achieve the technical effect of improving document detection capability.

In a first aspect, an embodiment of the present application provides a text detection method based on a document fingerprint, including:

acquiring text data of a document to be detected;

processing the text data in a unified coding format to obtain unified text data;

dividing the unified text data through a preset symbol to obtain fingerprint database data of the document to be detected;

and detecting the fingerprint database data of the document to be detected through the template fingerprint database data to obtain a detection result.

In the implementation process, the text detection method based on the document fingerprint can extract the text of the document with the common format, convert the text into the uniform coding format, and simultaneously segment the text through the special symbols in the text content to further obtain the fingerprint database data of the document to be detected, so that the reliability of the fingerprint can be ensured; therefore, the fingerprint database data of the document to be detected is detected through the template fingerprint database data, the document to be detected can be detected even after being modified, and whether the document to be detected violates rules or not can be known according to the detection result; therefore, the text detection method based on the document fingerprint can prevent the leakage of the confidential document and realize the technical effect of improving the document detection capability.

Further, before the step of detecting the fingerprint database data of the document to be detected through the template fingerprint database data and obtaining a detection result, the method further includes:

acquiring template text data of a template document;

processing the template text data in a uniform coding format to obtain uniform template text data;

and segmenting the unified template text data through a preset symbol to obtain the template fingerprint database data.

In the implementation process, the processing process of the template document is the same as that of the document to be detected, and the template fingerprint database data and the fingerprint database data of the document to be detected are respectively obtained; and comparing the two fingerprint database data, and further judging whether the document to be detected violates rules.

Further, the step of obtaining the fingerprint database data of the document to be detected by segmenting the unified text data by preset symbols includes:

the unified text data is segmented through one or more of line breaks, commas and periods to obtain characteristic value data;

and storing the first characters of the characteristic value data in a structural body, calculating the MD5 value of the characteristic value data, and generating fingerprint database data of the document to be detected.

In the implementation process, after the text data of the document to be detected is analyzed and processed and converted into a uniform coding format, the uniform text data is divided through special symbols such as line breaks, commas, periods and the like, characteristic values are extracted, the first characters of the characteristic values are stored in a structural body, and meanwhile, the MD5 value is calculated and further stored into fingerprint database data.

Further, the step of acquiring the text data of the document to be detected includes:

and acquiring text data of the document to be detected through a TIKA algorithm.

In a second aspect, an embodiment of the present application provides a document fingerprint-based text detection system, including:

the acquisition module is used for acquiring text data of a document to be detected;

the uniform coding module is used for carrying out uniform coding format processing on the text data to obtain uniform text data;

the document fingerprint module is used for segmenting the unified text data through preset symbols to obtain fingerprint database data of the document to be detected;

and the detection module is used for detecting the fingerprint database data of the document to be detected through the template fingerprint database data to obtain a detection result.

Further, the document fingerprint-based text detection system further includes:

the template acquisition module is used for acquiring template text data of the template document;

the uniform coding module is also used for carrying out uniform coding format processing on the template text data to obtain uniform template text data;

the document fingerprint module is further used for segmenting the unified template text data through preset symbols to obtain the template fingerprint database data.

Further, the document fingerprinting module includes:

a dividing unit, configured to divide the unified text data by one or more of a line break, a comma, and a period to obtain feature value data;

and the document fingerprint generating unit is used for storing the first character of the characteristic value data in a structural body, calculating the MD5 value of the characteristic value data and generating fingerprint database data of the document to be detected.

Further, the obtaining module is specifically configured to obtain the text data of the document to be detected through a TIKA algorithm.

In a third aspect, an electronic device provided in an embodiment of the present application includes: memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to any of the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having instructions stored thereon, which, when executed on a computer, cause the computer to perform the method according to any one of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to perform the method according to any one of the first aspect.

Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the above-described techniques.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a flowchart illustrating a document fingerprint-based text detection method according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating another method for detecting text based on a document fingerprint according to an embodiment of the present disclosure;

FIG. 3 is a block diagram of a document fingerprint-based text detection system according to an embodiment of the present application;

fig. 4 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

The embodiment of the application provides a text detection method, a text detection system, electronic equipment and a computer-readable storage medium based on document fingerprints, which can be applied to document leakage prevention detection, for example, in a scene of preventing confidential documents from leaking; the text detection method based on the document fingerprint can extract texts of common format documents (such as word, excel, ppt, pdf and the like), convert the texts into a uniform coding format, and simultaneously segment the texts through special symbols in text contents to further obtain fingerprint database data of the document to be detected, so that the reliability of the fingerprint can be ensured; therefore, the fingerprint database data of the document to be detected is detected through the template fingerprint database data, the document to be detected can be detected even after being modified, and whether the document to be detected violates rules or not can be known according to the detection result; therefore, the text detection method based on the document fingerprint can prevent the leakage of the confidential document and realize the technical effect of improving the document detection capability.

Referring to fig. 1, fig. 1 is a schematic flowchart of a text detection method based on a document fingerprint according to an embodiment of the present application, where the text detection method based on a document fingerprint includes the following steps:

s100: and acquiring text data of the document to be detected.

Illustratively, the document to be detected can be in various formats such as word, excel, ppt, pdf, etc., and is not limited herein.

S200: and carrying out uniform coding format processing on the text data to obtain uniform text data.

Exemplarily, the text data is processed in a unified coding format to obtain unified text data, so that the next processing can be conveniently performed according to unified processing steps, and whether the document to be detected violates the rule or not is judged according to a unified detection standard.

Alternatively, the uniform encoding format processing may be converting text data into uniform Unicode; for example, it may be converted to a unified UCS2 encoding format.

Illustratively, Unicode, also known as Unicode, Unicode, is an industry standard in the field of computer science, including character sets, encoding schemes, and the like. Unicode is generated to solve the limitation of the traditional character encoding scheme, and sets a uniform and unique binary code for each character in each language so as to meet the requirements of cross-language and cross-platform text conversion and processing. Unicode is a character encoding scheme established by the international organization that can accommodate all the words and symbols in the world. Unicode maps these characters with the numbers 0-0x10FFFF, which can accommodate up to 1114112 characters, or 1114112 code bits. The code bits are the numbers that can be assigned to the characters. UTF-8, UTF-16, and UTF-32 are all encoding schemes that convert numbers to program data.

The Unicode Character set may be abbreviated as ucs (Unicode Character set). Early Unicode standards were described in UCS-2 and UCS-4. UCS-2 is encoded with two bytes and UCS-4 is encoded with 4 bytes. UCS-4 is divided into 2^7 ^ 128 groups according to the highest byte with the highest bit of 0. Each group is further divided into 256 planes (planes) according to the next highest byte. Each plane is divided into 256 rows (row) with 256 code bits (cells) per row according to the 3 rd byte. Plane 0 of group 0 is called BMP (basic Multilingual plane). The BMP of UCS-4 is removed from the first two zero bytes to obtain UCS-2.

S300: and segmenting the unified text data through the preset symbols to obtain the fingerprint database data of the document to be detected.

The preset symbol may be a special symbol such as a line break symbol, a comma, a period, etc., and the unified text data is segmented by the above symbol, so as to obtain a feature value of the document to be detected, that is, fingerprint database data.

S400: and detecting the fingerprint database data of the document to be detected through the template fingerprint database data to obtain a detection result.

Illustratively, the template fingerprint database data is data obtained after a preset template document passes through the processing steps of S100-S300; comparing the template fingerprint database data with the fingerprint database data of the document to be detected to obtain a corresponding detection result, and detecting whether the document to be detected violates rules; for example, the template fingerprint database data and the fingerprint database data of the document to be detected are compared, the proportion of the characteristic value fingerprint in the template document appearing in the fingerprint database data of the document to be detected can be obtained, and if the proportion exceeds a set value, the document to be detected can be judged to be an illegal file.

In some implementation scenarios, the document fingerprint-based text detection method can extract texts of documents in common formats, convert the texts into a uniform coding format, and segment the texts through special symbols in text contents to further obtain fingerprint database data of the documents to be detected, so that reliable fingerprints can be ensured; therefore, the fingerprint database data of the document to be detected is detected through the template fingerprint database data, the document to be detected can be detected even after being modified, and whether the document to be detected violates rules or not can be known according to the detection result; therefore, the text detection method based on the document fingerprint can prevent the leakage of the confidential document and realize the technical effect of improving the document detection capability.

Referring to fig. 2, fig. 2 is a schematic flowchart of another text detection method based on document fingerprints according to an embodiment of the present application.

Exemplarily, at S400: detecting the fingerprint database data of the document to be detected through the template fingerprint database data, and before the step of obtaining the detection result, the method further comprises the following steps:

s401: acquiring template text data of a template document;

s402: carrying out uniform coding format processing on the template text data to obtain uniform template text data;

s403: and segmenting the unified template text data through the preset symbols to obtain template fingerprint database data.

Exemplarily, the processing procedures of S401-S403 are the same as the processing procedures of S100-S300, except that one is to process the template document to obtain the template fingerprint database data; one is to process the document to be detected to obtain the fingerprint database data of the document to be detected; and comparing the two fingerprint database data, and further judging whether the document to be detected violates rules.

Exemplarily, S300: the step of obtaining the fingerprint database data of the document to be detected by dividing the unified text data through the preset symbols comprises the following steps:

s310: the characteristic value data is obtained by segmenting the unified text data through one or more of line break, comma and period;

s320: and storing the first character of the characteristic value data in a structural body, calculating the MD5 value of the characteristic value data, and generating fingerprint database data of the document to be detected.

Illustratively, after the text data of the document to be detected is analyzed and processed and converted into a uniform coding format, the uniform text data is divided through special symbols such as line breaks, commas, periods and the like, characteristic values are extracted, the first characters of the characteristic values are stored in a structural body, and meanwhile, the MD5 value is calculated and further stored as fingerprint database data.

Exemplarily, S100: the method for acquiring the text data of the document to be detected comprises the following steps:

s110: and acquiring text data of the document to be detected through a TIKA algorithm.

Illustratively, the TIKA algorithm is based on Apache TIKA; apache Tika is a library for document type detection and content extraction from various file formats; internally, the TIKA algorithm uses various existing document parsers and document type detection techniques to detect and extract data. With the TIKA algorithm, a generic type of detector and content extractor can be developed to extract structured text to some extent as well as metadata from different types of documents (e.g., spreadsheets, text documents, images, PDFs, and even multimedia input formats). The TIKA algorithm provides a single generic API to parse different file formats. It uses 83 existing specialized parsers ibraries for each document type. All of these resolver libraries are encapsulated under a single interface named the Parser interface.

In some implementation scenarios, as shown in fig. 1 and fig. 2, the text detection method based on document fingerprints provided in the embodiment of the present application is applied to a certain personal information collection template form, where the content of the personal information collection template form is:

the personal information collection template form content is' name: sex: the national methods are as follows: age: address: ";

the specific treatment steps are as follows:

the method comprises the following steps: extracting a personal information acquisition template and obtaining text contents such as name, gender, ethnicity, age and address;

step two: carrying out code conversion on the text content, calculating an MD5 value, and storing the value into a fingerprint library file with 5 fingerprint characteristics;

step three: extracting text data of the document to be detected, such as data filled in a template form:

name: XX sex: and (4) ethnic group X: age X: XX address: XXXXXXX;

and performing code conversion on the text content, calculating an MD5 value, comparing the value with a fingerprint library of the template document, and judging that the document to be detected is an illegal document if five fingerprint characteristics exist.

Referring to fig. 3, fig. 3 is a block diagram of a document fingerprint-based text detection system according to an embodiment of the present application, where the document fingerprint-based text detection system includes:

the acquiring module 100 is used for acquiring text data of a document to be detected;

the unified coding module 200 is configured to perform unified coding format processing on the text data to obtain unified text data;

the document fingerprint module 300 is configured to segment the unified text data through a preset symbol to obtain fingerprint database data of a document to be detected;

the detection module 400 is configured to detect the fingerprint database data of the document to be detected through the template fingerprint database data, and obtain a detection result.

Illustratively, the document fingerprint-based text detection system further comprises:

the document fingerprint module is also used for segmenting the unified template text data through the preset symbols to obtain template fingerprint database data.

Illustratively, the document fingerprinting module 300 includes:

the segmentation unit is used for segmenting the unified text data through one or more of line breaks, commas and periods to obtain characteristic value data;

and the document fingerprint generating unit is used for storing the first character of the characteristic value data in the structural body, calculating the MD5 value of the characteristic value data and generating fingerprint database data of the document to be detected.

Illustratively, the obtaining module 100 is specifically configured to obtain text data of a document to be detected through a TIKA algorithm.

It should be noted that the block diagram of the document fingerprint-based text detection system shown in fig. 3 corresponds to the method embodiments shown in fig. 1 and fig. 2, and is not repeated here to avoid repetition.

Fig. 4 shows a block diagram of an electronic device according to an embodiment of the present disclosure, where fig. 4 is a block diagram of the electronic device. The electronic device may include a processor 510, a communication interface 520, a memory 530, and at least one communication bus 540. Wherein the communication bus 540 is used for realizing direct connection communication of these components. In this embodiment, the communication interface 520 of the electronic device is used for performing signaling or data communication with other node devices. Processor 510 may be an integrated circuit chip having signal processing capabilities.

The Processor 510 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 510 may be any conventional processor or the like.

The Memory 530 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Read Only Memory (EPROM), an electrically Erasable Read Only Memory (EEPROM), and the like. The memory 530 stores computer readable instructions, which when executed by the processor 510, enable the electronic device to perform the steps involved in the method embodiments of fig. 1-2 described above.

Optionally, the electronic device may further include a memory controller, an input output unit.

The memory 530, the memory controller, the processor 510, the peripheral interface, and the input/output unit are electrically connected to each other directly or indirectly, so as to implement data transmission or interaction. For example, these elements may be electrically coupled to each other via one or more communication buses 540. The processor 510 is used to execute executable modules stored in the memory 530, such as software functional modules or computer programs included in the electronic device.

The input and output unit is used for providing a task for a user to create and start an optional time period or preset execution time for the task creation so as to realize the interaction between the user and the server. The input/output unit may be, but is not limited to, a mouse, a keyboard, and the like.

It will be appreciated that the configuration shown in fig. 4 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 4 or may have a different configuration than shown in fig. 4. The components shown in fig. 4 may be implemented in hardware, software, or a combination thereof.

The embodiment of the present application further provides a storage medium, where the storage medium stores instructions, and when the instructions are run on a computer, when the computer program is executed by a processor, the method in the method embodiment is implemented, and in order to avoid repetition, details are not repeated here.

The present application also provides a computer program product which, when run on a computer, causes the computer to perform the method of the method embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A text detection method based on document fingerprints is characterized by comprising the following steps:

acquiring text data of a document to be detected;

2. The method for detecting text based on document fingerprint according to claim 1, wherein before the step of detecting the fingerprint database data of the document to be detected through the template fingerprint database data to obtain the detection result, the method further comprises:

acquiring template text data of a template document;

3. The document fingerprint-based text detection method according to claim 1, wherein the step of obtaining the fingerprint database data of the document to be detected by segmenting the unified text data by preset symbols comprises:

4. The document fingerprint-based text detection method according to claim 1, wherein the step of obtaining the text data of the document to be detected comprises:

5. A document fingerprint-based text detection system, comprising:

6. The document fingerprint based text detection system of claim 5, further comprising:

7. The document fingerprint-based text detection system of claim 5, wherein the document fingerprint module comprises:

8. The document fingerprint-based text detection system of claim 5, wherein the obtaining module is specifically configured to obtain the text data of the document to be detected through a TIKA algorithm.

9. An electronic device, comprising: memory, processor and computer program stored in the memory and executable on the processor, the processor implementing the steps of the document fingerprint based text detection method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium having stored thereon instructions which, when run on a computer, cause the computer to perform the document fingerprint based text detection method of any one of claims 1 to 4.