CN111488556A

CN111488556A - Nested document extraction method and device, electronic equipment and storage medium

Info

Publication number: CN111488556A
Application number: CN202010273216.0A
Authority: CN
Inventors: 蔡家坡; 关守兵
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-08-04

Abstract

The application discloses a nested document extraction method, a nested document extraction device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a target document to be extracted, and reading a document directory corresponding to the target document; extracting all sub-documents nested in the target document from the document directory; and respectively identifying the document content of each subdocument to identify whether confidential information exists in the subdocument. According to the method and the device, all the sub-documents nested in the target document can be extracted based on the document directory of the target document to be extracted, so that the extraction of the nested document is realized, the content of the sub-document is identified, whether confidential information exists in the sub-documents is determined, and information leakage can be effectively avoided.

Description

Nested document extraction method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for extracting a nested document, an electronic device, and a computer-readable storage medium.

Background

D L P (Data leakage prevention) refers to a system for preventing leakage, and most D L P products only support the extraction of the content of the current document for the extraction of the document, but do not support the extraction of nested documents, even multi-layer nested document contents, so that if a user embeds a confidential information file into the document, the leakage system cannot analyze the confidential information file and the confidential information is leaked out, therefore, how to solve the problems is a condition that needs to be focused by technical personnel in the field.

Disclosure of Invention

The application aims to provide a nested document extraction method and device, an electronic device and a computer readable storage medium, which can realize the extraction of nested documents and effectively avoid information leakage.

In order to achieve the above object, the present application provides a method for extracting a nested document, including:

acquiring a target document to be extracted, and reading a document directory corresponding to the target document;

extracting all sub-documents nested in the target document from the document directory;

and respectively identifying the document content of each subdocument to identify whether confidential information exists in the subdocument.

Optionally, after the target document to be extracted is obtained, the method further includes:

determining a document type corresponding to the target document;

the reading of the document directory corresponding to the target document includes:

and reading a document directory corresponding to the target document according to the document type.

Optionally, the reading the document directory corresponding to the target document according to the document type includes:

if the document type is in a first version format, reading a composite document directory corresponding to the target document; the first version format comprises any one of doc format, xls format and ppt format;

if the document type is the document type in the second type version format, decompressing the target document, and reading a multi-level document directory corresponding to the target document after decompression; the second type of version format includes any one of a docx format, an xlsx format, and a pptx format.

Optionally, extracting all the sub-documents nested in the target document from the document directory includes:

and extracting all the sub-documents nested in the target document by reading all the folders of the preset file in the compound document directory.

reading preset subdirectories under a multilevel document directory corresponding to the target document;

and extracting all the subdocuments stored in the preset subdirectory.

Optionally, before the identifying the document contents of each sub-document, the method further includes:

judging whether the subdocuments are single documents or nested documents;

and if the subdocuments are nested documents, taking the subdocuments as the target documents, and performing iterative extraction by the step of extracting all the subdocuments nested in the target documents from the document directory.

Optionally, the respectively identifying the document contents of each sub-document includes:

and if the subdocuments are single documents, directly extracting and identifying the contents of the subdocuments according to the document formats corresponding to the subdocuments.

In order to achieve the above object, the present application provides a nested document extraction device, including:

the catalog reading module is used for acquiring a target document to be extracted and reading a document catalog corresponding to the target document;

the document extraction module is used for extracting all the sub-documents nested in the target document from the document directory;

and the content identification module is used for respectively identifying the document content of each subdocument so as to identify whether confidential information exists in the subdocument.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

a processor for implementing the steps of any one of the nested document extraction methods disclosed above when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any one of the nested document extraction methods disclosed in the foregoing.

According to the scheme, the nested document extraction method provided by the application comprises the following steps: acquiring a target document to be extracted, and reading a document directory corresponding to the target document; extracting all sub-documents nested in the target document from the document directory; and respectively identifying the document content of each subdocument to identify whether confidential information exists in the subdocument. According to the method and the device, all the sub-documents nested in the target document can be extracted based on the document directory of the target document to be extracted, so that the extraction of the nested document is realized, the content of the sub-document is identified, whether confidential information exists in the sub-documents is determined, and information leakage can be effectively avoided.

The application also discloses a nested document extraction device, an electronic device and a computer readable storage medium, which can also realize the technical effects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a block diagram of a nested document extraction system according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a method for extracting a nested document according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of another nested document extraction method disclosed in an embodiment of the present application;

fig. 4, fig. 5, and fig. 6 are schematic diagrams of directories for different versions of documents in another nested document extraction method disclosed in the embodiment of the present application, respectively;

FIG. 7 is a flowchart of another method for extracting a nested document according to an embodiment of the present application;

fig. 8, 9, and 10 are schematic diagrams of directories for different versions of documents in another nested document extraction method disclosed in the embodiment of the present application, respectively;

fig. 11 is a block diagram of a nested document extraction apparatus disclosed in an embodiment of the present application;

fig. 12 is a block diagram of an electronic device disclosed in an embodiment of the present application;

fig. 13 is a block diagram of another electronic device disclosed in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the prior art, the requirement for extracting document contents generally exists in current divulged D L P products, while most D L P products only support the extraction of the current document contents for the extraction of documents, but do not support the extraction of nested documents, even multi-layer nested document contents, and therefore, if a user embeds a confidential information file into a document, the divulging system cannot analyze the confidential information file, which may cause the situation that the confidential information leaks.

Therefore, the embodiment of the application discloses a method for extracting the nested document, which can realize the extraction of the nested document, thereby effectively avoiding information leakage.

For ease of understanding, a system architecture to which the technical solution of the present application is applicable is described below. Referring to fig. 1, the constituent architecture of a nested document extraction system of the present application is shown, respectively. As shown in fig. 1, the nested document extraction system of the present application may specifically include a user terminal 11 and a server 12, where the user terminal 11 and the server 12 are connected through a network 13. The user terminal 11 and the server 12 may further include a processor, a memory, a communication interface, an input unit, a display, and a communication bus, and the processor, the memory, the communication interface, the input unit, and the display all complete communication with each other through the communication bus.

In an implementation, the user may perform file transmission through the user terminal 11, for example, the file may be uploaded to the server 12, or the file may be transmitted to another communication terminal through the server 12. In particular, the user terminal 11 may specifically include, but is not limited to, a data processing device such as a smartphone, a tablet computer, a wearable device, and a desktop computer.

It can be understood that the server 12 is specifically configured to identify the content of the file transmitted by the user after acquiring the file, extract the nested file if the file is the nested file to identify whether confidential information exists in the transmitted file, and send a risk prompt when the confidential information exists to avoid information leakage. The server 12 may include, but is not limited to, a cloud server, a physical server, a virtual server, and the like.

It should be noted that the network 13 in the present application may be determined according to the network condition and the application requirement in the actual application process, and may be a wireless communication network, such as a mobile communication network or a WiFi network, or a wired communication network; either a wide area network or a local area network may be used as circumstances warrant.

Referring to fig. 2, a method for extracting a nested document disclosed in the embodiment of the present application includes:

s101: acquiring a target document to be extracted, and reading a document directory corresponding to the target document;

in the embodiment of the application, the target document to be extracted can be obtained through the import interface, and the document directory corresponding to the target document is read.

As a feasible implementation manner, after the target document to be extracted is obtained, the document type corresponding to the target document may be further determined, so that the document directory corresponding to the target document is read according to the document type.

S102: extracting all sub-documents nested in the target document from the document directory;

in this step, the sub-documents nested in the target document may be determined based on the document directory, and all the sub-documents may be extracted from the document directory.

S103: and respectively identifying the document content of each subdocument to identify whether confidential information exists in the subdocument.

In a specific implementation, after all the subdocuments are extracted, the document contents of each subdocument can be respectively identified, so as to judge whether confidential information exists in the subdocuments. If confidential information exists, corresponding risk prompt information can be returned to avoid information leakage.

It should be noted that the confidential information may specifically include, but is not limited to, sensitive information such as financial data, customer profile information, technical data, source code, office letter data, business scheme, and the like. When judging whether confidential information exists in the document, content searching and matching can be carried out by specifically adopting a regular expression, keywords and other basic detection methods, and clear sensitive information content can be detected by adopting the basic detection methods. In addition, an accurate data comparison detection method, a fingerprint document comparison detection method, a vector classification comparison detection method and the like can be adopted to further improve the detection accuracy. The precise data comparison and detection method is used for comparing and detecting the structured data, such as the name, the identity card number, the bank account number and the like of a user; the fingerprint document comparison detection method is used for detecting unstructured data, such as office letter data, business schemes and the like; the vector classification comparison detection method is suitable for detecting data with unique characteristics, such as financial data, source codes and the like.

It will be appreciated that before the document contents of each subdocument are separately identified, it may first be determined whether the subdocuments are single documents or nested documents. If the subdocuments are single documents, content extraction and identification can be directly carried out on the subdocuments according to document formats corresponding to the subdocuments; and if the subdocuments are nested documents, taking the subdocuments as new target documents, and extracting the current target documents, namely, performing iterative extraction by the step of extracting all the subdocuments nested in the target documents from the document directory.

The embodiment of the application discloses another nested document extraction method, and compared with the previous embodiment, the embodiment further describes and optimizes the technical scheme. Referring to fig. 3, specifically:

s201: acquiring a target document to be extracted, and determining a document type corresponding to the target document;

s202: if the document type is in a first version format, reading a composite document directory corresponding to the target document; the first version format comprises any one of doc format, xls format and ppt format;

in the embodiment of the application, after the target document to be extracted is obtained, the document type corresponding to the target document is determined firstly. And if the document type is the document type in the first version format, namely the office2003 version document, reading the compound document directory corresponding to the target document. The first type of version format may include, but is not limited to, doc format, xls format, ppt format.

Note that the office2003 version document is stored in a compound document format. A compound document is a document that contains not only text, but also graphics, spreadsheet data, sound, video graphics, and other information. Compound documents divide the data into a number of streams, which are stored in different repositories, all streams are in turn divided into smaller data blocks called sectors, the whole file consists of one file header structure followed by all sectors, the size of the sectors is specified in the header structure, and all sectors are of uniform size. A catalog is an internal control flow consisting of a series of catalog entries, each pointing to a repository or stream of compound documents, the catalog entries being enumerated in the order they appear in the catalog stream. The nested document in office2003 also exists in the form of a directory entry.

S203: extracting all sub-documents nested in the target document by reading all folders of a preset file in the compound document directory;

in this step, all folders of the preset file in the compound document directory can be read, so as to extract all sub-files stored in each folder.

Specifically, as shown in fig. 4, each folder of the word nested file in doc format is named by adding 10 numbers to each underline under the decompressed ObjectPool file, each folder stores a nested subdocument, such as a subdocument embedded in docx, xlsx, pptx format, and each subdocument is stored in a package file stored in ole format. In addition, the extracted data redundant contents except the folder need to be processed again. As shown in fig. 5, the excel nested file in xls format is stored in a folder named by the decompressed MBD plus 8 hexadecimal numbers, and each file stores a nested sub-file. As shown in fig. 6, the nested file in the ppt format is stored in a PowerPoint Document file, and the nested file can be extracted by looking up a type value, and the nested file is in a type of RT _ extrinsic object stg.

S204: and respectively identifying the document content of each subdocument to identify whether confidential information exists in the subdocument.

It is understood that, when each sub-document is identified, if the sub-document is a single document, the sub-document can be extracted according to a specific document format. For example, if the directory entry object name is WordDocument, extracting the doc document format; if the directory entry object name is workbook, extracting the format of the xls document; and if the directory entry object name is PowerPointdocument, extracting the format of the ppt document. At the time of document extraction, start and length specified in the directory entry structure indicate the specific content of the document, respectively.

The embodiment of the application discloses another nested document extraction method, and compared with the previous embodiment, the embodiment further describes and optimizes the technical scheme. Referring to fig. 7, specifically:

s301: acquiring a target document to be extracted, and determining a document type corresponding to the target document;

s302: if the document type is the document type in the second type version format, decompressing the target document, and reading a multi-level document directory corresponding to the target document after decompression; the second type of version format comprises any one of a docx format, an xlsx format and a pptx format;

in the embodiment of the application, after the target document to be extracted is obtained, the document type corresponding to the target document is determined firstly. And if the document type is the document type in the second type version format, namely the office2007 version document, decompressing the target document, and reading the multistage document directory corresponding to the target document after decompression. The second type of version format may include, but is not limited to, a docx format, an xlsx format, and a pptx format.

S303: reading preset subdirectories under a multilevel document directory corresponding to the target document;

s304: extracting all the subdocuments stored in the preset subdirectory;

it should be noted that after the multi-level document directory corresponding to the target document is obtained, the preset sub-directory may be read, and all the sub-documents stored in the preset sub-directory are extracted.

In a specific implementation, as shown in fig. 8, the docx document format is stored using a compressed package, and after decompression, the document can see that the nested documents are stored in an embeddings directory under a word directory, for example, if two nested documents exist in the left docx document, two nested documents oleobject1.bin and oleobject2.bin documents exist in the corresponding embeddings. The xlsx nested document format is stored by using a compressed packet, after decompression, the document can see that the nested document is stored in an embeddings directory under an xl directory, as shown in fig. 9, two nested documents exist in the left document, and then two nested documents oleobject1.bin and oleobject2.bin files exist in the corresponding embeddings. As shown in FIG. 10, the extraction of pptx nested documents can be referred to above.

S305: and respectively identifying the document content of each subdocument to identify whether confidential information exists in the subdocument.

It can be appreciated that since office2007 series documents are stored using the ZIP compression format, the ZIP compression package can be cyclically decompressed to read nested objects. Specifically, if a word/document.xml file exists after decompression, extracting docx document content; extracting the content of the xlsx document if the xl/shared string. If the ppt/slides/slide X.xml file exists after decompression, extracting the contents of the pptx file; and if the file is the other file after being decompressed, taking the file as a target document to be extracted, returning to the step S301, and performing a file type identification process.

In the following, a nested document extracting apparatus provided by an embodiment of the present application is introduced, and a nested document extracting apparatus described below and a nested document extracting method described above may be referred to each other.

Referring to fig. 11, a nested document extraction apparatus provided in an embodiment of the present application includes:

a directory reading module 401, configured to obtain a target document to be extracted, and read a document directory corresponding to the target document;

a document extracting module 402, configured to extract all sub-documents nested in the target document from the document directory;

the content identification module 403 is configured to identify document contents of each sub-document respectively to identify whether confidential information exists in the sub-document.

For the specific implementation process of the modules 401 to 403, reference may be made to the corresponding content disclosed in the foregoing embodiments, and details are not repeated here.

The present application further provides an electronic device, and referring to fig. 12, an electronic device provided in an embodiment of the present application includes:

a memory 100 for storing a computer program;

the processor 200, when executing the computer program, may implement the steps provided by the above embodiments.

Specifically, the memory 100 includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer-readable instructions, and the internal memory provides an environment for the operating system and the computer-readable instructions in the non-volatile storage medium to run. The processor 200 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data processing chip in some embodiments, and provides computing and controlling capability for the electronic device, and when executing the computer program stored in the memory 100, the steps of the nested document extraction method disclosed in any of the foregoing embodiments may be implemented.

On the basis of the above embodiment, as a preferred implementation, referring to fig. 13, the electronic device further includes:

and an input interface 300 connected to the processor 200, for acquiring computer programs, parameters and instructions imported from the outside, and storing the computer programs, parameters and instructions into the memory 100 under the control of the processor 200. The input interface 300 may be connected to an input device for receiving parameters or instructions manually input by a user. The input device may be a touch layer covered on a display screen, or a button, a track ball or a touch pad arranged on a terminal shell, or a keyboard, a touch pad or a mouse, etc.

The display unit 400 is connected to the processor 200 and is used for displaying data processed by the processor 200 and displaying a visual user interface, and the display unit 400 may be a L ED display, a liquid crystal display, a touch-sensitive liquid crystal display, an O L ED (Organic L light-Emitting Diode) touch screen, and the like.

The communication technology adopted by the communication connection can be a wired communication technology or a wireless communication technology, such as mobile high-definition link technology (MH L), a Universal Serial Bus (USB), a high-definition multimedia interface (HDMI), wireless fidelity (WiFi), a Bluetooth communication technology, a low-power Bluetooth communication technology, an IEEE802.11s-based communication technology and the like.

While fig. 13 illustrates only an electronic device having the

assembly

100 and 500, those skilled in the art will appreciate that the configuration illustrated in fig. 13 is not intended to be limiting of electronic devices and may include fewer or more components than those illustrated, or some components may be combined, or a different arrangement of components.

The present application also provides a computer-readable storage medium, which may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk. The storage medium has a computer program stored thereon, which when executed by a processor implements the steps of the nested document extraction method disclosed in any of the foregoing embodiments.

According to the method and the device, all the sub-documents nested in the target document can be extracted based on the document directory of the target document to be extracted, so that the nested document can be extracted, the content of the sub-documents is identified, whether confidential information exists in the sub-documents or not is determined, and information leakage can be effectively avoided.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A nested document extraction method, comprising:

2. The method for extracting the nested document according to claim 1, wherein after the target document to be extracted is obtained, the method further comprises:

determining a document type corresponding to the target document;

3. The method for extracting the nested document according to claim 2, wherein the reading the document directory corresponding to the target document according to the document type includes:

4. A nested document extraction method according to claim 3, wherein extracting all sub-documents nested in the target document from the document directory comprises:

5. A nested document extraction method according to claim 3, wherein extracting all sub-documents nested in the target document from the document directory comprises:

and extracting all the subdocuments stored in the preset subdirectory.

6. A nested document extraction method according to any one of claims 1 to 5, wherein before identifying the document content of each subdocument separately, the method further comprises:

judging whether the subdocuments are single documents or nested documents;

7. A nested document extraction method according to claim 6, wherein the identifying the document content of each subdocument separately comprises:

8. A nested document extraction apparatus, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the nested document extraction method of any one of claims 1 to 7 when executing said computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the nested document extraction method of any one of claims 1 to 7.