CN112001164A

CN112001164A - Document content streaming analysis method and system

Info

Publication number: CN112001164A
Application number: CN202011159801.4A
Authority: CN
Inventors: 殷博; 潘飚; 冯静
Original assignee: BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD; Nanjing Zhongfu Information Technology Co Ltd; Zhongfu Information Co Ltd; Zhongfu Safety Technology Co Ltd
Current assignee: BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD; Nanjing Zhongfu Information Technology Co Ltd; Zhongfu Information Co Ltd; Zhongfu Safety Technology Co Ltd
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2020-11-27
Anticipated expiration: 2040-10-27
Also published as: CN112001164B

Abstract

The invention discloses a document content stream analysis method and a system, wherein the method comprises the following steps: s1, reading file data and completing directory scanning; s2, judging the file types and realizing the classification of different types of files; and S3, calling a corresponding parser according to the file type to parse the corresponding file. The invention has the beneficial effects that: the method comprises the steps of classifying files according to different file structures, and dividing the files into structured files, text files and compressed files; the invention provides a document content stream analysis method, which loads only a part of data for processing each time, uses different processing methods for different types of files, and controls the whole processing process by using a state machine; the internal streaming processing of each type of file is not the same, but the entire flow for file parsing is the same.

Description

Document content streaming analysis method and system

Technical Field

The invention relates to the technical field of document content analysis, in particular to a document content streaming analysis method and system.

Background

With the advent of the big data age, the number of files transmitted through the internet has increased greatly, and the internet is flooded with various text files, video files, audio files, and the like. In which a large number of electronic documents are present in addition to ordinary documents in text documents. Some confidential documents may exist in these electronic documents, and the documents are the most important source of confidential documents as the main way for the parties to work on the dates. Also in non-confidential devices confidential files may occur. In order to ensure the security of the national security work, the secret-related official documents are detected from massive files in the network and the host equipment and cannot be cached. File parsing is the primary link in the file inspection process.

The current document analysis method is to read the whole file into the internal memory, judge the file type first, and then process through different file analyzers. The first disadvantage is that: for the files transmitted by the network, the files are stored on the disk of the equipment and then read into the memory, and certain hysteresis exists for file inspection. The second disadvantage is that: and (4) processing the large file. For the processing of a large file, if the large file is loaded into the memory at one time, the analysis process occupies too much memory, and the processing process also occupies too much CPU resources, so that the device is stuck and other operations and uses of a device user are affected. The third disadvantage is that: and (5) processing the compressed file. The compressed file is a special file, and may include a plurality of files or folders, and may also include compressed files, forming a nested multi-layer compressed file, where if the number of nested layers is too large, one-time loading not only occupies a large amount of memory, but also reduces the processing performance of file parsing.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

The invention provides a document content stream analysis method and a system aiming at the problems in the related art, so as to overcome the technical problems in the prior related art.

Therefore, the invention adopts the following specific technical scheme:

according to one aspect of the present invention, there is provided a document content streaming parsing method, including the steps of:

s1, reading file data and completing directory scanning;

s2, judging the file types and realizing the classification of different types of files;

and S3, calling a corresponding parser according to the file type to parse the corresponding file.

Further, the reading the file data and completing the directory scan further includes the following steps:

s11, length rules are configured in the configuration file in advance;

and S12, reading the file data block.

Further, the judging the file types and the classifying the files of different types further comprises the following steps:

s21, checking the file feature string;

and S22, detecting the file type.

Further, the file types include structured files, text class files, and compressed files.

Further, the step of calling a corresponding parser according to the file type to parse the structured file includes the following steps:

s31, analyzing the file header, and analyzing the file header information through data offset according to the file header structure definition;

s32, continuously reading data;

s33, analyzing the main sector allocation table through cyclic reading and data processing;

s34, analyzing the sector allocation table, the directory flow and the table flow through circular reading and data processing;

and S35, extracting the text in the file according to the starting position and length of the directory stream and the starting position and length of the table stream.

Optionally, the invoking a corresponding parser according to the file type to parse the text file includes the following steps:

s31', reading in file data;

s32', text extraction;

s33', continuing to read in the file data;

s34 ', repeat steps S32', S33 'and S34' until the resolution is completed.

Optionally, the invoking a corresponding parser according to the file type to parse the compressed file includes the following steps:

s31', creating a temporary directory;

s32', reading file data;

s33', decompressing the data by using a compression algorithm;

s34', continuously reading data;

s35 ', repeating the steps S33 ' and S34 ', until the file decompression is completed;

s36', scanning the temporary directory and caching all file paths;

s37', and merging the analysis results.

Further, the structured file is a file with a hierarchical structure feature, and the structured file includes, but is not limited to, a word file and a pdf file; the text class files include but are not limited to text files, extensible markup language and hypertext markup language; the compressed files include, but are not limited to, zip files, rar files, and tar files.

According to another aspect of the present invention, there is also provided a document content streaming parsing system, including:

the file IO is used for reading file data and completing directory scanning;

the classifier is used for judging the file type and realizing the classification of different types of files;

and the analyzer is used for calling the corresponding analyzer according to the file type to analyze the corresponding file.

The invention has the beneficial effects that:

1. the method comprises the steps of classifying files according to different file structures, and dividing the files into structured files, text files and compressed files; the invention provides a document content stream analysis method, which loads only a part of data for processing each time, uses different processing methods for different types of files, and controls the whole processing process by using a state machine; the internal streaming processing mode of each type of file is different, but the whole flow of file analysis is the same;

2. the invention not only solves the problem that the network transmission file needs to be landed, but also can directly transmit the data into the file stream type analyzer for analysis; the problem that resources are excessively occupied when files are loaded at one time is solved, and the files can be read in a blocking mode and processed in a blocking mode;

3. the invention can configure the data length and the decompression layer number of the compressed file transmitted each time through the configuration file so as to flexibly deal with different use scenes.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow diagram of document content streaming parsing according to an embodiment of the invention;

FIG. 2 is a diagram of a structured document structure according to an embodiment of the present invention;

FIG. 3 is a flow diagram of structured document processing according to an embodiment of the present invention;

FIG. 4 is a flow diagram of text class file processing according to an embodiment of the present invention;

FIG. 5 is a flow diagram of compressed file processing according to an embodiment of the present invention;

fig. 6 is a document content streaming parsing architecture diagram according to an embodiment of the present invention.

Detailed Description

For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.

The embodiment of the invention provides a document content streaming analysis method and a document content streaming analysis system.

The present invention will be further explained with reference to the accompanying drawings and detailed description, as shown in fig. 1, a document content streaming parsing method according to an embodiment of the present invention includes the following steps:

s1, reading file data and completing directory scanning;

In one embodiment, the reading the file data and completing the directory scan further comprises the steps of:

s11, length rules are configured in the configuration file in advance;

and S12, reading the file data block.

In one embodiment, the determining the file type and classifying the files of different types further includes:

s21, checking the file feature string;

and S22, detecting the file type.

In one embodiment, the file types include structured files, text class files, and compressed files.

Specifically, for the structured document processing flow:

structured documents have a distinct structural hierarchy. Taking word97-2003 version as an example, a word document is divided into a header and a body, and a series of areas are defined in the body, including structures such as a main sector allocation table (msat), a sector allocation table (sat), a short sector allocation table (ssat), a directory stream and a table stream. The main sector distribution table records the ID of the sector where the sector (sector) distribution table is located, the sector distribution table records the ID of the sector where various streams are located, and the position of the required data in the file can be obtained through calculation of the sector ID and the sector length. The structured file can be categorized into a hierarchical structure, as shown in fig. 1, where the main sector allocation table is a root node, the sector allocation table and the short sector allocation table are storage nodes below the main sector allocation table, and a data stream of the file is below the storage nodes, and a text portion or a picture portion is in the data stream.

In one embodiment, as shown in fig. 2 to 3, the invoking the corresponding parser according to the file type (structured file) to parse the corresponding file (structured file) further includes the following steps:

in addition, specifically, the state machine is set as "header processing", and if the length of the incoming data is smaller than the length of the header, the parsing of the header is not completed, and a part of the data is cached.

S32, continuously reading data;

in addition, specifically, since the state machine at this time is "header processing", the data read again will be merged with the data cached last time, and enter the header processing flow, and if the header has not been processed yet, steps S31 and S32 are repeated. After the file header is processed, setting the state machine as 'main sector processing', and sending the data into the main sector processing flow.

specifically, the position of msat (master sector allocation table) is calculated from the id value of msat (master sector allocation table) in the header information, the length of msat (master sector allocation table) is calculated from the number of msat (master sector allocation table), the start position of data is found by position offset, and necessary data is read at a fixed length.

Specifically, for the text file processing flow:

the text class file includes xml, html, json file, and the like in addition to the text file. Compared with structured files, the text file processing flow is simple, namely sequential processing, and does not need a state machine for control; and the xml and json do not need to be converted into corresponding objects for processing.

In one embodiment, as shown in fig. 4, the step of calling the corresponding parser according to the file type (text class file) to parse the corresponding file (text class file) further includes the following steps:

s31', reading in file data;

s32', text extraction;

s33', continuing to read in the file data;

s34 ', repeat steps S32', S33 'and S34' until the resolution is completed.

Specifically, for the compressed file processing flow:

the principle of compressed file processing is to decompress the files, create a temporary directory with the same name as the compressed files, decompress the files under the temporary directory, analyze the files under the directory, then merge the analyzed results of the files, and delete the temporary directory and the files after the files are processed. The compressed file processing is configured to determine the number of layers of nested decompression, and to determine the maximum number of layers of decompression.

In one embodiment, as shown in fig. 5, the step of calling the corresponding parser to parse the corresponding file (compressed file) according to the file type (compressed file) further includes the following steps:

s31', creating a temporary directory;

s32', reading file data;

s33', decompressing the data by using a compression algorithm;

s34', continuously reading data;

s36', scanning the temporary directory and caching all file paths;

specifically, the file processing step is carried out according to a structured file or a text file;

s37', and merging the analysis results.

In one embodiment, the structured file is a file with a hierarchical structure feature, and the structured file includes, but is not limited to, a word file and a pdf file; the text class files include but are not limited to text files, extensible markup language and hypertext markup language; the compressed files include, but are not limited to, zip files, rar files, and tar files.

According to another embodiment of the present invention, as shown in fig. 6, there is also provided a document content streaming parsing system, including:

the file IO is used for reading file data and completing directory scanning;

In summary, with the above technical solution of the present invention, the present invention classifies the files into structured files, text files and compressed files according to the difference of the file structures; the invention provides a document content stream analysis method, which loads only a part of data for processing each time, uses different processing methods for different types of files, and controls the whole processing process by using a state machine; the internal streaming processing of each type of file is not the same, but the entire flow for file parsing is the same. The invention not only solves the problem that the network transmission file needs to be landed, but also can directly transmit the data into the file stream type analyzer for analysis; the problem that resources are occupied by loading the files at one time is solved, and the files can be read in blocks and processed in blocks. The invention can configure the data length and the decompression layer number of the compressed file transmitted each time through the configuration file so as to flexibly deal with different use scenes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for streaming parsing document content, the method comprising the steps of:

s1, reading file data and completing directory scanning;

s3, calling a corresponding parser according to the file type to parse the corresponding file;

the file types comprise a structured file, a text file and a compressed file;

the method for analyzing the structured file by calling the corresponding analyzer according to the file type comprises the following steps:

s32, continuously reading data;

s35, extracting the text in the file according to the initial position and length of the directory stream and the initial position and length of the table stream;

the method for analyzing the text type file by calling the corresponding analyzer according to the file type comprises the following steps:

s31', reading in file data;

s32', text extraction;

s33', continuing to read in the file data;

s34 ', repeating the steps S32', S33 'and S34' until the analysis is completed;

the step of calling the corresponding analyzer according to the file type to analyze the compressed file comprises the following steps:

s31', creating a temporary directory;

s32', reading file data;

s33', decompressing the data by using a compression algorithm;

s34', continuously reading data;

s36', scanning the temporary directory and caching all file paths;

s37', and merging the analysis results.

2. The document content streaming parsing method according to claim 1, wherein the reading of the file data and the completion of the directory scan further comprises the following steps:

s11, length rules are configured in the configuration file in advance;

and S12, reading the file data block.

3. The method for streaming document contents according to claim 1, wherein said determining the file type and classifying the different types of files further comprises the following steps:

s21, checking the file feature string;

and S22, detecting the file type.

4. The document content streaming parsing method according to claim 1, wherein the structured file is a file with hierarchical structure characteristics, and the structured file includes but is not limited to a word file and a pdf file; the text class files include but are not limited to text files, extensible markup language and hypertext markup language; the compressed files include, but are not limited to, zip files, rar files, and tar files.

5. A document content streaming parsing system for implementing the steps of the document content streaming parsing method of any one of claims 1-4, the system comprising:

the file IO is used for reading file data and completing directory scanning;

6. The system of claim 5, wherein the file types include structured files, text class files, and compressed files.