CN112001164A - Document content streaming analysis method and system - Google Patents

Document content streaming analysis method and system Download PDF

Info

Publication number
CN112001164A
CN112001164A CN202011159801.4A CN202011159801A CN112001164A CN 112001164 A CN112001164 A CN 112001164A CN 202011159801 A CN202011159801 A CN 202011159801A CN 112001164 A CN112001164 A CN 112001164A
Authority
CN
China
Prior art keywords
file
files
data
reading
steps
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011159801.4A
Other languages
Chinese (zh)
Other versions
CN112001164B (en
Inventor
殷博
潘飚
冯静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD
Nanjing Zhongfu Information Technology Co Ltd
Zhongfu Information Co Ltd
Zhongfu Safety Technology Co Ltd
Original Assignee
BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD
Nanjing Zhongfu Information Technology Co Ltd
Zhongfu Information Co Ltd
Zhongfu Safety Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD, Nanjing Zhongfu Information Technology Co Ltd, Zhongfu Information Co Ltd, Zhongfu Safety Technology Co Ltd filed Critical BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD
Priority to CN202011159801.4A priority Critical patent/CN112001164B/en
Publication of CN112001164A publication Critical patent/CN112001164A/en
Application granted granted Critical
Publication of CN112001164B publication Critical patent/CN112001164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a document content stream analysis method and a system, wherein the method comprises the following steps: s1, reading file data and completing directory scanning; s2, judging the file types and realizing the classification of different types of files; and S3, calling a corresponding parser according to the file type to parse the corresponding file. The invention has the beneficial effects that: the method comprises the steps of classifying files according to different file structures, and dividing the files into structured files, text files and compressed files; the invention provides a document content stream analysis method, which loads only a part of data for processing each time, uses different processing methods for different types of files, and controls the whole processing process by using a state machine; the internal streaming processing of each type of file is not the same, but the entire flow for file parsing is the same.

Description

Document content streaming analysis method and system
Technical Field
The invention relates to the technical field of document content analysis, in particular to a document content streaming analysis method and system.
Background
With the advent of the big data age, the number of files transmitted through the internet has increased greatly, and the internet is flooded with various text files, video files, audio files, and the like. In which a large number of electronic documents are present in addition to ordinary documents in text documents. Some confidential documents may exist in these electronic documents, and the documents are the most important source of confidential documents as the main way for the parties to work on the dates. Also in non-confidential devices confidential files may occur. In order to ensure the security of the national security work, the secret-related official documents are detected from massive files in the network and the host equipment and cannot be cached. File parsing is the primary link in the file inspection process.
The current document analysis method is to read the whole file into the internal memory, judge the file type first, and then process through different file analyzers. The first disadvantage is that: for the files transmitted by the network, the files are stored on the disk of the equipment and then read into the memory, and certain hysteresis exists for file inspection. The second disadvantage is that: and (4) processing the large file. For the processing of a large file, if the large file is loaded into the memory at one time, the analysis process occupies too much memory, and the processing process also occupies too much CPU resources, so that the device is stuck and other operations and uses of a device user are affected. The third disadvantage is that: and (5) processing the compressed file. The compressed file is a special file, and may include a plurality of files or folders, and may also include compressed files, forming a nested multi-layer compressed file, where if the number of nested layers is too large, one-time loading not only occupies a large amount of memory, but also reduces the processing performance of file parsing.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
The invention provides a document content stream analysis method and a system aiming at the problems in the related art, so as to overcome the technical problems in the prior related art.
Therefore, the invention adopts the following specific technical scheme:
according to one aspect of the present invention, there is provided a document content streaming parsing method, including the steps of:
s1, reading file data and completing directory scanning;
s2, judging the file types and realizing the classification of different types of files;
and S3, calling a corresponding parser according to the file type to parse the corresponding file.
Further, the reading the file data and completing the directory scan further includes the following steps:
s11, length rules are configured in the configuration file in advance;
and S12, reading the file data block.
Further, the judging the file types and the classifying the files of different types further comprises the following steps:
s21, checking the file feature string;
and S22, detecting the file type.
Further, the file types include structured files, text class files, and compressed files.
Further, the step of calling a corresponding parser according to the file type to parse the structured file includes the following steps:
s31, analyzing the file header, and analyzing the file header information through data offset according to the file header structure definition;
s32, continuously reading data;
s33, analyzing the main sector allocation table through cyclic reading and data processing;
s34, analyzing the sector allocation table, the directory flow and the table flow through circular reading and data processing;
and S35, extracting the text in the file according to the starting position and length of the directory stream and the starting position and length of the table stream.
Optionally, the invoking a corresponding parser according to the file type to parse the text file includes the following steps:
s31', reading in file data;
s32', text extraction;
s33', continuing to read in the file data;
s34 ', repeat steps S32', S33 'and S34' until the resolution is completed.
Optionally, the invoking a corresponding parser according to the file type to parse the compressed file includes the following steps:
s31', creating a temporary directory;
s32', reading file data;
s33', decompressing the data by using a compression algorithm;
s34', continuously reading data;
s35 ', repeating the steps S33 ' and S34 ', until the file decompression is completed;
s36', scanning the temporary directory and caching all file paths;
s37', and merging the analysis results.
Further, the structured file is a file with a hierarchical structure feature, and the structured file includes, but is not limited to, a word file and a pdf file; the text class files include but are not limited to text files, extensible markup language and hypertext markup language; the compressed files include, but are not limited to, zip files, rar files, and tar files.
According to another aspect of the present invention, there is also provided a document content streaming parsing system, including:
the file IO is used for reading file data and completing directory scanning;
the classifier is used for judging the file type and realizing the classification of different types of files;
and the analyzer is used for calling the corresponding analyzer according to the file type to analyze the corresponding file.
Further, the file types include structured files, text class files, and compressed files.
The invention has the beneficial effects that:
1. the method comprises the steps of classifying files according to different file structures, and dividing the files into structured files, text files and compressed files; the invention provides a document content stream analysis method, which loads only a part of data for processing each time, uses different processing methods for different types of files, and controls the whole processing process by using a state machine; the internal streaming processing mode of each type of file is different, but the whole flow of file analysis is the same;
2. the invention not only solves the problem that the network transmission file needs to be landed, but also can directly transmit the data into the file stream type analyzer for analysis; the problem that resources are excessively occupied when files are loaded at one time is solved, and the files can be read in a blocking mode and processed in a blocking mode;
3. the invention can configure the data length and the decompression layer number of the compressed file transmitted each time through the configuration file so as to flexibly deal with different use scenes.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow diagram of document content streaming parsing according to an embodiment of the invention;
FIG. 2 is a diagram of a structured document structure according to an embodiment of the present invention;
FIG. 3 is a flow diagram of structured document processing according to an embodiment of the present invention;
FIG. 4 is a flow diagram of text class file processing according to an embodiment of the present invention;
FIG. 5 is a flow diagram of compressed file processing according to an embodiment of the present invention;
fig. 6 is a document content streaming parsing architecture diagram according to an embodiment of the present invention.
Detailed Description
For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.
The embodiment of the invention provides a document content streaming analysis method and a document content streaming analysis system.
The present invention will be further explained with reference to the accompanying drawings and detailed description, as shown in fig. 1, a document content streaming parsing method according to an embodiment of the present invention includes the following steps:
s1, reading file data and completing directory scanning;
s2, judging the file types and realizing the classification of different types of files;
and S3, calling a corresponding parser according to the file type to parse the corresponding file.
In one embodiment, the reading the file data and completing the directory scan further comprises the steps of:
s11, length rules are configured in the configuration file in advance;
and S12, reading the file data block.
In one embodiment, the determining the file type and classifying the files of different types further includes:
s21, checking the file feature string;
and S22, detecting the file type.
In one embodiment, the file types include structured files, text class files, and compressed files.
Specifically, for the structured document processing flow:
structured documents have a distinct structural hierarchy. Taking word97-2003 version as an example, a word document is divided into a header and a body, and a series of areas are defined in the body, including structures such as a main sector allocation table (msat), a sector allocation table (sat), a short sector allocation table (ssat), a directory stream and a table stream. The main sector distribution table records the ID of the sector where the sector (sector) distribution table is located, the sector distribution table records the ID of the sector where various streams are located, and the position of the required data in the file can be obtained through calculation of the sector ID and the sector length. The structured file can be categorized into a hierarchical structure, as shown in fig. 1, where the main sector allocation table is a root node, the sector allocation table and the short sector allocation table are storage nodes below the main sector allocation table, and a data stream of the file is below the storage nodes, and a text portion or a picture portion is in the data stream.
In one embodiment, as shown in fig. 2 to 3, the invoking the corresponding parser according to the file type (structured file) to parse the corresponding file (structured file) further includes the following steps:
s31, analyzing the file header, and analyzing the file header information through data offset according to the file header structure definition;
in addition, specifically, the state machine is set as "header processing", and if the length of the incoming data is smaller than the length of the header, the parsing of the header is not completed, and a part of the data is cached.
S32, continuously reading data;
in addition, specifically, since the state machine at this time is "header processing", the data read again will be merged with the data cached last time, and enter the header processing flow, and if the header has not been processed yet, steps S31 and S32 are repeated. After the file header is processed, setting the state machine as 'main sector processing', and sending the data into the main sector processing flow.
S33, analyzing the main sector allocation table through cyclic reading and data processing;
specifically, the position of msat (master sector allocation table) is calculated from the id value of msat (master sector allocation table) in the header information, the length of msat (master sector allocation table) is calculated from the number of msat (master sector allocation table), the start position of data is found by position offset, and necessary data is read at a fixed length.
S34, analyzing the sector allocation table, the directory flow and the table flow through circular reading and data processing;
and S35, extracting the text in the file according to the starting position and length of the directory stream and the starting position and length of the table stream.
Specifically, for the text file processing flow:
the text class file includes xml, html, json file, and the like in addition to the text file. Compared with structured files, the text file processing flow is simple, namely sequential processing, and does not need a state machine for control; and the xml and json do not need to be converted into corresponding objects for processing.
In one embodiment, as shown in fig. 4, the step of calling the corresponding parser according to the file type (text class file) to parse the corresponding file (text class file) further includes the following steps:
s31', reading in file data;
s32', text extraction;
s33', continuing to read in the file data;
s34 ', repeat steps S32', S33 'and S34' until the resolution is completed.
Specifically, for the compressed file processing flow:
the principle of compressed file processing is to decompress the files, create a temporary directory with the same name as the compressed files, decompress the files under the temporary directory, analyze the files under the directory, then merge the analyzed results of the files, and delete the temporary directory and the files after the files are processed. The compressed file processing is configured to determine the number of layers of nested decompression, and to determine the maximum number of layers of decompression.
In one embodiment, as shown in fig. 5, the step of calling the corresponding parser to parse the corresponding file (compressed file) according to the file type (compressed file) further includes the following steps:
s31', creating a temporary directory;
s32', reading file data;
s33', decompressing the data by using a compression algorithm;
s34', continuously reading data;
s35 ', repeating the steps S33 ' and S34 ', until the file decompression is completed;
s36', scanning the temporary directory and caching all file paths;
specifically, the file processing step is carried out according to a structured file or a text file;
s37', and merging the analysis results.
In one embodiment, the structured file is a file with a hierarchical structure feature, and the structured file includes, but is not limited to, a word file and a pdf file; the text class files include but are not limited to text files, extensible markup language and hypertext markup language; the compressed files include, but are not limited to, zip files, rar files, and tar files.
According to another embodiment of the present invention, as shown in fig. 6, there is also provided a document content streaming parsing system, including:
the file IO is used for reading file data and completing directory scanning;
the classifier is used for judging the file type and realizing the classification of different types of files;
and the analyzer is used for calling the corresponding analyzer according to the file type to analyze the corresponding file.
In one embodiment, the file types include structured files, text class files, and compressed files.
In summary, with the above technical solution of the present invention, the present invention classifies the files into structured files, text files and compressed files according to the difference of the file structures; the invention provides a document content stream analysis method, which loads only a part of data for processing each time, uses different processing methods for different types of files, and controls the whole processing process by using a state machine; the internal streaming processing of each type of file is not the same, but the entire flow for file parsing is the same. The invention not only solves the problem that the network transmission file needs to be landed, but also can directly transmit the data into the file stream type analyzer for analysis; the problem that resources are occupied by loading the files at one time is solved, and the files can be read in blocks and processed in blocks. The invention can configure the data length and the decompression layer number of the compressed file transmitted each time through the configuration file so as to flexibly deal with different use scenes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. A method for streaming parsing document content, the method comprising the steps of:
s1, reading file data and completing directory scanning;
s2, judging the file types and realizing the classification of different types of files;
s3, calling a corresponding parser according to the file type to parse the corresponding file;
the file types comprise a structured file, a text file and a compressed file;
the method for analyzing the structured file by calling the corresponding analyzer according to the file type comprises the following steps:
s31, analyzing the file header, and analyzing the file header information through data offset according to the file header structure definition;
s32, continuously reading data;
s33, analyzing the main sector allocation table through cyclic reading and data processing;
s34, analyzing the sector allocation table, the directory flow and the table flow through circular reading and data processing;
s35, extracting the text in the file according to the initial position and length of the directory stream and the initial position and length of the table stream;
the method for analyzing the text type file by calling the corresponding analyzer according to the file type comprises the following steps:
s31', reading in file data;
s32', text extraction;
s33', continuing to read in the file data;
s34 ', repeating the steps S32', S33 'and S34' until the analysis is completed;
the step of calling the corresponding analyzer according to the file type to analyze the compressed file comprises the following steps:
s31', creating a temporary directory;
s32', reading file data;
s33', decompressing the data by using a compression algorithm;
s34', continuously reading data;
s35 ', repeating the steps S33 ' and S34 ', until the file decompression is completed;
s36', scanning the temporary directory and caching all file paths;
s37', and merging the analysis results.
2. The document content streaming parsing method according to claim 1, wherein the reading of the file data and the completion of the directory scan further comprises the following steps:
s11, length rules are configured in the configuration file in advance;
and S12, reading the file data block.
3. The method for streaming document contents according to claim 1, wherein said determining the file type and classifying the different types of files further comprises the following steps:
s21, checking the file feature string;
and S22, detecting the file type.
4. The document content streaming parsing method according to claim 1, wherein the structured file is a file with hierarchical structure characteristics, and the structured file includes but is not limited to a word file and a pdf file; the text class files include but are not limited to text files, extensible markup language and hypertext markup language; the compressed files include, but are not limited to, zip files, rar files, and tar files.
5. A document content streaming parsing system for implementing the steps of the document content streaming parsing method of any one of claims 1-4, the system comprising:
the file IO is used for reading file data and completing directory scanning;
the classifier is used for judging the file type and realizing the classification of different types of files;
and the analyzer is used for calling the corresponding analyzer according to the file type to analyze the corresponding file.
6. The system of claim 5, wherein the file types include structured files, text class files, and compressed files.
CN202011159801.4A 2020-10-27 2020-10-27 Document content streaming analysis method and system Active CN112001164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011159801.4A CN112001164B (en) 2020-10-27 2020-10-27 Document content streaming analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011159801.4A CN112001164B (en) 2020-10-27 2020-10-27 Document content streaming analysis method and system

Publications (2)

Publication Number Publication Date
CN112001164A true CN112001164A (en) 2020-11-27
CN112001164B CN112001164B (en) 2021-01-08

Family

ID=73475244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011159801.4A Active CN112001164B (en) 2020-10-27 2020-10-27 Document content streaming analysis method and system

Country Status (1)

Country Link
CN (1) CN112001164B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613276A (en) * 2020-12-28 2021-04-06 南京中孚信息技术有限公司 Parallel execution method and system for streaming document analysis

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107979783A (en) * 2016-10-25 2018-05-01 杭州海康威视数字技术股份有限公司 A kind of stream data analytic method, device and electronic equipment
CN109948120A (en) * 2019-04-02 2019-06-28 深圳市前海欢雀科技有限公司 A kind of resume analytic method based on dualization
CN111062187A (en) * 2019-11-27 2020-04-24 北京计算机技术及应用研究所 Structured parsing method and system for docx format document
CN111651514A (en) * 2020-07-09 2020-09-11 中国银行股份有限公司 Data import method and device
CN111694797A (en) * 2020-06-04 2020-09-22 中国建设银行股份有限公司 File uploading and analyzing method, device, server and medium
CN111797063A (en) * 2020-06-28 2020-10-20 中孚信息股份有限公司 Streaming data processing method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107979783A (en) * 2016-10-25 2018-05-01 杭州海康威视数字技术股份有限公司 A kind of stream data analytic method, device and electronic equipment
CN109948120A (en) * 2019-04-02 2019-06-28 深圳市前海欢雀科技有限公司 A kind of resume analytic method based on dualization
CN111062187A (en) * 2019-11-27 2020-04-24 北京计算机技术及应用研究所 Structured parsing method and system for docx format document
CN111694797A (en) * 2020-06-04 2020-09-22 中国建设银行股份有限公司 File uploading and analyzing method, device, server and medium
CN111797063A (en) * 2020-06-28 2020-10-20 中孚信息股份有限公司 Streaming data processing method and system
CN111651514A (en) * 2020-07-09 2020-09-11 中国银行股份有限公司 Data import method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
曹秀丽: "XML解析技术研究", 《福建电脑》 *
肖克辉 等: "文件系统备份的流式处理算法设计与实现", 《研究与开发》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613276A (en) * 2020-12-28 2021-04-06 南京中孚信息技术有限公司 Parallel execution method and system for streaming document analysis

Also Published As

Publication number Publication date
CN112001164B (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN109034993B (en) Account checking method, account checking equipment, account checking system and computer readable storage medium
US20080320387A1 (en) Information displaying device and information displaying method
US11023439B2 (en) Variable cardinality index and data retrieval
CN110795257A (en) Method, device and equipment for processing multi-cluster operation records and storage medium
US7530017B2 (en) Document transformation system
US8015218B2 (en) Method for compressing/decompressing structure documents
Breje et al. Comparative study of data sending methods for XML and JSON models
US9106251B2 (en) Data compression utilizing longest common subsequence template
JP2006221654A (en) Method and system for reducing delimiter
CN112001164B (en) Document content streaming analysis method and system
JP2006323821A (en) Method and system for sequentially accessing compiled schema
CN111949611B (en) File processing method, system, device and medium
JP2006221653A (en) System and method for determining acceptance state in document analysis
US20050102304A1 (en) Data compressor, data decompressor, and data management system
JP4776389B2 (en) Encoded document decoding method and system
US20100049727A1 (en) Compressing xml documents using statistical trees generated from those documents
JP2006221656A (en) High-speed encoding method and system of data document
CN110032432B (en) Example compression method and device and example decompression method and device
US8618960B1 (en) Selective recompression of a string compressed by a plurality of diverse lossless compression techniques
US20140337069A1 (en) Deriving business transactions from web logs
JP2006221657A (en) Display system and method of acceptance state
CN115576536A (en) Method and system for automatically generating interface document by analyzing byte codes
CN111797147B (en) Data processing method and device and electronic equipment
CN109960630B (en) Method for rapidly extracting logs from large-batch compressed files
US10841405B1 (en) Data compression of table rows

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant