CN111104527A

CN111104527A - Rich media file parsing method

Info

Publication number: CN111104527A
Application number: CN201911309803.4A
Authority: CN
Inventors: 程俊; 李文飞
Original assignee: Write Easy Network Technology Shanghai Co Ltd
Current assignee: Write Easy Network Technology Shanghai Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-05-05
Anticipated expiration: 2039-12-18
Also published as: CN111104527B

Abstract

A rich media file analysis method comprises five main flows of data screening and classification, resource factory distribution, Spark multi-concurrency analysis, multi-node cluster index and big data visualization analysis. The invention firstly screens and classifies mass rich media file data and screens complex structure data into classification data with relative rules, thereby being capable of carrying out accurate format processing on a single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; then, by using Spark parallel computation and adopting a multi-thread and multi-concurrency mode, the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used, so that the data security and the overall query speed are improved. And based on big data visual analysis, a visual, accurate and efficient processing result is presented to a user.

Description

Rich media file parsing method

Technical Field

The invention relates to the technical field of big data processing, in particular to a rich media file parsing method.

Background

With the increasing growth of internet data, the variety of data and the data capacity are increasing explosively at an unprecedented rate.

For common enterprises and related units, the common file formats such as mail data, document data (including Office document pdf documents and the like), web data, call ticket data, fund data, mobile phone backup and survey data, computer backup and survey data, database structured data (MySQL Oracle sql server Access MongoDB Redis) and the like are various, so how to comprehensively store, utilize and analyze the data, and query and data mining of common services is a difficult problem with high requirements on technical level.

Disclosure of Invention

The invention provides a rich media file analysis method, which solves the problems of comprehensively storing, utilizing and analyzing data in various file formats, and inquiring and mining common services.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a rich media file parsing method is characterized by comprising the following steps:

screening and classifying file formats of the massive rich media files;

distributing hardware resources needing to be processed and data analysis interfaces needed by corresponding file formats to the screened and classified rich media files through a resource factory;

performing high-concurrency analysis processing on the distributed node data analysis interfaces by adopting a Spark parallel computing frame;

performing multi-node cluster indexing on the analyzed result;

and performing visual analysis on the big data based on the index query interface.

According to another embodiment of the present invention, the rich media files include ZIP compressed packages, RAR compressed packages, HAR compressed packages, PST/OST compressed files for mail, and integrated document folders.

According to another embodiment of the present invention, the step of screening and classifying the file formats of the massive rich media files comprises:

decompressing massive rich media files, and performing multi-layer decompression extraction on the files by using a traversal algorithm;

sorting the decompressed different files through a built-in screening and distributing engine, distinguishing and classifying the file formats according to suffixes of file names, and temporarily storing the files in classified folders named in different data formats.

According to another embodiment of the present invention, the classified files include Word documents, Excel documents, PPT documents/PDF documents, picture files, Eml files, mobile phone backup/survey data, and hard disk backup/survey data.

According to another embodiment of the present invention, the step of allocating, by the resource factory, the hardware resources to be processed and the data parsing interface required by the corresponding file format includes:

distributing analysis interfaces according to files in different data formats, and distributing the document analysis interfaces by a resource factory when input is Word documents, Excel documents, PPT documents and PDF documents; when the input is Eml files, audio files and video files, the resource factory automatically allocates media file analysis interfaces; when the input file is a mobile phone evidence obtaining survey and a hard disk evidence obtaining survey, allocating an evidence obtaining survey analysis interface;

and distributing different hardware resources according to the data sizes of different analysis interfaces to obtain the hardware resources of each data node.

According to another embodiment of the present invention, the step of performing highly concurrent analysis processing on the distributed node data analysis interfaces by using a Spark parallel computing framework includes:

and summarizing the hardware resources of each node into a Spark framework.

Dividing an integral task into a plurality of small tasks through a Spark calculation framework, performing concurrent thread allocation and calculation according to resources required to be allocated for executing a single task, and summarizing and persisting the result executed by the single task.

According to another embodiment of the present invention, the parsing result is subjected to multi-node cluster indexing by a distributed full-text indexing technique.

According to another embodiment of the invention, the visualization analysis of the big data adopts a relational object query technology.

The invention provides a rich media file parsing method. The method has the following beneficial effects: firstly, screening and classifying mass rich media file data, and screening complex structure data into classification data with relative rules, so that accurate format processing can be performed on a single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; then, by using Spark parallel computation and adopting a multi-thread and multi-concurrency mode, the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used, so that the data security and the overall query speed are improved. And based on big data visual analysis, a visual, accurate and efficient processing result is presented to a user.

Drawings

In order to more clearly illustrate the present invention or the prior art solutions, the drawings that are needed in the description of the prior art will be briefly described below.

FIG. 1 is a flow chart illustrating an embodiment of a rich media file parsing method according to the present invention;

FIG. 2 is a flowchart illustrating an embodiment of the steps 100 of a rich media file parsing method according to the present invention;

FIG. 3 is a functional block diagram illustrating the steps 100 of a rich media file parsing method of the present invention;

FIG. 4 is a flowchart illustrating one embodiment of the steps 200 of a rich media file parsing method according to the present invention;

FIG. 5 is a functional block diagram of the steps 200 of a rich media file parsing method of the present invention;

FIG. 6 is a flowchart illustrating an embodiment of the step 300 of a rich media file parsing method according to the present invention;

FIG. 7 is a functional block diagram of the steps 300 of a rich media file parsing method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.

As shown in fig. 1, a rich media file parsing method includes:

step 100: screening and classifying file formats of the massive rich media files;

step 200: distributing hardware resources needing to be processed and data analysis interfaces needed by corresponding file formats to the screened and classified rich media files through a resource factory;

step 300: performing high-concurrency analysis processing on the distributed node data analysis interfaces by adopting a Spark parallel computing frame;

step 400: performing multi-node cluster indexing on the analyzed result;

step 500: and performing visual analysis on the big data based on the index query interface.

The rich media file analysis method firstly screens and classifies massive rich media file data, and screens complex structure data into classification data with relative rules. So that an accurate format processing can be performed for a single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; then, by using Spark parallel computation and adopting a multi-thread and multi-concurrency mode, the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used, so that the data security and the overall query speed are improved. And based on big data visual analysis, a visual, accurate and efficient processing result is presented to a user.

Optionally, the rich media files in the embodiment of the present invention include ZIP compressed packages, RAR compressed packages, HAR compressed packages, PST/OST compressed files of mails, and integrated document folders.

In some embodiments, referring to fig. 2-3, the step 100 of the rich media file parsing method of the present invention comprises:

step 101: decompressing massive rich media files, and performing multi-layer decompression extraction on the files by using a traversal algorithm; and saving the acquired entity file in a temporary directory of the distributed file storage.

Step 102: the decompressed different files are sorted by a built-in screening and distributing engine, the file formats are distinguished and classified according to suffixes of file names, and the files are temporarily stored in classified folders named in different data formats, so that the subsequent analysis operation is facilitated.

In this step, the classified folders include Word documents, Excel documents, PPT documents/PDF documents, picture files, Eml files, mobile phone backup/survey data, and hard disk backup/survey data.

In some embodiments, referring to FIGS. 4-5, the step 200 of the rich media file parsing method of the present invention comprises:

step 201: distributing and analyzing interfaces according to files with different data formats;

in the step, when the input is Word document, Excel document, PPT document and PDF document, the resource factory allocates document analysis interface; when the input is Eml files, audio files and video files, the resource factory automatically allocates media file analysis interfaces; when the input file is a mobile phone evidence obtaining survey and a hard disk evidence obtaining survey, allocating an evidence obtaining survey analysis interface;

step 202: and distributing different hardware resources according to the data sizes of different analysis interfaces to obtain the hardware resources of each data node. For example, the support size of the platform system is 10T, each batch of the platform system, and when the total size of the input file is 2T, 8 cores and 32G memories of each node are automatically allocated according to required hardware resources to perform subsequent analysis processing; when the total size of the input file is 10T, 32 cores and 128G memories are automatically allocated according to analysis requirements to perform subsequent analysis processing.

In some embodiments, referring to FIGS. 6-7, the step 300 of the rich media file parsing method of the present invention comprises:

step 301: and summarizing the hardware resources of each node.

In this embodiment, when each node is configured by using 32-core CPUs and 128G memories, the hardware resources of the 128-core CPUs and 512G memories are obtained after the configuration. Because the memory is directly used for calculation in the analysis process, the analysis efficiency is greatly improved, and the problems of data falling to the ground and disk IO are solved.

Step 302: dividing an integral task into a plurality of small tasks through a Spark calculation framework, performing concurrent thread allocation and calculation according to resources required to be allocated for executing a single task, and summarizing and persisting the result executed by the single task.

In this embodiment, when a single task is executed and needs a 4G memory and a 1-core CPU to perform computation, 100 threads can be generally allocated to perform concurrent computation, thereby greatly improving the running speed and the execution effect.

Preferably, in step 400 of the rich media file parsing method of the present invention, a multi-node cluster index is performed on the parsed result by using a distributed full-text indexing technique.

The distributed full-text index is different from the common database query technology, so that the query operation of mass data is professionally provided for a search engine, and the search speed reaches millisecond response. The content of the original file is split into index files based on the Lucene format through data fragmentation, so that the rapid search is facilitated, and the compression of the data size is also facilitated;

and the distributed full-text index technology stores the data of each fragment into a plurality of backups which are scattered on different data blocks of different racks and different nodes. The data disk damage and the data loss caused by the unexpected fault of the machine room are effectively prevented. The user is given an intuitive prompt by indexing the health value of the cluster.

Preferably, in step 500 of the rich media file parsing method of the present invention, a relational object query technique is used for the visualization analysis of the big data.

On one hand, the back end is based on the query interface of the distributed full-text index technology, so that the response time of data query is ensured, and a large amount of json data can be requested in batches in a short time.

On the other hand, the big data visualization analysis adopts a relational object query technology. Such as Neo4J database query interface, visual analysis can provide more accurate, efficient, and rapid results presentation in one-to-one, many-to-many relational queries.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A rich media file parsing method is characterized by comprising the following steps:

screening and classifying file formats of the massive rich media files;

performing multi-node cluster indexing on the analyzed result;

2. The rich media file parsing method of claim 1, wherein: the rich media files include ZIP compressed packages, RAR compressed packages, HAR compressed packages, PST/OST compressed files for mail, and integrated document folders.

3. The rich media file parsing method of claim 1, wherein: the step of screening and classifying the file formats of the massive rich media files comprises the following steps:

4. The rich media file parsing method of claim 3, wherein: the classified files comprise Word documents, Excel documents, PPT documents/PDF documents, picture files, Eml files, mobile phone backup/exploration data and hard disk backup/exploration data.

5. The rich media file parsing method according to claim 3, wherein the step of allocating hardware resources to be processed and data parsing interfaces required by corresponding file formats through resource factories comprises:

according to file allocation analysis interfaces of different data formats, when input is Word documents, Excel documents, PPT documents and PDF documents, a resource factory allocates the file analysis interfaces; when the input is Eml files, audio files and video files, the resource factory automatically allocates media file analysis interfaces; when the input file is a mobile phone evidence obtaining survey and a hard disk evidence obtaining survey, allocating an evidence obtaining survey analysis interface;

6. The rich media file parsing method according to claim 5, wherein the step of performing high-concurrency parsing processing on the distributed node data parsing interfaces by using a Spark parallel computing framework comprises:

summarizing hardware resources of each node into a Spark framework;

7. The rich media file parsing method as claimed in claim 1, wherein the result after parsing is multi-node cluster indexed by distributed full-text indexing.

8. The rich media file parsing method as claimed in claim 1, wherein the visualization analysis of the big data employs a relational object query technique.