CN111104527B

CN111104527B - Rich media file analysis method

Info

Publication number: CN111104527B
Application number: CN201911309803.4A
Authority: CN
Inventors: 程俊; 李文飞
Original assignee: Write Easy Network Technology Shanghai Co ltd
Current assignee: Write Easy Network Technology Shanghai Co ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2023-06-23
Anticipated expiration: 2039-12-18
Also published as: CN111104527A

Abstract

A rich media file analysis method comprises five main flows of data screening and classifying, resource factory distribution, spark multiple concurrent analysis, multi-node cluster indexing and big data visual analysis. The invention firstly screens and classifies mass rich media file data, screens complex structure data into relatively regular classified data, and can accurately process single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; the Spark parallel calculation is used, and the multi-thread and multi-concurrency mode is adopted, so that the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used for improving the safety of data and the overall query speed. And visual, accurate and efficient processing results are presented to the user based on big data visual analysis.

Description

Rich media file analysis method

Technical Field

The invention relates to the technical field of big data processing, in particular to a rich media file analysis method.

Background

With the increasing growth of internet data, the variety and capacity of data is increasing explosively at unprecedented speeds.

For common enterprises and related units, common file formats such as mail data, document data (including Office document pdf document, etc.), web page data, ticket data, fund data, mobile phone backup and investigation data, computer backup and investigation data, database structured data (MySQL Oracle SqlServer Access MongoDB Redis) and the like are various, how to comprehensively store, utilize and analyze the data, and query and data mining of common services are difficult problems with high requirements on the technical level.

Disclosure of Invention

The invention provides a rich media file analysis method, which solves the problems of how to comprehensively store, utilize and analyze data in various file formats, and perform query and data mining of common services.

In order to achieve the above purpose, the invention is realized by the following technical scheme:

a rich media file parsing method, comprising:

screening and classifying the file formats of the mass rich media files;

distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats to the filtered and classified rich media files through a resource factory;

adopting Spark parallel computing frames to carry out high-concurrency analysis processing on the distributed data analysis interfaces of all nodes;

performing multi-node cluster indexing on the analyzed result;

and carrying out visual analysis on big data based on the index query interface.

According to another embodiment of the present invention, the rich media file includes a ZIP compression package, a RAR compression package, a HAR compression package, a PST/OST compression file of mail, and an integrated document folder.

According to another embodiment of the present invention, the step of screening and classifying the massive rich media files according to file formats includes:

decompressing massive rich media files, and performing multi-layer decompression extraction on the files by using a traversal algorithm;

sorting the decompressed different files through a built-in screening and distributing engine, distinguishing and classifying the file formats according to the suffixes of the file names, and temporarily storing the file formats in classified folders named with different data formats.

According to another embodiment of the present invention, the files categorized include Word documents, excel documents, PPT documents/PDF documents, picture files, eml files, mobile phone backup/investigation data, and hard disk backup/investigation data.

According to another embodiment of the present invention, the step of allocating, by the resource factory, hardware resources to be processed and data parsing interfaces required for the corresponding file formats includes:

distributing analysis interfaces according to files with different data formats, and distributing the document analysis interfaces by a resource factory when the files are input as Word documents, excel documents, PPT documents and PDF documents; when the input is Eml file, audio file and video file, the resource factory automatically distributes the media file analysis interface; when the input file is a mobile phone evidence obtaining investigation and a hard disk evidence obtaining investigation, distributing an evidence obtaining investigation analysis interface;

and allocating different hardware resources according to the data sizes of different analysis interfaces to obtain the hardware resources of each data node.

According to another embodiment of the present invention, the step of performing high concurrency analysis processing on the distributed data analysis interfaces of each node by using a Spark parallel computing framework includes:

and summarizing the hardware resources of each node into a Spark framework.

And dividing an overall task into a plurality of small tasks through a Spark computing framework, carrying out concurrent thread allocation and computation according to resources required to be allocated for single task execution, and summarizing and persistence on the single task execution result.

According to another embodiment of the present invention, the multi-node cluster indexing is performed on the result after the parsing process through a distributed full-text indexing technique.

According to another embodiment of the present invention, the visual analysis of big data employs a relational object query technique.

The invention provides a rich media file analysis method. The beneficial effects are as follows: firstly, screening and classifying mass rich media file data, and screening complex structure data into relatively regular classified data, so that accurate format processing can be carried out on a single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; the Spark parallel calculation is used, and the multi-thread and multi-concurrency mode is adopted, so that the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used for improving the safety of data and the overall query speed. And visual, accurate and efficient processing results are presented to the user based on big data visual analysis.

Drawings

In order to more clearly illustrate the invention or the technical solutions in the prior art, the drawings used in the description of the prior art will be briefly described below.

FIG. 1 is a flow chart of one embodiment of a rich media file parsing method of the present invention;

FIG. 2 is a flow chart illustrating an embodiment of a method for parsing a rich media file according to step 100 of the present invention;

FIG. 3 is a functional block diagram of step 100 of a rich media file parsing method of the present invention;

FIG. 4 is a flow chart illustrating an embodiment of a method for parsing a rich media file at step 200 according to the present invention;

FIG. 5 is a functional block diagram of step 200 of a rich media file parsing method of the present invention;

FIG. 6 is a flow chart illustrating an embodiment of a method for parsing a rich media file 300 according to the present invention;

fig. 7 is a functional block diagram of a step 300 of a rich media file parsing method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a rich media file parsing method includes:

step 100: screening and classifying the file formats of the mass rich media files;

step 200: distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats to the filtered and classified rich media files through a resource factory;

step 300: adopting Spark parallel computing frames to carry out high-concurrency analysis processing on the distributed data analysis interfaces of all nodes;

step 400: performing multi-node cluster indexing on the analyzed result;

step 500: and carrying out visual analysis on big data based on the index query interface.

The rich media file analysis method of the embodiment of the invention firstly screens and classifies mass rich media file data, and screens complex structure data into relatively regular classification data. So that an accurate format handling can be performed for a single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; the Spark parallel calculation is used, and the multi-thread and multi-concurrency mode is adopted, so that the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used for improving the safety of data and the overall query speed. And visual, accurate and efficient processing results are presented to the user based on big data visual analysis.

Optionally, the rich media file in the embodiment of the present invention includes a ZIP compression package, a RAR compression package, a HAR compression package, a PST/OST compression file of mail, and an integrated document folder.

In some embodiments, referring to fig. 2-3, step 100 of the rich media file parsing method of the present invention includes:

step 101: decompressing massive rich media files, and performing multi-layer decompression extraction on the files by using a traversal algorithm; and saving the obtained entity file in a temporary directory of the distributed file storage.

Step 102: through built-in screening and distributing engines, the decompressed different files are sorted, file format distinction and classification are carried out according to the suffixes of the file names, and the files are temporarily stored in classified folders named with different data formats, so that subsequent analysis operation is facilitated.

In this step, the classified folders include Word documents, excel documents, PPT documents/PDF documents, picture files, eml files, mobile phone backup/investigation data, and hard disk backup/investigation data.

In some embodiments, referring to fig. 4-5, step 200 of the rich media file parsing method of the present invention includes:

step 201: distributing analysis interfaces according to files with different data formats;

in the step, when the input is Word document, excel document, PPT document, PDF document, the resource factory distributes document analysis interface; when the input is Eml file, audio file and video file, the resource factory automatically distributes the media file analysis interface; when the input file is a mobile phone evidence obtaining investigation and a hard disk evidence obtaining investigation, distributing an evidence obtaining investigation analysis interface;

step 202: and allocating different hardware resources according to the data sizes of different analysis interfaces to obtain the hardware resources of each data node. For example, when the support size of the platform system is 10T and the total size of the input file is 2T, automatically distributing 8 cores and 32G memories of each node according to the required hardware resources for subsequent analysis processing; when the total size of the input file is 10T, automatically distributing the 32-core 128G memory according to the analysis requirement to carry out subsequent analysis processing.

In some embodiments, referring to fig. 6-7, step 300 of the rich media file parsing method of the present invention comprises:

step 301: and summarizing the hardware resources of each node.

In this embodiment, when each node is configured by using a 32-core CPU and a 128G memory, the hardware resources of the 128-core CPU and the 512G memory are summarized. Because the analysis process directly uses the memory to calculate, the analysis efficiency is greatly improved, and the problems of data grounding and disk IO are solved.

Step 302: and dividing an overall task into a plurality of small tasks through a Spark computing framework, carrying out concurrent thread allocation and computation according to resources required to be allocated for single task execution, and summarizing and persistence on the single task execution result.

In this embodiment, when a single task needs to be executed in a 4G memory and a 1-core CPU to perform computation, 100 threads can be generally allocated to perform concurrent computation, so that the running speed and the execution effect are greatly improved.

Preferably, in step 400 of the rich media file parsing method of the present invention, multi-node cluster indexing is performed on the parsed result through a distributed full-text indexing technique.

Because the distributed full-text index is different from the common database query technology, the query operation of massive data is provided for the search engine in a professional way, and the search speed reaches millisecond-level response. Splitting the content of the original file into an index file based on the Lucene format through data slicing, facilitating quick searching and convenient data size compression;

and the distributed full-text indexing technology stores the data of each fragment as a plurality of backups and walks on different data blocks of different racks and different nodes. The problem of data loss caused by data disk damage and unexpected fault of the machine room is effectively prevented. And (5) giving an intuitive prompt to a user through indexing the health value of the cluster.

Preferably, in step 500 of the rich media file parsing method of the present invention, the visual analysis of big data adopts a relational object query technique.

On one hand, the rear end is based on a query interface of the distributed full-text index technology, so that the response time of data query is ensured, and a large amount of json data can be requested in batches in a short time.

On the other hand, the big data visualization analysis adopts a relational object query technology. Such as Neo4J database query interfaces, visual analysis can provide more accurate, efficient, and rapid results presentation in one-to-one, many-to-many relationship queries.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A rich media file parsing method, comprising:

screening and classifying the file formats of the mass rich media files;

the method for distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats to the filtered and categorized rich media files through a resource factory specifically comprises the following steps: distributing analysis interfaces according to files with different data formats, and distributing the document analysis interfaces by a resource factory when the files are input as Word documents, excel documents, PPT documents and PDF documents; when the input is Eml file, audio file and video file, the resource factory automatically distributes the media file analysis interface; when the input file is a mobile phone evidence obtaining investigation and a hard disk evidence obtaining investigation, distributing an evidence obtaining investigation analysis interface; different hardware resources are allocated according to the data sizes of different analysis interfaces, and the hardware resources of each data node are obtained;

performing multi-node cluster indexing on the analyzed result;

2. The method for parsing a rich media file according to claim 1, wherein: the rich media files comprise ZIP compression packages, RAR compression packages, HAR compression packages, PST/OST compression files of mails and comprehensive document folders.

3. The method for parsing a rich media file according to claim 1, wherein: the step of screening and classifying the file formats of the mass rich media files comprises the following steps:

4. A rich media file parsing method according to claim 3, wherein: files after classification comprise Word documents, excel documents, PPT documents/PDF documents, picture files, eml files, mobile phone backup/investigation data and hard disk backup/investigation data.

5. The method for parsing a rich media file according to claim 1, wherein the step of performing high-concurrency parsing processing on the distributed node data parsing interfaces by using a Spark parallel computing framework includes:

summarizing hardware resources of each node into a Spark framework;

6. The method of claim 1, wherein the multi-node cluster indexing is performed on the result of the parsing process by a distributed full-text indexing technique.

7. The method of claim 1, wherein the visual analysis of the big data employs a relational object query technique.