CN111104527B - Rich media file analysis method - Google Patents

Rich media file analysis method Download PDF

Info

Publication number
CN111104527B
CN111104527B CN201911309803.4A CN201911309803A CN111104527B CN 111104527 B CN111104527 B CN 111104527B CN 201911309803 A CN201911309803 A CN 201911309803A CN 111104527 B CN111104527 B CN 111104527B
Authority
CN
China
Prior art keywords
data
analysis
rich media
file
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911309803.4A
Other languages
Chinese (zh)
Other versions
CN111104527A (en
Inventor
程俊
李文飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Write Easy Network Technology Shanghai Co ltd
Original Assignee
Write Easy Network Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Write Easy Network Technology Shanghai Co ltd filed Critical Write Easy Network Technology Shanghai Co ltd
Priority to CN201911309803.4A priority Critical patent/CN111104527B/en
Publication of CN111104527A publication Critical patent/CN111104527A/en
Application granted granted Critical
Publication of CN111104527B publication Critical patent/CN111104527B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/44Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A rich media file analysis method comprises five main flows of data screening and classifying, resource factory distribution, spark multiple concurrent analysis, multi-node cluster indexing and big data visual analysis. The invention firstly screens and classifies mass rich media file data, screens complex structure data into relatively regular classified data, and can accurately process single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; the Spark parallel calculation is used, and the multi-thread and multi-concurrency mode is adopted, so that the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used for improving the safety of data and the overall query speed. And visual, accurate and efficient processing results are presented to the user based on big data visual analysis.

Description

Rich media file analysis method
Technical Field
The invention relates to the technical field of big data processing, in particular to a rich media file analysis method.
Background
With the increasing growth of internet data, the variety and capacity of data is increasing explosively at unprecedented speeds.
For common enterprises and related units, common file formats such as mail data, document data (including Office document pdf document, etc.), web page data, ticket data, fund data, mobile phone backup and investigation data, computer backup and investigation data, database structured data (MySQL Oracle SqlServer Access MongoDB Redis) and the like are various, how to comprehensively store, utilize and analyze the data, and query and data mining of common services are difficult problems with high requirements on the technical level.
Disclosure of Invention
The invention provides a rich media file analysis method, which solves the problems of how to comprehensively store, utilize and analyze data in various file formats, and perform query and data mining of common services.
In order to achieve the above purpose, the invention is realized by the following technical scheme:
a rich media file parsing method, comprising:
screening and classifying the file formats of the mass rich media files;
distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats to the filtered and classified rich media files through a resource factory;
adopting Spark parallel computing frames to carry out high-concurrency analysis processing on the distributed data analysis interfaces of all nodes;
performing multi-node cluster indexing on the analyzed result;
and carrying out visual analysis on big data based on the index query interface.
According to another embodiment of the present invention, the rich media file includes a ZIP compression package, a RAR compression package, a HAR compression package, a PST/OST compression file of mail, and an integrated document folder.
According to another embodiment of the present invention, the step of screening and classifying the massive rich media files according to file formats includes:
decompressing massive rich media files, and performing multi-layer decompression extraction on the files by using a traversal algorithm;
sorting the decompressed different files through a built-in screening and distributing engine, distinguishing and classifying the file formats according to the suffixes of the file names, and temporarily storing the file formats in classified folders named with different data formats.
According to another embodiment of the present invention, the files categorized include Word documents, excel documents, PPT documents/PDF documents, picture files, eml files, mobile phone backup/investigation data, and hard disk backup/investigation data.
According to another embodiment of the present invention, the step of allocating, by the resource factory, hardware resources to be processed and data parsing interfaces required for the corresponding file formats includes:
distributing analysis interfaces according to files with different data formats, and distributing the document analysis interfaces by a resource factory when the files are input as Word documents, excel documents, PPT documents and PDF documents; when the input is Eml file, audio file and video file, the resource factory automatically distributes the media file analysis interface; when the input file is a mobile phone evidence obtaining investigation and a hard disk evidence obtaining investigation, distributing an evidence obtaining investigation analysis interface;
and allocating different hardware resources according to the data sizes of different analysis interfaces to obtain the hardware resources of each data node.
According to another embodiment of the present invention, the step of performing high concurrency analysis processing on the distributed data analysis interfaces of each node by using a Spark parallel computing framework includes:
and summarizing the hardware resources of each node into a Spark framework.
And dividing an overall task into a plurality of small tasks through a Spark computing framework, carrying out concurrent thread allocation and computation according to resources required to be allocated for single task execution, and summarizing and persistence on the single task execution result.
According to another embodiment of the present invention, the multi-node cluster indexing is performed on the result after the parsing process through a distributed full-text indexing technique.
According to another embodiment of the present invention, the visual analysis of big data employs a relational object query technique.
The invention provides a rich media file analysis method. The beneficial effects are as follows: firstly, screening and classifying mass rich media file data, and screening complex structure data into relatively regular classified data, so that accurate format processing can be carried out on a single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; the Spark parallel calculation is used, and the multi-thread and multi-concurrency mode is adopted, so that the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used for improving the safety of data and the overall query speed. And visual, accurate and efficient processing results are presented to the user based on big data visual analysis.
Drawings
In order to more clearly illustrate the invention or the technical solutions in the prior art, the drawings used in the description of the prior art will be briefly described below.
FIG. 1 is a flow chart of one embodiment of a rich media file parsing method of the present invention;
FIG. 2 is a flow chart illustrating an embodiment of a method for parsing a rich media file according to step 100 of the present invention;
FIG. 3 is a functional block diagram of step 100 of a rich media file parsing method of the present invention;
FIG. 4 is a flow chart illustrating an embodiment of a method for parsing a rich media file at step 200 according to the present invention;
FIG. 5 is a functional block diagram of step 200 of a rich media file parsing method of the present invention;
FIG. 6 is a flow chart illustrating an embodiment of a method for parsing a rich media file 300 according to the present invention;
fig. 7 is a functional block diagram of a step 300 of a rich media file parsing method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, a rich media file parsing method includes:
step 100: screening and classifying the file formats of the mass rich media files;
step 200: distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats to the filtered and classified rich media files through a resource factory;
step 300: adopting Spark parallel computing frames to carry out high-concurrency analysis processing on the distributed data analysis interfaces of all nodes;
step 400: performing multi-node cluster indexing on the analyzed result;
step 500: and carrying out visual analysis on big data based on the index query interface.
The rich media file analysis method of the embodiment of the invention firstly screens and classifies mass rich media file data, and screens complex structure data into relatively regular classification data. So that an accurate format handling can be performed for a single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; the Spark parallel calculation is used, and the multi-thread and multi-concurrency mode is adopted, so that the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used for improving the safety of data and the overall query speed. And visual, accurate and efficient processing results are presented to the user based on big data visual analysis.
Optionally, the rich media file in the embodiment of the present invention includes a ZIP compression package, a RAR compression package, a HAR compression package, a PST/OST compression file of mail, and an integrated document folder.
In some embodiments, referring to fig. 2-3, step 100 of the rich media file parsing method of the present invention includes:
step 101: decompressing massive rich media files, and performing multi-layer decompression extraction on the files by using a traversal algorithm; and saving the obtained entity file in a temporary directory of the distributed file storage.
Step 102: through built-in screening and distributing engines, the decompressed different files are sorted, file format distinction and classification are carried out according to the suffixes of the file names, and the files are temporarily stored in classified folders named with different data formats, so that subsequent analysis operation is facilitated.
In this step, the classified folders include Word documents, excel documents, PPT documents/PDF documents, picture files, eml files, mobile phone backup/investigation data, and hard disk backup/investigation data.
In some embodiments, referring to fig. 4-5, step 200 of the rich media file parsing method of the present invention includes:
step 201: distributing analysis interfaces according to files with different data formats;
in the step, when the input is Word document, excel document, PPT document, PDF document, the resource factory distributes document analysis interface; when the input is Eml file, audio file and video file, the resource factory automatically distributes the media file analysis interface; when the input file is a mobile phone evidence obtaining investigation and a hard disk evidence obtaining investigation, distributing an evidence obtaining investigation analysis interface;
step 202: and allocating different hardware resources according to the data sizes of different analysis interfaces to obtain the hardware resources of each data node. For example, when the support size of the platform system is 10T and the total size of the input file is 2T, automatically distributing 8 cores and 32G memories of each node according to the required hardware resources for subsequent analysis processing; when the total size of the input file is 10T, automatically distributing the 32-core 128G memory according to the analysis requirement to carry out subsequent analysis processing.
In some embodiments, referring to fig. 6-7, step 300 of the rich media file parsing method of the present invention comprises:
step 301: and summarizing the hardware resources of each node.
In this embodiment, when each node is configured by using a 32-core CPU and a 128G memory, the hardware resources of the 128-core CPU and the 512G memory are summarized. Because the analysis process directly uses the memory to calculate, the analysis efficiency is greatly improved, and the problems of data grounding and disk IO are solved.
Step 302: and dividing an overall task into a plurality of small tasks through a Spark computing framework, carrying out concurrent thread allocation and computation according to resources required to be allocated for single task execution, and summarizing and persistence on the single task execution result.
In this embodiment, when a single task needs to be executed in a 4G memory and a 1-core CPU to perform computation, 100 threads can be generally allocated to perform concurrent computation, so that the running speed and the execution effect are greatly improved.
Preferably, in step 400 of the rich media file parsing method of the present invention, multi-node cluster indexing is performed on the parsed result through a distributed full-text indexing technique.
Because the distributed full-text index is different from the common database query technology, the query operation of massive data is provided for the search engine in a professional way, and the search speed reaches millisecond-level response. Splitting the content of the original file into an index file based on the Lucene format through data slicing, facilitating quick searching and convenient data size compression;
and the distributed full-text indexing technology stores the data of each fragment as a plurality of backups and walks on different data blocks of different racks and different nodes. The problem of data loss caused by data disk damage and unexpected fault of the machine room is effectively prevented. And (5) giving an intuitive prompt to a user through indexing the health value of the cluster.
Preferably, in step 500 of the rich media file parsing method of the present invention, the visual analysis of big data adopts a relational object query technique.
On one hand, the rear end is based on a query interface of the distributed full-text index technology, so that the response time of data query is ensured, and a large amount of json data can be requested in batches in a short time.
On the other hand, the big data visualization analysis adopts a relational object query technology. Such as Neo4J database query interfaces, visual analysis can provide more accurate, efficient, and rapid results presentation in one-to-one, many-to-many relationship queries.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (7)

1. A rich media file parsing method, comprising:
screening and classifying the file formats of the mass rich media files;
the method for distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats to the filtered and categorized rich media files through a resource factory specifically comprises the following steps: distributing analysis interfaces according to files with different data formats, and distributing the document analysis interfaces by a resource factory when the files are input as Word documents, excel documents, PPT documents and PDF documents; when the input is Eml file, audio file and video file, the resource factory automatically distributes the media file analysis interface; when the input file is a mobile phone evidence obtaining investigation and a hard disk evidence obtaining investigation, distributing an evidence obtaining investigation analysis interface; different hardware resources are allocated according to the data sizes of different analysis interfaces, and the hardware resources of each data node are obtained;
adopting Spark parallel computing frames to carry out high-concurrency analysis processing on the distributed data analysis interfaces of all nodes;
performing multi-node cluster indexing on the analyzed result;
and carrying out visual analysis on big data based on the index query interface.
2. The method for parsing a rich media file according to claim 1, wherein: the rich media files comprise ZIP compression packages, RAR compression packages, HAR compression packages, PST/OST compression files of mails and comprehensive document folders.
3. The method for parsing a rich media file according to claim 1, wherein: the step of screening and classifying the file formats of the mass rich media files comprises the following steps:
decompressing massive rich media files, and performing multi-layer decompression extraction on the files by using a traversal algorithm;
sorting the decompressed different files through a built-in screening and distributing engine, distinguishing and classifying the file formats according to the suffixes of the file names, and temporarily storing the file formats in classified folders named with different data formats.
4. A rich media file parsing method according to claim 3, wherein: files after classification comprise Word documents, excel documents, PPT documents/PDF documents, picture files, eml files, mobile phone backup/investigation data and hard disk backup/investigation data.
5. The method for parsing a rich media file according to claim 1, wherein the step of performing high-concurrency parsing processing on the distributed node data parsing interfaces by using a Spark parallel computing framework includes:
summarizing hardware resources of each node into a Spark framework;
and dividing an overall task into a plurality of small tasks through a Spark computing framework, carrying out concurrent thread allocation and computation according to resources required to be allocated for single task execution, and summarizing and persistence on the single task execution result.
6. The method of claim 1, wherein the multi-node cluster indexing is performed on the result of the parsing process by a distributed full-text indexing technique.
7. The method of claim 1, wherein the visual analysis of the big data employs a relational object query technique.
CN201911309803.4A 2019-12-18 2019-12-18 Rich media file analysis method Active CN111104527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911309803.4A CN111104527B (en) 2019-12-18 2019-12-18 Rich media file analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911309803.4A CN111104527B (en) 2019-12-18 2019-12-18 Rich media file analysis method

Publications (2)

Publication Number Publication Date
CN111104527A CN111104527A (en) 2020-05-05
CN111104527B true CN111104527B (en) 2023-06-23

Family

ID=70423627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911309803.4A Active CN111104527B (en) 2019-12-18 2019-12-18 Rich media file analysis method

Country Status (1)

Country Link
CN (1) CN111104527B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114090505B (en) * 2021-11-23 2024-09-10 成都锋卫科技有限公司 Intelligent resource scheduling and efficient concurrency data classification method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045856A (en) * 2015-07-09 2015-11-11 中国资源卫星应用中心 Hadoop-based data processing system for big-data remote sensing satellite
CN107528763A (en) * 2016-06-22 2017-12-29 北京易讯通信息技术股份有限公司 A kind of Mail Contents analysis method based on Spark and YARN
CN109151078A (en) * 2018-10-31 2019-01-04 厦门市美亚柏科信息股份有限公司 A kind of distributed intelligence e-mail analysis filter method, system and storage medium
CN110059138A (en) * 2019-03-12 2019-07-26 国网辽宁省电力有限公司信息通信分公司 One kind being based on big data platform data analysis domain architecting method
CN110209662A (en) * 2018-02-13 2019-09-06 北京京东尚科信息技术有限公司 A kind of method and apparatus of automation load data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045856A (en) * 2015-07-09 2015-11-11 中国资源卫星应用中心 Hadoop-based data processing system for big-data remote sensing satellite
CN107528763A (en) * 2016-06-22 2017-12-29 北京易讯通信息技术股份有限公司 A kind of Mail Contents analysis method based on Spark and YARN
CN110209662A (en) * 2018-02-13 2019-09-06 北京京东尚科信息技术有限公司 A kind of method and apparatus of automation load data
CN109151078A (en) * 2018-10-31 2019-01-04 厦门市美亚柏科信息股份有限公司 A kind of distributed intelligence e-mail analysis filter method, system and storage medium
CN110059138A (en) * 2019-03-12 2019-07-26 国网辽宁省电力有限公司信息通信分公司 One kind being based on big data platform data analysis domain architecting method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
詹利群 ; 任晓炜 ; 黄志 ; 李涛 ; .广西气象业务内网功能设计与实现.气象研究与应用.2019,(第01期),全文. *
陈小云 ; .浅议电子数据在检察实践中的应用.电脑迷.2016,(第10期),全文. *

Also Published As

Publication number Publication date
CN111104527A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
Zaharia et al. Fast and interactive analytics over Hadoop data with Spark
US20170083573A1 (en) Multi-query optimization
US20160275178A1 (en) Method and apparatus for search
CN111913955A (en) Data sorting processing device, method and storage medium
US20090043792A1 (en) Partial Compression of a Database Table Based on Historical Information
US20130227194A1 (en) Active non-volatile memory post-processing
WO2019148713A1 (en) Sql statement processing method and apparatus, computer device, and storage medium
WO2017028394A1 (en) Example-based distributed data recovery method and apparatus
CN106126601A (en) A kind of social security distributed preprocess method of big data and system
US20090043793A1 (en) Parallel Uncompression of a Partially Compressed Database Table
US10185743B2 (en) Method and system for optimizing reduce-side join operation in a map-reduce framework
WO2022083197A1 (en) Data processing method and apparatus, electronic device, and storage medium
US10776401B2 (en) Efficient database query aggregation of variable length data
US9830369B1 (en) Processor for database analytics processing
US20230359647A1 (en) Read-Write Separation and Automatic Scaling-Based Cloud Arrangement System and Method
CN113485999A (en) Data cleaning method and device and server
CN111428140B (en) High concurrency data retrieval method, device, equipment and storage medium
US7890705B2 (en) Shared-memory multiprocessor system and information processing method
WO2022253131A1 (en) Data parsing method and apparatus, computer device, and storage medium
CN111104527B (en) Rich media file analysis method
CN111767287A (en) Data import method, device, equipment and computer storage medium
CN108052535B (en) Visual feature parallel rapid matching method and system based on multiprocessor platform
US20150149498A1 (en) Method and System for Performing an Operation Using Map Reduce
CN113918532A (en) Portrait label aggregation method, electronic device and storage medium
Kang et al. Reducing i/o cost in olap query processing with mapreduce

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20200505

Assignee: Shanghai Suiyu Enterprise Management Consulting Partnership (L.P.)

Assignor: Write easy network technology (Shanghai) Co.,Ltd.

Contract record no.: X2023980042559

Denomination of invention: A Method for Parsing Rich Media Files

Granted publication date: 20230623

License type: Common License

Record date: 20230923

PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A method for parsing rich media files

Granted publication date: 20230623

Pledgee: Shanghai Rural Commercial Bank Co.,Ltd. Jiading sub branch

Pledgor: Write easy network technology (Shanghai) Co.,Ltd.

Registration number: Y2024310000896