CN111104527A - Rich media file parsing method - Google Patents

Rich media file parsing method Download PDF

Info

Publication number
CN111104527A
CN111104527A CN201911309803.4A CN201911309803A CN111104527A CN 111104527 A CN111104527 A CN 111104527A CN 201911309803 A CN201911309803 A CN 201911309803A CN 111104527 A CN111104527 A CN 111104527A
Authority
CN
China
Prior art keywords
data
rich media
analysis
files
media file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911309803.4A
Other languages
Chinese (zh)
Other versions
CN111104527B (en
Inventor
程俊
李文飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Write Easy Network Technology Shanghai Co Ltd
Original Assignee
Write Easy Network Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Write Easy Network Technology Shanghai Co Ltd filed Critical Write Easy Network Technology Shanghai Co Ltd
Priority to CN201911309803.4A priority Critical patent/CN111104527B/en
Publication of CN111104527A publication Critical patent/CN111104527A/en
Application granted granted Critical
Publication of CN111104527B publication Critical patent/CN111104527B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/44Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A rich media file analysis method comprises five main flows of data screening and classification, resource factory distribution, Spark multi-concurrency analysis, multi-node cluster index and big data visualization analysis. The invention firstly screens and classifies mass rich media file data and screens complex structure data into classification data with relative rules, thereby being capable of carrying out accurate format processing on a single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; then, by using Spark parallel computation and adopting a multi-thread and multi-concurrency mode, the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used, so that the data security and the overall query speed are improved. And based on big data visual analysis, a visual, accurate and efficient processing result is presented to a user.

Description

Rich media file parsing method
Technical Field
The invention relates to the technical field of big data processing, in particular to a rich media file parsing method.
Background
With the increasing growth of internet data, the variety of data and the data capacity are increasing explosively at an unprecedented rate.
For common enterprises and related units, the common file formats such as mail data, document data (including Office document pdf documents and the like), web data, call ticket data, fund data, mobile phone backup and survey data, computer backup and survey data, database structured data (MySQL Oracle sql server Access MongoDB Redis) and the like are various, so how to comprehensively store, utilize and analyze the data, and query and data mining of common services is a difficult problem with high requirements on technical level.
Disclosure of Invention
The invention provides a rich media file analysis method, which solves the problems of comprehensively storing, utilizing and analyzing data in various file formats, and inquiring and mining common services.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a rich media file parsing method is characterized by comprising the following steps:
screening and classifying file formats of the massive rich media files;
distributing hardware resources needing to be processed and data analysis interfaces needed by corresponding file formats to the screened and classified rich media files through a resource factory;
performing high-concurrency analysis processing on the distributed node data analysis interfaces by adopting a Spark parallel computing frame;
performing multi-node cluster indexing on the analyzed result;
and performing visual analysis on the big data based on the index query interface.
According to another embodiment of the present invention, the rich media files include ZIP compressed packages, RAR compressed packages, HAR compressed packages, PST/OST compressed files for mail, and integrated document folders.
According to another embodiment of the present invention, the step of screening and classifying the file formats of the massive rich media files comprises:
decompressing massive rich media files, and performing multi-layer decompression extraction on the files by using a traversal algorithm;
sorting the decompressed different files through a built-in screening and distributing engine, distinguishing and classifying the file formats according to suffixes of file names, and temporarily storing the files in classified folders named in different data formats.
According to another embodiment of the present invention, the classified files include Word documents, Excel documents, PPT documents/PDF documents, picture files, Eml files, mobile phone backup/survey data, and hard disk backup/survey data.
According to another embodiment of the present invention, the step of allocating, by the resource factory, the hardware resources to be processed and the data parsing interface required by the corresponding file format includes:
distributing analysis interfaces according to files in different data formats, and distributing the document analysis interfaces by a resource factory when input is Word documents, Excel documents, PPT documents and PDF documents; when the input is Eml files, audio files and video files, the resource factory automatically allocates media file analysis interfaces; when the input file is a mobile phone evidence obtaining survey and a hard disk evidence obtaining survey, allocating an evidence obtaining survey analysis interface;
and distributing different hardware resources according to the data sizes of different analysis interfaces to obtain the hardware resources of each data node.
According to another embodiment of the present invention, the step of performing highly concurrent analysis processing on the distributed node data analysis interfaces by using a Spark parallel computing framework includes:
and summarizing the hardware resources of each node into a Spark framework.
Dividing an integral task into a plurality of small tasks through a Spark calculation framework, performing concurrent thread allocation and calculation according to resources required to be allocated for executing a single task, and summarizing and persisting the result executed by the single task.
According to another embodiment of the present invention, the parsing result is subjected to multi-node cluster indexing by a distributed full-text indexing technique.
According to another embodiment of the invention, the visualization analysis of the big data adopts a relational object query technology.
The invention provides a rich media file parsing method. The method has the following beneficial effects: firstly, screening and classifying mass rich media file data, and screening complex structure data into classification data with relative rules, so that accurate format processing can be performed on a single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; then, by using Spark parallel computation and adopting a multi-thread and multi-concurrency mode, the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used, so that the data security and the overall query speed are improved. And based on big data visual analysis, a visual, accurate and efficient processing result is presented to a user.
Drawings
In order to more clearly illustrate the present invention or the prior art solutions, the drawings that are needed in the description of the prior art will be briefly described below.
FIG. 1 is a flow chart illustrating an embodiment of a rich media file parsing method according to the present invention;
FIG. 2 is a flowchart illustrating an embodiment of the steps 100 of a rich media file parsing method according to the present invention;
FIG. 3 is a functional block diagram illustrating the steps 100 of a rich media file parsing method of the present invention;
FIG. 4 is a flowchart illustrating one embodiment of the steps 200 of a rich media file parsing method according to the present invention;
FIG. 5 is a functional block diagram of the steps 200 of a rich media file parsing method of the present invention;
FIG. 6 is a flowchart illustrating an embodiment of the step 300 of a rich media file parsing method according to the present invention;
FIG. 7 is a functional block diagram of the steps 300 of a rich media file parsing method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.
As shown in fig. 1, a rich media file parsing method includes:
step 100: screening and classifying file formats of the massive rich media files;
step 200: distributing hardware resources needing to be processed and data analysis interfaces needed by corresponding file formats to the screened and classified rich media files through a resource factory;
step 300: performing high-concurrency analysis processing on the distributed node data analysis interfaces by adopting a Spark parallel computing frame;
step 400: performing multi-node cluster indexing on the analyzed result;
step 500: and performing visual analysis on the big data based on the index query interface.
The rich media file analysis method firstly screens and classifies massive rich media file data, and screens complex structure data into classification data with relative rules. So that an accurate format processing can be performed for a single data format. Automatically distributing hardware resources to be processed and data analysis interfaces required by corresponding file formats through a resource factory; then, by using Spark parallel computation and adopting a multi-thread and multi-concurrency mode, the analysis speed is improved to the maximum extent; and the distributed full-text index technology is used, so that the data security and the overall query speed are improved. And based on big data visual analysis, a visual, accurate and efficient processing result is presented to a user.
Optionally, the rich media files in the embodiment of the present invention include ZIP compressed packages, RAR compressed packages, HAR compressed packages, PST/OST compressed files of mails, and integrated document folders.
In some embodiments, referring to fig. 2-3, the step 100 of the rich media file parsing method of the present invention comprises:
step 101: decompressing massive rich media files, and performing multi-layer decompression extraction on the files by using a traversal algorithm; and saving the acquired entity file in a temporary directory of the distributed file storage.
Step 102: the decompressed different files are sorted by a built-in screening and distributing engine, the file formats are distinguished and classified according to suffixes of file names, and the files are temporarily stored in classified folders named in different data formats, so that the subsequent analysis operation is facilitated.
In this step, the classified folders include Word documents, Excel documents, PPT documents/PDF documents, picture files, Eml files, mobile phone backup/survey data, and hard disk backup/survey data.
In some embodiments, referring to FIGS. 4-5, the step 200 of the rich media file parsing method of the present invention comprises:
step 201: distributing and analyzing interfaces according to files with different data formats;
in the step, when the input is Word document, Excel document, PPT document and PDF document, the resource factory allocates document analysis interface; when the input is Eml files, audio files and video files, the resource factory automatically allocates media file analysis interfaces; when the input file is a mobile phone evidence obtaining survey and a hard disk evidence obtaining survey, allocating an evidence obtaining survey analysis interface;
step 202: and distributing different hardware resources according to the data sizes of different analysis interfaces to obtain the hardware resources of each data node. For example, the support size of the platform system is 10T, each batch of the platform system, and when the total size of the input file is 2T, 8 cores and 32G memories of each node are automatically allocated according to required hardware resources to perform subsequent analysis processing; when the total size of the input file is 10T, 32 cores and 128G memories are automatically allocated according to analysis requirements to perform subsequent analysis processing.
In some embodiments, referring to FIGS. 6-7, the step 300 of the rich media file parsing method of the present invention comprises:
step 301: and summarizing the hardware resources of each node.
In this embodiment, when each node is configured by using 32-core CPUs and 128G memories, the hardware resources of the 128-core CPUs and 512G memories are obtained after the configuration. Because the memory is directly used for calculation in the analysis process, the analysis efficiency is greatly improved, and the problems of data falling to the ground and disk IO are solved.
Step 302: dividing an integral task into a plurality of small tasks through a Spark calculation framework, performing concurrent thread allocation and calculation according to resources required to be allocated for executing a single task, and summarizing and persisting the result executed by the single task.
In this embodiment, when a single task is executed and needs a 4G memory and a 1-core CPU to perform computation, 100 threads can be generally allocated to perform concurrent computation, thereby greatly improving the running speed and the execution effect.
Preferably, in step 400 of the rich media file parsing method of the present invention, a multi-node cluster index is performed on the parsed result by using a distributed full-text indexing technique.
The distributed full-text index is different from the common database query technology, so that the query operation of mass data is professionally provided for a search engine, and the search speed reaches millisecond response. The content of the original file is split into index files based on the Lucene format through data fragmentation, so that the rapid search is facilitated, and the compression of the data size is also facilitated;
and the distributed full-text index technology stores the data of each fragment into a plurality of backups which are scattered on different data blocks of different racks and different nodes. The data disk damage and the data loss caused by the unexpected fault of the machine room are effectively prevented. The user is given an intuitive prompt by indexing the health value of the cluster.
Preferably, in step 500 of the rich media file parsing method of the present invention, a relational object query technique is used for the visualization analysis of the big data.
On one hand, the back end is based on the query interface of the distributed full-text index technology, so that the response time of data query is ensured, and a large amount of json data can be requested in batches in a short time.
On the other hand, the big data visualization analysis adopts a relational object query technology. Such as Neo4J database query interface, visual analysis can provide more accurate, efficient, and rapid results presentation in one-to-one, many-to-many relational queries.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A rich media file parsing method is characterized by comprising the following steps:
screening and classifying file formats of the massive rich media files;
distributing hardware resources needing to be processed and data analysis interfaces needed by corresponding file formats to the screened and classified rich media files through a resource factory;
performing high-concurrency analysis processing on the distributed node data analysis interfaces by adopting a Spark parallel computing frame;
performing multi-node cluster indexing on the analyzed result;
and performing visual analysis on the big data based on the index query interface.
2. The rich media file parsing method of claim 1, wherein: the rich media files include ZIP compressed packages, RAR compressed packages, HAR compressed packages, PST/OST compressed files for mail, and integrated document folders.
3. The rich media file parsing method of claim 1, wherein: the step of screening and classifying the file formats of the massive rich media files comprises the following steps:
decompressing massive rich media files, and performing multi-layer decompression extraction on the files by using a traversal algorithm;
sorting the decompressed different files through a built-in screening and distributing engine, distinguishing and classifying the file formats according to suffixes of file names, and temporarily storing the files in classified folders named in different data formats.
4. The rich media file parsing method of claim 3, wherein: the classified files comprise Word documents, Excel documents, PPT documents/PDF documents, picture files, Eml files, mobile phone backup/exploration data and hard disk backup/exploration data.
5. The rich media file parsing method according to claim 3, wherein the step of allocating hardware resources to be processed and data parsing interfaces required by corresponding file formats through resource factories comprises:
according to file allocation analysis interfaces of different data formats, when input is Word documents, Excel documents, PPT documents and PDF documents, a resource factory allocates the file analysis interfaces; when the input is Eml files, audio files and video files, the resource factory automatically allocates media file analysis interfaces; when the input file is a mobile phone evidence obtaining survey and a hard disk evidence obtaining survey, allocating an evidence obtaining survey analysis interface;
and distributing different hardware resources according to the data sizes of different analysis interfaces to obtain the hardware resources of each data node.
6. The rich media file parsing method according to claim 5, wherein the step of performing high-concurrency parsing processing on the distributed node data parsing interfaces by using a Spark parallel computing framework comprises:
summarizing hardware resources of each node into a Spark framework;
dividing an integral task into a plurality of small tasks through a Spark calculation framework, performing concurrent thread allocation and calculation according to resources required to be allocated for executing a single task, and summarizing and persisting the result executed by the single task.
7. The rich media file parsing method as claimed in claim 1, wherein the result after parsing is multi-node cluster indexed by distributed full-text indexing.
8. The rich media file parsing method as claimed in claim 1, wherein the visualization analysis of the big data employs a relational object query technique.
CN201911309803.4A 2019-12-18 2019-12-18 Rich media file analysis method Active CN111104527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911309803.4A CN111104527B (en) 2019-12-18 2019-12-18 Rich media file analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911309803.4A CN111104527B (en) 2019-12-18 2019-12-18 Rich media file analysis method

Publications (2)

Publication Number Publication Date
CN111104527A true CN111104527A (en) 2020-05-05
CN111104527B CN111104527B (en) 2023-06-23

Family

ID=70423627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911309803.4A Active CN111104527B (en) 2019-12-18 2019-12-18 Rich media file analysis method

Country Status (1)

Country Link
CN (1) CN111104527B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114090505A (en) * 2021-11-23 2022-02-25 成都深思科技有限公司 Intelligent resource scheduling and efficient concurrent data classification method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045856A (en) * 2015-07-09 2015-11-11 中国资源卫星应用中心 Hadoop-based data processing system for big-data remote sensing satellite
CN107528763A (en) * 2016-06-22 2017-12-29 北京易讯通信息技术股份有限公司 A kind of Mail Contents analysis method based on Spark and YARN
CN109151078A (en) * 2018-10-31 2019-01-04 厦门市美亚柏科信息股份有限公司 A kind of distributed intelligence e-mail analysis filter method, system and storage medium
CN110059138A (en) * 2019-03-12 2019-07-26 国网辽宁省电力有限公司信息通信分公司 One kind being based on big data platform data analysis domain architecting method
CN110209662A (en) * 2018-02-13 2019-09-06 北京京东尚科信息技术有限公司 A kind of method and apparatus of automation load data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045856A (en) * 2015-07-09 2015-11-11 中国资源卫星应用中心 Hadoop-based data processing system for big-data remote sensing satellite
CN107528763A (en) * 2016-06-22 2017-12-29 北京易讯通信息技术股份有限公司 A kind of Mail Contents analysis method based on Spark and YARN
CN110209662A (en) * 2018-02-13 2019-09-06 北京京东尚科信息技术有限公司 A kind of method and apparatus of automation load data
CN109151078A (en) * 2018-10-31 2019-01-04 厦门市美亚柏科信息股份有限公司 A kind of distributed intelligence e-mail analysis filter method, system and storage medium
CN110059138A (en) * 2019-03-12 2019-07-26 国网辽宁省电力有限公司信息通信分公司 One kind being based on big data platform data analysis domain architecting method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
詹利群;任晓炜;黄志;李涛;: "广西气象业务内网功能设计与实现" *
陈小云;: "浅议电子数据在检察实践中的应用" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114090505A (en) * 2021-11-23 2022-02-25 成都深思科技有限公司 Intelligent resource scheduling and efficient concurrent data classification method

Also Published As

Publication number Publication date
CN111104527B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
US20220156289A1 (en) Generating a multi-column index for relational databases by interleaving data bits for selectivity
US10467293B2 (en) Scalable distributed computing system for determining exact median and other quantiles in big data applications
CN111913955A (en) Data sorting processing device, method and storage medium
US9489411B2 (en) High performance index creation
US20130227194A1 (en) Active non-volatile memory post-processing
US20090043792A1 (en) Partial Compression of a Database Table Based on Historical Information
CN109815283B (en) Heterogeneous data source visual query method
CN110690984A (en) Spark-based big data weblog acquisition, analysis and early warning method and system
WO2019148713A1 (en) Sql statement processing method and apparatus, computer device, and storage medium
WO2022083197A1 (en) Data processing method and apparatus, electronic device, and storage medium
US20150149437A1 (en) Method and System for Optimizing Reduce-Side Join Operation in a Map-Reduce Framework
US20230359647A1 (en) Read-Write Separation and Automatic Scaling-Based Cloud Arrangement System and Method
US20160210228A1 (en) Asynchronous garbage collection in a distributed database system
CN113485999A (en) Data cleaning method and device and server
CN108052535B (en) Visual feature parallel rapid matching method and system based on multiprocessor platform
CN111104527B (en) Rich media file analysis method
US10552419B2 (en) Method and system for performing an operation using map reduce
CN116982035A (en) Measurement and improvement of index quality in distributed data systems
CN113918532A (en) Portrait label aggregation method, electronic device and storage medium
US10996855B2 (en) Memory allocation in a data analytics system
CN112631754A (en) Data processing method, data processing device, storage medium and electronic device
CN116700917A (en) Data decision platform and use method
CN114168557A (en) Processing method and device for access log, computer equipment and storage medium
CN113282568B (en) IOT big data real-time sequence flow analysis application technical method
CN115033616A (en) Data screening rule verification method and device based on multi-round sampling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20200505

Assignee: Shanghai Suiyu Enterprise Management Consulting Partnership (L.P.)

Assignor: Write easy network technology (Shanghai) Co.,Ltd.

Contract record no.: X2023980042559

Denomination of invention: A Method for Parsing Rich Media Files

Granted publication date: 20230623

License type: Common License

Record date: 20230923

EE01 Entry into force of recordation of patent licensing contract