CN111797063A - Streaming data processing method and system - Google Patents

Streaming data processing method and system Download PDF

Info

Publication number
CN111797063A
CN111797063A CN202010599344.4A CN202010599344A CN111797063A CN 111797063 A CN111797063 A CN 111797063A CN 202010599344 A CN202010599344 A CN 202010599344A CN 111797063 A CN111797063 A CN 111797063A
Authority
CN
China
Prior art keywords
data
file
content
memory
streaming
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010599344.4A
Other languages
Chinese (zh)
Inventor
刘洋洋
麻宇航
李兴国
苗功勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD
Nanjing Zhongfu Information Technology Co Ltd
Zhongfu Information Co Ltd
Zhongfu Safety Technology Co Ltd
Original Assignee
BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD
Nanjing Zhongfu Information Technology Co Ltd
Zhongfu Information Co Ltd
Zhongfu Safety Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD, Nanjing Zhongfu Information Technology Co Ltd, Zhongfu Information Co Ltd, Zhongfu Safety Technology Co Ltd filed Critical BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD
Priority to CN202010599344.4A priority Critical patent/CN111797063A/en
Publication of CN111797063A publication Critical patent/CN111797063A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/122File system administration, e.g. details of archiving or snapshots using management policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems

Abstract

The invention provides a streaming data processing method and a system, which can store a large-scale file in a disk, save a large amount of memory, and process analysis of large-scale data content by using a server with lower performance; the large file exists in the form of a memory data record in the circulation of the analysis process, so that the analysis efficiency is improved; in addition, the file content can be dynamically loaded according to the requirement, the file reading and writing times are reduced, the disk IO operation is reduced, and the analysis efficiency is improved.

Description

Streaming data processing method and system
Technical Field
The present invention relates to the field of streaming data technologies, and in particular, to a streaming data processing method and system.
Background
Streaming data refers to data continuously generated by a plurality of data sources, and is generally transmitted in the form of data records, and for most scenes continuously generating dynamic new data, the analysis conclusion can be obtained more quickly by adopting streaming data processing compared with a batch processing analysis mode.
The current mainstream stream processing tools are Apache Spark Streaming, Apache Storm, Flink, and the like. Data content in stream processing is loaded into a memory, and the data is generally structured data with a small scale, such as sensor data, application log data, user click data, transaction data, and the like. When processing unstructured data with a large scale, the data needs to be structured in advance, and a physical memory of a server needs to be increased or the data needs to be deployed in a cluster mode. The processing mode of current tools has the problem of inefficient analysis when processing structured data on a larger scale or batches of small-scale data.
Disclosure of Invention
The invention aims to provide a streaming data processing method and a streaming data processing system, which aim to solve the problems that in the prior art, streaming processing needs to structurize non-structural data and the analysis efficiency is low under large-scale structured data, save a large amount of memory and improve the analysis efficiency.
To achieve the above technical object, the present invention provides a streaming data processing method, including the following operations:
reading data required to be processed and storing the read data in a storage component;
analyzing the read data file content, and extracting the content to be analyzed into an internal memory to form a data record;
analyzing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
for the hit data record, if more detailed content needs to be extracted, the content in the storage component is read as required and loaded to the data record in the memory.
Preferably, the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file, and an audio file.
Preferably, the data record is metadata including an original file storage path, a file format, and a file size.
The present invention also provides a streaming data processing system, the system comprising:
the data reading module is used for reading data to be processed and storing the read data in the storage component;
the data analysis module is used for analyzing the read data file content and extracting the content to be analyzed into the memory to form a data record;
the analysis module is used for analyzing and processing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
and the data re-extraction module is used for reading the content in the storage component as required and loading the content into the data record in the memory for the hit data record if more detailed content needs to be extracted.
Preferably, the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file, and an audio file.
Preferably, the data record is metadata including an original file storage path, a file format, and a file size.
The present invention also provides a streaming data processing apparatus, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the streaming data processing method.
The invention also provides a readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the streaming data processing method.
The effect provided in the summary of the invention is only the effect of the embodiment, not all the effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:
compared with the prior art, the invention can store files with larger scale in the disk, save a large amount of memory, and can process the analysis of large-scale data content by using a server with lower performance; the large file exists in the form of a memory data record in the circulation of the analysis process, so that the analysis efficiency is improved; in addition, the file content can be dynamically loaded according to the requirement, the file reading and writing times are reduced, the disk IO operation is reduced, and the analysis efficiency is improved.
Drawings
Fig. 1 is a flow chart of a streaming data processing method provided in an embodiment of the present invention;
fig. 2 is a block diagram of a streaming data processing system provided in an embodiment of the present invention.
Detailed Description
In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.
A streaming data processing method and system provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
As shown in fig. 1, an embodiment of the present invention discloses a streaming data processing method, where the method includes the following operations:
reading data required to be processed and storing the read data in a storage component;
analyzing the read data file content, and extracting the content to be analyzed into an internal memory to form a data record;
analyzing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
for the hit data record, if more detailed content needs to be extracted, the content in the storage component is read as required and loaded to the data record in the memory.
According to the embodiment of the invention, original contents of the file are stored in a disk or other storage components, only the attribute to be analyzed is read, a data record with a small scale is formed and stored in the memory, analysis circulation is rapidly completed, and when some analysis scenes need to be supplemented with the attribute, more file contents are dynamically read and filled into the data record in the memory.
Reading data to be processed, including batch data records or large-scale unstructured documents such as Office files, PDF files, video files, audio files and the like, and saving the read data files in a disk.
And analyzing the file content and extracting the content to be analyzed into the memory to form a data record, wherein the data record comprises metadata such as an original file storage path, a file format, a file size and the like. The extracted content is the attribute needed in the subsequent analysis process, such as text abstract.
And entering a subsequent analysis processing flow, matching the file content extracted into the memory with preset keywords, and marking data records if the keywords are hit. When more detailed contents need to be extracted in the subsequent analysis processing flow, partial file contents, such as text paragraph contents, video key frames, voice recognition results of specific recording segments and the like of hit keywords, are read as required and loaded into the data records of the memory. Taking file content analysis as an example, when the content of the keywords is hit, reading the sections where the abstracts are located into the memory, analyzing the semantics to determine whether the file content is related to the keywords, and for the files without the keywords, not reading the article content.
The embodiment of the invention can store files with larger scale in the disk, save a large amount of memory, and can process analysis of large-scale data content by using a server with lower performance; the large file exists in the form of a memory data record in the circulation of the analysis process, so that the analysis efficiency is improved; in addition, the file content can be dynamically loaded according to the requirement, the file reading and writing times are reduced, the disk IO operation is reduced, and the analysis efficiency is improved.
As shown in fig. 2, an embodiment of the present invention further discloses a streaming data processing system, where the system includes:
the data reading module is used for reading data to be processed and storing the read data in the storage component;
the data analysis module is used for analyzing the read data file content and extracting the content to be analyzed into the memory to form a data record;
the analysis module is used for analyzing and processing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
and the data re-extraction module is used for reading the content in the storage component as required and loading the content into the data record in the memory for the hit data record if more detailed content needs to be extracted.
Reading data to be processed, including batch data records or large-scale unstructured documents such as Office files, PDF files, video files, audio files and the like, and saving the read data files in a disk.
And analyzing the file content and extracting the content to be analyzed into the memory to form a data record, wherein the data record comprises metadata such as an original file storage path, a file format, a file size and the like. The extracted content is the attribute needed in the subsequent analysis process, such as text abstract.
And entering a subsequent analysis processing flow, matching the file content extracted into the memory with preset keywords, and marking data records if the keywords are hit. When more detailed contents need to be extracted in the subsequent analysis processing flow, partial file contents, such as text paragraph contents, video key frames, voice recognition results of specific recording segments and the like of hit keywords, are read as required and loaded into the data records of the memory. Taking file content analysis as an example, when the content of the keywords is hit, reading the sections where the abstracts are located into the memory, analyzing the semantics to determine whether the file content is related to the keywords, and for the files without the keywords, not reading the article content.
The original content of the file is stored in a magnetic disk or other storage components, only the attribute needing to be analyzed is read, a data record with a small scale is formed and stored in a memory, analysis circulation is completed quickly, and when some analysis scenes need to be supplemented with the attribute, more file content is read dynamically and the data record in the memory is filled.
The embodiment of the invention also discloses a streaming data processing device, which comprises:
a memory for storing a computer program;
a processor for executing the computer program to implement the streaming data processing method.
The embodiment of the invention also discloses a readable storage medium for storing a computer program, wherein the computer program realizes the streaming data processing method when being executed by a processor.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (8)

1. A method of streaming data processing, the method comprising the operations of:
reading data required to be processed and storing the read data in a storage component;
analyzing the read data file content, and extracting the content to be analyzed into an internal memory to form a data record;
analyzing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
for the hit data record, if more detailed content needs to be extracted, the content in the storage component is read as required and loaded to the data record in the memory.
2. The streaming data processing method according to claim 1, wherein the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file, and an audio file.
3. The streaming data processing method of claim 1, wherein the data record is metadata including an original file storage path, a file format, and a file size.
4. A streaming data processing system, the system comprising:
the data reading module is used for reading data to be processed and storing the read data in the storage component;
the data analysis module is used for analyzing the read data file content and extracting the content to be analyzed into the memory to form a data record;
the analysis module is used for analyzing and processing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
and the data re-extraction module is used for reading the content in the storage component as required and loading the content into the data record in the memory for the hit data record if more detailed content needs to be extracted.
5. The streaming data processing system of claim 4, wherein the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file and an audio file.
6. The streaming data processing system of claim 4, wherein the data record is metadata comprising an original file storage path, a file format, and a file size.
7. A streaming data processing device, comprising:
a memory for storing a computer program;
a processor for executing said computer program for implementing the streaming data processing method according to any of claims 1-3.
8. A readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the streaming data processing method according to any one of claims 1 to 3.
CN202010599344.4A 2020-06-28 2020-06-28 Streaming data processing method and system Withdrawn CN111797063A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010599344.4A CN111797063A (en) 2020-06-28 2020-06-28 Streaming data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010599344.4A CN111797063A (en) 2020-06-28 2020-06-28 Streaming data processing method and system

Publications (1)

Publication Number Publication Date
CN111797063A true CN111797063A (en) 2020-10-20

Family

ID=72804367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010599344.4A Withdrawn CN111797063A (en) 2020-06-28 2020-06-28 Streaming data processing method and system

Country Status (1)

Country Link
CN (1) CN111797063A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001164A (en) * 2020-10-27 2020-11-27 南京中孚信息技术有限公司 Document content streaming analysis method and system

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030115190A1 (en) * 2001-12-19 2003-06-19 Rick Soderstrom System and method for retrieving data from a database system
CN101686209A (en) * 2008-09-24 2010-03-31 阿里巴巴集团控股有限公司 Method and device for storing message in message retransmission system
CN103281213A (en) * 2013-04-18 2013-09-04 西安交通大学 Method for extracting, analyzing and searching network flow and content
CN103294413A (en) * 2013-05-08 2013-09-11 山东地纬计算机软件有限公司 Mass data acquisition terminal supported distributed-memory real-time storage device and storage method
CN105183670A (en) * 2015-10-27 2015-12-23 北京百度网讯科技有限公司 Data processing method and device used for distributed cache system
CN106250320A (en) * 2016-07-19 2016-12-21 诸葛晴凤 A kind of memory file system management method of data consistency and abrasion equilibrium
CN106452868A (en) * 2016-10-12 2017-02-22 中国电子科技集团公司第三十研究所 Network traffic statistics implement method supporting multi-dimensional aggregation classification
CN106648988A (en) * 2016-12-28 2017-05-10 四川秘无痕信息安全技术有限责任公司 Method for extracting data in monitoring equipment
CN107015982A (en) * 2016-01-27 2017-08-04 阿里巴巴集团控股有限公司 A kind of method, device and the equipment of monitoring system file integrality
CN107943846A (en) * 2017-11-01 2018-04-20 内蒙古科电数据服务有限公司 Data processing method, device and electronic equipment
CN109002444A (en) * 2017-06-07 2018-12-14 北大方正集团有限公司 Text searching method and full-text search device
CN109086410A (en) * 2018-08-02 2018-12-25 中国联合网络通信集团有限公司 The processing method and system of streaming mass data
CN109144760A (en) * 2018-06-29 2019-01-04 清华大学 For obtaining the method, apparatus, system and medium of internal storage state
CN110287189A (en) * 2019-06-25 2019-09-27 浪潮卓数大数据产业发展有限公司 A kind of method and system based on spark streaming processing mobile cart data
CN110377563A (en) * 2019-07-23 2019-10-25 中国工商银行股份有限公司 Document handling method and device and electronic equipment and readable storage medium storing program for executing
CN111078705A (en) * 2019-12-20 2020-04-28 南京聚力云成电子科技有限公司 Spark platform based data index establishing method and data query method

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030115190A1 (en) * 2001-12-19 2003-06-19 Rick Soderstrom System and method for retrieving data from a database system
CN101686209A (en) * 2008-09-24 2010-03-31 阿里巴巴集团控股有限公司 Method and device for storing message in message retransmission system
CN103281213A (en) * 2013-04-18 2013-09-04 西安交通大学 Method for extracting, analyzing and searching network flow and content
CN103294413A (en) * 2013-05-08 2013-09-11 山东地纬计算机软件有限公司 Mass data acquisition terminal supported distributed-memory real-time storage device and storage method
CN105183670A (en) * 2015-10-27 2015-12-23 北京百度网讯科技有限公司 Data processing method and device used for distributed cache system
CN107015982A (en) * 2016-01-27 2017-08-04 阿里巴巴集团控股有限公司 A kind of method, device and the equipment of monitoring system file integrality
CN106250320A (en) * 2016-07-19 2016-12-21 诸葛晴凤 A kind of memory file system management method of data consistency and abrasion equilibrium
CN106452868A (en) * 2016-10-12 2017-02-22 中国电子科技集团公司第三十研究所 Network traffic statistics implement method supporting multi-dimensional aggregation classification
CN106648988A (en) * 2016-12-28 2017-05-10 四川秘无痕信息安全技术有限责任公司 Method for extracting data in monitoring equipment
CN109002444A (en) * 2017-06-07 2018-12-14 北大方正集团有限公司 Text searching method and full-text search device
CN107943846A (en) * 2017-11-01 2018-04-20 内蒙古科电数据服务有限公司 Data processing method, device and electronic equipment
CN109144760A (en) * 2018-06-29 2019-01-04 清华大学 For obtaining the method, apparatus, system and medium of internal storage state
CN109086410A (en) * 2018-08-02 2018-12-25 中国联合网络通信集团有限公司 The processing method and system of streaming mass data
CN110287189A (en) * 2019-06-25 2019-09-27 浪潮卓数大数据产业发展有限公司 A kind of method and system based on spark streaming processing mobile cart data
CN110377563A (en) * 2019-07-23 2019-10-25 中国工商银行股份有限公司 Document handling method and device and electronic equipment and readable storage medium storing program for executing
CN111078705A (en) * 2019-12-20 2020-04-28 南京聚力云成电子科技有限公司 Spark platform based data index establishing method and data query method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
潘英杰等: "针对混叠采集现场质量监控的高性能解决方案", 《地球物理学进展》, no. 02, pages 803 - 809 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001164A (en) * 2020-10-27 2020-11-27 南京中孚信息技术有限公司 Document content streaming analysis method and system

Similar Documents

Publication Publication Date Title
CN108319668B (en) Method and equipment for generating text abstract
CN107491518B (en) Search recall method and device, server and storage medium
US20130124796A1 (en) Storage method and apparatus which are based on data content identification
CN113032362B (en) Data blood edge analysis method, device, electronic equipment and storage medium
BRPI0415045A (en) Information storage media for multimedia data storage, information storage media, text subtitle processing apparatus, text subtitle processing method, and computer readable recording media
US20070133067A1 (en) Forming a master page for an electronic document
US20150071542A1 (en) Automated redaction
US9058335B2 (en) System, method and computer program product for protecting derived metadata when updating records within a search engine
CN107391544B (en) Processing method, device and equipment of column type storage data and computer storage medium
CN110516203B (en) Dispute focus analysis method, device, electronic equipment and computer-readable medium
CN112231407B (en) DDL synchronization method, device, equipment and medium of PostgreSQL database
US8037403B2 (en) Apparatus, method, and computer program product for extracting structured document
CN111797063A (en) Streaming data processing method and system
CN102270238A (en) Method and device for establishing continuation of Chinese knowledge points
CN111930708B (en) Ceph object storage-based object tag expansion system and method
CN101021851A (en) Text search device, text search method, recording medium for recording text search program
CN112346659B (en) Storage method, equipment and storage medium for distributed object storage metadata
WO2015154680A1 (en) File processing method, device, and network system
US20200257724A1 (en) Methods, devices, and storage media for content retrieval
KR20180059112A (en) Apparatus for classifying contents and method for using the same
CN106909623A (en) A kind of data set and date storage method of supporting efficient mass data to analyze and retrieve
CN102360381A (en) Device and method for performing lossless compression on embedded program
CN112765110B (en) PDF annotation data generation method, device, equipment and storage medium
CN114218347A (en) Method for quickly searching index of multiple file contents
EP3273365B1 (en) Method for generating search index and server utilizing the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20201020