CN111797063A - Streaming data processing method and system - Google Patents
Streaming data processing method and system Download PDFInfo
- Publication number
- CN111797063A CN111797063A CN202010599344.4A CN202010599344A CN111797063A CN 111797063 A CN111797063 A CN 111797063A CN 202010599344 A CN202010599344 A CN 202010599344A CN 111797063 A CN111797063 A CN 111797063A
- Authority
- CN
- China
- Prior art keywords
- data
- file
- content
- memory
- streaming
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 17
- 238000004458 analytical method Methods 0.000 claims abstract description 32
- 238000000034 method Methods 0.000 claims abstract description 13
- 238000004590 computer program Methods 0.000 claims description 12
- 238000007405 data analysis Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/113—Details of archiving
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/122—File system administration, e.g. details of archiving or snapshots using management policies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
Abstract
The invention provides a streaming data processing method and a system, which can store a large-scale file in a disk, save a large amount of memory, and process analysis of large-scale data content by using a server with lower performance; the large file exists in the form of a memory data record in the circulation of the analysis process, so that the analysis efficiency is improved; in addition, the file content can be dynamically loaded according to the requirement, the file reading and writing times are reduced, the disk IO operation is reduced, and the analysis efficiency is improved.
Description
Technical Field
The present invention relates to the field of streaming data technologies, and in particular, to a streaming data processing method and system.
Background
Streaming data refers to data continuously generated by a plurality of data sources, and is generally transmitted in the form of data records, and for most scenes continuously generating dynamic new data, the analysis conclusion can be obtained more quickly by adopting streaming data processing compared with a batch processing analysis mode.
The current mainstream stream processing tools are Apache Spark Streaming, Apache Storm, Flink, and the like. Data content in stream processing is loaded into a memory, and the data is generally structured data with a small scale, such as sensor data, application log data, user click data, transaction data, and the like. When processing unstructured data with a large scale, the data needs to be structured in advance, and a physical memory of a server needs to be increased or the data needs to be deployed in a cluster mode. The processing mode of current tools has the problem of inefficient analysis when processing structured data on a larger scale or batches of small-scale data.
Disclosure of Invention
The invention aims to provide a streaming data processing method and a streaming data processing system, which aim to solve the problems that in the prior art, streaming processing needs to structurize non-structural data and the analysis efficiency is low under large-scale structured data, save a large amount of memory and improve the analysis efficiency.
To achieve the above technical object, the present invention provides a streaming data processing method, including the following operations:
reading data required to be processed and storing the read data in a storage component;
analyzing the read data file content, and extracting the content to be analyzed into an internal memory to form a data record;
analyzing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
for the hit data record, if more detailed content needs to be extracted, the content in the storage component is read as required and loaded to the data record in the memory.
Preferably, the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file, and an audio file.
Preferably, the data record is metadata including an original file storage path, a file format, and a file size.
The present invention also provides a streaming data processing system, the system comprising:
the data reading module is used for reading data to be processed and storing the read data in the storage component;
the data analysis module is used for analyzing the read data file content and extracting the content to be analyzed into the memory to form a data record;
the analysis module is used for analyzing and processing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
and the data re-extraction module is used for reading the content in the storage component as required and loading the content into the data record in the memory for the hit data record if more detailed content needs to be extracted.
Preferably, the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file, and an audio file.
Preferably, the data record is metadata including an original file storage path, a file format, and a file size.
The present invention also provides a streaming data processing apparatus, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the streaming data processing method.
The invention also provides a readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the streaming data processing method.
The effect provided in the summary of the invention is only the effect of the embodiment, not all the effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:
compared with the prior art, the invention can store files with larger scale in the disk, save a large amount of memory, and can process the analysis of large-scale data content by using a server with lower performance; the large file exists in the form of a memory data record in the circulation of the analysis process, so that the analysis efficiency is improved; in addition, the file content can be dynamically loaded according to the requirement, the file reading and writing times are reduced, the disk IO operation is reduced, and the analysis efficiency is improved.
Drawings
Fig. 1 is a flow chart of a streaming data processing method provided in an embodiment of the present invention;
fig. 2 is a block diagram of a streaming data processing system provided in an embodiment of the present invention.
Detailed Description
In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.
A streaming data processing method and system provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.
As shown in fig. 1, an embodiment of the present invention discloses a streaming data processing method, where the method includes the following operations:
reading data required to be processed and storing the read data in a storage component;
analyzing the read data file content, and extracting the content to be analyzed into an internal memory to form a data record;
analyzing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
for the hit data record, if more detailed content needs to be extracted, the content in the storage component is read as required and loaded to the data record in the memory.
According to the embodiment of the invention, original contents of the file are stored in a disk or other storage components, only the attribute to be analyzed is read, a data record with a small scale is formed and stored in the memory, analysis circulation is rapidly completed, and when some analysis scenes need to be supplemented with the attribute, more file contents are dynamically read and filled into the data record in the memory.
Reading data to be processed, including batch data records or large-scale unstructured documents such as Office files, PDF files, video files, audio files and the like, and saving the read data files in a disk.
And analyzing the file content and extracting the content to be analyzed into the memory to form a data record, wherein the data record comprises metadata such as an original file storage path, a file format, a file size and the like. The extracted content is the attribute needed in the subsequent analysis process, such as text abstract.
And entering a subsequent analysis processing flow, matching the file content extracted into the memory with preset keywords, and marking data records if the keywords are hit. When more detailed contents need to be extracted in the subsequent analysis processing flow, partial file contents, such as text paragraph contents, video key frames, voice recognition results of specific recording segments and the like of hit keywords, are read as required and loaded into the data records of the memory. Taking file content analysis as an example, when the content of the keywords is hit, reading the sections where the abstracts are located into the memory, analyzing the semantics to determine whether the file content is related to the keywords, and for the files without the keywords, not reading the article content.
The embodiment of the invention can store files with larger scale in the disk, save a large amount of memory, and can process analysis of large-scale data content by using a server with lower performance; the large file exists in the form of a memory data record in the circulation of the analysis process, so that the analysis efficiency is improved; in addition, the file content can be dynamically loaded according to the requirement, the file reading and writing times are reduced, the disk IO operation is reduced, and the analysis efficiency is improved.
As shown in fig. 2, an embodiment of the present invention further discloses a streaming data processing system, where the system includes:
the data reading module is used for reading data to be processed and storing the read data in the storage component;
the data analysis module is used for analyzing the read data file content and extracting the content to be analyzed into the memory to form a data record;
the analysis module is used for analyzing and processing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
and the data re-extraction module is used for reading the content in the storage component as required and loading the content into the data record in the memory for the hit data record if more detailed content needs to be extracted.
Reading data to be processed, including batch data records or large-scale unstructured documents such as Office files, PDF files, video files, audio files and the like, and saving the read data files in a disk.
And analyzing the file content and extracting the content to be analyzed into the memory to form a data record, wherein the data record comprises metadata such as an original file storage path, a file format, a file size and the like. The extracted content is the attribute needed in the subsequent analysis process, such as text abstract.
And entering a subsequent analysis processing flow, matching the file content extracted into the memory with preset keywords, and marking data records if the keywords are hit. When more detailed contents need to be extracted in the subsequent analysis processing flow, partial file contents, such as text paragraph contents, video key frames, voice recognition results of specific recording segments and the like of hit keywords, are read as required and loaded into the data records of the memory. Taking file content analysis as an example, when the content of the keywords is hit, reading the sections where the abstracts are located into the memory, analyzing the semantics to determine whether the file content is related to the keywords, and for the files without the keywords, not reading the article content.
The original content of the file is stored in a magnetic disk or other storage components, only the attribute needing to be analyzed is read, a data record with a small scale is formed and stored in a memory, analysis circulation is completed quickly, and when some analysis scenes need to be supplemented with the attribute, more file content is read dynamically and the data record in the memory is filled.
The embodiment of the invention also discloses a streaming data processing device, which comprises:
a memory for storing a computer program;
a processor for executing the computer program to implement the streaming data processing method.
The embodiment of the invention also discloses a readable storage medium for storing a computer program, wherein the computer program realizes the streaming data processing method when being executed by a processor.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (8)
1. A method of streaming data processing, the method comprising the operations of:
reading data required to be processed and storing the read data in a storage component;
analyzing the read data file content, and extracting the content to be analyzed into an internal memory to form a data record;
analyzing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
for the hit data record, if more detailed content needs to be extracted, the content in the storage component is read as required and loaded to the data record in the memory.
2. The streaming data processing method according to claim 1, wherein the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file, and an audio file.
3. The streaming data processing method of claim 1, wherein the data record is metadata including an original file storage path, a file format, and a file size.
4. A streaming data processing system, the system comprising:
the data reading module is used for reading data to be processed and storing the read data in the storage component;
the data analysis module is used for analyzing the read data file content and extracting the content to be analyzed into the memory to form a data record;
the analysis module is used for analyzing and processing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;
and the data re-extraction module is used for reading the content in the storage component as required and loading the content into the data record in the memory for the hit data record if more detailed content needs to be extracted.
5. The streaming data processing system of claim 4, wherein the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file and an audio file.
6. The streaming data processing system of claim 4, wherein the data record is metadata comprising an original file storage path, a file format, and a file size.
7. A streaming data processing device, comprising:
a memory for storing a computer program;
a processor for executing said computer program for implementing the streaming data processing method according to any of claims 1-3.
8. A readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the streaming data processing method according to any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010599344.4A CN111797063A (en) | 2020-06-28 | 2020-06-28 | Streaming data processing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010599344.4A CN111797063A (en) | 2020-06-28 | 2020-06-28 | Streaming data processing method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111797063A true CN111797063A (en) | 2020-10-20 |
Family
ID=72804367
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010599344.4A Withdrawn CN111797063A (en) | 2020-06-28 | 2020-06-28 | Streaming data processing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111797063A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001164A (en) * | 2020-10-27 | 2020-11-27 | 南京中孚信息技术有限公司 | Document content streaming analysis method and system |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030115190A1 (en) * | 2001-12-19 | 2003-06-19 | Rick Soderstrom | System and method for retrieving data from a database system |
CN101686209A (en) * | 2008-09-24 | 2010-03-31 | 阿里巴巴集团控股有限公司 | Method and device for storing message in message retransmission system |
CN103281213A (en) * | 2013-04-18 | 2013-09-04 | 西安交通大学 | Method for extracting, analyzing and searching network flow and content |
CN103294413A (en) * | 2013-05-08 | 2013-09-11 | 山东地纬计算机软件有限公司 | Mass data acquisition terminal supported distributed-memory real-time storage device and storage method |
CN105183670A (en) * | 2015-10-27 | 2015-12-23 | 北京百度网讯科技有限公司 | Data processing method and device used for distributed cache system |
CN106250320A (en) * | 2016-07-19 | 2016-12-21 | 诸葛晴凤 | A kind of memory file system management method of data consistency and abrasion equilibrium |
CN106452868A (en) * | 2016-10-12 | 2017-02-22 | 中国电子科技集团公司第三十研究所 | Network traffic statistics implement method supporting multi-dimensional aggregation classification |
CN106648988A (en) * | 2016-12-28 | 2017-05-10 | 四川秘无痕信息安全技术有限责任公司 | Method for extracting data in monitoring equipment |
CN107015982A (en) * | 2016-01-27 | 2017-08-04 | 阿里巴巴集团控股有限公司 | A kind of method, device and the equipment of monitoring system file integrality |
CN107943846A (en) * | 2017-11-01 | 2018-04-20 | 内蒙古科电数据服务有限公司 | Data processing method, device and electronic equipment |
CN109002444A (en) * | 2017-06-07 | 2018-12-14 | 北大方正集团有限公司 | Text searching method and full-text search device |
CN109086410A (en) * | 2018-08-02 | 2018-12-25 | 中国联合网络通信集团有限公司 | The processing method and system of streaming mass data |
CN109144760A (en) * | 2018-06-29 | 2019-01-04 | 清华大学 | For obtaining the method, apparatus, system and medium of internal storage state |
CN110287189A (en) * | 2019-06-25 | 2019-09-27 | 浪潮卓数大数据产业发展有限公司 | A kind of method and system based on spark streaming processing mobile cart data |
CN110377563A (en) * | 2019-07-23 | 2019-10-25 | 中国工商银行股份有限公司 | Document handling method and device and electronic equipment and readable storage medium storing program for executing |
CN111078705A (en) * | 2019-12-20 | 2020-04-28 | 南京聚力云成电子科技有限公司 | Spark platform based data index establishing method and data query method |
-
2020
- 2020-06-28 CN CN202010599344.4A patent/CN111797063A/en not_active Withdrawn
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030115190A1 (en) * | 2001-12-19 | 2003-06-19 | Rick Soderstrom | System and method for retrieving data from a database system |
CN101686209A (en) * | 2008-09-24 | 2010-03-31 | 阿里巴巴集团控股有限公司 | Method and device for storing message in message retransmission system |
CN103281213A (en) * | 2013-04-18 | 2013-09-04 | 西安交通大学 | Method for extracting, analyzing and searching network flow and content |
CN103294413A (en) * | 2013-05-08 | 2013-09-11 | 山东地纬计算机软件有限公司 | Mass data acquisition terminal supported distributed-memory real-time storage device and storage method |
CN105183670A (en) * | 2015-10-27 | 2015-12-23 | 北京百度网讯科技有限公司 | Data processing method and device used for distributed cache system |
CN107015982A (en) * | 2016-01-27 | 2017-08-04 | 阿里巴巴集团控股有限公司 | A kind of method, device and the equipment of monitoring system file integrality |
CN106250320A (en) * | 2016-07-19 | 2016-12-21 | 诸葛晴凤 | A kind of memory file system management method of data consistency and abrasion equilibrium |
CN106452868A (en) * | 2016-10-12 | 2017-02-22 | 中国电子科技集团公司第三十研究所 | Network traffic statistics implement method supporting multi-dimensional aggregation classification |
CN106648988A (en) * | 2016-12-28 | 2017-05-10 | 四川秘无痕信息安全技术有限责任公司 | Method for extracting data in monitoring equipment |
CN109002444A (en) * | 2017-06-07 | 2018-12-14 | 北大方正集团有限公司 | Text searching method and full-text search device |
CN107943846A (en) * | 2017-11-01 | 2018-04-20 | 内蒙古科电数据服务有限公司 | Data processing method, device and electronic equipment |
CN109144760A (en) * | 2018-06-29 | 2019-01-04 | 清华大学 | For obtaining the method, apparatus, system and medium of internal storage state |
CN109086410A (en) * | 2018-08-02 | 2018-12-25 | 中国联合网络通信集团有限公司 | The processing method and system of streaming mass data |
CN110287189A (en) * | 2019-06-25 | 2019-09-27 | 浪潮卓数大数据产业发展有限公司 | A kind of method and system based on spark streaming processing mobile cart data |
CN110377563A (en) * | 2019-07-23 | 2019-10-25 | 中国工商银行股份有限公司 | Document handling method and device and electronic equipment and readable storage medium storing program for executing |
CN111078705A (en) * | 2019-12-20 | 2020-04-28 | 南京聚力云成电子科技有限公司 | Spark platform based data index establishing method and data query method |
Non-Patent Citations (1)
Title |
---|
潘英杰等: "针对混叠采集现场质量监控的高性能解决方案", 《地球物理学进展》, no. 02, pages 803 - 809 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001164A (en) * | 2020-10-27 | 2020-11-27 | 南京中孚信息技术有限公司 | Document content streaming analysis method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108319668B (en) | Method and equipment for generating text abstract | |
CN107491518B (en) | Search recall method and device, server and storage medium | |
US20130124796A1 (en) | Storage method and apparatus which are based on data content identification | |
CN113032362B (en) | Data blood edge analysis method, device, electronic equipment and storage medium | |
BRPI0415045A (en) | Information storage media for multimedia data storage, information storage media, text subtitle processing apparatus, text subtitle processing method, and computer readable recording media | |
US20070133067A1 (en) | Forming a master page for an electronic document | |
US20150071542A1 (en) | Automated redaction | |
US9058335B2 (en) | System, method and computer program product for protecting derived metadata when updating records within a search engine | |
CN107391544B (en) | Processing method, device and equipment of column type storage data and computer storage medium | |
CN110516203B (en) | Dispute focus analysis method, device, electronic equipment and computer-readable medium | |
CN112231407B (en) | DDL synchronization method, device, equipment and medium of PostgreSQL database | |
US8037403B2 (en) | Apparatus, method, and computer program product for extracting structured document | |
CN111797063A (en) | Streaming data processing method and system | |
CN102270238A (en) | Method and device for establishing continuation of Chinese knowledge points | |
CN111930708B (en) | Ceph object storage-based object tag expansion system and method | |
CN101021851A (en) | Text search device, text search method, recording medium for recording text search program | |
CN112346659B (en) | Storage method, equipment and storage medium for distributed object storage metadata | |
WO2015154680A1 (en) | File processing method, device, and network system | |
US20200257724A1 (en) | Methods, devices, and storage media for content retrieval | |
KR20180059112A (en) | Apparatus for classifying contents and method for using the same | |
CN106909623A (en) | A kind of data set and date storage method of supporting efficient mass data to analyze and retrieve | |
CN102360381A (en) | Device and method for performing lossless compression on embedded program | |
CN112765110B (en) | PDF annotation data generation method, device, equipment and storage medium | |
CN114218347A (en) | Method for quickly searching index of multiple file contents | |
EP3273365B1 (en) | Method for generating search index and server utilizing the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201020 |