CN111797063A

CN111797063A - Streaming data processing method and system

Info

Publication number: CN111797063A
Application number: CN202010599344.4A
Authority: CN
Inventors: 刘洋洋; 麻宇航; 李兴国; 苗功勋
Original assignee: BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD; Nanjing Zhongfu Information Technology Co Ltd; Zhongfu Information Co Ltd; Zhongfu Safety Technology Co Ltd
Current assignee: BEIJING ZHONGFU TAIHE TECHNOLOGY DEVELOPMENT CO LTD; Nanjing Zhongfu Information Technology Co Ltd; Zhongfu Information Co Ltd; Zhongfu Safety Technology Co Ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-10-20

Abstract

The invention provides a streaming data processing method and a system, which can store a large-scale file in a disk, save a large amount of memory, and process analysis of large-scale data content by using a server with lower performance; the large file exists in the form of a memory data record in the circulation of the analysis process, so that the analysis efficiency is improved; in addition, the file content can be dynamically loaded according to the requirement, the file reading and writing times are reduced, the disk IO operation is reduced, and the analysis efficiency is improved.

Description

Streaming data processing method and system

Technical Field

The present invention relates to the field of streaming data technologies, and in particular, to a streaming data processing method and system.

Background

Streaming data refers to data continuously generated by a plurality of data sources, and is generally transmitted in the form of data records, and for most scenes continuously generating dynamic new data, the analysis conclusion can be obtained more quickly by adopting streaming data processing compared with a batch processing analysis mode.

The current mainstream stream processing tools are Apache Spark Streaming, Apache Storm, Flink, and the like. Data content in stream processing is loaded into a memory, and the data is generally structured data with a small scale, such as sensor data, application log data, user click data, transaction data, and the like. When processing unstructured data with a large scale, the data needs to be structured in advance, and a physical memory of a server needs to be increased or the data needs to be deployed in a cluster mode. The processing mode of current tools has the problem of inefficient analysis when processing structured data on a larger scale or batches of small-scale data.

Disclosure of Invention

The invention aims to provide a streaming data processing method and a streaming data processing system, which aim to solve the problems that in the prior art, streaming processing needs to structurize non-structural data and the analysis efficiency is low under large-scale structured data, save a large amount of memory and improve the analysis efficiency.

To achieve the above technical object, the present invention provides a streaming data processing method, including the following operations:

reading data required to be processed and storing the read data in a storage component;

analyzing the read data file content, and extracting the content to be analyzed into an internal memory to form a data record;

analyzing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;

for the hit data record, if more detailed content needs to be extracted, the content in the storage component is read as required and loaded to the data record in the memory.

Preferably, the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file, and an audio file.

Preferably, the data record is metadata including an original file storage path, a file format, and a file size.

The present invention also provides a streaming data processing system, the system comprising:

the data reading module is used for reading data to be processed and storing the read data in the storage component;

the data analysis module is used for analyzing the read data file content and extracting the content to be analyzed into the memory to form a data record;

the analysis module is used for analyzing and processing the data records, matching the content of the data records with preset keywords, and marking the data records if the data records are hit;

and the data re-extraction module is used for reading the content in the storage component as required and loading the content into the data record in the memory for the hit data record if more detailed content needs to be extracted.

The present invention also provides a streaming data processing apparatus, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the streaming data processing method.

The invention also provides a readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the streaming data processing method.

The effect provided in the summary of the invention is only the effect of the embodiment, not all the effects of the invention, and one of the above technical solutions has the following advantages or beneficial effects:

compared with the prior art, the invention can store files with larger scale in the disk, save a large amount of memory, and can process the analysis of large-scale data content by using a server with lower performance; the large file exists in the form of a memory data record in the circulation of the analysis process, so that the analysis efficiency is improved; in addition, the file content can be dynamically loaded according to the requirement, the file reading and writing times are reduced, the disk IO operation is reduced, and the analysis efficiency is improved.

Drawings

Fig. 1 is a flow chart of a streaming data processing method provided in an embodiment of the present invention;

fig. 2 is a block diagram of a streaming data processing system provided in an embodiment of the present invention.

Detailed Description

In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.

A streaming data processing method and system provided by the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present invention discloses a streaming data processing method, where the method includes the following operations:

According to the embodiment of the invention, original contents of the file are stored in a disk or other storage components, only the attribute to be analyzed is read, a data record with a small scale is formed and stored in the memory, analysis circulation is rapidly completed, and when some analysis scenes need to be supplemented with the attribute, more file contents are dynamically read and filled into the data record in the memory.

Reading data to be processed, including batch data records or large-scale unstructured documents such as Office files, PDF files, video files, audio files and the like, and saving the read data files in a disk.

And analyzing the file content and extracting the content to be analyzed into the memory to form a data record, wherein the data record comprises metadata such as an original file storage path, a file format, a file size and the like. The extracted content is the attribute needed in the subsequent analysis process, such as text abstract.

And entering a subsequent analysis processing flow, matching the file content extracted into the memory with preset keywords, and marking data records if the keywords are hit. When more detailed contents need to be extracted in the subsequent analysis processing flow, partial file contents, such as text paragraph contents, video key frames, voice recognition results of specific recording segments and the like of hit keywords, are read as required and loaded into the data records of the memory. Taking file content analysis as an example, when the content of the keywords is hit, reading the sections where the abstracts are located into the memory, analyzing the semantics to determine whether the file content is related to the keywords, and for the files without the keywords, not reading the article content.

The embodiment of the invention can store files with larger scale in the disk, save a large amount of memory, and can process analysis of large-scale data content by using a server with lower performance; the large file exists in the form of a memory data record in the circulation of the analysis process, so that the analysis efficiency is improved; in addition, the file content can be dynamically loaded according to the requirement, the file reading and writing times are reduced, the disk IO operation is reduced, and the analysis efficiency is improved.

As shown in fig. 2, an embodiment of the present invention further discloses a streaming data processing system, where the system includes:

The original content of the file is stored in a magnetic disk or other storage components, only the attribute needing to be analyzed is read, a data record with a small scale is formed and stored in a memory, analysis circulation is completed quickly, and when some analysis scenes need to be supplemented with the attribute, more file content is read dynamically and the data record in the memory is filled.

The embodiment of the invention also discloses a streaming data processing device, which comprises:

a memory for storing a computer program;

The embodiment of the invention also discloses a readable storage medium for storing a computer program, wherein the computer program realizes the streaming data processing method when being executed by a processor.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method of streaming data processing, the method comprising the operations of:

2. The streaming data processing method according to claim 1, wherein the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file, and an audio file.

3. The streaming data processing method of claim 1, wherein the data record is metadata including an original file storage path, a file format, and a file size.

4. A streaming data processing system, the system comprising:

5. The streaming data processing system of claim 4, wherein the data to be processed is a batch data record or a large-scale unstructured document, and is any one of an Office file, a PDF file, a video file and an audio file.

6. The streaming data processing system of claim 4, wherein the data record is metadata comprising an original file storage path, a file format, and a file size.

7. A streaming data processing device, comprising:

a memory for storing a computer program;

a processor for executing said computer program for implementing the streaming data processing method according to any of claims 1-3.

8. A readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the streaming data processing method according to any one of claims 1 to 3.