CN118152383A

CN118152383A - Big data real-time analysis processing method

Info

Publication number: CN118152383A
Application number: CN202410202029.1A
Authority: CN
Inventors: 陈星栋; 郭浩哲; 蒙圣光
Original assignee: Guangdong Fastersoft Software Co ltd
Current assignee: Guangdong Fastersoft Software Co ltd
Priority date: 2024-02-23
Filing date: 2024-02-23
Publication date: 2024-06-07

Abstract

The invention relates to the technical field of data processing, and discloses a real-time analysis processing method for big data, which collects a large amount of structured and unstructured data from various data sources, and performs cleaning and preprocessing, including removing duplicate data, processing missing values and converting data formats. The preprocessed data is stored in the data storage system and transmitted from the data source to the data processing engine using the stream processing engine. Real-time analysis and mining algorithms are applied in the data processing engine to extract useful information from the data. Finally, the analysis results are presented to the relevant personnel in a form that is easy to understand and decision-making using data visualization tools, thereby supporting real-time decision-making and execution. The technical scheme has the beneficial effects of real-time performance, accuracy, visual display, comprehensive application and expansibility, provides high-efficiency, accurate and visual big data analysis and mining support for users, and helps the users make more intelligent decisions and optimize business processes.

Description

Big data real-time analysis processing method

Technical Field

The invention relates to the technical field of data analysis and processing, in particular to a real-time big data analysis and processing method.

Background

With the continuous development of society and economy, the application of big data is becoming more and more common. The advent of big data analysis processing technology has allowed businesses and organizations to extract useful information from big data to aid in decision making, optimize business processes, and the like. The existing big data analysis processing method comprises batch processing and real-time processing. Disadvantages of batch processing methods: time delay: the batch processing needs to wait for the data to accumulate to a certain amount and then process, so that the analysis result is delayed, and the real-time feedback of the real-time data cannot be realized. Data transmission and storage overhead: batch processing requires the transfer of large amounts of data from a data source to a data processing engine and the storage of the processing results, resulting in network burden and waste of storage resources. Not adapting to dynamic scenarios: batch processing is difficult to process dynamically-changed data, and cannot meet application scenes with high requirements on real-time performance. The disadvantage of the real-time processing method is that: the processing complexity is high: real-time processing requires real-time calculation and mining in a data stream, and has high requirements on calculation resources and algorithms and relatively high processing complexity. Data quality problem: real-time processing has higher requirements on the real-time performance and accuracy of data, but in reality, sensor data, text data and the like often have noise and missing values, and have higher requirements on the quality of the data. Results are difficult to visualize: the results generated by real-time processing are generally real-time data streams, and how to intuitively display the results to users and effectively communicate becomes challenging.

In summary, the method for real-time analysis and processing of big data overcomes the defects of the prior art by comprehensively utilizing various technical means, so that real-time analysis and mining can be efficiently performed in a big data environment facing complex changes, and results are visually displayed to users, thereby providing more timely, accurate and visual data analysis support for the users.

Disclosure of Invention

In order to achieve the above purpose, the present invention provides the following technical solutions:

a real-time analysis processing method for big data comprises the following steps:

S1, collecting a large amount of data from various data sources, including structured data and unstructured data;

S2: cleaning and preprocessing the collected data, including removing duplicate data, processing missing values, and converting data formats;

S3: the data storage system is utilized to store the preprocessed data;

S4: transmitting data from the data source to the data processing engine using the stream processing engine;

s5: data analysis and mining: applying appropriate real-time analysis and mining algorithms in the data processing engine to extract useful information from the data;

s6: an interactive chart or dashboard is created using a data visualization tool to visually present the analysis results to the user.

As a preferable technical scheme of the invention, the structured data in the S1 comprises data in a database and sensor data, and the unstructured data comprises text data, image data and audio data.

As a preferable technical scheme of the invention, the data storage system in the S3 adopts a distributed file system or a distributed database.

As a preferable technical scheme of the invention, the S4 stream processing engine adopts one or more of APACHE KAFKA, APACHE FLINK or Storm.

5. The real-time big data analyzing and processing method according to claim 1, wherein the method comprises the following steps: the S5 real-time analysis and mining algorithm adopts a machine learning algorithm, an image processing algorithm or a text analysis algorithm.

As a preferable technical scheme of the invention, the S6 data visualization tool adopts one or more of Tableau or Power BI.

As a preferable technical scheme of the invention, in S2

The duplicate data is removed:

the formula is used: data=data.drop_ duplicates ();

The processing missing values:

deleting a record/run of use formula containing a missing value: data=data.dropana ();

the specified value replaces the missing value using the formula: data=data.filena (value);

Filling the missing values using interpolation methods using the formula by estimating the missing values from the trend of the known data: data=data.interface ();

The conversion data format:

Converting the data type of the designated column into a new type new_type uses the formula: data [ column ] = data [ column ]. Astype (new_type).

As a preferred technical solution of the present invention, the machine learning algorithm:

Linear regression: the formula is y=w1×1+w2×2+ + wn×n+b, where y is the predicted variable, x1, x2, xn is the input variable, w1, w2, wn is the weight, and b is the deviation;

decision tree: constructing a decision tree model according to the information gain or the base index of the features, and using the decision tree model for classification and regression tasks;

support Vector Machine (SVM): the formula is y=sign (w≡t x+b), where y is the prediction result, x is the input sample, w is the weight vector, and b is the bias;

k means clustering: distributing data points into K clusters through iteration, so that the distance between each data point and the mass center in the cluster to which each data point belongs is minimized;

deep learning algorithm: including a combination of linear combinations and various activation functions, one or more of ReLU, sigmoid, or Softmax;

The image processing algorithm comprises:

And (3) image filtering: convolving the image with a filter;

Image segmentation: one or more of threshold segmentation, edge detection, or region growing are employed;

feature extraction: extracting image features using one or more of SIFT, HOG, CNN methods;

image recognition and classification: classifying the extracted image features by using a classifier, wherein the algorithm comprises one or more of a support vector machine or a convolutional neural network;

the text analysis algorithm includes one or more of a bag of words model, TF-IDF, topic model, or text classification algorithm.

As a preferable technical scheme of the invention, the filter comprises one or more of a Gaussian filter and a median filter.

Advantageous effects

Compared with the prior art, the invention provides an online psychological consultation system based on big data and an implementation method thereof, which have the following beneficial effects:

the beneficial effects of the technical scheme can be summarized as follows:

1. Real-time performance: the technical scheme can process a large amount of structured and unstructured data in real time, realizes the real-time analysis and mining of the real-time data, greatly shortens the response time, and improves the efficiency of decision making and service optimization.

2. Accuracy: through the steps of data cleaning, preprocessing, real-time analysis and the like, the technical scheme can process the data quality problems such as noise and missing values, improves the accuracy and the reliability of data, and ensures the reliability of analysis results.

3. Visual display: by utilizing data visualization tools, such as interactive charts or dashboards, the technical scheme can display analysis results to users in an intuitive and easy-to-understand manner, and help the users to better understand and apply the analysis results, thereby supporting decision making and business optimization.

4. Comprehensive application: according to the technical scheme, the stream processing engine, the big data storage system and various analysis and mining algorithms are comprehensively utilized, so that data analysis and mining can be efficiently realized in a big data environment with complex changes, and comprehensive and deep data insight is provided.

5. And (3) expansibility: the technical scheme adopts an open architecture and popular technical components, such as APACHE KAFKA, APACHE FLINK, a machine learning algorithm and the like, has good expansibility, and can support the ever-increasing data volume and application requirements.

The technical scheme has the beneficial effects of real-time performance, accuracy, visual display, comprehensive application and expansibility, provides high-efficiency, accurate and visual big data analysis and mining support for users, and helps the users make more intelligent decisions and optimize business processes.

Drawings

Fig. 1 is a flow chart of a real-time analysis processing method for big data according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a real-time analysis processing method for big data involves the following specific steps:

① And (3) data collection: a large amount of data is collected from various data sources, including structured data (e.g., databases, sensor data, etc.) and unstructured data (e.g., text, images, audio, etc.).

② Data cleaning and pretreatment: the collected data is cleaned and preprocessed, including removing duplicate data, processing missing values, converting data formats, etc., to ensure data quality and consistency of data formats.

Data cleansing and preprocessing are very important steps in real-time analysis processing of large data, and the following are examples of some common data cleansing and preprocessing operations:

duplicate data is removed:

the formula is used: data=data.drop_ duplicates ()

This operation will remove duplicate data in the dataset, leaving a unique record.

Processing the missing values:

the formula is used: data=data.dropna ()

This operation will delete the record/row containing the missing value.

The formula is used: data=data.filna (value)

This operation will replace the missing value with the specified value (value).

The formula is used: data=data.interface ()

This operation will fill in missing values using interpolation methods, by estimating missing values from the trend of the known data.

Converting the data format:

The formula is used: data [ column ] =data [ column ]. Astype (new_type)

This operation converts the data type of the specified column (column) into a new type (new_type), for example, converting a character string into an integer, or converting a date into a time stamp.

Data formatting and normalization:

The formula is used: data [ column ] =data [ column ]. Apply (function)

The data in the specified columns (columns) may be formatted or normalized using custom functions (functions), such as converting text to lower case, removing special characters, etc.

The formula is used: data [ column ] = (data [ column ] -mean)/std

This operation normalizes the column of data (column) by subtracting the mean (mean) and dividing by the standard deviation (std) to make the data appear as a standard normal distribution.

③ And (3) data storage: a data storage system suitable for real-time processing, such as a distributed file system (e.g., HDFS) or a distributed database (e.g., HBase, cassandra), is selected to accommodate large-scale data.

④ Real-time data stream processing: data is transferred from the data source to the data processing engine using a stream processing engine (e.g., APACHE KAFKA, APACHE FLINK, storm, etc.).

⑤ Data analysis and mining: appropriate real-time analysis and mining algorithms, such as machine learning algorithms, image processing algorithms, text analysis algorithms, etc., are applied in the data processing engine to extract useful information from the data. Machine learning algorithms, image processing algorithms, and text analysis algorithms are very broad fields, and specific formulas and algorithms may vary from case to case. The following is a brief description of some examples of algorithms common in these fields and their associated formulas:

Machine learning algorithm:

Linear regression: the formula y=w1×1+w2×2+ + wn×n+b, where y is the predicted variable, x1, x2, xn is the input variable, w1, w2, wn is the weight, and b is the deviation (intercept).

Decision tree: and constructing a decision tree model according to the information gain or the base index of the features, and using the decision tree model for classification and regression tasks.

Support Vector Machine (SVM): the formula is y=sign (w≡t x+b), where y is the prediction result, x is the input sample, w is the weight vector, and b is the deviation (intercept).

K means clustering: the data points are iteratively assigned into K clusters such that the distance of each data point from the centroid in the cluster to which it belongs is minimized.

Deep learning algorithms (such as neural networks): multiple levels of neurons and weights are involved, where the output of each neuron is determined by the weighted sum of its inputs plus the result of the bias passing through the activation function. The formulas may include a combination of linear combinations and various activation functions, such as ReLU, sigmoid, softmax, etc.

Image processing algorithm:

And (3) image filtering: the image is convolved using filters, common filters including gaussian filters, median filters, etc.

Image segmentation: common algorithms include threshold segmentation, edge detection, region growing, etc., to divide an image into different regions or objects.

Feature extraction: features in the image are extracted for further analysis and classification, such as extracting image features using SIFT, HOG, CNN or the like.

Image recognition and classification: the extracted image features are classified using a classifier, and common algorithms include a Support Vector Machine (SVM), a Convolutional Neural Network (CNN), and the like.

Text analysis algorithm:

Bag of Words model (Bag-of-Words): text is represented as a vector or matrix of occurrences of each word, commonly used for text classification and emotion analysis.

TF-IDF (word frequency-inverse document frequency): the importance of each word is calculated, and the importance degree of the word is quantized by combining the indexes of word frequency and inverse document frequency.

Topic model (e.g. LDA): text data is considered to be made up of a plurality of topics, and the topic distribution of the document and the word distribution of the topics are inferred by statistical methods.

Text classification algorithms (e.g., naive bayes, support vector machines): training is performed based on the text features and class labels for classifying new text into predefined classes.

It should be noted that these algorithms and formulas are just some common examples in these fields, more complex algorithms and formulas may be used in practice, and each algorithm may be subject to different variations and modifications. Specific algorithm selections and formulas should be determined based on specific questions and data, in conjunction with corresponding machine learning, image processing, or text analysis libraries.

⑥ Visualization and reporting: the analysis results are visually presented to the user so that the user can understand and utilize the results. Interactive charts, dashboards, etc. may be created using data visualization tools (e.g., tableau, power BI).

⑦ Real-time feedback and decision support: and feeding back the analysis result to related personnel in real time, and supporting real-time decision making. For example, in the e-commerce field, real-time analysis results may be utilized to implement personalized recommendations, anti-fraud measures, and the like.

To achieve real-time decisions, the following steps may be considered:

And (3) data real-time acquisition: real-time performance of data sources is ensured, including data collection from sensors, system logs, user interactions, and the like. Data may be collected and processed in real-time using streaming data processing techniques such as APACHE KAFKA, APACHE FLINK, and the like.

Real-time data processing and analysis: real-time data processing techniques are used to analyze the collected data in real-time, such as streaming data processing tools, complex Event Processing (CEP) engines, and the like. These techniques may apply various machine learning, image processing, text analysis algorithms, and provide real-time data processing and insight.

And (3) constructing a decision model: based on the real-time data and the analysis result, a corresponding decision model is constructed. The model can be a prediction model based on a machine learning algorithm, a real-time monitoring system, an intelligent recommendation system and the like. And the accuracy and the instantaneity of the decision model are ensured.

Decision feedback and real-time push: and feeding back the real-time analysis result and the output of the decision model to related personnel. This may be accomplished in various ways, such as a real time report, dashboard, mobile application, etc. The real-time decision result can also be transmitted to related personnel by means of mail, short message, instant message and the like by using notification and pushing technology.

Automated execution and integration: for decisions that can be automatically performed, they can be integrated into a real-time decision system or workflow to automatically perform the decisions in real-time. This may be achieved by automation tools, process management software, robotic Process Automation (RPA), etc.

⑧ Monitoring and optimizing: the performance and effect of the data processing flow are continuously monitored, and optimization and adjustment are carried out according to the requirement, so that a better real-time analysis processing effect is achieved.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A real-time analysis processing method for big data is characterized in that: the method comprises the following steps:

S3: the data storage system is utilized to store the preprocessed data;

s5: data analysis and mining: applying real-time analysis and mining algorithms in the data processing engine to extract useful information from the data;

2. The real-time big data analyzing and processing method according to claim 1, wherein the method comprises the following steps: the structured data in S1 includes data in a database and sensor data, and the unstructured data includes text data, image data and audio data.

3. The real-time big data analyzing and processing method according to claim 1, wherein the method comprises the following steps: and S3, the data storage system adopts a distributed file system or a distributed database.

4. The real-time big data analyzing and processing method according to claim 1, wherein the method comprises the following steps: the S4 stream processing engine adopts one or more of APACHE KAFKA, APACHE FL INK or Storm.

6. The method for real-time analysis and processing of big data according to claim 1, wherein the S6 data visualization tool adopts one or more of Tableau and Power BI.

7. The real-time big data analyzing and processing method according to claim 1, wherein the method comprises the following steps: in the S2