CN111212038A

CN111212038A - Open data API gateway system based on big data artificial intelligence

Info

Publication number: CN111212038A
Application number: CN201911337598.2A
Authority: CN
Inventors: 李长智; 周彪; 孙园; 滕飞; 张磊; 郑耀杰
Original assignee: Digital Guangxi Group Co Ltd; Jiangsu Guotai Xindian Software Co ltd; Huawei Technologies Co Ltd
Current assignee: Digital Guangxi Group Co Ltd; Jiangsu Guotai Xindian Software Co ltd; Huawei Technologies Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-29
Anticipated expiration: 2039-12-23
Also published as: CN111212038B

Abstract

The invention discloses an open data API gateway system based on big data artificial intelligence, which mainly comprises an API gateway, a KafKa distributed publishing and subscribing message unit, an Hbase distributed storage unit, an offline analysis service unit, a real-time analysis service unit and a monitoring management platform. The monitoring analysis service and the gateway core are mutually independent, under the condition of intelligent management, the intelligent analysis result is obtained by carrying out real-time analysis and off-line analysis on the log of the gateway without directly intervening the operation of the gateway, so that the gateway is indirectly and intelligently controlled, the accuracy, the robustness and the safety of the gateway can be ensured, and the gateway can still be normally used even if the service is down.

Description

Open data API gateway system based on big data artificial intelligence

Technical Field

The invention relates to the technical field of API gateways, in particular to an open data API gateway system based on big data artificial intelligence.

Background

The monitoring operation and maintenance of the gateway are important components of the gateway system, and the monitoring operation and maintenance of the gateway are used for learning the system state in real time, recording log positioning problems and providing 7 × 24 guard. The operation and maintenance personnel can carry out more reasonable resource allocation and weight adjustment on the service by analyzing the statistical form generated by the monitoring system, and can realize automatic alarm and push information to the operation and maintenance personnel after manually configuring the related early warning threshold value. Along with the continuous intervention of the service, the accuracy (ensuring that the platform can provide stable and accurate data resource content), the robustness (supporting various emergencies and intelligently coordinating resource processing) and the safety (deeply excavating potential safety hazards and various attack behaviors) of the gateway must be improved, and the problems of unreliable artificial rules, difficult abnormal positioning, untimely alarm response and the like existing in the gateway monitoring operation and maintenance system are solved.

By analyzing the logs of the open data API (Application Programming Interface) gateway, an alarm index can be configured to monitor the log data, perform anomaly detection, prediction and alarm, and improve the intelligent level of the open data API gateway. However, in the conventional log analysis platform, some methods of manual experience and business rules are usually adopted to perform analysis monitoring, and the analysis monitoring has the following disadvantages: (1) a large amount of work is time-consuming and labor-consuming, and real-time monitoring and prediction of the running condition of the gateway are difficult to realize; (2) the relevant function of attack behavior processing is lacked, and attack behavior processing (abnormal orientation behavior analysis) is to resist attack and send alarm information when the attack behavior of a user is found; (3) and a user behavior analysis model is lacked, and for different behaviors, different characteristics can be analyzed comprehensively to determine whether the user behavior is an attack behavior.

Disclosure of Invention

The invention aims to solve the problem that the real-time monitoring and prediction of the gateway running condition are difficult to realize by adopting a method of artificial experience and business rules to analyze and monitor the existing open data API, and provides an open data API gateway system based on big data artificial intelligence.

In order to solve the problems, the invention is realized by the following technical scheme:

the open data API gateway system based on big data artificial intelligence is composed of an API gateway, a KafKa distributed publishing and subscribing message unit, an Hbase distributed storage unit, an offline analysis service unit, a real-time analysis service unit and a monitoring management platform.

And the API gateway collects log data through the Nignx plug-in and sends the log data into a KafKa message queue of the Kafka distributed publish-subscribe message unit.

The KafKa distributed publishing and subscribing message unit divides a KafKa message queue with log data into two KafKa message streams, and one KafKa message stream is directly stored in the Hbase distributed storage unit and serves as offline historical data for backup storage; and the other kafka message flow is sent to the real-time analysis service unit to be used as real-time data for real-time calculation.

And the Hbase distributed storage unit receives the Kafka stream sent by the KafKa distributed publishing and subscribing message unit by using Logstash, acquires log data, and processes and solidifies the log data.

And the offline analysis service unit acquires historical log data stored in the Hbase distributed storage unit by using the Spark calculation engine, determines a normal range threshold of each service key performance index, generates a flow prediction model, an emergency classification model, a user behavior analysis model and a statistical report, and takes the normal threshold range, the flow prediction model, the emergency classification model, the user behavior analysis model and the statistical report as offline statistical analysis results.

And the real-time analysis service unit receives the Kafka stream sent by the Kafka distributed publishing and subscribing message unit by utilizing a spark streaming stream calculation engine, acquires real-time log data and performs stream calculation, and simultaneously realizes emergency identification, abnormal access behavior analysis and attack behavior detection by combining an offline statistical analysis result of the offline analysis service unit, and takes the emergency identification, the abnormal access behavior analysis and the attack behavior detection as a real-time analysis result.

And the monitoring management platform receives the offline statistical analysis result sent by the offline analysis service unit and the real-time analysis result sent by the real-time analysis service unit, and sends a corresponding alarm message according to the offline statistical analysis result and the real-time analysis result.

In the above scheme, the service key performance indexes include total flow and times of requests in a time period, total response flow in a time period, number of successful requests, number of failed requests, total number of requests, request accuracy, average response time, number of access applications, and number of access services.

In the above scheme, the process of calculating the normal range normal threshold of each service key performance index by the offline analysis service unit is as follows: firstly, acquiring historical data of service key performance index items through historical log data; secondly, classifying according to service states, and respectively calculating abnormal threshold values according to the extracted service indexes; then, counting the average value, standard deviation, variance, dispersion coefficient and correlation coefficient of each service key performance index in normal state and abnormal state by adopting a data exploration method; then, removing noise data by using an outlier analysis method; then, performing value domain distribution analysis by using the effective fruit data to calculate the normal distribution condition; and finally, calculating a threshold boundary according to the service scene and the positive-fault analysis result, and returning the upper limit and the lower limit of the recommended threshold.

Compared with the prior art, the monitoring analysis service and the gateway core are mutually independent, under the condition of intelligent management, the intelligent analysis result is obtained by carrying out real-time analysis and off-line analysis on the log of the gateway without directly interfering the operation of the gateway, so that the gateway is indirectly and further intelligently controlled, the accuracy, the robustness and the safety of the gateway can be ensured, and the gateway can still be normally used even if the service is down.

Drawings

FIG. 1 is an overall architecture diagram of an open data API gateway system based on big data artificial intelligence.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings in conjunction with specific examples.

An open data API gateway system based on big data artificial intelligence is mainly composed of an API gateway, a KafKa distributed publish-subscribe message unit, an Hbase distributed storage unit, an offline analysis service unit, a real-time analysis service unit and a monitoring management platform, as shown in fig. 1.

(1) API gateway

The API gateway collects log data (as shown in table 1) using the Nignx plug-in, and feeds the log data into the KafKa message queue of the KafKa distributed publish-subscribe message unit.

TABLE 1

(2) KafKa distributed publish-subscribe message unit

(3) Hbase distributed storage unit

And receiving the Kafka stream by using Logstash, and acquiring log records by using an Hbase distributed storage unit for processing and solidifying.

(4) Offline analysis service unit

And acquiring historical log data stored in an Hbase distributed storage unit by using a Spark calculation engine, determining a normal range threshold of each service key performance index, generating a flow prediction model, an emergency classification model, a user behavior analysis model and a statistical report, and sending the normal threshold range, the flow prediction model, the emergency classification model, the user behavior analysis model and the statistical report as an offline statistical analysis result to a monitoring management platform.

The calculation process of the normal range threshold in the offline analysis is specifically as follows: acquiring service key performance index item historical data (acquiring log data); classifying according to service states (respectively carrying out abnormal threshold calculation according to extracted service indexes); counting the average value, standard deviation, variance, dispersion coefficient and correlation coefficient of each index in normal state and abnormal state by adopting a data exploration method; removing noise data by using an outlier analysis method; performing value domain distribution analysis by using the effective fruit data to calculate the normal distribution condition; and calculating a threshold boundary according to the service scene and the positive-too analysis result, and returning the upper limit and the lower limit of the recommended threshold.

The real-time analysis service unit mainly generates a flow prediction model, an emergency classification model, a user behavior analysis model and a statistical form.

1) Flow prediction model (see table 3):

a regression prediction model is adopted, and the flow value in the future time period is predicted as an example.

TABLE 3

2) Incident classification model (see table 4):

and the core model in the intelligent operation and maintenance is used for detecting the abnormity and classifying the abnormity. A random forest model is to be used as a test model.

TABLE 4

3) User behavior analysis model (see table 5):

the big data analysis model is generally used by a plurality of models in a comprehensive mode, and generally characteristic data of drilling behavior are analyzed according to actual scene requirements to establish a corresponding model.

TABLE 5

4) Statistical report (see table 6):

statistical analysis and display of data results to form statistical analysis report

TABLE 6

(5) Real-time analysis service unit

Receiving the Kafka stream by using a spark streaming calculation engine, acquiring a real-time log message data stream, performing streaming calculation, calculating each index data of the log data of the current time window by combining the abnormal threshold of the service key performance index item, comparing the index data with the abnormal threshold of the index item, acquiring a real-time analysis result, and storing the real-time analysis result in an early warning library. And if the index in the current time window has a result exceeding the normal threshold range, pushing the result of the index to a monitoring management platform of the system to perform real-time alarm processing.

In order to facilitate deep analysis, the invention also adds some calculation indexes on the basis of the collected basic log data, as shown in table 7. The indexes in table 7 are all time sequence characteristics, and are divided by time periods (5s, 10s, 1m, and 5m), acquired basic log data are calculated by Spark Streaming to form key indexes, and real-time indexes are monitored in a time sequence window manner.

Index name	Obtaining rules
		Total flow/number of requests in a time period	Request traffic summation/count accumulation over specified time periods
Total flow in response to time period	Summation of response flows over a specified time period
		Number of success requests	Successful request summation
Number of request failures	Statistical failure summation
		Total number of requests	Number of request successes + number of request failures
Request rate of accuracy	Request success number/request total number
		Average response time	Sum of response elapsed times for request success/number of request successes
Number of applications accessed	Summation of different applications requesting a service
		Number of access services	Summing of different services accessed by an application

TABLE 7

The index calculation method takes the request accuracy as an example: in order to realize the real-time monitoring of the service accuracy, an abnormal threshold value is calculated through off-line analysis and is used as a judgment standard, then the real-time calculation engine monitors the abnormal threshold value, and once the index reaches the abnormal threshold value, the abnormal threshold value is recorded into an abnormal result table.

The real-time analysis service unit mainly realizes the functions of emergency identification, abnormal access behavior analysis and attack behavior detection.

1) Incident identification (see table 8):

table 82) abnormal access behavior analysis (unsupervised approach) (see table 9):

TABLE 9

3) Aggressive behavior detection (supervised approach) (see table 10):

and (3) adopting a supervised learning method, manually marking the attack behavior characteristics in the collected data, and adopting a machine learning model to train and identify the attack behavior characteristics.

Watch 10

(6) Monitoring management platform

The robustness is realized in the processing capacity of the emergency outside the specification, such as instantaneous high flow and suddenly slow response time, and the system can respond immediately and make reasonable treatment to maintain the normal operation of the gateway. And counting data such as service flow, response time and the like in real time, taking the data as input of an emergency model, analyzing the type of the emergency, pushing related information and an alarm instruction, and acquiring a message by the management platform to perform corresponding processing, wherein for example, the processing mode of instantaneous high flow is to actively load the service into a cache, reduce the actual request number of the rear end, reduce the weight of a service cluster with slow response if the response time is slow, and preferentially acquire a result from the service cluster with fast response. In addition, historical data can be used for prediction analysis, and the possible future flow situation can be analyzed, so that corresponding service resources can be arranged in advance to guarantee the smooth operation of application. And if the index in the current time window has a result exceeding the normal threshold range, pushing the result of the index to a monitoring management platform of the system to perform real-time alarm processing. The monitoring management platform captures the sudden abnormal condition in time, sends alarm information and enables related personnel to process the abnormal condition at the first time so as to guarantee the stable operation of the system.

The performance of the open data API gateway system based on big data artificial intelligence designed by the invention is analyzed as follows:

1. analysis of predictive models

The data prediction is a future expectation based on historical data, so that under a stable system, the data prediction can well help us to know the rule of the system and assist us to solve some system-related problems. But for the external sudden interference, the system stability is broken, and the prediction accuracy at that moment can be greatly reduced. Therefore, when data prediction is used, external influence is considered.

Data between the access logs 2019.3.3-2019.4.15 are collected, in order to evaluate model accuracy, data of five days are input into a prediction interface, a predicted value of the sixth day is returned, comparison is carried out between the predicted value and an actual value according to 70% accuracy, and the accuracy of each prediction method is recorded.

Selecting an algorithm: (1) linear regression: linear regression is a statistical analysis method that utilizes regression analysis in mathematical statistics to determine the quantitative relationship of interdependence between two or more variables, and is also a relatively simple and easy-to-use prediction approach. (2) Polynomial regression: polynomial regression, where the regression function is the regression of a regression variable polynomial. The polynomial regression model is one of linear regression models, in which case the regression function is linear with respect to the regression coefficients. (3) Ridge regression: ridge regression is a special biased estimation regression method for collinear data analysis, is an improved least square estimation method, obtains a regression coefficient more consistent with a practical and reliable regression method by giving up unbiased property of the least square method at the cost of losing partial information and reducing precision, and has stronger fitting to pathological data than the least square method. (4) lasso regression: LASSO was first proposed by Robert Tibshirani in 1996. The method is a kind of compression estimation. It obtains a more refined model by constructing a penalty function, so that it compresses some regression coefficients, i.e. the sum of the absolute values of the forcing coefficients is less than a certain fixed value; while some regression coefficients are set to zero. The advantage of subset puncturing is thus retained, and is a way to process biased estimates of data with complex collinearity. (5) kernel ridge regression: kernel edge regression (KRR) is composed of Kernel-ridge regression (ridge regression) using the Kernel method. Thus, the spatially different linear functions it learns are caused by different kernels and data. For a non-linear kernel, it corresponds to a non-linear function in the original space. The model learned by KernelRidge is in the same form as Support Vector Regression (SVR).

Cross-validation was performed on multiple sets of algorithms, combined with multiple indices of the data, as shown in table 11:

TABLE 11

It can be seen that lasso analysis performed well for the five regression analyses, but the accuracy of the five regression analyses was low, all below 30%. The prediction effect is poor, and the following reasons are roughly: 1) the correlation between the access times, the request traffic, the response traffic and the response time is strong, and the index prediction probability of a certain item is low, which also means that the prediction probabilities of other items are also low. 2) The data quality is poor, the available data is less, the data fault is more, the data of a certain day is lost, and the follow-up analysis is influenced. 3) The system cycle time is not 5 days (this requires data validation). 4) The data of the number of times of lasso predicted accesses is observed together with the actual data, the prediction accuracy is reduced due to the fact that the prediction value of lasso is delayed due to the data abnormality of 2019.3.28, and the system stability is damaged due to external influences, so that the prediction accuracy is reduced.

In summary, although there is some fluctuation in the preview process due to less data, the predictability is already demonstrated as a whole.

2. Response time threshold analysis

Data is collected between access logs 2019.3.3-2019.4.15.

Firstly, counting the distribution of response time in all historical logs, and eliminating outliers with overlarge values in the response time. Taking the existing historical log information as a data source, wherein the maximum response time is 599995 milliseconds; and by calculation, as in table 12:

global data set coverage	Response millisecond
		99.7% of data	26913
99.5% of data	4415
		99.2% of data	2373

TABLE 12

The response time value of 99.5% in the whole data is within 4415 milliseconds, so records larger than 4415 milliseconds are removed as abnormal values, and the rest data are analyzed as a new data set.

2.1 Overall value Domain analysis

For the new data set with outliers removed, the numerical distribution of the response time approaches the right side of the normal distribution (there is no left-side morphology since the response time is always greater than 0). Meanwhile, according to the characteristic of normal distribution, the response time covering the specific probability of the new data set and the number of times of exceeding the response time are calculated, as shown in table 13:

probability of covering a data set	Response time (millisecond)	Number of requests exceeding the response time
			99.9％	3312	16
99.8％	3054	32
			99.7％	2377	48
99.5％	2000	80
			99％	1531	159

Watch 13

It can be seen that for a normal request, there is a 99.9% probability that the response time is within 3312 ms. We can use 99.9% as the probability threshold of the response time of all APIs (i.e. assuming that each API has a response time in each case, 99.9% of the response time is a normal response time), and calculate the abnormal response time threshold of each API (i.e. calculate a threshold, and the probability of 99.9% of the normal response time of the API is less than or equal to the threshold). Response times exceeding this threshold will be considered as abnormal responses and early warning will be performed.

2.2 Single API analysis

The anomaly response time thresholds for the two APIs are calculated separately as follows: the transaction link docking platform inquires and docks an interface access token interface, and through calculation, the abnormal response time threshold value is as follows: 156 milliseconds. The external network uniform authentication obtains token, and the abnormal response time threshold is calculated as follows: 87 milliseconds.

In summary, the result of the value range analysis shows that the calculated threshold set by a single API is relatively reasonable, and the threshold can be applied to the actual process for early warning.

The tasks of the open data API gateway system based on big data artificial intelligence designed by the invention comprise:

1. and the performance index of the server is monitored and predicted in real time, the nginx service is monitored, the machine performance is monitored, and the resource requirement of the associated machine room is met. And the machine learning algorithm is combined to predict the service resource requirement in real time, dynamically adjust the service resource and save the resource.

2. The method comprises the steps of service flow abnormity detection, server system safety analysis, data leakage detection and suspicious request detection.

3. And detecting historical abnormal values of the specified analysis indexes. By configuring and analyzing the indexes, historical abnormal values of the indexes are detected, and detailed information of each abnormal point is given. The abnormal values comprise mutation data, data beyond fluctuation range, periodically fluctuating data and the like, and abnormal fluctuation of data such as service request number, rejection number, response time, flow rate, orders and the like is found in time by researching the abnormal data, so that the stability of the service is ensured.

4. And full link tracking, namely analyzing the data resources and the request link by combining the data resources and the application scene.

5. The method comprises the steps of log text analysis, log theme analysis, log clustering analysis, automatic classification of log information, automatic extraction of log alarm and monitoring information and formation of an automatic log analysis report.

6. Service performance extreme value and jitter analysis, service log DDos analysis, network attack detection and real-time and dynamic discovery of attack behaviors in the network.

7. And analyzing the heat degree of the data resource, namely analyzing the request behavior of a user based on the log record of the data resource request, and knowing the heat degree of the data resource and the data requirement of the user. And performing data preference analysis and data recommendation on the user data based on a machine learning algorithm and a recommendation algorithm.

8. And log classification analysis, namely automatically classifying the request log data and the resource types based on log request records and user preference data, analyzing and summarizing general log types, and automatically marking important log request information.

9. According to the user request behaviors and the log records, the user data portrayal is carried out by combining the portrayal system, the user behavior habits and the label characteristics are analyzed, the user data preference is known, and a data-driven log analysis system is formed. And analyzing the behavior and the characteristics of the hacker and mining the network attack behavior based on a collaborative filtering algorithm and a risk analysis model.

The intelligent management mainly refers to that under the condition that the operation of the API gateway is not directly interfered, collected logs are utilized, an artificial intelligence analysis technology is utilized to carry out excavation and output some intelligent analysis results, so that the gateway is indirectly managed and controlled, the intelligent management and control are realized, and the intelligent management and control method is also the key for realizing the accuracy, the robustness and the safety of the gateway. The historical time sequence data is self-learned through a machine learning algorithm and a deep learning algorithm, an optimal algorithm is automatically selected for a user to carry out analysis products of abnormality detection, prediction and alarm, and the user guides own business through an analysis result obtained through abnormality detection and prediction. The system automatically selects the optimal algorithm for the index selected by the user for analysis, the user can finish the abnormal detection and prediction without any algorithm-related knowledge to obtain an analysis result, and the use threshold is greatly reduced. Machine learning and deep learning algorithms such as clustering, regression, neural networks and the like are provided for accurate anomaly detection and prediction, and the accuracy is much higher than that of the traditional machine learning algorithm. The bottom algorithm is packaged, a visual operation environment is provided, a user can establish an abnormality detection task through page visual configuration task indexes, programming languages are not needed, the learning cost of the user is reduced, and data mining becomes a simple matter.

It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. The open data API gateway system based on big data artificial intelligence is characterized by consisting of an API gateway, a KafKa distributed publishing and subscribing message unit, an Hbase distributed storage unit, an offline analysis service unit, a real-time analysis service unit and a monitoring management platform;

the API gateway collects log data through the Nignx plug-in and sends the log data into a KafKa message queue of a Kafka distributed publish-subscribe message unit;

the KafKa distributed publishing and subscribing message unit divides a KafKa message queue with log data into two KafKa message streams, and one KafKa message stream is directly stored in the Hbase distributed storage unit and serves as offline historical data for backup storage; the other path of kafka message flow is sent to a real-time analysis service unit and is used as real-time data for real-time calculation;

the Hbase distributed storage unit receives the Kafka stream sent by the KafKa distributed publishing and subscribing message unit by using Logstash, acquires log data and processes and solidifies the log data;

the offline analysis service unit acquires historical log data stored in the Hbase distributed storage unit by using a Spark calculation engine, determines a normal range threshold of each service key performance index, generates a flow prediction model, an emergency classification model, a user behavior analysis model and a statistical report, and takes the normal threshold range, the flow prediction model, the emergency classification model, the user behavior analysis model and the statistical report as offline statistical analysis results;

the real-time analysis service unit receives the Kafka stream sent by the Kafka distributed publishing and subscribing message unit by utilizing a spark streaming stream calculation engine, acquires real-time log data and performs stream calculation, and simultaneously realizes emergency identification, abnormal access behavior analysis and attack behavior detection by combining an offline statistical analysis result of the offline analysis service unit, and takes the emergency identification, the abnormal access behavior analysis and the attack behavior detection as real-time analysis results;

2. The big data artificial intelligence based open data API gateway system of claim 1, wherein the service key performance indicators include total traffic and number of requests in a time period, total traffic of responses in a time period, number of successful requests, number of failed requests, total number of requests, request accuracy, average response time, number of applications accessed, and number of services accessed.

3. The open data API gateway system of claim 1, wherein the off-line analysis service unit calculates the normal range normal threshold of each service key performance index as follows: firstly, acquiring historical data of service key performance index items through historical log data; secondly, classifying according to service states, and respectively calculating abnormal threshold values according to the extracted service indexes; then, counting the average value, standard deviation, variance, dispersion coefficient and correlation coefficient of each service key performance index in normal state and abnormal state by adopting a data exploration method; then, removing noise data by using an outlier analysis method; then, performing value domain distribution analysis by using the effective fruit data to calculate the normal distribution condition; and finally, calculating a threshold boundary according to the service scene and the positive-fault analysis result, and returning the upper limit and the lower limit of the recommended threshold.