CN106649527B

CN106649527B - Advertisement click abnormity detection system and detection method based on Spark Streaming

Info

Publication number: CN106649527B
Application number: CN201610915505.XA
Authority: CN
Inventors: 刘群; 谭敢锋; 戴大祥
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2016-10-20
Filing date: 2016-10-20
Publication date: 2021-02-09
Anticipated expiration: 2036-10-20
Also published as: CN106649527A

Abstract

The invention relates to a Spark Streaming based advertisement click abnormity detection system and a detection method, and relates to the field of computer technology application. The abnormal data and the normal data are stored in a database, the suspected data are sent to a Kafka data message system, then a naive Bayes classifier is trained through the abnormal data, the classification condition of the suspected data can be obtained through the classifier, and the data are stored in the database. Finally, the cost of the advertiser is reasonably collected through the normal data volume, meanwhile, the popularity of each advertisement can be analyzed and obtained, the industry development direction is provided for the advertiser, and information such as the nationwide distribution condition of the user is provided.

Description

Advertisement click abnormity detection system and detection method based on Spark Streaming

Technical Field

The invention relates to the field of computer technology application, in particular to a system and a method for detecting advertisement click abnormity based on Spark Streaming.

Background

With the explosive growth of data, the era of big data comes, and safe, rapid, real-time and efficient data processing can not only enable enterprises to avoid risks in advance, but also provide data information in time to provide real and effective basis for enterprise development, product production and development.

However, because the network has openness, the convenience of the public is brought, and simultaneously, unreal information, malicious access, malicious attack and the like are brought. The problems are faced by each open website, and the research focus of each open website is how to prevent the problems, how to extract real and effective data and how to reduce the malicious load of the server. The malicious click of the advertisement is a typical problem, abnormal data is mastered in time to prevent the malicious click, effective advertisement click data is obtained, a basis is provided for reasonable charging of an open website, the server load can be effectively improved, and reasonable commercial planning and business guidance are provided for commercial merchants. The current processing technology is generally based on off-line batch processing, and the processing technology cannot solve the on-line problem in real time and cannot provide theoretical basis for some schemes needing quick decision making. For real-time type systems such as: storm, which has the capability of processing data in real time, but has a weaker effect than Spark Streaming in data security and mass data processing. Spark is a distributed computing framework similar to MapReduce, and the core of Spark is an elastic distributed data set, which provides a richer model than MapReduce, and can perform multiple iterations on the data set in a memory rapidly to support complex data mining algorithms and graph computing algorithms. Spark Streaming is a real-time computing framework built on Spark, which expands Spark's ability to process large-scale Streaming data.

The advantage of Spark Streaming is:

can run on a 100+ node and achieve millisecond delay.

Use of memory-based Spark as an execution engine, with efficient and fault tolerant features.

Batch and interactive queries that can integrate Spark.

Providing a simple interface similar to batch processing for implementing complex algorithms.

Therefore, based on the problems, the existing Spark big data calculation framework, strong computer hardware support and reasonable machine learning algorithm are combined, and the problems can be quickly, efficiently and accurately solved.

The invention aims to provide a Spark Streaming based advertisement click abnormity detection system, which can analyze and filter advertisement click abnormity thrown at a user end, timely master effective advertisement click conditions, reasonably and effectively charge advertisement throwing, analyze behaviors and characteristics of abnormal data, be more beneficial to analyzing user behaviors and interests, provide commercial planning for advertisement throwers, play a practical basis for product rationality and the like, and predict market future behavior and the like.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The advertisement click abnormity detection system and the detection method based on Spark Streaming can provide commercial planning, product rationality and the like for advertisement publishers quickly, efficiently and accurately, and can predict market future market quotations. The technical scheme of the invention is as follows:

a Spark Streaming based advertisement click abnormity detection system comprises a data acquisition unit, a data cleaning unit, a distributed data message system, a first abnormal data detection unit, a suspected data extraction unit, a normal data and abnormal data classifier and a classified data database unit; wherein

The data acquisition unit is used for acquiring the log information of the advertisement clicked by the user;

the data cleaning unit is used for cleaning and standardizing the logs acquired by the data acquisition unit, and finally sending the standardized data to the distributed data message system to wait for consumption;

the distributed data message system is mainly used for storing data after data standardization, also storing suspect data sent by the suspect data extraction unit, generating theme data required to be consumed by Spark Streaming, and generating respective Topic by different data;

the first abnormal data detection unit adopts a KNN algorithm to perform quasi real-time processing on data from the distributed message system (3) in Spark Streaming to obtain suspected data, abnormal data and normal data;

the suspect data extraction unit is mainly used for sending the suspect data generated by the first abnormal data detection unit back to the distributed data message system;

the normal data and abnormal data classifier adopts a naive Bayes classification method to classify the suspect data stored in the distributed message system to obtain abnormal data and normal data;

the classification data database unit comprises a MySQL database and a Redis memory database, wherein the MySQL database is used for storing normal data and abnormal data generated by a normal data classifier and an abnormal data classifier and mapping the abnormal data to the Redis memory database, so that a naive Bayesian classifier is convenient to train quickly, the Redis is a memory database which is only used for mapping the MySQL database, the query and modification speed is convenient to improve, the data is written into MySQL in a set period, and the permanent storage is convenient. In short, Redis is an intermediate piece, in order to increase speed.

Further, the Redis memory database further comprises a naive Bayesian classifier which uses the stored abnormal data for training.

Further, the device for acquiring the log information of the advertisement clicked by the user by the data acquisition unit is a log collector flash (distributed log collection system), and the distributed data message system is Kafka.

Further, the first abnormal data detection unit (4) adopts a KNN function of a KNN algorithm as follows:

x is a vector representation of a log to be classified, d_iFor an example log vector representation in the training set, c_jIs a category; the similarity of the log to be classified and the example log is cosine similarity:

further, in the KNN algorithm, the effectiveness of the KNN classifier clicking comprises five vectors, the first is that the number of clicks of the same IP in a period of time is large and abnormal, the second is that the stay time of the clicked IP on an advertisement page is almost negligible and abnormal, the third is that the time of artificial activities of the clicked IP for the abnormality of the advertisement access time is different from the normal time, the fourth is that the access synchronism of different addresses of the same IP section is similar for multiple times and abnormal, and the fifth is that the past behaviors and interests of the IP behavior and the concerned advertisement abnormality are different from the IP are suspected, the KNN classifier is obtained by training the sample data on the KNN.

Further, the naive bayes function is:

where d is the number of attributes, x_iIs the value of x on the ith attribute.

Training the classifier by taking the abnormal data mapped to Redis as samples, and in a period, for example: one week, the naive bayesian classifier was retrained with 20% of the outlier data extracted at random.

A Spark Streaming based advertisement click abnormity detection method comprises the following steps:

1) collecting advertisement click logs of website users by using a Flume (distributed log collection system);

2) carrying out data standardization processing on the data collected by the flash in the step 1), and then sending the standardized data to a Kafka message system by the flash, wherein the original data is defined as Topic1, and Topic1 represents the data waiting to be consumed, namely the address equivalent to the data is defined;

3) classifying the data to be consumed Topic1 in the step 2) under the KNN algorithm through a Spark Streaming quasi-real-time computing frame;

4) according to the suspected data, the abnormal data and the normal data generated in the step 3), sending the suspected data back to Kafka to be defined as Topic2, storing the rest data in a Redis memory database, and writing the data into a MySQL database to realize read-write separation of MySQL;

5) and (3) training a naive Bayes classifier according to 20% of abnormal data randomly extracted from the Redis in the MySQL database in the step 4), and then classifying the Topic2 in the Kafka under a naive Bayes algorithm through a Spark Streaming quasi-real-time computing framework.

Further, the KNN algorithm in step 3) is: and taking the training sample as a reference point, calculating the distance between the test sample and the training sample, and obtaining the closest value in the distance by adopting the Euclidean distance as a classification basis.

Further, the formula of the euclidean distance of the KNN algorithm in step 2) is as follows:

x and y represent different individuals, each having n-dimensional features.

The invention has the following advantages and beneficial effects:

according to the method, advertisement click data is put at a user side through a flash acquisition user side, the data is cleaned and standardized, the standardized data is sent to a distributed message system Kafka by the flash, Topic1 is generated after subscription is consumed, the data is classified into suspected data, abnormal data and normal data by using a big data quasi-real-time stream data Spark Streaming computing framework combined with a KNN classification algorithm, then the suspected data is sent back to the Kafka to generate Topic2, and the Topic2 generated by the suspected data is classified by using the big data quasi-real-time stream data Spark Streaming computing framework combined with a naive Bayesian classification algorithm to obtain the abnormal data and the normal data. The processes are finally classified and stored in Redis and then stored in a MySQL database, so that the read-write separation of the database is realized, and the read-write speed is increased.

Drawings

FIG. 1 is a schematic structural view of a preferred embodiment of the present invention;

FIG. 2 is a KNN classification flow chart under Spark Streaming;

fig. 3 is a naive bayes classification flow chart under Spark Streaming.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme of the invention is as follows:

as shown in fig. 1, an advertisement click anomaly detection system based on Spark Streaming is characterized by comprising a data acquisition unit 1, a data cleaning unit 2, a distributed data message system 3, a first anomaly data detection unit 4, a suspect data extraction unit 5, a normal data and anomaly data classifier 6 and a classified data database unit; wherein

The data acquisition unit 1 is used for acquiring the log information of the advertisement clicked by the user;

the data cleaning unit 2 is used for cleaning and standardizing the logs acquired by the data acquisition unit 1, and finally sending the standardized data to the distributed data message system 3 to wait for consumption;

the distributed data message system 3 is mainly used for storing data after data standardization, also storing suspect data sent by the suspect data extraction unit, generating theme data required to be consumed by Spark Streaming, and generating respective Topic by different data;

the first abnormal data detection unit 4 adopts a KNN algorithm to perform quasi real-time processing on data from the distributed message system 3 in Spark Streaming to obtain suspected data, abnormal data and normal data;

the suspect data extracting unit 5 is mainly used for sending the suspect data generated by the first abnormal data detecting unit 4 back to the distributed data message system 3;

the normal data and abnormal data classifier 6 classifies the suspect data stored in the distributed message system 3 by adopting a naive Bayesian classification method to obtain abnormal data and normal data;

the classification data database unit comprises a MySQL database 7 and a Redis memory database 8, wherein the MySQL database 7 is used for storing normal data and abnormal data generated by the normal data and abnormal data classifier 6 and mapping the abnormal data to the Redis memory database, so that a naive Bayesian classifier can be trained quickly, the Redis is a memory database and is only used for mapping the MySQL database, the query and modification speed can be improved conveniently, and the data can be written into MySQL in a set period and can be stored permanently. In short, Redis is an intermediate piece, in order to increase speed.

Fig. 2 is a KNN classification flowchart under Spark Streaming.

Fig. 3 is a naive bayes classification flow chart under Spark Streaming.

The KNN classifier classifies Topic1 data which are stored in Kafka after standardization to generate suspect data (KNN data which cannot be classified), normal data and abnormal data, the generated normal data and abnormal data are stored in a database, the suspect data are sent back to the Kafka to generate Topic2 to wait for classification of a naive Bayesian classifier, the naive Bayesian classifier is trained through abnormal data classified by the KNN, calculation is faster through combining super-strong calculation capability of big data Spark Streaming, results are more accurate, and the classified data are finally stored.

According to the method, after the webpage user clicks the advertisement, abnormal data are filtered in real time, the characteristics and behaviors of the abnormal data are analyzed and extracted, normal data are collected, the advertisement putting cost is calculated in total, the behaviors and interests of the user are analyzed, a business plan is made for advertisement putting enterprises, the future market quotation is predicted, and the like. The classification reaches three classifications, namely suspect data, abnormal data and normal data through the first classification of KNN, then naive Bayes is trained through the abnormal data, and the suspect data is accurately divided so as to achieve the rationality of the data, and the abnormal data, the normal data, the relevant data and the irrelevant data can powerfully provide guarantee for accurate data mining and predictive analysis.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A Spark Streaming based advertisement click abnormity detection system is characterized by comprising a data acquisition unit (1), a data cleaning unit (2), a distributed data message system (3), a first abnormal data detection unit (4), a suspect data extraction unit (5), a normal data and abnormal data classifier (6) and a classified data database unit; wherein

The data acquisition unit (1) is used for acquiring log information of the advertisement clicked by the user;

the data cleaning unit (2) is used for cleaning and standardizing the logs acquired by the data acquisition unit (1), and finally sending the standardized data to the distributed data message system (3) to wait for consumption;

the distributed data message system (3) is used for storing the standardized data and also storing the suspect data sent by the suspect data extraction unit (5) to generate subject data which needs to be consumed by Spark Streaming, and different data generate respective Topic;

the first abnormal data detection unit (4) adopts a KNN algorithm to perform quasi-real-time processing on data from the distributed data message system (3) in Spark Streaming to obtain suspected data, abnormal data and normal data; the KNN function of the KNN algorithm adopted by the first abnormal data detection unit (4) is as follows:

KNN function

x is a vector representation of a log to be classified, d_iFor an example log vector representation in the training set, c_jIs a category; the similarity between the log to be classified and the example log is cosine similarity, and the similarity between the log to be classified and the example log is as follows:

wherein when d belongs to c_jWhen the log is classified, x is the vector representation of a log to be classified, d is the example log vector in the training set, and d is 1, otherwise 0 is taken; the distance metric uses the euclidean distance;

in the KNN algorithm, the click effectiveness of the KNN classifier comprises five sample data, the first sample data is that the number of clicks of the same IP in a period of time is large and abnormal, the second sample data is that the stay time of the clicked IP in an advertisement page is almost negligible and abnormal, the third sample data is that the time of the clicked IP for the abnormal advertisement access time is different from the normal human activity time, the fourth sample data is that the time of the same IP section with different address access synchronicity is similar for multiple times and abnormal, the fifth sample data is that the past behavior and interest of the IP behavior and the concerned advertisement abnormality are different from the IP are suspected, and the sample data is used as KNN representative data to obtain the KNN classifier;

the suspect data extraction unit (5) is used for sending the suspect data generated by the first abnormal data detection unit (4) back to the distributed data message system (3);

the normal data and abnormal data classifier (6) classifies the suspected data stored in the distributed data message system (3) by adopting a naive Bayesian classification method to obtain abnormal data and normal data;

the classification data database unit comprises a MySQL database (7) and a Redis memory database (8), wherein the MySQL database (7) is used for storing normal data and abnormal data generated by a normal data and abnormal data classifier (6), the abnormal data is mapped to the Redis memory database (8), a naive Bayesian classifier is trained, the Redis is used as the memory database, the MySQL database is only used for mapping, the query and modification speed is improved, and the data is written into MySQL in a certain period and is permanently stored.

2. The Spark Streaming based ad click anomaly detection system as claimed in claim 1, wherein said Redis in-memory database further comprises using stored anomaly data to train a naive Bayesian classifier.

3. The Spark Streaming based advertisement click anomaly detection system according to claim 1, wherein the device for collecting the log information of the user click advertisement by the data collection unit (1) is a flash distributed log collection system, and the distributed data message system (3) is Kafka.

4. The Spark Streaming based advertisement click anomaly detection system according to claim 3, wherein said naive Bayes function is:

where d is the number of attributes, x_iAnd (3) taking the value of x on the ith attribute, training a classifier by taking the abnormal data mapped to Redis as a sample, and retraining and updating the naive Bayes classifier by using 20% of the abnormal data extracted randomly in one period.

5. A Spark Streaming based advertisement click abnormity detection method is characterized by comprising the following steps:

1) collecting advertisement click logs of website users by using a distributed log collection system Flume;

4) sending the suspected data back to Kafka according to the suspected data, the abnormal data and the normal data generated in the step 3) to be defined as Topic2, storing the abnormal data and the normal data in a Redis memory database, and writing the abnormal data and the normal data into a MySQL database to realize read-write separation of MySQL;

the KNN function using the KNN algorithm is:

d represents an example log vector in the training set, and d is taken as 1, otherwise 0 is taken; the distance metric uses the euclidean distance;

5) 20% of abnormal data in the MySQL database is randomly extracted to train a naive Bayes classifier, and then Topic2 in Kafka is classified under a naive Bayes algorithm through a Spark Streaming quasi-real-time computing framework.

6. The method for detecting abnormal advertisement clicks according to claim 5, wherein the KNN algorithm in step 3) is: and taking the training sample as a reference point, calculating the distance between the test sample and the training sample, and obtaining the closest value in the distance by adopting the Euclidean distance as a classification basis.

7. The method for detecting abnormal advertisement clicks based on Spark Streaming according to claim 6, wherein the formula of the Euclidean distance of the KNN algorithm in the step 2) is as follows:

dist (x, y) denotes the Euclidean distance, x and y denote the individual differences, each with n-dimensional features.