CN114663121A

CN114663121A - Method and device for detecting abnormal traffic of advertisement

Info

Publication number: CN114663121A
Application number: CN202011536419.0A
Authority: CN
Inventors: 苏同; 郭田奇; 李�诚; 刘崴; 李响
Original assignee: Hylink Digital Technology Co ltd
Current assignee: Hylink Digital Technology Co ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2022-06-24

Abstract

Embodiments of the present specification provide methods, apparatuses, computing devices, and machine-readable storage media for ad anomaly traffic detection. The method can comprise the following steps: acquiring historical advertisement log data in a historical time period, wherein the historical advertisement log data comprises data obtained by at least one device participating in advertisement in the historical time period; generating feature data based on historical advertisement log data, wherein the feature data is used for representing advertisement features and user features participating in the advertisement; coding the characteristic data to generate a target training sample; and performing machine learning model training based on the target training sample for advertisement abnormal traffic detection.

Description

Method and device for detecting abnormal traffic of advertisement

Technical Field

Embodiments of the present description relate to the field of information technology, and more particularly, to a method, apparatus, computing device, and machine-readable storage medium for advertisement abnormal traffic detection.

Background

With the explosion of the internet, the digital advertising market is also expanding. Compared with traditional advertisements, digital advertisements can create greater value for advertisers through audience targeting, traffic monitoring, personalized services, and the like, but at the same time, digital advertisements also suffer from the threat of increasingly strong advertising cheating organizations. The value and the credibility of the digital advertisement are continuously destroyed by abnormal flow generated by advertisement cheating, and great harm is caused to the current internet service safety at present.

For this situation, the industry generally sets statistical rules according to access frequency, behavior frequency, regional orientation, delivery time interval, and the like, and then performs advertisement abnormal traffic detection based on the statistical rules. However, this approach is usually attacked first and then discovered, and finally the rules are supplemented, and a long lag period may be experienced in the middle, resulting in failure to adapt to the attack and defense rhythm of the current advertising cheating organization. In addition, practice shows that the method can generally have a high false rate of abnormal traffic and a high false rate of normal traffic. For example, this approach may not identify traffic that is being cheated by the shuffle technique.

Disclosure of Invention

In view of the above-identified problems of the prior art, embodiments of the present specification provide a method, apparatus, computing device, and machine-readable storage medium for advertisement anomaly traffic detection.

In one aspect, an embodiment of the present specification provides a method for advertisement abnormal traffic detection, including: obtaining historical advertisement log data in a historical time period, wherein the historical advertisement log data comprises data obtained by at least one device participating in advertisement in the historical time period; generating feature data based on the historical advertisement log data, wherein the feature data is used for representing advertisement features and user features participating in the advertisement; coding the characteristic data to generate a target training sample; and performing machine learning model training based on the target training sample for advertisement abnormal traffic detection.

In another aspect, embodiments of the present specification provide a method for advertisement abnormal traffic detection, including: acquiring current advertisement log data in a specified time period, wherein the current advertisement log data comprises data obtained by at least one device participating in advertisement in the specified time period; generating current feature data based on the current advertisement log data, wherein the current feature data is used for representing advertisement features and user features participating in the advertisement; encoding the current characteristic data to generate target characteristic data; and processing the target characteristic data by using a trained machine learning model to obtain a prediction result, wherein the prediction result is used for indicating whether abnormal traffic exists in the current advertisement log data.

In another aspect, an embodiment of the present specification provides an apparatus for advertisement abnormal traffic detection, including: an acquisition unit, configured to acquire historical advertisement log data in a historical time period, where the historical advertisement log data includes data obtained by participation of at least one device in an advertisement in the historical time period; a generating unit, configured to generate feature data based on the historical advertisement log data, wherein the feature data is used for representing advertisement features and user features participating in advertisements; the coding unit is used for coding the characteristic data to generate a target training sample; and the training unit is used for performing machine learning model training based on the target training sample so as to be used for detecting abnormal advertisement flow.

In another aspect, an embodiment of the present specification provides an apparatus for advertisement abnormal traffic detection, including: the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring current advertisement log data in a specified time period, and the current advertisement log data comprises data obtained by that at least one device participates in advertisement in the specified time period; a generating unit, configured to generate current feature data based on the current advertisement log data, where the current feature data is used to represent advertisement features and user features participating in advertisements; the encoding unit is used for encoding the current characteristic data to generate target characteristic data; and the prediction unit is used for processing the target characteristic data by utilizing the trained machine learning model to obtain a prediction result, wherein the prediction result is used for indicating whether abnormal traffic exists in the current advertisement log data.

In another aspect, embodiments of the present description provide a computing device comprising at least one processor; a memory in communication with the at least one processor having executable code stored thereon, which when executed by the at least one processor causes the at least one processor to implement the method provided by the first aspect described above.

In another aspect, embodiments of the present specification provide a computing device comprising: at least one processor; a memory in communication with the at least one processor having executable code stored thereon, which when executed by the at least one processor causes the at least one processor to implement the method provided by the second aspect above.

In another aspect, embodiments of the present description provide a machine-readable storage medium storing executable code that, when executed, causes a machine to perform the method provided by the first aspect described above.

In another aspect, embodiments of the present description provide a machine-readable storage medium storing executable code that, when executed, causes a machine to perform the method provided by the second aspect described above.

Drawings

The foregoing and other objects, features and advantages of the embodiments of the present specification will become more apparent from the following more particular description of the embodiments of the present specification, as illustrated in the accompanying drawings in which like reference characters generally represent like elements throughout.

FIG. 1 is a schematic flow diagram of an overall process for ad anomaly traffic detection based on machine learning techniques, according to some embodiments.

FIG. 2 is a schematic flow diagram of a method for ad anomaly traffic detection, according to some embodiments.

FIG. 3 is a schematic flow chart diagram of a method for ad anomaly traffic detection, according to some embodiments.

FIG. 4 is a schematic flow chart diagram of one example of a process for processing historical advertisement log data.

FIG. 5 is a schematic flow chart diagram of one example of a process for machine learning model training.

Fig. 6 is a schematic block diagram of an apparatus for ad anomaly traffic detection according to some embodiments.

Fig. 7 is a schematic flow diagram of an apparatus for ad anomaly traffic detection according to some embodiments.

FIG. 8 is a hardware block diagram of a computing device for advertisement anomaly traffic detection, according to some embodiments.

FIG. 9 is a hardware block diagram of a computing device for advertisement anomaly traffic detection, according to some embodiments.

Detailed Description

The subject matter described herein will now be discussed with reference to various embodiments. It should be understood that these examples are discussed only to enable those skilled in the art to better understand and implement the subject matter described herein, and are not intended to limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the claims. Various embodiments may omit, replace, or add various procedures or components as desired.

As used herein, the term "include" and its variants mean open-ended terms in the sense of "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment". The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below, and a definition of a term is consistent throughout the specification unless the context clearly dictates otherwise.

At present, the abnormal flow generated by the advertisement cheating has generated great harm to the digital advertisement service. Anomalous traffic generally refers to non-real traffic generated for an advertisement (e.g., forged or fabricated traffic, etc.). However, the conventional method for detecting abnormal traffic based on statistical rules may have serious hysteresis, and is difficult to satisfy the rapidly changing abnormal traffic condition. In addition, the missing rate and the false rate of the method are high.

Nowadays, with the rapid development of artificial intelligence, machine learning technology as a branch of the artificial intelligence has also made breakthrough progress, and shows superior prediction performance in various fields. The machine learning technology can discover and learn the internal rules by analyzing the existing data, and can use the learning result to complete the judgment and prediction of unknown data.

In view of this, embodiments herein provide a technical solution for advertisement anomaly traffic detection based on machine learning techniques. In the technical solution, a training sample may be obtained by performing a series of appropriate processing on historical advertisement log data, and then a machine learning model is trained based on the training sample. Thereafter, current ad log data can be predicted using the trained machine learning model. The whole process can be completed quickly and efficiently, and reliable prediction results can be provided, so that favorable decision basis is provided for advertisement cheating prevention.

The technical solutions herein will be described below with reference to specific embodiments.

As shown in fig. 1, the process may mainly include four parts, namely data access, data processing, model training, and model prediction.

In the Data access section, raw Data such as history advertisement log Data, user tag Data of a Data Management Platform (DMP), Data of a Distributed Invalid Traffic Filter (DIF) (e.g., an Invalid Traffic Filter list), and the like, which will be described in detail below, may be acquired.

In the data processing portion, the raw data may be processed to obtain training samples suitable for training a machine learning model. For example, in this section, corresponding feature information may be extracted from the historical advertisement log data, encoded, tagged with DIF-based data, and so forth.

In the model training portion, machine learning model training may be performed based on training samples. For example, one or more machine learning models may be trained based on training samples, resulting in a target machine learning model. In the case where a plurality of machine learning models are trained, an optimal machine learning model may be selected as a target machine learning model from the trained plurality of machine learning models.

In the model prediction part, the target machine learning model can be used to process the advertisement log data to be analyzed to obtain a prediction result. For example, the prediction result may indicate whether there is abnormal traffic in the advertisement log data. Therefore, based on the prediction result, subsequent abnormal flow early warning or filtering and the like can be carried out.

It should be understood that each part described with respect to fig. 1 performs one or more loop iterations according to actual situations, which should be determined according to a specific implementation, and is not limited herein. For example, when the prediction result of the model is found to be unexpected, the data access, the data processing, the model training and the model prediction can be performed again.

Therefore, in the technical scheme, the historical advertisement log data can be converted into the structural data which can be used for modeling by combining and processing the historical advertisement log data, the DMP user label data, the DIF data and the like, so that an effective and reliable training sample is provided for the machine learning model which can be applied to the field of advertisement business. Further, advertisement abnormal traffic can be efficiently and accurately detected based on the trained machine learning model. Compared with the existing mode of detection based on the artificially set statistical rules, the technical scheme can efficiently solve the problems of easy missed detection, easy misjudgment and the like.

Furthermore, in existing statistical rule-based approaches, there is a severe lag from being attacked to being discovered before a decision is made for ad placement can be long (e.g., currently it can take at least one month or more). According to the technical scheme, the machine learning model can quickly provide a prediction result about abnormal flow, so that flow early warning and/or flow filtering and the like can be quickly realized on the basis of the prediction result, and the timeliness is remarkably improved. For example, the machine learning model may determine whether there is abnormal traffic based on the advertisement log data within the last several hours, thereby providing a favorable decision for the next advertisement placement decision (e.g., traffic pre-warning and/or filtering, etc.). Thus, to some extent, the technical solution herein enables an effective transition from advertising cheating monitoring to advertising cheating prevention.

Therefore, the technical scheme can improve the value of digital advertisement putting, and helps advertisers gradually get rid of the threat of advertisement cheating.

FIG. 2 is a schematic flow diagram of a method for ad anomaly traffic detection, according to some embodiments. In the method of fig. 2, emphasis will be placed on how raw data is processed for training a machine learning model.

As shown in FIG. 2, in step 202, historical advertisement log data over a historical period of time may be obtained. For example, historical advertisement log data may include data resulting from at least one device engaging in advertisements over a historical period of time.

In step 204, feature data may be generated based on the historical advertisement log data. For example, the characteristic data may be used to represent advertisement characteristics as well as user characteristics participating in the advertisement.

In step 206, the feature data may be subjected to an encoding process to generate a target training sample.

In step 208, machine learning model training may be performed based on the target training samples for ad anomaly traffic detection.

In this embodiment, the historical advertisement log data is processed to obtain feature data representing advertisement features and user features, and then the feature data is encoded to obtain reliable target training samples, so that machine learning model training is performed based on the target training samples, and thus a reliable machine learning model for advertisement abnormal traffic detection can be provided.

In embodiments herein, the historical time period may be determined based on various factors such as the specific implementation, the business requirements, and the like. For example, the historical time period may be the past week, two weeks, one month, three months, six months, etc., without limitation.

In embodiments herein, historical advertisement log data may include various data generated by various devices during various stages of advertisement bidding, exposure, clicking, etc. during historical time periods. For example, such data may include various information such as device type, device screen size, user identification, network connection mode, Internet Protocol (IP) address, advertisement plan, advertisement spot, advertisement material, advertisement media, advertisement type, log type, access time, exposure times, click times, and the like. It should be understood that these specific details are merely illustrative and that historical advertisement log data may also include other information in different embodiments, and are not limited herein. Further, in some embodiments, the device may include devices with operating systems of IOS and Android, among others.

In some embodiments, in step 204, the historical advertisement log data may be organized by device identification of at least one device recorded in the historical advertisement log data to determine valid log data. Feature data may then be generated based on the valid log data.

For convenience of description, data obtained after organizing the historical advertisement log data according to the device identification of at least one device may be referred to as base log data. Typically, the historical advertisement log data may be a plurality of records ordered by time. In embodiments herein, the plurality of records may be grouped by a device identification of at least one device, resulting in base log data. For example, the base log data may include a plurality of sets of records, each of which may correspond to a different device identification.

In one implementation, a Hive tool may be used on a Hadoop big data platform to organize historical ad log data into sets of records grouped by device identification.

As is known, Hadoop can be a framework of distributed data and computation that excels in storing large sets of semi-structured data, enabling distributed processing of large amounts of data.

The Hive is a set of data warehouse analysis system constructed based on Hadoop, and provides rich SQL query modes to analyze data stored in a Hadoop platform, and structured data files can be mapped into database tables. Thus, in one implementation herein, historical ad log data may be stored on a Hadoop big data platform and then may be grouped using a Hive tool, resulting in base log data grouped by device identification.

It can be understood that the device identifier generally has stability, and therefore in this embodiment, grouping the historical log data according to the device identifier is more beneficial to subsequent abnormal traffic determination. In contrast, if historical ad log data is organized using some varying factor, such as an IP address, which may change frequently between normal traffic situations and abnormal traffic situations, this would be severely detrimental to abnormal traffic determinations.

In embodiments herein, the device identification may include an International Mobile Equipment Identity (IMEI), an advertisement Identifier (IDFA), and so on. For example, for a device whose operating system is an IOS, its device identification may be an IDFA. For a device with an operating system of Android, the device identifier may be an IMEI.

In some embodiments, the base log data may be used directly as valid log data as referred to herein.

In some embodiments, some invalid log data may be included in the base log data. For example, the invalid log data may indicate that the device is engaged in an advertisement bid but has not generated an exposure and click operation. In this way, in order to ensure the accuracy and reliability of the data, invalid log data can be removed from the base log data, thereby obtaining valid log data.

In either case, the feature data may then be generated based on the valid log data.

In some embodiments, the valid log data may include user identification information associated with the at least one device. The characteristic data may include user characteristic information. The user characteristic information may include a user characteristic associated with the at least one device.

In this case, the user tag data may be acquired from the DMP. The user tag data may then be associated with user identification information to obtain user characteristic information.

In general, the DMP can integrate scattered multi-party data and standardize and subdivide the data to provide various corresponding data tags.

In one implementation, the user tag data provided by the DMP may include corresponding user identification information, so that the user identification information may be associated with the user identification information in the valid log data, thereby obtaining corresponding user characteristic information. As can be seen, a user representation is actually constructed from the user tag data of the DMP.

In embodiments herein, user tag data may include a variety of tag information, such as a variety of tag information of demographic attributes, device characteristics, location characteristics, interest preferences, behavioral preferences (e.g., behaviors of browsing, clicking, purchasing, etc.), consumption preferences, and so forth. Accordingly, the user characteristic information may include characteristics such as demographic attributes, device characteristics (such as screen size), location characteristics, interest preferences, behavioral preferences, consumption preferences, and the like.

It should be noted that the user tag data and the user characteristic information are only examples, and the embodiments herein are not limited to these contents.

In some embodiments, the characteristic data may include advertising data and/or contextual information. The advertisement data may include advertisement attributes and the context information may include advertisement operating characteristics.

For example, the advertisement data may include various attribute information such as advertiser information, advertisement plans, advertisement types, ad spots, advertisement material, advertisement media, advertisement campaigns, and so forth. The context information may include IP address, real-time location, network type, number of exposures (e.g., at various time granularity of minutes, hours, days, etc.), number of clicks (e.g., at various time granularity of minutes, hours, days, etc.), time difference between ad exposure and bids, time difference between clicks and exposures, liveness (e.g., in terms of different ad spots, materials, media, ad campaigns, etc.), and so on. For example, an activity level may indicate inactivity, high activity, low activity, and the like.

It should be understood that the advertising data and contextual information are merely illustrative and may include different content in different implementations. This is not a limitation herein.

It can be seen that in embodiments herein, feature data may be multidimensional, such as it may include user feature information, advertising data, and contextual information, providing a powerful data basis for machine learning model training. Of course, the feature data may also include information in other dimensions, depending on the particular implementation, and is not limited herein.

Thereafter, in step 206, the feature data may be subjected to an encoding process to generate a target training sample.

For example, in some embodiments, the plurality of features represented by the feature data may be divided into two classes of features, where a first class of features may be suitable for statistical analysis and a second class of features may be the remaining features that are not suitable for statistical analysis.

In this case, in order to further effectively characterize the first class of features, a statistical analysis may be performed on the first class of features to obtain a set of statistical features. For example, the statistical characteristics may include various applicable statistical indicators such as minimum, maximum, average, standard deviation, ranking, and the like, which are not limited herein. The statistical features may be calculated based on various applicable time granularities. For example, the time granularity may include minutes, hours, days, months, etc., which are not limited herein.

The set of statistical features and the second class of features may then be encoded to obtain the original training sample. For example, the statistical features and the second class of features may be encoded into a sample format suitable for training of the machine learning model using various suitable encoding methods (e.g., one-hot encoding, etc.).

Thereafter, at least a portion of the samples may be extracted from the original training samples as target training samples.

In some implementations, an appropriate sample size may be drawn as the target training sample, such as where the sample size of the original training sample is large. In some implementations, all of the original training samples may be taken as target training samples, such as where the sample size of the original training samples is small.

In some embodiments, the original training samples may include positive samples and negative samples. The positive and negative examples may be distinguished based on an invalid traffic filter list obtained from the DIF.

DIF may be a technology platform that makes and manages invalid traffic filtering lists. The DIF may be public to all its members, and each member may report device information that results in invalid traffic. This information is recorded in the invalid traffic filter list. For example, the device information may include a device identification.

In this way, the original training samples may be divided into positive and negative samples based on the invalid traffic filtering list provided by the DIF. The device identification associated with the positive sample may not be recorded in the invalid traffic filtering list and the device identification associated with the negative sample may be recorded in the invalid traffic filtering list. Thus, it can be appreciated that positive samples can represent samples of normal flow, while negative samples can represent samples of abnormal flow.

Further, in some embodiments, to ensure that negative examples are indeed associated with anomalous traffic, the device associated with the negative example may be marked in an invalid traffic filtering list as producing invalid traffic by at least two members.

For example, in one implementation, the valid log data may be organized by device identification, as previously described. Thus, the feature data and the resulting raw training samples may also be organized according to device identification. For example, each sample may correspond to a device identification. In this way, the original training samples can be divided into positive and negative samples by looking up in an invalid traffic filter list.

Thereafter, target positive and negative samples may be extracted from the positive and negative samples in the original training sample. Here, the ratio between the target positive and negative examples may be predefined.

For example, in some cases, the number of positive samples may be much greater than the number of negative samples. In this way, all negative examples may be taken as target negative examples, and then the number of target positive examples to be extracted from the positive examples may be determined based on the above-mentioned predefined ratio. Thereafter, the number of positive samples may be randomly drawn from the positive samples as the target positive sample.

Of course, in some cases, the number of negative samples may be greater than the number of positive samples. A corresponding number of negative examples may also be drawn as target negative examples in a similar manner as above.

In addition, although a random decimation principle is mentioned here, different decimation principles may be employed in different implementations, which is not limited herein.

After the target training samples are obtained, machine learning model training may be performed.

In one implementation, at least one initial machine learning model may be trained, resulting in at least one available machine learning model. Then, at least one available machine learning model can be evaluated, and an available machine learning model with the optimal evaluation result is selected as the target machine learning model.

In specific implementation, each initial machine learning model can be subjected to multiple iterations and parameter tuning, so that an available machine learning model is obtained.

In one implementation, the machine learning model may be implemented using python. Python is an object-oriented interpreted computer programming language that has been widely used in the field of machine learning. Of course, in other implementations, the machine learning model may be implemented in other design languages, which are not limited herein.

In embodiments herein, the machine learning model may be implemented using various suitable algorithms that predict an output as a probability score, such as a multi-layered perceptron algorithm, a naive bayes algorithm, a logistic regression algorithm, and so forth. Of course, the embodiments herein are not limited to these algorithms, and other suitable algorithms may also be employed.

In addition, various applicable manners may be adopted to evaluate the plurality of available machine learning models, such as a Receiver Operating Characteristic (ROC) Curve, an Area Under Curve (AUC) value, and the like, which are commonly used at present, and are not limited herein.

For example, where the AUC values are employed for evaluation, the target machine learning model may be the model with the highest AUC value of a plurality of available machine learning models, thereby enabling the target machine learning model to provide reliable and accurate results.

The above focuses on the whole process of machine learning model training. The process of how to use the trained machine learning model for ad anomaly traffic detection will be described below.

It will be understood at the outset that in order to facilitate distinguishing the training process from the prediction process, some terms will be described below as "current".

In addition, it can be understood that some processes for current advertisement log data are similar to those for historical advertisement log data when training the machine learning model when using the machine learning model for prediction, and thus, for brevity of description, some specific operations in fig. 3 will not be described in detail.

As shown in FIG. 3, in step 302, current ad log data over a specified time period may be obtained. For example, the current ad log data may include data resulting from at least one device participating in an ad for a specified period of time.

The specified time period may be determined according to various factors such as specific implementation, service requirements, and the like, for example, the specified time period may be four hours, eight hours, twenty-four hours, and the like. In one case, the specified time period may be a period of time from the time the ad placement decision is to be made, such as the previous four hours, in order to provide real-time effective basis for the next ad placement decision. This is by way of example only and is not intended to be limiting.

The current ad log data may include the same or similar information fields as the historical ad log data described previously.

In step 304, current feature data may be generated based on the current advertisement log data. For example, the current feature data may be used to represent advertising features and user features participating in the advertisement.

The specific implementation of this step can refer to step 204 in fig. 2, and therefore, the detailed description thereof is omitted here.

For example, in some embodiments, current advertisement log data may be organized by device identification of at least one device to determine current valid log data.

In one case, the current advertisement log data may be organized according to a device identification of at least one device, generating current base log data. Then, the current invalid log data may be removed from the current base log data, resulting in current valid log data. For example, the current invalid log data may indicate that the device is engaged in an advertisement bid but has not generated an exposure and click operation.

Thereafter, current feature data may be generated based on the current valid log data.

In some embodiments, the currently valid log data may include user identification information associated with the at least one device. The current feature data may include current user feature information, wherein the current user feature information may include user features associated with the at least one device.

In this case, the user tag data may be acquired from the DMP. Then, the user tag data may be associated with the user identification information in the currently valid log data to generate current user characteristic information.

In some embodiments, the current feature data may include current advertising data and current context information. The current advertisement data may include advertisement attributes and the current context information may include advertisement operating characteristics. In this case, the current advertisement data and current context information may be extracted from the current active log data.

In step 306, the current feature data may be subjected to an encoding process to generate target feature data.

In some embodiments, the current feature data may be used to represent a plurality of features. The plurality of features may include a first class of features suitable for statistical analysis and a second class of features, other than the first class of features, not suitable for statistical analysis. Then a statistical analysis may be performed on the first class of features to generate a set of statistical features. The set of statistical features and the second class of features may then be encoded to generate target feature data.

In step 308, the target feature data may be processed using the trained machine learning model to obtain a prediction result. For example, the machine learning model herein may be the aforementioned target machine learning model. The prediction result may indicate whether there is abnormal traffic in the current advertisement log data.

For example, in one implementation, the prediction results may include device identifications and corresponding probability scores. The probability score may be positively correlated with whether the device is producing abnormal traffic. For example, if the probability score is high, it may be considered that the probability that the device corresponding to the device identifier generates abnormal traffic is high; and if the probability score is low, the probability that the device corresponding to the device identifier generates abnormal traffic is considered to be small.

The prediction result can be used as the basis for subsequent flow early warning or flow filtering and the like. For example, the score threshold may be set according to actual needs and the like. Devices with a probability score above the score threshold may be considered devices that are to perform traffic forewarning or traffic filtering. For example only, in various implementations, the prediction results may be used in various forms.

Therefore, as can be seen from the above description, the target feature data is obtained by performing feature extraction and encoding processing on the current advertisement log data, and then the target feature data is obtained based on the machine learning model, so that a reliable prediction result can be quickly provided, and a real-time and accurate basis is provided for subsequent traffic early warning or traffic filtering and the like.

The technical solutions herein will be described below with reference to specific examples. It should be understood that the following examples are intended only to help those skilled in the art better understand the technical solutions herein, and are not intended to limit the scope thereof.

As shown in fig. 4, this example may mainly include three parts, namely data preprocessing, data association, feature construction.

In the data preprocessing section, historical advertisement log data may be obtained. Typically, the historical advertisement log data is a plurality of records ordered by time. To facilitate subsequent analysis, the historical advertisement log data may be grouped by device identification recorded in the historical advertisement log data, and the resulting data may be referred to herein as base log data. For example, the base log data may include multiple sets of records, each set of records may correspond to a device identification.

For example, in this example, historical ad log data may be stored in Hadoop and the plurality of records sorted by time are converted into a plurality of sets of records grouped by device identification using Hive tools.

In addition, in order to ensure the accuracy of subsequent training samples, invalid log data can be removed from the basic log data to obtain valid log data.

In the data association portion, a flag may be made for valid log data.

For example, an invalid traffic filter list may be obtained from the DIF, thereby adding flags "abnormal traffic" and "normal traffic" to the valid log data. For example, in one implementation, valid log data may include multiple sets of records, each set of records corresponding to a device identification. If the device identifier corresponding to a certain group of records can be found in the invalid traffic filtering list, the group of records can be marked as 'abnormal traffic'; if the device identifier corresponding to a group of records is not found in the invalid traffic filtering list, the group of records can be marked as "normal traffic".

In some cases, a device identification corresponding to a tag "abnormal traffic" may be tagged as invalid traffic by more than two members in an invalid traffic filtering list. Thus, the reliability of the marking can be ensured.

These DIF-based flags may provide a basis for subsequent separation of the original training samples into positive and negative examples.

However, it should be understood that although the marking process is described herein as being performed in the data association portion, in practice the marking process may be performed after the original training sample is obtained, thereby separating the original training sample into positive and negative samples. For example, each of the original training samples may correspond to a device identification since the log data has been previously grouped by device identification. In this way, individual samples may also be marked as positive or negative based on the invalid traffic filter list.

In addition, in the data association section, user tag data may be acquired from the DMP, and then the user tag data may be associated with the valid log data to obtain user characteristic information. In addition, advertisement data and context information may also be extracted from the active log data.

In the feature construction section, feature construction may be performed based on user feature information, advertisement data, and context information.

For example, for the features capable of statistical analysis in the user feature information, advertisement data, and context information, the corresponding statistical features, such as minimum value, maximum value, average value, standard deviation, ranking, and the like, may be determined.

The statistical features and the remaining features not suitable for statistical analysis may then be encoded, which may also be understood herein as converting the features into a format suitable for machine learning training, resulting in the original training samples. The encoding may be performed in a variety of suitable ways, such as one-hot encoding, and the like.

It should be understood that the specific processes in the above sections are not necessarily performed in the order described above. For example, in some cases, some processes may be performed simultaneously, some processes may be performed in a reversed order, and so on. This should be decided according to the specific implementation. This is not a limitation herein.

As shown in FIG. 5, the process may consist essentially of two parts, sample extraction and model training.

After the feature construction portion of fig. 4, the original training sample may be obtained. In the sample extraction section of fig. 5, at least a part of the training samples may be extracted from the original training samples as target training samples.

For example, as previously described, the original training samples may be divided into positive and negative samples based on the labeling result of DIF. In some cases, the ratio of positive and negative examples in the original training samples may not be suitable for direct machine learning model training, and thus, a corresponding number of positive and negative examples may be drawn from the original training samples based on a predetermined ratio between the target positive and negative examples to form the target training samples.

Machine learning model training may then be performed based on the target training samples. In the example of fig. 5, it is assumed that there are N initial machine learning models, which are shown in fig. 5 as initial machine learning model 1, initial machine learning models 2, … …, initial machine learning model N for ease of explanation.

These N initial machine learning models may be trained (e.g., iterated and trimmed multiple times) based on the target training samples, resulting in N available machine learning models, which are shown in fig. 5 as available machine learning model 1, available machine learning model 2, … …, available machine learning model N.

The N available machine learning models may then be evaluated. For example, the ROC curve or AUC value may be used to evaluate the N available machine learning models. Then, one available machine learning model with the best evaluation result can be selected as the target machine learning model. Ad anomaly traffic detection may subsequently be performed using a targeted machine learning module.

As shown in fig. 6, the apparatus 600 may include an obtaining unit 602, a generating unit 604, an encoding unit 606, and a training unit 608.

The acquisition unit 602 may acquire historical advertisement log data for a historical period of time. The historical advertisement log data may include data resulting from at least one device engaging in advertisements over a historical period of time.

The generation unit 604 may generate feature data based on the historical advertisement log data. The characteristic data may be used to represent characteristics of the advertisement as well as characteristics of users participating in the advertisement.

The encoding unit 606 may perform an encoding process on the feature data to generate a target training sample.

The training unit 608 may perform machine learning model training based on the target training samples for ad anomaly traffic detection.

In some embodiments, the generating unit 604 may organize the historical advertisement log data by a device identification of at least one device to determine valid log data. The generating unit 604 may generate the feature data based on the valid log data.

In some embodiments, the generating unit 604 may organize the historical advertisement log data according to a device identification of at least one device, generating base log data. The generating unit 604 may remove invalid log data from the base log data, resulting in valid log data. The invalid log data may indicate that the device is engaged in an advertisement bid but has not generated an exposure and click operation.

The generating unit 604 may obtain user tag data from the data management platform. The generating unit 604 may associate the user tag data with the user identification information to generate user characteristic information.

In some embodiments, the characteristic data may include advertising data and contextual information. The advertisement data may include advertisement attributes and the contextual information may include advertisement operating characteristics.

The generating unit 604 may extract advertisement data and context information from the valid log data.

In some embodiments, the feature data may be used to represent a plurality of features including a first class of features suitable for statistical analysis and a second class of features other than the first class of features.

The encoding unit 606 may perform statistical analysis on the first class of features to generate a set of statistical features, and encode the set of statistical features and the second class of features to obtain the original training sample.

The training unit 608 may extract at least a portion of the samples from the original training samples as target training samples.

In some embodiments, the original training samples may include positive samples and negative samples. The positive and negative examples may be distinguished based on an invalid traffic filter list obtained from the DIF. The invalid traffic filter list may include device information that is marked by members of the DIF as producing invalid traffic.

The training unit 608 may extract target positive and negative examples from the positive and negative examples. The target training samples may include target positive samples and target negative samples, and a ratio between the target positive samples and the target negative samples may be predefined.

In some embodiments, devices associated with negative examples are marked in the invalid traffic filtering list by at least two members as producing invalid traffic.

In some embodiments, the training unit 608 may train the at least one initial machine learning model based on the target training samples, resulting in at least one available machine learning model. The training unit 608 may evaluate at least one available machine learning model. The training unit 608 may select an available machine learning model having an optimal evaluation result from the at least one available machine learning model as a target machine learning model for advertisement abnormal traffic detection.

As shown in fig. 7, the apparatus 700 may include an obtaining unit 702, a generating unit 704, an encoding unit 706, and a predicting unit 708.

The acquisition unit 702 may acquire current advertisement log data within a specified period of time. The current ad log data may include data resulting from at least one device engaging in an ad for a specified period of time.

The generation unit 704 may generate current feature data based on the current advertisement log data. The current characteristic data may be used to represent advertising characteristics and user characteristics participating in the advertisement.

The encoding unit 706 may perform encoding processing on the current feature data to generate target feature data.

The prediction unit 708 may process the target feature data using the trained machine learning model to obtain a prediction result. The prediction result may be used to indicate whether there is abnormal traffic in the current advertisement log data.

In some embodiments, the generating unit 704 may organize the current advertisement log data by a device identification of at least one device to determine current valid log data. The generation unit 704 may generate current feature data based on the current valid log data.

In some embodiments, the generating unit 704 may organize the current advertisement log data according to a device identification of at least one device, generating current base log data. The generating unit 704 may remove the currently invalid log data from the current base log data to obtain currently valid log data. The current invalid log data may indicate that the device is engaged in an advertisement bid but has not generated an exposure and click operation.

In some embodiments, the currently valid log data may include user identification information associated with the at least one device. The current characteristic data may include current user characteristic information. The current user characteristic information may include a user characteristic associated with the at least one device.

The generation unit 704 may obtain user tag data from the data management platform. The generating unit 704 may associate the user tag data with the user identification information to generate current user characteristic information.

In some embodiments, the current feature data may include current advertising data and current context information. The current advertisement data may include advertisement attributes and the current context information may include advertisement operating characteristics.

The generation unit 704 may extract current advertisement data and current context information from the current valid log data.

In some embodiments, the current feature data may be used to represent a plurality of features including a first class of features suitable for statistical analysis and a second class of features other than the first class of features.

The encoding unit 706 may perform a statistical analysis on the first type of features to generate a set of statistical features. The encoding unit 706 may encode the set of statistical features and the second class of features to generate target feature data.

The respective units of the

apparatuses

600 and 700 may perform corresponding steps in the method embodiments of fig. 1-4, and therefore, for brevity of description, specific operations and functions of the respective units of the

apparatuses

600 and 700 are not described herein again.

The

apparatuses

600 and 700 described above may be implemented by hardware, may be implemented by software, or may be implemented by a combination of hardware and software. For example, when implemented in software, the

apparatus

600 or 700 may be formed by a processor of a device in which corresponding executable code in a memory (e.g., a non-volatile memory) is read into the memory and executed.

FIG. 8 is a hardware block diagram of a computing device for advertisement anomaly traffic detection, according to some embodiments. As shown in fig. 8, computing device 800 may include at least one processor 802, storage 804, memory 806, and communication interface 808, with the at least one processor 802, storage 804, memory 806, and communication interface 808 being coupled together via a bus 810. The at least one processor 802 executes at least one executable code (i.e., the elements described above as being implemented in software) stored or encoded in the memory 804.

In some embodiments, the executable code stored in the memory 804, when executed by the at least one processor 802, causes the computing device to implement the respective processes described above in connection with fig. 1-2 and 4-5.

As shown in fig. 9, computing device 900 may include at least one processor 902, storage 904, memory 906, and a communication interface 908, and the at least one processor 902, storage 904, memory 906, and communication interface 908 are connected together via a bus 910. The at least one processor 902 executes at least one executable code (i.e., the elements described above as being implemented in software) stored or encoded in the memory 904.

In some embodiments, the executable code stored in the memory 904, when executed by the at least one processor 902, causes the computing device to implement the respective processes described above in connection with fig. 1 and 3.

Computing devices

800 and 900 may be implemented in any suitable form known in the art, including, for example, but not limited to, desktop computers, laptop computers, smart phones, tablet computers, consumer electronics devices, wearable smart devices, and the like.

Embodiments of the present specification also provide a machine-readable storage medium. The machine-readable storage medium may store executable code that, when executed by a machine, causes the machine to implement the respective processes of the method embodiments described above with reference to fig. 1-2 and 4-5.

Embodiments of the present specification also provide a machine-readable storage medium. The machine-readable storage medium may store executable code that, when executed by a machine, causes the machine to implement the respective processes of the method embodiments described above with reference to fig. 1 and 3.

For example, the machine-readable storage medium may include, but is not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Static Random Access Memory (SRAM), a hard disk, a flash Memory, and the like.

It should be understood that the embodiments in this specification are described in a progressive manner, and that the same or similar parts between the embodiments may be referred to each other, and each embodiment is described with emphasis on differences from the other embodiments. For example, as for the embodiments related to the apparatus, the computing device, and the machine-readable storage medium, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the partial description of the method embodiments.

Specific embodiments of this specification have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Not all steps and elements in the above flows and system structure diagrams are necessary, and some steps or elements may be omitted according to actual needs. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by a plurality of physical entities respectively, or some units may be implemented by some components in a plurality of independent devices together.

The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous" over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.

Although the embodiments of the present disclosure have been described in detail with reference to the accompanying drawings, the embodiments of the present disclosure are not limited to the specific details of the embodiments, and various modifications may be made within the technical spirit of the embodiments of the present disclosure, which belong to the scope of the embodiments of the present disclosure.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for ad anomaly traffic detection, comprising:

obtaining historical advertisement log data in a historical time period, wherein the historical advertisement log data comprises data obtained by at least one device participating in advertisement in the historical time period;

generating feature data based on the historical advertisement log data, wherein the feature data is used for representing advertisement features and user features participating in the advertisement;

encoding the characteristic data to generate a target training sample;

and performing machine learning model training based on the target training sample for advertisement abnormal traffic detection.

2. The method of claim 1, wherein generating the feature data comprises:

organizing the historical advertisement log data according to the device identification of the at least one device to determine valid log data;

generating the feature data based on the valid log data.

3. The method of claim 2, wherein determining the valid log data comprises:

organizing the historical advertisement log data according to the equipment identifier of the at least one equipment to generate basic log data;

and removing invalid log data from the basic log data to obtain the valid log data, wherein the invalid log data indicates that the equipment participates in advertisement bidding but does not generate exposure and click operation.

4. The method of claim 2, wherein the valid log data comprises user identification information associated with the at least one device, the feature data comprises user feature information, wherein the user feature information comprises a user feature associated with the at least one device;

generating the feature data includes:

acquiring user tag data from a data management platform;

and associating the user tag data with the user identification information to generate the user characteristic information.

5. The method of claim 2, wherein the characteristic data comprises advertisement data and contextual information, wherein the advertisement data comprises advertisement attributes and the contextual information comprises advertisement operating characteristics;

generating the feature data includes:

extracting the advertisement data and the context information from the active log data.

6. The method of claim 1, wherein the feature data is used to represent a plurality of features, the plurality of features including a first class of features suitable for statistical analysis and a second class of features other than the first class of features;

performing encoding processing on the feature data to generate a target training sample, including:

performing statistical analysis on the first type of features to generate a group of statistical features;

coding the group of statistical characteristics and the second type of characteristics to obtain an original training sample;

at least a portion of the samples are extracted from the original training samples as the target training samples.

7. The method of claim 6, wherein the original training samples comprise positive and negative samples, wherein the positive and negative samples are distinguished based on an invalid traffic filtering list obtained from a distributed invalid traffic filter, the invalid traffic filtering list comprising device information marked by members of the distributed invalid traffic filter as producing invalid traffic;

extracting at least a portion of samples from the original training samples as the target training samples, including:

extracting target positive and negative examples from the positive and negative examples, wherein the target training examples comprise the target positive and negative examples, and a ratio between the target positive and negative examples is predefined.

8. The method of claim 7, wherein the device associated with the negative examples is marked in the invalid traffic filtering list by at least two members as generating invalid traffic.

9. The method of claim 1, wherein performing machine learning model training based on the target training samples comprises:

training at least one initial machine learning model based on the target training sample to obtain at least one available machine learning model;

evaluating the at least one available machine learning model;

and selecting the available machine learning model with the optimal evaluation result from the at least one available machine learning model as a target machine learning model for detecting the abnormal advertisement traffic.

10. A method for ad anomaly traffic detection, comprising:

acquiring current advertisement log data in a specified time period, wherein the current advertisement log data comprises data obtained by at least one device participating in advertisement in the specified time period;

generating current feature data based on the current advertisement log data, wherein the current feature data is used for representing advertisement features and user features participating in the advertisement;

encoding the current characteristic data to generate target characteristic data;

and processing the target characteristic data by using a trained machine learning model to obtain a prediction result, wherein the prediction result is used for indicating whether abnormal traffic exists in the current advertisement log data.

11. The method of claim 10, wherein generating the current feature data comprises:

organizing the current advertisement log data according to the device identification of the at least one device to determine current effective log data;

generating the current feature data based on the current valid log data.

12. The method of claim 11, wherein determining the currently valid log data comprises:

organizing the current advertisement log data according to the equipment identifier of the at least one equipment to generate current basic log data;

and removing current invalid log data from the current basic log data to obtain the current valid log data, wherein the current invalid log data represent that the equipment participates in advertisement bidding but does not generate exposure and click operation.

13. The method of claim 11, wherein the currently valid log data includes user identification information associated with the at least one device, the current feature data includes current user feature information, wherein the current user feature information includes user features associated with the at least one device;

generating the current feature data comprises:

acquiring user tag data from a data management platform;

and associating the user tag data with the user identification information to generate the current user characteristic information.

14. The method of claim 11, wherein the current feature data comprises current advertisement data and current context information, wherein the current advertisement data comprises advertisement attributes and the current context information comprises advertisement operating features;

generating the current feature data comprises:

extracting the current advertisement data and the current context information from the current active log data.

15. The method of claim 10, wherein the current feature data is used to represent a plurality of features, the plurality of features including a first class of features suitable for statistical analysis and a second class of features other than the first class of features;

performing encoding processing on the current feature data to generate target feature data, including:

and coding the group of statistical characteristics and the second type of characteristics to generate the target characteristic data.

16. An apparatus for ad anomaly traffic detection, comprising:

an acquisition unit, configured to acquire historical advertisement log data in a historical time period, where the historical advertisement log data includes data obtained by participation of at least one device in an advertisement in the historical time period;

a generating unit, configured to generate feature data based on the historical advertisement log data, wherein the feature data is used for representing advertisement features and user features participating in advertisements;

the coding unit is used for coding the characteristic data to generate a target training sample;

and the training unit is used for performing machine learning model training based on the target training sample so as to detect abnormal advertisement flow.

17. The apparatus of claim 16, wherein the generating unit is further configured to:

generating the feature data based on the valid log data.

18. The apparatus of claim 17, wherein the generating unit is further configured to:

19. The apparatus of claim 17, wherein the valid log data comprises user identification information associated with the at least one device, the feature data comprises user feature information, wherein the user feature information comprises a user feature associated with the at least one device;

the generation unit is further configured to:

acquiring user tag data from a data management platform;

20. The apparatus of claim 17, wherein the characteristic data comprises advertisement data and contextual information, wherein the advertisement data comprises advertisement attributes and the contextual information comprises advertisement operating characteristics;

the generating unit is further configured to:

21. The apparatus of claim 16, wherein the feature data is indicative of a plurality of features, the plurality of features including a first class of features suitable for statistical analysis and a second class of features other than the first class of features;

the encoding unit is further configured to:

the training unit is further to:

22. The apparatus of claim 21, wherein the original training samples comprise positive and negative samples, wherein the positive and negative samples are distinguished based on an invalid traffic filtering list obtained from a distributed invalid traffic filter, the invalid traffic filtering list comprising device information marked by members of the distributed invalid traffic filter as producing invalid traffic;

the training unit is further to:

23. The apparatus of claim 16, wherein the training unit is further to:

evaluating the at least one available machine learning model;

24. An apparatus for ad anomaly traffic detection, comprising:

the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring current advertisement log data in a specified time period, and the current advertisement log data comprises data obtained by that at least one device participates in advertisement in the specified time period;

a generating unit, configured to generate current feature data based on the current advertisement log data, where the current feature data is used to represent advertisement features and user features participating in advertisements;

the encoding unit is used for encoding the current characteristic data to generate target characteristic data;

and the prediction unit is used for processing the target characteristic data by utilizing the trained machine learning model to obtain a prediction result, wherein the prediction result is used for indicating whether abnormal traffic exists in the current advertisement log data.

25. The apparatus of claim 24, wherein the generating unit is further configured to:

and generating the current characteristic data based on the current effective log data.

26. The apparatus of claim 25, wherein the generating unit is further configured to:

and removing current invalid log data from the current basic log data to obtain the current valid log data, wherein the current invalid log data indicate that the equipment participates in advertisement bidding but does not generate exposure and click operation.

27. The apparatus of claim 25, wherein the currently valid log data comprises user identification information associated with the at least one device, the current feature data comprises current user feature information, wherein the current user feature information comprises a user feature associated with the at least one device;

the generation unit is further configured to:

acquiring user tag data from a data management platform;

28. The apparatus of claim 25, wherein the current characteristics data comprises current advertisement data comprising advertisement attributes and current context information comprising advertisement operating characteristics;

the generation unit is further configured to:

29. The apparatus of claim 24, wherein the current feature data is used to represent a plurality of features, the plurality of features comprising a first class of features suitable for statistical analysis and a second class of features other than the first class of features;

the encoding unit is further configured to:

and encoding the group of statistical characteristics and the second type of characteristics to generate the target characteristic data.

30. A computing device, comprising:

at least one processor;

a memory in communication with the at least one processor having executable code stored thereon, which when executed by the at least one processor causes the at least one processor to implement the method of any one of claims 1 to 9.

31. A computing device, comprising:

at least one processor;

a memory in communication with the at least one processor having executable code stored thereon, which when executed by the at least one processor causes the at least one processor to implement the method of any one of claims 10 to 15.

32. A machine readable storage medium storing executable code that when executed causes a machine to perform a method according to any one of claims 1 to 9.

33. A machine readable storage medium storing executable code that when executed causes a machine to perform the method of any of claims 10 to 15.