WO2021027362A1 - Information pushing method and apparatus based on data analysis, computer device, and storage medium - Google Patents

Information pushing method and apparatus based on data analysis, computer device, and storage medium Download PDF

Info

Publication number
WO2021027362A1
WO2021027362A1 PCT/CN2020/092856 CN2020092856W WO2021027362A1 WO 2021027362 A1 WO2021027362 A1 WO 2021027362A1 CN 2020092856 W CN2020092856 W CN 2020092856W WO 2021027362 A1 WO2021027362 A1 WO 2021027362A1
Authority
WO
WIPO (PCT)
Prior art keywords
potential user
user
potential
data
target
Prior art date
Application number
PCT/CN2020/092856
Other languages
French (fr)
Chinese (zh)
Inventor
卢显锋
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021027362A1 publication Critical patent/WO2021027362A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0251Targeted advertisements
    • G06Q30/0269Targeted advertisements based on user profile or attribute
    • G06Q30/0271Personalized advertisement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0277Online advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/55Push-based network services

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to an information push method, device, computer equipment and storage medium based on data analysis.
  • the embodiments of the present application provide an information push method, device, computer equipment, and storage medium based on data analysis, aiming to solve the problem of low accuracy in mining online customers willing to apply for insurance.
  • an embodiment of the present application provides an information push method based on data analysis, which includes: collecting user behavior data through a web crawler; performing one-hot encoding and normalization on the behavior data Feature engineering processing to obtain target data; input the target data into a pre-trained potential user mining model to output a potential user predicted value, the potential user predicted value used to characterize the possibility that the user belongs to a potential user; Compare the predicted value of the potential user with a preset threshold to determine the potential user and push information on the potential user.
  • the embodiment of the present application also provides an information push device based on data analysis, which includes: a crawler unit for collecting user behavior data by way of web crawlers; a feature engineering unit for using one-hot encoding Perform feature engineering processing on the behavior data in a normalized and normalized manner to obtain target data; a prediction unit is used to input the target data into a pre-trained potential user mining model to output potential user prediction values, The potential user predicted value is used to characterize the possibility that the user belongs to a potential user; the pushing unit is used to compare the potential user predicted value with a preset threshold to determine the potential user and push information about the potential user.
  • an embodiment of the present application also provides a computer device, which includes a memory and a processor, and a computer program is stored on the memory.
  • the processor executes the computer program, it realizes: collecting by means of a web crawler User behavior data; perform feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data; input the target data into a pre-trained potential user mining model to output potential user predictions
  • the potential user predicted value is used to characterize the possibility that the user belongs to a potential user; the potential user is compared with a preset threshold according to the potential user predicted value to determine the potential user and push information about the potential user.
  • an embodiment of the present application also provides a computer-readable storage medium, the storage medium stores a computer program, and the computer program enables when executed by a processor to realize: collecting user information by means of web crawlers Behavioral data; perform feature engineering processing on the behavioral data through one-hot encoding and normalization to obtain target data; input the target data into a pre-trained potential user mining model to output potential user prediction values, The potential user prediction value is used to characterize the possibility that the user belongs to a potential user; the potential user prediction value is compared with a preset threshold to determine the potential user and push information about the potential user.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the embodiment of the application collects user behavior data and processes the data through feature engineering processing, and then predicts the behavior data through the potential user mining model to mine potential users, and pushes advertisements to potential users, which can improve the mining potential
  • the accuracy of insuring users can effectively push advertisements and reduce the cost of obtaining user information.
  • FIG. 1 is a schematic diagram of an application scenario of an information push method based on data analysis provided by an embodiment of the application;
  • FIG. 2 is a schematic flowchart of an information push method based on data analysis provided by an embodiment of the application
  • FIG. 3 is a schematic diagram of a sub-flow of an information push method based on data analysis provided by an embodiment of the application;
  • FIG. 4 is a schematic diagram of a sub-flow of the method for pushing information based on data analysis provided by an embodiment of the application;
  • FIG. 5 is a schematic diagram of a sub-flow of an information pushing method based on data analysis provided by an embodiment of the application;
  • FIG. 6 is a schematic flowchart of an information push method based on data analysis provided by another embodiment of the application.
  • FIG. 7 is a schematic block diagram of an information pushing device based on data analysis provided by an embodiment of the application.
  • FIG. 8 is a schematic block diagram of specific units of an information push device based on data analysis provided by an embodiment of the application.
  • FIG. 9 is a schematic block diagram of an information pushing device based on data analysis provided by another embodiment of the application.
  • FIG. 10 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic diagram of an application scenario of an information push method based on data analysis provided by an embodiment of the application.
  • FIG. 2 is a schematic flowchart of an information push method based on data analysis provided by an embodiment of the application.
  • the potential user mining is applied to the terminal 10 and is realized through the interaction between the terminal 10 and the server 20.
  • FIG. 2 is a schematic flowchart of an information push method based on data analysis provided by an embodiment of the present application. As shown in the figure, the method includes the following steps S110-S140.
  • the user's behavior data refers to data that the user performs some behavior on the network and the network records the behavior, for example, the user searches for compulsory traffic insurance on Taobao.
  • a web crawler is a program or script that automatically crawls information on the World Wide Web according to certain rules. Specifically, first select some specific webpages as the start page, crawl the webpage from the start page by means of a web crawler, and filter a large number of crawled webpages after the crawling is complete to obtain the target webpage.
  • the target webpage is the user For the web page that will be browsed, the behavior data of the user browsing the target webpage is finally obtained from the preset database of the target webpage.
  • the step S110 may include steps: S111-S113.
  • a web crawler refers to a program that automatically captures information on the World Wide Web in accordance with certain rules, which mainly includes three parts: collection, storage, and processing. Specifically, first select the URL of a representative webpage as the initial URL and start to fetch data from the server.
  • the preset webpage is the representative webpage.
  • the initial URL is from the customer's point of view.
  • the customer will use the search engine To search for car insurance information, for example, you can use the search result page of compulsory traffic insurance on Baidu as the initial URL, or the search result page of compulsory traffic insurance on Taobao as the initial URL; then store the crawled webpages and analyze and filter them.
  • the crawled initial URL contains the new URL, and the initial URL is parsed to filter the new URL and select the URL related to insurance. For example, the URL of the FAQ about insurance is placed in the URL queue waiting to be crawled, and the rest Irrelevant URLs are discarded; finally, select the URL of the web page to be crawled next in the URL queue to be crawled, and repeat the above process until it stops when traversing the entire network.
  • the default webpage index refers to the webpage index provided by the data sharing platform based on the massive search and browsing behavior data of major search engines.
  • the webpage index is specifically based on the browsing data of the website (view volume, browsing duration, and number of views) The value obtained through a series of evaluations.
  • the official website of Insurance Company A has a website index of 89.
  • the preset database refers to a database storing the target webpage, and the preset database stores all data related to the target webpage.
  • the target webpage interface is called according to the URL of the target webpage.
  • the interface is provided with the consent of the operator of the target webpage, and the target webpage is called from the preset database.
  • the webpage log of the webpage after obtaining the webpage log, analyzes the obtained webpage log and finally obtains the user's behavior data, where the user's behavior data includes: user information, user browsing records, and user IP address.
  • S120 Perform feature engineering processing on the behavior data by means of one-hot encoding and normalization to obtain target data.
  • feature engineering refers to the process of transforming original data into target data of a model.
  • Commonly used feature engineering methods include: timestamp processing, decomposition of category attributes, binning/partitioning, cross feature, feature selection, feature Scaling and feature extraction.
  • Behavioral data is mainly divided into two categories. One is numerical behavioral data, such as car age, browsing time, and annual income, and the other is non-numerical behavioral data, such as favorites, comments, concerns, and adding to shopping carts. Wait. Specifically, the non-numerical behavior data is converted into target data for model input by decomposing category attributes, and the numerical behavior data is converted into target data for model input by feature scaling.
  • the step S120 may include steps S121-S122.
  • S121 Perform one-hot encoding on the non-numerical behavior data to obtain target data.
  • the method of decomposing category attributes is used to perform feature engineering.
  • the method of decomposing category attributes is to encode behavior data through one-hot encoding, that is, one-hot encoding.
  • the method is to use N-bit status registers.
  • N states each state has its own independent register bit, and at any time, only one of them is valid.
  • the gender attribute includes male and female.
  • the target data of "male” is [1,0]
  • the target data of "female” is [0,1].
  • the target data of "favorite" is [1,0]
  • the target data of "not favorite” is [0,1].
  • S122 Normalize the numerical behavior data according to a preset formula to obtain target data.
  • ⁇ scaling is used for feature engineering. Because some numerical features have a much higher span value than other features, such as annual income and age, in order to avoid certain features and other features The size of is very different, and the feature value needs to be scaled to the same range value.
  • a preset formula is used to normalize the numerical target data, and the preset formula is specifically as follows:
  • X′ is the normalized characteristic value
  • X is the current user characteristic parameter
  • minX is the minimum parameter of the current user characteristic
  • maxX is the maximum parameter of the current user characteristic. For example, if the maximum value of annual income is 500,000, the minimum value of annual income is 60,000, and the current user's annual income is 100,000, then after normalization, a normalized feature value of 0.09 in the range of 0 to 1 is obtained.
  • the potential user mining model is constructed by using a gradient boosting decision tree algorithm (Gradient Boosting Decision Tree).
  • the gradient boosting decision tree is a combined decision tree algorithm, which is mainly through multiple decision trees in series.
  • the next decision tree learning uses the residual of the decision tree in the previous lesson, the residual is obtained by the gradient, and all the decision trees are combined to form the gradient boosting decision tree.
  • predict potential users which features include: user age and user annual income.
  • the ages of users A, B, C, and D are 18, 26, 36, and 41 respectively, and the annual income is 0, 300,000, 100,000, and 50, respectively.
  • the first decision tree classifies user AB into the category below 30 years old according to the age label (based on 30 years old), and divides CD into the category above 30 years old.
  • the predicted values of ABCD as potential users are respectively 0.1, 0.3, 0.6 and 0.8
  • the residual of class AB is the difference between the average of the predicted value of AB and the predicted value, so the average of the predicted value of AB is 0.2, and the residual of AB is -0.1 and 0.1 respectively ;
  • the average of the predicted value of CD is 0.7, and the CD residuals are -0.1 and 0.1 respectively, then the next decision is predicted based on the residual of the previous decision tree, based on the annual income label (based on 150,000) Divide AC to below 150,000 and BD to above 150,000.
  • the core of its prediction is that each tree learns the residuals of the sum of all previous tree conclusions.
  • the potential user mining model has been pre-trained, and the potential user mining model is run on the Spark platform to predict the target data. Spark is a fast and universal computing engine designed for large-scale data processing.
  • the Spark platform includes the algorithm component Spark MLlib (Machine Learning Library, machine learning library), Spark MLlib includes an algorithm library.
  • the algorithm library has a gradient boosting decision tree algorithm.
  • Spark MLlib provides an algorithm interface for the gradient boosting decision tree algorithm to predict target data.
  • the step S130 may include steps: S131-S132.
  • a target sample refers to a sample composed of target data and a label (label) for model input, where the target sample is divided into a positive sample and a negative sample, the label value of the positive sample is 1, and the label value of the negative sample Is 0.
  • a positive sample is that the annual income is greater than or equal to 100,000, and a negative sample is that no car has been purchased. If the customer's annual income is 100,000, then the target sample is (0.09, 1), where 0.09 is the feature value and 1 is the label value; If the customer does not purchase a car, then the target sample is (0, 0).
  • the potential user mining model adopts the gradient boosting decision tree algorithm.
  • the gradient boosting decision tree algorithm is through multiple rounds of iteration. Each round of iteration obtains a decision tree. The loss of each round of decision tree in the previous round of decision tree The function is based on fitting, and finally the conclusions of all decision trees are added up to get the predicted value.
  • the formula of the gradient boosting decision tree algorithm is as follows:
  • F M (x) represents the model
  • T(x; ⁇ m ) represents the decision tree
  • ⁇ m is the decision tree parameter
  • m is the number of decision trees
  • L is the loss function
  • x is the sample feature
  • y is the sample label.
  • the sample feature and sample label constitute the target sample
  • the label value is 0 or 1
  • i is the number of samples
  • T uses the CART decision tree, which is a typical binary decision tree that can be classified or regressed.
  • S140 Compare the predicted value of the potential user with a preset threshold to determine the potential user and push information about the potential user.
  • the predicted value of the potential user is obtained, the predicted value of the potential user is compared with a preset threshold, and if the predicted value of the potential user is greater than the preset threshold, the user is determined to be a potential user; If the predicted value of is less than the preset threshold, it is determined that the user is a non-potential user. For example, if the preset threshold is 0.6 and the predicted value of the potential user is 0.8, then the predicted value of the user is greater than the preset threshold to determine that the user is a potential user.
  • the preset threshold is 0.6 and the predicted value of the potential user is 0.8, then the predicted value of the user is greater than the preset threshold to determine that the user is a potential user.
  • the advertisements pushed can be insurance information, auto insurance product information, insurance links, etc. Specifically, the list of potential users and the advertisement link are sent to the operator of the target webpage, and the operator pushes the advertisement link according to the user's IP address when the potential user logs in and browses the webpage.
  • step S140 after the step S140, it further includes steps: S150-S160.
  • the feedback result refers to whether the potential user has opened the advertisement link pushed by the target webpage. If the user opens the advertisement link pushed by the target webpage, it is a positive feedback; if the user does not open the advertisement pushed by the target webpage Links are negative feedback.
  • the feedback result is obtained from the target webpage, and the feedback result is stored in the preset database of the target webpage operator in the form of webpage log. Therefore, the calling interface is obtained from the preset database of the target webpage and parsed to obtain the webpage log, and then pass The regular expression sets the URL of the pushed advertisement link as the rule string, and filters the browsing record of browsing the advertisement link from the web log, and the browsing record is the feedback result.
  • whether the user mining model needs to be optimized is mainly judged by the conversion rate.
  • the conversion rate refers to the ratio of the number of potential users who viewed the pushed advertising links to the number of all potential users. The more potential users of the advertising link, the higher the conversion rate.
  • the actual conversion rate is compared with the expected conversion rate. If the actual conversion rate is greater than the expected conversion rate, it indicates that the potential user mining model has a good conversion effect and does not need to be optimized; if the actual conversion rate is less than the expected conversion rate, It shows that the conversion effect of the potential user mining model is poor, and the model needs to be optimized. According to the feedback results, a reminder email is generated, and the reminder email is sent to the email address of the model manager to remind the model to be optimized.
  • FIG. 7 is a schematic block diagram of an information push device 200 based on data analysis provided by an embodiment of the present application.
  • the data analysis-based information pushing device 200 includes a unit for executing the above-mentioned data analysis-based information pushing method, and the device can be configured in a desktop computer, a tablet computer, a laptop computer, and other terminals.
  • the information pushing device 200 based on data analysis includes: a crawler unit 210, a feature engineering unit 220, a prediction unit 230, and a pushing unit 240.
  • the crawler unit 210 is used to collect user behavior data by way of web crawlers.
  • the user's behavior data refers to data that the user performs some behavior on the network and the network records the behavior, for example, the user searches for compulsory traffic insurance on Taobao.
  • a web crawler is a program or script that automatically crawls information on the World Wide Web according to certain rules. Specifically, first select some specific webpages as the start page, crawl the webpage from the start page by means of a web crawler, and filter a large number of crawled webpages after the crawling is complete to obtain the target webpage.
  • the target webpage is the user For the web page that will be browsed, the behavior data of the user browsing the target webpage is finally obtained from the preset database of the target webpage.
  • the crawler unit 210 includes: a crawler subunit 211, a screening unit 212 and an acquisition subunit 213.
  • the crawler subunit 211 is used for crawling a preset webpage by way of a web crawler.
  • a web crawler refers to a program that automatically captures information on the World Wide Web in accordance with certain rules, which mainly includes three parts: collection, storage, and processing. Specifically, first select the URL of a representative webpage as the initial URL and start to fetch data from the server.
  • the preset webpage is the representative webpage.
  • the initial URL is from the customer's point of view.
  • the customer will use the search engine To search for car insurance information, for example, you can use the search result page of compulsory traffic insurance on Baidu as the initial URL, or the search result page of compulsory traffic insurance on Taobao as the initial URL; then store the crawled webpages and analyze and filter them.
  • the crawled initial URL contains the new URL, and the initial URL is parsed to filter the new URL and select the URL related to insurance. For example, the URL of the FAQ about insurance is placed in the URL queue waiting to be crawled, and the rest Irrelevant URLs are discarded; finally, select the URL of the web page to be crawled next in the URL queue to be crawled, and repeat the above process until it stops when traversing the entire network.
  • the screening unit 212 is used for screening the crawled webpages according to a preset webpage index to obtain a target webpage.
  • the crawled webpages contain a large number of worthless webpages, it is necessary to further filter the crawled webpages, and select some valuable webpages as target webpages, that is, webpages that users are likely to browse , Evaluate and filter the crawled webpages according to the preset webpage index to obtain the target webpage.
  • the preset webpage index refers to the webpage index provided by the data sharing platform based on the massive amount of Internet user search behavior data of major search engines. Get the preset webpage index of the crawled webpage, sort the crawled webpages from high to low according to the preset webpage index, and select the top ten webpages as the target webpage. Of course, it is understandable that, You can also choose another number of pages as the target page.
  • the obtaining subunit 213 is configured to obtain user behavior data from a preset database according to the target webpage.
  • the preset database refers to a database storing the target webpage, and the preset database stores all data related to the target webpage.
  • the target webpage interface is called according to the URL of the target webpage.
  • the interface is provided with the consent of the operator of the target webpage, and the target webpage is called from the preset database.
  • the webpage log of the webpage after obtaining the webpage log, analyzes the obtained webpage log and finally obtains the user's behavior data, where the user's behavior data includes: user information, user browsing records, and user IP address.
  • the feature engineering unit 220 is configured to perform feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data.
  • feature engineering refers to the process of transforming original data into target data of a model.
  • Commonly used feature engineering methods include: timestamp processing, decomposition of category attributes, binning/partitioning, cross feature, feature selection, feature Scaling and feature extraction.
  • Behavioral data is mainly divided into two categories. One is numerical behavioral data, such as car age, browsing time, and annual income, and the other is non-numerical behavioral data, such as favorites, comments, concerns, and adding to shopping carts. Wait. Specifically, the non-numerical behavior data is converted into target data for model input by decomposing category attributes, and the numerical behavior data is converted into target data for model input by feature scaling.
  • the feature engineering unit 220 includes: an encoding unit 221 and a normalization unit 222.
  • the encoding unit 221 is configured to perform one-hot encoding on the non-numerical behavior data to obtain target data.
  • the method of decomposing category attributes is used to perform feature engineering.
  • the method of decomposing category attributes is to encode behavior data through one-hot encoding, that is, one-hot encoding.
  • the method is to use N-bit status registers.
  • N states each state has its own independent register bit, and at any time, only one of them is valid.
  • the gender attribute includes male and female.
  • the target data of "male” is [1,0]
  • the target data of "female” is [0,1].
  • the target data of "favorite" is [1,0]
  • the target data of "not favorite” is [0,1].
  • the normalization unit 222 is configured to normalize the numerical behavior data according to a preset formula to obtain target data.
  • ⁇ scaling is used for feature engineering. Because some numerical features have a much higher span value than other features, such as annual income and age, in order to avoid certain features and other features The size of is very different, and the feature value needs to be scaled to the same range value.
  • a preset formula is used to normalize the numerical target data, and the preset formula is specifically as follows:
  • X′ is the normalized characteristic value
  • X is the current user characteristic parameter
  • minX is the minimum parameter of the current user characteristic
  • maxX is the maximum parameter of the current user characteristic. For example, if the maximum value of annual income is 500,000, the minimum value of annual income is 60,000, and the current user's annual income is 100,000, then after normalization, a normalized feature value of 0.09 in the range of 0 to 1 is obtained.
  • the prediction unit 230 is configured to input the target data into a pre-trained potential user mining model to output a potential user prediction value, and the potential user prediction value is used to characterize the possibility that the user belongs to a potential user.
  • the potential user mining model is constructed by using a gradient boosting decision tree algorithm (Gradient Boosting Decision Tree).
  • the gradient boosting decision tree is a combined decision tree algorithm, which is mainly through multiple decision trees in series.
  • the next decision tree learning uses the residual of the decision tree in the previous lesson, the residual is obtained by the gradient, and all the decision trees are combined to form the gradient boosting decision tree.
  • predict potential users which features include: user age and user annual income.
  • the ages of users A, B, C, and D are 18, 26, 36, and 41 respectively, and the annual income is 0, 300,000, 100,000, and 50, respectively.
  • the first decision tree classifies user AB into the category below 30 years old according to the age label (based on 30 years old), and divides CD into the category above 30 years old.
  • the predicted values of ABCD as potential users are respectively 0.1, 0.3, 0.6 and 0.8
  • the residual of class AB is the difference between the average of the predicted value of AB and the predicted value, so the average of the predicted value of AB is 0.2, and the residual of AB is -0.1 and 0.1 respectively ;
  • the average of the predicted value of CD is 0.7, and the CD residuals are -0.1 and 0.1 respectively, then the next decision is predicted based on the residual of the previous decision tree, based on the annual income label (based on 150,000) Divide AC to below 150,000 and BD to above 150,000.
  • the core of its prediction is that each tree learns the residuals of the sum of all previous tree conclusions.
  • the potential user mining model has been pre-trained, and the potential user mining model is run on the Spark platform to predict the target data. Spark is a fast and universal computing engine designed for large-scale data processing.
  • the Spark platform includes the algorithm component Spark MLlib( Machine Learning Library), Spark MLlib includes an algorithm library.
  • the algorithm library has a gradient boosting decision tree algorithm.
  • Spark MLlib provides an algorithm interface for the gradient boosting decision tree algorithm to predict target data.
  • the feature engineering unit 220 includes: a construction unit 231 and a prediction subunit 232.
  • the construction unit 231 is configured to construct a target sample according to the target data.
  • a target sample refers to a sample composed of target data and a label (label) for model input, where the target sample is divided into a positive sample and a negative sample, the label value of the positive sample is 1, and the label value of the negative sample Is 0.
  • the positive sample is annual income greater than or equal to 100,000, and the negative sample is for example not buying a car. If the customer’s annual income is 100,000, the target sample is (0.09, 1), and if the customer does not purchase a car, the target sample is (0 , 0).
  • the prediction subunit 232 is configured to input the target sample into the gradient boosting decision tree model to iteratively update and output the predicted value of the potential user.
  • the potential user mining model adopts the gradient boosting decision tree algorithm.
  • the gradient boosting decision tree algorithm is through multiple rounds of iteration. Each round of iteration obtains a decision tree. The loss of each round of decision tree in the previous round of decision tree The function is based on fitting, and finally the conclusions of all decision trees are added up to get the predicted value.
  • the formula of the gradient boosting decision tree algorithm is as follows:
  • F M (x) represents the model
  • T(x; ⁇ m ) represents the decision tree
  • ⁇ m is the decision tree parameter
  • m is the number of decision trees
  • L is the loss function
  • x is the sample feature
  • y is the sample label.
  • the sample feature and sample label constitute the target sample
  • the label value is 0 or 1
  • i is the number of samples
  • T uses the CART decision tree, which is a typical binary decision tree that can be classified or regressed.
  • the pushing unit 240 is configured to compare the predicted value of the potential user with a preset threshold to determine the potential user and push the information of the potential user.
  • the predicted value of the potential user is compared with a preset threshold, and if the predicted value of the potential user is greater than the preset threshold, the user is determined to be a potential user; If the predicted value of is less than the preset threshold, it is determined that the user is a non-potential user. For example, if the preset threshold is 0.6 and the predicted value of the potential user is 0.8, then the predicted value of the user is greater than the preset threshold to determine that the user is a potential user.
  • push advertisements to this part of the potential users can be insurance information, auto insurance product information, and insurance links. Specifically, the list of potential users and the advertisement link are sent to the operator of the target webpage, and the operator pushes the advertisement link according to the user's IP address when the potential user logs in and browses the webpage.
  • the information pushing device 200 based on data analysis further includes: an acquiring unit 250 and a prompting unit 260.
  • the obtaining unit 250 is configured to obtain the feedback result of the advertisement push.
  • the feedback result refers to whether the potential user has opened the advertisement link pushed by the target webpage. If the user opens the advertisement link pushed by the target webpage, it is a positive feedback; if the user does not open the advertisement pushed by the target webpage Links are negative feedback.
  • the feedback result is obtained from the target webpage, and the feedback result is stored in the preset database of the target webpage operator in the form of webpage log. Therefore, the calling interface is obtained from the preset database of the target webpage and parsed to obtain the webpage log, and then pass The regular expression sets the URL of the pushed advertisement link as the rule string, and filters the browsing record of browsing the advertisement link from the web log, and the browsing record is the feedback result.
  • the prompt unit 260 is configured to prompt and optimize the potential user mining model through email according to the feedback result.
  • whether the user mining model needs to be optimized is mainly judged by the conversion rate.
  • the conversion rate refers to the ratio of the number of potential users who viewed the pushed advertising links to the number of all potential users. The more potential users of the advertising link, the higher the conversion rate.
  • the actual conversion rate is compared with the expected conversion rate. If the actual conversion rate is greater than the expected conversion rate, it indicates that the potential user mining model has a good conversion effect and does not need to be optimized; if the actual conversion rate is less than the expected conversion rate, It shows that the conversion effect of the potential user mining model is poor, and the model needs to be optimized.
  • the embodiment of the application shows an information push device based on data analysis, which collects user behavior data through a web crawler; performs feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data Input the target data into a pre-trained potential user mining model to output a potential user prediction value, the potential user prediction value is used to characterize the possibility that the user belongs to a potential user; according to the potential user prediction value Comparing with the preset threshold value to determine potential users and push information to the potential users, potential insured users can be mined, advertising can be effectively pushed, and the cost of obtaining user information can be reduced.
  • the above-mentioned information pushing device based on data analysis can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 10.
  • the computer device 500 may be a terminal, where the terminal may be an electronic device with communication functions such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the computer program 5032 includes program instructions.
  • the processor 502 can execute an information push method based on data analysis.
  • the processor 502 is used to provide calculation and control capabilities to support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute an information push method based on data analysis.
  • the network interface 505 is used for network communication with other devices.
  • the structure shown in FIG. 10 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in a memory to implement the following steps: collect user behavior data by means of web crawlers; and perform one-hot encoding and normalization on the behavior data Perform feature engineering processing to obtain target data; input the target data into a pre-trained potential user mining model to output potential user prediction values, which are used to characterize the possibility that the user belongs to a potential user ; Compare the predicted value of the potential user with the preset threshold to determine the potential user and push the information of the potential user.
  • the processor 502 when the processor 502 implements the step of collecting user behavior data by means of a web crawler, it specifically implements the following steps: crawling a preset webpage by means of a web crawler; The fetched webpages are filtered to obtain a target webpage; the user's behavior data is obtained from a preset database according to the target webpage.
  • the processor 502 when the processor 502 implements the step of performing feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data, it specifically implements the following steps:
  • the behavior data is one-hot encoded to obtain the target data;
  • the numerical behavior data is normalized according to a preset formula to obtain the target data.
  • the processor 502 inputs the target data into a pre-trained potential user mining model to output a potential user prediction value.
  • the potential user prediction value is used to characterize that the user is a potential user.
  • the following steps are specifically implemented: construct a target sample according to the target data; input the target sample into the gradient boosting decision tree model to iteratively update the predicted value of the potential user.
  • the processor 502 after the processor 502 implements the step of comparing the predicted value of the potential user with a preset threshold to determine the potential user and push the information of the potential user, the processor 502 further implements the following step: The feedback result of the advertisement push; according to the feedback result, the potential user mining model is prompted to optimize through the email.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the computer program includes program instructions, and the computer program can be stored in a storage medium, which is a computer-readable storage medium.
  • the program instructions are executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiments.
  • the storage medium may be a computer-readable storage medium.
  • the storage medium stores a computer program, where the computer program includes program instructions.
  • the processor executes the following steps: collect user behavior data through a web crawler; perform feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data; Input the target data into a pre-trained potential user mining model to output a potential user prediction value, which is used to characterize the possibility that the user belongs to a potential user; according to the potential user prediction value and The preset threshold is compared to determine potential users and push information to the potential users.
  • the computer-readable storage medium may be a non-volatile storage medium or a volatile storage medium.
  • the processor when the processor executes the program instructions to implement the step of collecting user behavior data by means of a web crawler, it specifically implements the following steps: crawling a preset webpage by means of a web crawler; The preset webpage index filters the crawled webpages to obtain the target webpage; and obtains user behavior data from the preset database according to the target webpage.
  • the processor when the processor executes the program instructions to implement the step of performing feature engineering processing on the behavior data by one-hot encoding and normalization to obtain the target data, it specifically implements the following steps : Perform one-hot encoding on the non-numeric behavior data to obtain target data; normalize the numeric behavior data according to a preset formula to obtain the target data.
  • the processor executes the program instructions to realize the input of the target data into a pre-trained potential user mining model to output potential user predicted values, and the potential user predicted values are used
  • the following steps are specifically implemented: construct a target sample according to the target data; input the target sample into the gradient boosting decision tree model to iteratively update the predicted value of the potential user .
  • the processor executes the program instructions to implement the comparison between the predicted value of the potential user and a preset threshold to determine the potential user and perform the information push step for the potential user
  • the following steps are achieved: obtaining the feedback result of the advertisement push; according to the feedback result, prompting and optimizing the potential user mining model through email.
  • the storage medium may be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other computer-readable storage media that can store program codes.
  • ROM Read-Only Memory
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of each unit is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the steps in the method of the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs.
  • the units in the devices in the embodiments of the present application may be combined, divided, and deleted according to actual needs.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Development Economics (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An information pushing method and apparatus based on data analysis, a computer device, and a storage medium. The method relates to an artificial intelligence technology, and is applied to the field of prediction models in intelligent decisions. The method comprises: collecting behavior data of a user in a web crawler manner (S110); performing feature engineering processing on the behavior data in a one-hot coding and normalization manner to obtain target data (S120); inputting the target data into a pre-trained potential user mining model to output a potential user prediction value, the potential user prediction value being used for representing the possibility that the user belongs to a potential user (S130); and comparing the potential user prediction value with a preset threshold to determine the potential user and push information to the potential user (S140). By implementing the method, the accuracy of mining potential users to be insured can be improved, advertisements can be effectively pushed, and the cost of obtaining user information by an enterprise is reduced.

Description

基于数据分析的信息推送方法、装置、计算机设备及存储介质Information push method, device, computer equipment and storage medium based on data analysis
本申请要求于2019年8月13日提交中国专利局、申请号为201910745385.7,发明名称为“基于数据分析的信息推送方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on August 13, 2019, the application number is 201910745385.7, and the invention title is "data analysis-based information push methods, devices, computer equipment and storage media". All of them The content is incorporated in this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种基于数据分析的信息推送方法、装置、计算机设备及存储介质。This application relates to the field of artificial intelligence technology, and in particular to an information push method, device, computer equipment and storage medium based on data analysis.
背景技术Background technique
随着科技与经济的发展,人们生活水平的日益改善,人们对生活品质的追求越来越高。汽车逐渐成为了人们生活中不可或缺的一部分,汽车保险也同样为汽车和人们的生活提供保障。现有的汽车保险客户通常是通过4S店或者汽车保养店等渠道来了解汽车保险信息进而购买汽车保险的。然而,发明人意识到,这种获取客户来源的方式比较单一,且通常都是汽车保险的刚需客户,无法获取潜在客户的信息。对于部分网上有投保意愿的客户,通常根据用户的浏览记录来进行挖掘,然而这种挖掘方式的准确度低,成本高,难以辨别出真实有效的潜在用户。With the development of technology and economy, people's living standards are improving day by day, and people's pursuit of quality of life is getting higher and higher. Cars have gradually become an indispensable part of people's lives, and car insurance also provides protection for cars and people's lives. Existing auto insurance customers usually learn about auto insurance information through channels such as 4S shops or auto maintenance shops and then purchase auto insurance. However, the inventor realizes that this method of obtaining the source of customers is relatively single, and it is usually a rigid customer of auto insurance and cannot obtain information of potential customers. For some online customers who are willing to purchase insurance, mining is usually based on the user's browsing history. However, this mining method has low accuracy and high cost, and it is difficult to identify true and effective potential users.
发明内容Summary of the invention
本申请实施例提供了一种基于数据分析的信息推送方法、装置、计算机设备及存储介质,旨在解决对于网上有投保意愿的客户挖掘的准确度低的问题。The embodiments of the present application provide an information push method, device, computer equipment, and storage medium based on data analysis, aiming to solve the problem of low accuracy in mining online customers willing to apply for insurance.
第一方面,本申请实施例提供了一种基于数据分析的信息推送方法,其包括:通过网络爬虫的方式采集用户的行为数据;通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。In the first aspect, an embodiment of the present application provides an information push method based on data analysis, which includes: collecting user behavior data through a web crawler; performing one-hot encoding and normalization on the behavior data Feature engineering processing to obtain target data; input the target data into a pre-trained potential user mining model to output a potential user predicted value, the potential user predicted value used to characterize the possibility that the user belongs to a potential user; Compare the predicted value of the potential user with a preset threshold to determine the potential user and push information on the potential user.
第二方面,本申请实施例还提供了一种基于数据分析的信息推送装置,其包括:爬虫单元,用于通过网络爬虫的方式采集用户的行为数据;特征工程单元,用于通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;预测单元,用于将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;推送单元,用于根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。In the second aspect, the embodiment of the present application also provides an information push device based on data analysis, which includes: a crawler unit for collecting user behavior data by way of web crawlers; a feature engineering unit for using one-hot encoding Perform feature engineering processing on the behavior data in a normalized and normalized manner to obtain target data; a prediction unit is used to input the target data into a pre-trained potential user mining model to output potential user prediction values, The potential user predicted value is used to characterize the possibility that the user belongs to a potential user; the pushing unit is used to compare the potential user predicted value with a preset threshold to determine the potential user and push information about the potential user.
第三方面,本申请实施例还提供了一种计算机设备,其包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现:通过网络爬虫的方式采集用户的行为数据;通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。In a third aspect, an embodiment of the present application also provides a computer device, which includes a memory and a processor, and a computer program is stored on the memory. When the processor executes the computer program, it realizes: collecting by means of a web crawler User behavior data; perform feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data; input the target data into a pre-trained potential user mining model to output potential user predictions The potential user predicted value is used to characterize the possibility that the user belongs to a potential user; the potential user is compared with a preset threshold according to the potential user predicted value to determine the potential user and push information about the potential user.
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序令当被处理器执行时可实现:通过网络爬虫的方式采集用户的行为数据;通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。可选的,该计算机可读存储介质可以为计算机非易失性可读存储介质。In a fourth aspect, an embodiment of the present application also provides a computer-readable storage medium, the storage medium stores a computer program, and the computer program enables when executed by a processor to realize: collecting user information by means of web crawlers Behavioral data; perform feature engineering processing on the behavioral data through one-hot encoding and normalization to obtain target data; input the target data into a pre-trained potential user mining model to output potential user prediction values, The potential user prediction value is used to characterize the possibility that the user belongs to a potential user; the potential user prediction value is compared with a preset threshold to determine the potential user and push information about the potential user. Optionally, the computer-readable storage medium may be a non-volatile computer-readable storage medium.
本申请实施例通过采集用户的行为数据并通过特征工程处理方式对数据进行处理,进 而通过潜在用户挖掘模型对行为数据进行预测以挖掘潜在用户,从而对潜在用户进行广告推送,可实现提高挖掘潜在的投保用户的准确度,有效地进行广告推送,降低企业获取用户信息的成本的效果。The embodiment of the application collects user behavior data and processes the data through feature engineering processing, and then predicts the behavior data through the potential user mining model to mine potential users, and pushes advertisements to potential users, which can improve the mining potential The accuracy of insuring users can effectively push advertisements and reduce the cost of obtaining user information.
附图说明Description of the drawings
图1为本申请实施例提供的基于数据分析的信息推送方法的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario of an information push method based on data analysis provided by an embodiment of the application;
图2为本申请实施例提供的基于数据分析的信息推送方法的流程示意图;2 is a schematic flowchart of an information push method based on data analysis provided by an embodiment of the application;
图3为本申请实施例提供的基于数据分析的信息推送方法的子流程示意图;FIG. 3 is a schematic diagram of a sub-flow of an information push method based on data analysis provided by an embodiment of the application;
图4为本申请实施例提供的基于数据分析的信息推送方法的子流程示意图;4 is a schematic diagram of a sub-flow of the method for pushing information based on data analysis provided by an embodiment of the application;
图5为本申请实施例提供的基于数据分析的信息推送方法的子流程示意图;FIG. 5 is a schematic diagram of a sub-flow of an information pushing method based on data analysis provided by an embodiment of the application;
图6为本申请另一实施例提供的基于数据分析的信息推送方法的流程示意图;6 is a schematic flowchart of an information push method based on data analysis provided by another embodiment of the application;
图7为本申请实施例提供的基于数据分析的信息推送装置的示意性框图;FIG. 7 is a schematic block diagram of an information pushing device based on data analysis provided by an embodiment of the application;
图8为本申请实施例提供的基于数据分析的信息推送装置的具体单元的示意性框图;FIG. 8 is a schematic block diagram of specific units of an information push device based on data analysis provided by an embodiment of the application;
图9为本申请另一实施例提供的基于数据分析的信息推送装置的示意性框图;以及FIG. 9 is a schematic block diagram of an information pushing device based on data analysis provided by another embodiment of the application; and
图10为本申请实施例提供的计算机设备的示意性框图。FIG. 10 is a schematic block diagram of a computer device provided by an embodiment of the application.
具体实施方式detailed description
本下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application.
应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should be understood that the term "and/or" used in the description of this application and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes these combinations.
请参阅图1和图2,图1为本申请实施例提供的基于数据分析的信息推送方法的应用场景示意图。图2为本申请实施例提供的基于数据分析的信息推送方法的示意性流程图。该潜在用户挖掘应用于终端10中,通过终端10与服务器20之间的交互实现。Please refer to FIG. 1 and FIG. 2. FIG. 1 is a schematic diagram of an application scenario of an information push method based on data analysis provided by an embodiment of the application. FIG. 2 is a schematic flowchart of an information push method based on data analysis provided by an embodiment of the application. The potential user mining is applied to the terminal 10 and is realized through the interaction between the terminal 10 and the server 20.
图2是本申请实施例提供的基于数据分析的信息推送方法的流程示意图。如图所示,该方法包括以下步骤S110-S140。FIG. 2 is a schematic flowchart of an information push method based on data analysis provided by an embodiment of the present application. As shown in the figure, the method includes the following steps S110-S140.
S110、通过网络爬虫的方式采集用户的行为数据。S110. Collect user behavior data through a web crawler.
在一实施例中,用户的行为数据指的是用户在网络上做出一些行为并由网络记录该行为的数据,例如,用户在淘宝网上搜索交强险。网络爬虫是一种按照一定的规则自动地抓取万维网信息的程序或者脚本。具体地,首先选取部分特定的网页作为起始页,通过网络爬虫的方式从起始页开始爬取网页,爬取结束后对所爬取到的大量网页进行筛选得到目标网页,目标网页是用户会浏览的网页,最后再从目标网页的预设数据库中获取用户浏览目标网页的行为数据。In one embodiment, the user's behavior data refers to data that the user performs some behavior on the network and the network records the behavior, for example, the user searches for compulsory traffic insurance on Taobao. A web crawler is a program or script that automatically crawls information on the World Wide Web according to certain rules. Specifically, first select some specific webpages as the start page, crawl the webpage from the start page by means of a web crawler, and filter a large number of crawled webpages after the crawling is complete to obtain the target webpage. The target webpage is the user For the web page that will be browsed, the behavior data of the user browsing the target webpage is finally obtained from the preset database of the target webpage.
在一实施例中,如图3所示,所述步骤S110可包括步骤:S111-S113。In an embodiment, as shown in FIG. 3, the step S110 may include steps: S111-S113.
S111、通过网络爬虫的方式爬取预设网页。S111. Crawling a preset webpage by means of a web crawler.
具体地,网络爬虫指的是一种按照一定的规则,自动地抓取万维网信息的程序,其主要包括采集、存储以及处理三个部分。具体地,首先选择具有代表性的网页的URL作为初始URL开始从服务器中抓取数据,预设网页即为具有代表性的网页,该初始URL是从客户的角度出发,通常客户会通过搜索引擎来搜索汽车保险信息,因此,例如可将百度上搜索交强险的结果页面作为初始URL,也可将淘宝上搜索交强险的结果页面作为初始URL;然后将抓取到的网页存储后进行解析过滤,所抓取到的初始URL中包含有新的URL,解析初始URL对新的URL进行过滤选取与保险相关的URL,例如,关于保险的FAQ常见问题的URL放入等待抓取的URL队列中,其余不相关的URL放弃;最后在待抓取的URL队列中选择下一步要抓取的网页URL,并重复上述过程,直到遍历整个网络时停止。Specifically, a web crawler refers to a program that automatically captures information on the World Wide Web in accordance with certain rules, which mainly includes three parts: collection, storage, and processing. Specifically, first select the URL of a representative webpage as the initial URL and start to fetch data from the server. The preset webpage is the representative webpage. The initial URL is from the customer's point of view. Usually, the customer will use the search engine To search for car insurance information, for example, you can use the search result page of compulsory traffic insurance on Baidu as the initial URL, or the search result page of compulsory traffic insurance on Taobao as the initial URL; then store the crawled webpages and analyze and filter them. The crawled initial URL contains the new URL, and the initial URL is parsed to filter the new URL and select the URL related to insurance. For example, the URL of the FAQ about insurance is placed in the URL queue waiting to be crawled, and the rest Irrelevant URLs are discarded; finally, select the URL of the web page to be crawled next in the URL queue to be crawled, and repeat the above process until it stops when traversing the entire network.
S112、根据预设网页指数对所爬取的网页进行筛选得到目标网页。S112. Filter the crawled webpages according to a preset webpage index to obtain a target webpage.
具体地,由于所爬取到的网页中包含大量没有价值的网页,因此需要对所爬取的网页进行进一步地筛选,选取部分有价值的网页作为目标网页,即用户很可能会去浏览的网页, 根据预设网页指数对所爬取的网页进行评估筛选得到目标网页。其中,预设网页指数指的是以各大搜索引擎海量网民搜索浏览行为数据为基础的数据分享平台提供的网页指数,网页指数具体是根据网站的浏览数据(浏览量、浏览时长、浏览次数)进行一系列的评估得到的数值,例如,A保险公司的官网,其网页指数为89。获取所爬取的网页的预设网页指数,根据预设网页指数对所爬取的网页进行按照由高到低进行排序,选取排名在前十位的网页作为目标网页,当然可以理解的是,还可以选择其他数量的网页作为目标网页。Specifically, since the crawled webpages contain a large number of worthless webpages, it is necessary to further filter the crawled webpages, and select some valuable webpages as target webpages, that is, webpages that users are likely to browse Evaluate and filter the crawled webpages according to the preset webpage index to obtain the target webpage. Among them, the default webpage index refers to the webpage index provided by the data sharing platform based on the massive search and browsing behavior data of major search engines. The webpage index is specifically based on the browsing data of the website (view volume, browsing duration, and number of views) The value obtained through a series of evaluations. For example, the official website of Insurance Company A has a website index of 89. Obtain the preset webpage index of the crawled webpage, sort the crawled webpages from high to low according to the preset webpage index, and select the top ten webpages as the target webpage. Of course, it is understandable that, You can also choose another number of pages as the target page.
S113、根据所述目标网页从预设数据库中获取用户的行为数据。S113. Obtain user behavior data from a preset database according to the target webpage.
具体地,预设数据库指的是存储目标网页的数据库,该预设数据库存储与目标网页相关的所有数据。具体地,在筛选得到目标网页后,根据目标网页的URL调用该目标网页的接口,该接口是经目标网页运营方同意后所提供的,通过调用目标网页的接口从预设数据库中获取该目标网页的网页日志,在获取到网页日志后对所获取的网页日志进行解析最终得到用户的行为数据,其中,用户的行为数据包括:用户信息、用户的浏览记录以及用户IP地址等。Specifically, the preset database refers to a database storing the target webpage, and the preset database stores all data related to the target webpage. Specifically, after the target webpage is obtained by screening, the target webpage interface is called according to the URL of the target webpage. The interface is provided with the consent of the operator of the target webpage, and the target webpage is called from the preset database. The webpage log of the webpage, after obtaining the webpage log, analyzes the obtained webpage log and finally obtains the user's behavior data, where the user's behavior data includes: user information, user browsing records, and user IP address.
S120、通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据。S120: Perform feature engineering processing on the behavior data by means of one-hot encoding and normalization to obtain target data.
在一实施例中,特征工程指的是将原始数据转变为模型的目标数据的过程,常用的特征工程方法包括:时间戳处理、分解类别属性、分箱/分区、交叉特征、特征选择、特征缩放以及特征提取。行为数据主要分为两类,一类是数值型的行为数据,例如为车龄、浏览时长以及年收入等,另一类是非数值型的行为数据,例如为收藏、评论、关注以及加入购物车等。具体地,通过分解类别属性的方式将非数值型的行为数据转化为可供模型输入的目标数据,通过特征缩放的方式将数值型的行为数据转化为可供模型输入的目标数据。In one embodiment, feature engineering refers to the process of transforming original data into target data of a model. Commonly used feature engineering methods include: timestamp processing, decomposition of category attributes, binning/partitioning, cross feature, feature selection, feature Scaling and feature extraction. Behavioral data is mainly divided into two categories. One is numerical behavioral data, such as car age, browsing time, and annual income, and the other is non-numerical behavioral data, such as favorites, comments, concerns, and adding to shopping carts. Wait. Specifically, the non-numerical behavior data is converted into target data for model input by decomposing category attributes, and the numerical behavior data is converted into target data for model input by feature scaling.
在一实施例中,如图4所示,所述步骤S120可包括步骤:S121-S122。In an embodiment, as shown in FIG. 4, the step S120 may include steps S121-S122.
S121、对非数值型的所述行为数据进行独热编码得到目标数据。S121: Perform one-hot encoding on the non-numerical behavior data to obtain target data.
具体地,对于非数值型的特征采用分解类别的方式进行特征工程,分解类别属性的方式具体是通过独热编码即one-hot编码的方式对行为数据进行编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。例如,性别的属性包括男和女,经过独热编码可得到“男”的目标数据为[1,0],“女”的目标数据为[0,1],又例如,用户是否收藏该网页,经过独热编码可得到“收藏”的目标数据为[1,0],“未收藏”的目标数据为[0,1]。Specifically, for non-numerical features, the method of decomposing category attributes is used to perform feature engineering. The method of decomposing category attributes is to encode behavior data through one-hot encoding, that is, one-hot encoding. The method is to use N-bit status registers. To encode N states, each state has its own independent register bit, and at any time, only one of them is valid. For example, the gender attribute includes male and female. After one-hot encoding, the target data of "male" is [1,0], and the target data of "female" is [0,1]. For example, whether the user bookmarks the webpage , After one-hot encoding, the target data of "favorite" is [1,0], and the target data of "not favorite" is [0,1].
S122、对数值型的所述行为数据根据预设公式进行归一化得到目标数据。S122: Normalize the numerical behavior data according to a preset formula to obtain target data.
具体地,对于数值型的特征采用特征缩放的方式进行特征工程,由于部分数值型的特征比其他特征拥有高得多的跨度值,例如,年收入和年龄,因此为了避免某些特征与其他特征的大小相差非常悬殊,需要将特征值缩放到相同的范围值内。具体地,采用预设公式来对数值型的目标数据进行归一化,预设公式具体如下:Specifically, for numerical features, feature scaling is used for feature engineering. Because some numerical features have a much higher span value than other features, such as annual income and age, in order to avoid certain features and other features The size of is very different, and the feature value needs to be scaled to the same range value. Specifically, a preset formula is used to normalize the numerical target data, and the preset formula is specifically as follows:
X′=(X-minX)/(maxX-minX)X′=(X-minX)/(maxX-minX)
其中,X′为归一化特征值,X为当前用户特征参数,minX为当前用户特征的最小参数,maxX为当前用户特征的最大参数。例如,年收入的最大值为500000,年收入的最小值为60000,当前用户的年收入为100000,那么经过归一化后得到区间在0至1之间的归一化特征值0.09。Among them, X′ is the normalized characteristic value, X is the current user characteristic parameter, minX is the minimum parameter of the current user characteristic, and maxX is the maximum parameter of the current user characteristic. For example, if the maximum value of annual income is 500,000, the minimum value of annual income is 60,000, and the current user's annual income is 100,000, then after normalization, a normalized feature value of 0.09 in the range of 0 to 1 is obtained.
S130、将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性。S130. Input the target data into a pre-trained potential user mining model to output a potential user prediction value, where the potential user prediction value is used to characterize the possibility that the user belongs to a potential user.
在一实施例中,潜在用户挖掘模型具体采用梯度提升决策树算法(Gradient Boosting Decision Tree)进行构建,梯度提升决策树是一种组合决策树算法,其主要是通过多棵决策树串行在一起,下一棵决策树学习使用上一课决策树的残差,残差由梯度求得,所有的决 策树组合在一起即为梯度提升决策树。例如,预测潜在用户,其中特征包括:用户年龄以及用户年收入,A、B、C、D用户的年龄分别为18、26、36以及41,年收入分别为0、30万、10万以及50万,首先第一棵决策树根据年龄标签(以30岁为基准))将用户AB划分到30岁以下一类,将CD划分到30岁以上一类,对ABCD为潜在用户的预测值分别为0.1,0.3,0.6以及0.8,AB类的残差为AB的预测值的平均值与预测值之间的差,因此AB的预测值的平均值为0.2,AB的残差分别为-0.1和0.1;而CD的预测值的平均值为0.7,CD残差分别为-0.1和0.1,那么下一棵决策根据上一棵决策树的残差进行预测,根据年收入标签(以15万为基准)将AC划分到15万以下,BD划分到15万以上,下一棵决策树根据上一棵决策树的残差进行求解,下一颗决策树求得AC的残值均为0,即(-0.1+0.1)/2=0,BC的残值也均为0,最终所有用户的残差均为0,从而得到ABCD的最终预测值分别为0,0.4,0.5以及0.9,最终预测值是预测值与残差的和。其预测的核心就在于每一棵树学的是之前所有树结论和的残差。潜在用户挖掘模型已预先训练好,通过Spark平台运行该潜在用户挖掘模型对目标数据进行预测,Spark是专为大规模数据处理而设计的快速通用的计算引擎,Spark平台包括算法组件Spark MLlib(Machine Learning Library,机器学习库),Spark MLlib包括算法库,算法库内设有梯度提升决策树算法,通过Spark MLlib为梯度提升决策树算法提供算法接口,以对目标数据进行预测。In one embodiment, the potential user mining model is constructed by using a gradient boosting decision tree algorithm (Gradient Boosting Decision Tree). The gradient boosting decision tree is a combined decision tree algorithm, which is mainly through multiple decision trees in series. , The next decision tree learning uses the residual of the decision tree in the previous lesson, the residual is obtained by the gradient, and all the decision trees are combined to form the gradient boosting decision tree. For example, predict potential users, which features include: user age and user annual income. The ages of users A, B, C, and D are 18, 26, 36, and 41 respectively, and the annual income is 0, 300,000, 100,000, and 50, respectively. Wan, first of all, the first decision tree classifies user AB into the category below 30 years old according to the age label (based on 30 years old), and divides CD into the category above 30 years old. The predicted values of ABCD as potential users are respectively 0.1, 0.3, 0.6 and 0.8, the residual of class AB is the difference between the average of the predicted value of AB and the predicted value, so the average of the predicted value of AB is 0.2, and the residual of AB is -0.1 and 0.1 respectively ; And the average of the predicted value of CD is 0.7, and the CD residuals are -0.1 and 0.1 respectively, then the next decision is predicted based on the residual of the previous decision tree, based on the annual income label (based on 150,000) Divide AC to below 150,000 and BD to above 150,000. The next decision tree is solved according to the residual of the previous decision tree, and the residual value of AC obtained by the next decision tree is 0, that is (- 0.1+0.1)/2=0, the residual value of BC is also 0, and the residuals of all users are finally 0, so that the final prediction values of ABCD are 0, 0.4, 0.5 and 0.9 respectively, and the final prediction value is the prediction The sum of the value and the residual. The core of its prediction is that each tree learns the residuals of the sum of all previous tree conclusions. The potential user mining model has been pre-trained, and the potential user mining model is run on the Spark platform to predict the target data. Spark is a fast and universal computing engine designed for large-scale data processing. The Spark platform includes the algorithm component Spark MLlib (Machine Learning Library, machine learning library), Spark MLlib includes an algorithm library. The algorithm library has a gradient boosting decision tree algorithm. Spark MLlib provides an algorithm interface for the gradient boosting decision tree algorithm to predict target data.
在一实施例中,如图5所示,所述步骤S130可包括步骤:S131-S132。In an embodiment, as shown in FIG. 5, the step S130 may include steps: S131-S132.
S131、根据所述目标数据构建目标样本。S131. Construct a target sample according to the target data.
具体地,目标样本指的是由目标数据和标签(label)构成的可供模型输入的样本,其中,目标样本分为正样本以及负样本,正样本的标签值为1,负样本的标签值为0。正样本例如为年收入大于等于10万,负样本例如为没有购买车,若客户的年收入为10万那么该目标样本为(0.09,1),其中,0.09为特征值,1为标签值;若客户没有购买车那么该目标样本为(0,0)。Specifically, a target sample refers to a sample composed of target data and a label (label) for model input, where the target sample is divided into a positive sample and a negative sample, the label value of the positive sample is 1, and the label value of the negative sample Is 0. For example, a positive sample is that the annual income is greater than or equal to 100,000, and a negative sample is that no car has been purchased. If the customer's annual income is 100,000, then the target sample is (0.09, 1), where 0.09 is the feature value and 1 is the label value; If the customer does not purchase a car, then the target sample is (0, 0).
S132、将所述目标样本输入至梯度提升决策树模型中进行迭代更新输出潜在用户的预测值。S132. Input the target sample into the gradient boosting decision tree model to iteratively update and output the predicted value of the potential user.
具体地,潜在用户挖掘模型采用的是梯度提升决策树算法,梯度提升决策树算法是通过多轮的迭代,每轮迭代得到一棵决策树,每轮的决策树在上一轮决策树的损失函数基础上拟合得到,最后将所有决策树的结论累加起来得到预测值。具体地,梯度提升决策树算法的公式如下:Specifically, the potential user mining model adopts the gradient boosting decision tree algorithm. The gradient boosting decision tree algorithm is through multiple rounds of iteration. Each round of iteration obtains a decision tree. The loss of each round of decision tree in the previous round of decision tree The function is based on fitting, and finally the conclusions of all decision trees are added up to get the predicted value. Specifically, the formula of the gradient boosting decision tree algorithm is as follows:
Figure PCTCN2020092856-appb-000001
Figure PCTCN2020092856-appb-000001
F m(x)=F m-1(x)+T(x;θ m) F m (x)=F m-1 (x)+T(x; θ m )
Figure PCTCN2020092856-appb-000002
Figure PCTCN2020092856-appb-000002
L[y,F(x)]=[y-F(x)] 2 L[y,F(x)]=[yF(x)] 2
其中,F M(x)表示模型,T(x;θ m)表示决策树,θ m为决策树参数,m为决策树个数,L为损失函数,x为样本特征,y为样本标签,样本特征和样本标签组成目标样本,标签值为0或1,i为样本数量,T采用CART决策树,CART决策树是一种典型的二叉决策树,可以进行分类或者回归。具体地,首先初始化决策树即令F 0(x)=0,然后根据目标样本计 算损失函数,接着根据损失函数更新模型,继续迭代模型直到迭代结束从而得到最终模型,最后将模型中的每棵决策树的预测值进行求和平均得到潜在用户的预测值。 Among them, F M (x) represents the model, T(x; θ m ) represents the decision tree, θ m is the decision tree parameter, m is the number of decision trees, L is the loss function, x is the sample feature, and y is the sample label. The sample feature and sample label constitute the target sample, the label value is 0 or 1, i is the number of samples, and T uses the CART decision tree, which is a typical binary decision tree that can be classified or regressed. Specifically, first initialize the decision tree that is to set F 0 (x) = 0, then calculate the loss function according to the target sample, then update the model according to the loss function, continue to iterate the model until the end of the iteration to obtain the final model, and finally calculate each decision in the model The predicted value of the tree is summed and averaged to obtain the predicted value of potential users.
S140、根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。S140: Compare the predicted value of the potential user with a preset threshold to determine the potential user and push information about the potential user.
在一实施例中,在得到潜在用户的预测值后,将潜在用户的预测值与预设阈值进行比较,若潜在用户的预测值大于预设阈值,则确定该用户为潜在用户;若潜在用户的预测值小于预设阈值,则确定该用户为非潜在用户。例如,预设阈值为0.6,潜在用户的预测值为0.8,那么该用户的预测值大于预设阈值确定该用户为潜在用户。在得到潜在用户后,对该部分潜在用户推送广告,所推送的广告可以是投保信息、汽车保险产品信息以及投保链接等。具体地,将潜在用户名单以及广告链接发送至目标网页的运营方,由运营方根据用户的IP地址在潜在用户登录浏览网页时将广告链接进行推送。In one embodiment, after the predicted value of the potential user is obtained, the predicted value of the potential user is compared with a preset threshold, and if the predicted value of the potential user is greater than the preset threshold, the user is determined to be a potential user; If the predicted value of is less than the preset threshold, it is determined that the user is a non-potential user. For example, if the preset threshold is 0.6 and the predicted value of the potential user is 0.8, then the predicted value of the user is greater than the preset threshold to determine that the user is a potential user. After getting potential users, push advertisements to this part of potential users. The advertisements pushed can be insurance information, auto insurance product information, insurance links, etc. Specifically, the list of potential users and the advertisement link are sent to the operator of the target webpage, and the operator pushes the advertisement link according to the user's IP address when the potential user logs in and browses the webpage.
在一实施例中,如图6所示,所述步骤S140之后,还包括步骤:S150-S160。In one embodiment, as shown in FIG. 6, after the step S140, it further includes steps: S150-S160.
S150、获取所述广告推送的反馈结果。S150. Obtain a feedback result of the advertisement push.
在一实施了中,反馈结果指的是潜在用户是否打开了目标网页所推送的广告链接,若用户打开了目标网页所推送的广告链接则为正反馈;若用户未打开目标网页所推送的广告链接则为负反馈。具体地,反馈结果从目标网页中获取,反馈结果以网页日志的形式保存在目标网页运营方的预设数据库中,因此调用接口从目标网页的预设数据库中获取并解析得到网页日志,然后通过正则表达式设置所推送广告链接的URL作为规则字符串,从网页日志中筛选出浏览该广告链接的浏览记录,该浏览记录即为反馈结果。In an implementation, the feedback result refers to whether the potential user has opened the advertisement link pushed by the target webpage. If the user opens the advertisement link pushed by the target webpage, it is a positive feedback; if the user does not open the advertisement pushed by the target webpage Links are negative feedback. Specifically, the feedback result is obtained from the target webpage, and the feedback result is stored in the preset database of the target webpage operator in the form of webpage log. Therefore, the calling interface is obtained from the preset database of the target webpage and parsed to obtain the webpage log, and then pass The regular expression sets the URL of the pushed advertisement link as the rule string, and filters the browsing record of browsing the advertisement link from the web log, and the browsing record is the feedback result.
S160、根据所述反馈结果通过邮件对所述潜在用户挖掘模型提示优化。S160. Prompt optimization of the potential user mining model via email according to the feedback result.
在一实施了中,用户挖掘模型是否需要进行优化主要通过转化率来进行判断,转化率指的是浏览了所推送的广告链接的潜在用户数量对所有潜在用户数量的占比,浏览了所推送的广告链接的潜在用户数量越多,转化率就越高。具体地,将实际转化率与期望转化率进行对比,若实际转化率大于期望转化率,说明该潜在用户挖掘模型的转化效果好,不需要对模型进行优化;若实际转化率小于期望转化率,说明该潜在用户挖掘模型的转化效果差,需要对模型进行优化。根据反馈结果生成提示邮件,将提示邮件发送至模型管理人员的邮件地址中,提示模型需要进行优化。In an implementation, whether the user mining model needs to be optimized is mainly judged by the conversion rate. The conversion rate refers to the ratio of the number of potential users who viewed the pushed advertising links to the number of all potential users. The more potential users of the advertising link, the higher the conversion rate. Specifically, the actual conversion rate is compared with the expected conversion rate. If the actual conversion rate is greater than the expected conversion rate, it indicates that the potential user mining model has a good conversion effect and does not need to be optimized; if the actual conversion rate is less than the expected conversion rate, It shows that the conversion effect of the potential user mining model is poor, and the model needs to be optimized. According to the feedback results, a reminder email is generated, and the reminder email is sent to the email address of the model manager to remind the model to be optimized.
本申请实施例展示了一种基于数据分析的信息推送方法,通过网络爬虫的方式采集用户的行为数据;通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送,可以挖掘潜在的投保用户,有效地进行广告推送,降低企业获取用户信息的成本。The embodiment of the application shows an information push method based on data analysis, which collects user behavior data through a web crawler; performs feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data Input the target data into a pre-trained potential user mining model to output a potential user prediction value, the potential user prediction value is used to characterize the possibility that the user belongs to a potential user; according to the potential user prediction value Comparing with the preset threshold value to determine potential users and push information to the potential users, potential insured users can be mined, advertising can be effectively pushed, and the cost of obtaining user information can be reduced.
图7是本申请实施例提供的一种基于数据分析的信息推送装置200的示意性框图。如图7所示,对应于以上基于数据分析的信息推送方法,本申请还提供一种基于数据分析的信息推送装置200。该基于数据分析的信息推送装置200包括用于执行上述基于数据分析的信息推送方法的单元,该装置可以被配置于台式电脑、平板电脑、手提电脑、等终端中。具体地,请参阅图7,该基于数据分析的信息推送装置200包括:爬虫单元210、特征工程单元220、预测单元230以及推送单元240。FIG. 7 is a schematic block diagram of an information push device 200 based on data analysis provided by an embodiment of the present application. As shown in FIG. 7, corresponding to the above information pushing method based on data analysis, the present application also provides an information pushing device 200 based on data analysis. The data analysis-based information pushing device 200 includes a unit for executing the above-mentioned data analysis-based information pushing method, and the device can be configured in a desktop computer, a tablet computer, a laptop computer, and other terminals. Specifically, referring to FIG. 7, the information pushing device 200 based on data analysis includes: a crawler unit 210, a feature engineering unit 220, a prediction unit 230, and a pushing unit 240.
爬虫单元210,用于通过网络爬虫的方式采集用户的行为数据。The crawler unit 210 is used to collect user behavior data by way of web crawlers.
在一实施例中,用户的行为数据指的是用户在网络上做出一些行为并由网络记录该行为的数据,例如,用户在淘宝网上搜索交强险。网络爬虫是是一种按照一定的规则自动地抓取万维网信息的程序或者脚本。具体地,首先选取部分特定的网页作为起始页,通过网络爬虫的方式从起始页开始爬取网页,爬取结束后对所爬取到的大量网页进行筛选得到目 标网页,目标网页是用户会浏览的网页,最后再从目标网页的预设数据库中获取用户浏览目标网页的行为数据。In one embodiment, the user's behavior data refers to data that the user performs some behavior on the network and the network records the behavior, for example, the user searches for compulsory traffic insurance on Taobao. A web crawler is a program or script that automatically crawls information on the World Wide Web according to certain rules. Specifically, first select some specific webpages as the start page, crawl the webpage from the start page by means of a web crawler, and filter a large number of crawled webpages after the crawling is complete to obtain the target webpage. The target webpage is the user For the web page that will be browsed, the behavior data of the user browsing the target webpage is finally obtained from the preset database of the target webpage.
在一实施例中,如图8所示,所述爬虫单元210包括:爬虫子单元211、筛选单元212以及获取子单元213。In an embodiment, as shown in FIG. 8, the crawler unit 210 includes: a crawler subunit 211, a screening unit 212 and an acquisition subunit 213.
爬虫子单元211,用于通过网络爬虫的方式爬取预设网页。The crawler subunit 211 is used for crawling a preset webpage by way of a web crawler.
具体地,网络爬虫指的是一种按照一定的规则,自动地抓取万维网信息的程序,其主要包括采集、存储以及处理三个部分。具体地,首先选择具有代表性的网页的URL作为初始URL开始从服务器中抓取数据,预设网页即为具有代表性的网页,该初始URL是从客户的角度出发,通常客户会通过搜索引擎来搜索汽车保险信息,因此,例如可将百度上搜索交强险的结果页面作为初始URL,也可将淘宝上搜索交强险的结果页面作为初始URL;然后将抓取到的网页存储后进行解析过滤,所抓取到的初始URL中包含有新的URL,解析初始URL对新的URL进行过滤选取与保险相关的URL,例如,关于保险的FAQ常见问题的URL放入等待抓取的URL队列中,其余不相关的URL放弃;最后在待抓取的URL队列中选择下一步要抓取的网页URL,并重复上述过程,直到遍历整个网络时停止。Specifically, a web crawler refers to a program that automatically captures information on the World Wide Web in accordance with certain rules, which mainly includes three parts: collection, storage, and processing. Specifically, first select the URL of a representative webpage as the initial URL and start to fetch data from the server. The preset webpage is the representative webpage. The initial URL is from the customer's point of view. Usually, the customer will use the search engine To search for car insurance information, for example, you can use the search result page of compulsory traffic insurance on Baidu as the initial URL, or the search result page of compulsory traffic insurance on Taobao as the initial URL; then store the crawled webpages and analyze and filter them. The crawled initial URL contains the new URL, and the initial URL is parsed to filter the new URL and select the URL related to insurance. For example, the URL of the FAQ about insurance is placed in the URL queue waiting to be crawled, and the rest Irrelevant URLs are discarded; finally, select the URL of the web page to be crawled next in the URL queue to be crawled, and repeat the above process until it stops when traversing the entire network.
筛选单元212,用于根据预设网页指数对所爬取的网页进行筛选得到目标网页。The screening unit 212 is used for screening the crawled webpages according to a preset webpage index to obtain a target webpage.
具体地,由于所爬取到的网页中包含大量没有价值的网页,因此需要对所爬取的网页进行进一步地筛选,选取部分有价值的网页作为目标网页,即用户很可能会去浏览的网页,根据预设网页指数对所爬取的网页进行评估筛选得到目标网页。其中,预设网页指数指的是以各大搜索引擎海量网民搜索行为数据为基础的数据分享平台提供的网页指数。获取所爬取的网页的预设网页指数,根据预设网页指数对所爬去的网页进行按照由高到低进行排序,选取排名在前十位的网页作为目标网页,当然可以理解的是,还可以选择其他数量的网页作为目标网页。Specifically, since the crawled webpages contain a large number of worthless webpages, it is necessary to further filter the crawled webpages, and select some valuable webpages as target webpages, that is, webpages that users are likely to browse , Evaluate and filter the crawled webpages according to the preset webpage index to obtain the target webpage. Among them, the preset webpage index refers to the webpage index provided by the data sharing platform based on the massive amount of Internet user search behavior data of major search engines. Get the preset webpage index of the crawled webpage, sort the crawled webpages from high to low according to the preset webpage index, and select the top ten webpages as the target webpage. Of course, it is understandable that, You can also choose another number of pages as the target page.
获取子单元213,用于根据所述目标网页从预设数据库中获取用户的行为数据。The obtaining subunit 213 is configured to obtain user behavior data from a preset database according to the target webpage.
具体地,预设数据库指的是存储目标网页的数据库,该预设数据库存储与目标网页相关的所有数据。具体地,在筛选得到目标网页后,根据目标网页的URL调用该目标网页的接口,该接口是经目标网页运营方同意后所提供的,通过调用目标网页的接口从预设数据库中获取该目标网页的网页日志,在获取到网页日志后对所获取的网页日志进行解析最终得到用户的行为数据,其中,用户的行为数据包括:用户信息、用户的浏览记录以及用户IP地址等。Specifically, the preset database refers to a database storing the target webpage, and the preset database stores all data related to the target webpage. Specifically, after the target webpage is obtained by screening, the target webpage interface is called according to the URL of the target webpage. The interface is provided with the consent of the operator of the target webpage, and the target webpage is called from the preset database. The webpage log of the webpage, after obtaining the webpage log, analyzes the obtained webpage log and finally obtains the user's behavior data, where the user's behavior data includes: user information, user browsing records, and user IP address.
特征工程单元220,用于通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据。The feature engineering unit 220 is configured to perform feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data.
在一实施例中,特征工程指的是将原始数据转变为模型的目标数据的过程,常用的特征工程方法包括:时间戳处理、分解类别属性、分箱/分区、交叉特征、特征选择、特征缩放以及特征提取。行为数据主要分为两类,一类是数值型的行为数据,例如为车龄、浏览时长以及年收入等,另一类是非数值型的行为数据,例如为收藏、评论、关注以及加入购物车等。具体地,通过分解类别属性的方式将非数值型的行为数据转化为可供模型输入的目标数据,通过特征缩放的方式将数值型的行为数据转化为可供模型输入的目标数据。In one embodiment, feature engineering refers to the process of transforming original data into target data of a model. Commonly used feature engineering methods include: timestamp processing, decomposition of category attributes, binning/partitioning, cross feature, feature selection, feature Scaling and feature extraction. Behavioral data is mainly divided into two categories. One is numerical behavioral data, such as car age, browsing time, and annual income, and the other is non-numerical behavioral data, such as favorites, comments, concerns, and adding to shopping carts. Wait. Specifically, the non-numerical behavior data is converted into target data for model input by decomposing category attributes, and the numerical behavior data is converted into target data for model input by feature scaling.
在一实施例中,如图8所示,所述特征工程单元220包括:编码单元221以及归一化单元222。In an embodiment, as shown in FIG. 8, the feature engineering unit 220 includes: an encoding unit 221 and a normalization unit 222.
编码单元221,用于对非数值型的所述行为数据进行独热编码得到目标数据。The encoding unit 221 is configured to perform one-hot encoding on the non-numerical behavior data to obtain target data.
具体地,对于非数值型的特征采用分解类别的方式进行特征工程,分解类别属性的方式具体是通过独热编码即one-hot编码的方式对行为数据进行编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。例如,性别的属性包括男和女,经过独热编码可得到“男”的目标数据为 [1,0],“女”的目标数据为[0,1],又例如,用户是否收藏该网页,经过独热编码可得到“收藏”的目标数据为[1,0],“未收藏”的目标数据为[0,1]。Specifically, for non-numerical features, the method of decomposing category attributes is used to perform feature engineering. The method of decomposing category attributes is to encode behavior data through one-hot encoding, that is, one-hot encoding. The method is to use N-bit status registers. To encode N states, each state has its own independent register bit, and at any time, only one of them is valid. For example, the gender attribute includes male and female. After one-hot encoding, the target data of "male" is [1,0], and the target data of "female" is [0,1]. For example, whether the user bookmarks the webpage , After one-hot encoding, the target data of "favorite" is [1,0], and the target data of "not favorite" is [0,1].
归一化单元222,用于对数值型的所述行为数据根据预设公式进行归一化得到目标数据。The normalization unit 222 is configured to normalize the numerical behavior data according to a preset formula to obtain target data.
具体地,对于数值型的特征采用特征缩放的方式进行特征工程,由于部分数值型的特征比其他特征拥有高得多的跨度值,例如,年收入和年龄,因此为了避免某些特征与其他特征的大小相差非常悬殊,需要将特征值缩放到相同的范围值内。具体地,采用预设公式来对数值型的目标数据进行归一化,预设公式具体如下:Specifically, for numerical features, feature scaling is used for feature engineering. Because some numerical features have a much higher span value than other features, such as annual income and age, in order to avoid certain features and other features The size of is very different, and the feature value needs to be scaled to the same range value. Specifically, a preset formula is used to normalize the numerical target data, and the preset formula is specifically as follows:
X′=(X-minX)/(maxX-minX)X′=(X-minX)/(maxX-minX)
其中,X′为归一化特征值,X为当前用户特征参数,minX为当前用户特征的最小参数,maxX为当前用户特征的最大参数。例如,年收入的最大值为500000,年收入的最小值为60000,当前用户的年收入为100000,那么经过归一化后得到区间在0至1之间的归一化特征值0.09。Among them, X′ is the normalized characteristic value, X is the current user characteristic parameter, minX is the minimum parameter of the current user characteristic, and maxX is the maximum parameter of the current user characteristic. For example, if the maximum value of annual income is 500,000, the minimum value of annual income is 60,000, and the current user's annual income is 100,000, then after normalization, a normalized feature value of 0.09 in the range of 0 to 1 is obtained.
预测单元230,用于将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性。The prediction unit 230 is configured to input the target data into a pre-trained potential user mining model to output a potential user prediction value, and the potential user prediction value is used to characterize the possibility that the user belongs to a potential user.
在一实施例中,潜在用户挖掘模型具体采用梯度提升决策树算法(Gradient Boosting Decision Tree)进行构建,梯度提升决策树是一种组合决策树算法,其主要是通过多棵决策树串行在一起,下一棵决策树学习使用上一课决策树的残差,残差由梯度求得,所有的决策树组合在一起即为梯度提升决策树。例如,预测潜在用户,其中特征包括:用户年龄以及用户年收入,A、B、C、D用户的年龄分别为18、26、36以及41,年收入分别为0、30万、10万以及50万,首先第一棵决策树根据年龄标签(以30岁为基准))将用户AB划分到30岁以下一类,将CD划分到30岁以上一类,对ABCD为潜在用户的预测值分别为0.1,0.3,0.6以及0.8,AB类的残差为AB的预测值的平均值与预测值之间的差,因此AB的预测值的平均值为0.2,AB的残差分别为-0.1和0.1;而CD的预测值的平均值为0.7,CD残差分别为-0.1和0.1,那么下一棵决策根据上一棵决策树的残差进行预测,根据年收入标签(以15万为基准)将AC划分到15万以下,BD划分到15万以上,下一棵决策树根据上一棵决策树的残差进行求解,下一颗决策树求得AC的残值均为0,即(-0.1+0.1)/2=0,BC的残值也均为0,最终所有用户的残差均为0,从而得到ABCD的最终预测值分别为0,0.4,0.5以及0.9,最终预测值是预测值与残差的和。其预测的核心就在于每一棵树学的是之前所有树结论和的残差。该潜在用户挖掘模型已预先训练好,通过Spark平台运行该潜在用户挖掘模型对目标数据进行预测,Spark是专为大规模数据处理而设计的快速通用的计算引擎,Spark平台包括算法组件Spark MLlib(Machine Learning Library,机器学习库),Spark MLlib包括算法库,算法库内设有梯度提升决策树算法,通过Spark MLlib为梯度提升决策树算法提供算法接口,以对目标数据进行预测。In one embodiment, the potential user mining model is constructed by using a gradient boosting decision tree algorithm (Gradient Boosting Decision Tree). The gradient boosting decision tree is a combined decision tree algorithm, which is mainly through multiple decision trees in series. , The next decision tree learning uses the residual of the decision tree in the previous lesson, the residual is obtained by the gradient, and all the decision trees are combined to form the gradient boosting decision tree. For example, predict potential users, which features include: user age and user annual income. The ages of users A, B, C, and D are 18, 26, 36, and 41 respectively, and the annual income is 0, 300,000, 100,000, and 50, respectively. Wan, first of all, the first decision tree classifies user AB into the category below 30 years old according to the age label (based on 30 years old), and divides CD into the category above 30 years old. The predicted values of ABCD as potential users are respectively 0.1, 0.3, 0.6 and 0.8, the residual of class AB is the difference between the average of the predicted value of AB and the predicted value, so the average of the predicted value of AB is 0.2, and the residual of AB is -0.1 and 0.1 respectively ; And the average of the predicted value of CD is 0.7, and the CD residuals are -0.1 and 0.1 respectively, then the next decision is predicted based on the residual of the previous decision tree, based on the annual income label (based on 150,000) Divide AC to below 150,000 and BD to above 150,000. The next decision tree is solved according to the residual of the previous decision tree, and the residual value of AC obtained by the next decision tree is 0, that is (- 0.1+0.1)/2=0, the residual value of BC is also 0, and the residuals of all users are finally 0, so that the final prediction values of ABCD are 0, 0.4, 0.5 and 0.9 respectively, and the final prediction value is the prediction The sum of the value and the residual. The core of its prediction is that each tree learns the residuals of the sum of all previous tree conclusions. The potential user mining model has been pre-trained, and the potential user mining model is run on the Spark platform to predict the target data. Spark is a fast and universal computing engine designed for large-scale data processing. The Spark platform includes the algorithm component Spark MLlib( Machine Learning Library), Spark MLlib includes an algorithm library. The algorithm library has a gradient boosting decision tree algorithm. Spark MLlib provides an algorithm interface for the gradient boosting decision tree algorithm to predict target data.
在一实施例中,如图8所示,所述特征工程单元220包括:构建单元231以及预测子单元232。In an embodiment, as shown in FIG. 8, the feature engineering unit 220 includes: a construction unit 231 and a prediction subunit 232.
构建单元231,用于根据所述目标数据构建目标样本。The construction unit 231 is configured to construct a target sample according to the target data.
具体地,目标样本指的是由目标数据和标签(label)构成的可供模型输入的样本,其中,目标样本分为正样本以及负样本,正样本的标签值为1,负样本的标签值为0。正样本例如为年收入大于等于10万,负样本例如为没有购买车,若客户的年收入10万为那么该目标样本为(0.09,1),若客户没有购买车那么该目标样本为(0,0)。Specifically, a target sample refers to a sample composed of target data and a label (label) for model input, where the target sample is divided into a positive sample and a negative sample, the label value of the positive sample is 1, and the label value of the negative sample Is 0. For example, the positive sample is annual income greater than or equal to 100,000, and the negative sample is for example not buying a car. If the customer’s annual income is 100,000, the target sample is (0.09, 1), and if the customer does not purchase a car, the target sample is (0 , 0).
预测子单元232,用于将所述目标样本输入至梯度提升决策树模型中进行迭代更新输出潜在用户的预测值。The prediction subunit 232 is configured to input the target sample into the gradient boosting decision tree model to iteratively update and output the predicted value of the potential user.
具体地,潜在用户挖掘模型采用的是梯度提升决策树算法,梯度提升决策树算法是通 过多轮的迭代,每轮迭代得到一棵决策树,每轮的决策树在上一轮决策树的损失函数基础上拟合得到,最后将所有决策树的结论累加起来得到预测值。具体地,梯度提升决策树算法的公式如下:Specifically, the potential user mining model adopts the gradient boosting decision tree algorithm. The gradient boosting decision tree algorithm is through multiple rounds of iteration. Each round of iteration obtains a decision tree. The loss of each round of decision tree in the previous round of decision tree The function is based on fitting, and finally the conclusions of all decision trees are added up to get the predicted value. Specifically, the formula of the gradient boosting decision tree algorithm is as follows:
Figure PCTCN2020092856-appb-000003
Figure PCTCN2020092856-appb-000003
F m(x)=F m-1(x)+T(x;θ m) F m (x)=F m-1 (x)+T(x; θ m )
Figure PCTCN2020092856-appb-000004
Figure PCTCN2020092856-appb-000004
L[y,F(x)]=[y-F(x)] 2 L[y,F(x)]=[yF(x)] 2
其中,F M(x)表示模型,T(x;θ m)表示决策树,θ m为决策树参数,m为决策树个数,L为损失函数,x为样本特征,y为样本标签,样本特征和样本标签组成目标样本,标签值为0或1,i为样本数量,T采用CART决策树,CART决策树是一种典型的二叉决策树,可以进行分类或者回归。具体地,首先初始化决策树即令F 0(x)=0,然后根据目标样本计算损失函数,接着根据损失函数更新模型,继续迭代模型直到迭代结束从而得到最终模型,最后将模型中的每棵决策树的预测值进行求和平均得到潜在用户的预测值。 Among them, F M (x) represents the model, T(x; θ m ) represents the decision tree, θ m is the decision tree parameter, m is the number of decision trees, L is the loss function, x is the sample feature, and y is the sample label. The sample feature and sample label constitute the target sample, the label value is 0 or 1, i is the number of samples, and T uses the CART decision tree, which is a typical binary decision tree that can be classified or regressed. Specifically, first initialize the decision tree that is to set F 0 (x) = 0, then calculate the loss function according to the target sample, then update the model according to the loss function, continue to iterate the model until the end of the iteration to obtain the final model, and finally calculate each decision in the model The predicted value of the tree is summed and averaged to obtain the predicted value of potential users.
推送单元240,用于根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。The pushing unit 240 is configured to compare the predicted value of the potential user with a preset threshold to determine the potential user and push the information of the potential user.
在一实施例中,在得到潜在用户的预测值后,将潜在用户的预测值与预设阈值进行比较,若潜在用户的预测值大于预设阈值,则确定该用户为潜在用户;若潜在用户的预测值小于预设阈值,则确定该用户为非潜在用户。例如,预设阈值为0.6,潜在用户的预测值为0.8,那么该用户的预测值大于预设阈值确定该用户为潜在用户。在得到潜在用户后,对该部分潜在用户推送广告,所推送的广告可以是投保信息、汽车保险产品信息以及投保链接等。具体地,将潜在用户名单以及广告链接发送至目标网页的运营方,由运营方根据用户的IP地址在潜在用户登录浏览网页时将广告链接进行推送。In one embodiment, after the predicted value of the potential user is obtained, the predicted value of the potential user is compared with a preset threshold, and if the predicted value of the potential user is greater than the preset threshold, the user is determined to be a potential user; If the predicted value of is less than the preset threshold, it is determined that the user is a non-potential user. For example, if the preset threshold is 0.6 and the predicted value of the potential user is 0.8, then the predicted value of the user is greater than the preset threshold to determine that the user is a potential user. After getting the potential users, push advertisements to this part of the potential users. The advertisements pushed can be insurance information, auto insurance product information, and insurance links. Specifically, the list of potential users and the advertisement link are sent to the operator of the target webpage, and the operator pushes the advertisement link according to the user's IP address when the potential user logs in and browses the webpage.
在一实施例中,如图9所示,所述基于数据分析的信息推送装置200还包括:获取单元250以及提示单元260。In an embodiment, as shown in FIG. 9, the information pushing device 200 based on data analysis further includes: an acquiring unit 250 and a prompting unit 260.
获取单元250,用于获取所述广告推送的反馈结果。The obtaining unit 250 is configured to obtain the feedback result of the advertisement push.
在一实施了中,反馈结果指的是潜在用户是否打开了目标网页所推送的广告链接,若用户打开了目标网页所推送的广告链接则为正反馈;若用户未打开目标网页所推送的广告链接则为负反馈。具体地,反馈结果从目标网页中获取,反馈结果以网页日志的形式保存在目标网页运营方的预设数据库中,因此调用接口从目标网页的预设数据库中获取并解析得到网页日志,然后通过正则表达式设置所推送广告链接的URL作为规则字符串,从网页日志中筛选出浏览该广告链接的浏览记录,该浏览记录即为反馈结果。In an implementation, the feedback result refers to whether the potential user has opened the advertisement link pushed by the target webpage. If the user opens the advertisement link pushed by the target webpage, it is a positive feedback; if the user does not open the advertisement pushed by the target webpage Links are negative feedback. Specifically, the feedback result is obtained from the target webpage, and the feedback result is stored in the preset database of the target webpage operator in the form of webpage log. Therefore, the calling interface is obtained from the preset database of the target webpage and parsed to obtain the webpage log, and then pass The regular expression sets the URL of the pushed advertisement link as the rule string, and filters the browsing record of browsing the advertisement link from the web log, and the browsing record is the feedback result.
提示单元260,用于根据所述反馈结果通过邮件对所述潜在用户挖掘模型提示优化。The prompt unit 260 is configured to prompt and optimize the potential user mining model through email according to the feedback result.
在一实施了中,用户挖掘模型是否需要进行优化主要通过转化率来进行判断,转化率指的是浏览了所推送的广告链接的潜在用户数量对所有潜在用户数量的占比,浏览了所推送的广告链接的潜在用户数量越多,转化率就越高。具体地,将实际转化率与期望转化率进行对比,若实际转化率大于期望转化率,说明该潜在用户挖掘模型的转化效果好,不需 要对模型进行优化;若实际转化率小于期望转化率,说明该潜在用户挖掘模型的转化效果差,需要对模型进行优化。In an implementation, whether the user mining model needs to be optimized is mainly judged by the conversion rate. The conversion rate refers to the ratio of the number of potential users who viewed the pushed advertising links to the number of all potential users. The more potential users of the advertising link, the higher the conversion rate. Specifically, the actual conversion rate is compared with the expected conversion rate. If the actual conversion rate is greater than the expected conversion rate, it indicates that the potential user mining model has a good conversion effect and does not need to be optimized; if the actual conversion rate is less than the expected conversion rate, It shows that the conversion effect of the potential user mining model is poor, and the model needs to be optimized.
本申请实施例展示了一种基于数据分析的信息推送装置,通过网络爬虫的方式采集用户的行为数据;通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送,可以挖掘潜在的投保用户,有效地进行广告推送,降低企业获取用户信息的成本。The embodiment of the application shows an information push device based on data analysis, which collects user behavior data through a web crawler; performs feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data Input the target data into a pre-trained potential user mining model to output a potential user prediction value, the potential user prediction value is used to characterize the possibility that the user belongs to a potential user; according to the potential user prediction value Comparing with the preset threshold value to determine potential users and push information to the potential users, potential insured users can be mined, advertising can be effectively pushed, and the cost of obtaining user information can be reduced.
需要说明的是,所属领域的技术人员可以清楚地了解到,上述基于数据分析的信息推送装置200和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。It should be noted that those skilled in the art can clearly understand that the above-mentioned data analysis-based information push device 200 and the specific implementation process of each unit can refer to the corresponding description in the foregoing method embodiment, for the convenience and brevity of the description. , I won’t repeat it here.
上述基于数据分析的信息推送装置可以实现为一种计算机程序的形式,该计算机程序可以在如图10所示的计算机设备上运行。The above-mentioned information pushing device based on data analysis can be implemented in the form of a computer program, and the computer program can be run on the computer device as shown in FIG. 10.
请参阅图10,图10是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备500可以是终端,其中,终端可以是智能手机、平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等具有通信功能的电子设备。参阅图10,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。Please refer to FIG. 10, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a terminal, where the terminal may be an electronic device with communication functions such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device. 10, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032包括程序指令,该程序指令被执行时,可使得处理器502执行一种基于数据分析的信息推送方法。The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. The computer program 5032 includes program instructions. When the program instructions are executed, the processor 502 can execute an information push method based on data analysis.
该处理器502用于提供计算和控制能力,以支撑整个计算机设备500的运行。The processor 502 is used to provide calculation and control capabilities to support the operation of the entire computer device 500.
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行一种基于数据分析的信息推送方法。The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute an information push method based on data analysis.
该网络接口505用于与其它设备进行网络通信。本领域技术人员可以理解,图10中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication with other devices. Those skilled in the art can understand that the structure shown in FIG. 10 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现如下步骤:通过网络爬虫的方式采集用户的行为数据;通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。Wherein, the processor 502 is configured to run a computer program 5032 stored in a memory to implement the following steps: collect user behavior data by means of web crawlers; and perform one-hot encoding and normalization on the behavior data Perform feature engineering processing to obtain target data; input the target data into a pre-trained potential user mining model to output potential user prediction values, which are used to characterize the possibility that the user belongs to a potential user ; Compare the predicted value of the potential user with the preset threshold to determine the potential user and push the information of the potential user.
在一实施例中,处理器502在实现所述通过网络爬虫的方式采集用户的行为数据步骤时,具体实现如下步骤:通过网络爬虫的方式爬取预设网页;根据预设网页指数对所爬取的网页进行筛选得到目标网页;根据所述目标网页从预设数据库中获取用户的行为数据。In one embodiment, when the processor 502 implements the step of collecting user behavior data by means of a web crawler, it specifically implements the following steps: crawling a preset webpage by means of a web crawler; The fetched webpages are filtered to obtain a target webpage; the user's behavior data is obtained from a preset database according to the target webpage.
在一实施例中,处理器502在实现所述通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据步骤时,具体实现如下步骤:对非数值型的所述行为数据进行独热编码得到目标数据;对数值型的所述行为数据根据预设公式进行归一化得到目标数据。In an embodiment, when the processor 502 implements the step of performing feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data, it specifically implements the following steps: The behavior data is one-hot encoded to obtain the target data; the numerical behavior data is normalized according to a preset formula to obtain the target data.
在一实施例中,处理器502在实现所述将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户 的可能性步骤时,具体实现如下步骤:根据所述目标数据构建目标样本;将所述目标样本输入至梯度提升决策树模型中进行迭代更新输出潜在用户的预测值。In an embodiment, the processor 502 inputs the target data into a pre-trained potential user mining model to output a potential user prediction value. The potential user prediction value is used to characterize that the user is a potential user. In the user possibility step, the following steps are specifically implemented: construct a target sample according to the target data; input the target sample into the gradient boosting decision tree model to iteratively update the predicted value of the potential user.
在一实施例中,处理器502在实现所述根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送步骤之后,还实现如下步骤:获取所述广告推送的反馈结果;根据所述反馈结果通过邮件对所述潜在用户挖掘模型提示优化。In one embodiment, after the processor 502 implements the step of comparing the predicted value of the potential user with a preset threshold to determine the potential user and push the information of the potential user, the processor 502 further implements the following step: The feedback result of the advertisement push; according to the feedback result, the potential user mining model is prompted to optimize through the email.
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成。该计算机程序包括程序指令,计算机程序可存储于一存储介质中,该存储介质为计算机可读存储介质。该程序指令被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by computer programs instructing relevant hardware. The computer program includes program instructions, and the computer program can be stored in a storage medium, which is a computer-readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiments.
因此,本申请还提供一种存储介质。该存储介质可以为计算机可读存储介质。该存储介质存储有计算机程序,其中计算机程序包括程序指令。该程序指令被处理器执行时使处理器执行如下步骤:通过网络爬虫的方式采集用户的行为数据;通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。可选的,该计算机可读存储介质可以是非易失性的存储介质,也可以是易失性的存储介质。Therefore, this application also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program, where the computer program includes program instructions. When the program instructions are executed by the processor, the processor executes the following steps: collect user behavior data through a web crawler; perform feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data; Input the target data into a pre-trained potential user mining model to output a potential user prediction value, which is used to characterize the possibility that the user belongs to a potential user; according to the potential user prediction value and The preset threshold is compared to determine potential users and push information to the potential users. Optionally, the computer-readable storage medium may be a non-volatile storage medium or a volatile storage medium.
在一实施例中,所述处理器在执行所述程序指令而实现所述通过网络爬虫的方式采集用户的行为数据步骤时,具体实现如下步骤:通过网络爬虫的方式爬取预设网页;根据预设网页指数对所爬取的网页进行筛选得到目标网页;根据所述目标网页从预设数据库中获取用户的行为数据。In an embodiment, when the processor executes the program instructions to implement the step of collecting user behavior data by means of a web crawler, it specifically implements the following steps: crawling a preset webpage by means of a web crawler; The preset webpage index filters the crawled webpages to obtain the target webpage; and obtains user behavior data from the preset database according to the target webpage.
在一实施例中,所述处理器在执行所述程序指令而实现所述通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据步骤时,具体实现如下步骤:对非数值型的所述行为数据进行独热编码得到目标数据;对数值型的所述行为数据根据预设公式进行归一化得到目标数据。In an embodiment, when the processor executes the program instructions to implement the step of performing feature engineering processing on the behavior data by one-hot encoding and normalization to obtain the target data, it specifically implements the following steps : Perform one-hot encoding on the non-numeric behavior data to obtain target data; normalize the numeric behavior data according to a preset formula to obtain the target data.
在一实施例中,所述处理器在执行所述程序指令而实现所述将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性步骤时,具体实现如下步骤:根据所述目标数据构建目标样本;将所述目标样本输入至梯度提升决策树模型中进行迭代更新输出潜在用户的预测值。In an embodiment, the processor executes the program instructions to realize the input of the target data into a pre-trained potential user mining model to output potential user predicted values, and the potential user predicted values are used In the step of characterizing the possibility that the user belongs to a potential user, the following steps are specifically implemented: construct a target sample according to the target data; input the target sample into the gradient boosting decision tree model to iteratively update the predicted value of the potential user .
在一实施例中,所述处理器在执行所述程序指令而实现所述根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送步骤之后,还实现如下步骤:获取所述广告推送的反馈结果;根据所述反馈结果通过邮件对所述潜在用户挖掘模型提示优化。In an embodiment, after the processor executes the program instructions to implement the comparison between the predicted value of the potential user and a preset threshold to determine the potential user and perform the information push step for the potential user, The following steps are achieved: obtaining the feedback result of the advertisement push; according to the feedback result, prompting and optimizing the potential user mining model through email.
所述存储介质可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的计算机可读存储介质。The storage medium may be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other computer-readable storage media that can store program codes.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些 功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the hardware and software Interchangeability. In the above description, the composition and steps of each example have been generally described in terms of function. Whether these functions are performed by hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的。例如,各个单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of each unit is only a logical function division, and there may be other division methods in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
本申请实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。本申请实施例装置中的单元可以根据实际需要进行合并、划分和删减。另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。The steps in the method of the embodiment of the present application can be adjusted, merged, and deleted in order according to actual needs. The units in the devices in the embodiments of the present application may be combined, divided, and deleted according to actual needs. In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,终端,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a terminal, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种基于数据分析的信息推送方法,其中,包括:An information push method based on data analysis, which includes:
    通过网络爬虫的方式采集用户的行为数据;Collect user behavior data through web crawlers;
    通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;Perform feature engineering processing on the behavior data by means of one-hot encoding and normalization to obtain target data;
    将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;Inputting the target data into a pre-trained potential user mining model to output a potential user prediction value, where the potential user prediction value is used to characterize the possibility that the user belongs to a potential user;
    根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。Compare the predicted value of the potential user with a preset threshold to determine the potential user and push information on the potential user.
  2. 根据权利要求1所述的基于数据分析的信息推送方法,其中,所述通过网络爬虫的方式采集用户的行为数据,包括:The method for pushing information based on data analysis according to claim 1, wherein said collecting user behavior data by means of web crawlers comprises:
    通过网络爬虫的方式爬取预设网页;Crawl preset webpages by means of web crawlers;
    根据预设网页指数对所爬取的网页进行筛选得到目标网页;Filter the crawled webpages according to the preset webpage index to obtain the target webpage;
    根据所述目标网页从预设数据库中获取用户的行为数据。The user's behavior data is obtained from a preset database according to the target webpage.
  3. 根据权利要求1所述的基于数据分析的信息推送方法,其中,所述通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据,包括:The information push method based on data analysis according to claim 1, wherein said performing feature engineering processing on said behavior data in a way of one-hot encoding and normalization to obtain target data comprises:
    对非数值型的所述行为数据进行独热编码得到目标数据;Performing one-hot encoding on the non-numerical behavior data to obtain target data;
    对数值型的所述行为数据根据预设公式进行归一化得到目标数据。The numerical behavior data is normalized according to a preset formula to obtain target data.
  4. 根据权利要求1所述的基于数据分析的信息推送方法,其中,所述将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性,包括:The method for pushing information based on data analysis according to claim 1, wherein said inputting said target data into a pre-trained potential user mining model to output potential user prediction values, said potential user prediction values being used for Characterizing the possibility that the user is a potential user includes:
    根据所述目标数据构建目标样本;Construct a target sample according to the target data;
    将所述目标样本输入至梯度提升决策树模型中进行迭代更新输出潜在用户的预测值。The target sample is input into the gradient boosting decision tree model to iteratively update the predicted value of the potential user.
  5. 根据权利要求1所述的基于数据分析的信息推送方法,其中,所述根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送之后,还包括:The method for pushing information based on data analysis according to claim 1, wherein after comparing the predicted value of the potential user with a preset threshold to determine the potential user and pushing the information of the potential user, the method further comprises:
    获取所述广告推送的反馈结果;Obtaining the feedback result of the advertisement push;
    根据所述反馈结果通过邮件对所述潜在用户挖掘模型提示优化。According to the feedback result, the potential user mining model is prompted to optimize through an email.
  6. 根据权利要求4所述的基于数据分析的信息推送方法,其中,所述将所述目标样本输入至梯度提升决策树模型中进行迭代更新输出潜在用户的预测值,包括:The method for pushing information based on data analysis according to claim 4, wherein said inputting said target sample into a gradient boosting decision tree model to iteratively update and output the predicted value of potential users comprises:
    初始化决策树模型,根据所述目标样本计算损失函数;Initialize the decision tree model, and calculate the loss function according to the target sample;
    根据所述损失函数更新所述决策树模型,继续迭代所述决策树模型直到迭代结束以得到最终的决策树模型;Update the decision tree model according to the loss function, and continue to iterate the decision tree model until the iteration ends to obtain the final decision tree model;
    将决策树模型中的每棵决策树的预测值进行求和平均得到潜在用户的预测值。The predicted value of each decision tree in the decision tree model is summed and averaged to obtain the predicted value of potential users.
  7. 根据权利要求5所述的基于数据分析的信息推送方法,其中,所述反馈结果用于指示潜在用户是否打开了目标网页所推送的广告链接;The information pushing method based on data analysis according to claim 5, wherein the feedback result is used to indicate whether the potential user has opened the advertisement link pushed by the target webpage;
    其中,所述反馈结果包括正反馈或反馈,所述正反馈用于指示用户打开了目标网页所推送的广告链接,所述负反馈用于指示用户未打开目标网页所推送的广告链接。Wherein, the feedback result includes positive feedback or feedback, the positive feedback is used to indicate that the user has opened the advertisement link pushed by the target webpage, and the negative feedback is used to indicate that the user has not opened the advertisement link pushed by the target webpage.
  8. 一种基于数据分析的信息推送装置,其中,包括:An information push device based on data analysis, which includes:
    爬虫单元,用于通过网络爬虫的方式采集用户的行为数据;Crawler unit, used to collect user behavior data through web crawlers;
    特征工程单元,用于通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;The feature engineering unit is used to perform feature engineering processing on the behavior data through one-hot encoding and normalization to obtain target data;
    预测单元,用于将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;A prediction unit, configured to input the target data into a pre-trained potential user mining model to output a potential user prediction value, the potential user prediction value being used to characterize the possibility that the user belongs to a potential user;
    推送单元,用于根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所 述潜在用户进行信息推送。The pushing unit is configured to compare the predicted value of the potential user with a preset threshold to determine the potential user and push information of the potential user.
  9. 一种计算机设备,其中,所述计算机设备包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:A computer device, wherein the computer device includes a memory and a processor, a computer program is stored on the memory, and the processor implements the following steps when the processor executes the computer program:
    通过网络爬虫的方式采集用户的行为数据;Collect user behavior data through web crawlers;
    通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;Perform feature engineering processing on the behavior data by means of one-hot encoding and normalization to obtain target data;
    将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;Inputting the target data into a pre-trained potential user mining model to output a potential user prediction value, where the potential user prediction value is used to characterize the possibility that the user belongs to a potential user;
    根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。Compare the predicted value of the potential user with a preset threshold to determine the potential user and push information on the potential user.
  10. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述通过网络爬虫的方式采集用户的行为数据时,具体执行以下步骤:The computer device according to claim 9, wherein when the processor executes the collection of user behavior data by means of a web crawler, the following steps are specifically executed:
    通过网络爬虫的方式爬取预设网页;Crawl preset webpages by means of web crawlers;
    根据预设网页指数对所爬取的网页进行筛选得到目标网页;Filter the crawled webpages according to the preset webpage index to obtain the target webpage;
    根据所述目标网页从预设数据库中获取用户的行为数据。The user's behavior data is obtained from a preset database according to the target webpage.
  11. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据时,具体执行以下步骤:The computer device according to claim 9, wherein when the processor executes the feature engineering processing on the behavior data by one-hot encoding and normalization to obtain the target data, the following steps are specifically executed:
    对非数值型的所述行为数据进行独热编码得到目标数据;Performing one-hot encoding on the non-numerical behavior data to obtain target data;
    对数值型的所述行为数据根据预设公式进行归一化得到目标数据。The numerical behavior data is normalized according to a preset formula to obtain target data.
  12. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性时,具体执行以下步骤:The computer device according to claim 9, wherein the processor executes the input of the target data into a pre-trained potential user mining model to output a potential user prediction value, and the potential user prediction value is used for When characterizing the possibility that the user is a potential user, the following steps are specifically performed:
    根据所述目标数据构建目标样本;Construct a target sample according to the target data;
    将所述目标样本输入至梯度提升决策树模型中进行迭代更新输出潜在用户的预测值。The target sample is input into the gradient boosting decision tree model to iteratively update the predicted value of the potential user.
  13. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送之后,还执行以下步骤:The computer device according to claim 9, wherein after the processor executes the comparison between the predicted value of the potential user and a preset threshold to determine the potential user and pushes the information of the potential user, the processor further executes the following step:
    获取所述广告推送的反馈结果;Obtaining the feedback result of the advertisement push;
    根据所述反馈结果通过邮件对所述潜在用户挖掘模型提示优化。According to the feedback result, the potential user mining model is prompted to optimize through an email.
  14. 根据权利要求12所述的计算机设备,其中,所述处理器执行所述将所述目标样本输入至梯度提升决策树模型中进行迭代更新输出潜在用户的预测值时,具体执行以下步骤:The computer device according to claim 12, wherein when the processor executes the input of the target sample into the gradient boosting decision tree model to iteratively update the predicted value of the potential user, the following steps are specifically executed:
    初始化决策树模型,根据所述目标样本计算损失函数;Initialize the decision tree model, and calculate the loss function according to the target sample;
    根据所述损失函数更新所述决策树模型,继续迭代所述决策树模型直到迭代结束以得到最终的决策树模型;Update the decision tree model according to the loss function, and continue to iterate the decision tree model until the iteration ends to obtain the final decision tree model;
    将决策树模型中的每棵决策树的预测值进行求和平均得到潜在用户的预测值。The predicted value of each decision tree in the decision tree model is summed and averaged to obtain the predicted value of potential users.
  15. 根据权利要求13所述的计算机设备,其中,所述反馈结果用于指示潜在用户是否打开了目标网页所推送的广告链接;The computer device according to claim 13, wherein the feedback result is used to indicate whether the potential user has opened the advertisement link pushed by the target webpage;
    其中,所述反馈结果包括正反馈或反馈,所述正反馈用于指示用户打开了目标网页所推送的广告链接,所述负反馈用于指示用户未打开目标网页所推送的广告链接。Wherein, the feedback result includes positive feedback or feedback, the positive feedback is used to indicate that the user has opened the advertisement link pushed by the target webpage, and the negative feedback is used to indicate that the user has not opened the advertisement link pushed by the target webpage.
  16. 一种计算机可读存储介质,其中,所述存储介质存储有计算机程序,所述计算机程序当被处理器执行时实现以下步骤:A computer-readable storage medium, wherein the storage medium stores a computer program, and when the computer program is executed by a processor, the following steps are implemented:
    通过网络爬虫的方式采集用户的行为数据;Collect user behavior data through web crawlers;
    通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据;Perform feature engineering processing on the behavior data by means of one-hot encoding and normalization to obtain target data;
    将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性;Inputting the target data into a pre-trained potential user mining model to output a potential user prediction value, where the potential user prediction value is used to characterize the possibility that the user belongs to a potential user;
    根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送。Compare the predicted value of the potential user with a preset threshold to determine the potential user and push information on the potential user.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述通过网络爬虫的方式采集用户的行为数据时,所述计算机程序被处理器执行实现以下步骤:The computer-readable storage medium according to claim 16, wherein when the user's behavior data is collected by means of a web crawler, the computer program is executed by the processor to implement the following steps:
    通过网络爬虫的方式爬取预设网页;Crawl preset webpages by means of web crawlers;
    根据预设网页指数对所爬取的网页进行筛选得到目标网页;Filter the crawled webpages according to the preset webpage index to obtain the target webpage;
    根据所述目标网页从预设数据库中获取用户的行为数据。The user's behavior data is obtained from a preset database according to the target webpage.
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述通过独热编码和归一化的方式对所述行为数据进行特征工程处理以得到目标数据时,所述计算机程序被处理器执行实现以下步骤:The computer-readable storage medium according to claim 16, wherein the computer program is executed by the processor when the characteristic engineering processing is performed on the behavior data by one-hot encoding and normalization to obtain the target data Implement the following steps:
    对非数值型的所述行为数据进行独热编码得到目标数据;Performing one-hot encoding on the non-numerical behavior data to obtain target data;
    对数值型的所述行为数据根据预设公式进行归一化得到目标数据。The numerical behavior data is normalized according to a preset formula to obtain target data.
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述将所述目标数据输入至预先训练好的潜在用户挖掘模型中以输出潜在用户预测值,所述潜在用户预测值用于表征所述用户属于潜在用户的可能性时,所述计算机程序被处理器执行实现以下步骤:The computer-readable storage medium according to claim 16, wherein said inputting said target data into a pre-trained potential user mining model to output potential user prediction values, said potential user prediction values being used to characterize all When the user is a potential user, the computer program is executed by the processor to implement the following steps:
    根据所述目标数据构建目标样本;Construct a target sample according to the target data;
    将所述目标样本输入至梯度提升决策树模型中进行迭代更新输出潜在用户的预测值。The target sample is input into the gradient boosting decision tree model to iteratively update the predicted value of the potential user.
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述根据所述潜在用户预测值与预设阈值进行对比以确定潜在用户并对所述潜在用户进行信息推送之后,所述计算机程序还用于被处理器执行实现以下步骤:The computer-readable storage medium according to claim 16, wherein, after comparing the predicted value of the potential user with a preset threshold to determine the potential user and push the information of the potential user, the computer program further Used by the processor to implement the following steps:
    获取所述广告推送的反馈结果;Obtaining the feedback result of the advertisement push;
    根据所述反馈结果通过邮件对所述潜在用户挖掘模型提示优化。According to the feedback result, the potential user mining model is prompted to optimize through an email.
PCT/CN2020/092856 2019-08-13 2020-05-28 Information pushing method and apparatus based on data analysis, computer device, and storage medium WO2021027362A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910745385.7 2019-08-13
CN201910745385.7A CN110688553A (en) 2019-08-13 2019-08-13 Information pushing method and device based on data analysis, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021027362A1 true WO2021027362A1 (en) 2021-02-18

Family

ID=69108252

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092856 WO2021027362A1 (en) 2019-08-13 2020-05-28 Information pushing method and apparatus based on data analysis, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN110688553A (en)
WO (1) WO2021027362A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925982A (en) * 2021-03-12 2021-06-08 上海意略明数字科技股份有限公司 User redirection method and device, storage medium and computer equipment
CN113344626A (en) * 2021-06-03 2021-09-03 上海冰鉴信息科技有限公司 Data feature optimization method and device based on advertisement push
CN113987018A (en) * 2021-10-27 2022-01-28 平安国际智慧城市科技股份有限公司 Character feature mining method, device, equipment and storage medium
CN115860836A (en) * 2022-12-07 2023-03-28 广东南粤分享汇控股有限公司 E-commerce service pushing method and system based on user behavior big data analysis
CN117976124A (en) * 2024-03-29 2024-05-03 四川省肿瘤医院 Disease prevention information pushing system and pushing method

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688553A (en) * 2019-08-13 2020-01-14 平安科技(深圳)有限公司 Information pushing method and device based on data analysis, computer equipment and storage medium
CN111459993B (en) * 2020-02-17 2023-06-06 平安科技(深圳)有限公司 Configuration updating method, device, equipment and storage medium based on behavior analysis
CN111475671B (en) * 2020-03-12 2023-09-26 支付宝(杭州)信息技术有限公司 Voice document processing method and device and server
CN111507849A (en) * 2020-03-25 2020-08-07 上海商汤智能科技有限公司 Authority guaranteeing method and related device and equipment
CN111507768B (en) * 2020-04-17 2023-04-07 腾讯科技(深圳)有限公司 Potential user determination method and related device
CN111931809A (en) * 2020-06-29 2020-11-13 北京大米科技有限公司 Data processing method and device, storage medium and electronic equipment
CN112001760B (en) * 2020-08-28 2021-10-12 贝壳找房(北京)科技有限公司 Potential user mining method and device, electronic equipment and storage medium
CN112100237B (en) * 2020-09-04 2023-08-15 北京百度网讯科技有限公司 User data processing method, device, equipment and storage medium
CN112308635A (en) * 2020-11-25 2021-02-02 拉扎斯网络科技(上海)有限公司 Data processing method and device and resource providing method and device
CN113177148B (en) * 2021-05-21 2022-06-24 滨州职业学院 Data pushing method and device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229986A (en) * 2016-12-14 2018-06-29 腾讯科技(深圳)有限公司 Feature construction method, information distribution method and device in Information prediction
CN108520442A (en) * 2018-04-10 2018-09-11 电子科技大学 A kind of displaying ad click rate prediction technique based on fusion structure
CN109167816A (en) * 2018-08-03 2019-01-08 广州虎牙信息科技有限公司 Information-pushing method, device, equipment and storage medium
CN109684554A (en) * 2018-12-26 2019-04-26 腾讯科技(深圳)有限公司 The determination method and news push method of the potential user of news
CN110688553A (en) * 2019-08-13 2020-01-14 平安科技(深圳)有限公司 Information pushing method and device based on data analysis, computer equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9947028B1 (en) * 2014-02-27 2018-04-17 Intuit Inc. System and method for increasing online conversion rate of potential users
CN105005918B (en) * 2015-07-24 2018-07-17 金鹃传媒科技股份有限公司 A kind of online advertisement push appraisal procedure analyzed based on user behavior data and potential user's influence power
CN106803190A (en) * 2017-01-03 2017-06-06 北京掌阔移动传媒科技有限公司 A kind of ad personalization supplying system and method
CN109636430A (en) * 2017-10-09 2019-04-16 北京京东尚科信息技术有限公司 Object identifying method and its system
CN108256052B (en) * 2018-01-15 2023-07-11 成都达拓智通科技有限公司 Tri-tracking-based potential customer identification method for automobile industry
CN109509040A (en) * 2019-01-03 2019-03-22 广发证券股份有限公司 Predict modeling method, marketing method and the device of fund potential customers

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229986A (en) * 2016-12-14 2018-06-29 腾讯科技(深圳)有限公司 Feature construction method, information distribution method and device in Information prediction
CN108520442A (en) * 2018-04-10 2018-09-11 电子科技大学 A kind of displaying ad click rate prediction technique based on fusion structure
CN109167816A (en) * 2018-08-03 2019-01-08 广州虎牙信息科技有限公司 Information-pushing method, device, equipment and storage medium
CN109684554A (en) * 2018-12-26 2019-04-26 腾讯科技(深圳)有限公司 The determination method and news push method of the potential user of news
CN110688553A (en) * 2019-08-13 2020-01-14 平安科技(深圳)有限公司 Information pushing method and device based on data analysis, computer equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925982A (en) * 2021-03-12 2021-06-08 上海意略明数字科技股份有限公司 User redirection method and device, storage medium and computer equipment
CN112925982B (en) * 2021-03-12 2023-04-07 上海意略明数字科技股份有限公司 User redirection method and device, storage medium and computer equipment
CN113344626A (en) * 2021-06-03 2021-09-03 上海冰鉴信息科技有限公司 Data feature optimization method and device based on advertisement push
CN113987018A (en) * 2021-10-27 2022-01-28 平安国际智慧城市科技股份有限公司 Character feature mining method, device, equipment and storage medium
CN113987018B (en) * 2021-10-27 2024-05-07 平安国际智慧城市科技股份有限公司 Character feature mining method, device, equipment and storage medium
CN115860836A (en) * 2022-12-07 2023-03-28 广东南粤分享汇控股有限公司 E-commerce service pushing method and system based on user behavior big data analysis
CN115860836B (en) * 2022-12-07 2023-09-26 广东南粤分享汇控股有限公司 E-commerce service pushing method and system based on user behavior big data analysis
CN117976124A (en) * 2024-03-29 2024-05-03 四川省肿瘤医院 Disease prevention information pushing system and pushing method

Also Published As

Publication number Publication date
CN110688553A (en) 2020-01-14

Similar Documents

Publication Publication Date Title
WO2021027362A1 (en) Information pushing method and apparatus based on data analysis, computer device, and storage medium
Peng et al. Reinforced, incremental and cross-lingual event detection from social messages
Jia et al. A practical approach to constructing a knowledge graph for cybersecurity
CN107818344B (en) Method and system for classifying and predicting user behaviors
CN109345399B (en) Method, device, computer equipment and storage medium for evaluating risk of claim settlement
US11080340B2 (en) Systems and methods for classifying electronic information using advanced active learning techniques
US20200050968A1 (en) Interactive interfaces for machine learning model evaluations
US9390176B2 (en) System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data
US20100077301A1 (en) Systems and methods for electronic document review
US20120303661A1 (en) Systems and methods for information extraction using contextual pattern discovery
CN111581983A (en) Method for predicting social concern hotspots in network public opinion events based on group analysis
US8078642B1 (en) Concurrent traversal of multiple binary trees
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
Fang et al. TAP: A static analysis model for PHP vulnerabilities based on token and deep learning technology
US11803600B2 (en) Systems and methods for intelligent content filtering and persistence
CN113139134B (en) Method and device for predicting popularity of user-generated content in social network
US20200320153A1 (en) Method for accessing data records of a master data management system
CN111190968A (en) Data preprocessing and content recommendation method based on knowledge graph
CN114118192A (en) Training method, prediction method, device and storage medium of user prediction model
CN111259220A (en) Data acquisition method and system based on big data
CN111444424A (en) Information recommendation method and information recommendation system
Sadesh et al. Automatic Clustering of User Behaviour Profiles for Web Recommendation System.
Shayegh et al. Automated approach to improve iot privacy policies
Duan et al. Multi-feature fused collaborative attention network for sequential recommendation with semantic-enriched contrastive learning
CN111242519B (en) User characteristic data generation method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20852301

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20852301

Country of ref document: EP

Kind code of ref document: A1