CN106909663B

CN106909663B - Label user brand preference behavior prediction method and device

Info

Publication number: CN106909663B
Application number: CN201710110119.8A
Authority: CN
Inventors: 江有归; 封雷; 马嵩; 徐焕根
Original assignee: Hangzhou Adtime Technology Co ltd
Current assignee: Hangzhou Adtime Technology Co ltd
Priority date: 2017-02-27
Filing date: 2017-02-27
Publication date: 2020-07-28
Anticipated expiration: 2037-02-27
Also published as: CN106909663A

Abstract

The invention relates to a label-based user brand preference behavior prediction method and a device thereof, wherein the method comprises the steps of obtaining UR L data for reflecting user internet surfing behaviors, analyzing the UR L data, extracting search keywords from analysis results and storing the search keywords in a user search behavior table, extracting commodity codes of electronic commerce from the analysis results, obtaining electronic commerce browsing data corresponding to the commodity codes through a crawler-type database and storing the electronic commerce browsing data in the user electronic commerce browsing behavior table, deleting data which are not consistent with brand information through intelligent word segmentation and semantic analysis of texts on the data stored in the user search behavior table and the user electronic commerce browsing behavior table to form a first data set, carrying out cluster analysis on the first data set to obtain the preference degree of a user on brand information, calculating the brand preference of the user, and greatly improving the keyword extraction efficiency through a keyword extraction technology.

Description

Label user brand preference behavior prediction method and device

Technical Field

The invention relates to the technical field of information, in particular to a label user brand preference behavior prediction method and a label user brand preference behavior prediction device.

Background

The existing brand preference behavior prediction technical scheme in the market generally follows data normalization, keyword identification and matching and brand weight statistics. When data is structured, because data sources are complicated, especially data extracted by directly collecting the data from the internet through a machine, the character types, the length and the like of the data cannot be completely unified and standardized, and therefore unified data formatting is required. Through unified regulation, dirty data can be effectively rejected, the influence of invalid data is reduced, and the analysis efficiency and the accuracy of later-stage data are improved. When the keywords are identified and matched, the text word dimension table library which needs to be manually maintained for a long time is used for carrying out text word segmentation on the standardized text sentences, so that the core words can be accurately segmented. And matching and associating the data subjected to text removal with a brand dimension table library to obtain brand information described in the text, and performing primary weight calculation according to indexes such as text similarity, matching rate and occurrence frequency to obtain a brand weight score in the text. Generally, due to the frequent brand changes in the market and the diversity of Chinese text semantics, the brand dimension table library mostly needs frequent or irregular arrangement and maintenance to ensure the matching rate and accuracy of the brands. And (3) when the brand weight is counted, according to the internet word segmentation result, combining the frequency of each brand, the similarity degree of the brands and other characteristics, and obtaining the final weight value of each brand preference in a clustering mode. The problems that exist are that: most data screening still has a large amount of manual intervention, low efficiency and long execution time; data analysis errors caused by semantics are not realized by good technical means, so that the error rate is high, and the authenticity of the data is to be checked.

Disclosure of Invention

Aiming at the problems that in the prior art, a large amount of manual intervention still exists in most data screening, the efficiency is low, and the execution time is long; the defects that the error rate is high and the data authenticity is to be considered due to data analysis errors caused by semantics are overcome by no good technical means, and the label user brand preference behavior prediction method and the device thereof are provided.

The method comprises the steps of obtaining UR L data used for reflecting internet surfing behaviors of a user, analyzing the UR L data, extracting search keywords from an analysis result and storing the search keywords in a user search behavior table, extracting commodity codes of electronic commerce from the analysis result, obtaining electronic commerce browsing data corresponding to the commodity codes through a crawler-type database and storing the electronic commerce browsing data in a user electronic commerce browsing behavior table, deleting data which do not accord with brand information from the user search behavior table and the data stored in the user electronic commerce browsing behavior table through intelligent text word segmentation and semantic analysis to form a first data set, carrying out cluster analysis on the first data set to obtain the preference degree of the user on the brand information, and calculating the brand preference of the user.

Optionally, the method further comprises filtering the UR L data through a preset black and white list of data.

Optionally, the obtaining the brand preference of the user by using the brand preference data model specifically includes:

calculating the brand preference of the user using the following formula:

wherein α platformj is the calculated platform weight, Ni is the number of electric power merchants selling brands i, α action is the calculated action weight, and α t is the calculated time weight and frequency weight.

Optionally, the semantic analysis is specifically completed by a semantic similarity algorithm of Word2 vec.

Optionally, the extracting a search keyword from the analysis result specifically includes:

extracting brand keywords from the analysis result based on the average mutual information;

the average mutual information is calculated by the following equation:

wherein I (xi; yi) is the probability of x and y occurring together; p (xiyi) is the probability of x and y appearing at the same time, p (xi | yi) is the probability of x appearing when y appears, and p (xi) is the probability of x appearing; x and y are any two words.

The invention also provides a label-based user brand preference behavior prediction device which comprises a UR L data acquisition module, a keyword extraction module, a commodity code extraction module, an e-commerce browsing data acquisition module, a first data set generation module and a brand preference degree generation module, wherein the UR L data acquisition module is used for acquiring UR L data used for reflecting the internet surfing behavior of a user, the keyword extraction module is used for analyzing the UR L data, extracting search keywords from an analysis result and storing the search keywords in a user search behavior table, the commodity code extraction module is used for extracting the commodity codes of e-commerce from the analysis result, the e-commerce browsing data acquisition module is used for acquiring the e-commerce browsing data corresponding to the commodity codes through a crawler-type database and storing the e-commerce browsing data in the user e-commerce browsing behavior table, the first data set is formed by deleting the data which are not consistent with brand information through text intelligent word segmentation and semantic analysis, and the brand preference degree generation module is used for analyzing the first data set, obtaining the preference degree of the user on the brand information and calculating the brand preference degree of the user.

Optionally, the UR L data obtaining module is further configured to filter the UR L data through a black and white list of preset data.

Optionally, the brand preference generating module is specifically configured to: calculating the brand preference of the user using the following formula:

Optionally, the first data set generating module is specifically configured to: and completing semantic analysis by a semantic similarity algorithm of Word2 vec.

Optionally, the keyword extraction module is specifically configured to: extracting brand keywords from the analysis result based on the average mutual information;

the average mutual information is calculated by the following equation:

The method can greatly improve the extraction efficiency of the keywords, reduce the labor cost investment and reduce the error rate generated by manual output through a keyword extraction technology, can further deeply analyze the accuracy of UR L text information extraction through a semantic correction technology to ensure the authenticity and reliability of subsequent brand preference analysis results, accurately analyzes the brand preference of a user through establishment of a brand preference model, can dynamically adjust the assignment and division of the labels, predicts the user behavior based on the labels, realizes accurate recommendation of enterprises, and provides personalized services.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method of the present invention for predicting brand preference behavior based on tag users;

fig. 2 is a flow chart of a specific algorithm of the present invention.

Fig. 3 is a partial flow diagram of the present invention.

Fig. 4 is a partial schematic view of the present invention.

Fig. 5 is a schematic structural diagram of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.

An exemplary method:

the invention relates to a label user brand preference behavior prediction method, which comprises the following steps of:

and S1, acquiring UR L data for reflecting the internet surfing behavior of the user.

Optionally, due to the sparse nature of UR L data, before extracting keywords, some preprocessing may be performed using a suitable method, for example, by black and white lists of preset data, and the UR L data is filtered to reduce the UR L data that needs to be processed, which is more representative.

And S2, analyzing the UR L data, extracting search keywords from the analysis result and storing the search keywords in a user search behavior table.

UR L is a URL, which is called as a URL for understanding the access address of a site or resource in a network, the parsing process is a process for obtaining content corresponding to UR L, and the parsing result may include text information, image information, or other types of information.

Specifically, UR L accessed by the user may be analyzed, and keywords may be extracted from the analyzed UR L data according to preset brand information and stored in the user search behavior table.

More specifically, the keyword search may be done based on average mutual information (i.e., the degree of association between two words). The average mutual information refers to the statistical average value of the mutual information quantity of I (xi; yi) on the joint probability space p (xy).

Wherein I (xi; yi) can be calculated by the following equation:

in the practical operation process, the probability ratio of x and y is obtained through p (xi | yi)/p (xi), the probability value can be changed into an integer by using a log formula, and the base number of the log can be ignored. While calculating the affinity of x and y by p (xiyi). And then, continuously adding all the probability values and the parent densities to obtain a value serving as an average value of final statistics. The above-mentioned technology for searching and acquiring keywords from a large amount of information, which is commonly used in the prior art through average mutual information calculation, is well known by those skilled in the art and will not be described herein again.

And S3, analyzing the UR L data, and extracting the commodity code of the E-commerce from the analysis result.

And similar to the step S2, the parsing process refers to a process for acquiring content corresponding to UR L, and the content may also include text information, image information or other types of information, UR L accessed by the user is parsed, a commodity code of an e-commerce is extracted from the parsed information, e-commerce browsing data corresponding to the commodity code is acquired and stored in an e-commerce browsing behavior table of the user, and the e-commerce commodity code is generally a number or a combination of a number and a letter, exists in a special specific character string, and can be directly identified and extracted through a predetermined rule.

S4: and acquiring e-commerce browsing data corresponding to the commodity code through a crawler-type database and storing the e-commerce browsing data in a user e-commerce browsing behavior table.

Daily internet mass data are collected in a circulating mode, commodity ID and commodity associated information are extracted through a crawler technology, and user internet upper-level behavior data are compared with crawler data according to the commodity ID extracted through UR L information, so that brand identification preference is obtained through direct matching.

S5: and deleting data which are not consistent with brand information from the data stored in the user search behavior list and the user e-commerce browsing behavior list through intelligent word segmentation and semantic analysis of the text to form a first data set. The brand information is information indicating what brand of goods is.

The intelligent word segmentation and semantic analysis aims to eliminate error information extraction caused by simple keyword matching as far as possible. In the general shopping tendency of a user, a special preference exists for certain brands, and some wrong or impossible related data of brand information can be deleted firstly to form a first data set so as to better predict the behavior. Specifically, a semantic similarity algorithm of Word2vec is adopted for semantic correction, and a specific algorithm model is as follows:

FIG. 3 is a schematic diagram of a CBOW (Continuous bag of Words) Model, where the current word w _ t is predicted on the premise that the context w _ t (t-2), w _ 1, w _ t (t +1), w _ t (t +2) of the current word w _ t is known.

As shown in fig. 3 and 4, in this model:

the input layer comprises word vectors v (context (w)1), v (context (w)2) … and v (context (w)2c) ∈ Rm. of 2c words in context (w), wherein the meaning of m represents the length of the word vectors.

And (4) a projection layer, namely, summing and accumulating 2c vectors of the input layer, namely xw- ∑ v (context (w) i) ∈ Rm.

An output layer: the output layer corresponds to a binary tree, which is a Huffman tree constructed by using words appearing in the corpus as leaf nodes and using the times of appearance of each word in the corpus as weights. In the Huffman tree, the leaf nodes are N (═ D | in total) and correspond to words in the dictionary D, and the non-leaf nodes are N-1 in number, respectively.

5: and performing cluster analysis on the first data set to obtain the preference degree of the user to brand information, and calculating to obtain the brand preference of the user.

In this embodiment, according to the first data set, through statistical analysis, the preference degree of a user for a certain product/brand can be obtained, and behavior weight α action, platform weight α platform, time weight and frequency weight α t can be performed from the three dimensions, wherein behavior weighting can include purchase > join shopping cart > favorite > search > browse.

Correspondingly, the following brand preference data model formula can be used to calculate the preference value of the user for a certain brand: :

the model comprises α platformj, Ni is the number of electric power suppliers of brands sold in the market, α action, α action and α t, wherein α platformj is the calculated platform weight, Ni is the number of electric power suppliers of brands sold in the market, α action is the calculated action weight, α t is the calculated time weight and frequency weight, in the model, the three weights of α platformj, α action and α t are added in succession, the final brand preference value is finally obtained, t is time and can be set according to the actual situation.

Preferably, the database is a crawler database, and the data in the crawler database is the full data of the e-commerce website.

Fig. 5 shows a prediction apparatus for a brand preference behavior based on a tag according to an embodiment of the present invention, which includes, as shown in fig. 5, a UR L data obtaining module 100 configured to obtain UR L data reflecting a user internet behavior, a keyword extracting module 200 configured to analyze the UR L data, extract a search keyword from an analysis result, and store the search keyword in a user search behavior table, a product code extracting module 300 configured to extract a product code of an e-commerce from the analysis result, an e-commerce browsing data obtaining module 400 configured to obtain e-commerce browsing data corresponding to the product code through a crawler-type database and store the obtained data in the user e-commerce browsing behavior table, a first data set generating module 500 configured to delete data that does not conform to brand information through intelligent text segmentation and semantic analysis on the data stored in the user search behavior table and the user e-commerce browsing behavior table to form a first data set, and a brand preference generating module 600 configured to obtain a brand preference of a user using the brand data model.

The

function module

100 and 600 of the apparatus may further execute corresponding steps in the above method embodiments to implement corresponding functions. The device embodiment and the method embodiment are based on the same inventive concept, and are not described herein for simplicity.

Examples of applications of the process are detailed below:

for example, a user often searches 'soybean milk maker', 'household appliance', 'Jiuyang' and the like on the internet, firstly, the internet records of the user are obtained according to the requirements, a keyword of the 'Jiuyang' is selected, the 'Jiuyang' is stored in a search behavior table, possibly accompanied by words with the same pronunciation as the 'Jiuyang' or containing the 'Jiuyang', at the moment, text word segmentation and semantic correction are required to be carried out on all data in the search behavior table and a user e-commerce browsing behavior table, commodity codes of e-commerce are extracted from the rest words, the commodity codes are matched with corresponding commodity codes in a database, if the matching is successful, user e-commerce browsing data of corresponding commodities are obtained, the user e-commerce browsing data of the user are stored in the user e-commerce browsing behavior table, all data in the search behavior table and the user e-commerce browsing behavior table are processed, some obvious brand information which is not mined is removed, and basic data for data are obtained;

and establishing a brand preference data model according to the basic data, acquiring a final brand preference value, and further acquiring the brand preference value.

The key words are extracted in the following way that the UR L data of the internet surfing behavior are filtered to obtain the data of search keywords, the average probability of the keywords is obtained, and the calculation formula is as follows:

I(x_i；y_i) Statistical average of mutual information quantities over a joint probability space p (xy)

I(x_i；y_i) Calculating the probability of x and y co-occurrence to obtain the degree of association between two words;

p(x_iy_i) Representing the probability of calculating the simultaneous occurrence of x and y;

p(x_i|y_i) Representing the probability of x occurring when y is calculated;

p(x_i) Indicating the probability of calculating the occurrence of x. The higher the probability, the greater the probability of occurrence.

Preferably, the establishing of the brand preference data model means that the brand preference data model is established as follows:

α platformj is the calculated platform weight, Ni is the number of electric merchants selling the i brand, α action is the calculated action weight, and α t is the calculated time weight and frequency weight.

According to the calculation formula, the following results are obtained: the brand preference degree is determined by considering four dimensions, and the behavior weight is as follows: purchase > join shopping cart > favorite > search > browse, e.g., buy "nine sun" more often than other brands; platform weight: e-commerce platform > other platforms; time weight: the preference decays over time; frequency weight: the higher the access frequency, the stronger the preference.

In addition, it should be noted that the specific embodiments described in the present specification may differ in the shape of the components, the names of the components, and the like. All equivalent or simple changes of the structure, the characteristics and the principle of the invention which are described in the patent conception of the invention are included in the protection scope of the patent of the invention. Various modifications, additions and substitutions for the specific embodiments described may be made by those skilled in the art without departing from the scope of the invention as defined in the accompanying claims.

Claims

1. A label user brand preference based behavior prediction method is characterized by comprising the following steps:

UR L data used for reflecting the internet surfing behavior of the user is obtained;

analyzing the UR L data, extracting search keywords from the analysis result and storing the search keywords in a user search behavior table;

extracting commodity codes of E-commerce from the analysis result;

acquiring e-commerce browsing data corresponding to the commodity code through a crawler-type database and storing the e-commerce browsing data in a user e-commerce browsing behavior table;

deleting data which are not consistent with brand information from the data stored in the user search behavior list and the user e-commerce browsing behavior list through intelligent word segmentation and semantic analysis of texts to form a first data set;

and performing cluster analysis on the first data set to obtain the preference degree of the user to brand information, and calculating to obtain the brand preference of the user.

2. The method of claim 1, wherein after obtaining UR L data reflecting user surfing behavior, the method further comprises:

and filtering the UR L data through a preset black and white list of data.

3. The method according to claim 1, wherein the calculating of the brand preference of the user specifically comprises:

calculating the brand preference of the user using the following formula:

wherein the content of the first and second substances,

the calculated platform weight; n is a radical of_iNumber of e-commerce for sale i brand α_actionα for the calculated behavior weight_tAre the calculated time weight and frequency weight.

4. The method according to claim 1, characterized in that said semantic analysis is done in particular by the semantic similarity algorithm of Word2 vec.

5. The method according to claim 1, wherein the extracting search keywords from the analysis result specifically comprises: extracting brand keywords from the analysis result based on the average mutual information; the average mutual information is calculated by the following equation:

wherein, I (x)_i；y_j) Is the probability of x, y co-occurrence; p (x)_iy_j) Is the probability of x, y occurring simultaneously, p (x)_i|y_j) Is the probability that x will occur when y occurs, p (x)_i) Is the probability of x occurrence; x and y are any two words.

6. A tag-based user brand preference behavior prediction apparatus, comprising:

the UR L data acquisition module is used for acquiring UR L data used for reflecting the internet surfing behavior of the user;

the keyword extraction module is used for analyzing the UR L data, extracting search keywords from analysis results and storing the search keywords in a user search behavior table;

the commodity code extraction module is used for extracting the commodity code of the E-commerce from the analysis result;

the e-commerce browsing data acquisition module is used for acquiring e-commerce browsing data corresponding to the commodity code through the crawler-type database and storing the e-commerce browsing data in a user e-commerce browsing behavior table;

the first data set generation module is used for deleting data which are not consistent with brand information to form a first data set by intelligently segmenting words and performing semantic analysis on data stored in the user search behavior list and the user e-commerce browsing behavior list;

and the brand preference degree generating module is used for carrying out clustering analysis on the first data set, obtaining the preference degree of the user on brand information, and calculating to obtain the brand preference of the user.

7. The apparatus of claim 6, wherein the UR L data acquisition module is further configured to filter the UR L data by black and white listing of preset data.

8. The apparatus of claim 6, wherein the brand preference generation module is specifically configured to: calculating the brand preference of the user using the following formula:

wherein the content of the first and second substances,

9. The apparatus of claim 6, wherein the first data set generation module is specifically configured to: and completing semantic analysis by a semantic similarity algorithm of Word2 vec.

10. The apparatus of claim 6, wherein the keyword extraction module is specifically configured to: extracting brand keywords from the analysis result based on the average mutual information; the average mutual information is calculated by the following equation: