Detailed Description
For the purpose of promoting a clear understanding of the objects, aspects and advantages of the embodiments of the present application, reference will now be made to the accompanying drawings and detailed description, wherein like reference numerals refer to like elements throughout.
The illustrative embodiments and descriptions of the present application are provided to explain the present application and not to limit the present application. Additionally, the same or similar numbered elements/components used in the drawings and the embodiments are used to represent the same or similar parts.
As used herein, "first," "second," …, etc., are not specifically intended to mean in a sequential or chronological order, nor are they intended to limit the application, but merely to distinguish between elements or operations described in the same technical language.
With respect to directional terminology used herein, for example: up, down, left, right, front or rear, etc., are simply directions with reference to the drawings. Accordingly, the directional terminology used is intended to be illustrative and is not intended to be limiting of the present teachings.
As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.
As used herein, "and/or" includes any and all combinations of the described items.
As used herein, the terms "substantially", "about" and the like are used to modify any slight variation in quantity or error that does not alter the nature of the variation. Generally, the range of slight variations or errors modified by such terms may be 20% in some embodiments, 10% in some embodiments, 5% in some embodiments, or other values. It should be understood by those skilled in the art that the aforementioned values can be adjusted according to actual needs, and are not limited thereto.
Certain words used to describe the present application are discussed below or elsewhere in this specification to provide additional guidance to those skilled in the art in describing the present application.
Fig. 1 is a flowchart of a user attribute value calculation method based on a user browsing behavior according to an embodiment of the present application, where a log of a user visiting a website daily is analyzed, data mining is performed on a visited webpage, a user demand (e.g., an investment demand or a loan demand) is obtained, the user demand can be fully understood without a user applying operation, and information can be pushed to the user or a service can be provided to the user in a targeted manner.
The figure shows an embodiment comprising:
step 101: and calling historical webpages daily visited by the user from the user information database. In an embodiment of the present application, web pages browsed by a user in each service scenario are collected, and if the web page visited by the user a is denoted as u (a) ═ u1,u2,…,unIn which uiThe URL of each webpage accessed by the user A is represented, n represents the number of the webpages accessed by the user A in a preset history stage, the preset history stage can be one month, half year, one year, three years and the like, and the history webpages comprise various webpages accessed by the user daily, including news webpages, science and technology webpages, financial loan webpages, political entertainment webpages and the like. The user information database may be user browsing information recorded by a large website, or user browsing information recorded by a plurality of websites in a combined manner, and is stored in a shared database for a plurality of service systems to call, which is not limited in this application.
Step 102: and filtering out the web pages which are irrelevant to the target attribute in the historical web pages. In an embodiment of the present application, in order to save subsequent processing overhead, a subsequent web page to be analyzed may be specified according to a business requirement or a target attribute, for example, a web page not belonging to a specified site in a historical web page is filtered, that is, a web page of a site (website) related to financial loan is reserved, such as a website related to a financial institution, a personal loan, a your own loan, and the like.The URL list of the specified webpage to be analyzed is F ═ F1,f2,…,fm},fiRepresents the ith specified URL, where m represents the number of specified URLs.
In an embodiment of the present application, step 102 further includes:
step 1021: web pages of the site related to the target attribute are collected. If the objective attributes of the present application require information in order to obtain the user's funds, then the sites associated with financial lending include: land and gold houses, personal credits, your own credits and the like, which are related to investing, managing and loan financing.
Step 1022: and filtering out webpages which do not belong to the website webpages in the historical webpages according to the URLs. Because each webpage has a unique URL, historical webpages which do not belong to the specified sites can be easily filtered out according to the URL, and therefore the accuracy of the user attribute value of the user is improved.
Step 103: and acquiring a page attribute value corresponding to the historical webpage according to a page attribute value database. The page attribute values corresponding to the historical web pages belong to data in a page attribute value database, and a specific generation method of the page attribute value database will be described in detail below, wherein the page attribute values represent the tendency of the page, such as web pages related to financing loan, if-1 represents loan and +1 represents financing, the page attribute values are between-1 and +1, the more the page attribute values are biased towards-1, the more the content of the page is related to loan, the more the page attribute values are biased towards +1, and the more the content of the page is related to financing.
Step 104: and obtaining the user attribute value of the corresponding user according to the page attribute value. The user attribute value represents the target tendency of the user, if the interval [ -1, +1] represents the user attribute value of the user, if-1 represents that the user has the loan demand, and +1 represents that the user has the financial demand, the user attribute value is more biased to-1, which represents that the loan demand of the user is stronger, and the user attribute value is more biased to +1, which represents that the investment demand of the user is stronger, which is not limited by the present application.
In the embodiment of the present application, p (a) is a result of determining the user attribute value by the user a, where-1 is greater than or equal to p (a) is less than or equal to 1, p (a) is more biased towards 1 to indicate that the user a has the user attribute value in the aspect of investment and financing, and p (a) is more biased towards-1 to indicate that the user a has the user attribute value in the aspect of loan.
In a specific embodiment of the present application, obtaining a user attribute value of a corresponding user according to the page attribute value specifically includes:
step 1041: and acquiring the access time of the user to access each historical webpage. And recording the access time of the user to each historical webpage while recording the access time of the user to the historical webpages.
Step 1042: and distributing a weight value to the corresponding page attribute value according to the access time. In general, the longer the access time is from the current time, the smaller the weight value assigned to the accessed historical page is, and different target attributes of the user at different periods are mainly considered, and the target attribute of the user can be most reflected by the activity closest to the current time.
Step 1043: and obtaining the user attribute value of the corresponding user according to the page attribute value and the weight value. In order to accurately obtain the user attribute value of the user at the current time, the application considers that the user attribute values are different in different time periods, for example, the user may have a need of investment and financing in the last year, but has a need of loan and financing in this year, the access record (the accessed webpage) which is farther away from the current time has a smaller effect on the current user attribute value discrimination, and the access record which is closer to the current time has a larger feasibility degree on knowing the user attribute value, so that the attribute value of the webpage accessed by the user needs to be attenuated according to the time, and the exponential attenuation is generally used. In the specific examples of this application, decade (t) is definedi) For the attenuation function, decap (ti) exp (- δ (ti)), δ (ti) is greater than 0, δ (ti) represents time tiTime from the current moment, tiRepresenting the time when the user A visits the historical web page, 0 ≦ decade (t)i)≤1。
In a specific embodiment of the present application, a specific calculation formula of the user attribute value p (a) may be:
wherein, P (A) is the current user attribute value of the user A, P (A) is more than or equal to-1 and less than or equal to 1; decapay (t)i) As a function of attenuation, decay (t)i)=exp(-δ(ti)),δ(ti) Greater than 0, delta (t)i) Representing time t of user A accessing historical web pagesiTime from the current moment, tiRepresenting the time when the user A visits the historical web page, 0 ≦ decade (t)i) Less than or equal to 1; h (A) represents historical web pages visited by the user A; h (A, t)i) Indicates that user A is at tiHistorical web pages visited at any time; p (H (A, t)i) Represents user A at t)iPage attribute values of historical webpages accessed at all times; subscript H (A, t)i) ∈ H (A) indicates that user A is at tiThe historical web pages accessed at the moment belong to the historical web pages accessed by the user A.
Step 105: and pushing specific information to a corresponding user according to the user attribute value information. After obtaining the user attribute value information of the user, the corresponding service information or service consultation information may be pushed to the user in a targeted manner, or specific information may be pushed to the user through a third party approach, for example, specific information may be pushed to the mobile terminal of the user through a mobile communication network.
Fig. 2 is a flowchart for generating a page attribute value database based on user browsing behavior according to an embodiment of the present disclosure, and as shown in fig. 2, website webpages related to target attributes are collected (historical webpages daily visited by a user belong to a part of the website webpages), a processing frequency of each website webpage is determined according to a visiting summary frequency of each website webpage by all users, the webpages are processed according to the processing frequency, then normalization, word segmentation, and word filtering are sequentially performed, words of all website webpages are selected, or words of a part of website webpages are randomly selected to form a dictionary vocabulary, an attribute value of each word in the dictionary vocabulary is calculated, and finally, a page attribute value of each website webpage is solved by using the words in the dictionary vocabulary.
The figure shows an embodiment comprising:
step 100: web pages of the site related to the target attribute are collected. If the objective attribute specified in the present application is to obtain the fund demand information of the user, the web page of the site related to the objective attribute is the web page related to the financial transaction, for example, the site related to the financial loan comprises: land and gold houses, personal credits, your own credits and the like, which are related to investing, managing and loan financing. In other specific embodiments of the present application, the website webpages related to the target attributes may not be specified in the early collection process, and the website webpages not belonging to the specified websites (the websites related to financial loan administration) may be filtered out according to the URLs of the webpages in the post-processing process, so that the post-processing overhead is saved.
Step 200: and processing the website webpage to obtain a word list corresponding to the website webpage.
In a specific embodiment of the present application, step 200 may specifically include:
step 2001: and obtaining the access heat of the website webpage so as to obtain the information of the website webpage according to the access heat. Counting the web pages f of each user on the siteiFrequency of accesses viThus, vectorizing the website web page visited by each user, for example, the visit frequency of the user a visiting each website web page may be represented as V (a) ═ V (a)1),V(A2),…,V(Am) If the user does not visit a certain website webpage, setting the corresponding visit times as 0; obtaining the web page f of each user to different sitesiThe access frequency of each site webpage can be obtained, and the site webpage fiAggregated frequency of accesses V (f)i) Can be expressed as:
wherein, V (A)i) The method includes the steps of representing the frequency of one user accessing each website webpage, A ∈ user representing all users accessing a specified website, and quantizing the access summary frequency vector of the website webpage to be analyzed into V (F) ═ V (f) in the specific embodiment of the application1),V(f2),…,V(fm) FromAnd obtains the access heat of the web pages of the site.
Step 2002: and acquiring the information of the website webpage. In the specific implementation manner of the application, information of website webpages can be acquired by using a webpage crawler technology, the crawler frequency of each website webpage is determined according to the access summarizing frequency (access heat), the website webpages to be analyzed are crawled, and different crawler strategies can be designed according to the access heat of different website webpages in consideration of the fact that the crawler needs to consume a large amount of system resources; after the crawler frequency is determined, a crawler program can be designed to crawl website webpages with different access heat. And crawling all the web pages according to the crawler frequency to obtain information corresponding to each website web page. In this embodiment, the crawled information may be denoted as C ═ C1,c2,…,cmIn which c isiI.e. the web page f of the siteiAnd information obtained after crawler is performed.
Step 2003: and carrying out normalization processing on the information to obtain standard information. The normalization process includes: converting all capital letters into lowercase letters; the traditional Chinese characters are converted into simplified Chinese characters; converting the character of the half-angle symbol into the character of the full-angle symbol; and replacing synonyms in the short text by using a synonym processing algorithm to finish the normalization processing of the information C.
Step 2004: and performing word segmentation processing on the standard information to obtain a word list corresponding to the website webpage. I.e. to translate the standard information into a plurality of words. In the embodiments of the present application, stop words in the vocabulary may also be filtered based on the stop word list. In natural language processing, words or phrases and punctuation marks that are semantically nonsense or irrelevant to the service are generally put in a table, and words in the table are not analyzed in subsequent analysis, and the table is generally called a stop word table. In the specific embodiment of the present application, stop words in the stop word list are denoted as S ═ { S1, S2, …, st }, where si denotes the ith stop word, and t denotes the number of stop words.
Step 300: and randomly selecting words with a preset proportion from the word list by taking the webpage as a unit so as to calculate the attribute value of the randomly selected words. In the embodiment of the present application, in order to improve the accuracy of calculating the user attribute value, words (i.e., word lists) corresponding to all web pages of a site may be selected to form a dictionary word list, and in order to save processing resources, words in a predetermined proportion may also be randomly selected from the word lists to form the dictionary word list by using the web pages as units, assuming that the number of the web pages of the selected site is Z. In addition, after this step, the selected website webpages need to be labeled, that is, each website webpage is labeled as two types, that is, Y { +1, -1}, if the target attribute of the present application is to obtain the fund demand information of the user, then +1, -1 respectively indicate the user's needs for investing in funds and lending, and the selected website webpages may be represented as WY { W (c) }i),yi|1≤i≤z,yi∈ Y, i is a positive integer }, W (c)i) Representing the result of word segmentation of the selected website webpage, and yi represents the selected website webpage ciA printed label (+1 or-1). In the embodiment of the application, it is assumed that the selected website web page has s different words, and the ith different word uses diMeaning that all words build a dictionary vocabulary D, which can be expressed as D ═ D1,d2,…,ds}。
Step 400: and calculating the attribute value of each randomly selected word. If words (namely word lists) corresponding to all site web pages are selected to form a dictionary word list, the attribute value of each word in the word list is calculated (at the moment, the dictionary word list is the same as the word list), if words with a preset proportion are randomly selected from the word list to form the dictionary word list by taking the web pages as a unit, the words corresponding to partial site web pages are selected to form the dictionary word list, and the attribute value of each word in the dictionary word list is calculated (at the moment, the dictionary word list is a subset of the word list). In the embodiment of the application, after the selected website webpage is labeled, the ith word D in the dictionary word list D can be countediIn the random selectionJth site web page cjNumber of occurrences in, | dijL, calculating the attribute value of each word, i-th word D in dictionary word list DiThe attribute value calculation formula of (2) is:
wherein, | dijL is the d-th word in the dictionaryiThe word is in the j site web page c selected at randomjThe number of times of occurrence in the dictionary is that i is less than or equal to s, and s is the number of words in the dictionary vocabulary; y isjThe label of the jth website webpage is +1, which indicates that the investment requirement exists, and the label of-1 indicates that the loan requirement exists; z is the number of randomly selected website webpages, z is less than or equal to m, j is less than or equal to z, and m is the number of all website webpages subjected to crawler processing.
Step 500: and obtaining the page attribute value of each website webpage according to the attribute value. In the embodiment of the present application, the page attribute value of each website webpage (including randomly selected website webpages and unselected website webpages, that is, all website webpages processed by the crawler) can be obtained, and each website webpage ciPage attribute value of P (c)i) The calculation formula of (2) is as follows:
wherein, P (d)
i) For appearance in dictionary vocabulary and in site web page c
iThe attribute value of the word in (1); d represents a dictionary word list; w (c)
i) Web page c representing a site
iThe word in (1);
for qualifier d
iPresence in site page c
iThe Chinese belongs to the words in the dictionary vocabulary;
web page c representing cumulative site
iAttribute values of all words in; l c
iI represents a site web pagec
iThe number of Chinese words.
Step 600: and generating a page attribute value database according to the page attribute value corresponding to the website webpage. And storing the page attribute values corresponding to all the website webpages in a database for calculating the user attribute values.
Fig. 3 is a comprehensive flowchart of a user attribute value calculation method based on a user browsing behavior according to an embodiment of the present application, and as shown in fig. 3, statistics is performed on access heat of all web pages of a specified site, and crawlers, normalization, word segmentation and filtering are performed on all web pages of the site according to the access heat to obtain a web page vocabulary corresponding to each web page of the site; selecting words of partial or all website webpages to form a dictionary word list, and solving the attribute value of each word in the dictionary word list; and then, the attribute value of each word in the dictionary word list is used for obtaining the page attribute value corresponding to each site webpage of the specified site. For a specific user A, collecting the historical access records of the user A, counting site webpages of the specified sites accessed by the user A, and obtaining the user attribute value of the user A at the current moment according to the page attribute value of the site webpages of the specified sites accessed by the user A and the access moment of the site webpages accessed by the user A.
For example, two sites related to financial loan are specified, wherein one site X related to financial loan has three web pages X1, X2 and X3; another loan-related site Y has two web pages Y1, Y2. Then the web pages X1, X2, X3 are marked as +1, indicating that the browsing user has an investment requirement for funds; the web pages Y1, Y2 are labeled-1 and represent the viewing user's desire to debit funds. The specified URL list to be analyzed is F ═ F1,f2,f3,f4,f5In which f1、f2、f3Sequentially corresponds to X1, X2, X3, f4、f5Corresponding to Y1 and Y2 in sequence. Assuming that a total of three users A, B, C visited site X, Y, assuming that the frequency of user a visiting each web page is denoted as v (a) ═ {1,1,1,1,0}, the frequency of user B visiting each web page is denoted as v (a) } 1,0,1,1,1, and the frequency of user C visiting each web page is denoted as v (a) } {0,0,1,0,1}, each time the web page is accessed by user aSince the access frequency of the web page X3 is the highest, that is, the access popularity of the web page X3 is the highest, the web page X3 needs to be heavily crawled when a crawler policy is formulated, and conversely, the crawler frequency of the web page X2 can be slightly reduced. Marking the text information after the web crawler as C ═ C1,c2,c3,c4,c5And recording the text information as C, performing normalization, word segmentation and word filtering to obtain words corresponding to web pages X1, X2, X3, Y1 and Y2, and recording the words as W (C)i)={wij|1≤j≤|ciL } in which wijRepresents the jth word, | c, in the ith web pageiI denotes the number of words last obtained by the ith web page, assuming | c1|、|c2|、|c3|、|c4|、|c5If all web pages are selected to form a dictionary vocabulary D, the number of words in the dictionary vocabulary is less than or equal to 80 (mainly considering that the same page may contain the same words after word segmentation, and different web pages may contain the same words after word segmentation, so the number of words in the dictionary vocabulary may be less than the sum of the numbers of words in the web pages), and if only web pages X1 and Y1 are selected, the number of words in the dictionary vocabulary is less than or equal to 45. For convenience of description, it is assumed that words formed after word segmentation of web pages are different, and words formed after word segmentation of different web pages are also different, and word forming dictionary vocabulary D ═ D of words of all web pages is selected1,d2,…,d80}Calculating the attribute value P (D) of each word in the dictionary vocabulary Di) I-th word D in dictionary vocabulary DiThe attribute value of (2) is calculated as:
wherein, | dijL is the d-th word in the dictionaryiWord in jth webpage cjThe number of times of occurrence in the dictionary is that i is less than or equal to s, and s is the number of words in the dictionary vocabulary; y isjThe label of the jth webpage is +1, which indicates that the investment requirement exists, and the label of-1 indicates that the loan requirement exists.
According to the genus of each wordProperty value P (d)i) The web page attribute value P (c) of each web page can be obtainedi) I.e. after summing up the attribute values of all words in each web page, dividing by the web page ciNumber of Chinese words, each Web page ciWeb page attribute value P (c)i) The calculation formula of (2) is as follows:
wherein, P (d)
i) For appearance in dictionary vocabulary and in web page c
iThe attribute value of the word in (1); d represents a dictionary word list; w (c)
i) Web page c representing a site
iThe word in (1);
for qualifier d
iAppear on the web page c
iThe Chinese belongs to the words in the dictionary vocabulary;
representing cumulative Web pages c
iAttribute values of all words in; l c
iI represents a Web page c
iThe number of Chinese words.
If the user attribute value p (a) of the user a at the current time needs to be obtained, the web page attribute value of each web page visited by the user a and the access time of each web page visited by the user a are needed, the frequency that the user a visits each web page is given as v (a) {1,1,1, 0}, that is, the user a does not visit the web page Y2, and since the time from the time when the user a visits the web pages X1, X2, X3, and Y1 to the current time can also be obtained, the specific calculation formula of the user attribute value p (a) of the user a is:
wherein, P (A) is the user attribute value of the user A at the current moment, P is more than or equal to-1 and less than or equal to (A) and less than or equal to 1; decapay (t)i) As a function of attenuation, decay (t)i)=exp(-δ(ti)),δ(ti) Greater than 0, delta (t)i) Represents tiThe time being distant from the current timeThe time unit can be hour, day, week, month, or year, tiRepresents the time when the user A accesses the webpage, 0 is less than or equal to decade (t)i) Less than or equal to 1; h (A) represents a webpage accessed by the user A; h (A, t)i) Indicates that user A is at tiWeb pages accessed at any time; p (H (A, t)i) Represents user A at t)iConstantly accessing the webpage attribute value of the webpage; subscript H (A, t)i) ∈ H (A) indicates that user A is at tiThe web page accessed at the moment belongs to the web page accessed by the user A. Similarly, user attribute values P (B) and P (C) for user B, C may be found.
Suppose a web page c
iWeb page attribute value P (c)
i) The decay function of each web page accessed by user a has a value of decay (t) 0.8, 0.7, 0.5, -0.6, -0.9, respectively
i) 0.4, 0.5, 0.8, 0.2, 0, then the user attribute value of user a is set to {0.4, 0.5, 0.8, 0.2, 0}
The user A can be known to have a relatively strong investment demand, and demand information related to financing can be pushed to the user A in a small amount; similarly, the frequency of accessing each web page by user B is denoted by v (a) {1,0,1,1,1}, and the values of the decay functions of accessing each web page by user B are assumed to be decay (t), respectively
i) If {0.4, 0, 0.8, 0.2, 0.9}, then the user attribute value p (B) of user B is known to be equal to p (B) {0.4, 0, 0.8, 0.2, 0.9}, in the same way
The slight loan demand of the user B can be known, and the demand information related to the loan can be pushed to the user B in a small amount; the frequency of access to each web page by user C is denoted by v (a) {0,0,1,0,1}, and the value of the decay function of access to each web page by user C is assumed to be decay (t), respectively
i) 0,0, 0.2, 0, 0.9, the user attribute value of user C is then set to {0,0, 0.2, 0, 0.9}
The user C can be informed of the strong loan demand, and can be pushed with a large amount of demand information related to the loan.
Fig. 4 is a block diagram of a user attribute value calculation apparatus based on a user browsing behavior according to an embodiment of the present application, and as shown in fig. 4, a log of a daily website visited by a user is analyzed to perform text data mining on a visited webpage, so as to obtain a user's demand for funds (investment demand or loan demand), so that the user's demand for funds (investment demand or loan demand) can be fully understood without requiring a user application, thereby facilitating accurate marketing and fund release by a sponsor, and facilitating a financing product promotion and fund absorption by a financing party in a more targeted manner.
In the specific embodiment shown in the figure, the user attribute value calculation apparatus includes a scheduling device 10, a filtering device 20, an obtaining device 30, an obtaining device 40, and an information pushing device 50, where the scheduling device 10 is configured to invoke a history web page daily visited by a user from a user information database; the filtering device 20 is configured to filter out web pages in the historical web pages that are not related to the target attribute, so as to calculate a page attribute value of the filtered historical web pages; the obtaining device 30 is configured to obtain a page attribute value corresponding to the historical webpage according to a page attribute value database; the obtaining device 40 is configured to obtain a user attribute value of a corresponding user according to the page attribute value; the information pushing device 50 is used for pushing specific information to the corresponding user according to the user attribute value. The user information database may be user browsing information recorded by a large website, or user browsing information recorded by multiple websites in a combined manner, and is stored in a shared database for being called by multiple service systems, which is not limited in this application.
Referring to fig. 4 again, the obtaining device 40 specifically includes an obtaining unit 401, a weight value allocating unit 402, and a calculating unit 403, where the obtaining unit 401 is configured to obtain an access time for a user to access each historical webpage; the weight value distribution unit 402 is configured to distribute a weight value to the corresponding page attribute value according to the access time; the calculating unit 403 is configured to obtain a user attribute value of a corresponding user according to the page attribute value and the weight.
Fig. 5 is a block diagram of a unit for solving a web page attribute value based on a user browsing behavior according to an embodiment of the present application, and as shown in fig. 5, according to a frequency of summarizing access to each web page of a website by all users, a crawler frequency of each web page of the website is determined, crawlers are performed on the web pages of the website according to the crawler frequency, then words of all web pages of the website are selected after normalization, word segmentation and word filtering are sequentially performed, or words of a part of web pages of the website are randomly selected to form a dictionary vocabulary, an attribute value of each word in the dictionary vocabulary is calculated, a web page attribute value of each web page is solved by using the words in the dictionary vocabulary, and finally, a web page attribute value database is generated (composed) according to the web page attribute value corresponding to the web page of the website.
In the specific embodiment shown in the figure, the generating unit 1 of the page attribute value database specifically includes a collecting module 11, a processing module 12, a word selecting module 13, a calculating module 14, an obtaining module 15, and a generating module 16, where the collecting module 11 is configured to collect website webpages related to target attributes; the processing module 12 is configured to process the website webpage to obtain a vocabulary corresponding to the website webpage; the word selecting module 13 is configured to randomly select words with a predetermined ratio from the word list by taking a webpage as a unit so as to calculate an attribute value of the randomly selected words; the calculation module 14 is configured to calculate an attribute value of each word in the word list; the obtaining module 15 is configured to obtain a page attribute value of each website webpage according to the attribute value; the generating module 16 is configured to generate a page attribute value database according to the page attribute value corresponding to the website webpage.
In a specific embodiment of the present application, the processing module 12 further includes an obtaining sub-module 121, an obtaining sub-module 122, a normalizing sub-module 123, and a word segmentation sub-module 124, where the obtaining sub-module 121 is configured to obtain the access popularity of the site webpage, so as to obtain the information of the site webpage according to the access popularity; the obtaining sub-module 122 is configured to obtain information of the website webpage; the normalization submodule 123 is configured to perform normalization processing on the information to obtain standard information; the participle sub-module 124 is configured to perform participle processing on the standard information to obtain a vocabulary corresponding to the website webpage.
FIG. 6 is a general block diagram of a user attribute value calculation apparatus based on user browsing behavior according to an embodiment of the present application, as shown in FIG. 6, a collection module 11 is used for collecting web pages of a site related to a target attribute; the processing module 12 performs crawler, normalization, word segmentation and filtering processing on the website webpages, so as to obtain word lists corresponding to the website webpages, that is, each website webpage may have a plurality of same words, but because of the normalization processing, no near-synonym or synonym exists, and the filtering processing mainly refers to filtering out words, words and punctuation marks which have no semantics or are irrelevant to the service in the word lists by using the disabled word lists; the word selecting module 13 randomly selects words with a predetermined proportion from the word list by taking a webpage as a unit, which means that a website webpage is randomly selected by taking the webpage as a unit, and then the words in the selected website webpage are all identified as selected; the calculation module 14 calculates the attribute value of each word in the word list, if all the site web pages are selected, the word list at this time includes words in all the site web pages, if some site web pages are selected by taking the web pages as a unit, the word list at this time only includes words in the selected site web pages, so that the data processing amount can be saved, in the big data processing, the number of all the site web pages is huge, some site web pages are randomly selected, and words appearing in all the site web pages can be basically covered; the obtaining module 15 obtains the page attribute value of the site webpage according to the attribute value of each word in the site webpage. For example, for a specific user a, the collecting module 11 is configured to collect site webpages related to a target attribute, the processing module 12 processes the site webpages to obtain word lists corresponding to the site webpages, and the calculating module 14 calculates attribute values of each word in the word lists; the obtaining module 15 obtains the page attribute value of each website webpage accessed by the user according to the attribute value, and obtains the user attribute value of the user a at the current moment.
The embodiment of the application provides a user attribute value calculation method and a calculation device based on user browsing behaviors, wherein a big data processing technology is used for calculating webpage attribute values of all webpages of sites related to target attributes, a cloud technology is used for collecting historical browsing information of all users into a database, the user attribute values of corresponding users can be obtained according to the historical webpage browsing time and the webpage browsing time of each user, and information push or service providing and other operations can be performed in a targeted manner according to the user attribute values, so that the development of a network big data era is promoted, and the rapid development of national economy is promoted.
The embodiments of the present application described above may be implemented in various hardware, software code, or a combination of both. For example, the embodiments of the present application may also be program codes executed in a Digital Signal Processor (DSP) to execute the above-described programs. The present application may also relate to a variety of functions performed by a computer processor, digital signal processor, microprocessor, or Field Programmable Gate Array (FPGA). The processor described above may be configured in accordance with the present application to perform certain tasks by executing machine-readable software code or firmware code that defines certain methods disclosed herein. Software code or firmware code may be developed in different programming languages and in different formats or forms. Software code may also be compiled for different target platforms. However, different code styles, types, and languages of software code and other types of configuration code for performing tasks according to the present application do not depart from the spirit and scope of the present application.
The foregoing is merely an illustrative embodiment of the present application, and any equivalent changes and modifications made by those skilled in the art without departing from the spirit and principles of the present application shall fall within the protection scope of the present application.