WO2016045567A1 - 网页数据分析方法及装置 - Google Patents

网页数据分析方法及装置 Download PDF

Info

Publication number
WO2016045567A1
WO2016045567A1 PCT/CN2015/090185 CN2015090185W WO2016045567A1 WO 2016045567 A1 WO2016045567 A1 WO 2016045567A1 CN 2015090185 W CN2015090185 W CN 2015090185W WO 2016045567 A1 WO2016045567 A1 WO 2016045567A1
Authority
WO
WIPO (PCT)
Prior art keywords
keywords
keyword
cluster
user
module
Prior art date
Application number
PCT/CN2015/090185
Other languages
English (en)
French (fr)
Inventor
何鑫
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Priority to US15/513,501 priority Critical patent/US10621245B2/en
Publication of WO2016045567A1 publication Critical patent/WO2016045567A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Definitions

  • the present invention relates to the field of data analysis, and in particular to a web page data analysis method and apparatus.
  • the website usually classifies the users who visit the website by constructing a model of the behavior track of the user browsing the website, training the classifier, or describing the user's needs through the heat of the Query in the website.
  • the way of searching in the station is the behavior of the user actively searching for information, which can describe the user's needs to some extent.
  • the traditional in-station search word clustering technology relies on the search word itself, and it is calculated by the overlap between words on the literal.
  • the implementation scheme is generally as follows: The first step is to literally disassemble the keyword (including the sub-word or the participle).
  • the keywords after disassembly can be expressed as a sequence of words (words) as a unit sequence; the second step: then calculate the similarity (jaccard or edit distance, etc.) of each pair of keyword pairs one by one, that is, compare the two search words The degree of overlap of the word strings, and returns the measure of similarity; the third step: clustering using clustering algorithm, including k-means clustering or hierarchical clustering, etc. Different clustering algorithms are implemented differently but substantively There is no difference. Since the traditional technology establishes the connection by the degree of overlap of the keywords literally, it does not conform to the actual situation, and only a rigid construction of a related dependency relationship, and therefore cannot accurately explain the user's needs.
  • the present invention is directed to the existing web page data analysis method only relying on the degree of overlap on the literal surface of the search word, which leads to the problem that the data analysis result cannot accurately explain the user's demand.
  • the main purpose of the present invention is to provide a web page data analysis. Methods and apparatus to solve the above problems.
  • a web page data analysis method includes: obtaining m keywords input by a user on a webpage; and obtaining a dependency relationship among the m keywords The keyword, wherein the corresponding keywords with the same user demand have a dependency relationship; and the keywords with the dependent relationships among the m keywords are divided into the same keyword.
  • obtaining the m keywords input by the user on the webpage includes: loading the script file code on the webpage; receiving the input behavior of the user on the webpage; and reading m keywords carried by the script file code to input the webpage input behavior.
  • acquiring keywords having a dependency relationship among the m keywords includes: determining a hypothesis condition, wherein the hypothesis condition is a logical relationship included in an input behavior of the hypothesized m keywords; creating a graph model ⁇ G according to the assumption condition, S ⁇ , wherein G represents a set of m keywords, S represents a set of dependencies between m keywords; and a keyword having a dependency relationship among m keywords is obtained by a graph model.
  • obtaining a keyword having a dependency relationship among the m keywords by using a graph model includes: calculating a transition probability according to the intensity of the dependency relationship between the m keywords, wherein the transition probability is that each keyword belongs to its existence dependency The probability of the keyword of the relationship; and iterating the m keywords according to the transition probability, and obtaining keywords having a dependency relationship among the m keywords.
  • the method further includes: respectively naming a plurality of keywords of the same type; and selecting keywords according to each type of keywords Quantity, sorts multiple named keywords of the same type.
  • a webpage data analyzing apparatus comprising: a first obtaining unit, configured to acquire m keywords input by a user on a webpage; and a second acquiring unit, a keyword for obtaining a dependency relationship among m keywords, wherein a corresponding keyword has a dependency relationship between the keywords; and a dividing unit is configured to divide the keyword having the dependency relationship among the m keywords For the same category of keywords.
  • the first obtaining unit includes: a loading module, configured to load the script file code on the webpage; a receiving module, configured to receive an input behavior of the user on the webpage; and a reading module, configured to read the input of the webpage by using the script file code m keywords carried by the behavior.
  • the second obtaining unit includes: a first determining module, configured to determine a hypothesis condition, wherein the hypothesis condition is a logical relationship included in an input behavior of the hypothesized m keywords; and a creating module for using the hypothesis bar Create a graph model ⁇ G, S ⁇ , where G represents a set of m keywords, S represents a set of dependencies between m keywords; and a first acquisition module is used to obtain m by graph models Keywords with dependencies in the keywords.
  • the first obtaining module includes: a calculating module, configured to calculate a transition probability according to the strength of the dependency relationship between the m keywords, wherein the transition probability is a probability that each keyword belongs to a keyword having a dependency relationship; And the second obtaining module, iterating the m keywords according to the transition probability, and acquiring the keywords having the dependent relationships among the m keywords.
  • the apparatus further includes: a naming unit for respectively naming a plurality of keywords of the same type; and a sorting unit for naming the plurality of keywords according to the number of keywords included in each type of keyword Class keywords are sorted.
  • a method comprising the steps of: obtaining m keywords input by a user on a webpage; and acquiring keywords having dependencies among m keywords, wherein there is a dependency between keywords corresponding to the same user demand Relationships; and the keywords that have dependencies in m keywords are divided into the same type of keywords, which solves the problem that the existing web page data analysis method only depends on the degree of overlap of the search words, and the data analysis results cannot accurately explain the user requirements.
  • the problem further realizes the clustering of the webpage data by using the dependency relationship between the keywords determined by the user's demand, thereby accurately reflecting the effect of the user's demand through the clustering result.
  • FIG. 1 is a flow chart of a first embodiment of a data analysis method in accordance with the present invention.
  • FIG. 2 is a flow chart of a second embodiment of a data analysis method in accordance with the present invention.
  • Figure 3 is a flow chart of a third embodiment of a data analysis method in accordance with the present invention.
  • FIG. 4 is a flow chart of a fourth embodiment of a data analysis method in accordance with the present invention.
  • Figure 5 is a flow chart of a fifth embodiment of a data analysis method in accordance with the present invention.
  • Figure 6 is a flow chart of a sixth embodiment of a data analysis method in accordance with the present invention.
  • Figure 7 is a block diagram showing the structure of a first embodiment of a data analyzing apparatus according to the present invention.
  • Figure 8 is a block diagram showing the structure of a second embodiment of the data analyzing apparatus according to the present invention.
  • Figure 9 is a block diagram showing the structure of a third embodiment of the data analyzing apparatus according to the present invention.
  • Figure 10 is a block diagram showing the structure of a sixth embodiment of the data analyzing apparatus according to the present invention.
  • FIG. 1 is a flow chart of a first embodiment of a web page data analysis method in accordance with the present invention. As shown in FIG. 1, the method includes the following steps:
  • Step S102 Obtain m keywords input by the user on the webpage, where m is a natural number greater than 1.
  • each user requirement can be represented by different keywords input by the user, and each keyword can also represent a plurality of different user requirements. intention.
  • the method defines a one-to-many dependency between user requirements and keywords entered by the user. The user needs can be identified by clustering the keywords entered by the user on the website.
  • Step S104 Acquire keywords in which there are dependencies in the m keywords, wherein there is a dependency relationship between the corresponding keywords with the same user requirements.
  • the keywords searched For a user's web page data search behavior, there is often a relationship between the keywords searched. This relationship is not the similarity of the keywords on the literal side, but the user requirements of the keywords are the same.
  • step S106 the keywords having the dependent relationships among the m keywords are divided into the same type of keywords.
  • the keywords input by the user can be divided into several categories according to the dependency relationship.
  • deep keyword aggregation relationships can be mined to accurately represent user needs. For example, the relationship between "violation”, “electronic eye”, “electronic jin” and “electronic observation” can be found.
  • the following steps are taken: obtaining m keywords input by the user on the webpage; acquiring keywords having dependent relationships among the m keywords; and dividing the keywords having the dependent relationships among the m keywords into
  • This method breaks through the limitation of the traditional query aggregation process based on the query literal matching assumption of the query.
  • the user behavior data is used for data mining to construct a mathematical model that is more in line with the user's needs.
  • FIG. 2 is a flow chart of a second embodiment of a web page data analysis method in accordance with the present invention. This embodiment can be used as a preferred embodiment of the embodiment shown in FIG. 1. As shown in FIG. 2, the webpage data analysis method includes:
  • step S201 the script file code is loaded on the webpage.
  • a script file is similar to a batch file in the DOS operating system. It combines different commands and executes them automatically and continuously in a determined order.
  • the script program is closer to the natural language than the general program development, and can be interpreted without being compiled.
  • scripting languages There are many types of scripting languages.
  • the execution of a general scripting language is only related to the specific interpreter executor, so as long as there is a corresponding language interpreter on the system, it can be cross-platform.
  • javascript can be used to obtain behavior data of the user when performing web browsing by adding javascript code to the website.
  • Step S202 receiving an input behavior of the user on the webpage.
  • step S204 the m keywords carried by the input behavior of the webpage are read by the script file code.
  • the in-site search behavior performed by the user in one session can constitute a sequence of intra-station searches, expressed as [Keyword1, Keyword2, Keyword3, ...]. Each session is represented by a unique key, which can form data in the following format:
  • the data includes but is not limited to two columns of conversations and keywords, and may also include more dimensions such as search time and number of searches to improve the performance of the cluster.
  • step S206 keywords having a dependency relationship among the m keywords are obtained, wherein the corresponding keywords having the same user demand have a dependency relationship.
  • This step is equivalent to S104 and will not be described here.
  • Step S207 the keywords having the dependent relationships among the m keywords are divided into the same type of keywords.
  • This step is equivalent to S106, and will not be described again here.
  • FIG. 3 is a flow chart of a third embodiment of a web page data analysis method in accordance with the present invention. This embodiment can be used as a preferred embodiment of the embodiment shown in FIG. 1. As shown in FIG. 3, the webpage data analysis method includes:
  • Step S301 Obtain m keywords input by the user on the webpage.
  • This step is equivalent to S102, and will not be described again here.
  • the step may be implemented by using steps S201, S202, and S204 in the foregoing second embodiment, and details are not described herein again.
  • step S302 a hypothesis condition is determined, wherein the assumption condition is a logical relationship included in the input behavior of the hypothesized m keywords.
  • the user must have a user's demand for the search behavior. According to the actual business needs of the web page analyst (ie, what kind of user needs are interested), reasonable assumptions can be made. The dependencies between the keywords can be obtained according to the assumptions.
  • the keyword sequence of a session is A-B-C-D
  • the assumption of the method may be to establish a dependency relationship ⁇ AD, BD, CD, DD ⁇ .
  • A, B, and C respectively establish a dependency relationship with D, that is, A and D correspond to the same user requirement (first user demand), B and D correspond to the same user requirement (second user demand), and C and D correspond to the same User requirements (third user needs).
  • Different dependencies can be established based on other assumptions, such as ⁇ AB, BC, CD ⁇ or ⁇ AB, AC, AD, BC, BD, CD ⁇ .
  • the following assumptions can be made: 1. When the user browses the website, the access purpose of the same session is unique; 2. The keyword of the station generated by the user in the same session is semantically related; 3. User In the process of achieving the purpose of access, there may be multiple in-site search behaviors, but these behaviors are self-correcting. Based on the above three assumptions, it can be concluded that the keyword used in the last in-site search in the session is the attribution of all keywords in the session. Based on this, the dependencies between keywords can be clarified.
  • Step S303 creating a graph model ⁇ G, S ⁇ according to the assumption condition, wherein G represents a set of m keywords, and S represents a set of dependencies between m keywords.
  • a graph model is a graph consisting of points (nodes) and lines (edges) that describe the system and is used to describe the relationship between things (one node) and things (another node) in the system.
  • the graph model is a directed graph, and if each edge in the graph model has a direction, the graph model is called a directed graph.
  • each node in the graph model represents a keyword
  • each edge represents a dependency relationship between one keyword and another keyword.
  • a directed graph ⁇ G, S ⁇ of m keywords is constructed, wherein G represents a set of m keywords in the graph, and each keyword can be represented as a node in the graph; S represents a graph
  • the set of keyword dependencies represents an edge connected between two nodes in the graph, wherein the direction of the edge is determined by the dependency relationship of the two nodes, and the strength of the edge is determined by the number of times of the dependency relationship.
  • all keywords have an edge pointing to the last keyword of the session.
  • Step S304 obtaining a keyword having a dependency relationship among the m keywords by using the graph model.
  • the set of all keywords and keyword dependencies is given in the graph model. According to the actual business needs of the web page analyst, multiple keyword groups representing the same user's needs can be identified.
  • the simple graph model is used to find the algorithm of the community to perform query clustering, avoiding the traditional clustering algorithm and reducing the complexity.
  • Step S305 the keywords having the dependent relationships among the m keywords are divided into the same type of keywords.
  • This step is equivalent to S106, and will not be described again here.
  • FIG. 4 is a flow chart of a fourth embodiment of a web page data analysis method in accordance with the present invention. This embodiment can be used as a preferred embodiment of the embodiment shown in FIG. 3. As shown in FIG. 4, the webpage data analysis method includes:
  • Step S401 Obtain m keywords input by the user on the webpage.
  • This step is equivalent to step S301 and will not be described again here.
  • Step S403 determining a hypothesis condition, wherein the assumption condition is a logical relationship included in the input behavior of the hypothesized m keywords.
  • step S302 The same as step S302, and details are not described herein again.
  • Step S404 creating a graph model ⁇ G, S ⁇ according to the assumption condition, wherein G represents a set of m keywords, and S represents a set of dependencies between m keywords.
  • step S303 The same as step S303, and details are not described herein again.
  • Step S405 calculating a transition probability according to the strength of the dependency relationship between the m keywords, wherein the transition probability is a probability that each keyword belongs to a keyword having a dependency relationship with each other.
  • the strength of the dependency relationship can be determined according to the number of dependency relationships. The more times the dependency relationship is established between the two nodes, the greater the strength of the dependency relationship between the two nodes. In this embodiment, the more times the dependency relationship is established between the two keywords, the greater the strength of the dependency relationship between the two keywords. According to the strength of the dependency relationship between the keywords, the probability that the node depends on each node, that is, the transition probability, can be calculated.
  • the transition probability is defined as c(n i , n j )/c(n j ), and c(n i , n j ) is the strength of the dependency of the i-th webpage data and the j-th webpage data, c(n j Is the sum of the intensities of all the dependencies of the jth web page data, where i, j ⁇ ⁇ 1, 2...m ⁇ and i ⁇ j.
  • Step S406 iterating on the m keywords according to the transition probability, and acquiring keywords having a dependency relationship among the m keywords.
  • the nodes (keywords) are iterated according to the transition probability, and each node (keyword) is randomly moved to its dependent node (keyword) with the transition probability. According to this, multiple iterations are performed to calculate the keyword group of the same user requirement that the node (keyword) belongs to after the final iteration.
  • a label propagation algorithm can be employed. It should be noted that the details of the label propagation algorithm are not the emphasis of this application, and that the algorithms that can cluster a graph are within the scope of protection. Without loss of generality, the present application provides the following algorithm for tag propagation for clustering of nodes in the graph.
  • each node has a unique tag, which can be the keyword searched for the last time in the session where each search keyword is located.
  • For each node calculate the contribution value of all neighbor nodes pointing to the node to the node replacement label.
  • the node calculation change may be equal, if the current label of the node is one of several possibilities, the node does not change its label; otherwise, all possibilities are randomly selected, and the label is replaced. can.
  • all nodes in the graph are synchronously updated during a label propagation process, that is, all nodes simultaneously calculate the instantaneously received contribution value distribution, and then update the label. operating. There is no sequence of node label changes in the process.
  • the iterative process of the above steps is repeated multiple times until the labels of all nodes no longer change, and the calculation is terminated.
  • the iterative process does not wait until the final stop.
  • the process (which requires too many iterations to stop), but chooses to pre-set the number of iterations. After that number of iterations, the current result is used as an approximate clustering result.
  • the above-described random walk is repeated a plurality of times, and the decision that the final node (keyword) belongs to the final keyword group (keyword cluster) is obtained according to the law of large numbers.
  • This repeated process is necessary because the directed graph constructed at the beginning of the model is a directed ring graph, so it is possible for the node to enter the loop through the transition probability to obtain a local optimal solution. Repeating the steps can effectively reduce such errors and improve the accuracy of clustering.
  • step S407 the keywords having the dependent relationships among the m keywords are divided into the same type of keywords.
  • step S106 The same as step S106, and will not be described again here.
  • FIG. 5 is a flow chart of a fifth embodiment of a web page data analysis method in accordance with the present invention. This embodiment can be used as a preferred embodiment of the embodiment shown in FIG. 4. As shown in FIG. 5, the webpage data analysis method includes:
  • Step S501 Obtain m keywords input by the user on the webpage.
  • This step is equivalent to step S301 and will not be described again here.
  • Step S503 determining a hypothesis condition, wherein the assumption condition is a logical relationship included in the input behavior of the hypothesized m keywords.
  • step S302 The same as step S302, and details are not described herein again.
  • Step S504 creating a graph model ⁇ G, S ⁇ according to the assumption condition, wherein G represents a set of m keywords, and S represents a set of dependencies between m keywords.
  • step S303 The same as step S303, and details are not described herein again.
  • Step S505 calculating a transition probability according to the strength of the dependency relationship between the m keywords, wherein the transition probability is a probability that each keyword belongs to a keyword having a dependency relationship with each other.
  • step S405 The same as step S405, and details are not described herein again.
  • Step S507 performing an iteration on the i-th keyword according to the transition probability, and calculating the k-th keyword cluster to which the i-th keyword belongs after the iteration, where k ⁇ 1,2,...,i-1,i +1,...,m ⁇ .
  • each node randomly moves to its dependent node with a transition probability.
  • the keywords that embody the same user needs will gather more and more until the keyword cluster covers all the keywords that have the dependency in m keywords.
  • Step S508 determining whether the difference between the i-th cluster and the k-th cluster is less than a preset value, wherein the preset value is an error value allowed by a preset keyword cluster.
  • the setting of the preset value can be set according to the needs of different data analysts.
  • the ith keyword can be caused to step by step to approach the keyword cluster to which it belongs.
  • Step S509 if the difference between the i-th cluster and the k-th cluster is greater than a preset value, the iteration is continued.
  • This step is to repeat step S507.
  • the difference between the i-th keyword cluster and the k-th keyword cluster is greater than a preset value, it indicates that the keyword reflecting the same user requirement has not been completely covered in the keyword cluster, and iteration needs to be continued.
  • Step S510 if the difference between the i-th keyword cluster and the k-th keyword cluster is less than or equal to a preset value, the iteration is stopped, and all keywords in the keyword cluster to which the i-th keyword belongs are acquired.
  • the difference between the i-th keyword cluster and the k-th keyword cluster is less than or equal to a preset value, it can be considered that the keyword reflecting the same user requirement has been completely included in the keyword cluster.
  • the number of iterations may also be set according to the analysis needs of the data analyst.
  • the preset number of iterations is completed, all keywords in the keyword cluster to which the i-th keyword belongs are acquired.
  • step S511 the keywords having the dependent relationships among the m keywords are divided into the same type of keywords.
  • step S106 The same as step S106, and will not be described again here.
  • the keywords in the dependency relationship are divided into the same type of keywords.
  • the preset value can be set according to the user's requirements, that is, the error range of the keyword cluster can be set by itself, so that the needs of different data analysts can be met, and the applicable range of the method becomes large.
  • this repeated iteration method makes the clustering results more accurate.
  • Figure 6 is a flow chart of a sixth embodiment of a web page data analysis method in accordance with the present invention. This embodiment can be used as a preferred embodiment of the embodiment shown in FIG. 1. As shown in FIG. 6, the webpage data analysis method includes:
  • Step S601 Obtain m keywords input by the user on the webpage.
  • This step is equivalent to step S102 and will not be described again here.
  • the step may be implemented by using steps S201, S202, and S204 in the foregoing second embodiment, and details are not described herein again.
  • step S602 keywords having a dependency relationship among the m keywords are obtained, wherein the corresponding keywords having the same user demand have a dependency relationship.
  • step S104 This step is the same as step S104, and details are not described herein again.
  • the step may be implemented by using steps S503-S510 in the foregoing fifth embodiment, and details are not described herein again.
  • step S603 the keywords having the dependent relationships among the m keywords are divided into the same type of keywords.
  • step S106 This step is the same as step S106, and details are not described herein again.
  • step S604 a plurality of keywords of the same type are respectively named.
  • each keyword of the same category reflects different user requirements, in order to describe the user requirements, the same type of keywords can be named.
  • the naming method may be a rule-based naming method and a statistic-based naming method, or a combination of the two methods, that is, a mixed naming method.
  • the naming methods of the same type of keywords include, but are not limited to, naming according to the number of times the user searches or the number of times the user searches for clicks, and selects the keywords with higher ranking as the naming; the maximum likelihood estimation is performed according to the aggregation points when the graph model converges, Take the concentrated keywords to name them.
  • Step S605 sorting the plurality of named keywords of the same type according to the number of keywords included in each type of keyword.
  • Sorting refers to sorting according to the statistics of the same type of keywords. The higher the statistic, the stronger the user demand corresponding to the same type of keyword (keyword cluster).
  • the commonly used statistics include: the number of keyword searches in the cluster and the number of sessions to which the keywords in the cluster belong.
  • a specific step of analyzing the webpage data is given: obtaining m items input by the user on the webpage Key words; obtain keywords with dependent relationships among m keywords; classify keywords with dependencies in m keywords into the same type of keywords; name each of the same keywords; according to each class
  • the number of keywords included in the keyword is sorted by naming multiple keywords of the same class.
  • Figure 7 is a block diagram showing the structure of a first embodiment of a web page data analyzing apparatus according to the present invention. As shown in FIG. 7, the device structure includes:
  • the first obtaining unit 22 is configured to acquire m keywords input by the user on the webpage.
  • each user requirement can be represented by different keywords input by the user, and each keyword can also represent a plurality of different user requirements. intention.
  • the device can identify the user's needs by clustering the keywords entered by the user on the website.
  • the second obtaining unit 24 is configured to obtain a keyword having a dependency relationship among the m keywords, wherein the corresponding keyword having the same user demand has a dependency relationship.
  • Keywords searched For a user's web page data search behavior, there is often a relationship between the keywords searched. This relationship is not the similarity of the keywords on the literal side, but the user requirements of the keywords are the same.
  • the dividing unit 26 is configured to divide the keywords having the dependent relationships among the m keywords into the same type of keywords.
  • the keywords input by the user can be divided into several categories according to the dependency relationship.
  • the clustering method implemented by the device deep keyword aggregation relationships can be mined to accurately represent user requirements. For example, the device can find the relationship between "violation”, “electronic eye”, “electronic jin” and "electronic observation”.
  • the webpage data analyzing apparatus includes: a first obtaining unit 22, a second obtaining unit 24, and a dividing unit 26.
  • the webpage data analysis is based on the dependency relationship between the keywords determined by the user's needs, and is no longer one-sidedly dependent on the degree of literal overlap between the keywords.
  • the device breaks through the limitation of the traditional query aggregation process based on the query literal matching assumption of the query.
  • the user behavior data is used for data mining, and the obtained clustering result can more accurately reflect the user's demand.
  • Figure 8 is a flow chart of a second embodiment of a web page data analyzing apparatus in accordance with the present invention. This embodiment can be used as a preferred embodiment of the embodiment shown in FIG. 7. As shown in FIG. 8, the webpage data analyzing apparatus includes:
  • the first obtaining unit 22 may further include:
  • the loading module 32 is configured to load the script file code on the webpage.
  • a script file is similar to a batch file in the DOS operating system. It combines different commands and executes them automatically and continuously in a determined order. Script programs are closer to natural language than general program development, and can be interpreted without compiling.
  • scripting languages There are many types of scripting languages.
  • the execution of a general scripting language is only related to the specific interpreter executor, so as long as there is a corresponding language interpreter on the system, it can be cross-platform.
  • the module can use javascript to obtain behavior data of the user when browsing the webpage by adding javascript code to the website.
  • the receiving module 34 is configured to receive a user input behavior on a webpage.
  • the user searches in the website, and the receiving module 34 can receive the data input by it, monitor and implement dynamic reading through the javascript code.
  • the reading module 36 is configured to read the m keywords carried by the input behavior of the webpage through the script file code.
  • the in-site search behavior performed by the user in one session can constitute a sequence of intra-station searches, expressed as [Keyword1, Keyword2, Keyword3, ...]. Each session is represented by a unique key, which can form data in the following format:
  • the data includes but is not limited to two columns of conversations and keywords, and may also include more dimensions such as search time and number of searches to improve the performance of the cluster.
  • the first obtaining unit 22 of the webpage data analyzing apparatus may further include the following modules: a loading module 32, a receiving module 34, and a reading module 36.
  • a loading module 32 may further include the following modules: a receiving module 34, and a reading module 36.
  • Figure 9 is a flow chart showing a third embodiment of the web page data analyzing apparatus according to the present invention. This embodiment can be used as a preferred embodiment of the embodiment shown in FIG. 7. As shown in FIG. 9, the webpage data analyzing apparatus includes:
  • the second obtaining unit 24 may further include:
  • a first determining module 42 configured to determine a hypothesis condition, wherein the hypothesis condition is a hypothetical input of m keywords The logical relationship contained in the behavior.
  • the user must perform the search behavior to have user requirements, and the first determining module 42 can obtain reasonable assumptions that the web page data analyst proposes according to his own business needs. It should be noted that the assumptions reflect the dependencies between the keywords.
  • the keyword sequence of a session is A-B-C-D, and the assumption can be to establish a dependency ⁇ AD, BD, CD, DD ⁇ .
  • Other assumptions can establish different dependencies, such as ⁇ AB, BC, CD ⁇ or ⁇ AB, AC, AD, BC, BD, CD ⁇ .
  • the assumption condition may be as follows: 1. When the user browses the website, the access purpose of the same session is unique; 2. the intra-site keywords generated by the user in the same session are semantically related; 3. User In the process of achieving the purpose of access, there may be multiple in-site search behaviors, but these behaviors are self-correcting. Based on the above three assumptions, it can be concluded that the keyword used in the last in-site search in the session is the attribution of all keywords in the session. Based on this, the dependencies between keywords can be clarified.
  • a creation module 44 is configured to create a graph model ⁇ G, S ⁇ according to the hypothesis, wherein G represents a set of m keywords, and S represents a set of dependencies between m keywords.
  • the creating module 44 can construct a directed graph ⁇ G, S ⁇ of m keywords, where G represents a set of m keywords in the graph, and each keyword can be represented as a graph.
  • G represents a set of m keywords in the graph
  • each keyword can be represented as a graph.
  • S represents a set of keyword dependencies in the graph, representing an edge connected between two nodes in the graph, wherein the direction of the edge is determined by the dependency relationship of the two nodes, and the strength of the edge is determined by the dependency
  • the number of times is determined.
  • all keywords have an edge that points to the last keyword of the session.
  • the first obtaining module 46 is configured to obtain, by using the graph model, a keyword having a dependency relationship among the m keywords.
  • a set of all keywords and keyword dependencies is given in the graph model. According to the actual business needs of the web page analyst, the first obtaining module 46 can identify a plurality of keyword groups representing the same user's needs.
  • the first obtaining module 46 uses a simple graph model to find a community algorithm to perform query clustering, avoiding the traditional clustering algorithm, complexity O(nlgn).
  • the second obtaining unit 24 in the webpage data analyzing apparatus may further include the following modules: a first determining module 42, a creating module 44, and a first obtaining module 46.
  • the first determining module 42 can formulate the assumptions according to the different needs of the user, so that the range of webpage data analysis applicable to the device is wide, and can meet the needs of various webpage data analysis users.
  • the device is based on the relationship between the keywords established by the logical relationship contained in the input behavior of the webpage, the user's needs can be accurately reflected.
  • the webpage data analysis device includes:
  • the first obtaining module 46 may further include:
  • the calculation module is configured to calculate a transition probability according to the strength of the dependency relationship between the m keywords, wherein the transition probability is a probability that each keyword belongs to a keyword having a dependency relationship.
  • the calculation module can calculate the probability that the node depends on each node, that is, the transition probability, according to the strength of the dependency relationship between the keywords.
  • the transition probability can be defined as c(n i , n j )/c(n j ), and c(n i , n j ) is the strength of the dependency of the i-th web page data and the j-th web page data, c(n j Is the sum of the intensities of all the dependencies of the jth web page data, where i, j ⁇ ⁇ 1, 2...m ⁇ and i ⁇ j.
  • the second obtaining module iterates the m keywords according to the transition probability, and obtains keywords with dependent relationships among the m keywords.
  • the module iterates the nodes (keywords) according to the transition probability, and each node (keyword) randomly moves to its dependent nodes (keywords) with the transition probability. According to this, iteratively, the module can output the keyword group of the same user requirement that the node (keyword) belongs after the final iteration.
  • the above-described random walk is repeated a plurality of times, and the decision that the final node (keyword) belongs to the final keyword group (keyword cluster) is obtained according to the law of large numbers.
  • This repeated process is necessary because the directed graph constructed at the beginning of the model is a directed ring graph, so it is possible for the node to enter the loop through the transition probability to obtain a local optimal solution. Repeating the steps can effectively reduce such errors, so that the accuracy of the clustering results obtained by the module is improved.
  • the first obtaining module 46 of the webpage data analyzing apparatus may further include the following modules: a computing module and a second acquiring module. Because the second acquisition module makes the keywords belonging to the same user demand in the keywords to be analyzed gradually merge into one class through the iterative method, the clustering method is more in line with the real needs of the user, and the clustering result is more analytical value.
  • the webpage data analysis device includes:
  • the second obtaining unit 24 includes a first determining module 42, a creating module 44, and a first obtaining module 46, the first obtaining module 46 further comprising a calculating module and The second acquisition module.
  • the second obtaining module may further include:
  • the calculation submodule is configured to perform an iteration on the i-th keyword according to the transition probability, and calculate a kth keyword cluster to which the i-th keyword belongs after the iteration, where k ⁇ 1, 2...i-1, i+1...m ⁇ .
  • each node is randomly moved to its dependent node with a transition probability.
  • the keywords that embody the same user needs will gather more and more until the keyword cluster covers all the keywords that have the dependency in m keywords.
  • the determining sub-module is configured to determine whether the difference between the i-th cluster and the k-th cluster is less than a preset value, wherein the preset value is an error value allowed by the preset keyword cluster.
  • the preset value can be set according to the needs of different data analysts, and the data is input into the sub-module.
  • the iterative sub-module continues Iteration.
  • the judging sub-module judges that the difference between the i-th keyword cluster and the k-th keyword cluster is less than or equal to the preset value, it can be considered that the keyword embodying the same user requirement has been completely included in the keyword cluster.
  • the determining sub-module may also determine the number of iterations according to the analysis requirements of the data analyst.
  • the acquisition sub-module acquires all the keywords in the keyword cluster to which the i-th keyword belongs.
  • the second obtaining module in the webpage data analyzing apparatus may further include the following modules: a hypothetical submodule, a computing submodule, a judging submodule, an iteration submodule, and an obtaining submodule. Since the judgment sub-module can set the preset value according to the user's requirement, that is, the error range of the cluster, it can meet the needs of different data analysts, and the application range of the method becomes large. At the same time, the iterative sub-module performs repeated iterations several times, which also makes the final clustering result more accurate.
  • Figure 10 is a flow chart showing a sixth embodiment of the web page data analyzing apparatus according to the present invention. This embodiment can be used as a preferred embodiment of the embodiment shown in FIG. 7. As shown in FIG. 10, the webpage data analyzing apparatus includes:
  • the obtaining unit 22, the first determining unit 24, and the second determining unit 26 are the same as those in FIG. 7, and are not described herein again.
  • the naming unit 28 and the sorting unit 30 are specifically:
  • the naming unit 28 is configured to respectively name a plurality of keywords of the same type.
  • the naming unit 28 can be used to name the obtained keywords of the same class.
  • the naming method may be a rule-based naming method and a statistic-based naming method, or a combination of the two methods, that is, a mixed naming method.
  • the naming methods of the same type of keywords include, but are not limited to, naming according to the number of times the user searches or the number of times the user searches for clicks, and selects the keywords with higher ranking as the naming; the maximum likelihood estimation is performed according to the aggregation points when the graph model converges, Take the concentrated keywords to name them.
  • the sorting unit 210 is configured to sort the named multiple keywords of the same type according to the number of keywords included in each type of keyword.
  • Sorting refers to sorting according to the statistics of the same type of keywords. The higher the statistic, the stronger the user demand corresponding to the same type of keyword (keyword cluster).
  • the commonly used statistics include: the number of keyword searches in the cluster and the number of sessions to which the keywords in the cluster belong.
  • the embodiment provides a unit that the webpage data analyzing device can further include: a naming unit 28 and a sorting unit 210.
  • the clusters are respectively named by the naming unit 28, and the sorting unit 210 sorts the clusters according to the number of keywords included in each cluster, so that each cluster more clearly displays the search heat of each type of data, and can The clustering results are presented more intuitively to the web page data analyst.
  • each unit and module may be operated as a part of the apparatus in a mobile terminal, a computer terminal or the like, and may be through a mobile terminal or a computer terminal.
  • a processor in a similar computing device may perform the functions implemented by the above-described units and modules, or may be stored as part of a storage medium.
  • the above mobile terminal, computer terminal or similar computing device may be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, and a mobile Internet device (MID), a PAD, and the like.
  • an embodiment of the present invention may provide a computer terminal, which may be any computer terminal device in a computer terminal group.
  • the foregoing computer terminal may also be replaced with a terminal device such as a mobile terminal.
  • the computer terminal may be located in at least one network device of the plurality of network devices of the computer network.
  • the computer terminal may execute the program code of the following steps in the webpage data analysis method: acquiring m keywords input by the user on the webpage; and acquiring keywords having dependency relationships among the m keywords, wherein There is a dependency between the corresponding keywords with the same user requirements; and m keywords Keywords with dependencies are divided into the same category of keywords.
  • the computer terminal can include: one or more processors, memory, and transmission means.
  • the memory can be used to store software programs and modules, such as the webpage data analysis method and the program instruction/module corresponding to the device in the embodiment of the present invention, and the processor executes various functions by running a software program and a module stored in the memory.
  • Application and data processing that is, the above-described web page data analysis method is implemented.
  • the memory may include a high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • the memory can further include memory remotely located relative to the processor, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the above transmission device is for receiving or transmitting data via a network.
  • Specific examples of the above network may include a wired network and a wireless network.
  • the transmission device includes a Network Interface Controller (NIC) that can be connected to other network devices and routers via a network cable to communicate with the Internet or a local area network.
  • the transmission device is a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • NIC Network Interface Controller
  • RF Radio Frequency
  • the memory is used to store preset action conditions and information of the preset rights user, and an application.
  • the processor can call the memory stored information and the application by the transmitting device to execute the program code of the method steps of each of the alternative or preferred embodiments of the above method embodiments.
  • the computer terminal can also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, an applause computer, and a mobile Internet device (MID), a PAD, and the like.
  • a smart phone such as an Android phone, an iOS phone, etc.
  • a tablet computer such as an iPad, Samsung Galaxy Tab, Samsung Galaxy Tab, etc.
  • MID mobile Internet device
  • PAD PAD
  • the embodiment of the invention further provides a storage medium.
  • the foregoing storage medium may be used to save the program code executed by the webpage data analysis method provided by the foregoing method embodiment and the device embodiment.
  • the foregoing storage medium may be located in any one of the computer terminal groups in the computer network, or in any one of the mobile terminal groups.
  • the storage medium is configured to store program code for performing the following steps: acquiring m keywords input by the user on the webpage; and acquiring keywords having dependency relationships among the m keywords , wherein there is a dependency between the corresponding keywords with the same user requirements; and there are m keywords
  • the keywords of the relationship are divided into the same type of keywords.
  • the storage medium may also be configured to store program code of various preferred or optional method steps provided by the web page data analysis method.
  • modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device, such that they may be stored in a storage device by a computing device, or they may be fabricated into individual integrated circuit modules, or Multiple modules or steps are made into a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software.

Abstract

一种网页数据分析方法及装置。该网页数据分析方法包括:获取用户在网页上输入的m个关键词(S102);获取m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在依存关系(S104);以及将m个关键词中存在依存关系的关键词划分为同一类关键词(S106)。实现了通过利用用户需求决定的关键词之间的依存关系对网页数据进行聚类,使聚类结果能准确反映用户需求的效果。

Description

网页数据分析方法及装置 技术领域
本发明涉及数据分析领域,具体而言,涉及一种网页数据分析方法及装置。
背景技术
用户通常会抱有一定的目的和意图浏览网站。对于网站而言,了解用户访问的真实意图非常重要。网站通常会通过用户浏览网站的行为轨迹构造模型、训练分类器的方法对访问网站的用户进行分类,或是通过网站站内搜索词(Query)的热度对用户需求进行描述。
站内搜索的方式是用户主动寻找信息的行为,可以一定程度上描述用户需求。传统的站内搜索词聚类技术依赖于搜索词本身,通过词语间字面上的重叠进行计算,实现方案一般为:第一步:对关键词进行字面上的拆解(包括逐子或分词),拆解以后的关键词可以表示为词(字)为单元的序列串;第二步:然后逐一计算每一对关键词对的相似度(jaccard或编辑距离等),即比较两个搜索词的词串的重叠程度,并返回相似度的度量;第三步:使用聚类算法进行聚类,聚类算法包括k-means聚类或层次聚类等,不同的聚类算法实现方式不同但实质上并无差别。由于传统技术是通过关键词字面上的重叠程度来建立联系,并不符合实际情况,仅仅是生硬的构造一种相关依存关系,因此不能准确的解释用户需求。比如“三星”和“苹果”不包含任何字面上的匹配,但是相关性应该很高,另外“本田”和“本源”是完全无关的两类词,但是字面上仍然存在着相关依存关系。并且,现有的站内搜索词聚类技术需要计算每两个关键词之间的相似度,复杂度高,不适用于大规模数据挖掘。
针对相关技术中网页数据分析方法仅仅依赖于搜索词字面上的重叠程度,从而导致的数据分析结果不能准确解释用户需求的问题,目前尚未提出有效的解决方案。
发明内容
针对现有的网页数据分析方法仅仅依赖于搜索词字面上的重叠程度,导致数据分析结果不能准确解释用户需求的问题而提出本发明,为此,本发明的主要目的在于提供一种网页数据分析方法及装置,以解决上述问题。
为了实现上述目的,根据本发明的一个方面,提供了一种网页数据分析方法。该方法包括:获取用户在网页上输入的m个关键词;获取m个关键词中存在依存关系的 关键词,其中,对应的用户需求相同的关键词之间存在依存关系;以及将m个关键词中存在依存关系的关键词划分为同一类关键词。
进一步地,获取用户在网页上输入的m个关键词包括:在网页加载脚本文件代码;接收用户在网页的输入行为;以及通过脚本文件代码读取网页的输入行为所携带的m个关键词。
进一步地,获取m个关键词中存在依存关系的关键词包括:确定假设条件,其中,假设条件是假设的m个关键词的输入行为中包含的逻辑关系;根据假设条件创建图模型{G,S},其中,G代表m个关键词的集合,S代表m个关键词之间的依存关系的集合;以及通过图模型,获取m个关键词中存在依存关系的关键词。
进一步地,通过图模型,获取m个关键词中存在依存关系的关键词包括:根据m个关键词之间的依存关系的强度计算转移概率,其中,转移概率是每个关键词属于与其存在依存关系的关键词的概率;以及按照转移概率对m个关键词进行迭代,获取m个关键词中存在依存关系的关键词。
进一步地,按照转移概率对m个关键词进行迭代,获取m个关键词中存在依存关系的关键词包括:假设在进行迭代之前第i个关键词属于第i关键词簇,其中,簇是一类关键词的集合,i=1,2...m;按照转移概率对第i个关键词进行一次迭代,计算迭代后第i个关键词属于的第k关键词簇,其中,k∈{1,2...i-1,i+1...m};判断第i簇和第k簇的差异是否小于预设值,其中,预设值是预先设定的关键词簇允许的误差值;如果第i簇和第k簇的差异大于预设值,则继续进行迭代;以及如果第i簇和第k簇的差异小于或者等于预设值,则停止迭代,获取第i个关键词属于的关键词簇中的所有关键词。
进一步地,将m个关键词中存在依存关系的关键词划分为同一类关键词之后,方法还包括:对多个同一类关键词分别进行命名;以及按照每一类关键词包含的关键词的数量,对命名后的多个同一类关键词进行排序。
为了实现上述目的,根据本发明的另一方面,提供了一种网页数据分析装置,该装置包括:第一获取单元,用于获取用户在网页上输入的m个关键词;第二获取单元,用于获取m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在依存关系;以及划分单元,用于将m个关键词中存在依存关系的关键词划分为同一类关键词。
进一步地,第一获取单元包括:加载模块,用于在网页加载脚本文件代码;接收模块,用于接收用户在网页的输入行为;以及读取模块,用于通过脚本文件代码读取网页的输入行为所携带的m个关键词。
进一步地,第二获取单元包括:第一确定模块,用于确定假设条件,其中,假设条件是假设的m个关键词的输入行为中包含的逻辑关系;创建模块,用于根据假设条 件创建图模型{G,S},其中,G代表m个关键词的集合,S代表m个关键词之间的依存关系的集合;以及第一获取模块,用于通过图模型,获取m个关键词中存在依存关系的关键词。
进一步地,第一获取模块包括:计算模块,用于根据m个关键词之间的依存关系的强度计算转移概率,其中,转移概率是每个关键词属于与其存在依存关系的关键词的概率;以及第二获取模块,按照转移概率对m个关键词进行迭代,获取m个关键词中存在依存关系的关键词。
进一步地,第二获取模块包括:假设子模块,用于假设在进行迭代之前第i个关键词属于第i关键词簇,其中,簇是一类关键词的集合,i=1,2...m;计算子模块,用于按照转移概率对第i个关键词进行一次迭代,计算迭代后第i个关键词属于的第k关键词簇,其中,k∈{1,2...i-1,i+1...m};判断子模块,用于判断第i簇和第k簇的差异是否小于预设值,其中,预设值是预先设定的关键词簇允许的误差值;迭代子模块,用于如果第i簇和第k簇的差异大于预设值,则继续进行迭代;以及获取子模块,用于如果第i簇和第k簇的差异小于或者等于预设值,则停止迭代,获取第i个关键词属于的关键词簇中的所有关键词。
进一步地,该装置还包括:命名单元,用于对多个同一类关键词分别进行命名;以及排序单元,用于按照每一类关键词包含的关键词的数量,对命名后的多个同一类关键词进行排序。
通过本发明,采用包括以下步骤的方法:获取用户在网页上输入的m个关键词;获取m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在依存关系;以及将m个关键词中存在依存关系的关键词划分为同一类关键词,解决了现有网页数据分析方法仅仅依赖于搜索词字面上的重叠程度,导致数据分析结果不能准确解释用户需求的问题,进而达到了通过利用用户需求决定的关键词之间的依存关系对网页数据进行聚类,从而通过聚类结果准确反映用户需求的效果。
附图说明
构成本申请的一部分的附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明的数据分析方法的第一实施例的流程图;
图2是根据本发明的数据分析方法的第二实施例的流程图;
图3是根据本发明的数据分析方法的第三实施例的流程图;
图4是根据本发明的数据分析方法的第四实施例的流程图;
图5是根据本发明的数据分析方法的第五实施例的流程图;
图6是根据本发明的数据分析方法的第六实施例的流程图;
图7是根据本发明的数据分析装置的第一实施例的结构框图;
图8是根据本发明的数据分析装置的第二实施例的结构框图;
图9是根据本发明的数据分析装置的第三实施例的结构框图;以及
图10是根据本发明的数据分析装置的第六实施例的结构框图。
具体实施方式
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。
图1是根据本发明的网页数据分析方法的第一实施例的流程图。如图1所示,该方法包括如下步骤:
步骤S102,获取用户在网页上输入的m个关键词,m为大于1的自然数。
用户需求与用户输入的关键词之间应存在多对多的依存关系,即每个用户需求可以通过用户输入的不同的关键词来表示意图,每个关键词也可以表示多个不同的用户需求意图。为了简化问题,该方法定义用户需求与用户输入的关键词之间存在一对多的依存关系。通过对用户在网站中输入的关键词进行聚类的方式可以对用户需求进行识别。
步骤S104,获取m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在依存关系。
对于用户的一次网页数据搜索行为,往往搜索的各关键词之间存在关系,这种关系不是各个关键词字面上的相似性,而是关键词体现的用户需求相同。比如,用户在进行网页数据搜索时,搜索的关键词之间可能存在以下依存关系:一个关键词是前一个关键词的归属(ki=f(ki-1)),或后一个关键词是所有前面关键词的归属(ki=f(ki-1,ki-2,ki-3,…,k1))等等。
步骤S106,将m个关键词中存在依存关系的关键词划分为同一类关键词。
因为具有依存关系的一类关键词对应同一用户需求,因此按照依存关系可以将用户输入的关键词分为几类。通过这种聚类的方式,能够挖掘出深层次的关键词聚合关系,从而准确地表示用户需求。例如能发现“违章”、“电子眼”、“电子jin”和“电子敬察”之间的关系。
该实施例由于采取了以下步骤:获取用户在网页上输入的m个关键词;获取m个关键词中存在依存关系的关键词;以及将m个关键词中存在依存关系的关键词划分为 同一类关键词,使得网页数据分析是基于用户需求决定的关键词之间的依存关系,而不再片面地依赖关键词之间的字面重叠程度。该方法突破了传统query聚合过程基于query本身字面匹配假设的局限性,采用用户行为数据进行数据挖掘,构建出更符合用户需求的数学模型。
图2是根据本发明的网页数据分析方法的第二实施例的流程图。该实施例可以作为图1所示实施例的一种优选实施方式,如图2所示,该网页数据分析方法包括:
步骤S201,在网页加载脚本文件代码。
脚本文件类似于DOS操作系统中的批处理文件,它可以将不同的命令组合起来,并按确定的顺序自动连续地执行。脚本程序相对一般程序开发来说比较接近自然语言,可以不经编译而解释执行。
脚本语言种类较多,一般的脚本语言的执行只同具体的解释执行器有关,所以只要系统上有相应语言的解释程序就可以做到跨平台。优选地,在本实施例中可使用javascript,通过在网站中添加javascript代码来获取用户在进行网页浏览时的行为数据。
步骤S202,接收用户在网页的输入行为。
用户在网站中进行搜索,其输入的数据可以通过javascript代码监测并实现动态读取。
步骤S204,通过脚本文件代码读取网页的输入行为所携带的m个关键词。
用户在一次会话中进行的站内搜索行为,能够构成一条站内搜索的序列,表示为[Keyword1,Keyword2,Keyword3,……]。用唯一键表示每条会话,能够形成如下格式的数据:
会话 关键词
1 Keyword1
1 Keyword2
2 Keyword2
其中,数据包括但不限于会话和关键词两列,还可以包含如搜索时间、搜索次数等更多维度,用以提高聚类的性能。
步骤S206,获取m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在依存关系。
该步骤等同于S104,这里不再赘述。
步骤S207,将m个关键词中存在依存关系的关键词划分为同一类关键词。
该步骤等同于S106,这里不再赘述。
在该实施例中,给出了网页数据分析的具体步骤:在网页加载脚本文件代码;接收用户在网页的输入行为;通过脚本文件代码读取网页的输入行为所携带的m个关键词;获取m个关键词中存在依存关系的关键词;将m个关键词中存在依存关系的关键词划分为同一类关键词。通过以上步骤,可以动态读取用户输入的关键词,准确、高效地获取待分析的网页数据,有利于对用户数据高效地进行聚类分析。
图3是根据本发明的网页数据分析方法的第三实施例的流程图。该实施例可以作为图1所示实施例的一种优选实施方式,如图3所示,该网页数据分析法包括:
步骤S301,获取用户在网页上输入的m个关键词。
该步骤等同于S102,这里不再赘述。
在一种实施方式中,该步骤可以通过上述第二实施例中的步骤S201、S202和S204来实施,具体方式不再赘述。
步骤S302,确定假设条件,其中,假设条件是假设的m个关键词的输入行为中包含的逻辑关系。
用户进行搜索行为必定存在用户需求,根据网页数据分析者实际的业务需求(即对哪方面的用户需求感兴趣),可提出合理的假设条件。根据假设条件可得到关键词之间的依存关系。
例如:一次会话的关键词序列是A-B-C-D,该方法的假设条件可以是建立依存关系{AD,BD,CD,DD}。其中,A、B、C分别和D建立了依存关系,也即A、D对应同一用户需求(第一用户需求),B、D对应同一用户需求(第二用户需求),C、D对应同一用户需求(第三用户需求)。根据其他的假设条件可以建立不同的依存关系,如{AB,BC,CD}或{AB,AC,AD,BC,BD,CD}等。
优选地,可提出如下的几点假设:1、用户浏览网站时,同一个会话的访问目的唯一;2、用户在同一次会话中所产生的站内关键词在语义上存在相关性;3、用户在达到访问目的的过程中,可能会产生多次站内搜索行为,但这些行为自身具有自我修正的特点。基于以上三条假设,可以得到这样的结论:会话中最后一次站内搜索使用的关键词是该会话所有关键词的归属。基于此可以明确关键词之间的依存关系。
步骤S303,根据假设条件创建图模型{G,S},其中,G代表m个关键词的集合,S代表m个关键词之间的依存关系的集合。
图模型为由点(节点)和线(边)组成的用于描述系统的图形,用于描述系统中事物(一个节点)与事物(另外一个节点)之间的关系。可选地,该图模型为有向图,如果图模型中每条边都是有方向的,则称该图模型为有向图。在该实施例中,图模型中的每个节点代表一个关键词,每条边代表一个关键词与另一个关键词之间的依存关系。
根据步骤S302的假设,构建m个关键词的有向图{G,S},其中G表示图中m个关键词的集合,每一个关键词可以表示为图中的一个节点;S表示图中关键词依存关系的集合,表示图中两个节点之间相连的一条边,其中,边的方向由两节点的依存关系决定,边的强度由该依存关系的次数决定。根据步骤S302中优选的假设条件,在一次会话中,所有关键词均具有指向该会话最后一个关键词的一条边。
步骤S304,通过图模型,获取m个关键词中存在依存关系的关键词。
图模型中给出了所有关键词及关键词依存关系的集合,根据网页数据分析者实际的业务需求,可以将代表同一用户需求的多个关键词组识别出来。
采用简单的图模型寻找社团的算法进行query的聚类,避开了传统的聚类算法,降低了复杂度。
步骤S305,将m个关键词中存在依存关系的关键词划分为同一类关键词。
该步骤等同于S106,这里不再赘述。
在该实施例中,给出了网页数据分析的具体步骤:获取用户在网页上输入的m个关键词;确定假设条件;根据假设条件创建图模型{G,S};通过图模型,获取m个关键词中存在依存关系的关键词;将m个关键词中存在依存关系的关键词划分为同一类关键词。在以上步骤中,由于根据用户的不同需求可自行拟定假设条件,使得该方法适用的网页数据分析范围更加广泛,可满足各种网页数据分析用户的需求。同时,由于该方法是基于网页的输入行为包含的逻辑关系而建立的关键词之间的联系,从而能准确反映用户需求。
图4是根据本发明的网页数据分析方法的第四实施例的流程图。该实施例可以作为图3所示实施例的一种优选实施方式,如图4所示,该网页数据分析法包括:
步骤S401,获取用户在网页上输入的m个关键词。
该步骤等同于步骤S301,这里不再赘述。
步骤S403,确定假设条件,其中,假设条件是假设的m个关键词的输入行为中包含的逻辑关系。
与步骤S302相同,这里不再赘述。
步骤S404,根据假设条件创建图模型{G,S},其中,G代表m个关键词的集合,S代表m个关键词之间的依存关系的集合。
与步骤S303相同,这里不再赘述。
步骤S405,根据m个关键词之间的依存关系的强度计算转移概率,其中,转移概率是每个关键词属于与其存在依存关系的关键词的概率。
在图模型中,可根据依存关系的次数确定依存关系的强度,两个节点之间建立依存关系的次数越多,则该两个节点之间的依存关系的强度越大。在该实施例中,两个关键词之间建立依存关系的次数越多,可认为该两个关键词之间的依存关系的强度越大。根据关键词之间的依存关系的强度可以计算出节点依存于各个节点的概率,即转移概率。将转移概率定义为c(ni,nj)/c(nj),c(ni,nj)是第i个网页数据与第j个网页数据的依存关系的强度,c(nj)是第j个网页数据的所有依存关系的强度之和,其中,i,j∈{1,2…m}并且i≠j。
步骤S406,按照转移概率对m个关键词进行迭代,获取m个关键词中存在依存关系的关键词。
按照转移概率对节点(关键词)进行迭代,每一个节点(关键词)都以转移概率随机地移动到其依存的节点(关键词)中去。依此进行多次迭代,计算最终迭代后节点(关键词)所属于的同一用户需求的关键词组。
具体地,可以采用标签传播算法。需要说明的是,标签传播算法的细节并不是本申请要强调的重点,也就表示,只要能够将一个图进行聚类的算法都在保护范围之内。不失一般性,本申请提供以下一种标签传播的算法,用于图中节点的聚类。
设置初始状态下,每一个节点有一个唯一的标签,该标签可为每一个搜索关键词所在会话的最后一次站内搜索的关键词。对于每个节点,计算所有指向该节点的邻居节点对该节点更换标签的贡献值。计算方法为以节点之间的转移概率为权重,对邻居节点的标签进行加权求和。例如,如果节点A有邻居节点B、C、D,标签分别为x、x、y,且分别对于A的转移概率值为0.2、0.2、0.5,那么,节点A接受到变更的选择为x(0.4=0.2+0.2)或者y(0.5),则将节点A的标签更变为y。当出现节点计算变更可能出现相等的情形时,若该节点当前的标签是相等的几种可能性之一,则该节点不变更其标签,否则,对所有可能性进行随机选择,并更换标签即可。
同时需要注意,本申请可采用的上述方法,在一次标签传播过程中,图中的所有节点是同步更新的,也就是一次迭代所有节点同时计算其瞬时接收到的贡献值分布,然后进行更新标签操作。过程中不存在节点标签变更的先后顺序。
最后,重复上述步骤迭代过程多次,直到所有节点的标签不再发生变化,则计算终止。但是,对于实际情况,由于图中节点数量巨大,往往迭代过程并不等到最终停 止过程(需要太多次迭代才能停止),而是选择事先预设好迭代次数,经过该次数迭代后,以当时的结果作为近似的聚类结果。
重复进行上述随机游走的过程多次,根据大数定律得到最终节点(关键词)属于最终的关键词组(关键词簇)的判定。该重复多次的过程是有必要的,因为模型之初构建的有向图是一个有向有环图,因此节点有可能通过转移概率进入到环中,得到局部最优解。重复步骤可以有效地减少这种错误,使聚类的准确度得到提高。
步骤S407,将m个关键词中存在依存关系的关键词划分为同一类关键词。
与步骤S106相同,这里不再赘述。
在该实施例中,给出了网页数据分析的具体步骤:获取用户在网页上输入的m个关键词;确定假设条件;根据假设条件创建图模型{G,S};根据m个关键词之间的依存关系的强度计算转移概率;按照转移概率对m个关键词进行迭代,获取m个关键词中存在依存关系的关键词;将m个关键词中存在依存关系的关键词划分为同一类关键词。在以上步骤中,采用迭代的方式,待分析的关键词中属于同一用户需求的关键词逐渐聚为一类,这种聚类方式更加符合用户的真实需求,聚类结果更具分析价值。
图5是根据本发明的网页数据分析方法的第五实施例的流程图。该实施例可以作为图4所示实施例的一种优选实施方式,如图5所示,该网页数据分析法包括:
步骤S501,获取用户在网页上输入的m个关键词。
该步骤等同于步骤S301,这里不再赘述。
步骤S503,确定假设条件,其中,假设条件是假设的m个关键词的输入行为中包含的逻辑关系。
与步骤S302相同,这里不再赘述。
步骤S504,根据假设条件创建图模型{G,S},其中,G代表m个关键词的集合,S代表m个关键词之间的依存关系的集合。
与步骤S303相同,这里不再赘述。
步骤S505,根据m个关键词之间的依存关系的强度计算转移概率,其中,转移概率是每个关键词属于与其存在依存关系的关键词的概率。
与步骤S405相同,这里不再赘述。
步骤S506,假设在进行迭代之前第i个关键词属于第i关键词簇,其中,簇是一类关键词的集合,i=1,2...m。
在初始化时假设图中所有的节点(关键词)都各自属于一个关键词簇(各自都持有自己的一次投票机会),以每个节点为起点,开始进行迭代。
步骤S507,按照转移概率对第i个关键词进行一次迭代,计算迭代后第i个关键词属于的第k关键词簇,其中,k∈{1,2,...,i-1,i+1,...,m}。
在迭代进行的过程中,每一个节点都以转移概率随机地移动到其依存节点中去。随着迭代的进行,体现相同用户需求的关键词会越聚越多,直到该关键词簇涵盖了m个关键词中均具有该依存关系的所有关键词。
步骤S508,判断第i簇和第k簇的差异是否小于预设值,其中,预设值是预先设定的关键词簇允许的误差值。
预设值的设定可以根据不同数据分析者的自身需求进行设定。
每进行一次迭代得到该节点属于的关键词簇,则将其与该节点迭代前属于的关键词簇进行对比。然后判断当前关键词簇与前一个关键词簇之间的差异,其中,当前关键词簇与前一个关键词簇之间差异的定义是:差异值=本次改变所属关键词簇的节点的个数/总节点数(diff=nchange/N)。通过该判断步骤,可以促使第i个关键词一步步逼近自身属于的关键词簇。
步骤S509,如果第i簇和第k簇的差异大于预设值,则继续进行迭代。
该步骤是重复步骤S507。当第i关键词簇和第k关键词簇的差异大于预设值时,说明体现同一用户需求的关键词还没有完全涵盖进该关键词簇中,需要继续进行迭代。
步骤S510,如果第i关键词簇和第k关键词簇的差异小于或者等于预设值,则停止迭代,获取第i个关键词属于的关键词簇中的所有关键词。
当第i关键词簇和第k关键词簇的差异小于或者等于预设值,可认为体现同一用户需求的关键词已经完全涵盖进该关键词簇中。
可选地,也可以根据数据分析者的分析需求,设定迭代次数。当完成预设迭代次数,则获取第i个关键词属于的关键词簇中的所有关键词。
步骤S511,将m个关键词中存在依存关系的关键词划分为同一类关键词。
与步骤S106相同,这里不再赘述。
在该实施例中,给出了网页数据分析的具体步骤:获取用户在网页上输入的m个关键词;确定假设条件;根据假设条件创建图模型{G,S};根据m个关键词之间的依存关系的强度计算转移概率;假设在进行迭代之前第i个关键词属于第i关键词簇;按照转移概率对第i个关键词进行一次迭代,计算迭代后第i个关键词属于的第k关键词簇;判断第i簇和第k簇的差异是否小于预设值;如果第i簇和第k簇的差异大于预设值,则继续进行迭代;如果第i关键词簇和第k关键词簇的差异小于或者等于预设值,则停止迭代,获取第i个关键词属于的关键词簇中的所有关键词;将m个关键词中存 在依存关系的关键词划分为同一类关键词。在上述步骤中,由于可按照用户需求自行拟定预设值,即自行设定关键词簇的误差范围,因此能够满足不同数据分析人士的需求,使该方法适用范围变大。同时,这种多次重复的迭代方式,也使聚类结果更加准确。
图6是根据本发明的网页数据分析方法的第六实施例的流程图。该实施例可以作为图1所示实施例的一种优选实施方式,如图6所示,该网页数据分析法包括:
步骤S601,获取用户在网页上输入的m个关键词。
该步骤等同于步骤S102,这里不再赘述。
在一种实施方式中,该步骤可以通过上述第二实施例中的步骤S201、S202和S204来实施,具体方式不再赘述。
步骤S602,获取m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在依存关系。
该步骤同于步骤S104,这里不再赘述。
在一种实施方式中,该步骤可以通过上述第五实施例中的步骤S503-S510来实施,具体方式不再赘述。
步骤S603,将m个关键词中存在依存关系的关键词划分为同一类关键词。
该步骤同于步骤S106,这里不再赘述。
步骤S604,对多个同一类关键词分别进行命名。
由于各个同一类的关键词反映不同的用户需求,因此为对用户需求进行描述,可以对得到的同一类的关键词进行命名。
优选地,命名方法可以为基于规则的命名方法和基于统计的命名方法两种,也可以将两种方法相结合,即混合的命名方法。同一类关键词的命名方法包括但不限于:根据用户搜索次数或用户搜索点击次数等行为进行命名,选取排序较高的关键词作为命名;根据图模型收敛时的聚集点进行最大似然估计,取集中收敛的关键词进行命名等。
步骤S605,按照每一类关键词包含的关键词的数量,对命名后的多个同一类关键词进行排序。
排序是指按照同一类关键词的统计量进行排序,统计量越高的同一类关键词(关键词簇)所对应的用户需求越强烈。优选地,常用的统计量包括:簇内关键词搜索次数和簇内关键词所属的会话数等。
在该实施例中,给出了网页数据分析的具体步骤:获取用户在网页上输入的m个 关键词;获取m个关键词中存在依存关系的关键词;将m个关键词中存在依存关系的关键词划分为同一类关键词;对多个同一类关键词分别进行命名;按照每一类关键词包含的关键词的数量,对命名后的多个同一类关键词进行排序。通过以上步骤,对聚类得到的每一类关键词分别进行命名,并且按各自包含的关键词数量进行排序,从而更加清晰地展现出每一类数据的搜索热度,并将结果更加直观地呈现给网页数据分析者。
图7是根据本发明的网页数据分析装置的第一实施例的结构框图。如图7所示,该装置结构包括:
第一获取单元22,用于获取用户在网页上输入的m个关键词。
用户需求与用户输入的关键词之间应存在多对多的依存关系,即每个用户需求可以通过用户输入的不同的关键词来表示意图,每个关键词也可以表示多个不同的用户需求意图。为了简化问题,定义用户需求与用户输入的关键词之间存在一对多的依存关系。该装置通过对用户在网站中输入的关键词进行聚类的方式可以对用户需求进行识别。
第二获取单元24,用于获取m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在依存关系。
对于用户的一次网页数据搜索行为,往往搜索的各关键词之间存在关系,这种关系不是各个关键词字面上的相似性,而是关键词体现的用户需求相同。比如,用户在进行网页数据搜索时,搜索的关键词之间可能存在以下依存关系:一个关键词是前个一关键词的归属(ki=f(ki-1)),或后一个关键词是所有前面关键词的归属(ki=f(ki-1,ki-2,ki-3,…,k1))等等。该单元用于获取m个关键词中存在依存关系的关键词。
划分单元26,用于将m个关键词中存在依存关系的关键词划分为同一类关键词。
因为具有依存关系的一类关键词对应同一用户需求,因此按照依存关系可以将用户输入的关键词分为几类。通过该装置所实现的聚类方式,能够挖掘出深层次的关键词聚合关系,从而准确地表示用户需求。例如利用该装置能发现“违章”、“电子眼”、“电子jin”和“电子敬察”之间的关系。
本实施例提供的网页数据分析装置包括:第一获取单元22、第二获取单元24和划分单元26。通过该装置使得网页数据分析是基于用户需求决定的关键词之间的依存关系,而不再片面地依赖关键词之间的字面重叠程度。该装置突破了传统query聚合过程基于query本身字面匹配假设的局限性,利用用户行为数据进行数据挖掘,获得的聚类结果能够更加准确地反映用户需求。
图8是根据本发明的网页数据分析装置的第二实施例的流程图。该实施例可以作为图7所示实施例的一种优选实施方式,如图8所示,该网页数据分析装置包括:
第一获取单元22、第二获取单元24和划分单元26,其中,第二获取单元24和划分单元26与图7中所述相同,这里不再赘述。其中,第一获取单元22还可以包括:
加载模块32,用于在网页加载脚本文件代码。
脚本文件类似于DOS操作系统中的批处理文件,它可以将不同的命令组合起来,并按确定的顺序自动连续地执行。脚本程序相对一般程序开发来说比较接近自然语言,可以不经编译而是解释执行。
脚本语言种类较多,一般的脚本语言的执行只同具体的解释执行器有关,所以只要系统上有相应语言的解释程序就可以做到跨平台。优选地,该模块可使用javascript,通过在网站中添加javascript代码来获取用户在进行网页浏览时的行为数据。
接收模块34,用于接收用户在网页的输入行为。
用户在网站中进行搜索,接收模块34可接收其输入的数据,通过javascript代码监测并实现动态读取。
读取模块36,用于通过脚本文件代码读取网页的输入行为所携带的m个关键词。
用户在一次会话中进行的站内搜索行为,能够构成一条站内搜索的序列,表示为[Keyword1,Keyword2,Keyword3,……]。用唯一键表示每条会话,能够形成如下格式的数据:
会话 关键词
1 Keyword1
1 Keyword2
2 Keyword2
其中,数据包括但不限于会话和关键词两列,还可以包含如搜索时间、搜索次数等更多维度,用以提高聚类的性能。
本实施例提供的网页数据分析装置中第一获取单元22还可以包括以下模块:加载模块32、接收模块34和读取模块36。通过以上模块,可以动态读取用户输入的关键词,准确、高效地获取待分析的网页数据,有利于对用户数据高效地进行聚类分析。
图9是根据本发明的网页数据分析装置的第三实施例的流程图。该实施例可以作为图7所示实施例的一种优选实施方式,如图9所示,该网页数据分析装置包括:
第一获取单元22、第二获取单元24和划分单元26,其中,第一获取单元22和划分单元26与图7中相同,这里不再赘述。第二获取单元24还可以包括:
第一确定模块42,用于确定假设条件,其中,假设条件是假设的m个关键词的输 入行为中包含的逻辑关系。
用户进行搜索行为必定存在用户需求,第一确定模块42可获得网页数据分析者根据其自身业务需求提出的合理的假设条件。需要注意的是,假设条件体现的是关键词之间的依存关系。
例如:一次会话的关键词序列是A-B-C-D,假设条件可以是建立依存关系{AD,BD,CD,DD}。其他的假设条件可以建立不同的依存关系,如{AB,BC,CD}或{AB,AC,AD,BC,BD,CD}等。
优选地,假设条件可以是如下几点:1、用户浏览网站时,同一个会话的访问目的唯一;2、用户在同一次会话中所产生的站内关键词在语义上存在相关性;3、用户在达到访问目的的过程中,可能会产生多次站内搜索行为,但这些行为自身具有自我修正的特点。基于以上三条假设,可以得到这样的结论:会话中最后一次站内搜索使用的关键词是该会话所有关键词的归属。基于此可以明确关键词之间的依存关系。
创建模块44,用于根据假设条件创建图模型{G,S},其中,G代表m个关键词的集合,S代表m个关键词之间的依存关系的集合。
根据第一确定模块42确定的假设条件,创建模块44可构建m个关键词的有向图{G,S},其中G表示图中m个关键词的集合,每一个关键词可以表示为图中的一个节点;S表示图中关键词依存关系的集合,表示图中两个节点之间相连的一条边,其中,边的方向由两节点的依赖关系决定,边的强度由该依赖关系的次数决定。如果是上述介绍的优选的假设条件,则在一次会话中,所有关键词均具有指向该会话最后一个关键词的一条边。
第一获取模块46,用于通过图模型,获取m个关键词中存在依存关系的关键词。
图模型中给出了所有关键词及关键词依存关系的集合,根据网页数据分析者实际的业务需求,第一获取模块46可以将代表同一用户需求的多个关键词组识别出来。
第一获取模块46采用简单的图模型寻找社团的算法进行query的聚类,避开了传统的聚类算法,复杂度O(nlgn)。
本实施例提供的网页数据分析装置中第二获取单元24还可以包括以下模块:第一确定模块42、创建模块44和第一获取模块46。通过以上模块,由于第一确定模块42根据用户的不同需求可自行拟定假设条件,使得该装置适用的网页数据分析范围及其广泛,可满足各种网页数据分析用户的需求。同时,由于该装置是基于网页的输入行为包含的逻辑关系而建立的关键词之间的联系,从而能准确反映用户需求。
下面是根据本发明的网页数据分析装置的第四实施例。该实施例可以作为该装置第三实施例的一种优选实施方式。该网页数据分析装置包括:
第一获取单元22、第二获取单元24和划分单元26,其中,第二获取单元24包括第一确定模块42、创建模块44和第一获取模块46。除第一获取模块46之外,其他单元、模块与图7中相同,这里不再赘述。第一获取模块46还可以包括:
计算模块,用于根据m个关键词之间的依存关系的强度计算转移概率,其中,转移概率是每个关键词属于与其存在依存关系的关键词的概率。
计算模块根据关键词之间的依存关系的强度可以计算出节点依存于各个节点的概率,即转移概率。转移概率可定义为c(ni,nj)/c(nj),c(ni,nj)是第i个网页数据与第j个网页数据的依存关系的强度,c(nj)是第j个网页数据的所有依存关系的强度之和,其中,i,j∈{1,2…m}并且i≠j。
第二获取模块,按照转移概率对m个关键词进行迭代,获取m个关键词中存在依存关系的关键词。
该模块按照转移概率对节点(关键词)进行迭代,每一个节点(关键词)都以转移概率随机地移动到其依存的节点(关键词)中去。依此进行多次迭代,该模块可输出最终迭代后节点(关键词)所属于的同一用户需求的关键词组。
在该模块中,重复进行上述随机游走的过程多次,根据大数定律得到最终节点(关键词)属于最终的关键词组(关键词簇)的判定。该重复多次的过程是有必要的,因为模型之初构建的有向图是一个有向有环图,因此节点有可能通过转移概率进入到环中,得到局部最优解。重复步骤可以有效地减少这种错误,使得该模块获得的聚类结果的准确度得到提高。
本实施例提供的网页数据分析装置中第一获取模块46还可以包括以下模块:计算模块和第二获取模块。由于第二获取模块通过迭代的方式,使得待分析的关键词中属于同一用户需求的关键词逐渐聚为一类,这种聚类方式更加符合用户的真实需求,聚类结果更具分析价值。
下面是根据本发明的网页数据分析装置的第五实施例。该实施例可以作为该装置第四实施例的一种优选实施方式。该网页数据分析装置包括:
第一获取单元22、第二获取单元24和划分单元26,其中,第二获取单元24包括第一确定模块42、创建模块44和第一获取模块46,第一获取模块46还包括计算模块和第二获取模块。这里,除第二获取模块外,其他单元和模块与图10中所述相同,这里不再赘述。第二获取模块还可以包括:
假设子模块,用于假设在进行迭代之前第i个关键词属于第i关键词簇,其中,簇是一类关键词的集合,i=1,2...m。
在初始化时,假设子模块假设图中所有的节点(关键词)都各自属于一个关键词簇(各自都持有自己的一次投票机会)。
计算子模块,用于按照转移概率对第i个关键词进行一次迭代,计算迭代后第i个关键词属于的第k关键词簇,其中,k∈{1,2...i-1,i+1...m}。
在该子模块执行迭代的过程中,每一个节点都以转移概率随机地移动到其依存节点中去。随着迭代的进行,体现相同用户需求的关键词会越聚越多,直到该关键词簇涵盖了m个关键词中均具有该依存关系的所有关键词。
判断子模块,用于判断第i簇和第k簇的差异是否小于预设值,其中,预设值是预先设定的关键词簇允许的误差值。
预设值可以根据不同数据分析者的自身需求进行设定,并将该数据输入该子模块。
每进行一次迭代得到该节点属于的关键词簇,则将其与该节点迭代前属于的关键词簇进行对比。然后该子模块判断当前关键词簇与前一个关键词簇之间的差异,其中,当前关键词簇与前一个关键词簇之间差异的定义是:差异值=本次改变所属关键词簇的节点的个数/总节点数(diff=nchange/N)。通过该子模块的判断过程,可以促使第i个关键词一步步逼近自身属于的关键词簇。
迭代子模块,用于如果第i簇和第k簇的差异大于预设值,则继续进行迭代。
当判断子模块判断出第i关键词簇和第k关键词簇的差异大于预设值时,说明体现同一用户需求的关键词还没有完全涵盖进该关键词簇中,则迭代子模块继续进行迭代。
获取子模块,用于如果第i簇和第k簇的差异小于或者等于预设值,则停止迭代,获取第i个关键词属于的关键词簇中的所有关键词。
当判断子模块判断出第i关键词簇和第k关键词簇的差异小于或者等于预设值,可认为体现同一用户需求的关键词已经完全涵盖进该关键词簇中。
可选地,判断子模块也可以根据数据分析者的分析需求,判断迭代次数。当判断出已完成预设迭代次数,获取子模块则获取第i个关键词属于的关键词簇中的所有关键词。
本实施例提供的网页数据分析装置中第二获取模块还可以包括以下模块:假设子模块、计算子模块、判断子模块、迭代子模块和获取子模块。由于判断子模块可按照用户需求设置预设值,即簇的误差范围,因此能够满足不同数据分析人士的需求,使该方法适用范围变大。同时,迭代子模块进行多次重复迭代,也使最终的聚类结果更加准确。
图10是根据本发明的网页数据分析装置的第六实施例的流程图。该实施例可以作为图7所示实施例的一种优选实施方式,如图10所示,该网页数据分析装置包括:
第一获取单元22、第二获取单元24、划分单元26、命名单元28和排序单元30, 这里获取单元22、第一确定单元24和第二确定单元26与图7中相同,这里不再赘述。命名单元28和排序单元30具体为:
命名单元28,用于对多个同一类关键词分别进行命名。
由于各个同一类的关键词反映不同的用户需求,因此为对用户需求进行描述,可以利用命名单元28对得到的同一类的关键词进行命名。
优选地,命名方法可以为基于规则的命名方法和基于统计的命名方法两种,也可以将两种方法相结合,即混合的命名方法。同一类关键词的命名方法包括但不限于:根据用户搜索次数或用户搜索点击次数等行为进行命名,选取排序较高的关键词作为命名;根据图模型收敛时的聚集点进行最大似然估计,取集中收敛的关键词进行命名等。
排序单元210,用于按照每一类关键词包含的关键词的数量,对命名后的多个同一类关键词进行排序。
排序是指按照同一类关键词的统计量进行排序,统计量越高的同一类关键词(关键词簇)所对应的用户需求越强烈。优选地,常用的统计量包括:簇内关键词搜索次数和簇内关键词所属的会话数等。
本实施例提供了网页数据分析装置还可以包含的单元:命名单元28和排序单元210。通过命名单元28对聚类得到的各个簇分别进行命名,并且排序单元210按各簇包含的关键词数量对簇进行排序,使得各个簇更加清晰地展现出每一类数据的搜索热度,并能将聚类结果更加直观地呈现给网页数据分析者。
此处需要说明的是,本发明的网页数据分析装置的上述实施例中,各个单元和模块可以作为装置的一部分在移动终端、计算机终端或者类似的运算装置中运行,可以通过移动终端、计算机终端或者类似的运算装置中的处理器来执行上述单元和模块实现的功能,也可以作为存储介质的一部分进行存储。上述移动终端、计算机终端或者类似的运算装置可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。
由此,本发明实施例可以提供一种计算机终端,该计算机终端可以是计算机终端群中的任意一个计算机终端设备。可选地,在本发明实施例中,上述计算机终端也可以替换为移动终端等终端设备。
可选地,在本发明实施例中,上述计算机终端可以位于计算机网络的多个网络设备中的至少一个网络设备。
在本发明实施例中,上述计算机终端可以执行网页数据分析方法中以下步骤的程序代码:获取用户在网页上输入的m个关键词;获取m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在依存关系;以及将m个关键词中 存在依存关系的关键词划分为同一类关键词。
可选地,该计算机终端可以包括:一个或多个处理器、存储器、以及传输装置。
其中,存储器可用于存储软件程序以及模块,如本发明实施例中的网页数据分析方法及装置对应的程序指令/模块,处理器通过运行存储在存储器内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的网页数据分析方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
上述的传输装置用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
其中,具体地,存储器用于存储预设动作条件和预设权限用户的信息、以及应用程序。
处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行上述方法实施例中的各个可选或优选实施例的方法步骤的程序代码。
本领域普通技术人员可以理解,计算机终端也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌声电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
本发明实施例还提供了一种存储介质。可选地,在本发明实施例中,上述存储介质可以用于保存上述方法实施例和装置实施例所提供的网页数据分析方法所执行的程序代码。
可选地,在本发明实施例中,上述存储介质可以位于计算机网络中计算机终端群中的任意一个计算机终端中,或者位于移动终端群中的任意一个移动终端中。
可选地,在本发明实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:获取用户在网页上输入的m个关键词;获取m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在依存关系;以及将m个关键词中存在依 存关系的关键词划分为同一类关键词。
可选地,在本实施例中,存储介质还可以被设置为存储网页数据分析方法提供的各种优选地或可选的方法步骤的程序代码。
如上参照附图以示例的方式描述了根据本发明的网页数据分析方法及装置。但是,本领域技术人员应当理解,对于上述本发明所提出的网页数据分析方法及装置,还可以在不脱离本发明内容的基础上做出各种改进。因此,本发明的保护范围应当由所附的权利要求书的内容确定。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (12)

  1. 一种网页数据分析方法,其特征在于,包括:
    获取用户在网页上输入的m个关键词;
    获取所述m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在所述依存关系;以及
    将所述m个关键词中存在依存关系的关键词划分为同一类关键词。
  2. 根据权利要求1所述的方法,其特征在于,获取用户在网页上输入的m个关键词包括:
    在所述网页加载脚本文件代码;
    接收所述用户在所述网页的输入行为;以及
    通过脚本文件代码读取所述网页的输入行为所携带的m个关键词。
  3. 根据权利要求1所述的方法,其特征在于,获取所述m个关键词中存在依存关系的关键词包括:
    确定假设条件,其中,所述假设条件是假设的所述m个关键词的输入行为中包含的逻辑关系;
    根据所述假设条件创建图模型{G,S},其中,所述G代表所述m个关键词的集合,所述S代表所述m个关键词之间的依存关系的集合;以及
    通过所述图模型,获取所述m个关键词中存在依存关系的关键词。
  4. 根据权利要求3所述的方法,其特征在于,通过所述图模型,获取所述m个关键词中存在依存关系的关键词包括:
    根据所述m个关键词之间的依存关系的强度计算转移概率,其中,所述转移概率是每个关键词属于与其存在所述依存关系的关键词的概率;以及
    按照所述转移概率对所述m个关键词进行迭代,获取所述m个关键词中存在依存关系的关键词。
  5. 根据权利要求4所述的方法,其特征在于,按照所述转移概率对所述m个关键词进行迭代,获取所述m个关键词中存在依存关系的关键词包括:
    假设在进行所述迭代之前第i个关键词属于第i关键词簇,其中,所述簇是一类关键词的集合,所述i=1,2...m;
    按照所述转移概率对所述第i个关键词进行一次迭代,计算迭代后所述第i个关键词属于的第k关键词簇,其中,所述k∈{1,2...i-1,i+1...m};
    判断所述第i簇和所述第k簇的差异是否小于预设值,其中,所述预设值是预先设定的所述关键词簇允许的误差值;
    如果所述第i簇和所述第k簇的差异大于所述预设值,则继续进行迭代;以及
    如果所述第i簇和所述第k簇的差异小于或者等于所述预设值,则停止迭代,获取所述第i个关键词属于的关键词簇中的所有关键词。
  6. 根据权利要求1所述的方法,其特征在于,将所述m个关键词中存在依存关系的关键词划分为同一类关键词之后,所述方法还包括:
    对多个所述同一类关键词分别进行命名;以及
    按照每一类关键词包含的所述关键词的数量,对命名后的多个所述同一类关键词进行排序。
  7. 一种网页数据分析装置,其特征在于,包括:
    第一获取单元,用于获取用户在网页上输入的m个关键词;
    第二获取单元,用于获取所述m个关键词中存在依存关系的关键词,其中,对应的用户需求相同的关键词之间存在所述依存关系;以及
    划分单元,用于将所述m个关键词中存在依存关系的关键词划分为同一类关键词。
  8. 根据权利要求7所述的装置,其特征在于,所述第一获取单元包括:
    加载模块,用于在所述网页加载脚本文件代码;
    接收模块,用于接收所述用户在所述网页的输入行为;以及
    读取模块,用于通过脚本文件代码读取所述网页的输入行为所携带的m个关键词。
  9. 根据权利要求7所述的装置,其特征在于,所述第二获取单元包括:
    第一确定模块,用于确定假设条件,其中,所述假设条件是假设的所述m个关键词的输入行为中包含的逻辑关系;
    创建模块,用于根据所述假设条件创建图模型{G,S},其中,所述G代表所述m个关键词的集合,所述S代表所述m个关键词之间的依存关系的集合;以及
    第一获取模块,用于通过所述图模型,获取所述m个关键词中存在依存关系的关键词。
  10. 根据权利要求9所述的装置,其特征在于,所述第一获取模块包括:
    计算模块,用于根据所述m个关键词之间的依存关系的强度计算转移概率,其中,所述转移概率是每个关键词属于与其存在所述依存关系的关键词的概率;以及
    第二获取模块,按照所述转移概率对所述m个关键词进行迭代,获取所述m个关键词中存在依存关系的关键词。
  11. 根据权利要求10所述的装置,所述第二获取模块包括:
    假设子模块,用于假设在进行所述迭代之前第i个关键词属于第i关键词簇,其中,所述簇是一类关键词的集合,所述i=1,2...m;
    计算子模块,用于按照所述转移概率对所述第i个关键词进行一次迭代,计算迭代后所述第i个关键词属于的第k关键词簇,其中,所述k∈{1,2...i-1,i+1...m};
    判断子模块,用于判断所述第i簇和所述第k簇的差异是否小于预设值,其中,所述预设值是预先设定的所述关键词簇允许的误差值;
    迭代子模块,用于如果所述第i簇和所述第k簇的差异大于所述预设值,则继续进行迭代;以及
    获取子模块,用于如果所述第i簇和所述第k簇的差异小于或者等于所述预设值,则停止迭代,获取所述第i个关键词属于的关键词簇中的所有关键词。
  12. 根据权利要求7所述的装置,其特征在于,所述装置还包括:
    命名单元,用于对多个所述同一类关键词分别进行命名;以及
    排序单元,用于按照每一类关键词包含的所述关键词的数量,对命名后的多个所述同一类关键词进行排序。
PCT/CN2015/090185 2014-09-22 2015-09-21 网页数据分析方法及装置 WO2016045567A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/513,501 US10621245B2 (en) 2014-09-22 2015-09-21 Webpage data analysis method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410487202.3 2014-09-22
CN201410487202.3A CN104199969B (zh) 2014-09-22 2014-09-22 网页数据分析方法及装置

Publications (1)

Publication Number Publication Date
WO2016045567A1 true WO2016045567A1 (zh) 2016-03-31

Family

ID=52085262

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/090185 WO2016045567A1 (zh) 2014-09-22 2015-09-21 网页数据分析方法及装置

Country Status (3)

Country Link
US (1) US10621245B2 (zh)
CN (1) CN104199969B (zh)
WO (1) WO2016045567A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199969B (zh) 2014-09-22 2017-10-03 北京国双科技有限公司 网页数据分析方法及装置
CN104731867B (zh) * 2015-02-27 2018-09-07 百度在线网络技术(北京)有限公司 一种对对象进行聚类的方法和装置
CN106407229A (zh) * 2015-08-03 2017-02-15 天脉聚源(北京)科技有限公司 一种网页关键词匹配的方法和系统
CN105631025B (zh) * 2015-12-29 2021-09-28 腾讯科技(深圳)有限公司 一种查询标签的归一化处理方法和装置
US10554779B2 (en) 2017-01-31 2020-02-04 Walmart Apollo, Llc Systems and methods for webpage personalization
US11609964B2 (en) * 2017-01-31 2023-03-21 Walmart Apollo, Llc Whole page personalization with cyclic dependencies
WO2018207649A1 (ja) * 2017-05-11 2018-11-15 日本電気株式会社 推論システム
US11514498B2 (en) * 2019-03-07 2022-11-29 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for intelligent guided shopping
CN116561401B (zh) * 2023-05-26 2024-03-15 北京国新汇金股份有限公司 一种基于大数据分析的资讯热点提炼方法及系统
CN116431838B (zh) * 2023-06-15 2024-01-30 北京墨丘科技有限公司 文献检索方法、装置、系统及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006215717A (ja) * 2005-02-02 2006-08-17 Toshiba Corp 情報検索装置、情報検索方法および情報検索プログラム
CN101118560A (zh) * 2006-08-03 2008-02-06 株式会社东芝 关键词输出设备和关键词输出方法
CN102929870A (zh) * 2011-08-05 2013-02-13 北京百度网讯科技有限公司 一种建立分词模型的方法、分词的方法及其装置
CN103177036A (zh) * 2011-12-23 2013-06-26 盛乐信息技术(上海)有限公司 一种标签自动提取方法和系统
CN104199969A (zh) * 2014-09-22 2014-12-10 北京国双科技有限公司 网页数据分析方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8352499B2 (en) * 2003-06-02 2013-01-08 Google Inc. Serving advertisements using user request information and user information
US20050160107A1 (en) * 2003-12-29 2005-07-21 Ping Liang Advanced search, file system, and intelligent assistant agent
US20070179832A1 (en) * 2006-01-27 2007-08-02 Reich Joshua D Methods and systems for managing online advertising assets
US8799285B1 (en) * 2007-08-02 2014-08-05 Google Inc. Automatic advertising campaign structure suggestion
US8768960B2 (en) * 2009-01-20 2014-07-01 Microsoft Corporation Enhancing keyword advertising using online encyclopedia semantics
CN102968447A (zh) * 2012-10-24 2013-03-13 西安工程大学 基于决策树算法的seo关键词竞争程度计算方法
CN103914478B (zh) * 2013-01-06 2018-05-08 阿里巴巴集团控股有限公司 网页训练方法及系统、网页预测方法及系统
US9183238B2 (en) * 2013-03-15 2015-11-10 Google Inc. Providing task-based information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006215717A (ja) * 2005-02-02 2006-08-17 Toshiba Corp 情報検索装置、情報検索方法および情報検索プログラム
CN101118560A (zh) * 2006-08-03 2008-02-06 株式会社东芝 关键词输出设备和关键词输出方法
CN102929870A (zh) * 2011-08-05 2013-02-13 北京百度网讯科技有限公司 一种建立分词模型的方法、分词的方法及其装置
CN103177036A (zh) * 2011-12-23 2013-06-26 盛乐信息技术(上海)有限公司 一种标签自动提取方法和系统
CN104199969A (zh) * 2014-09-22 2014-12-10 北京国双科技有限公司 网页数据分析方法及装置

Also Published As

Publication number Publication date
US20170300573A1 (en) 2017-10-19
US10621245B2 (en) 2020-04-14
CN104199969A (zh) 2014-12-10
CN104199969B (zh) 2017-10-03

Similar Documents

Publication Publication Date Title
WO2016045567A1 (zh) 网页数据分析方法及装置
CN107436875B (zh) 文本分类方法及装置
US11645517B2 (en) Information processing method and terminal, and computer storage medium
KR102301899B1 (ko) 정보 검색 방법, 장치 및 시스템
US9818142B2 (en) Ranking product search results
US20190362222A1 (en) Generating new machine learning models based on combinations of historical feature-extraction rules and historical machine-learning models
US11455306B2 (en) Query classification and processing using neural network based machine learning
WO2020207074A1 (zh) 一种信息推送的方法及设备
JP2021533450A (ja) 機械学習のためのハイパーパラメータの識別および適用
US20230031591A1 (en) Methods and apparatus to facilitate generation of database queries
WO2023124029A1 (zh) 深度学习模型的训练方法、内容推荐方法和装置
US11729286B2 (en) Feature-based network embedding
CN104933100A (zh) 关键词推荐方法和装置
US20170308620A1 (en) Making graph pattern queries bounded in big graphs
WO2017166944A1 (zh) 一种提供业务访问的方法及装置
CN110737805B (zh) 图模型数据的处理方法、装置和终端设备
KR20230095796A (ko) 하이퍼그래프 콘볼루션 네트워크들을 통한 공동 개인맞춤형 검색 및 추천
CN111191825A (zh) 用户违约预测方法、装置及电子设备
CN104778205B (zh) 一种基于异构信息网络的移动应用排序和聚类方法
US10003492B2 (en) Systems and methods for managing data related to network elements from multiple sources
TWI684147B (zh) 雲端自助分析平台與其分析方法
KR102457359B1 (ko) 뉴럴 네트워크를 이용하는 마케팅 비용 효율 산정 방법 및 상기 마케팅 비용 효율 산정 방법을 수행하는 전자 시스템
CN113918577B (zh) 数据表识别方法、装置、电子设备及存储介质
CN112463974A (zh) 知识图谱建立的方法和装置
CN105608183A (zh) 一种提供聚合类型回答的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15843171

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 15513501

Country of ref document: US

122 Ep: pct application non-entry in european phase

Ref document number: 15843171

Country of ref document: EP

Kind code of ref document: A1