CN107133321B

CN107133321B - Method and device for analyzing search characteristics of page

Info

Publication number: CN107133321B
Application number: CN201710308061.8A
Authority: CN
Inventors: 尹文科; 徐健; 刘高强; 闫彬
Original assignee: Guangzhou Shenma Mobile Information Technology Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2017-05-04
Filing date: 2017-05-04
Publication date: 2020-06-12
Anticipated expiration: 2037-05-04
Also published as: CN107133321A

Abstract

The invention discloses a method and a device for analyzing search characteristics of a page. The analysis method comprises the following steps: calculating a first similarity between the historical query requests in the query set and the pages in the page set; considering the historical query requests and the pages with the first similarity exceeding a first preset threshold as matching with each other; and analyzing the page according to the matching information of the page to determine the searching characteristics of the page. Therefore, all steps in the analysis method can be realized off line, the searching characteristics of the page are determined based on the matching information of the page and the historical query requests, and compared with the existing page analysis scheme, the determined searching characteristics of the page not only better accord with the searching intention of the user, but also can dig out the page meeting the cold requirement of the user and a new page.

Description

Method and device for analyzing search characteristics of page

Technical Field

The present invention relates to the field of search technologies, and in particular, to a method and an apparatus for analyzing search characteristics of a page.

Background

The existing commercial search engines basically adopt the overall architecture shown in fig. 1, that is, crawlers periodically capture web pages on the internet, feature calculation and index construction of the web pages are completed through offline analysis, and finally an online retrieval system provides retrieval services for users. However, it is estimated that there are about 100 trillion web pages in the chinese internet alone, and about 100 trillion new web pages are added every day, so the huge scale poses a huge challenge to capture, storage, indexing, retrieval, etc.

The main solution at present is to select a subset considered to have "value" from a complete set of web pages for processing preferentially, and the current named web page value analysis method mainly comprises PageRank and HITS (Hyperlink-Induced Topic Search, link analysis algorithm).

The calculation of PageRank is based on the following two basic assumptions:

1. quantity: in the Web graph model, if a page node receives a larger number of inbound links pointed to by other Web pages, then the more important the page is.

2. Quality: the incoming links to page a are of different quality, and a high quality page may pass more weight to other pages through the link. So the more high quality pages point to page a, the more important page a is.

The PageRank algorithm has the advantages that the PageRank algorithm is a static algorithm independent of query, and PageRank values of all web pages can be obtained through off-line calculation. However, this algorithm also has some disadvantages: firstly, the level of the hot page is often higher than that of the long cold page, which is not beneficial to mining the page meeting the long cold requirement of the user; secondly, old pages will be ranked higher than new pages, since even very good new pages will not have many upstream links, and thus not be conducive to the discovery of new pages.

The HITS algorithm is a typical algorithm for mining by using a link structure of Web, the core idea of the HITS algorithm is based on a page link relation, the basic idea of the HITS algorithm is to mine useful information hidden in the pages by using a reference chain between the pages, and the HITS algorithm has two important concepts:

hub page: refers to a web page containing many links to high quality "Authority" pages;

authority page: refers to a high-quality web page related to a certain field or a certain topic.

HITS is based on the following basic assumptions:

assume that 1: a good "Authority" page will be pointed to by many good "Hub" pages;

assume 2: a good "Hub" page will point to many good "Authority" pages.

The HITS algorithm has the advantage that it can better describe the organizational characteristics of the internet, however, the HITS algorithm has some disadvantages, such as low efficiency, the HITS algorithm is query-related algorithm, so the computation must be performed in real time after receiving the user query, and the problems of long cold link mining and insufficient new link discovery capability also exist.

Thus, there is a need for an analysis scheme that can more accurately mine out valuable pages.

Disclosure of Invention

The invention mainly aims to provide a method and a device for analyzing the searching characteristics of a page, which can more accurately dig out the page meeting the searching requirements of a user.

According to an aspect of the present invention, there is provided a method for analyzing search characteristics of a page, including: calculating a first similarity between the historical query requests in the query set and the pages in the page set; considering the historical query requests and the pages with the first similarity exceeding a first preset threshold as matching with each other; and analyzing the page according to the matching information of the page to determine the searching characteristics of the page.

Therefore, the searching characteristics of the page can be determined according to the matching information of the page and the historical query requests, and compared with the existing page analysis scheme, the determined searching characteristics of the page not only better accord with the searching intention of the user, but also can dig out the page meeting the cold requirement of the user and a new page.

Preferably, analyzing the page according to the matching information of the page to determine the search characteristic of the page may include: determining the query popularity of the page according to the number of historical query requests of the matched page; and/or determining the resource scarcity of the page according to the number of the pages matched with the historical query requests of the matched page.

Therefore, for the page A with a large number of matched historical query requests, the query popularity of the page A can be considered to be high, and when the number of the pages matched with the historical query requests of the page A is small, the resource scarcity of the page A can be considered to be high.

Preferably, calculating the first similarity between the historical query requests in the query set and the pages in the page set may include: performing word segmentation on historical query requests in a query set and calculating weights to obtain a plurality of first words and the weight corresponding to each first word; segmenting words of the text information corresponding to the pages in the page set and calculating weights to obtain a plurality of second segments and the weight corresponding to each second segment; and determining the first similarity between the historical query request and the page by calculating the similarity between the first participle corresponding to the historical query request and the second participle corresponding to the page.

Therefore, the method can be used for segmenting the historical query requests in the query set and the pages in the page set respectively, calculating the segmentation weight, and determining the first similarity between the historical query requests and the pages by calculating the similarity between the segmentation words

Preferably, only a first similarity between historical query requests and pages having at least one same valid participle may be calculated.

Therefore, the similarity between every two of all historical query requests in the query set and all pages in the page set can be calculated instead of calculating the similarity between every two of all historical query requests and all pages in the page set, and therefore the calculation amount can be greatly reduced while the calculation accuracy is not influenced.

Preferably, the same participles have the same weight, and the first similarity S (q, u) between the historical query request and the page is calculated according to the following similarity calculation formula:

wherein q represents a historical query request, u represents a page, k_jA participle, k, representing the intersection of a first participle belonging to a historical query request and a second participle corresponding to a page_iA participle representing a union of a first participle corresponding to the historical query request and a second participle corresponding to the page,

representing a participle k_jThe weight of (a) is determined,

to representWord segmentation k_iThe weight of (c).

Thus, a first similarity S (q, u) between the historical query request and the page may preferably be determined using a Jaccard similarity calculation.

Preferably, the process of calculating the first similarity S (q, u) may include: calculating partial similarity S between historical query requests and pages_j' (q, u) wherein,

by accumulating partial similarity S corresponding to the same set of historical query requests and pages_j' (q, u) to obtain a first similarity between the historical query request and the page.

Therefore, the calculation process can be decomposed into a plurality of partial similarity calculation processes according to the characteristics of the similarity calculation formula, and the partial similarities S aiming at the same group of historical query requests and pages are accumulated_j' (q, u) to obtain a first similarity S (q, u).

Preferably, the partial similarity S is calculated_j' (q, u) may include: generating a plurality of pieces of first record data, wherein each piece of first record data comprises a first word segmentation, a weight corresponding to the first word segmentation, a historical query request corresponding to the first word segmentation, all first words corresponding to the historical query request and the weights of the first words, and the plurality of pieces of first record data are arranged according to the hash value of the first word segmentation; generating a plurality of pieces of second record data, wherein each piece of second record data comprises second participles, weights corresponding to the second participles, pages corresponding to the second participles, weights of the pages, all second participles corresponding to the pages and weights of the second participles, and the plurality of pieces of second record data are arranged according to hash values of the second participles; selecting first record data and second record data with the same hash value from the first record data and the second record data as calculation data, and calculating partial similarity S_j'(q,u)。

Therefore, the similarity between the pages in the page set and the query set and the historical query requests can be calculated by using the parallel computing model Map-Reduce.

Preferably, for the first record data with the same hash value, the first record data is sorted according to the character sequence of the historical query request for which the first record data is directed, and/or for the second record data with the same hash value, the second record data is sorted according to the weight of the page for which the second record data is directed.

Preferably, the method may further comprise: for a plurality of pieces of first record data with the same hash value, reserving a first quantity threshold value of the first record data to participate in the calculation of partial similarity; and/or for a plurality of pieces of second record data with the same hash value, keeping second record data with a second number threshold value to participate in the calculation of the partial similarity.

Therefore, the first record data and/or the second record data can be screened, and the calculation amount can be further reduced while the long tail of the data is avoided.

Preferably, before calculating the similarity between the first participle corresponding to the historical query request and the second participle corresponding to the page, the method may further include: removing stop words in the first participle and/or the second participle; and/or selecting the first participle and/or the second participle with the weight larger than a second preset threshold value to participate in the calculation of the similarity; and/or eliminating historical query requests and/or pages with the ratio of the number of stop words to the number of non-stop words exceeding a third preset threshold value from the query set and/or the page set; and/or selecting the participle with the weight higher than the first threshold value in the participle corresponding to each historical query request and/or page to participate in the calculation of the similarity.

Therefore, before formal calculation, historical query requests and word segmentation of pages can be screened, so that the calculation accuracy can be guaranteed, and the calculation amount is reduced.

Preferably, the method may further comprise: for the historical query requests and the pages with the first similarity exceeding a first preset threshold, selecting the participles with the weight ranking higher than the first threshold from the corresponding participles of the historical query requests and the pages respectively, and further calculating a second similarity between the historical query requests and the pages by using a similarity calculation formula, wherein the fourth threshold is larger than the third threshold; and when the second similarity exceeds a fourth preset threshold value, determining that the historical query requests and the pages are matched with each other, wherein the fourth preset threshold value is larger than the first preset threshold value.

According to another aspect of the present invention, there is also provided an apparatus for analyzing a search characteristic of a page, including: the first similarity calculation unit is used for calculating first similarity between historical query requests in the query set and pages in the page set; the matching determination unit is used for regarding the historical query requests and the pages with the first similarity exceeding a preset threshold as matching with each other; and the searching characteristic determining unit is used for analyzing the page according to the matching information of the page so as to determine the searching characteristic of the page.

Preferably, the search characteristic determination unit may include: the query popularity determining module is used for determining the query popularity of the page according to the number of the historical query requests of the matched page; and/or the resource scarcity determining module is used for determining the resource scarcity of the page according to the number of the pages matched with the historical query requests of the matched page.

Preferably, the first similarity calculation unit may include: the first segmentation and weight calculation module is used for segmenting the historical query requests in the query set and calculating weights so as to obtain a plurality of first segmentation and the weight corresponding to each first segmentation; the second word segmentation and weight calculation module is used for segmenting word information corresponding to the pages in the page set and calculating weights to obtain a plurality of second words and the weight corresponding to each second word segmentation; and the similarity determining module is used for determining the first similarity between the historical query request and the page by calculating the similarity between the first participle corresponding to the historical query request and the second participle corresponding to the page.

Preferably, the same participles have the same weight, and the similarity determination module calculates a first similarity S (q, u) between the historical query request and the page according to the following similarity calculation formula:

representing a participle k_jThe weight of (a) is determined,

representing a participle k_iThe weight of (c).

Preferably, the similarity determination module calculates partial similarity S between the historical query request and the page_j' (q, u) wherein,

by accumulating partial similarity S corresponding to the same historical query request and page_j' (q, u) to obtain a first similarity between the historical query request and the page.

Preferably, the similarity determination module may include: the first generation module is used for generating a plurality of pieces of first record data, each piece of first record data comprises a first participle, a weight corresponding to the first participle, a historical query request corresponding to the first participle, all first participles corresponding to the historical query request and the weights of all the first participles, and the plurality of pieces of first record data are arranged according to a hash value of the first participle; the second generation module is used for generating a plurality of pieces of second record data, each piece of second record data comprises a second participle, a weight corresponding to the second participle, a page corresponding to the second participle, a weight of the page, all second participles corresponding to the page and weights of the second participles, and the plurality of pieces of second record data are arranged according to a hash value of the second participle; a calculation module for selecting the first record data and the second record data with the same hash value from the plurality of pieces of the first record data and the plurality of pieces of the second record data as calculation data, and calculating the partial similarity S_j'(q,u)。

Preferably, the analysis apparatus may further include a screening unit, configured to perform one or more of the following operations before the similarity determination module calculates the similarity between the first segmentation word corresponding to the historical query request and the second segmentation word corresponding to the page: removing stop words in the first participle and/or the second participle; selecting a first participle and/or a second participle with the weight larger than a second preset threshold value to participate in the calculation of the similarity; removing historical query requests and/or pages with the ratio of the number of stop words to the number of non-stop words exceeding a third preset threshold value from the query set and/or page set; and selecting the participles with the weight higher than the first threshold value of the third quantity from the participles corresponding to each historical query request and/or page to participate in the calculation of the similarity.

Preferably, the analysis apparatus may further include: the second similarity calculation unit is used for respectively selecting the participles with the weight rank higher than a fourth quantity threshold from the historical query requests and the participles corresponding to the pages aiming at the historical query requests and the pages with the first similarity exceeding a first preset threshold, and further calculating the second similarity between the historical query requests and the pages by using a similarity calculation formula, wherein the fourth quantity threshold is larger than the third quantity threshold; and the matching determining unit is used for determining that the historical query requests and the pages are matched with each other when the second similarity exceeds a fourth preset threshold, wherein the fourth preset threshold is larger than the first preset threshold.

Compared with the existing page analysis scheme, the determined search characteristics of the page not only better accord with the search intention of a user, but also can dig out a page meeting the cold requirement of the user and a new page, and the process of calculating the similarity can be carried out off line.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 is a schematic diagram illustrating the general architecture employed by existing commercial search engines.

FIG. 2 is a schematic flow chart diagram illustrating a method of analyzing search characteristics of a page in accordance with an embodiment of the present invention.

Fig. 3 is a schematic flowchart showing a first similarity calculation process according to an embodiment of the present invention.

FIG. 4 illustrates a flow diagram for implementing matching and optimization between pages in a set of pages and historical query requests in a set of queries based on a Map-Reduce distributed computing model.

Fig. 5 is a schematic block diagram showing the structure of an analysis apparatus of a search characteristic of a page according to an embodiment of the present invention.

Fig. 6 is a schematic block diagram showing functional blocks that the similarity determination module in fig. 5 may have.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As described above, the existing page value analysis scheme determines the value of a page by analyzing the link relationship between pages only from the characteristics of the page itself, and is not beneficial to mining a page and a new page meeting the cold requirement of a user; or the calculation is carried out after the user query is received, so that the calculation efficiency is low. Aiming at the defects of the existing page value analysis scheme, the invention provides a new page analysis scheme.

Briefly, the page analysis scheme of the present invention may calculate the similarity between a page and a historical query request offline, regard the page and the historical query request whose similarity exceeds a certain threshold as matching with each other, regard the page matching with the historical query request as a target page to which the query request is directed, that is, the matching information of the calculated page may reflect the search requirement of the user to a certain extent. Therefore, the search characteristics of the page can be determined by analyzing the calculated matching information of the page.

A detailed description will be given below of a specific implementation process of the page analysis scheme of the present invention with reference to the accompanying drawings, and fig. 2 is a schematic flow chart illustrating an analysis method of a search characteristic of a page according to an embodiment of the present invention.

Referring to fig. 2, step S110 may be performed first to calculate a first similarity between the historical query requests in the query set and the pages in the page set.

The historical query requests in the query set may be query requests of a large number of users within a certain period of time counted in advance. The pages in the page set may be newly added pages within a statistical period of time (e.g., one day, one week), or may be a large number of pages stored in a web page library in advance.

After the query set and the page set are obtained, similarity between the historical query requests in the query set and the pages in the page set can be calculated based on a certain similarity calculation rule. For the sake of convenience of distinction, the first similarity is referred to herein, and a specific calculation process regarding the first similarity will be described in detail below, and will not be described herein.

After the first similarity is calculated, the query request and the page with the first similarity exceeding a first predetermined threshold may be regarded as matching each other (step S120). The first predetermined threshold may be set as required, for example, the first predetermined threshold may be set between 0.7 and 0.75.

The page matching the historical query request may be regarded as a target page for the historical query request, and thus the page may be analyzed according to the matching information of the page to determine the search characteristics of the page (step S130).

For example, if the number of historical query requests matched by the page a is found to be large after calculation, the query popularity of the page a may be considered to be high; conversely, if the number of the historical query requests matched by the page B is found to be large after calculation, the query popularity of the page B can be considered to be low. Therefore, the query popularity of a certain page can be determined according to the number of historical query requests matched with the page.

For another example, if it is found after calculation that the number of pages matched by the historical query request(s) matching page a is generally small, it may be considered that the resource scarcity of page a is high; conversely, if the number of pages matched by the finally computed historical query request(s) matching page A is generally high, then page A may be considered to have a low resource scarcity. That is, the resource scarcity of the page may also be determined according to the number of pages matched by the historical query requests of the matched page. When determining the resource scarcity of the page a, the first N historical query requests with a smaller number of matched pages may be selected from all the historical query requests matched with the page a obtained by calculation, and the resource scarcity of the page a may be determined according to the number of pages matched with the selected historical query requests.

Of course, other search characteristics of the page may also be determined according to the calculated matching information of the page, and will not be described here.

In summary, the above steps of the method for analyzing the search characteristic of the page of the present invention can be implemented offline, and the search characteristic of the page is determined based on the matching information between the page and the historical query request, so that compared with the existing page analysis scheme, the determined search characteristic of the page better conforms to the search intention of the user, and the page and the new page meeting the cold requirement of the user can be mined.

The flow of the method for analyzing the search characteristics of the page according to the present invention is briefly described above with reference to fig. 2, and as can be seen from the above description, the key point of the method for analyzing the page according to the present invention is to calculate the first similarity between the page in the page set and the historical query request in the query set, and the following description is provided with reference to fig. 3 for a process of calculating the first similarity between the page and the historical query request.

First, first similarity calculation process

Referring to fig. 3, step S210 and step S220 may be executed first, to perform word segmentation on the historical query requests in the query set and the text information corresponding to the pages in the page set, and calculate weights of the word segmentation. The text information corresponding to the page may be link text, anchor text, title, and other text information of the page. In addition, for convenience of description, the participle corresponding to the historical query may be referred to as a first participle, and the participle corresponding to the page may be referred to as a second participle.

The segmentation and weight calculation processes mentioned in step S210 and step S220 may be implemented by calling a segmentation technique in NLP (natural language Processing) service. The specific implementation principle of NLP service is well known to those skilled in the art, and will not be described herein.

It should be noted that step S210 and step S220 may use the same NLP service, and the weight corresponding to each participle in the NLP service may be preset, that is, the same participle may correspond to one weight. For example, assume that the word segmentation result of page a is { china, good, sound }, the word segmentation result of historical query request B is { china, good, song }, and the weights of the first word segmentation "china" in page a and the second word segmentation "china" in historical query request B may be the same.

Step S230 may then be performed to determine a first similarity between the historical query request and the page by calculating a similarity between a first participle corresponding to the historical query request and a second participle corresponding to the page.

The similarity between the first participle and the second participle can be calculated in various ways, such as Jaccard similarity, cosine similarity, ctrtn similarity, and other similarity calculation ways. Jaccard similarity, cosine similarity and CtrCtn similarity, the cosine similarity calculation is the loosest and CtrCtn is the worst, and Jaccard is between them, for example, for a ═ {0,1,1,1} and q ═ 1,1,0}, weighted Jaccard similarity is 1/2, cosine is 2/3, and ctrtn is 4/9.

The Jaccard similarity calculation mode can realize the balance of accuracy and recall rate, if the required accuracy is higher than the recall rate, the CtrCtn similarity calculation mode can be selected, and if the required recall rate is higher than the accuracy, the cosine similarity calculation mode can be selected.

As an example, the present invention may employ a weighted Jaccard similarity calculation method, in which case the first similarity S (q, u),

representing a participle k_jThe weight of (a) is determined,

representing a participle k_iThe weight of (c).

Second, optimization of similarity calculation process

According to the similarity calculation formula (1), the time complexity calculation formula for calculating the first similarity between the historical query requests in the query set and the pages in the page set is

o(|Q|×|U|×|W|) (2)

Wherein, | Q | represents the number of historical query requests in the query set, | U | represents the number of pages in the page set, and | W | represents the number of participles. Taking the data scale in the normal case as an example, the time complexity can reach o (10^8 × 10^11 × 10^1) ═ o (10^20), and it can be seen that the calculation result can not be obtained in a reasonable time without improving and optimizing the above algorithm.

For this reason, after intensive research, the inventors found that the complexity of calculation can be reduced by screening data before calculation and further optimizing the data during calculation.

1. Pre-calculation screening

After completing the calculation of the participle and participle weight of the historical query request and the page, the second participle corresponding to the first participle page corresponding to the historical query request and the page historical query request can be screened based on a preset screening rule, so that the historical query request, the page and the corresponding participle meeting the requirement are selected to participate in the calculation of the similarity. Specifically, one or more of the following screening rules may be employed.

① reject stop words in the first participle and/or the second participle.

② the first participle and/or the second participle with the weight larger than the second predetermined threshold are selected to participate in the calculation of the similarity.

③ selecting the participle with the weight higher than the third threshold number from the participles corresponding to each historical query request and/or page to participate in the similarity calculation.

By eliminating stop words, selecting participles with weights larger than a preset threshold value to participate in calculation and participles with weights ranked at the top to participate in calculation, the | W | in the time complexity calculation formula (2) can be effectively reduced, the calculation speed can be accelerated, and meanwhile, the accuracy of similarity calculation can be improved.

④, historical query requests and/or pages are culled from the query set and/or page set that have a ratio of number of stop words to number of non-stop words that exceeds a third predetermined threshold.

Historical query requests and/or pages for which the ratio of the number of stop words to the number of non-stop words exceeds a third predetermined threshold may be considered spam historical query requests and/or spam pages. Therefore, | Q | and | U | in the time complexity formula (2) can be reduced by filtering spam history query requests and spam pages.

⑤ only calculate a first similarity between historical query requests and pages having at least one same valid segmentation.

According to the similarity calculation formula (1), the first similarity between the historical query requests without common effective participles and the pages is zero, so that the historical query requests without common effective participles and the pages can be eliminated, and only the first similarity between the historical query requests with at least one same effective participle and the pages is calculated.

2. Optimization of a computing process

The above-mentioned similarity calculation formula (1) can be expressed as:

therefore, partial similarity S between the historical query request and the page can be calculated by first calculating_j' (q, u) wherein,

and then by accumulating the partial similarity S corresponding to the same set of historical query requests and pages_j' (q, u) to obtain a first similarity between the historical query request and the page. Wherein, the partial similarity may represent the contribution degree of the participle to the overall similarity.

The similarity calculation process may be implemented in parallel using a distributed computing model. For example, a Map-Reduce distributed computing model can be used to simulate an inverted index recall to implement parallelized computation, and the specific computation process is as follows.

FIG. 4 illustrates a schematic computing flow diagram for computing a first similarity between a page and historical query requests based on a Map-Reduce distributed computing model.

Referring to fig. 4, step S410 may be performed first to perform a data preprocessing operation on the query set and the page set. Here, the following operation may be performed in step S410.

1. And respectively segmenting the historical query requests in the query set and the pages in the page set, and calculating the weight.

The process of word segmentation and weight calculation can be referred to the above description of step S210 and step S220 in fig. 3.

2. And calculating the weight uweight of the page in the page set. The calculation can be performed using the following calculation formula uWeight ═ a × (PR × b + HR × c + d × (LinkFollow/UrlDepth)),

wherein, PR, HR, LinkFollow and Urldepth respectively represent page score (PageRank), master station score (HostRank), number of incoming chains and link depth of a page, a, b, c and d are weights, which can be 0.015254, 9, 5 and 45, and the larger the PR, HR and LinkFollow values are, the smaller the Urldepth value is, the larger the weight of the page is, and the page with large weight can be considered as good quality.

3. And (4) screening data.

Specific screening procedures that may be performed may be as described above for pre-computational screening. For example, the historical query requests and the corresponding participles of the pages may be respectively truncated by only retaining the top n participles with the largest weight for each historical query request/page. Wherein n is usually a small integer value, such as 2 or 3, etc., as can be seen from formula (1), the similarity is mainly affected by the key words with large weights, and the computational complexity of similarity calculation (W | in formula 2) can be reduced without affecting the recall ratio by increasing truncation.

After the data preprocessing, first log data for the historical query request and second log data for the page may be formed, respectively (step S420, step S425). The data structure and arrangement of the first recording data and the second recording data will be further described below.

4.1 Structure and arrangement of the first recording data

As described above, after a historical query is participled, one or more first participles under the historical query can be obtained. Thus, for each first participle under each historical query request, first record data for the first participle can be generated. The first record data may include the first participle, a weight corresponding to the first participle, a historical query request corresponding to the first participle, and all the first participles and their weights corresponding to the historical query request.

For example, assuming that the historical query request q1 is segmented to obtain q1 ═ china, good, sound, where china is weighted 2, good is weighted 1, and sound is weighted 0.6, then for q1, three pieces of first record data may be formed: { China, 2, q1 (China 2; good 1; sound 0.6) }, { good, 1, q1 (China 2; good 1; sound 0.6) }, { sound, 0.6, q1 (China 2; good 1; sound 0.6) }.

Thus, a plurality of pieces of first record data can be generated for different historical query requests. The generated plurality of pieces of first record data may be sorted according to the corresponding first segmentation, and the first record data corresponding to the same first segmentation are sorted together. In order to facilitate the arrangement of the first record data corresponding to the same first division, a predetermined hash algorithm may be used to assign a specific hash value to the first division, whereby the first record data having the same hash value may be arranged. For the first record data with the same hash value, the first record data may be sorted according to the character sequence of the historical query request to which the first record data is directed.

4.2 Structure and arrangement of the second recording data

The structure and arrangement of the second recording data are the same as those of the first recording data in principle. That is, second record data for each second participle in each page may be generated, where the second record data may include the second participle, a weight corresponding to the second participle, a page corresponding to the second participle, a weight of the page, and all second participles and weights thereof corresponding to the page.

In arranging the second record data, the second record data corresponding to the same second participle may be arranged together, where the second participle may be given a specific hash value using the same hash algorithm as in step 3.1, whereby the second record data with the same hash value may be arranged together. For the second record data with the same hash value, the second record data may be sorted according to the weight of the page to which the second record data is directed.

After the first record data and the second record data are formed, step S430 may be performed to select the first record data and the second record data having the same hash value as the calculation data from the plurality of pieces of first record data and the plurality of pieces of second record data, respectively, and calculate the partial similarity S_j' (q, u), partial similarity S_j' (q, u) can be referred to the above formula (2).

In calculating partial similarity S_jBefore' (q, u), the first recording data and the second recording data may also be truncated to avoid a long tail of data.

Specifically, for a plurality of pieces of first record data having the same hash value, a first number of pieces of the first record data of the first number threshold may be retained to participate in the calculation of the partial similarity. The first quantity threshold may be set according to practical situations, for example, the first quantity threshold may be selected from 1000 to 10000.

For a plurality of pieces of second record data with the same hash value, a second number of pieces of second record data with a threshold value may be retained to participate in the calculation of the partial similarity. For example, the top Y pages with the top weight (uWeight) can be picked from the pieces of second record data with the same hash value, and then X pieces of pages can be randomly reserved. Wherein Y can be 100000-500000, and X can be 50000-100000.

At the completion of partial similarity S_j'(q, u) may be calculated, a plurality of pieces of third recording data may be formed, and each piece of the third recording data may be a data structure in the form of (k, q, u, s'). Wherein q represents a historical query request, u represents a page, k represents a participle belonging to q and u at the same time, and s' represents a partial similarity calculation result calculated based on the participle k.

Then, step S440 may be performed to calculate the first similarity. Here, by adding s 'in the third recorded data (k, q, u, s') having the same q + u, the first similarity between q and u can be obtained.

After obtaining the first similarity, the historical query request and the page may be first coarse-screened once, ifIf the calculated first similarity s is not greater than the threshold value

Then discarded, otherwise proceed to the next step, where the threshold value is

Generally, it is preferably 0.7 to 0.75.

After coarse screening, fourth record data may be formed, and the data structure of the fourth record data is as follows: (q, u, uWeight, s) records. The meaning of each field is the same as above, and the description is omitted here. The fourth record data pieces may be sorted first according to the character order of q, the same q may be sorted from large to small according to s, and the same q may be sorted from large to small according to uWeight.

The historical query requests and pages that passed the primary screen may then be fine screened (step S450). Specifically, for the historical query request and the page with the first similarity exceeding the first predetermined threshold, a fourth number of the participles with the weight ranking higher than the first threshold may be selected from the participles corresponding to the historical query request and the page, respectively, and the second similarity between the historical query request and the page is further calculated using the above-mentioned similarity calculation formula (1), where the fourth number threshold may be greater than the third number threshold. When the second similarity magnitude exceeds a fourth predetermined threshold, the historical query requests and the pages may be determined to match each other, wherein the fourth predetermined threshold is greater than the first predetermined threshold.

After the fine screening, the matching information between the obtained page and the historical query request can be written into the corresponding database (step S460). In the process of writing in the database, for a historical query request, only the first K pages matched with the historical query request can be written in the database, and generally K can be valued in 2000-3000.

Finally, the search characteristics of the page can be calculated according to the matching information between the page and the historical query requests stored in the database.

For example, the access popularity pop (u) of the page a, that is, the access popularity pop (u), can be determined by calculating the number of all the historical query requests with the number of matched pages less than K (which may be the same as the value of K mentioned above) among all the historical query requests with which the page a is matched

Pop(u)＝|{q|S(q，u)＞γand|u′|S(q，u′)＞γ|＜K}|

For another example, the number of pages matched by the historical query request with the least number of matched pages in all the historical query requests matched by the page a can be used as the resource scarcity (u) of the page a,

so far, the matching and optimization between the pages in the page set and the historical query requests in the query set based on the Map-Reduce distributed computing model is described in detail with reference to fig. 4. Based on the above description, the similarity calculation of massive pages and historical query requests can be realized by adopting the map-reduce calculation model to simulate the process of recalling the inverted index, and the calculation process can be elastically expanded and contracted, so that the calculation resources can be saved.

In addition, the present invention also provides an analysis apparatus for page search characteristics, which can be used to execute the analysis method for page search characteristics of the present invention, so the following mainly describes the structure that the analysis apparatus can have, and for the details thereof, refer to the above related description, and the following detailed description is not repeated.

Fig. 5 is a schematic block diagram showing the structure of an analysis apparatus of a page search characteristic according to an embodiment of the present invention.

Referring to fig. 5, the analysis apparatus 500 includes a first similarity calculation unit 510, a matching determination unit 520, and a search characteristic determination unit 530.

The first similarity calculation unit 510 is configured to calculate a first similarity between the historical query requests in the query set and the pages in the page set.

The matching determination unit 520 is configured to consider the historical query requests and the pages with the first similarity exceeding a predetermined threshold as matching with each other.

The search characteristic determining unit 530 is configured to analyze the page according to the matching information of the page to determine the search characteristic of the page.

As an alternative embodiment of the present invention, as shown in fig. 5, the search characteristic determining unit 530 may include a query popularity determining module 531 and a resource scarcity determining module 533.

The query popularity determination module 531 is configured to determine the query popularity of the page according to the number of the historical query requests matching the page.

The resource scarcity determining module 533 is configured to determine the resource scarcity of the page according to the number of the pages matched by the historical query request of the matched page.

As another alternative embodiment of the present invention, as shown in fig. 5, the first similarity calculation unit 510 may include a first participle and weight calculation module 511, a second participle and weight calculation module 513, and a similarity determination module 515.

The first segmentation and weight calculation module 511 is configured to perform segmentation and weight calculation on historical query requests in the query set to obtain a plurality of first segments and a weight corresponding to each first segment.

The second segmentation and weight calculation module 513 is configured to segment the text information corresponding to the pages in the page set and calculate a weight to obtain a plurality of second segments and a weight corresponding to each second segment.

The similarity determining module 515 is configured to determine a first similarity between the historical query request and the page by calculating a similarity between a first participle corresponding to the historical query request and a second participle corresponding to the page.

The same participles may have the same weight, and the similarity determination module may calculate a first similarity S (q, u) between the historical query request and the page according to the following similarity calculation formula:

wherein q represents a historical query request, u represents a page, k_jRepresenting the first participle corresponding to the historical query request and the second participle corresponding to the pageSegmentation of the intersection of two segmentations, k_iA participle representing a union of a first participle corresponding to the historical query request and a second participle corresponding to the page,

representing a participle k_jThe weight of (a) is determined,

representing a participle k_iThe weight of (c).

The similarity determination module 515 may calculate a partial similarity S between the historical query request and the page_j' (q, u) wherein,

then, a first similarity between the historical query request and the page is obtained by accumulating the partial similarities corresponding to the same historical query request and the page.

Fig. 6 shows a schematic block diagram of functional modules that the similarity determination module 515 may have.

Referring to fig. 6, the similarity determining module 515 may include a first generating module 5151, a second generating module 5153, and a calculating module 5155.

The first generating module 5151 is configured to generate a plurality of pieces of first record data, where each piece of first record data includes a first participle, a weight corresponding to the first participle, a historical query request corresponding to the first participle, and all first participles and weights thereof corresponding to the historical query request, and the plurality of pieces of first record data are arranged according to a hash value of the first participle.

The second generating module 5153 is configured to generate a plurality of pieces of second record data, where each piece of second record data includes a second participle, a weight corresponding to the second participle, a page corresponding to the second participle, a weight of the page, and all second participles and weights thereof corresponding to the page, and the plurality of pieces of second record data are arranged according to a hash value of the second participle.

The calculating module 5155 is used for respectively selecting the first record data and the second record dataSelecting the first record data and the second record data with the same hash value as the data for calculation, and calculating partial similarity S_j'(q,u)。

Returning to fig. 5, as another alternative embodiment of the present invention, the analysis device 500 may further include a screening unit 540. The screening unit 540 may perform one or more of the following operations before the similarity determination module 515 calculates the similarity between the first participle corresponding to the historical query request and the second participle corresponding to the page:

removing stop words in the first participle and/or the second participle;

selecting a first participle and/or a second participle with the weight larger than a second preset threshold value to participate in the calculation of the similarity;

removing historical query requests and/or pages with the ratio of the number of stop words to the number of non-stop words exceeding a third preset threshold value from the query set and/or page set;

and selecting the participles with the weight higher than the first threshold value of the third quantity from the participles corresponding to each historical query request and/or page to participate in the calculation of the similarity.

As another alternative embodiment of the present invention, the analysis apparatus 500 may further include a second similarity degree calculation unit 550 and a matching determination unit 560.

The second similarity calculation unit 550 is configured to, for a historical query request and a page for which the first similarity exceeds a first predetermined threshold, select a fourth quantity of tokens with a weight higher than the top rank from the tokens corresponding to the historical query request and the page, and further calculate a second similarity between the historical query request and the page by using a similarity formula, where the fourth quantity of tokens is greater than the third quantity of tokens.

The matching determination unit 560 is configured to determine that the historical query request and the page match each other when the second similarity exceeds a fourth predetermined threshold, where the fourth predetermined threshold is greater than the first predetermined threshold.

The page search characteristic analysis method and the analysis apparatus according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention. Alternatively, the method according to the present invention may also be implemented as a computer program product comprising a computer readable medium having stored thereon a computer program for executing the above-mentioned functions defined in the above-mentioned method of the present invention. Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for analyzing search characteristics of a page, comprising:

calculating a first similarity between the historical query requests in the query set and the pages in the page set based on a plurality of first participles of the historical query requests in the query set and a plurality of second participles of the pages in the page set;

considering the historical query requests and the pages with the first similarity exceeding a first preset threshold as matching with each other;

and analyzing the page according to the matching information of the page to determine the search characteristics of the page.

2. The analysis method of claim 1, wherein analyzing the page to determine the search characteristics of the page according to the page's matching information comprises:

determining the query popularity of the page according to the number of the historical query requests matched with the page; and/or

And determining the resource scarcity of the page according to the number of the pages matched with the historical query requests of the page.

3. The method of claim 1, wherein calculating a first similarity between historical query requests in the query set and pages in the page set comprises:

performing word segmentation on the historical query requests in the query set and calculating weights to obtain a plurality of first words and the weight corresponding to each first word;

performing word segmentation on the text information corresponding to the pages in the page set and calculating the weight to obtain a plurality of second words and the weight corresponding to each second word;

the method comprises the steps of determining first similarity between historical query requests and pages by calculating similarity between first participles corresponding to the historical query requests and second participles corresponding to the pages.

4. The assay of claim 3, wherein,

only a first similarity between historical query requests and pages having at least one same valid participle is calculated.

5. The method of claim 3, wherein the same participles have the same weight, and the first similarity S (q, u) between the historical query request and the page is calculated according to the following similarity calculation:

representing a participle k_jThe weight of (a) is determined,

representing a participle k_iThe weight of (c).

6. The method of claim 5, wherein the process of calculating the first similarity S (q, u) comprises:

calculating partial similarity S between historical query requests and pages_j' (q, u) wherein,

by accumulating partial similarity S corresponding to the same set of historical query requests and pages_j' (q, u) get the historical queryA first similarity between the request and the page.

7. The method of claim 6, wherein partial similarity S is calculated_j' (q, u) includes:

generating a plurality of pieces of first record data, wherein each piece of first record data comprises a first word segmentation, a weight corresponding to the first word segmentation, a historical query request corresponding to the first word segmentation, all first words corresponding to the historical query request and the weights of the first words, and the plurality of pieces of first record data are arranged according to a hash value of the first word segmentation;

generating a plurality of pieces of second record data, wherein each piece of second record data comprises second participles, weights corresponding to the second participles, pages corresponding to the second participles, weights of the pages, all second participles corresponding to the pages and the weights of the second participles, and the plurality of pieces of second record data are arranged according to hash values of the second participles;

selecting first record data and second record data with the same hash value from the first record data and the second record data as calculation data, and calculating partial similarity S_j'(q,u)。

8. The method of claim 7, wherein,

for the first record data with the same hash value, sorting according to the character sequence of the historical query request for which the first record data is directed, and/or,

and sorting the second record data with the same hash value according to the weight of the page to which the second record data aims.

9. The method of claim 8, further comprising:

for a plurality of pieces of first record data with the same hash value, reserving a first quantity threshold value of the first record data to participate in the calculation of partial similarity; and/or

For a plurality of pieces of second record data with the same hash value, the second record data with a second number threshold is reserved to participate in the calculation of the partial similarity.

10. The method of claim 6, wherein prior to calculating the similarity between the first participle corresponding to the historical query request and the second participle corresponding to the page, the method further comprises:

removing stop words in the first participle and/or the second participle; and/or

Selecting a first participle and/or a second participle with the weight larger than a second preset threshold value to participate in the calculation of the similarity; and/or

Removing historical query requests and/or pages with the ratio of the number of stop words to the number of non-stop words exceeding a third preset threshold value from the query set and/or the page set; and/or

11. The method of claim 10, further comprising:

for the historical query requests and the pages with the first similarity exceeding a first preset threshold, selecting the participles with the weight higher than the first threshold from the historical query requests and the corresponding participles of the pages respectively, and further calculating a second similarity between the historical query requests and the pages by using the similarity calculation formula, wherein the fourth threshold is larger than the third threshold;

when the second similarity exceeds a fourth predetermined threshold, determining that the historical query requests and the page are matched with each other, wherein the fourth predetermined threshold is greater than the first predetermined threshold.

12. An apparatus for analyzing search characteristics of a page, comprising:

the first similarity calculation unit is used for calculating first similarity between the historical query requests in the query set and the pages in the page set based on a plurality of first participles of the historical query requests in the query set and a plurality of second participles of the pages in the page set;

the matching determination unit is used for regarding the historical query requests and the pages with the first similarity exceeding a preset threshold as matching with each other;

and the search characteristic determining unit is used for analyzing the page according to the matching information of the page so as to determine the search characteristic of the page.

13. The analysis device according to claim 12, wherein the search characteristic determination unit includes:

the query popularity determining module is used for determining the query popularity of the page according to the number of the historical query requests matched with the page; and/or

And the resource scarcity determining module is used for determining the resource scarcity of the page according to the number of the pages matched with the historical query requests matched with the page.

14. The analysis device according to claim 12, wherein the first similarity calculation unit includes:

the first segmentation and weight calculation module is used for segmenting the historical query requests in the query set and calculating weights to obtain a plurality of first segments and the weight corresponding to each first segment;

the second word segmentation and weight calculation module is used for segmenting word information corresponding to the pages in the page set and calculating weights so as to obtain a plurality of second words and weights corresponding to each second word segmentation;

and the similarity determining module is used for determining the first similarity between the historical query request and the page by calculating the similarity between the first participle corresponding to the historical query request and the second participle corresponding to the page.

15. The analytics device of claim 14, wherein the same participles have the same weight, the similarity determination module calculates a first similarity S (q, u) between the historical query request and the page according to the following similarity calculation formula:

representing a participle k_jThe weight of (a) is determined,

representing a participle k_iThe weight of (c).

16. The analytics device of claim 15, wherein the similarity determination module calculates a partial similarity S between historical query requests and pages_j' (q, u) wherein,

17. The analysis device of claim 16, wherein the similarity determination module comprises:

the first generation module is used for generating a plurality of pieces of first record data, each piece of first record data comprises a first segmentation word, a weight corresponding to the first segmentation word, a historical query request corresponding to the first segmentation word, all first segmentation words corresponding to the historical query request and the weights of all the first segmentation words, and the plurality of pieces of first record data are arranged according to a hash value of the first segmentation word;

the second generation module is used for generating a plurality of pieces of second record data, each piece of second record data comprises a second participle, a weight corresponding to the second participle, a page corresponding to the second participle, a weight of the page, all second participles corresponding to the page and the weights of all the second participles, and the plurality of pieces of second record data are arranged according to a hash value of the second participle;

a calculation module for selecting first record data and second record data with the same hash value from the first record data and the second record data as calculation data, and calculating partial similarity S_j'(q,u)。

18. The analysis device of claim 16, further comprising a filtering unit configured to perform one or more of the following operations before the similarity determination module calculates the similarity between the first segmentation word corresponding to the historical query request and the second segmentation word corresponding to the page:

removing stop words in the first participle and/or the second participle;

removing historical query requests and/or pages with the ratio of the number of stop words to the number of non-stop words exceeding a third preset threshold value from the query set and/or the page set;

19. The analysis device of claim 18, further comprising:

the second similarity calculation unit is used for respectively selecting the participles with the weight higher than a fourth quantity threshold from the historical query requests and the participles corresponding to the pages according to the historical query requests and the pages with the first similarity exceeding a first preset threshold, and further calculating the second similarity between the historical query requests and the pages by using the similarity calculation formula, wherein the fourth quantity threshold is larger than the third quantity threshold;

a matching determination unit, configured to determine that the historical query request and the page match each other when the second similarity exceeds a fourth predetermined threshold, where the fourth predetermined threshold is greater than the first predetermined threshold.