TWI601024B

TWI601024B - Sampling methods, systems and equipment

Info

Publication number: TWI601024B
Application number: TW098122780A
Authority: TW
Inventors: jun-lin Zhang; Jian Sun; Lei Hou; Qin Zhang
Original assignee: Alibaba Group Holding Ltd
Priority date: 2009-07-06
Filing date: 2009-07-06
Publication date: 2017-10-01
Also published as: TW201102846A

Description

Sampling analysis methods, systems and equipment

本發明電腦網路技術領域，特別有關一種抽樣分析方法、系統和設備。 The invention relates to the field of computer network technology, and particularly relates to a sampling analysis method, system and device.

搜尋引擎一般會記載用戶的查詢記錄，對於大型搜尋引擎，在一定時間段內用戶的查詢記錄是大量資料，而有很大比例的用戶查詢關鍵字是重複查詢，譬如對於最近的熱門事件，不同用戶進行的查詢是相近甚至相同的。搜尋引擎服務提供商為了提供更好的服務，會對用戶的查詢記錄進行處理，而一個基礎的處理步驟就是將相同查詢關鍵字進行合併，這樣可以大量縮小資料儲存佔用的記憶體或者磁碟空間。譬如，最近有2000個查詢關鍵字是“阿里巴巴”，那麼經過合併後的資料形式是“阿里巴巴2000”，其中“阿里巴巴”代表用戶查詢關鍵字，2000代表該查詢關鍵字在一段時期的Query Log(查詢日誌)中出現的次數。但是對於這種已經初步經過整理的統計資料，如何進行查詢關鍵字抽樣才能夠使得其抽樣資料接近於查詢關鍵字的真實分佈就成為一個需要解決的問題。 The search engine generally records the user's query record. For a large search engine, the user's query record is a large amount of data in a certain period of time, and a large percentage of the user's query keyword is a repeated query, for example, for the most recent popular event. The queries made by the users are similar or even the same. In order to provide better services, the search engine service provider processes the user's query records, and a basic processing step is to merge the same query keywords, which can greatly reduce the memory or disk space occupied by the data storage. . For example, recently there were 2,000 query keywords "Alibaba", then the combined data format is "Alibaba 2000", in which "Alibaba" represents the user query keyword and 2000 represents the query keyword for a period of time. The number of occurrences in the Query Log. However, for such statistical data that has been initially sorted out, how to perform query keyword sampling can make the sampling data close to the true distribution of the query keywords become a problem to be solved.

在現有技術中，對於“查詢關鍵字PV(Page View，網頁瀏覽)”這種格式的統計資料，首先要計算每個查詢關鍵字在所有查詢關鍵字中所占的比例，其中PV代表查詢關鍵字在搜尋平臺出現的次數的統計資訊。譬如說，對於“阿里巴巴2000”這個查詢資料，首先將查詢關鍵字集合中的所有查詢關鍵字PV值之和統計出來，假設這個PV總值是100萬，代表了所有用戶查詢關鍵字數目是100萬條，然後計算“阿里巴巴”這個查詢關鍵字在所有查詢關鍵字中的比例，可知這個比例為2000/1000000=0.0025，這個資料的含義是：在所有查詢關鍵字中，“阿里巴巴”這個查詢關鍵字被隨機抽取到的概率是0.0025。當所有查詢關鍵字的抽取概率計算結果得到後，可以根據某個查詢關鍵字的抽取概率在所有查詢關鍵片語成的集合中進行查詢關鍵字抽樣，從而獲得相應查詢關鍵字最終的抽樣資料，透過對抽樣資料的分析瞭解用戶查詢關鍵字的分佈情況。譬如，在PV總值為100萬的查詢關鍵字集合中，預計抽取1萬條的查詢記錄作為查詢關鍵字試樣進行分析。具體查詢關鍵字抽樣過程如下：根據某個查詢關鍵字的抽取概率確定該查詢關鍵字的抽樣數目，亦即：[某個查詢關鍵字的抽樣數目]=[預計抽樣數目]*(該查詢關鍵字的抽取概率)，其中，查詢關鍵字的抽樣數目和預計抽樣數目均為正整數。譬如，“阿里巴巴”這個查詢關鍵字被隨機抽到的概率是0.0025，則在“阿里巴巴2000”這個查詢記錄中抽取10000*0.0025=25個“阿里巴巴”查詢關鍵字作為查詢關鍵字試樣；同樣地，其他查詢關鍵字被進行抽樣分析的數目可以根據上述計算公式得到；所有查詢關鍵字的抽樣數目之和為1萬。相對於100萬條查詢記錄來說，對1萬條抽樣查詢記錄進行分析處理，資料分析師的工作量和運算步驟會大大地降低，提高了工作效率。In the prior art, for the statistics of the "query keyword PV (Page View)" format, firstly, the proportion of each query keyword in all query keywords is calculated, wherein PV represents the query key. Statistics on the number of times a word appears on the search platform. For example, right In the query data of "Alibaba 2000", firstly, the sum of the PV values of all the query keywords in the query keyword set is counted, assuming that the total PV value is 1 million, which means that the number of keywords for all users is 1 million. Then calculate the proportion of the query keyword "Alibaba" in all query keywords. It can be seen that the ratio is 2000/1000000=0.0025. The meaning of this data is: In all query keywords, the key of "Alibaba" is the query. The probability that the word is randomly drawn is 0.0025. After the calculation result of the extraction probability of all the query keywords is obtained, the query keyword may be sampled in the set of all the query key words according to the extraction probability of a certain query keyword, thereby obtaining the final sampling data of the corresponding query keyword. Through the analysis of the sample data, the distribution of the user's query keywords is known. For example, in a query keyword set with a total PV value of 1 million, it is estimated that 10,000 query records are extracted as query keyword samples for analysis. The specific query keyword sampling process is as follows: the sampling number of the query keyword is determined according to the extraction probability of a query keyword, that is: [the number of samples of a query keyword] = [the expected number of samples] * (the key of the query Word extraction probability), wherein the number of samples of the query keyword and the expected number of samples are positive integers. For example, if the probability of the query keyword “Alibaba” is randomly selected is 0.0025, then 10000*0.0025=25 “Alibaba” query keywords are extracted as the query keyword sample in the query record “Alibaba 2000”. Similarly, the number of other query keywords that are sampled can be obtained according to the above formula; the sum of the sample numbers of all query keywords is 10,000. Compared with 1 million query records, 10,000 sample query records are analyzed and processed, and the data analyst's workload and calculation steps are greatly reduced, which improves work efficiency.

在實現本發明的過程中，發明人發現現有技術至少存在以下問題：如果需要抽取的資料數目較大時，現有技術中的抽樣分析方法能夠在一定程度上類比真實的資料分佈進行資料抽樣，但是當需要抽取的數目是中等或者小規模的情況，抽取結果會與資料真實分佈有較大的失真。原因在於：在資料統計中，很多資料的統計分佈都具有長尾的特性，所謂長尾，即為出現頻率很低的實體或者資料個數非常多，具體在用戶透過搜尋引擎查詢關鍵字來說，就是很多用戶查詢的關鍵字只出現了很少的次數，譬如某些查詢的關鍵字只出現了1次或者2次。雖然某個關鍵字出現概率很低，但是這些出現低頻的查詢關鍵字總數在總的查詢關鍵字個數中所占的比例卻很大。對於這種長尾分佈的情況，如果採取上述現有技術中的抽樣分析方法，會導致無法抽取到低頻查詢關鍵字。譬如某個應用的目標是需要抽取2000個查詢關鍵字，其中查詢PV總數為100萬，對於某個查詢關鍵字，以“電子商務1”來說，其被抽取到的概率僅為百萬分之一，所以利用上述方法是抽取不到低頻查詢關鍵字的。這種根據現有抽樣分析方法抽取的資料與真實資料的分佈會有很大不同，從而無法根據搜尋引擎中查詢關鍵字的抽樣分析準確瞭解用戶需求資訊和市場動態，也就不能很好的為用戶提供方便、快捷的電子商務的網上交易服務。In the process of implementing the present invention, the inventors have found that at least the following problems exist in the prior art: if the number of pieces of data to be extracted is large, the sampling analysis method in the prior art can sample data to some extent analogous to the true data distribution, but When the number of extractions is medium or small, the extraction result will have a large distortion with the true distribution of the data. The reason is that in the statistics of statistics, the statistical distribution of many data has long tail characteristics. The so-called long tail is the number of entities with very low frequency of occurrence or the number of data, especially when users search for keywords through search engines. Many users query keywords only a few times, for example, the keywords of some queries only appear once or twice. Although the probability of occurrence of a keyword is very low, the total number of low-frequency query keywords is a large percentage of the total number of query keywords. For the case of such a long tail distribution, if the sampling analysis method in the prior art described above is adopted, the low frequency query keyword cannot be extracted. For example, the goal of an application is to extract 2000 query keywords, in which the total number of query PVs is 1 million. For a query keyword, the probability of being extracted is only 1 million for "e-commerce 1". One, so using the above method is not able to extract low frequency query keywords. This kind of data extracted according to the existing sampling analysis method is very different from the distribution of real data, so that it is impossible to accurately understand the user demand information and market dynamics according to the sampling analysis of the query keywords in the search engine, and thus it is not good for the user. Provide convenient and fast e-commerce online trading services.

本發明實施例提供一種抽樣分析方法、系統和設備，用於對大規模搜尋引擎查詢的資料分析，以實現在使用盡可能少的儲存空間的情況下得到真實的資料抽樣，準確瞭解用戶需求資訊和市場動態，提高服務品質。The embodiment of the invention provides a sampling analysis method, system and device for analyzing data of a large-scale search engine query, so as to obtain real data sampling while using as little storage space as possible, and accurately understand user demand information. And market dynamics to improve service quality.

為達到上述目的，本發明實施例一方面提供了一種抽樣分析方法，用於對大規模搜尋引擎查詢的資料分析，包括以下步驟：根據不同查詢關鍵字的查詢記錄PV值而將查詢關鍵字劃分為至少一個查詢關鍵字子集；計算所述查詢關鍵字子集的抽樣數目；根據所述抽樣數目而在所述查詢關鍵字子集中抽取查詢資料。To achieve the above objective, an embodiment of the present invention provides a sampling analysis method for analyzing data of a large-scale search engine query, including the following steps: dividing a query keyword according to a query record PV value of different query keywords. Having at least one query keyword subset; calculating a sample number of the query keyword subset; extracting query data in the query keyword subset according to the sample number.

本發明實施例另一方面提供了一種抽樣分析設備，用於對大規模搜尋引擎查詢的資料分析，包括：劃分模組，用以根據不同查詢關鍵字的查詢記錄PV值而將查詢關鍵字劃分為至少一個查詢關鍵字子集；計算模組，用以計算透過所述劃分模組所劃分的所述查詢關鍵字子集的抽樣數目；抽樣模組，用以根據所述計算模組所得到的抽樣數目而在所述劃分模組所劃分的所述查詢關鍵字子集中抽取查詢資料。Another aspect of the present invention provides a sampling analysis device for analyzing data of a large-scale search engine query, including: a dividing module, configured to divide a query keyword according to a query record PV value of different query keywords. a sampling module for calculating a sample number of the subset of query keywords divided by the dividing module; a sampling module for obtaining according to the computing module The number of samples is used to extract query data in the subset of query keywords divided by the partitioning module.

另一方面，本發明實施例還提供了一種抽樣分析系統，用於對大規模搜尋引擎查詢的資料分析，包括：搜尋平臺，用以為用戶查詢提供搜尋服務，記錄不同查詢關鍵字的PV值；抽樣分析設備，用以根據所述搜尋平臺記錄的不同查詢關鍵字的PV值而將查詢關鍵字劃分為至少一個查詢關鍵字子集，計算所述查詢關鍵字子集的抽樣數目，根據所述抽樣數目而在所述查詢關鍵字子集中抽取查詢資料。 On the other hand, the embodiment of the present invention further provides a sample analysis system for analyzing data of a large-scale search engine query, including: a search platform for providing a search service for a user query, and recording PV values of different query keywords; a sampling analysis device, configured to divide a query keyword into at least one query keyword subset according to a PV value of different query keywords recorded by the search platform, and calculate a sample number of the query keyword subset, according to the The number of samples is used to extract query data in the subset of query keywords.

與現有技術相比，本發明實施例具有以下優點：可以從大量的並且經過初步統計整理的查詢關鍵字集合中隨機抽取出所需的查詢記錄，既可以減少後續計算所需的儲存量，又能夠解決現有抽樣分析方法中小概率低頻查詢關鍵字被低估的風險，有效地達到了隨機抽取查詢記錄的目的，使得中等規模或者小規模的抽樣可以更逼近於資料的真實分佈；搜尋引擎服務提供商可以根據抽樣資料來建立完善的數學模型，獲取真實有效的資料分佈資訊，準確瞭解用戶需求和市場動態，適當調整搜尋引擎的服務內容，從而更好的為用戶提供方便、快捷的電子商務網上交易平臺，提高服務品質。 Compared with the prior art, the embodiment of the present invention has the following advantages: the required query record can be randomly extracted from a large number of query keyword sets that have undergone preliminary statistical sorting, which can reduce the storage amount required for subsequent calculations, and It can solve the problem that the low-probability low-frequency query keywords in the existing sampling analysis methods are underestimated, and effectively achieve the purpose of randomly extracting the query records, so that the medium-sized or small-scale sampling can be closer to the true distribution of the data; the search engine service provider According to the sampling data, we can establish a perfect mathematical model, obtain real and effective data distribution information, accurately understand user needs and market dynamics, and appropriately adjust the service content of the search engine to better provide users with convenient and fast e-commerce online. Trading platform to improve service quality.

下面將結合本發明實施例中的附圖，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本發明的一部分實施例，而不是全部的實施例。基於本發明中的實施例，本領域普通技術人員在沒有做出創造性勞動前提下所獲得的所有其他實施例，都屬於本發明保護的範圍。 The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Based on the embodiments of the present invention, those of ordinary skill in the art are not doing All other embodiments obtained under the premise of creative labor are within the scope of the invention.

本發明實施例一提供了一種抽樣分析方法，用於對大規模搜尋引擎查詢的資料分析，具體包括以下步驟： The first embodiment of the present invention provides a sampling analysis method for analyzing data of a large-scale search engine query, which specifically includes the following steps:

步驟S101，根據不同查詢關鍵字的PV值而將查詢關鍵字劃分為至少一個查詢關鍵字子集。 Step S101: Divide the query keyword into at least one query keyword subset according to the PV value of the different query keywords.

其中，PV值具體為在一個預設的時間段內，至少一個查詢關鍵字在搜尋平臺上出現的次數。在進行抽樣分析前，首先對搜尋平臺在一個時間段內記錄的所有用戶查詢關鍵字的PV值進行儲存，對這些不同查詢關鍵字的PV值進行排序，排序方式可以按照從小到大，也可以按照從大到小，然後將所述PV值相同的查詢關鍵字歸為一個查詢關鍵字子集。 The PV value is specifically the number of times at least one query keyword appears on the search platform within a preset time period. Before performing the sampling analysis, first store the PV values of all the user query keywords recorded by the search platform in a time period, and sort the PV values of the different query keywords, and the sorting manner may be from small to large, or From large to small, the query keywords with the same PV value are then classified into a subset of query keywords.

步驟S102，計算所述查詢關鍵字子集中查詢關鍵字的抽樣數目。 Step S102: Calculate the number of samples of the query keyword in the query keyword subset.

在對所有的查詢關鍵片語成的集合進行查詢關鍵字抽樣前，首先要根據應用需要確定抽樣分析查詢關鍵字的數目。具體地說，先計算每個查詢關鍵字子集中PV值的總和SPV(Set Page View，一組網頁瀏覽)值，SPV指的是某個查詢關鍵字子集的總PV數目；然後，將每個查詢關鍵字子集得到的SPV值計算總和，得到查詢關鍵字集合中所述網頁瀏覽的總次數TPV(Total Page View)值，亦即，在一個預設的時間段內，所有用戶查詢關鍵字在搜尋平臺上出現的總次數；根據得到的所述SPV值與所述TPV值的比值就可以計算得出所述查詢關鍵字子集被抽取到的概率。然後，根據預先確定的需要抽取的查詢關鍵字數目和每個查詢關鍵字子集的抽取概率來計算某個查詢關鍵字子集的抽樣數目。Before sampling the query keywords for all the sets of query key phrases, first determine the number of sample analysis query keywords according to the application needs. Specifically, the SPV (Set Page View) value of the PV value in each query keyword subset is calculated first, and the SPV refers to the total PV number of a certain query keyword subset; The sum of the SPV values obtained by the query keyword subsets is obtained, and the total number of times of the web page browsing in the query keyword set is TPV (Total Page View) value, that is, all users query the key within a preset time period. The total number of times the word appears on the search platform; based on the obtained ratio of the SPV value to the TPV value, the probability that the query keyword subset is extracted can be calculated. Then, the number of samples of a certain subset of the query keywords is calculated according to the predetermined number of query keywords that need to be extracted and the extraction probability of each query keyword subset.

步驟S103，根據所述查詢關鍵字的抽樣數目而在所述查詢關鍵字子集中抽取查詢關鍵字的查詢資料。Step S103: Extract query data of the query keyword in the subset of query keywords according to the number of samples of the query keyword.

透過隨機採樣法而在查詢關鍵字子集中抽取查詢資料，其中隨機採樣法包括抽籤法和/或亂數法。採用隨機抽取的查詢關鍵字的查詢資料，可以分析在一段時間內用戶搜尋的不同查詢關鍵字的分佈情況，從而可以瞭解用戶的需求資訊。The query data is extracted in the query keyword subset by a random sampling method, wherein the random sampling method includes a lottery method and/or a random number method. Using the query data of the randomly selected query keywords, the distribution of different query keywords searched by the user in a period of time can be analyzed, so that the user's demand information can be understood.

透過本發明實施例，可以從大量的並且經過初步統計整理的查詢關鍵字集合中隨機抽取出所需的查詢資料，既可以減少後續計算所需的儲存量，又能夠解決很多方法中小概率低頻查詢被低估的風險，有效地達到了隨機抽取查詢記錄的目的，使得中等規模或者小規模的抽樣可以更逼近於資料的真實分佈，從而為搜尋引擎服務提供商提供準確的用戶需求資訊和市場動態，提高服務品質。Through the embodiment of the present invention, the required query data can be randomly extracted from a large number of query keyword sets that have undergone preliminary statistical sorting, which can reduce the storage amount required for subsequent calculation, and can solve many methods of small probability low frequency query. The underestimated risk effectively achieves the purpose of randomly extracting query records, so that medium-sized or small-scale sampling can be closer to the true distribution of data, thus providing search engine service providers with accurate user demand information and market dynamics. Improve service quality.

對很多搜尋引擎服務提供商來說，需要提供一個“目前網友正在搜尋什麼”的功能，目的是要即時輸出網友向搜尋引擎發出的查詢請求。尤其對於大型搜尋引擎來說，很多用戶都會利用其進行搜尋查詢，而且每個用戶在該搜尋平臺上一般會搜尋多個查詢關鍵字，因此，即使在很短的一段時間內，搜尋引擎所記錄的用戶查詢都是大量資料，譬如一天接受到上億條搜尋請求。在記錄的用戶查詢關鍵字中，有很大比例的用戶查詢的關鍵字是重複的查詢，也就是說，不同用戶發出的查詢關鍵字是相近甚至是相同的，譬如對於最近的熱門事件，可能會有成百上千萬的用戶在很短的一個時間段內集中發出同一個查詢請求查詢該事件。搜尋引擎服務提供商需要在一定的時間段內，對大量的用戶查詢請求進行處理，以便為用戶提供更好的服務，其中一個基礎的處理步驟就是將相同的用戶查詢關鍵字合併，這樣，可以大大縮小資料儲存所佔用的記憶體或者磁碟空間。為了瞭解用戶需求，為用戶提供更好、更便利的服務，需要對一段時間內的用戶查詢關鍵字進行抽樣分析和調查。當然，確定抽樣的查詢關鍵字數目相對於總的所有用戶查詢關鍵字來說，其比例是很小的。如果直接對整理後的查詢記錄進行抽樣分析的話，那些在該段時間內查詢頻率較高的、大規模的用戶查詢關鍵字被抽到的概率就比較高，而那些低頻的、小規模的用戶查詢關鍵字被抽到的概率就非常低，因此不能達到最初的抽樣的目的。另外，由於網頁顯示空間有限，不可能把所有網友即時的查詢關鍵字都顯示出來，所以只能透過對查詢關鍵字進行抽樣，建立小規模查詢記錄的數學模型來顯示，而為了能夠準確反映用戶的查詢需求，要求這種抽樣和大量的用戶查詢的關鍵字分佈是真實逼近的。For many search engine service providers, it is necessary to provide a "current user is searching for what" feature, the purpose is to immediately output the query request sent by the user to the search engine. Especially for large search engines, many users use it for search queries, and each user usually searches for multiple query keywords on the search platform, so even for a short period of time, the search engine records User queries are a lot of information, such as receiving hundreds of millions of search requests a day. Among the recorded user query keywords, a large percentage of the keywords queried by users are duplicate queries, that is, the query keywords sent by different users are similar or even the same, for example, for the most recent hot events, Hundreds of millions of users will send out the same query request to query the event in a short period of time. The search engine service provider needs to process a large number of user query requests within a certain period of time in order to provide better services for users. One of the basic processing steps is to merge the same user query keywords, so that Greatly reduce the memory or disk space occupied by data storage. In order to understand the user's needs and provide users with better and more convenient services, it is necessary to conduct sample analysis and investigation on user query keywords within a certain period of time. Of course, the number of query keywords that determine the sample is small relative to the total number of all user query keywords. If the sampled records are sorted and analyzed directly, the probability of large-scale user query keywords being searched for during this period is higher, and those with low frequency and small scale are higher. The probability that the query keyword is drawn is very low, so the original sampling purpose cannot be achieved. In addition, due to the limited display space of the webpage, it is impossible to display all the online query keywords of all the netizens. Therefore, the mathematical model of the small-scale query record can be displayed by sampling the query keywords, and the user can be accurately reflected. The query requirements require that the sampling of this sample and the keyword distribution of a large number of user queries be realistically approximated.

本發明實施例二針對帶有部分統計資訊的大規模搜尋引擎查詢的真實抽樣問題，提供了另一種抽樣分析方法，採用兩階段的抽樣方法，來解決現有技術中存在的問題，使得其抽樣資料接近於查詢的真實分佈。整體方法的流程如圖2所示。如果對某個查詢應用已經確定需要抽取的查詢數目M，在本發明的第一階段，首先根據每個搜尋引擎查詢關鍵字的PV值而將查詢關鍵字歸類，並計算每個類別的抽取概率，由此可以計算得到從每個類別中抽取到的查詢關鍵字的數目；在本發明的第二階段，可以在查詢關鍵片語成的某個類別裏面採用隨機抽樣的方法抽取最終的查詢資料。Embodiment 2 of the present invention provides another sampling analysis method for the real sampling problem of large-scale search engine query with partial statistical information, and adopts a two-stage sampling method to solve the problems existing in the prior art and make the sampling data thereof. Close to the true distribution of the query. The flow of the overall method is shown in Figure 2. If the number of queries M that have been determined to be extracted is applied to a certain query, in the first stage of the present invention, the query keywords are first classified according to the PV value of each search engine query keyword, and the extraction of each category is calculated. Probability, whereby the number of query keywords extracted from each category can be calculated; in the second stage of the present invention, a random sample can be used to extract the final query in a certain category of the query key phrase. data.

在下面的實施例中，對第一階段和第二階段的抽樣方法的流程作進一步詳細描述。其中，在第一階段計算每個查詢關鍵字子集的抽樣數目，其方法流程如圖3所示，包括以下步驟：In the following embodiments, the flow of the sampling methods of the first stage and the second stage will be further described in detail. In the first stage, the number of samples of each query keyword subset is calculated. The method flow is as shown in FIG. 3, and includes the following steps:

步驟S301、將以“查詢關鍵字PV”格式儲存的搜尋引擎查詢關鍵字集合按照PV數值進行排序。排序可以是PV值由大到小的方式，也可以是PV值由小到大的方式，其排序方式不影響後續步驟的操作。例如，假設在一個時間段內，搜尋引擎記錄的所有用戶查詢關鍵字總數目TPV為10萬條，其中，查詢“阿里巴巴”關鍵字的記錄有2000條，儲存為“阿里巴巴2000”；查詢“電子商務”的記錄有1800條，儲存為“電子商務1800”；查詢“電腦”的記錄有500條，儲存為“電腦500”；查詢“服飾”的記錄500條，儲存為“服飾500”；……；另外還有“水杯”查詢60條，“鉛筆”查詢60條，“便箋本”查詢60條，等等，均按照上述的儲存格式進行儲存。然後，將上述查詢集合按照從小到大的順序排列，亦即：“水杯60”，“鉛筆60”，“便箋本60”，……，“電腦500”，“服飾500”，“電子商務1800”，“阿里巴巴2000”，……。Step S301: Sort the search engine query keyword set stored in the “query keyword PV” format according to the PV value. The sorting may be a method in which the PV value is from large to small, or a method in which the PV value is from small to large, and the sorting manner does not affect the operation of the subsequent steps. For example, suppose that the total number of all user query keywords recorded by the search engine is 100,000 in a period of time. Among them, there are 2000 records for querying the "Alibaba" keyword, and the storage is "Alibaba 2000"; There are 1800 records of “e-commerce”, which are stored as “e-commerce 1800”; 500 records of “computer” are stored as “computer 500”; 500 records of “clothing” are stored, and stored as “clothing 500” ;......; In addition, there are 60 "water cups" inquiries, 60 "pencils" inquiries, 60 inquiries, and so on, all stored in the above storage format. Then, the above query sets are arranged in order of small to large, namely: "Water Cup 60", "Pencil 60", "Note Book 60", ..., "Computer 500", "Apparel 500", "E-commerce 1800" ", "Alibaba 2000", ....

步驟S302、將PV值相同的查詢關鍵字進行歸倂。Step S302, blaming the query keywords with the same PV value.

對於PV值相同的查詢關鍵字，可以將所有這些查詢關鍵字看做一個查詢關鍵字集合的子集QuerySet，屬於QuerySet集合的這些查詢的共通屬性是：每個查詢關鍵字的PV值都相同；這樣，可以根據不同的PV值而得到不同的QuerySet，假設PV值是從1到K(K為大於1的自然數)，那麼可以據此得到查詢關鍵字子集合QuerySet1，QuerySet2，......，QuerySetK。當然，在具體情況下，對於不同的用戶查詢，搜尋引擎在一個時間段內統計的每個查詢關鍵字的PV值也可能是不連續的。將步驟S301中PV值相同的查詢關鍵字進行合倂，可以順序得到多個查詢關鍵字子集合，如：QuerySet60，QuerySet500，QuerySet1800，QuerySet2000，等等。For query keywords with the same PV value, all of these query keywords can be regarded as a subset of the query keyword set QuerySet. The common attributes of these queries belonging to the QuerySet set are: the PV values of each query keyword are the same; In this way, different QuerySets can be obtained according to different PV values. Assuming that the PV value is from 1 to K (K is a natural number greater than 1), the query keyword subsets QuerySet1, QuerySet2, ... can be obtained accordingly. .., QuerySetK. Of course, in a specific case, for different user queries, the PV value of each query keyword counted by the search engine in a time period may also be discontinuous. By combining the query keywords with the same PV value in step S301, multiple query keyword subsets may be obtained in sequence, such as: QuerySet60, QuerySet500, QuerySet1800, QuerySet2000, and the like.

步驟S303、計算每個查詢關鍵字子集合的抽取概率。Step S303: Calculate the extraction probability of each query keyword subset.

對於PV值為i的查詢組成的查詢關鍵字子集合QuerySetI，統計計算得到這個查詢關鍵字子集合的總PV數目，亦即SPV值：SPV _i=I*|QuerySetI|，其中，I代表PV值為I，|QuerySetI|代表這個查詢關鍵字子集合的大小，也就是說，有多少個PV值為i的查詢關鍵字屬於這個子集合，亦即對於步驟S302中的查詢關鍵字子集合QuerySet60來說，假設其中有30個“查詢關鍵字60”的查詢記錄，|QuerySet60|等於30，代表這個查詢關鍵字子集合中有30個不同的查詢，則SPV值為60*30=1800。對於滿足長尾分佈的資料來說，一般PV數值越小，其組成的查詢關鍵字子集合包含的查詢關鍵字個數越多，所以雖然對於單個查詢關鍵字來說PV值很小，但是SPV作為統計資訊其值並不因單個查詢關鍵字的PV值小而受影響。例如，對於“水杯60”查詢資料來說，其PV值相對于所有查詢總和TPV值10萬，甚至相對于“阿里巴巴2000”的PV值2000來說是很小的一個數字，但是，該“水杯60”查詢資料所在的查詢關鍵字子集合QuerySet60，其SPV值為1800，與“阿里巴巴2000”所在的查詢關鍵字子集合的SPV值2000(假設只有1個“阿里巴巴2000”的查詢記錄)非常接近。For the query keyword subset QuerySetI composed of the query with PV value i, the total PV number of the query keyword subset is calculated statistically, that is, the SPV value: SPV _i = I *| QuerySetI |, where I represents the PV value I, | QuerySetI | represents the size of the sub-set of the query keyword, that is, how many query keywords with PV values belong to this sub-set, that is, for the query keyword sub-set QuerySet60 in step S302 Say, suppose there are 30 query keys of "query keyword 60", | QuerySet 60| is equal to 30, which means there are 30 different queries in the sub-set of this query keyword, then the SPV value is 60*30=1800. For the data satisfying the long tail distribution, the smaller the PV value is, the larger the number of query keywords included in the query keyword sub-set is, so although the PV value is small for a single query keyword, the SPV is The statistical information is not affected by the small PV value of a single query keyword. For example, for the "Water Cup 60" query data, its PV value is 100,000 relative to the total TPV value of all queries, even a small number relative to the PV value of 2000 of "Alibaba 2000", but the " The water cup 60" query data sub-set QuerySet60, the SPV value is 1800, and the SPV value of the query keyword subset of "Alibaba 2000" is 2000 (assuming only one "Alibaba 2000" query record )very close.

為了計算每個查詢關鍵字子集合的抽取概率，將所有查詢關鍵字子集合的SPV數目求和，得到所有查詢關鍵字的PV總數，稱之為TPV；有了TPV，就可以計算抽樣過程中每個查詢關鍵字子集合被抽取到的概率，例如對於PV值為i的查詢關鍵字子集合來說，其被抽取到的概率為P _i=SPV _i/TPV。In order to calculate the extraction probability of each query keyword subset, the SPV numbers of all the query keyword subsets are summed to obtain the total PV of all the query keywords, which is called TPV; with the TPV, the sampling process can be calculated. The probability that each query key subset is extracted, for example, for a subset of query keywords with a PV value of i, the probability of being extracted is P _i = SPV _i / TPV .

經過如上步驟，每個查詢關鍵字子集合都可以計算得到該集合被抽取到的概率P _i。這個概率對於本抽樣方法是很重要的，因為對於很多低頻出現的查詢關鍵字來說，低頻查詢本身被抽取到的概率非常小。但是，對於由相同PV值組成的查詢關鍵字子集合來說，往往低頻查詢的關鍵字的個數會很多，所以由低頻查詢關鍵片語成的查詢關鍵字子集合SPV數目還是較大的，如此一來，這些低頻查詢關鍵字作為一個整體被抽樣，其被抽取到的概率就被有效地放大，使得最終抽樣得出的資料更加符合資料的真實分佈。如：查詢關鍵字子集合QuerySet60被抽取到的概率為P ₆₀=1800/100000=0.018，而QuerySet2000被抽取到的概率為P ₂₀₀₀=2000/100000=0.020，從得出的資料結果可以看出，查詢關鍵字子集合QuerySet60與QuerySet2000分別被抽取到的概率是非常接近的。Through the above steps, each query keyword subset can calculate the probability P _i to which the set is extracted. This probability is important for this sampling method because the probability of the low frequency query itself being extracted is very small for many low frequency query keywords. However, for a sub-set of query keywords consisting of the same PV value, the number of keywords of the low-frequency query is often large, so the number of SPVs of the query keyword sub-sets of the low-frequency query key words is still large. In this way, the low-frequency query keywords are sampled as a whole, and the probability of being extracted is effectively amplified, so that the final sampled data is more in line with the true distribution of the data. For example, the probability that the query keyword subset QuerySet60 is extracted is P ₆₀ =1800/100000=0.018, and the probability that QuerySet2000 is extracted is P ₂₀₀₀ =2000/100000=0.020, as can be seen from the data obtained. The probability that the query keyword subset QuerySet60 and QuerySet2000 are respectively extracted is very close.

在本發明的第一階段，假設具體某個應用已經確定了抽樣數目K，那麼可以根據每個查詢關鍵字子集合被抽取到的概率計算應從本集合中抽取的查詢數目，譬如確定K為5000，而PV=60的子集合抽取概率為0.018，那麼需要從QuerySet60中抽取的查詢個數為：5000*0.018=90個；假設PV=2的集合抽取概率為0.010，那麼需要從QuerySet2中抽取的查詢個數為：5000*0.010=50個。In the first stage of the present invention, assuming that a certain application has determined the number of samples K, the number of queries that should be extracted from the set can be calculated according to the probability that each query keyword subset is extracted, for example, it is determined that K is 5000. The probability of sub-set extraction of PV=60 is 0.018, then the number of queries to be extracted from QuerySet60 is: 5000*0.018=90; assuming the set extraction probability of PV=2 is 0.010, then it needs to be extracted from QuerySet2. The number of queries is 5000*0.010=50.

在抽樣的第二階段，從每個查詢關鍵字子集中抽取最終的查詢關鍵字。透過在第一階段對不同的查詢資料進行歸類和統計，確定了從某個查詢關鍵字子集合中需要抽取出的查詢數目，第二階段就從某個指定查詢關鍵字子集合中隨機抽取某條查詢關鍵字，其流程如圖4所示。由於在第一階段已經能夠確定在某個查詢關鍵字子集合中需要抽取的查詢數目N(N為自然數)，所以在第二階段進行抽樣查詢時，需要對某個查詢關鍵字子集合連續抽樣N次，每次從該查詢關鍵字子集中隨機抽取一條查詢記錄，直到取滿N條為止。例如，對於查詢關鍵字子集合QuerySet60，在第一階段步驟S303中，已經根據該集合被抽取到的概率計算得出在該查詢子集合中需要抽取的查詢數目為90條，因此，在進行最後查詢關鍵字抽樣時，要從該查詢關鍵字子集合中連續隨機抽樣90次，得到90條查詢記錄。In the second phase of sampling, the final query key is extracted from each subset of query keywords. Through the classification and statistics of different query data in the first stage, the number of queries that need to be extracted from a sub-set of query keywords is determined, and the second stage is randomly selected from a subset of specified query keywords. The flow of a query keyword is shown in Figure 4. Since in the first stage, it is possible to determine the number N of queries (N is a natural number) that need to be extracted in a certain subset of query keywords, when performing the sample query in the second stage, it is necessary to continuously select a subset of the query keywords. Sampling N times, each time randomly extracting a query record from the query keyword subset until N is filled. For example, for the query keyword sub-set QuerySet 60, in the first stage step S303, the number of queries that need to be extracted in the query sub-set is calculated according to the probability that the set is extracted, and therefore, the last time is performed. When querying keyword sampling, it is necessary to randomly sample 90 consecutive times from the query keyword subset to obtain 90 query records.

在某個查詢關鍵字子集合抽取任意一條搜尋引擎查詢記錄的時候，由於在第一階段對所有查詢關鍵字歸類時所遵循的原則是：相同PV值的查詢關鍵字歸為一類。因此，對於該查詢關鍵字子集合中的每個查詢關鍵字來說，其被某次抽樣抽取到的概率應該是等概率事件，即每個查詢關鍵字被抽取到的概率是相同的。例如，在查詢關鍵字子集合QuerySet60中進行抽樣時，抽取到的90條記錄中，可能包括“水杯”查詢記錄2條，“鉛筆”查詢記錄3條，等等。這樣，對低頻查詢進行抽樣時，所得的抽樣結果就可以逼近於資料的真實分佈，達到最初進行抽樣分析的目的。When extracting any search engine query record in a sub-set of query keywords, the principle that is followed when classifying all query keywords in the first stage is that the query keywords of the same PV value are classified into one class. Therefore, for each query keyword in the query keyword subset, the probability that it is extracted by a certain sampling should be an equal probability event, that is, the probability that each query keyword is extracted is the same. For example, when sampling in the query keyword subset QuerySet60, the extracted 90 records may include 2 "water cup" query records, "pencil" query records 3, and so on. In this way, when sampling the low frequency query, the obtained sampling result can be approximated to the true distribution of the data, and the purpose of the initial sampling analysis is achieved.

在本階段對某個查詢關鍵字子集合進行抽樣，可以採用常用的隨機採樣方法，譬如抽籤法或者亂數法進行抽樣。在本實施例中，採用亂數方法對這個查詢關鍵字子集合中的查詢關鍵字進行抽樣，其具體計算流程如演算法1所示。At this stage, a sub-set of a query keyword is sampled, and a commonly used random sampling method, such as a lottery method or a random number method, may be used for sampling. In this embodiment, the query key in the query keyword subset is sampled by a random number method, and the specific calculation process is as shown in Algorithm 1.

Algorithm 1: Extracting any query record from the query subset using the random number method

透過本發明實施例，採用兩階段抽樣分析方法，從大量的並且經過初步統計整理的搜尋引擎查詢關鍵字集合中隨機抽取出所需的查詢，抽樣結果逼近真實資料分佈情況。採用這種抽樣方式，既可以透過保留並利用初步統計資料，以減少後續計算所需的儲存量，又能夠解決很多方法中小概率低頻查詢被低估的風險，有效地達到了隨機抽取進行抽樣分析的目標；根據抽樣資料獲知的資訊，搜尋引擎服務提供商可以準確瞭解用戶需求和市場動態，從中發現一些商業機會，適當調整搜尋引擎的服務內容，從而更好的為用戶提供方便、快捷的電子商務網上交易平臺，提高服務品質。Through the embodiment of the present invention, a two-stage sampling analysis method is used to randomly extract the required query from a large number of search engine query keyword sets that have undergone preliminary statistical sorting, and the sampling result approximates the distribution of real data. This sampling method can not only reduce the storage required for subsequent calculations by retaining and utilizing preliminary statistics, but also solve the risk of underestimation of low-probability low-frequency queries in many methods, effectively achieving random sampling for sampling analysis. Target; based on the information obtained from the sampling data, the search engine service provider can accurately understand the user's needs and market dynamics, find some business opportunities, and appropriately adjust the search engine's service content, so as to better provide users with convenient and fast e-commerce. Online trading platform to improve service quality.

本發明實施例三提供了一種抽樣分析系統，用於對大規模搜尋引擎查詢的資料分析，其結構如圖5所示，包括：搜尋平臺1，用以為用戶查詢提供搜尋服務，記錄不同查詢關鍵字的PV值；抽樣分析設備2，用以根據所述搜尋平臺1記錄的不同查詢關鍵字的PV值而將查詢關鍵字集合劃分為至少一個查詢關鍵字子集，計算所述查詢關鍵字子集的抽樣數目，根據所述抽樣數目而在所述查詢關鍵字子集中抽取查詢資料。The third embodiment of the present invention provides a sampling analysis system for analyzing data of a large-scale search engine query. The structure thereof is as shown in FIG. 5, and includes: a search platform 1 for providing a search service for user queries, and recording different query keys. a PV value of the word; the sampling analysis device 2 is configured to divide the query keyword set into at least one query keyword subset according to the PV value of the different query keywords recorded by the search platform 1, and calculate the query keyword The number of samples of the set, and the query data is extracted in the subset of the query keywords according to the number of samples.

其中，抽樣分析設備2的結構如圖6所示，包括：劃分模組21，用以根據不同查詢關鍵字的查詢記錄PV值而將查詢關鍵字劃分為至少一個查詢關鍵字子集；計算模組22，用以計算透過劃分模組21所劃分的所述查詢關鍵字子集的抽樣數目；抽樣模組23，用以根據計算模組22所得到的抽樣數目而在劃分模組21劃分的所述查詢關鍵字子集中抽取查詢資料。The structure of the sample analysis device 2 is as shown in FIG. 6 , and includes: a division module 21, configured to divide the query keyword into at least one query keyword subset according to the query record PV value of different query keywords; The group 22 is configured to calculate the number of samples of the subset of the query keywords divided by the partitioning module 21; the sampling module 23 is configured to be divided by the dividing module 21 according to the number of samples obtained by the computing module 22. The query keyword subset extracts query data.

另外，該抽樣分析設備2還可以包括：儲存模組24，用以儲存所述PV值，所述PV值具體為在一個預設的時間段內，至少一個查詢關鍵字出現的次數。In addition, the sampling analysis device 2 may further include: a storage module 24, configured to store the PV value, where the PV value is specifically the number of times at least one query keyword appears within a preset time period.

其中，劃分模組21還可以進一步包括：排序子模組211，用以對儲存模組24所儲存的所述不同查詢關鍵字的PV值進行排序；歸類子模組212，用以根據排序子模組211安排的順序而將所述PV值相同的查詢關鍵字歸為一個查詢關鍵字子集。The partitioning module 21 may further include: a sorting sub-module 211 for sorting PV values of the different query keywords stored by the storage module 24; and a sorting sub-module 212 for sorting according to The sub-module 211 arranges the order of the query keywords with the same PV value into one query keyword subset.

計算模組22還可以進一步包括：概率計算子模組221，用以計算所述查詢關鍵字子集的抽取概率；抽樣計算子模組222，用以根據確定抽取的查詢數目和概率計算子模組221得到的抽取概率來計算所述查詢關鍵字子集的抽樣數目。The calculation module 22 may further include: a probability calculation sub-module 221 for calculating a extraction probability of the query keyword subset; and a sampling calculation sub-module 222 for calculating a sub-module according to the determined number of queries and the probability of the extraction. The extraction probability obtained by group 221 is used to calculate the number of samples of the subset of query keywords.

透過本發明實施例提供的抽樣分析系統和設備，可以從大量的並且經過初步統計整理的查詢關鍵字集合中隨機抽取出所需的查詢資料，既可以減少後續計算所需的儲存量，又能夠解決很多方法中小概率低頻查詢關鍵字被低估的風險，有效地達到了隨機抽取查詢記錄的目的，使得中等規模或者小規模的抽樣可以更逼近於資料的真實分佈，從而為搜尋引擎服務提供商提供準確的用戶需求資訊和市場動態，提高服務品質。Through the sampling analysis system and device provided by the embodiment of the present invention, the required query data can be randomly extracted from a large number of query keyword sets that are preliminary statistically compiled, which can reduce the storage amount required for subsequent calculation, and can Solve the risk that the low-probability low-frequency query keywords are underestimated in many methods, effectively achieving the purpose of randomly extracting query records, so that medium-sized or small-scale sampling can be closer to the true distribution of data, thus providing search engine service providers with Accurate user demand information and market dynamics to improve service quality.

為了描述的方便，以上所述系統的各部分以功能分為各種模組或設備分別描述。當然，在實施本發明時可以把各模組或設備的功能在同一個或多個軟體或硬體中實現。For the convenience of description, each part of the above system is described by function into various modules or devices. Of course, the functions of each module or device can be implemented in the same software or hardware or hardware in the practice of the present invention.

上述模組可以分佈於一個裝置，也可以分佈於多個裝置。上述模組可以合倂為一個模組，也可以進一步拆分成多個子模組。The above modules may be distributed in one device or distributed in multiple devices. The above modules can be combined into one module, or can be further divided into multiple sub-modules.

本領域技術人員可以理解附圖只是一個較佳實施例的示意圖，附圖中的模組或流程並不一定是實施本發明所必須的。A person skilled in the art can understand that the drawings are only a schematic diagram of a preferred embodiment, and the modules or processes in the drawings are not necessarily required to implement the invention.

本領域技術人員可以理解實施例中的裝置中的模組可以按照實施例描述進行分佈於實施例的裝置中，也可以進行相應變化位於不同於本實施例的一個或多個裝置中。上述實施例的模組可以合倂為一個模組，也可以進一步拆分成多個子模組。A person skilled in the art can understand that the modules in the apparatus in the embodiment can be distributed in the apparatus of the embodiment according to the description of the embodiment, or the corresponding changes can be located in one or more apparatuses different from the embodiment. The modules of the above embodiments may be combined into one module, or may be further split into multiple sub-modules.

上述本發明實施例序號僅僅為了描述，不代表實施例的優劣。The serial numbers of the embodiments of the present invention are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

透過以上的實施方式的描述，本領域的技術人員可以清楚地瞭解到本發明可以透過硬體實現，也可以借助軟體加必要的通用硬體平臺的方式來實現。基於這樣的理解，本發明的技術方案可以以軟體產品的形式體現出來，該軟體產品可以儲存在一個非易失性儲存媒體(可以是CD-ROM，U盤，移動硬碟等)中，包括若干指令用以使得一台電腦設備(可以是個人電腦，伺服器，或者網路設備等)執行本發明各個實施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the present invention can be implemented by hardware, or by means of a software plus a necessary universal hardware platform. Based on such understanding, the technical solution of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a USB flash drive, a mobile hard disk, etc.), including A number of instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present invention.

以上公開的僅為本發明的幾個具體實施例，但是，本發明並非局限於此，任何本領域的技術人員能思之的變化都應落入本發明的保護範圍。The above disclosure is only a few specific embodiments of the present invention, but the present invention is not limited thereto, and any changes that can be made by those skilled in the art should fall within the protection scope of the present invention.

1．．．搜尋平臺1. . . Search platform

2．．．抽樣分析設備2. . . Sampling analysis equipment

21．．．劃分模組twenty one. . . Partition module

22．．．計算模組twenty two. . . Computing module

23．．．抽樣模組twenty three. . . Sampling module

24．．．儲存模組twenty four. . . Storage module

211．．．排序子模組211. . . Sorting submodule

212．．．歸類子模組212. . . Categorical submodule

221．．．概率計算子模組221. . . Probability calculation sub-module

222．．．抽樣計算子模組222. . . Sampling calculation sub-module

為了更清楚地說明本發明實施例的技術方案，下面將對實施例描述中所需要使用的附圖作簡單地介紹，顯而易見地，下面描述中的附圖僅僅是本發明的一些實施例，對於本領域普通技術人員來講，在不付出創造性勞動的前提下，還可以根據這些附圖獲得其他的附圖。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention, Those skilled in the art can also obtain other drawings based on these drawings without paying any creative work.

圖1是本發明實施例一中一種抽樣分析方法流程圖；1 is a flow chart of a sampling analysis method in Embodiment 1 of the present invention;

圖2是本發明實施例二中兩階段抽樣分析方法流程圖；2 is a flow chart of a two-stage sampling analysis method in Embodiment 2 of the present invention;

圖3是本發明實施例二中第一階段抽樣方法流程圖；3 is a flow chart of a first stage sampling method in Embodiment 2 of the present invention;

圖4是本發明實施例二中第二階段抽樣方法流程圖；4 is a flow chart of a second stage sampling method in Embodiment 2 of the present invention;

圖5是本發明實施例三中一種抽樣分析系統結構示意圖；5 is a schematic structural diagram of a sampling analysis system according to Embodiment 3 of the present invention;

圖6是本發明實施例三中抽樣分析設備結構示意圖。FIG. 6 is a schematic structural diagram of a sampling analysis device according to Embodiment 3 of the present invention.

Claims

A sampling analysis method for analyzing data of a large-scale search engine query, comprising: dividing a query keyword into at least one query keyword according to a PV (Page View) value of a query record of different query keywords a set, wherein the query key having the same PV value is classified into a query keyword subset; calculating a sample number of the query keyword subset, wherein, before calculating the sample number of the query keyword subset, The method includes: determining a number of query keywords to be extracted by the subsample analysis, wherein the calculating the number of samples of the query keyword subset comprises: calculating a extraction probability of the query keyword subset; and determining the number of queries extracted according to the determination The probability is extracted to calculate the number of samples of the subset of query keywords; and the query data is extracted in the subset of query keywords according to the number of samples.

The sampling analysis method of claim 1, wherein before the query keyword is divided into at least one query keyword subset according to the query record PV value of different query keywords, the method further includes: storing the PV The value of the PV is specifically the number of times the query keyword appears on the search platform within a preset time period.

The sampling analysis method of claim 1, wherein the dividing the query keyword into the at least one query keyword subset according to the PV value of the different query keywords comprises: sorting the PV values; The query keyword with the same PV value is classified into a query keyword. set.

The sampling analysis method of claim 3, wherein the sorting the PV values comprises: sorting the PV values in a small to large manner; or sorting the PV values in a large to small manner.

The sampling analysis method of claim 1, wherein calculating the extraction probability of the query keyword subset comprises: calculating a sum of SPV values of the PV values in the query keyword subset; The SPV value obtains a total number of times TPV (Total Page View) of the query record in the query keyword set; and obtains a extraction probability of the query keyword subset according to a ratio of the SPV value to the TPV value.

The method of sampling analysis according to claim 1, wherein the extracting the query data in the subset of the query keywords according to the number of samples is obtained by a random sampling method.

The sampling analysis method according to claim 6, wherein the random sampling method comprises: a lottery method and/or a random number method.

A sampling analysis device for analyzing data of a large-scale search engine query, comprising: a dividing module, configured to divide a query keyword into a PV (Page View) value according to a query record of different query keywords At least one query keyword subset, wherein the query keyword having the same PV value is classified into a query keyword subset; a calculation module, configured to calculate a sample number of the subset of the query keywords divided by the partitioning module, wherein, before the calculating the number of samples of the subset of the query keywords, the method further comprises: determining that the sampling analysis is required The number of extracted query keywords, wherein the calculating the number of samples of the query keyword subset comprises: calculating a extraction probability of the query keyword subset; calculating the query keyword according to the determined number of extracted queries and the extraction probability The number of samples of the subset; and a sampling module for extracting query data from the subset of query keywords divided by the partitioning module according to the number of samples obtained by the computing module.

The sampling analysis device of claim 8 , further comprising: a storage module, configured to store a PV value of the different query keywords, where the PV value is specifically within a preset time period, at least The number of times a query keyword appears.

The sampling analysis device of claim 9, wherein the dividing module comprises: a sorting sub-module for sorting PV values of the different query keywords stored in the storage module; The class sub-module is configured to classify the query keywords having the same PV value into a query keyword subset according to the order arranged by the sorting sub-module.

The sampling analysis device of claim 8, wherein the calculation module comprises: a probability calculation sub-module, configured to calculate a extraction probability of the query keyword subset; a sampling calculation sub-module, configured to calculate a sampling number of the query keyword subset according to the determined number of extracted queries and the extraction probability obtained by the probability calculation sub-module, wherein the query keyword subset is calculated Before the number of samples, it also includes: determining the number of query keywords to be extracted for the sample analysis.

A sampling analysis system for analyzing data of a large-scale search engine query, comprising: a search platform for providing a search service for a user query, recording a PV (Page View) value of different query keywords; and a sampling analysis device And calculating, according to a PV value of different query keywords recorded by the search platform, a query keyword into at least one query keyword subset, and calculating a sample number of the query keyword subset, wherein the query key is calculated in the query Before the number of samples of the subset of words, the method further includes: determining a number of query keywords to be extracted by the subsample analysis, wherein the calculating the number of samples of the subset of the query keywords comprises: calculating a probability of extracting the subset of the query keywords; Calculating the number of samples of the query keyword subset according to the determined number of queries and the extraction probability; and extracting query data in the query keyword subset according to the number of samples.