CN103136219B - A kind of based on ageing demand method for digging and device - Google Patents

A kind of based on ageing demand method for digging and device Download PDF

Info

Publication number
CN103136219B
CN103136219B CN201110379120.3A CN201110379120A CN103136219B CN 103136219 B CN103136219 B CN 103136219B CN 201110379120 A CN201110379120 A CN 201110379120A CN 103136219 B CN103136219 B CN 103136219B
Authority
CN
China
Prior art keywords
query
preset
timeliness
pattern
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110379120.3A
Other languages
Chinese (zh)
Other versions
CN103136219A (en
Inventor
黄际洲
钟华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110379120.3A priority Critical patent/CN103136219B/en
Publication of CN103136219A publication Critical patent/CN103136219A/en
Application granted granted Critical
Publication of CN103136219B publication Critical patent/CN103136219B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of based on ageing demand method for digging and device, obtaining the click data of selected time period from search daily record, described click data at least includes clicked web page title in the search terms (query) of user and the Search Results that query is corresponding;The query that corresponding web page title meets the pattern of preset need type is obtained from described click data;Calculate the clicking rate of each query of acquisition respectively, selecting the clicking rate query more than or equal to preset ratio threshold value, wherein the clicking rate of query is: query hits on the web page title of pattern with described preset need type accounts for the ratio of this query hits on all Search Results;The ageing query with described preset need type is obtained by the query selected.Can improve, by the present invention, accuracy rate and the recall rate that demand query is excavated, thus promote the effect of demand identification.

Description

Requirement mining method and device based on timeliness
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of computers, in particular to a demand mining method and device based on timeliness.
[ background of the invention ]
With the rapid development and maturity of the internet in the global scope, the information resources on the network are continuously abundant, the information data volume is rapidly expanding, and the acquisition of information through a search engine has become the main way for modern people to acquire information. To provide users with more convenient and accurate query services is the development direction of search engine technology in the present and future.
In search engine technology, identifying the search requirement of a user is an important ring for improving the accuracy and effectiveness of search, and particularly plays a significant role in structured search (i.e. vertical search). One way is to pre-mine a query with a certain type of requirements in a search log, and when a user inputs a query, the query is directly matched with the query with the pre-mined requirements so as to identify the requirements of the query input by the user.
When the query of each requirement type is mined, a keyword-based or template-based manner is generally adopted, the query containing the keyword of a certain type of requirement is identified as having the requirement, the query conforming to the template of the certain type of requirement is identified as having the requirement, and the like. However, this method often cannot mine a query with timeliness in each requirement type, for example, the query "home dish", and a priori knowledge shows that the query has only a recipe requirement, and if a method based on a keyword or a template is adopted, the query may be identified as having a recipe requirement even at a stage of tv drama mapping of "home dish", but when a user inputs "home dish" at the stage of tv drama mapping, a main requirement may be a video class rather than a recipe class, and it is obvious that the query with timeliness cannot be recalled in requirement mining, and thus accuracy may be affected in a requirement identification process.
[ summary of the invention ]
The invention provides a method and a device for identifying timeliness query, which are used for improving the accuracy and recall rate of demand mining and further improving the effect of demand identification.
The specific technical scheme is as follows:
a method for demand mining based on timeliness, the method comprising:
s1, obtaining click data of the selected time period from the search log, wherein the click data at least comprises search terms query of the user and clicked webpage titles in search results corresponding to the query;
s2, obtaining query of a mode that a corresponding webpage title meets a preset requirement type from the click data;
s3, respectively calculating the click rate of each query obtained in the step S2, and selecting the query with the click rate larger than or equal to a preset proportion threshold, wherein the click rate of the query is as follows: the click number of the query on the webpage title with the preset requirement type mode accounts for the proportion of the click number of the query on all search results;
and S4, obtaining the timeliness query with the preset requirement type from the selected query.
According to a preferred embodiment of the present invention, the pattern of the requirement type is composed of one or any combination of phrases, words, attribute identifiers and segmentation symbols.
According to a preferred embodiment of the present invention, the mining of the pattern of the demand type specifically includes:
a1, acquiring a webpage title with timeliness as a corpus;
a2, clustering the obtained corpora;
a3, respectively executing the steps A31 to A33 for the webpage titles in each category of the clustering result:
a31, cutting words of the web page title;
a32, replacing the named entities in the word cutting result with corresponding named entity type marks;
a33, determining n-gram of n-gram phrases of word segmentation results, wherein n is one or more preset positive integers, counting the occurrence times of each n-gram in the category where the webpage title is located, and extracting n-grams with the occurrence times meeting the preset word selection requirements as the mode of the requirement type.
According to a preferred embodiment of the present invention, the step a1 specifically includes:
acquiring a webpage title from a news website with a preset demand type as a corpus; or,
and acquiring a clicked webpage title corresponding to the timeliness seed query of the preset requirement type from the search log as a corpus.
According to a preferred embodiment of the present invention, between the step a32 and the step a33, the method further comprises:
and searching a synonym word table, and normalizing the words in the word segmentation result into synonym roots.
According to a preferred embodiment of the present invention, after the step a33, the method further includes:
a34, verifying the mode of the requirement type, and reserving the mode which passes the verification, wherein the verification process specifically comprises the following steps: taking the webpage titles with timeliness as a positive example set, and taking the webpage titles with non-timeliness as a negative example set; respectively calculating the ratio of the number of samples in the positive example set matched with the pattern to the sum of the number of samples in the positive example set matched with the pattern and the number of samples in the negative example set, and taking the calculated ratio as the score of the pattern; if the score is greater than a preset score threshold, the verification passes.
According to a preferred embodiment of the present invention, the step S4 includes:
respectively counting the number of websites from which the webpage titles with the modes of the preset demand types corresponding to each query are sourced, and filtering the queries corresponding to the webpage titles with the source website number lower than a preset number threshold value from the selected queries; and/or the presence of a gas in the gas,
and filtering the query containing the preset blacklist words from the selected query.
According to a preferred embodiment of the present invention, the step S4 includes:
obtaining the query with the searching times larger than a preset time threshold value in the vertical searching log of the preset requirement type, and obtaining an intersection of the obtained query and the selected query to obtain a timeliness query with the preset requirement type; or,
and respectively acquiring the search times of the selected query in the vertical search logs of the preset requirement type, and reserving the query with the search times larger than a preset time threshold as the timeliness query with the preset requirement type.
According to a preferred embodiment of the present invention, after the step S4, the method further includes:
and combining the timeliness query with the preset requirement type and the query with the preset requirement type mined in other modes to obtain the final mined query with the preset requirement type.
A demand excavation apparatus based on timeliness, the apparatus comprising:
the data acquisition unit is used for acquiring click data of a selected time period from a search log, wherein the click data at least comprises search terms query of a user and a clicked webpage title in a search result corresponding to the query;
the query acquisition unit is used for acquiring a query corresponding to a mode of a webpage title meeting a preset requirement type from the click data;
the query selection unit is used for respectively calculating the click rate of each query acquired by the query acquisition unit, and selecting the query with the click rate larger than or equal to a preset ratio threshold, wherein the click rate of the query is as follows: the click number of the query on the webpage title with the preset requirement type mode accounts for the proportion of the click number of the query on all search results;
and the query determining unit is used for obtaining the timeliness query with the preset requirement type from the query selected by the query selecting unit.
According to a preferred embodiment of the present invention, the pattern of the requirement type is composed of one or any combination of phrases, words, attribute identifiers and segmentation symbols.
According to a preferred embodiment of the present invention, the apparatus further comprises: a pattern mining unit;
the pattern mining unit specifically includes:
the corpus acquiring subunit is used for acquiring a webpage title with timeliness as a corpus;
the clustering subunit is used for clustering the obtained linguistic data;
and the pattern extraction subunit is used for respectively executing the following steps on the webpage titles in each category of the clustering result: the method comprises the steps of segmenting words of a webpage title, replacing named entities in a word segmentation result with corresponding named entity type marks, determining n-element word groups n-grams of the word segmentation result, wherein n is one or more preset positive integers, counting the occurrence frequency of each n-gram in the category where the webpage title is located, and extracting n-grams with the occurrence frequency meeting the preset word selection requirements as a mode of the required type.
According to a preferred embodiment of the present invention, the corpus acquiring subunit acquires a web page title from a news website of a preset demand type as a corpus; or,
and acquiring a clicked webpage title corresponding to the timeliness seed query of the preset requirement type from the search log as a corpus.
According to a preferred embodiment of the present invention, the pattern extraction subunit is further configured to search the synonym table and normalize the words in the word segmentation result into the synonym root after replacing the named entity in the word segmentation result with the corresponding named entity type tag and before determining the n-gram of the word segmentation result.
According to a preferred embodiment of the present invention, the pattern extraction subunit is further configured to verify the pattern of the demand type, and reserve a pattern that passes verification, where the verification process specifically includes: taking the webpage titles with timeliness as a positive example set, and taking the webpage titles with non-timeliness as a negative example set; respectively calculating the ratio of the number of samples in the positive example set matched with the pattern to the sum of the number of samples in the positive example set matched with the pattern and the number of samples in the negative example set, and taking the calculated ratio as the score of the pattern; if the score is greater than a preset score threshold, the verification passes.
According to a preferred embodiment of the present invention, the query determining unit includes: a first filtering subunit and/or a second filtering subunit;
the first filtering subunit is configured to respectively count the number of websites from which the webpage titles having the pattern of the preset demand type corresponding to each query are sourced, and filter, from the selected query, the query corresponding to the webpage title whose source website number is lower than a preset number threshold;
and the second filtering subunit is used for filtering the query containing the preset blacklist words from the selected query.
According to a preferred embodiment of the present invention, the query determining unit includes: a first determining subunit or a second determining subunit;
the first determining subunit is configured to acquire a query of which the search time in the vertical search log of the preset demand type is greater than a preset time threshold, and obtain an intersection between the acquired query and the selected query to obtain a timeliness query with the preset demand type;
and the second determining subunit is configured to obtain the search times of the selected query in the vertical search log of the preset requirement type, and reserve the query with the search time greater than a preset time threshold as the timeliness query with the preset requirement type.
According to a preferred embodiment of the present invention, the apparatus further comprises: and the excavating and combining unit is used for combining the timeliness query with the preset requirement type excavated by the other requirement excavating devices to obtain the finally excavated query with the preset requirement type.
According to the technical scheme, the query with the webpage title meeting the mode of the preset requirement type is obtained from the click data, and the query is selected based on the click rate of the query on the webpage title with the mode of the preset requirement type to obtain the timeliness query with the preset requirement type. Namely, by the method and the device, the query with timeliness can be automatically mined in the requirement mining, so that the requirement identification of the timeliness query can be realized during requirement identification, and the accuracy and the recall rate of the requirement identification are improved.
[ description of the drawings ]
FIG. 1 is a flowchart of a method provided in accordance with an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the mining of short strings according to a second embodiment of the present invention;
fig. 3 is a structural diagram of a demand excavation apparatus according to a third embodiment of the present invention;
fig. 4 is a structural diagram of a mode mining unit according to a fourth embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
It is observed that the titles of time-sensitive web pages often conform to some special patterns, and if the number of clicks on these web pages is large and the number of clicks on other web pages is small during the user search, it indicates that the query searched by the user has a time-sensitive requirement. For example, when a user searches for a query of "family dish", if a tv drama is hot-played in a current period of time under the name of "family dish", the user may click on some news-type web pages, which typically have similar patterns, and the number of the web pages that the user clicks on these patterns is much larger than that of other web pages, which indicates that the query has a time-efficient requirement. Based on this consideration, the method and apparatus for identifying a time-dependent query provided by the present invention are described in detail below with reference to embodiments.
The first embodiment,
Fig. 1 is a flowchart of a method according to an embodiment of the present invention, as shown in fig. 1, the method may include:
step 101: click data for a selected time period is obtained from the search log.
If the query needs to be analyzed whether the query has a timeliness requirement in a certain time period, click data in the time period is obtained, and the selected time period can be the latest time period, for example, click data of the latest day. Because the demand mining is usually performed periodically, the time period involved in the step may also be consistent with the time period adopted by the ordinary demand mining, for example, if the ordinary demand mining is performed once a month, the time period involved in the step may also be one month, and the click data of the next month is acquired.
The acquired click data at least comprises: the query searched by the user and the title of the clicked webpage in the search result corresponding to the query.
Step 102: and acquiring the query of the mode which corresponds to the webpage title and meets the preset requirement type from the click data.
The mode of the preset demand type can be represented by a mode short string, and if the mode short string is contained in the webpage title, the mode that the webpage title meets the preset demand type is indicated. Wherein the composition of the short string of patterns may include, but is not limited to: phrases, words, attribute identifiers, etc., may also include some segmentation symbols, such as: 【】 The [ ], (), { }, -,! And @, #,%, … …, &, _ and _, etc.
For example, in the photo category, the pattern strings "there is a sense of earthquake (figure)", "there is a loss of life in accident [ number ]," there is an injury in accident [ number ] (figure) ", where [ number ] is an attribute identifier of a number.
The above-mentioned short pattern strings can be manually summarized by observing the titles of the search results, or can be automatically mined in a machine learning manner, and the automatic mining of the short pattern strings will be described in detail in the second embodiment.
Step 103: and respectively calculating the ratio of the click number of each query on the webpage title with the mode of the preset demand type to the click number of the query on all search results, and selecting the query with the ratio larger than or equal to a preset ratio threshold.
The preset proportion threshold in this step may be set according to a specific requirement type and an actual situation, and may be set to 0.3 in one embodiment.
Assume that the query, corresponding web page title, and number of clicks obtained after step 102 are shown in table 1.
TABLE 1
After the proportion is calculated in step 103, it is determined that the proportions calculated by the "Shanghai train exhibition" and the "Baidu corporation" are both less than 0.3, and the proportion calculated by the "supernova outbreak" is greater than 0.3, so the "supernova outbreak" is selected.
Step 104: and respectively counting the number of websites from which the webpage titles with the preset demand type mode corresponding to each query are sourced, and filtering out the queries corresponding to the webpage titles with the source website number lower than a preset number threshold.
If the number of websites from which a query has a time-sensitive web title is small, the occurrence and click conditions of the websites are caused by the sudden behavior of the website and cannot reflect the time-sensitive nature of the query, so that the number of websites from which the web title is derived is limited in this step, and only the query corresponding to the web title whose source website number is greater than or equal to the preset number threshold is considered to have time-sensitive nature.
For example, the query obtained after step 103 and the corresponding number of source websites are shown in table 2.
TABLE 2
query Corresponding source web site number
Outbreak of supernova 8
American crash invisible helicopter 12
Libya street battle 2
United states satellite hitting earth 4
Assuming that the preset number threshold of the source website is 3, "the liberian street wars" is filtered out, and the remaining query is: "supernova outbreak", "american crash stealth helicopter" and "american satellite hit the earth".
Step 105: and filtering out the query containing the preset blacklist words.
This step removes some apparent errors by using blacklist rules, where the blacklist terms involved include, but are not limited to: interrogative words, yellow adverbs, and the like.
It should be noted that the filtering operations in step 104 and step 105 are not essential, and the two steps may be executed alternatively or sequentially in any order.
Step 106: and acquiring the query with the searching times larger than a preset time threshold value in the vertical searching log of the preset requirement type, and acquiring an intersection of the acquired query and the query obtained after filtering in the step 105 to obtain the timeliness query with the preset requirement type.
This step is performed to further ensure that the mined query has significant requirements on the preset requirement types, and is not a necessary step of the present invention. And if the search times of the query in the vertical search log of the preset requirement type are greater than a preset time threshold, the query has a stronger requirement on the requirement type.
Besides the implementation manner of this step, another manner may be adopted, that is, the search times of the queries obtained after the filtering processing in step 105 in the vertical search log of the preset requirement type are respectively obtained, and the query with the search time greater than the preset time threshold is reserved as the timeliness query with the preset requirement type.
Continuing to the example, assuming that the search times of the supernova outbreak, the American crash stealth helicopter and the American satellite hitting the earth in the picture vertical search log are 612, 5630 and 126 respectively, if the preset time threshold is 200, the supernova outbreak and the American crash stealth helicopter are finally determined to be the timeliness query with the picture requirement.
After the step, the obtained timeliness query with the preset requirement type can be combined with the query with the preset requirement type excavated by other methods to obtain the finally excavated query with the preset requirement type, so that the timeliness query which cannot be recalled by the existing requirement excavating method is made up.
Example II,
Fig. 2 is a mining flow chart of a pattern short string provided in the second embodiment of the present invention, and as shown in fig. 2, the mining process specifically includes:
step 201: and acquiring a webpage title with timeliness as a corpus.
In this step, the obtaining of the webpage title with timeliness may include, but is not limited to, the following two ways:
the method comprises the steps of acquiring a webpage title from a news website with a preset demand type. Most of the news reported by the news websites are news with timeliness, and in order to attract the attention of readers, the news headlines with timeliness usually describe the core of an event in a same way and adopt a word for expressing the specificity of news. In the title of the web page of the photo news website, the descriptive description is such as "accident X people are injured", "accident X people are killed", and the words for expressing the specificity of news are such as "[ high definition big picture ]," (picture) "," group picture ", and the like. These titles are all well suited for mining of short strings of patterns.
And secondly, acquiring a clicked webpage title corresponding to the timeliness seed query of the preset requirement type from the search log. Here, a plurality of seed queries with timeliness can be artificially set, and the clicked web page titles corresponding to the seed queries also usually show timeliness and are similar in mode, so that the method is also suitable for mining mode short strings.
Step 202: and clustering the obtained corpora.
In this step, clustering may be performed based on events, sites, channels, or the like. The purpose of clustering the web page titles is that the titles in each clustering result after clustering are all describing the same or similar expressions, for example, the web page titles describing the same or similar events have a higher probability of having the same pattern, the web page titles of the same site have a higher probability of having the same pattern, and the web page titles of the same channel have a higher probability of having the same pattern. The clustering algorithm may adopt the existing text clustering method, and is not described herein again.
For example, assume that the news headlines obtained from the photo-like news website in step 201 are shown in table 3.
TABLE 3
Numbering News headline Source web site
1 Continuous occurrence of 6.0-grade and 4.1-grade earthquakes within 2 minutes in Ili, Xinjiang Baidu news
2 The occurrence of 5.4-grade earthquake at the boundary of Gansu province in Sichuan province has a feeling of earthquake (picture) Baidu news
3 Two persons are injured when a train and a truck collide with each other in Anqing city Baidu news
4 Japanese 5 persons injured (figure) in collision accident of train and truck News of New wave
5 5-section carriage derailment in collision accident of train and truck in Kunming News of New wave
6 5.4-grade earthquake at Gansu boundary of Sichuan does not cause casualties and house collapse News of New wave
7 5.4 grade earthquake (picture) occurs at the boundary of Gansu Long nan and Sichuan Qingchuan county News of New wave
8 Indonesia accident of collision between train and car 8 people Tencent news
9 Dozens of people who die after 5 people collide with the greek train truck are injured Tencent news
10 Caused by explosion in Guizhou FuquanPositive investigation of 4 causes of death Tencent news
11 Nearly hundred official soldiers rescue in case of explosion at toll station of Guizhou Fuquan court Tencent news
12 Assembling a picture: many people casualty caused by explosion in Guizhou Fuquan Tencent news
13 [ personal Rapid news ] Sinkiang Yili occurs 6.0-grade and 4.1-grade 2 times of earthquake within 2 minutes People net
14 The residents in the southern Longnan who take 5.4-grade earthquake at the junction of Gansu and Sichuan are shaken up People net
15 Chile in 2010 occurs 8.8-grade earthquake People net
16 7.1 grade earthquake occurs in Yushu county of Qinghai province People net
17 Tsunami and typhoon caused by 7.3-grade earthquake in Indonesia People net
18 6.1 grade earthquake in Zhaosu county of Xinjiang People net
After clustering by event, 3 types of clustering results were obtained as shown in table 4.
TABLE 4
Cluster description News title numbering
Earthquake 1,2,6,7,13,14,15,16,17,18
Traffic accident 3,4,5,8,9
Explosion of the vessel 10,11,12
Step 203: and executing the step 203_1 to the step 203_5 respectively for each webpage title in each category of the clustering result.
Step 203_ 1: and cutting words of the webpage title.
Taking the webpage title "tsunami and typhoon caused by 7.3-level earthquake in Indonesia" as an example, the segmentation result is as follows: "Indonesia/occurrence/7.3/level/earthquake/initiation/tsunami/and/typhoon".
Step 203_ 2: and replacing the named entities in the word cutting result with corresponding named entity type marks.
Named entities herein include, but are not limited to: name of person, place name, organization name, number, date, currency, address, etc.
After the segmentation result of the previous example is executed in the step, the word/occurrence/[ number ]/level/earthquake/initiation/tsunami/typhoon is obtained.
Step 203_ 3: and searching a synonym word table, and normalizing the words in the word segmentation result into synonym roots.
In the synonym table, there is a root word for each group of synonyms, and the group of synonyms can be represented by the root word. The purpose of this step is to convert more synonymous expressions into a unified template, although this step is not a necessary operation.
Assuming that the root of the synonym of "and" is "and", the above example performs this step to obtain "[ place ]/occurrence/[ number ]/level/earthquake/initiation/tsunami/and/typhoon".
Step 203_ 4: and determining n-gram phrases (n-grams) of word segmentation results, counting the occurrence times of the n-grams in the category where the webpage title is located, and extracting the n-grams with the occurrence times meeting the preset word selection requirements as mode short strings.
The n-gram is a combination of n words with the minimum granularity, wherein n is preset one or more positive integers, and the n words sequentially appear. In embodiments of the present invention, n is typically selected to be one or more positive integers greater than or equal to 3, since there is a greater risk of escape from 1-grams and 2-grams.
The preset vote requirement can be that the occurrence number is ranked in the top N1, N1 is a preset positive integer, or the occurrence number is greater than a preset threshold.
Assuming that n is selected to be 3 and 4, the 3-gram obtained in the previous example has: [ Place ] occurrence [ number ], occurrence [ number ] level, [ number ] level earthquake, level earthquake initiation, tsunami initiation by earthquake, tsunami and typhoon initiation; the 4-gram has: [ Place ] occurrence [ number ] level, occurrence [ number ] level earthquake, [ number ] level earthquake initiation, tsunami initiation caused by level earthquake, tsunami and tsunami initiated by earthquake, tsunami and typhoon initiation.
N-grams of 10 seismic headlines were n-gram counted to obtain n-grams with the first 3 occurrences as shown in Table 5.
TABLE 5
n-gram Number of occurrences
[number]Stage earthquake 10
Occurrence [ number ]]Stage earthquake 7
[place]Occurrence [ number ]]Stage earthquake 7
The extracted n-gram can be directly used as a mode short string to end the excavation process of the mode short string, and in order that the excavated mode short string has better accuracy, the subsequent steps can be further executed.
Step 203_ 5: and verifying the extracted mode short string, and reserving the mode short string passing the verification.
The specific verification process is as follows: taking the time-sensitive web title as a positive example set and the non-time-sensitive web title as a negative example set, for example, the web title of a news website can be taken as the positive example set, and the web title of a community website (such as a Baidu post, Baidu know, etc.) can be taken as the negative example set; calculating the ratio of the number of samples in the positive example set matched with each short pattern string to the sum of the number of samples in the positive example set matched with the short pattern string and the number of samples in the negative example set, and taking the calculated ratio as the score of the short pattern string; if the score is larger than a preset score threshold value, the verification is passed, otherwise, the verification is not passed. The matching method of the pattern short string and the sample comprises the following steps: if the web page title contains a short string of patterns, a match is indicated, otherwise no match is made.
The above is a detailed description of the method provided by the present invention, and the following is a detailed description of the apparatus provided by the present invention with reference to the third embodiment.
Example III,
Fig. 3 is a structural diagram of a demand excavation apparatus according to a third embodiment of the present invention, and as shown in fig. 3, the apparatus includes: a data acquisition unit 300, a query acquisition unit 310, a query selection unit 320, and a query determination unit 330.
The data obtaining unit 300 obtains click data of the selected time period from the search log, where the click data at least includes the search term query of the user and the clicked web page title in the search result corresponding to the query.
The selected time period may be the last time period, such as the last day of click data. Because the demand mining is usually performed periodically, the time period involved in the step may also be consistent with the time period adopted by the ordinary demand mining, for example, if the ordinary demand mining is performed once a month, the time period involved in the step may also be one month, and the click data of the next month is acquired.
The query obtaining unit 310 is configured to obtain, from the click data, a query corresponding to a webpage title in a mode meeting a preset requirement type.
The mode of the requirement type is composed of one or any combination of phrases, words, attribute marks and segmentation symbols. The mode of the preset demand type can be represented by a mode short string, and if the mode short string is contained in the webpage title, the mode that the webpage title meets the preset demand type is indicated.
The above-mentioned mode of presetting the demand type can be summarized through the title manual of observing the retrieval result, also can excavate automatically through the mode of machine learning, in order to realize the automatic excavation of the mode of presetting the demand type, the device still includes: the pattern mining unit 340, the specific structure of which will be described in detail in embodiment four.
The query selecting unit 320 calculates the click rate of each query acquired by the query acquiring unit 310, and selects a query with a click rate greater than or equal to a preset ratio threshold, where the click rate of the query is: the click number of the query on the webpage title with the mode of the preset requirement type accounts for the ratio of the click number of the query on all the search results.
The query determining unit 330 is configured to obtain the timeliness query with the preset requirement type from the query selected by the query selecting unit 320.
In order to further improve the accuracy of mining the time-dependent query, the query determining unit 330 may include: the first filtering subunit 331 and/or the second filtering subunit 332, for example, in fig. 3, when two subunits are included at the same time, the query selected by the query selecting unit 320 may be processed in any order.
The first filtering subunit 331 respectively counts the number of websites from which the webpage titles having the pattern of the preset demand type corresponding to each query are sourced, and filters the queries corresponding to the webpage titles of which the number of source websites is lower than a preset number threshold from the selected queries.
The second filtering subunit 332 filters out the query containing the preset blacklist terms from the selected query. Blacklist terms referred to herein include, but are not limited to: interrogative words, yellow adverbs, and the like.
In addition, to further ensure that the mined query has significant requirements on the preset requirement types, the query determination unit 330 may include: the first determining subunit 333 or the second determining subunit (not shown in fig. 3).
The first determining subunit 333 obtains a query whose search time in the vertical search log of the preset demand type is greater than a preset time threshold, and obtains an intersection of the obtained query and the selected query to obtain a timeliness query with the preset demand type.
And the second determining subunit respectively obtains the search times of the selected query in the vertical search logs of the preset requirement type, and reserves the query of which the search times are greater than a preset time threshold as the timeliness query with the preset requirement type.
Still further, the apparatus further comprises: and the mining merging unit 350 is configured to merge the timeliness query with the preset requirement type mined by the other requirement mining device, so as to obtain a final mined query with the preset requirement type.
Example four,
Fig. 4 is a structural diagram of a pattern mining unit according to a fourth embodiment of the present invention, and as shown in fig. 4, the pattern mining unit may specifically include: a corpus acquisition subunit 341, a clustering subunit 342, and a pattern extraction subunit 343.
The corpus acquiring subunit 341 acquires a web page title with timeliness as a corpus.
The method for obtaining the corpus includes, but is not limited to: acquiring a webpage title from a news website with a preset demand type as a corpus; or acquiring a clicked webpage title corresponding to the timeliness seed query of the preset requirement type from the search log as a corpus.
The clustering subunit 342 clusters the obtained corpora.
In clustering, clustering may be performed based on events, websites, or channels, etc. The clustering algorithm may adopt the existing text clustering method, and is not described herein again.
The pattern extraction subunit 343 performs, for each web title in each category of the clustering result: the method comprises the steps of segmenting words of a webpage title, replacing named entities in a word segmentation result with corresponding named entity type marks, determining n-element word groups n-grams of the word segmentation result, wherein n is one or more preset positive integers, counting the occurrence frequency of each n-gram in the category where the webpage title is located, and extracting n-grams with the occurrence frequency meeting the preset word selection requirements as a mode of a demand type.
The named entities referred to above include, but are not limited to: name of person, place name, organization name, number, date, currency, address, etc.
In the synonym table, there is usually one root for each group of synonyms, and in order to convert more synonym expressions into a unified template, the pattern extraction subunit 343 is further configured to search the synonym table after replacing the named entity in the word segmentation result with the corresponding named entity type tag and before determining the n-gram of the word segmentation result, and normalize the words in the word segmentation result into the synonym root.
In order to further improve the accuracy of pattern mining, the pattern extraction subunit 343 is further configured to verify the pattern of the requirement type, and retain the verified pattern.
The verification process specifically comprises the following steps: taking the webpage titles with timeliness as a positive example set, and taking the webpage titles with non-timeliness as a negative example set; respectively calculating the ratio of the number of samples in the positive example set matched with the pattern to the sum of the number of samples in the positive example set matched with the pattern and the number of samples in the negative example set, and taking the calculated ratio as the score of the pattern; if the score is greater than a preset score threshold, the verification passes. The pattern matching method comprises the following steps: if the web page title contains the pattern, a match is indicated, otherwise no match is made.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (16)

1. A demand mining method based on timeliness is characterized by comprising the following steps:
s1, obtaining click data of the selected time period from the search log, wherein the click data at least comprises search terms query of the user and clicked webpage titles in search results corresponding to the query;
s2, obtaining query of a mode that a corresponding webpage title meets a preset requirement type from the click data;
s3, respectively calculating the click rate of each query obtained in the step S2, and selecting the query with the click rate larger than or equal to a preset proportion threshold, wherein the click rate of the query is as follows: the click number of the query on the webpage title with the preset requirement type mode accounts for the proportion of the click number of the query on all search results;
and S4, obtaining the timeliness query with the preset requirement type from the selected query.
2. The method according to claim 1, wherein the pattern of the requirement type is composed of one or any combination of phrases, words, attribute identifiers and segmentation symbols.
3. The method according to claim 1 or 2, characterized in that the mining of patterns of the demand type comprises in particular:
a1, acquiring a webpage title with timeliness as a corpus;
a2, clustering the obtained corpora;
a3, respectively executing the steps A31 to A33 for the webpage titles in each category of the clustering result:
a31, cutting words of the web page title;
a32, replacing the named entities in the word cutting result with corresponding named entity type marks;
a33, determining n-gram of n-gram phrases of word segmentation results, wherein n is one or more preset positive integers, counting the occurrence times of each n-gram in the category where the webpage title is located, and extracting n-grams with the occurrence times meeting the preset word selection requirements as the mode of the requirement type.
4. The method according to claim 3, wherein the step A1 specifically comprises:
acquiring a webpage title from a news website with a preset demand type as a corpus; or,
and acquiring a clicked webpage title corresponding to the timeliness seed query of the preset requirement type from the search log as a corpus.
5. The method of claim 3, further comprising, between step A32 and step A33:
and searching a synonym word table, and normalizing the words in the word segmentation result into synonym roots.
6. The method of claim 3, further comprising, after said step A33:
a34, verifying the mode of the requirement type, and reserving the mode which passes the verification, wherein the verification process specifically comprises the following steps: taking the webpage titles with timeliness as a positive example set, and taking the webpage titles with non-timeliness as a negative example set; respectively calculating the ratio of the number of samples in the positive example set matched with the pattern to the sum of the number of samples in the positive example set matched with the pattern and the number of samples in the negative example set, and taking the calculated ratio as the score of the pattern; if the score is greater than a preset score threshold, the verification passes.
7. The method according to claim 1, wherein the step S4 includes:
respectively counting the number of websites from which the webpage titles with the modes of the preset demand types corresponding to each query are sourced, and filtering the queries corresponding to the webpage titles with the source website number lower than a preset number threshold value from the selected queries; and/or the presence of a gas in the gas,
and filtering the query containing the preset blacklist words from the selected query.
8. The method according to claim 1, wherein the step S4 includes:
obtaining the query with the searching times larger than a preset time threshold value in the vertical searching log of the preset requirement type, and obtaining an intersection of the obtained query and the selected query to obtain a timeliness query with the preset requirement type; or,
and respectively acquiring the search times of the selected query in the vertical search logs of the preset requirement type, and reserving the query with the search times larger than a preset time threshold as the timeliness query with the preset requirement type.
9. A demand excavation apparatus based on timeliness, characterized in that the apparatus comprises:
the data acquisition unit is used for acquiring click data of a selected time period from a search log, wherein the click data at least comprises search terms query of a user and a clicked webpage title in a search result corresponding to the query;
the query acquisition unit is used for acquiring a query corresponding to a mode of a webpage title meeting a preset requirement type from the click data;
the query selection unit is used for respectively calculating the click rate of each query acquired by the query acquisition unit, and selecting the query with the click rate larger than or equal to a preset ratio threshold, wherein the click rate of the query is as follows: the click number of the query on the webpage title with the preset requirement type mode accounts for the proportion of the click number of the query on all search results;
and the query determining unit is used for obtaining the timeliness query with the preset requirement type from the query selected by the query selecting unit.
10. The apparatus of claim 9, wherein the pattern of the requirement type is composed of one or any combination of phrases, words, attribute identifiers, segmentation symbols.
11. The apparatus of claim 9 or 10, further comprising: a pattern mining unit;
the pattern mining unit specifically includes:
the corpus acquiring subunit is used for acquiring a webpage title with timeliness as a corpus;
the clustering subunit is used for clustering the obtained linguistic data;
and the pattern extraction subunit is used for respectively executing the following steps on the webpage titles in each category of the clustering result: the method comprises the steps of segmenting words of a webpage title, replacing named entities in a word segmentation result with corresponding named entity type marks, determining n-element word groups n-grams of the word segmentation result, wherein n is one or more preset positive integers, counting the occurrence frequency of each n-gram in the category where the webpage title is located, and extracting n-grams with the occurrence frequency meeting the preset word selection requirements as a mode of the required type.
12. The apparatus according to claim 11, wherein the corpus acquiring subunit acquires a web page title as the corpus from a news website of a preset demand type; or,
and acquiring a clicked webpage title corresponding to the timeliness seed query of the preset requirement type from the search log as a corpus.
13. The apparatus of claim 11, wherein the pattern extraction subunit is further configured to search a synonym table and normalize the words in the word segmentation result to a synonym root after replacing the named entity in the word segmentation result with the corresponding named entity type tag and before determining the n-gram of the word segmentation result.
14. The apparatus according to claim 11, wherein the pattern extraction subunit is further configured to verify the pattern of the requirement type, and retain a verified pattern, wherein the verification specifically includes: taking the webpage titles with timeliness as a positive example set, and taking the webpage titles with non-timeliness as a negative example set; respectively calculating the ratio of the number of samples in the positive example set matched with the pattern to the sum of the number of samples in the positive example set matched with the pattern and the number of samples in the negative example set, and taking the calculated ratio as the score of the pattern; if the score is greater than a preset score threshold, the verification passes.
15. The apparatus of claim 9, wherein the query determination unit comprises: a first filtering subunit and/or a second filtering subunit;
the first filtering subunit is configured to respectively count the number of websites from which the webpage titles having the pattern of the preset demand type corresponding to each query are sourced, and filter, from the selected query, the query corresponding to the webpage title whose source website number is lower than a preset number threshold;
and the second filtering subunit is used for filtering the query containing the preset blacklist words from the selected query.
16. The apparatus of claim 9, wherein the query determination unit comprises: a first determining subunit or a second determining subunit;
the first determining subunit is configured to acquire a query of which the search time in the vertical search log of the preset demand type is greater than a preset time threshold, and obtain an intersection between the acquired query and the selected query to obtain a timeliness query with the preset demand type;
and the second determining subunit is configured to obtain the search times of the selected query in the vertical search log of the preset requirement type, and reserve the query with the search time greater than a preset time threshold as the timeliness query with the preset requirement type.
CN201110379120.3A 2011-11-24 2011-11-24 A kind of based on ageing demand method for digging and device Active CN103136219B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110379120.3A CN103136219B (en) 2011-11-24 2011-11-24 A kind of based on ageing demand method for digging and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110379120.3A CN103136219B (en) 2011-11-24 2011-11-24 A kind of based on ageing demand method for digging and device

Publications (2)

Publication Number Publication Date
CN103136219A CN103136219A (en) 2013-06-05
CN103136219B true CN103136219B (en) 2016-08-17

Family

ID=48496055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110379120.3A Active CN103136219B (en) 2011-11-24 2011-11-24 A kind of based on ageing demand method for digging and device

Country Status (1)

Country Link
CN (1) CN103136219B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462259B (en) * 2014-11-21 2018-11-23 百度在线网络技术(北京)有限公司 It is a kind of for providing the method and apparatus of timeliness picture search result
US10127322B2 (en) * 2015-02-25 2018-11-13 Microsoft Technology Licensing, Llc Efficient retrieval of fresh internet content
CN105095434B (en) * 2015-07-23 2019-03-29 百度在线网络技术(北京)有限公司 The recognition methods of timeliness demand and device
CN105468782B (en) * 2015-12-21 2019-05-17 北京奇虎科技有限公司 A kind of method and device of the resource matched degree judgement of inquiry-
CN106919603B (en) * 2015-12-25 2020-12-04 北京奇虎科技有限公司 Method and device for calculating word segmentation weight in query word mode
CN107870913B (en) * 2016-09-23 2021-12-14 腾讯科技(深圳)有限公司 Efficient time high expectation weight item set mining method and device and processing equipment
CN108268552B (en) * 2016-12-30 2020-08-11 北京国双科技有限公司 Website information processing method and device
CN109582874B (en) * 2018-12-10 2020-12-01 北京搜狐新媒体信息技术有限公司 Bidirectional LSTM-based related news mining method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055587A (en) * 2007-05-25 2007-10-17 清华大学 Search engine retrieving result reordering method based on user behavior information
CN101398856A (en) * 2008-11-12 2009-04-01 北京搜狗科技发展有限公司 Method for acquiring navigation enquiry words, device and method for displaying searching result
CN102004792A (en) * 2010-12-07 2011-04-06 百度在线网络技术(北京)有限公司 Method and system for generating hot-searching word
CN102129422A (en) * 2010-01-14 2011-07-20 富士通株式会社 Template extraction method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8271516B2 (en) * 2008-06-12 2012-09-18 Microsoft Corporation Social networks service

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055587A (en) * 2007-05-25 2007-10-17 清华大学 Search engine retrieving result reordering method based on user behavior information
CN101398856A (en) * 2008-11-12 2009-04-01 北京搜狗科技发展有限公司 Method for acquiring navigation enquiry words, device and method for displaying searching result
CN102129422A (en) * 2010-01-14 2011-07-20 富士通株式会社 Template extraction method and device
CN102004792A (en) * 2010-12-07 2011-04-06 百度在线网络技术(北京)有限公司 Method and system for generating hot-searching word

Also Published As

Publication number Publication date
CN103136219A (en) 2013-06-05

Similar Documents

Publication Publication Date Title
CN103136219B (en) A kind of based on ageing demand method for digging and device
Wang et al. Detecting dominant locations from search queries
CN109800284B (en) Task-oriented unstructured information intelligent question-answering system construction method
US10437867B2 (en) Scenario generating apparatus and computer program therefor
CN103544210B (en) System and method for identifying webpage types
CN106570144A (en) Method and apparatus for recommending information
CN102722558B (en) A kind of method and apparatus recommending for user to put question to
US8041730B1 (en) Using geographic data to identify correlated geographic synonyms
CN105045875B (en) Personalized search and device
CN104021198B (en) The relational database information search method and device indexed based on Ontology
CN106484797A (en) Accident summary abstracting method based on sparse study
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
JP2005251206A (en) Word collection method and system for use in word segmentation
CN107357777B (en) Method and device for extracting label information
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN103873601A (en) Addressing class query word mining method and system
CN109815401A (en) A kind of name disambiguation method applied to Web people search
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
CN109408806A (en) A kind of Event Distillation method based on English grammar rule
CN113515939B (en) System and method for extracting key information of investigation report text
CN104346382B (en) Use the text analysis system and method for language inquiry
Korn et al. Automatically generating interesting facts from wikipedia tables
CN104317783A (en) SRC calculation method
CN111651559A (en) Social network user relationship extraction method based on event extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant