CN103902552B

CN103902552B - The method for digging and device of stop words, searching method and device, evaluating method and device

Info

Publication number: CN103902552B
Application number: CN201210572702.8A
Authority: CN
Inventors: 赵耀; 胡熠; 刘磊; 程佳
Original assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2012-12-25
Filing date: 2012-12-25
Publication date: 2019-03-26
Anticipated expiration: 2032-12-25
Also published as: CN103902552A

Abstract

A kind of method for digging of stop words, comprising the following steps: obtain inquiry log；At least one of the reverse document frequency of query word in the query string recorded in inquiry log, the relative term weight of query word, the inquiry set of words because of modification query string behavior generation, the set of correspondences of the query string because of triggering behavior generation and web page address attribute information is obtained, and is generated according to the attribute information and deactivates word set.In addition, additionally provide the excavating gear of stop words, searching method and searcher, stop words mining algorithm evaluating method and device.The method for digging and device of above-mentioned stop words, improve the accuracy rate of stop words；Above-mentioned searching method and device simplify original query string by removing stop words, can search for more relevant webpages, improve the accuracy rate of search；The evaluating method and device of the mining algorithm of above-mentioned stop words are evaluated and tested by the way of cross validation, are compared and are obtained optimal algorithm.

Description

The method for digging and device of stop words, searching method and device, evaluating method and device

Technical field

The present invention relates to Internet technologies, method for digging and device, searching method more particularly to a kind of stop words and The evaluating method and device of device, stop words mining algorithm.

Background technique

Stop words is the query word that search engine neglects automatically when indexing webpage or handling inquiry request, and stop words is logical Often occur excessively frequent and without practical significance, as " the ", " a ", " ", " ", remove this kind of word and advantageously reduce webpage The scale of search improves the accuracy of search result.

There are mainly two types of the excavation modes of traditional stop words, and one is manually press some criterion picks；One is from net Automatic mining in the log of page document and search engine.Hand picking mode needs to expend a large amount of manpower, and low efficiency.From net The mode of automatic mining stop words is divided into two kinds in the log of page document and search engine, and one is generate sample using stochastical sampling This collection successively calculates weight to the word in sample set, then chooses the smallest several words of word weight, generates and deactivates word set, The deactivated word set accuracy rate that this method obtains is low；One is by query string be in left position word regard stop words as, it is such Excavation mode is low for the accuracy rate excavated compared with the stop words in short queries.

Summary of the invention

Based on this, it is necessary to for the low problem of the accuracy rate of traditional stop words excavation, provide a kind of can improve accurately The method for digging of the stop words of rate.

In addition, there is a need the low problem of the accuracy rate excavated for traditional stop words, a kind of can improve accurately is provided The excavating gear of the stop words of rate.

In addition, there is a need to provide a kind of searching method that can improve accuracy rate.

In addition, there is a need to provide a kind of searcher that can improve accuracy rate.

In addition, there is a need to provide a kind of evaluating method of the mining algorithm of stop words that can improve accuracy rate.

In addition, there is a need to provide a kind of evaluating apparatus of the mining algorithm of stop words that can improve accuracy rate.

A kind of method for digging of stop words, comprising the following steps:

Obtain inquiry log；

Obtain in inquiry log the reverse document frequency of query word in the query string that records, query word relative term weight, Inquiry set of words, the set of correspondences because triggering query string and web page address that behavior generates generated by modification query string behavior At least one of attribute information, and according to the attribute information generate deactivate word set.

A kind of excavating gear of stop words, comprising:

Module is obtained, for obtaining inquiry log；

Generation module, for obtaining reverse document frequency, the query word of query word in the query string recorded in inquiry log Relative term weight, generate by modification query string behavior inquiry set of words, because of the query string and webpage of triggering behavior generation At least one of set of correspondences of location attribute information, and generated according to the attribute information and deactivate word set.

A kind of searching method, comprising the following steps:

Obtain query string；

The query string is handled using the deactivated word set that the method for digging of above-mentioned stop words generates；

Treated that query string is scanned for according to described.

A kind of searcher, comprising:

Query string obtains module, for obtaining query string；

Processing module, for being carried out to the query string using the deactivated word set that the excavating gear of above-mentioned stop words generates Processing；

Search module, for treated according to, query string is scanned for.

A kind of evaluating method of the mining algorithm of stop words, comprising the following steps:

Obtain the respective deactivated word set of a variety of mining algorithms；

The number that the stop words that each stop words is concentrated concentrates the stop words also occurred in remaining all stop words is counted, at it Remainder amount subtracts 1, and stop words concentrates the number of the stop words also occurred, and successively recursive statistics method obtains only going out in deactivated word set itself The number of existing stop words；

The number of the stop words occurred and pre-set corresponding weight are concentrated in remaining stop words by what statistics obtained It is weighted summation, obtains the weighting estimated value of each mining algorithm.

A kind of evaluating apparatus of the mining algorithm of stop words, comprising:

Extraction module, for obtaining the respective deactivated word set of a variety of mining algorithms；

Statistical module concentrates what is also occurred to deactivate for counting the stop words that each stop words is concentrated in remaining all stop words The number of word concentrates the number of the stop words also occurred in remaining quantity subtract 1 stop words, and successively recursive statistics method obtains only stopping With the number for the stop words that word set itself occurs；

Weighting block is concentrated the number of the stop words occurred and is preset for obtain statistics in remaining stop words Corresponding weight be weighted summation, obtain the weighting estimated value of each mining algorithm.

The method for digging and device of above-mentioned stop words are weighed according to the relative term of the reverse document frequency of query word, query word The deactivated word set that the corresponding relationship of inquiry set of words or query string and web page address that weight, modification query string behavior generate generates, Due to combining User behavior and the triggering behavior of user, a variety of truthful datas such as feature of query word and the stop words that generates Collection, improves the accuracy rate of stop words.

Above-mentioned searching method and device eliminate stop words to query string, and it is occupied to save the deactivated glossarial index of generation A large amount of memory space, and original query string is simplified by removing stop words, it can search for more relevant webpages, improve The accuracy rate of search.In addition, reality can will be had by reducing weight to the stop words in query string when being ranked up to search result The webpage of border semanteme comes front, saves user's browsing time.

The evaluating method and device of the mining algorithm of above-mentioned stop words are evaluated and tested by the way of cross validation, are compared Optimal algorithm out, the evaluating method and device apply also for the scene of many algorithms of similar same task.

Detailed description of the invention

Fig. 1 is the flow diagram of the method for digging of stop words in one embodiment；

Fig. 2 is the category that the reverse document frequency of query word in the query string recorded in inquiry log is obtained in one embodiment Property information, and according to the attribute information generate deactivate word set flow diagram；

Fig. 3 is the attribute information that the relative term weight of the query word recorded in inquiry log is obtained in one embodiment, and The flow diagram for deactivating word set is generated according to the attribute information；

Fig. 4 is the flow diagram that training data is obtained in one embodiment；

Fig. 5 is that the inquiry set of words generated by modification query string behavior recorded in inquiry log is obtained in one embodiment Attribute information, and according to the attribute information generate deactivate word set flow diagram；

Fig. 6 is the schematic diagram of the session recorded in inquiry log in one embodiment；

Fig. 7 is the partial data schematic diagram of redundancy Collocation centralized recording in one embodiment；

Fig. 8 is that the query string and web page address generated by triggering behavior recorded in inquiry log is obtained in one embodiment Set of correspondences attribute information, and according to the attribute information generate deactivate word set flow diagram；

Fig. 9 is the relation schematic diagram of query string and the corresponding webpage being triggered；

Figure 10 is that the part that four kinds of method for digging obtain deactivates word set schematic diagram；

Figure 11 is the flow diagram of searching method in one embodiment；

Figure 12 is the structural schematic diagram of the excavating gear of stop words in one embodiment；

Figure 13 is the schematic diagram of internal structure of generation module in one embodiment；

Figure 14 is the schematic diagram of internal structure of generation module in another embodiment；

Figure 15 is the schematic diagram of internal structure of training data acquiring unit in Figure 14；

Figure 16 is the schematic diagram of internal structure of generation module in another embodiment；

Figure 17 is the schematic diagram of internal structure of generation module in another embodiment；

Figure 18 is the structural schematic diagram of searcher in one embodiment；

Figure 19 is the flow diagram of the evaluating method of the mining algorithm of stop words in one embodiment；

Figure 20 is the schematic diagram of internal structure of the evaluating apparatus of the mining algorithm of stop words in one embodiment.

Specific embodiment

The excavation of the method for digging and device and stop words of stop words is calculated below with reference to specific embodiment and attached drawing The evaluating method of method and the technical solution of device are described in detail, so that it is clearer.

As shown in Figure 1, in one embodiment, a kind of method for digging of stop words, comprising the following steps:

Step S102 obtains inquiry log.

Specifically, inquiry log carries out User behavior and triggering query result behavior for recording user input query string The information of generation.It include query string, the web page address that inquiry obtains, the behavior for modifying query string, triggering webpage in inquiry log Corresponding relationship etc. between address behavior and query string and web page address.

Step S104 obtains the phase of the reverse document frequency of query word, query word in the query string recorded in inquiry log To word weight, the inquiry set of words because of modification query string behavior generation, the query string because of triggering behavior generation and web page address At least one of set of correspondences attribute information, and generated according to the attribute information and deactivate word set.

The method for digging of above-mentioned stop words, according to the reverse document frequency of query word, the relative term weight of query word, modification The deactivated word set that the corresponding relationship of inquiry set of words or query string and web page address that query string behavior generates generates, due to synthesis The User behavior of user and triggering behavior, a variety of truthful datas such as feature of query word and the deactivated word set that generates improve The accuracy rate of stop words.

As shown in Fig. 2, in one embodiment, obtaining the reverse document of query word in the query string recorded in inquiry log The attribute information of frequency, and the step of deactivating word set is generated according to the attribute information and includes:

Step S202 obtains the reverse document frequency of all query words in document sets.

Specifically, IDF(Inverse Document Frequency, reverse document frequency) refer to all texts of collection of document The total number of shelves then takes logarithm to obtain obtained quotient divided by the number of the document comprising certain certain words.IDF usually by with The significance level of one word is described, big IDF value means that this word is only present in a few documents, and the word Appearance tends to provide important information content；Small IDF value means that this word appears in large volume document, and the word is often Also without apparent semanteme, important information content cannot be provided.There are many kinds of the specific formula for calculation of IDF, adopts in the present embodiment With following formula:

In formula (1), idf_tIndicate the IDF value of word t, N indicates the number of documents in entire document sets, D_tIndicate document sets The number of document containing word t in conjunction.

In the present embodiment, the IDF value of word all in the document of the webpage comprising predetermined quantity is calculated.

Reverse document frequency is ranked up by step S204.

Specifically, word is ranked up from small to large or from big to small by IDF value.

In addition, can also carry out denoising after being ranked up to word, mainly reject all nouns, name, place name Deng.The part of speech that segmenter analyzes word can be used.

Step S206 chooses the smallest query word of the reverse document frequency of predetermined number from ranking results, generates stop words Collection.

Specifically, the query word of the smallest predetermined number of reverse document frequency can be selected after being ranked up.As from small to large Sequence then chooses the query word for the forward predetermined number that sorts as stop words.In the present embodiment, predetermined number is 500, but It is without being limited thereto.

As shown in figure 3, in one embodiment, obtaining the attribute of the relative term weight of the query word recorded in inquiry log Information, and the step of deactivating word set is generated according to the attribute information and includes:

Step S302 obtains training data, and extracts the feature of query word in training data.

Specifically, the feature of query word includes part of speech, the Chinese character number for including, word IDF, the word in inquiry log in word Frequently, Average Mutual etc..Wherein, Average Mutual refers to the correlation of two events, i.e. two events while the relationship occurred, It is measured by the mathematic expectaion of mutual information between two independent events.

Step S304 is trained according to the feature of the query word, constructs the relative term weight appraising model of query word.

Specifically, being trained using AdaBoost training method, the relative term weight model of query word is constructed.This is opposite Word weight model calculates the features such as the part of speech of each query word, the Chinese character number, word IDF, word frequency, the Average Mutual that include in word, The comprehensive relative term weight for obtaining the query word.Wherein, AdaBoost is a kind of iterative algorithm, is for the same training training Practice different classifiers (Weak Classifier), then these weak classifier sets are got up, constitutes a stronger final classification device (strong classifier), algorithm itself are realized by changing data distribution, it is according to each sample among each training set Whether classification correct and the accuracy rate of general classification of last time, to determine the weight of each sample.Weight will be modified New data set is given sub-classification device and is trained, and finally finally merges the classifier that each training obtains, as most Decision Classfication device afterwards.

Step S306, according to the relative term weight appraising model to looking into the query string in the first predetermined time of acquisition It askes word to be analyzed, obtains low weight word set.

Specifically, the first predetermined time can be set as needed, and such as one day, week etc..Calculate each query string The relative term weight of middle query word, extracts the query word that relative term weight is minimum in each query string, and low weight word set is added.This Outside, low weight vocabulary mode can also be used to record.

Step S308 counts the word frequency of each query word in the low weight word set.

Specifically, the query word of low weight in all query strings in the first predetermined time is had recorded in low weight word set, There may be same words to occur many times, the word frequency of statistical query word.

Step S310 is ranked up query word by word frequency.

Specifically, can be ranked up from big to small or from small to large to query word by word frequency.

Step S312 chooses the highest query word of predetermined number word frequency, generates and deactivates word set.

Specifically, predetermined number can be set as needed, predetermined number is 500 in the present embodiment.Word frequency is higher, indicates The query word more may be stop words.

As shown in figure 4, in one embodiment, the step of obtaining training data, includes:

Step S402 obtains net according to the corresponding relationship of the query string and web page address that record in the inquiry log respectively Query word in page content and query string.

Specifically, record has query string and URL(Uniform/Universal Resource in inquiry log Locator, uniform resource locator) corresponding relationship.Query word is obtained from query string, is obtained in webpage according to web page address Hold.

Step S404 judges whether query word appears in the web page contents in the query string, if so, executing step S406, if it is not, executing step S408.

Step S406, query word are high weight word.

Step S408, query word are low weight word.

Step S410, using the high weight word and low weight word as training data.

As shown in figure 5, in one embodiment, obtaining the looking into because of modification query string behavior generation recorded in inquiry log The attribute information of set of words is ask, and the step of deactivating word set is generated according to the attribute information and includes:

Step S502 acquires the User behavior recorded in the inquiry log in the second predetermined time, according to the User behavior The variation of middle query string generates inquiry word set.

In the present embodiment, the sequence of user behavior of an inquiry user predefined first within a preset time is one Session(session).For example, as shown in fig. 6, certain user elder generation input inquiry string A " mssj in a search engine within a preset time World of Warcraft " then clicks the 1st article as a result, query string A is then revised as query string B " World of Warcraft ", and click 2 as a result, this process is a session.

Secondly, predefined two set:

First collection is combined into query word set Set (t), is inquiring among query string A, but not for record queries word t Among string B.

Second collection is combined into query word relation integration Set (<a, t>), among query string A and not for record queries word t Among query string B, and query word a in query string A close to query word t.

Inquiry set of words only stores query word itself, and query word relation integration not only stores query word, also storage query word The context of appearance, i.e. inquiry pair.For example, the query string A that user starts input is " mssj World of Warcraft official website ", rear modification is looked into Asking string A is query string B " World of Warcraft ", then the member for including in Set (t) is known as " mssj ", " official website ", and wraps in Set (<a, t>) The member contained is known as "<mssj, Warcraft>", "<the world, official website>".

Query string behavior is modified to user in each session again, generation two above set.

Step S504 seeks union to inquiry set of words, chooses and concentrate the query word of the highest predetermined number of the frequency of occurrences, It generates and deactivates word set.

Specifically, predetermined number is set as needed, it such as can be 500.

In the present embodiment, the method for digging of the stop words further comprises the steps of: the change according to query string in the User behavior Metaplasia is at query word relation integration；Union is asked to query word relation integration, choose and concentrates the highest predetermined number of the frequency of occurrences Query word pair, generate redundancy arrange in pairs or groups word set.Wherein, query word is to i.e. finger query word and close to the query word of the query word, such as The world and official website.As shown in fig. 7, the partial redundance of redundancy word centralized recording is arranged in pairs or groups, such as (official) is downloaded, (big) final result, wherein Word in bracket is redundancy word.

As shown in figure 8, in one embodiment, obtain the query string generated by triggering behavior that is recorded in inquiry log with The attribute information of the set of correspondences of web page address, and the step of deactivating word set is generated according to the attribute information and includes:

Step S602 obtains the corresponding relationship of query string and web page address that the triggering behavior recorded in inquiry log generates Collection.

Specifically, obtaining the web results of user input query string and triggering.The triggering behavior can be to pass through keyboard or mouse The triggering behavior that the input equipments webpage clicking such as mark or touch screen generates.In obtaining different query strings, it may trigger same One web page address.As shown in figure 9, different query string A, B, C, user clicks webpage 1 and webpage 2 when inquiring A, is looking into When asking B, webpage 2 is clicked, when inquiring C, clicks webpage 2 and webpage 3.Because query string A, B and C click identical net Page 2, it can be assumed that the semanteme of three query strings may be close.The triggering behavior of user is had recorded in inquiry log (as clicked Behavior), with<Query, URL>pair form storage, wherein Query is query string, and URL is triggered corresponding to the query string Web page address.

Step S604 concentrates lookup to obtain the corresponding all query strings of same web page address from corresponding relationship.

Specifically, find the corresponding all query strings of same web page address, it is denoted as<URL, QuerySet>, QuerySet To inquire trail.

In addition, also denoising can be carried out to each QuerySet, it is assumed that shortest query string contains n in QuerySet Then a query word removes the query string that all length is greater than n+2.

Step S606 obtains the redundancy of each query word of the corresponding all query strings of all same web page addresses.

Specifically, the size for assuming the QuerySet after denoising is m, i.e. contain m query string in QuerySet.For Each word t in QuerySet calculates redundancy r (t)=(m-df (t))/m of t, and wherein df (t) indicates that word t exists QuerySet(inquire trail) in a Query(query string of df (t)) in occurred.

Calculate all<URL, QuerSet>in all words redundancy.

Step S608 is ranked up query word by redundancy size.

Specifically, being ranked up from big to small or from small to large to query word by redundancy.

Step S610 chooses the query word of the maximum predetermined number of redundancy, generates and deactivates word set.

Specifically, predetermined number can be set as needed, such as 500.By the query word of the maximum predetermined number of redundancy As stop words, generates and deactivate word set.Deactivated vocabulary record stop words can also be used.

It as shown in Figure 10, is reverse document frequency, the relative term weight, modification query string behavior generation of use query word Inquire set of words, the deactivated word set that the set of correspondences for the query string and web page address that triggering behavior generates generates.In Figure 10, IDF represents the stop words generated according to the attribute information of the reverse document frequency of query word in the query string recorded in inquiry log Collection；Word weight represents the deactivated word set generated according to the attribute information of the relative term weight of the query word recorded in inquiry log； User behavior represents the calculation of the deactivated word set generated according to the attribute information of the inquiry set of words generated by modification query string behavior Method；Triggering behavior is represented to be generated according to the attribute information of the set of correspondences of the query string and web page address that generate by triggering behavior Deactivated word set.

In addition, additionally providing a kind of searching method, as shown in figure 11, comprising the following steps:

Step S702 obtains query string.

Step S704 handles the query string using the deactivated word set that the method for digging of above-mentioned stop words generates.

Specifically, deactivating word set is to be produced according to the reverse document frequency of query word, relative term weight, modification query string behavior The deactivated word set that the set of correspondences of query string and web page address that raw inquiry set of words or triggering behavior generates generates.

Step S706, according to treated, query string is scanned for.

Above-mentioned searching method eliminates stop words to query string, and it is occupied a large amount of to save the deactivated glossarial index of generation Memory space, and original query string is simplified by removing stop words, it can search for more relevant webpages, improve search Accuracy rate.In addition, practical semanteme can will be had by reducing weight to the stop words in query string when being ranked up to search result Webpage come front, save user's browsing time.

As shown in figure 12, in one embodiment, a kind of excavating gear of stop words, including obtain module 10, generate mould Block 20.Wherein:

Module 10 is obtained for obtaining inquiry log.

Generation module 20 is used to obtain reverse document frequency, the query word of query word in the query string recorded in inquiry log Relative term weight, generate by modification query string behavior inquiry set of words, because of the query string and webpage of triggering behavior generation At least one of set of correspondences of location attribute information, and generated according to the attribute information and deactivate word set.

The excavating gear of above-mentioned stop words, according to the reverse document frequency of query word, the relative term weight of query word, modification The deactivated word set that the corresponding relationship of inquiry set of words or query string and web page address that query string behavior generates generates, due to synthesis The User behavior of user and triggering behavior, a variety of truthful datas such as feature of query word and the deactivated word set that generates improve The accuracy rate of stop words.

As shown in figure 13, in one embodiment, generation module 20 includes reverse document frequency acquiring unit 202, first Sequencing unit 204 and the first generation unit 206.Wherein:

Reverse document frequency acquiring unit 202 is used to obtain the reverse document frequency of all query words in document sets.

First sequencing unit 204 is for the reverse document frequency to be ranked up.

In addition, generation module 20 may also include the first denoising unit, for carrying out denoising to word after sorting, mainly Reject all nouns, name, place name etc..The part of speech that segmenter analyzes word can be used.

First generation unit 206 is used to choose the smallest query word of the reverse document frequency of predetermined number from ranking results, It generates and deactivates word set.

As shown in figure 14, in one embodiment, generation module 20 includes training data acquiring unit 212, appraising model Construction unit 214, word weight analysis unit 216, statistic unit 218, the second sequencing unit 220 and the second generation unit 222.Its In:

Training data acquiring unit 212 extracts the feature of query word in training data for obtaining training data.

Appraising model construction unit 214 constructs the opposite of query word for being trained according to the feature of the query word Word weight appraising model.

Word weight analysis unit 216, for the first predetermined time according to the relative term weight appraising model to acquisition Query word in interior query string is analyzed, and low weight word set is obtained.

Statistic unit 218, for counting the word frequency of each query word in the low weight word set.

Second sequencing unit 220, for being ranked up by the word frequency to query word.

Second generation unit 222 generates for choosing the highest query word of predetermined number word frequency and deactivates word set.

As shown in figure 15, in one embodiment, training data acquiring unit 212 includes that source data obtains subelement 2122, judgment sub-unit 2124 and training data obtain subelement 2126.Wherein:

Source data obtains subelement 2122 for pair according to the query string and web page address that record in the inquiry log It should be related to, obtain the query word in web page contents and query string respectively.

Judgment sub-unit 2124 is used for judging whether query word appears in the web page contents in the query string, if It is that then the query word is high weight word, if it is not, then the query word is low weight word.

Training data obtains subelement 2126 and is used for using the high weight word and low weight word as training data.

As shown in figure 16, in one embodiment, generation module 20 includes acquisition unit 232 and third generation unit 234. Wherein:

Acquisition unit 232 is used to acquire the User behavior recorded in the inquiry log in the second predetermined time, according to described The variation of query string generates inquiry set of words in User behavior.

In the present embodiment, generation module 20 further includes initialization unit.An inquiry is predefined by initialization unit first The sequence of user behavior of user within a preset time is a session(session).For example, as shown in fig. 6, in preset time Certain interior user elder generation input inquiry string A " mssj World of Warcraft " in a search engine, then clicks the 1st article as a result, will then look into It askes string A and is revised as query word B " World of Warcraft ", and click the 2nd article as a result, this process is a session.

Secondly, predefined two set of initialization unit:

Third generation unit 234 is used to seek union to inquiry set of words, chooses and concentrates the frequency of occurrences highest predetermined Several query words generates and deactivates word set.Specifically, predetermined number is set as needed, it such as can be 500.

Acquisition unit 312 is also used to generate query word relation integration according to the variation of query string in the User behavior.

Third generation unit 324 is also used to seek union to query word relation integration, chooses and concentrates the frequency of occurrences highest The query word pair of predetermined number generates redundancy collocation vocabulary.Wherein, query word is looked into i.e. finger query word and close to the query word Word is ask, such as the world and official website.As shown in fig. 7, the partial redundance of redundancy word centralized recording is arranged in pairs or groups, such as (official) is downloaded, (big) knot Office, wherein the word in bracket is redundancy word.

As shown in figure 17, in one embodiment, generation module 20 includes set of relations acquiring unit 242, searching unit 244, redundancy acquiring unit 246, third sequencing unit 248 and the 4th generation unit 250.Wherein:

Set of relations acquiring unit 242 is with being used to obtain the query string that the triggering behavior recorded in inquiry log generates and webpage The set of correspondences of location.

Searching unit 244 is used to concentrate to search from the corresponding relationship to obtain the corresponding all inquiries of same web page address String.

In addition, generation module 20 may also include the second denoising unit, for carrying out denoising to each QuerySet, Assuming that shortest query string contains n query word in QuerySet, the query string that all length is greater than n+2 is then removed.

Redundancy acquiring unit 246 is used to obtain each query word of the corresponding all query strings of all same web page addresses Redundancy.

Redundancy acquiring unit 246 calculates all<URL, QuerySet>in all words redundancy.

Third sequencing unit 248 is for being ranked up query word by redundancy size.

4th generation unit 250 is used to choose the query word of the maximum predetermined number of redundancy, generates and deactivates word set.Specifically , predetermined number can be set as needed, and such as 500.It is raw using the query word of the maximum predetermined number of redundancy as stop words At deactivated word set.Deactivated vocabulary record stop words can also be used.

In addition, additionally providing a kind of searcher, as shown in figure 18, including query string obtains module 30, processing module 40 With search module 50.Wherein:

Query string obtains module 30 for obtaining query string.

Processing module 40 is used to handle query string using the deactivated word set that the excavating gear of above-mentioned stop words generates.

Search module 50 is for according to treated, query string to be scanned for.

Above-mentioned searcher eliminates stop words to query string, and it is occupied a large amount of to save the deactivated glossarial index of generation Memory space, and original query string is simplified by removing stop words, it can search for more relevant webpages, improve search Accuracy rate.In addition, practical semanteme can will be had by reducing weight to the stop words in query string when being ranked up to search result Webpage come front, save user's browsing time.

As shown in figure 19, in one embodiment, a kind of evaluating method of the mining algorithm of stop words, including following step It is rapid:

Step S802 obtains the respective deactivated word set of a variety of mining algorithms.

In one embodiment, the step of obtaining a variety of mining algorithms respective deactivated word set includes: to obtain according to inquiry The deactivated word set that the reverse document frequency of query word generates in the query string recorded in log；Acquisition is recorded according in inquiry log Query word relative term weight generate deactivated word set；It obtains raw according to the inquiry set of words generated by modification query string behavior At deactivated word set；Obtain the stop words generated according to the set of correspondences of the query string and web page address that generate by triggering behavior Collection.Details are not described herein as described in the method for digging of stop words for the specific generation method of deactivated word set.

In the present embodiment, it is 500 that every kind of mining algorithm, which generates stop words and concentrates stop words,.

Step S804 counts the stop words that each stop words is concentrated and concentrates the stop words also occurred in remaining all stop words Number concentrates the number of the stop words also occurred in remaining quantity subtract 1 stop words, and successively recursive statistics method obtains only in stop words Collect the number of the stop words of itself appearance.

Specifically, successively counting n by taking four kinds of mining algorithms as an example₃、n₂、n₁、n₀Value, n₃It indicates in certain mining algorithm Stop words concentrate and occur, and concentrate the word number that also occurs, n in the stop words of other three kinds of mining algorithms₂It indicates at certain The stop words of kind mining algorithm, which is concentrated, to be occurred, and the word number also occurred is concentrated in the stop words of other two kinds of mining algorithms； n₁It indicates to concentrate in the stop words of certain mining algorithm and occur, and also occur in the stop words of another mining algorithm concentration Word number；n₀It indicates only to concentrate in the stop words of certain algorithm and occur.

Step S806 concentrates the number and pre-set phase of the stop words occurred in remaining stop words for what statistics obtained The weight answered is weighted summation, obtains the weighting estimated value of each mining algorithm.

Specifically, the calculation formula for weighting estimated value is as follows by taking four kinds of mining algorithms as an example:

V=0.6*n₂+0.3*n₁+0.1*n₀(2)

In formula (2), n₂Indicate that the Top500 stop words obtained in certain mining algorithm concentrates appearance, in addition also at other two The Top500 stop words of scheme concentrates the word number also occurred.n₁Indicate that the Top500 obtained in certain mining algorithm deactivates word set Middle appearance, in addition also in a kind of other word numbers that the Top500 stop words that mining algorithm obtains concentration occurs.n₀It indicates only In the word number that the Top500 stop words of the mining algorithm oneself is concentrated.0.6,0.3 and 0.1 is respectively n₂、n₁、n₀It is corresponding Weight.N3 is that the stop words obtained in four kinds of mining algorithms concentrates the stop words all occurred, can not be considered.It is obtained by weighting The weighting estimated value of four kinds of mining algorithms is as shown in table 1.

Table 1

n₃

n₂

n₁

n₀

V

IDF	179	47	57	217	67
						Word weight	179	100	87	134	99.5
User behavior	179	110	137	74	114.5
						Triggering behavior	179	121	129	71	118.4

In table 1, IDF is represented to be believed according to the attribute of the reverse document frequency of query word in the query string recorded in inquiry log Cease the algorithm of the deactivated word set generated；Word weight represents the attribute of the relative term weight according to the query word recorded in inquiry log The algorithm for the deactivated word set that information generates；User behavior represents the category according to the inquiry set of words generated by modification query string behavior Property information generate deactivated word set algorithm；Triggering behavior is represented according to the query string and web page address generated by triggering behavior The algorithm for the deactivated word set that the attribute information of set of correspondences generates.V is bigger, indicates that the algorithm is more superior.

As shown in figure 20, in one embodiment, a kind of evaluating apparatus of the mining algorithm of stop words, including extraction module 60, statistical module 70 and weighting block 80.Wherein:

Extraction module 60 is for obtaining the respective deactivated word set of a variety of mining algorithms.Specifically, extraction module 60 is also used to Obtain the deactivated word set generated according to the reverse document frequency of query word in the query string recorded in inquiry log；And obtain basis The deactivated word set that the relative term weight of the query word recorded in inquiry log generates；And it obtains according to because modification query string behavior produces The deactivated word set that raw inquiry set of words generates；And it obtains corresponding with web page address according to the query string generated by triggering behavior The deactivated word set that set of relations generates.

Statistical module 70 is used to count the stop words that each stop words is concentrated and concentrates what is also occurred to stop in remaining all stop words The number of word concentrates the number of the stop words also occurred in remaining quantity subtract 1 stop words, and successively recursive statistics method obtains only existing Deactivate the number for the stop words that word set itself occurs.

What weighting block 80 was used to obtain statistics concentrates the number of the stop words occurred and sets in advance in remaining stop words The corresponding weight set is weighted summation, obtains the weighting estimated value of each mining algorithm.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

1. a kind of method for digging of stop words, comprising the following steps:

Inquiry log is obtained, the inquiry log carries out User behavior and triggering inquiry knot for recording user input query string The information that fruit behavior generates；

The attribute information recorded in inquiry log is obtained, and is generated according to the attribute information and deactivates word set, the attribute information Including the reverse document frequency of query word, the relative term weight of query word, looking into because of modification query string behavior generation in query string Ask set of words, because of the query string of triggering behavior generation and the set of correspondences of web page address.

2. the method for digging of stop words according to claim 1, which is characterized in that the attribute information includes inquiry log The reverse document frequency of query word in the query string of middle record；

Generating the step of deactivating word set according to the attribute information includes:

Obtain the reverse document frequency of all query words in document sets；

The reverse document frequency is ranked up；

The smallest query word of the reverse document frequency of predetermined number is chosen from ranking results, is generated and is deactivated word set.

3. the method for digging of stop words according to claim 1, which is characterized in that the attribute information includes inquiry log The relative term weight of the query word of middle record；

Training data is obtained, and extracts the feature of query word in training data；

It is trained according to the feature of the query word, constructs the relative term weight appraising model of query word；

The query word in the query string in the first predetermined time of acquisition is divided according to the relative term weight appraising model Analysis, obtains low weight word set；

Count the word frequency of each query word in the low weight word set；

Query word is ranked up by the word frequency；

The highest query word of predetermined number word frequency is chosen, generates and deactivates word set.

4. the method for digging of stop words according to claim 3, which is characterized in that the step of acquisition training data wraps It includes:

According to the corresponding relationship of the query string and web page address that record in the inquiry log, web page contents and inquiry are obtained respectively Query word in string；

Judge whether query word appears in the web page contents in the query string, if so, the query word is Gao Quanchong Word, if it is not, then the query word is low weight word, using the high weight word and low weight word as training data.

5. the method for digging of stop words according to claim 1, which is characterized in that the attribute information includes inquiry log The inquiry set of words of middle record generated by modification query string behavior；

The User behavior recorded in the inquiry log in the second predetermined time is acquired, according to the change of query string in the User behavior Metaplasia is at inquiry word set；

Union is asked to inquiry set of words, chooses and concentrates the query word of the highest predetermined number of the frequency of occurrences, generate and deactivate word set.

6. the method for digging of stop words according to claim 5, which is characterized in that further comprise the steps of:

Query word relation integration is generated according to the variation of query string in the User behavior；

Union is asked to query word relation integration, chooses and concentrate the query word pair of the highest predetermined number of the frequency of occurrences, is generated superfluous Remaining collocation vocabulary.

7. the method for digging of stop words according to claim 1, which is characterized in that the attribute information includes inquiry log The set of correspondences for the query string and web page address of middle record generated by triggering behavior；

Obtain the set of correspondences of query string and web page address that the triggering behavior recorded in inquiry log generates；

It concentrates to search from the corresponding relationship and obtains the corresponding all query strings of same web page address；

Obtain the redundancy of each query word of the corresponding all query strings of all same web page addresses；

Query word is ranked up by redundancy size；

The query word of the maximum predetermined number of redundancy is chosen, generates and deactivates word set.

8. a kind of excavating gear of stop words characterized by comprising

Module is obtained, for obtaining inquiry log, the inquiry log carries out User behavior for recording user input query string And the information that triggering query result behavior generates；

Generation module for obtaining the attribute information recorded in inquiry log, and generates according to the attribute information and deactivates word set, The attribute information includes the reverse document frequency of query word in query string, the relative term weight of query word, because modifying query string The set of correspondences of the inquiry set of words that behavior generates, the query string because triggering behavior generation and web page address.

9. the excavating gear of stop words according to claim 8, which is characterized in that the generation module includes:

Reverse document frequency acquiring unit, for obtaining the reverse document frequency of all query words in document sets；

First sequencing unit, for the reverse document frequency to be ranked up；

First generation unit, for choosing the smallest query word of the reverse document frequency of predetermined number from ranking results, generation stops Use word set.

10. the excavating gear of stop words according to claim 8, which is characterized in that the generation module includes:

Training data acquiring unit for obtaining training data, and extracts the feature of query word in training data；

Appraising model construction unit constructs the relative term weight of query word for being trained according to the feature of the query word Appraising model；

Word weight analysis unit, for according to the relative term weight appraising model to the inquiry in the first predetermined time of acquisition Query word in string is analyzed, and low weight word set is obtained；

Statistic unit, for counting the word frequency of each query word in the low weight word set；

Second sequencing unit, for being ranked up by the word frequency to query word；

Second generation unit generates for choosing the highest query word of predetermined number word frequency and deactivates word set.

11. the excavating gear of stop words according to claim 10, which is characterized in that the training data acquiring unit packet It includes:

Source data obtains subelement, for the corresponding relationship according to the query string and web page address that are recorded in the inquiry log, The query word in web page contents and query string is obtained respectively；

Judgment sub-unit, for judging whether query word appears in the web page contents in the query string, if so, described Query word is high weight word, if it is not, then the query word is low weight word；

Training data obtains subelement, for using the high weight word and low weight word as training data.

12. the excavating gear of stop words according to claim 8, which is characterized in that the generation module includes:

Acquisition unit is gone for acquiring the User behavior recorded in the inquiry log in the second predetermined time according to the inquiry Inquiry set of words is generated for the variation of middle query string；

Third generation unit is chosen for seeking union to inquiry set of words and concentrates looking into for the highest predetermined number of the frequency of occurrences Word is ask, generates and deactivates word set.

13. the excavating gear of stop words according to claim 12, which is characterized in that the acquisition unit is also used to basis The variation of query string generates query word relation integration in the User behavior；

The third generation unit is also used to seek union to query word relation integration, chooses and concentrates the frequency of occurrences highest predetermined The query word pair of number generates redundancy collocation vocabulary.

14. the excavating gear of stop words according to claim 8, which is characterized in that the generation module includes:

Set of relations acquiring unit, for obtaining pair of query string and web page address that the triggering behavior recorded in inquiry log generates Answer set of relations；

Searching unit obtains the corresponding all query strings of same web page address for concentrating to search from the corresponding relationship；

Redundancy acquiring unit, the redundancy of each query word for obtaining the corresponding all query strings of all same web page addresses Degree；

Third sequencing unit, for being ranked up by redundancy size to query word；

4th generation unit generates for choosing the query word of the maximum predetermined number of redundancy and deactivates word set.

15. a kind of searching method, comprising the following steps:

Obtain query string；

The stop words that the query string is generated using the method for digging of the stop words as described in any one of claims 1 to 7 Collection is handled；

Treated that query string is scanned for according to described.

16. a kind of searcher characterized by comprising

Query string obtains module, for obtaining query string；

Processing module, for being filled to the query string using the excavation of the stop words as described in any one of claim 8 to 14 The deactivated word set for setting generation is handled；

Search module, for treated according to, query string is scanned for.

17. a kind of evaluating method of the mining algorithm of stop words, comprising the following steps:

Obtain the respective deactivated word set of a variety of mining algorithms；

The number that the stop words that each stop words is concentrated concentrates the stop words also occurred in remaining all stop words is counted, in its remainder Amount subtracts 1, and stop words concentrates the number of the stop words also occurred, and successively recursive statistics method obtains only deactivating what word set itself occurred The number of stop words；

The number of the stop words occurred is concentrated to carry out with pre-set corresponding weight in remaining stop words for what statistics obtained Weighted sum obtains the weighting estimated value of each mining algorithm.

18. the evaluating method of the mining algorithm of stop words according to claim 17, which is characterized in that the acquisition is a variety of The step of mining algorithm respective deactivated word set includes:

Obtain the deactivated word set generated according to the reverse document frequency of query word in the query string recorded in inquiry log；

Obtain the deactivated word set generated according to the relative term weight of the query word recorded in inquiry log；

Obtain the deactivated word set generated according to the inquiry set of words generated by modification query string behavior；

Obtain the deactivated word set generated according to the set of correspondences of the query string and web page address that generate by triggering behavior.

19. a kind of evaluating apparatus of the mining algorithm of stop words characterized by comprising

Statistical module concentrates the stop words also occurred in remaining all stop words for counting the stop words that each stop words is concentrated Number concentrates the number of the stop words also occurred in remaining quantity subtract 1 stop words, and successively recursive statistics method obtains only in stop words Collect the number of the stop words of itself appearance；

Weighting block, for the number and pre-set phase of the obtained stop words occurred in remaining stop words concentration will to be counted The weight answered is weighted summation, obtains the weighting estimated value of each mining algorithm.

20. the evaluating apparatus of the mining algorithm of stop words according to claim 19, which is characterized in that the extraction module It is also used to obtain the deactivated word set generated according to the reverse document frequency of query word in the query string recorded in inquiry log；

And obtain the deactivated word set generated according to the relative term weight of the query word recorded in inquiry log；

And obtain the deactivated word set generated according to the inquiry set of words generated by modification query string behavior；

And obtain the deactivated word set generated according to the set of correspondences of the query string and web page address that generate by triggering behavior.