CN102737021B - Search engine and realization method thereof - Google Patents

Search engine and realization method thereof Download PDF

Info

Publication number
CN102737021B
CN102737021B CN201110081259.XA CN201110081259A CN102737021B CN 102737021 B CN102737021 B CN 102737021B CN 201110081259 A CN201110081259 A CN 201110081259A CN 102737021 B CN102737021 B CN 102737021B
Authority
CN
China
Prior art keywords
synonym
query
linguistic context
word
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110081259.XA
Other languages
Chinese (zh)
Other versions
CN102737021A (en
Inventor
呼大为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110081259.XA priority Critical patent/CN102737021B/en
Publication of CN102737021A publication Critical patent/CN102737021A/en
Application granted granted Critical
Publication of CN102737021B publication Critical patent/CN102737021B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a search engine and a realization method thereof. The method comprises: receiving an original query of an active user; identifying an original word contained in the original query, a potential synonym pair containing the original word and a synonym of the original word, and the synonym context of the potential synonym word pair; determining if the synonym context matches the original query, and if the synonym context matches the original query, substituting the synonym for the original word in the original query to obtain a synonym query mode; and based on the original query and the synonym query mode search, acquiring a set of the result web pages. Through determining a semantic environment, the search engine provides an accurate synonym expansion query, thereby providing a complete and accurate search result for the user and guaranteeing a good usage experience to the user at the same time.

Description

Search engine and its implementation
Technical field
The present invention relates to search engine technique, relate in particular to a kind of search engine and its implementation of expanding synonymic search query.
Background technology
The carrier that the develop rapidly of internet provides a brand-new information storage, processing, transmission and used for people, the network information also becomes rapidly people and obtains one of main channel of knowledge and information.And so how fully the information resources of scale, when nearly all knowledge that the mankind are occupied is included, have brought the problem of development and utilization also to the user of resource.Search engine arises at the historic moment just under this demand, and its assisted network user searches information on the internet.Particularly, search engine according to certain strategy, use specific computer program to gather information from internet, after information being organized and is processed, for user provides search service, by information display relevant user search to user.
The normally search based on keyword of on-line search service that search engine provides, user is by the input frame input inquiry expression formula of search engine, and the results web page that comprises these keywords is inquired about and returned to search engine.Due to knowledge background or the use habit difference of different user, the keyword that same thing search is used may be also can be different, add in natural language and itself just have a lot of synonyms or near synonym, so the keyword search only providing based on user is inadequate.At present, a lot of search engines all have the function of expanding query, as synonym expanding query.When search engine receives after the original query expression formula of user's input, can carry out participle operation to it, and identify in the entry set after participle whether have potential synonym pair.Particularly, search engine can mate the entry after cutting and predetermined synonym dictionary, judge in these entries, whether exist synon, if, can be on synon basis expanded search inquiry, and return and be shown to user after the Query Result of expansion and original Query Result are merged.Thereby, for user provides the Search Results of more heterogeneous pass.
But same words may embody different implications in different semantic environments, so its synonym is also synonym or closely adopted in certain semantic environment, do different semantic environments and change, this synonym just cannot be suitable for.So in this case, the possibility of result obtaining with synonym expanding query is not just the result that user wants, and thus, can bring poor experience to user on the contrary.For example, the original query of user's input is " how fish-flavoured shredded pork is cooked ".Subsequently, search engine is by the participle to original query, and the potential synonym that has obtained " how doing " after mating with thesaurus is to { " how doing ", " menu " }, and replaces " how doing " with " menu " and carried out expansion synonym and inquire about and obtain corresponding Query Result.If but the original query that user provides is " how doing bedside cupboard ", obviously, user's demand is now to want to understand the making of furniture, and search engine still uses " menu " to replace " how doing " to expand the words of synonym inquiry, just obtained user undesired escape result, so user can query to the accuracy of search.
In view of this, be necessary existing search engine to be improved, to address the above problem.
Summary of the invention
The object of the present invention is to provide a kind of search engine, its judgement by semantic environment provides synonym expanding query accurately, thereby for user provides Search Results comprehensively and accurately, guarantees that user has good experience simultaneously.
The present invention also aims to provide a kind of implementation method of above-mentioned search engine.
One of for achieving the above object, the implementation method of a kind of search engine of the present invention, it comprises the steps:
Receive the original query of active user's search;
Former word that identification comprises in original query, comprise this former word and its synon potential synonym to and the right synonym linguistic context of this potential synonym;
Judge whether described synonym linguistic context and original query mate, and in the time that both mate, described synonym is substituted to former word in original query to obtain synonym query formulation, obtain the set of results web page according to described original query and synonym query-based search.
As a further improvement on the present invention, the step of the matching degree of described judgement synonym linguistic context and original query comprises: the matching degree of calculating synonym linguistic context and original query; In the time that the value of described matching degree is in predetermined matching degree interval, determine synonym linguistic context and original query coupling.
As a further improvement on the present invention, the calculating of described matching degree is removed the length after former word according to original query, and the length of synonym linguistic context is determined.
As a further improvement on the present invention, the method also comprises: before the step of matching degree that judges synonym linguistic context and original query, the entry segment that also can comprise based on synonym linguistic context is done the maximum cutting of forward to original query, thereby obtains the entry set after cutting.
As a further improvement on the present invention, the method also comprises the steps:
Obtain historical user and inquire about click data, described data comprise historical query formulation and queried result website that return in response to this query formulation and clicked access;
Identification synonym pair, described synonym is present in the former word in described historical query formula and is present in the corresponding synonym in described queried result website comprising;
Record and be defined as the right synonym linguistic context of described synonym to historical query formula described in major general.
As a further improvement on the present invention, the step of described definite synonym linguistic context also comprises next-door neighbour's word of described historical query formula Central Plains word is recorded and be defined as synonym linguistic context.
As a further improvement on the present invention, described next-door neighbour's word comprise in described historical query formula, be positioned at former word before and be positioned at the entry after former word.
As a further improvement on the present invention, described next-door neighbour's word comprises empty word bar.
As a further improvement on the present invention, the method also comprises, determining before the step of synonym linguistic context, judges whether the title in described results web page comprises synonym and do not comprise former word; If so, carry out again the step of described definite synonym linguistic context, if not, do not carry out the step of determining synonym linguistic context.
As a further improvement on the present invention, the step of described definite synonym linguistic context also comprises, adds up the frequency that described synonym linguistic context is recorded, and in the time that the described frequency is more than or equal to a predetermined frequency threshold value, determines that this synonym linguistic context is the right synonym linguistic context of described synonym.
As a further improvement on the present invention, described synonym linguistic context is determined according to the anchor text of webpage.
As a further improvement on the present invention, described synonym linguistic context is determined according to section arranged side by side in web page title.
For realizing above-mentioned another object, a kind of search engine of the present invention, it comprises search component, search component comprises query analysis module and search module;
Wherein query analysis module is used for:
Receive the original query of active user's search;
Identify the former word that comprises in described original query, comprise this former word and its synon potential synonym to and the right synonym linguistic context of this potential synonym;
Judge whether described synonym linguistic context and original query mate, and in the time that both mate, described synonym is substituted to former word in original query to obtain synonym query formulation;
Search module is used in the time of synonym linguistic context and original query coupling according to the set of described original query and synonym query-based search acquisition results web page.
As a further improvement on the present invention, described query analysis module is in the time judging the matching degree of synonym linguistic context and original query, also for the matching degree of calculating synonym linguistic context and original query; In the time that the value of described matching degree is in predetermined matching degree interval, determine synonym linguistic context and original query coupling.
As a further improvement on the present invention, the calculating of described matching degree is removed the length after former word according to original query, and the length of synonym linguistic context is determined.
As a further improvement on the present invention, described query analysis module also for: judging before the matching degree of synonym linguistic context and original query, the entry segment that also can comprise based on synonym linguistic context is done the maximum cutting of forward to original query, thereby obtains the entry set after cutting.
As a further improvement on the present invention, search engine also comprises user's inquiry log analyzer, its for:
Obtain historical user and inquire about click data, described data comprise historical query formulation and queried result website that return in response to this query formulation and clicked access;
Identification synonym pair, described synonym is present in the former word in described historical query formula and is present in the corresponding synonym in described queried result website comprising;
Record and be defined as the right synonym linguistic context of described synonym to historical query formula described in major general.
As a further improvement on the present invention, described log analyzer is determined that synonym linguistic context also comprises next-door neighbour's word of described historical query formula Central Plains word is recorded and be defined as synonym linguistic context.
As a further improvement on the present invention, described next-door neighbour's word comprise in described historical query formula, be positioned at former word before and be positioned at the entry after former word.
As a further improvement on the present invention, described next-door neighbour's word comprises empty word bar.
As a further improvement on the present invention, described log analyzer also for: before definite synonym linguistic context, judge whether the title in described results web page comprises synonym and do not comprise former word; If so, carry out again the step of described definite synonym linguistic context, if not, do not carry out the step of determining synonym linguistic context.
As a further improvement on the present invention, described log analyzer also for: add up the frequency that described synonym linguistic context is recorded, in the time that the described frequency is more than or equal to a predetermined frequency threshold value, determine that this synonym linguistic context is the right synonym linguistic context of described synonym.
As a further improvement on the present invention, described synonym linguistic context is determined according to the anchor text of webpage.
As a further improvement on the present invention, described synonym linguistic context is determined according to section arranged side by side in web page title.
Compared with prior art, the invention has the beneficial effects as follows: search engine is by the semantic environment analysis to active user's query demand, to determine whether that being suitable for synonym converts to carry out synonym expanding query, thereby guarantee the accuracy rate of synonym expanding query, make expanding query meet user's demand as far as possible, and then guarantee that user has good experience.
Brief description of the drawings
Fig. 1 is the principle of work block diagram of the first embodiment of search engine of the present invention;
Fig. 2 is the workflow diagram that the search engine shown in Fig. 1 excavates synonym linguistic context;
Fig. 3 is the workflow diagram that the search engine shown in Fig. 1 is carried out synonym expanding query;
Fig. 4 is the principle of work block diagram of the second embodiment of search engine of the present invention;
Fig. 5 is the workflow diagram that the search engine shown in Fig. 4 is carried out synonym expanding query;
Fig. 6 is the principle of work block diagram of the 3rd embodiment of search engine of the present invention;
Fig. 7 is the workflow diagram that the search engine shown in Fig. 6 is carried out synonym expanding query;
Fig. 8 is the principle of work block diagram of the 4th embodiment of search engine of the present invention;
Fig. 9 is the workflow diagram that the search engine shown in Fig. 8 is carried out synonym expanding query;
Figure 10 is that the search engine shown in Fig. 8 judges synonym similarity grade, and synonym is carried out to the workflow diagram in an embodiment of corresponding mark.
Embodiment
Describe the present invention below with reference to each embodiment shown in the drawings.But these embodiments do not limit the present invention, the conversion in structure, method or function that those of ordinary skill in the art makes according to these embodiments is all included in protection scope of the present invention.
Shown in Fig. 1 is the principle of work block diagram of the first embodiment of search engine 100 of the present invention.In present embodiment, search engine 100 is collected webpage according to certain strategy from internet, and after webpage being organized and processed, browser 21 that can customer in response end 20 is asked and the service of search inquiry is provided.Wherein, search engine 100 can comprise one or more store and management data respond the webserver entity of searching request of being used for.Client 20 can comprise one or more subscriber terminal equipments, as personal computer, notebook computer, wireless telephone, personal digital assistant (PDA) or other computer installation and communicator.
These servers and terminal device all comprise some basic modules on framework, as bus, treating apparatus, memory storage, one or more input/output device and communication interface etc.Bus can comprise one or more wires, is used for realizing the communication between server or the each assembly of terminal device.Treating apparatus comprises that all types of being used for carry out processor or the microprocessor of instruction, treatment progress or thread.Memory storage can comprise stores the dynamic storagies such as the random access storage device (RAM) of multidate information, with the static memory such as ROM (read-only memory) (ROM) of storage static information, and comprise the mass storage of magnetic or optical record medium and respective drive.Input media arrives server or terminal device for user's input information, as keyboard, mouse, writing pencil, voice recognition device or biometric apparatus etc.Output unit comprises for the display of output information, printer, loudspeaker etc.Communication interface is used for making server or terminal device and other system or device to communicate.Between communication interface, can be connected in network by wired connection, wireless connections or light, make search engine 100,20 of clients realize mutual communication by network.Network can comprise that LAN (Local Area Network) (LAN), wide area network (WAN), telephone network are as combination of internet, the Internet or above-mentioned these networks of public switch telephone network (PSTN), enterprises etc.On server and terminal device, all include for management of system resource, control the operating system software of other program operation, and be used for realizing application software or the programmed instruction of certain functional modules function.
As shown in Figure 1, search engine 100 can be carried out synonym expanding query, and it can be divided into off-line part and online part on the whole.In off-line part, search engine 100 comprises can storage network page data and data repository 12, index 13, webpage grabber 14, the user inquiry log database 16 of recording user Query Information and the log analyzer 17 that user inquiry log analyzed of synonym to information.
Webpage grabber 14 is the programs that capture one by one webpage according to certain strategy by the hyperlink relation between webpage.In concrete embodiment, webpage grabber 14 is from initial URL (Universal Resource Locator, URL(uniform resource locator)) choose URL to be crawled according to certain scheduling strategy in storehouse, resolve the network server address of indicating in URL, then connect, send request and receive data, the web data of acquisition be stored in the web page library 122 of data repository 12 and set up local collection of document, then from wherein extracting link to carry out next step grasping movement, so move in circles until all URL have captured.The scheduling strategy that webpage grabber 14 is chosen URL institute foundation can comprise that breadth-first strategy, depth-first strategy, backward chaining count strategy etc.; Grasp Modes can be that accumulation formula captures, and can be also that increment type captures.Index 13 is for analyzing and set up index to local collection of document.For example from the full text of document, extract entry by participle, then remove by filter high frequency words or low-frequency word, to obtain index terms set, finally webpage is converted into the mapping of index terms to webpage to the mapping of index terms, forms in the inverted file that comprises index thesaurus and inverted list the index database 121 that is stored in data repository 12.The method of web document being carried out to participle comprises segmenting method, the segmenting method based on understanding and the segmenting method based on statistics based on dictionary.Wherein the more common segmenting method based on dictionary comprises again the maximum syncopation of forward, reverse maximum syncopation and minimum syncopation.
In the present invention, synonym refers to the equivalent in meaning or close entry that title is different but express, and identical the or close meaning, these entries synonym each other expressed in multiple entries.In present embodiment, thesaurus 123 comprises the corresponding table 1231 of synonym and synonym context bank 1232.Wherein in the corresponding table 1231 of synonym, different words and its synon corresponding relation are specified in advance, as passed through former word and its synon mapping table that statistics obtains in advance.This correspondence table can also constantly upgrade by the historical query click data of analysis user.For example, when the synonym that comprises certain former word in the title of clicked queried result website but do not occur former word, and the frequency that this situation occurs is higher, this former word and synonym are defined as synonym to and be added in the corresponding table 1231 of synonym.
Shown in Fig. 2 is the workflow that search engine 100 excavates an embodiment of the right synonym linguistic context of synonym.In the present invention, synonym linguistic context refers to the semantic environment that synonym occurs Central Plains word, and it is used for showing that this synonym is to applicable semantic environment, and, under this semantic environment, synonym is applicable to replacing former word and carries out synonym expanding query.In the present embodiment, synonym linguistic context obtains by analysis user inquiry log.User's inquiry log database 17 is after each search finishes, and is used for the inquiry click data of recording user, as query word expression formula, search time, the results list returning and clicked results web page etc.Also coordinate with reference to shown in Fig. 1 with reference to Fig. 2, the historical user's query formulation and the click data (step 411) that in log analyzer 17 analysis user inquiry log databases 16, comprise, comprise the query formulation of analysis of history and queried result website that return in response to ad hoc inquiry formula and clicked access.Next, whether log analyzer 17 can there is the right synonym linguistic context of certain synonym in these data of identification, if so, records and stores in synonym context bank 1232.
Particularly, first log analyzer 17 can judge in a certain historical query formula whether comprise former word based on the corresponding table 1231 of synonym, if so, obtains and comprises this former word and corresponding synon synonym pair.For example, historical query formula is " how fish-flavoured shredded pork is cooked ", log analyzer 17 judges in this query formulation that based on the corresponding table 1231 of synonym having the former word of " how doing " (is " fish-flavoured shredded pork " and " how doing " two entries by " how fish-flavoured shredded pork is cooked " cutting, then the former word in corresponding with synonym these two entries table is mated, thereby find the former word of " how doing "), and obtain corresponding synonym to { " how doing ", " menu " }.Subsequently, log analyzer 17 judges for this query formulation, and user clicks in the web page title of access and whether comprised synonym but do not comprised former word, if so, records the right synonym linguistic context of this synonym.For example, for query formulation " how fish-flavoured shredded pork is cooked ", user clicked title for the webpage of " fish-flavoured shredded pork menu ", the operation that log analyzer 17 will executive logging synonym linguistic context.Synonym linguistic context at least comprises this historical query formula, as " how fish-flavoured shredded pork is cooked "; Also can comprise next-door neighbour's word of this historical query formula Central Plains word, as " fish-flavoured shredded pork "; Or both record the synonym linguistic context to { " how doing ", " menu " } as synonym.Wherein, next-door neighbour's word can be positioned at before former word, also can be positioned at after former word; Next-door neighbour's word can be also empty word bar, in original query, only comprises former word, does not have next-door neighbour's word.
In above-mentioned embodiment, synonym linguistic context is to obtain by historical user behavior, but in other embodiments, synonym linguistic context also can be determined according to the anchor text of webpage.Anchor text is the text message comprising in the hyperlink of webpage.For example, the super chain text in place that webpage www.sina.com.cn is cited has " Sina website's homepage ", " Sina's homepage ", " sina homepage ", these word sections can go on record and be used as the synonym linguistic context of synonym to { " Sina website ", " Sina " } so.In addition, synonym linguistic context also can be determined according to arranged side by side section in web page title.For example, price.mycar168.com/search.asp? the title of this network address of factoryid=135 is " quotation of Huachen BMW, automobile big world, Huachen BMW price Shenzhen net ".Pass through separator, this title can be split as multiple entry fragments arranged side by side " quotation of Huachen BMW " " Huachen BMW price " " automobile big world, Shenzhen net ", and the first two fragment comprises synonym to { " price ", " quotation " } in " price " and " quotation ", these two fragments also can be used as the right synonym linguistic context of this synonym so.
Shown in Fig. 2, in the process of excavating in synonym linguistic context, user's click behavior might not be all completely reasonably, that is to say, user may not be in the mood for clicking some incoherent results in the process of navigate search results, and the synonym linguistic context of record just can not be accurate in this case.Think and eliminate the negative effect that causes of this situation, log analyzer 17 also can be added up the frequency that synonym linguistic context is recorded, and, only have in the time that the frequency is more than or equal to a predetermined frequency threshold value, this synonym linguistic context just can retain and is defined as the right synonym linguistic context of corresponding synonym, that is to say, filter out the synonym linguistic context (step 413) of low frequency.
As shown in Figure 1, the online part of search engine 100 mainly comprises search component 11 and user interface 15.Wherein user interface 15 represents by the browser software 21 of client 20, for supplying user input query formula, and by the list of predetermined ways of presentation display of search results; In addition, after search finishes, also for the Query Information of recording user, and deposited in user's inquiry log database 16.Search component 11, for the searching request of customer in response end 30, returns to client 20 by Search Results.In present embodiment, search component 11 comprises search module 111, query analysis module 112 and result synthesis module 113.For common original query (not comprising expanding query), query analysis module 112 is generally used for the current original query receiving to carry out participle operation, obtains query word set, and generated query vocabulary.Search module 111, receiving after inquiry vocabulary, mates with the index thesaurus in data directory storehouse 121, finds corresponding index terms and inverted list corresponding to each index terms, thereby obtains the web document set relevant to query word.Result synthesis module 113 is arranged according to the degree of correlation weights between predetermined each document and query word the web document order searching, and then the results list is returned to client by user interface 15.
Illustrate that below in conjunction with the workflow shown in Fig. 3 search engine 100 carries out the detailed step of synonym expanding query online according to synonym linguistic context.Query analysis module 112 receives the original query (step 421) of active user's search by user interface 15, then analysis and consult formula (step 422), comprises original query is carried out to participle operation.It should be noted that, the segmenting method in present embodiment is the maximum syncopation of the forward based on dictionary, and the entry fragment that this dictionary is comprised by synonym linguistic context structure forms.Before address, historical query formula can be used as synonym context record, and the fragment length of historical query formula is greater than the length of the entry after this query formulation is split, so, can guarantee once adopt the maximum syncopation of forward the fragment that comprises historical query formula in current original query, this fragment can be taken the lead in cutting out, thereby has improved the accuracy rate of follow-up calculating.For example, in synonym linguistic context excavation phase, " today Nokia how much " historical query formula be, recording synonym to { " how much ", " price " } synonym linguistic context time, historical query formula " today Nokia how much " and next-door neighbour's word " Nokia " all can be recorded as synonym linguistic context.And " who know today Nokia how much " current original query be, according to the maximum syncopation of forward, the longest fragment in synonym context lexicon " today Nokia how much " length is 8, query analysis module 112 from left to right scans current original query, judge length is whether 8 phrase appears in synonym context lexicon, in the time finding " today Nokia how much " coupling, it first will be cut out, so, " Nokia " just can not cut out as independent keyword.In step 422, query analysis module 112 also can be mated the entry set obtaining after original query cutting with thesaurus 123, obtain potential synonym to the synonym linguistic context right with this synonym, this potential synonym centering has comprised and has been present in the former word comprising in original query, and the synonym corresponding with this former word.
Next, query analysis module 112 judges whether synonym linguistic context and original query mate (step 423).In the present embodiment, query analysis module 112 can be calculated the matching degree of synonym border and original query, in the time that the value of matching degree is in predetermined matching degree interval, determine synonym linguistic context and original query coupling, the semantic environment that shows current original query is applicable to adopting synonym to replace former word carrying out expanding query.The calculating of matching degree can be removed the length after former word according to former word beginning query formulation, and the length of synonym linguistic context is determined.Below in present embodiment, in the time that the length of original query is greater than the length (being q ≠ orig) of former word, the computing formula of matching degree M:
M ( orig , syn ) = Σ i = 1 n TermCount ( p i ) TermCount ( q ) - TermCount ( orig ) , q ≠ orig
Wherein TermCount (q) represents the length of original query, and TermCount (orig) represents the length of original query Central Plains word, and TermCount (pi) represents the length of i synonym linguistic context.Because in this case, in original query, can there is the word of non-synonym linguistic context, therefore M is the value between [0,1].Preset a matching degree threshold value θ, the value of working as M is in [θ, 1] time, show synonym linguistic context and original query coupling, the former word of synonym being replaced in original query is inquired about to obtain synonym, search module 111 obtains the collections of web pages of original query result and the set (step 424) of synonym queried result website according to original query and synonym query search subsequently, and result synthesis module 113 merges the result (step 425) of original query and synonym inquiry according to predetermined consolidation strategy.About result consolidation strategy, will be described in detail below.When the value of M is in [0, θ] time, show that synonym linguistic context and original query do not mate, under this semantic environment, be not suitable for substituting former word with synonym, next the collections of web pages (step 426) of searching for and obtain original query result is carried out in 111 meetings of search module according to original query, and then result synthesis module 113 obtains search result list (step 425) according to the degree of correlation weights between predetermined each webpage and original query.In the time that original query only comprises former word (being q=orig), matching degree M=1, with replacing original query between synonym, then performs step 424 and step 425.
Search engine is by the semantic environment analysis to active user's query demand, to determine whether that being suitable for synonym converts to carry out synonym expanding query, thereby guarantee the accuracy rate of synonym expanding query, make expanding query meet user's demand as far as possible, and then guarantee that user has good experience.
Fig. 4 and Fig. 5 have disclosed the second embodiment of search engine of the present invention.Compare the first embodiment, the search engine 200 of present embodiment is main by judging the escape probability of synonym Query Result, adjusts synonym Query Result and in the end represents to the position in user's search result list.As shown in Figure 4, search engine 200 comprises search component 11, data repository 12, index 13, grabber 14 and user interface 15.The functional modules such as data repository 12, index 13, grabber 14 and user interface 15 and above-mentioned embodiment are basic identical, so applicant is no longer repeated at this.In present embodiment, search component 11 comprises that search module 111, query analysis module 112 and registration calculate and result merges module 114.
Below in conjunction with Fig. 5, the search engine of present embodiment being carried out to synonym expanding query elaborates.First, query analysis module 112 receives user's original query (step 431).Next, analysis and consult formula (step 432), comprise original query is carried out to participle operation to obtain query word set, former word the acquisition identified in original query based on thesaurus 123 comprise former word and its synon synonym pair, and directly synonym are replaced to former word to obtain synonym inquiry.Search module 111 obtains the collections of web pages of original query result and the set (step 433) of synonym queried result website according to original query and synonym query search.Next, registration calculating and result merging module 114 are calculated the registration (step 434) of webpage in original query result and synonym Query Result.This registration is mainly the quantity of the middle same web page for reacting original query result and synonym Query Result, if the quantity of same web page is abundant, show that synonym Query Result and original query result are more approaching, synonym Query Result occurs that the probability of escape is less; Otherwise, showing that synonym Query Result occurs that the probability of escape is larger, the result that need to suppress to avoid not meet user search demand to synonym Query Result appears at the prostatitis of the results list.
The calculating of registration can adopt various ways, as only calculated the quantity of the webpage overlapping in original query result and synonym Query Result | and U1 ∩ U2|, determines identical URL quantity; Or calculate the coincidence webpage quantity of each front 100 results in two results sets, then compare judgement with predetermined threshold value.As preferred mode, the calculating of registration also comprises determines a Min less in the webpage quantity of original query result and the webpage quantity of synonym Query Result (| U1|, | U2|); Then registration I (U1, U2)=| U1 ∩ U2|/Min (| U1|, | U2|).Or in other embodiments, the calculating of registration also comprises the summation of calculating the webpage quantity of original query result and the webpage quantity of synonym Query Result | U1 ∪ U2|; Then registration I (U1, U2)=| U1 ∩ U2|/| U1 ∪ U2|.After the value of registration is calculated, can judge this value whether in predetermined registration interval to determine whether suppressing synonym Query Result, then determine the result (step 435) after position the output merging of synonym Query Result in search result list.With registration account form I (U1, U2)=| U1 ∩ U2|/Min (| U1|, | U2|) is example, and the value I of registration is the floating number between [0,1].Preset a registration threshold value σ, when I is in [σ, 1] time, the registration that shows original query result and synonym Query Result is higher, in this case, do not need to suppress synonym Query Result, only need merge result original and synonym inquiry according to the degree of correlation weights of predetermined each webpage.When I is when [0, σ], show that the registration of original query result and synonym Query Result is lower, the escape probability of synonym Query Result is larger, at this moment just need to suppress synonym Query Result.The mode of suppressing can be that the degree of correlation weights of webpage in synonym Query Result are done and fall power and process, thereby makes in the search result list of synonym Query Result after merging the position after; Or after synonym Query Result being inserted into the specific page of search result list, as synonym Query Result adjusted to the second page of search result list; In addition, also synonym Query Result can be adjusted to original query result after, synonym Query Result appears at search result list backmost.
Search engine is by judging the registration of original query result and synonym Query Result, determine that the probability of escape appears in synonym Query Result, and in the time that escape probability is larger, suppress synonym Query Result, to avoid the result that does not meet user search demand to appear at the prostatitis of search result list, thereby guarantee that user has good experience.In present embodiment, carry out before synonym expanding query replacing former word with synonym, the synonym linguistic context of passing through that must not adopt embodiment one to mention judges to determine whether to carry out synonym replacement, but, what those of ordinary skill in the art can expect easily is, if present embodiment is in conjunction with the first embodiment, before replacing, first carries out synonym the judgement of synonym linguistic context, then merge Search Results according to the registration of original and synonym Query Result at synonym Query Result after out, obviously can obtain like this Search Results more accurately, thereby further promoting user experiences.
Fig. 6 and Fig. 7 have disclosed the 3rd embodiment of search engine of the present invention.Present embodiment is based on synonym Query Result, further distribute to judge the escape probability of synonym Query Result by analyzing the semantic topic of synonym queried result website, and then adjustment synonym Query Result in the end represents to the position in user's search result list.As shown in Figure 6, similar with the first embodiment, search engine 300 comprises search component 11, data repository 12, index 13, grabber 14, user interface 15, user's inquiry log database 16, log analyzer 17.Wherein the functional module such as index 13, grabber 14, user interface 15, user's inquiry log database 16, log analyzer 17 is identical with the first embodiment, and applicant is no longer repeated at this.In present embodiment, search component 11 comprises search module 111, query analysis module 112, result synthesis module 113 and escape determination module 115.Data repository 12 includes index database 121, web page library 122, thesaurus 123 and web page semantics theme storehouse 124.Wherein index database 121, web page library 122, thesaurus 123 are identical with the first embodiment, and applicant is no longer repeated at this.Search engine 300 also comprises a subject analysis module 18, and in present embodiment, this subject analysis module 18 comprises a probability latent semantic analysis (Probabilitistic Latent Semantic Analysis, calls PLSA in the following text) model.
PLSA model is a kind of instrument of natural language processing, and it is mainly used in the potential semanteme of analytical documentation.A document can be represented as the set of one group of word, but due to synon existence, and word is not the basic composition element of document, so, can think and between word and document, also have a potential semantic level, i.e. theme.For example, the query formulation of user's input is " the green color of Swiss Army Knife ", due to { " green color ", " green " } be synonym pair, so can carry out synonym expanding query with " green " replacement " green color ", but the possibility of result of at this moment recalling can comprise the webpage of title for " system Swiss Army Knife-perfection unloading V2007 green edition ".This be because " the green color of Swiss Army Knife " corresponding theme as " article ", and " system Swiss Army Knife-perfection unloading V2007 green edition " corresponding theme as " software ", obvious, search engine also cannot be understood these implicit themes.PLSA model is a kind of topic model of analyzing potential semantic topic by calculating the distribution of co-occurrence word in document, and it introduces a potential semantic layer between document and word, and this potential semantic layer is made up of n potential semantic topic.Suppose between document and word it is separate, the common probability occurring of document and word is decided by the probabilistic relation between they and theme.Therefore, can calculate the relation between document or word and potential semantic topic by PLSA model.Based on this, the semantic topic that can obtain synonym linguistic context and synonym queried result website by PLSA model in present embodiment distributes, and the matching degree of calculating both is to determine the escape probability of synonym Query Result.Next will be described in detail.
As shown in Figure 6, subject analysis module 18 is obtained webpage from web page library 122, removes the noise words such as frame advertisement in webpage, then extracts the keyword set that can represent this webpage.Subsequently, subject analysis module 18 is calculated the webpage-potential semantic topic vector S2={s21 that obtains the semantic topic distribution that represents this webpage by PLSA model, s22 ..., s2n}, wherein s2n represents the probability score of this webpage on n semantic topic.In present embodiment, obtaining that web page semantics theme distributes is to obtain under off-line state, i.e. subject analysis module 18 is analyzed all crawled webpages, obtains its semantic topic and distributes, and is then stored in web page semantics theme storehouse 124.Certainly, this process can be also to obtain under the state of on-line search, and, after synonym Query Result obtains, subject analysis module 18 is the webpage in analysis and consult result only, then the semantic topic of these webpages is distributed to give escape determination module 115 and judge.In present embodiment, obtaining that synonym linguistic context semantic topic distributes is canbe used on line.When query analysis module 112 cutting original query obtain after keyword set, subject analysis module 18 is obtained this keyword set, and from synonym context bank 1232, obtains the entry set that corresponding synonym linguistic context comprises.Then, the entry set of keyword set and synonym linguistic context is combined, give synonym linguistic context-potential semantic topic vector S1={s11 that PLSA model calculated and obtained the semantic topic distribution that represents this synonym linguistic context, s12, ... s1n}, wherein s1n refers to the probable value of synonym linguistic context on n semantic topic.When obtaining after vectorial S 1, subject analysis module 18 is given escape determination module 115 and is judged the similarity of S1 and S2.About the step of judgement, will describe in detail later.
Next the detailed step of synonym expanding query will be coordinated Fig. 7 to introduce in detail search engine 300 in present embodiment to carry out.First, query analysis module 112 receives the original query (step 441) of user search, then this original query is analyzed to (step 442).Query analysis module 112 can be carried out participle operation to original query, and as the first embodiment, participle operation is that the dictionary building based on synonym linguistic context does maximum forward cutting.After participle operation, obtain primary keys set, on the one hand, query analysis module 112 is carried out original query (step 449) by primary keys set intersection to search module 111, and obtains original query result (step 450).On the other hand, query analysis module 112 is identified the former word comprising in original query based on thesaurus 123, and obtain corresponding potential synonym to and the right synonym linguistic context of this potential synonym.Analysis and consult module 112 is obtaining after above-mentioned data, can directly replace former word to obtain synonym inquiry with synonym, and give search module 111 and carry out synonym expanding query (step 443).In preferred embodiment, carrying out before synonym replacement operation, can first judge whether to meet the synonym linguistic context of former word, if met, then carry out the operation that synonym is replaced, so can further improve the accuracy rate of synonym Query Result.About the operation that judges to carry out synonym according to the matching degree of synonym linguistic context and replace, be described in detail in the first embodiment, applicant this no longer semanteme repeat.In addition, query analysis module 112 also by primary keys set intersection to subject analysis module 18, by its by PLSA model calculate and obtain synonym linguistic context semantic topic distribute (step 447), the result of calculating is given escape determination module 115.
Search module 111 is carried out synonym inquiry and is obtained after synonym Query Result (step 444), escape determination module 115 obtains results web page from web page semantics theme storehouse semantic topic according to synonym Query Result distributes, be the vectorial S2={s21 of webpage-potential semantic topic, s22, ..., s2n} (step 445).On the other hand, the semantic topic that escape determination module 115 has obtained synonym linguistic context from subject analysis module distributes, be the vectorial S1={s11 of synonym linguistic context-potential semantic topic, s12, ... s1n}, next, escape determination module 115 judges the matching degree that two semantic topics distribute, and calculates the similarity (step 446) of two vectorial S1, S2; Then filter synonym Query Result (step 448) according to matching degree, determine the mode of suppressing of synonym Query Result, and merge accordingly the result of original query and synonym inquiry, generate search result list (step 451).About two vectorial similarities calculate have multiple, as inner product similarity, cosine similarity etc.It is below the example of utilizing the computing formula of similarity between cosine similarity compute vector S1 and S2.
sim ( S 1 , S 2 ) = Σ i = 1 n s 1 i * s 2 i Σ j = 1 n s 1 i 2 Σ j = 1 n s 2 i 2
If the value of the similarity of calculating is very high, show that this webpage and synonym linguistic context probability on n semantic topic is all very large, can judge that two semantic topic distribution matching degrees are high, the escape probability of this webpage is less; Otherwise, if the value of the similarity of calculating is very low, showing that the escape probability of this webpage is larger, so just need to suppress this result.Particularly, the value sim (S1, S2) of similarity is the floating number between [0,1].Can preset a threshold alpha, as sim (S1, S2) in [α, 1] time, the matching degree that shows two semantic topics distributions is higher, in this case, do not need to suppress synonym Query Result, only need merge result original and synonym inquiry according to the degree of correlation weights of predetermined webpage.When sim (S1, S2) is when [0, α], show that the matching degree that two semantic topics distribute is lower, the escape probability of synonym Query Result is larger, at this moment just need to suppress synonym Query Result.The mode of suppressing can be that the degree of correlation weights of synonym queried result website are done and fall power and process, thereby makes in the search result list of synonym Query Result after merging the position after; Or after synonym Query Result being inserted into the specific page of search result list, as synonym Query Result adjusted to the second page of search result list; In addition also synonym Query Result can be adjusted to original query result after, synonym Query Result appears at search result list backmost.
Search engine is by the matching degree that relatively semantic topic of synonym linguistic context and synonym queried result website distributes, can judge whether synonym Query Result meets user's potential demand, thereby can correspondingly control accordingly the sequence of synonym Query Result in whole search result list, to avoid occurring escape result in the prostatitis of Search Results, and then guarantee that user has good experience.The PLSA model of introducing in above-mentioned embodiment, other topic model also can be used for analyzing the potential semantic topic of synonym linguistic context and synonym queried result website, as latent semantic analysis (Latent Semantic Analysis, LSA) model or potential Di Li Cray distribute (Latent Dirichlet Allocation, LDA) model etc.
Fig. 8 to Figure 10 has disclosed the 4th embodiment of search engine of the present invention.Present embodiment is mainly the synon ways of presentation of describing in Search Results.The principle of work block diagram of search engine 400 as shown in Figure 8, it comprises search component 11, data repository 12, index 13, grabber 14 and user interface 15.The functional modules such as data repository 12, index 13, grabber 14 and user interface 15 and above-mentioned embodiment are basic identical, so applicant is no longer repeated at this.In present embodiment, search component 11 comprises search module 111, query analysis module 112, result synthesis module 113, for analyzing the labeling module 117 of analysis module 116 and definite synonym ways of presentation of synonym and former Word similarity grade.
Below in conjunction with Fig. 9, the search engine of present embodiment being carried out to synonym expanding query elaborates.First, query analysis module 112 receives the original query (step 461) of user search, then this original query is analyzed to (step 462).Query analysis module 112 can be carried out participle operation to original query, to obtain primary keys set.Query analysis module 112 is identified the former word comprising in original query based on thesaurus 123, and acquisition comprises this former word and synon synonym pair thereof.On the one hand, analysis and consult module 112 use synonyms are replaced former word to obtain synonym inquiry, and search module 111 is according to original query and synonym query execution original query and synonym expanding query (step 463) subsequently.Search module 111 is obtaining after original query result and synonym Query Result, transfers to result synthesis module 113 to merge and generates search result list (step 464).About merging method original and synonym inquiry, in above-mentioned embodiment, describe in detail, applicant is no longer repeated at this.On the other hand, query analysis module 112 to giving similarity grade analysis module 116, judges synonym the similarity grade (step 465) of synonym and former word by it, and gives labeling module 117 by judged result.Next, labeling module 117 is determined synon exhibition method according to the judged result of similarity grade, and finally by user interface 15, the search result list having marked is represented to user (step 466).
Similarity grade judgement below in conjunction with Figure 10 to synonym and former word and correspondingly exhibition method further illustrate.Similarity grade analysis module 116 is obtained synonym to (step 471) from query analysis module 112, first judges whether the synonym of synonym centering and former word belong to high similarity grade (being the first estate that higher grade of similarity) (step 472).In present embodiment, the situation that synonym and former word belong to high similarity grade comprises proper noun abbreviation (as " Peking University " and " Beijing University ", " Sina website " and " sina ") or digital conversion (as " the 5th collection " and " the 5th collection ") or region word conversion (as " Beijing " and " Beijing ") etc.If belong to high similarity grade, synonym is carried out the mark (step 473) of particular color, this particular color is more eye-catching color conventionally, as the redness in present embodiment; If do not belonged to, next judge whether synonym centering synonym and former word belong to middle similarity grade (being junior the second grade of similarity) (step 474).In present embodiment, in former word and synonym, the judgement of similarity grade comprises the judgement of semantic similarity or morphology similarity.
Below the concrete example of semantic similarity computing formula:
SSim ( orig , syn ) = ClickQueryCount ( orig , syn ) QueryCount ( orig ) ,
Wherein ClickQueryCount (orig, syn) represents in query formulation to comprise former word orig, clicks the historical query quantity that does not comprise former word orig in the title of webpage of access but comprise synonym syn simultaneously; QueryCount (orig) represents the historical query quantity that comprises former word orig in query formulation.For example, the historical query formula of user's input is " Beijing University where ", then clicked the webpage that title in Search Results is " Peking University where ", so current inquiry will be accumulated on ClickQueryCount (orig, syn) and QueryCount (orig); And if user has just clicked for historical query formula " Beijing University where " webpage that the title in Search Results is " Beijing University where ", current inquiry only can be accumulated on QueryCount (orig).Obviously, the value of semantic similarity is the floating number between [0,1].Can preset a threshold value beta,, when the value of semantic similarity is when [β, 1], show that former word and synonym belong to middle similarity grade; And when the value of semantic similarity is when [0, β], next also will carry out the judgement of morphology similarity.If determined that this synonym, to belonging to middle similarity grade, carries out the mark (step 475) of specific font to synonym, as runic or italic, in present embodiment, be runic.
Below the concrete example of morphology calculating formula of similarity:
WSim ( orig , syn ) = CoocAlphaCount ( orig , syn ) AllAlphaCount ( orig , syn )
Wherein CoocAlphaCount (orig, syn) represents that it is the same that former word orig and synonym syn have how many words, and AllAlphaCount (orig, syn) represents the sum that comprises different words in former word orig and synonym syn.For example: for synonym to { " how ", " how " }, CoocAlphaCount (" how ", " how ")=2 why " " and " " these two words appear in former word and synonym simultaneously because synonym centering; Why AllAlphaCount (orig, syn)=3, because synonym centering one has 3 different words " " " " " sample ".For English, add up alphabetical quantity, for example: for synonym to { " man ", " men " }, CoocAlphaCount (" man ", " men ")=2, and AllAlphaCount (" man ", " men ")=4.Obviously, the value of morphology similarity is also the floating number between [0,1].Can preset a threshold gamma, when the value of semantic similarity is when [γ, 1], show that former word and synonym belong to middle similarity grade, labeling module 117 is marked slightly synonym; And the value of working as semantic similarity is in [0, γ] time, show that this synonym centering synonym and former word belong to low similarity grade (being the tertiary gradient that similarity grade is lower than the second grade), thereby synonym does not carry out any mark (step 476).With respect to the mark of particular color, the boldness of specific font is weaker, but still can cause user's concern, thus be applicable to the synonym of middle similarity grade, although because variation has occurred for its semanteme or morphology, and former word is still more approaching; And the synonym of low similarity grade due to semanteme or morphology and former word gap larger, if mark can bring lofty sense to user; Not carry out marking in a preferred manner.
Search engine is by discriminating synonyms and the similarity grade of former word, the mark that the synonym in Search Results is adapted, thus in locating information needed fast for user, avoid bringing lofty sense to user, and then promote user's experience.
Those skilled in the art can expect easily, and the judgment mode of synonym similarity grade, the mode that synonym is shown and different similarity grade are described in being not limited in above-mentioned embodiment from the corresponding relation of different exhibition methods.For example, can also judge similarity grade by editing distance, or synonym is carried out to highlighted mark mode.In addition, similarity grade can arrange more, as semantic similarity and morphology similarity are split as to two different grades.Certainly, also can reduce similarity grade, only classify as high similarity grade or low similarity grade by all synonyms.Belong to proper noun abbreviation, digital conversion or the conversion of region word as worked as synonym and former word; Or when former word and synon semantic similarity, morphology similarity or editing distance are more than or equal to assign thresholds, can think high similarity grade, all the other are low similarity grade.
Be to be understood that, although this instructions is described according to embodiment, but be not that each embodiment only comprises an independently technical scheme, this narrating mode of instructions is only for clarity sake, those skilled in the art should make instructions as a whole, technical scheme in each embodiment also can, through appropriately combined, form other embodiments that it will be appreciated by those skilled in the art that.
Listed a series of detailed description is above only illustrating for feasibility embodiment of the present invention; they are not in order to limit the scope of the invention, all do not depart from the equivalent embodiment that skill spirit of the present invention does or change and all should be included in protection scope of the present invention within.

Claims (22)

1. an implementation method for search engine, is characterized in that, the method comprises the steps:
Receive the original query of active user's search;
Former word that identification comprises in original query, comprise this former word and its synon potential synonym to and the right synonym linguistic context of this potential synonym;
Judge whether described synonym linguistic context and original query mate, and in the time that both mate, described synonym is substituted to former word in original query to obtain synonym query formulation;
According to the set of described original query and synonym query-based search acquisition results web page;
Wherein said potential synonym to and the right synonym linguistic context of this potential synonym adopt following steps to excavate:
Obtain historical user and inquire about click data, described data comprise historical query formulation and queried result website that return in response to this query formulation and clicked access;
Identification synonym pair, described synonym is present in the former word in described historical query formula and is present in the corresponding synonym in described queried result website comprising;
Record and be defined as the right synonym linguistic context of described synonym to historical query formula described in major general.
2. the implementation method of search engine according to claim 1, is characterized in that, the described step that judges whether described synonym linguistic context and original query mate comprises: the matching degree of calculating synonym linguistic context and original query; In the time that the value of described matching degree is in predetermined matching degree interval, determine synonym linguistic context and original query coupling.
3. the implementation method of search engine according to claim 2, is characterized in that, the calculating of described matching degree is removed the length after former word according to original query, and the length of synonym linguistic context is determined.
4. according to the implementation method of the search engine described in any one in claims 1 to 3, it is characterized in that, the method also comprises: judging before the step whether described synonym linguistic context and original query mate, the entry segment that also can comprise based on synonym linguistic context is done the maximum cutting of forward to original query, thereby obtains the entry set after cutting.
5. the implementation method of search engine according to claim 1, is characterized in that, the step of described definite synonym linguistic context also comprises next-door neighbour's word of described historical query formula Central Plains word is recorded and be defined as synonym linguistic context.
6. the implementation method of search engine according to claim 5, is characterized in that, described next-door neighbour's word comprise in described historical query formula, be positioned at former word before and be positioned at the entry after former word.
7. the implementation method of search engine according to claim 6, is characterized in that, described next-door neighbour's word comprises empty word bar.
8. the implementation method of search engine according to claim 1, is characterized in that, the method also comprises, determining before the step of synonym linguistic context, judges whether the title in described results web page comprises synonym and do not comprise former word; If so, carry out again the step of described definite synonym linguistic context, if not, do not carry out the step of determining synonym linguistic context.
9. the implementation method of search engine according to claim 1, it is characterized in that, the step of described definite synonym linguistic context also comprises, add up the frequency that described synonym linguistic context is recorded, in the time that the described frequency is more than or equal to a predetermined frequency threshold value, determine that this synonym linguistic context is the right synonym linguistic context of described synonym.
10. the implementation method of search engine according to claim 1, is characterized in that, described synonym linguistic context is determined according to the anchor text of webpage.
The implementation method of 11. search engines according to claim 1, is characterized in that, described synonym linguistic context is determined according to section arranged side by side in web page title.
12. 1 kinds of search engines, is characterized in that, this search engine comprises search component, and search component comprises query analysis module and search module;
Wherein query analysis module is used for:
Receive the original query of active user's search;
Identify the former word that comprises in described original query, comprise this former word and its synon potential synonym to and the right synonym linguistic context of this potential synonym;
Judge whether described synonym linguistic context and original query mate, and in the time that both mate, described synonym is substituted to former word in original query to obtain synonym query formulation;
Search module is used in the time of synonym linguistic context and original query coupling according to the set of described original query and synonym query-based search acquisition results web page;
Described search engine also comprises user's inquiry log analyzer, for:
Obtain historical user and inquire about click data, described data comprise historical query formulation and queried result website that return in response to this query formulation and clicked access;
Identification synonym pair, described synonym is present in the former word in described historical query formula and is present in the corresponding synonym in described queried result website comprising;
Record and be defined as the right synonym linguistic context of described synonym to historical query formula described in major general.
13. search engines according to claim 12, is characterized in that, described query analysis module is in the time judging whether described synonym linguistic context and original query mate, for the matching degree of calculating synonym linguistic context and original query; In the time that the value of described matching degree is in predetermined matching degree interval, determine synonym linguistic context and original query coupling.
14. search engines according to claim 13, it is characterized in that, the calculating of described matching degree is removed the length after former word according to original query, and the length of synonym linguistic context are determined.
15. according to claim 12 to the search engine described in 14 any one, it is characterized in that, described query analysis module also for: before judging whether described synonym linguistic context and original query mate, the entry segment that also can comprise based on synonym linguistic context is done the maximum cutting of forward to original query, thereby obtains the entry set after cutting.
16. search engines according to claim 12, is characterized in that, described log analyzer is determined that synonym linguistic context also comprises next-door neighbour's word of described historical query formula Central Plains word is recorded and be defined as synonym linguistic context.
17. search engines according to claim 16, is characterized in that, described next-door neighbour's word comprise in described historical query formula, be positioned at former word before and be positioned at the entry after former word.
18. search engines according to claim 17, is characterized in that, described next-door neighbour's word comprises empty word bar.
19. search engines according to claim 12, is characterized in that, described log analyzer also for: before definite synonym linguistic context, judge whether the title in described results web page comprises synonym and do not comprise former word; If so, carry out again the step of described definite synonym linguistic context, if not, do not carry out the step of determining synonym linguistic context.
20. search engines according to claim 12, it is characterized in that, described log analyzer also for: add up the frequency that described synonym linguistic context is recorded, in the time that the described frequency is more than or equal to a predetermined frequency threshold value, determine that this synonym linguistic context is the right synonym linguistic context of described synonym.
21. search engines according to claim 12, is characterized in that, described synonym linguistic context is determined according to the anchor text of webpage.
22. search engines according to claim 12, is characterized in that, described synonym linguistic context is determined according to section arranged side by side in web page title.
CN201110081259.XA 2011-03-31 2011-03-31 Search engine and realization method thereof Active CN102737021B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110081259.XA CN102737021B (en) 2011-03-31 2011-03-31 Search engine and realization method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110081259.XA CN102737021B (en) 2011-03-31 2011-03-31 Search engine and realization method thereof

Publications (2)

Publication Number Publication Date
CN102737021A CN102737021A (en) 2012-10-17
CN102737021B true CN102737021B (en) 2014-10-22

Family

ID=46992544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110081259.XA Active CN102737021B (en) 2011-03-31 2011-03-31 Search engine and realization method thereof

Country Status (1)

Country Link
CN (1) CN102737021B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103873601B (en) * 2012-12-11 2019-03-08 百度在线网络技术(北京)有限公司 A kind of method for digging and system addressing class query word
CN105653553B (en) * 2014-11-14 2020-04-03 腾讯科技(深圳)有限公司 Word weight generation method and device
CN105659235A (en) * 2016-01-08 2016-06-08 马岩 A term searching method for network information and a system thereof
CN107562713A (en) * 2016-06-30 2018-01-09 北京智能管家科技有限公司 The method for digging and device of synonymous text
WO2018023481A1 (en) * 2016-08-03 2018-02-08 王晓光 Method and system for applying synonym in big data search
CN106250516A (en) * 2016-08-03 2016-12-21 王晓光 Synonym application process in big data search and system
CN106528644B (en) * 2016-10-14 2020-07-31 航天恒星科技有限公司 Remote sensing data retrieval method and device
CN107025215A (en) * 2017-02-13 2017-08-08 阿里巴巴集团控股有限公司 A kind of picture and text composition method and device
CN107844596A (en) * 2017-11-22 2018-03-27 福建中金在线信息科技有限公司 A kind of article search method and system
CN111160007B (en) * 2019-12-13 2023-04-07 中国平安财产保险股份有限公司 Search method and device based on BERT language model, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101241512A (en) * 2008-03-10 2008-08-13 北京搜狗科技发展有限公司 Search method for redefining enquiry word and device therefor
CN101878476A (en) * 2007-06-22 2010-11-03 谷歌公司 Machine translation for query expansion

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7627548B2 (en) * 2005-11-22 2009-12-01 Google Inc. Inferring search category synonyms from user logs
CN101443759B (en) * 2006-05-12 2010-08-11 北京乐图在线科技有限公司 Multi-lingual information retrieval
CN101872351B (en) * 2009-04-27 2012-10-10 阿里巴巴集团控股有限公司 Method, device for identifying synonyms, and method and device for searching by using same
EP2629211A1 (en) * 2009-08-21 2013-08-21 Mikko Kalervo Väänänen Method and means for data searching and language translation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101878476A (en) * 2007-06-22 2010-11-03 谷歌公司 Machine translation for query expansion
CN101241512A (en) * 2008-03-10 2008-08-13 北京搜狗科技发展有限公司 Search method for redefining enquiry word and device therefor

Also Published As

Publication number Publication date
CN102737021A (en) 2012-10-17

Similar Documents

Publication Publication Date Title
CN102722498B (en) Search engine and implementation method thereof
CN102737021B (en) Search engine and realization method thereof
CN102722501B (en) Search engine and realization method thereof
CN102722499B (en) Search engine and implementation method thereof
CN102073725B (en) Method for searching structured data and search engine system for implementing same
CN102073726B (en) Structured data import method and device for search engine system
CN100524307C (en) Method and device for establishing coupled relation between documents
US8051080B2 (en) Contextual ranking of keywords using click data
CN100530180C (en) Method and system for suggesting search engine keywords
US20090070322A1 (en) Browsing knowledge on the basis of semantic relations
CN101118560A (en) Keyword outputting apparatus, keyword outputting method, and keyword outputting computer program product
CN107729336A (en) Data processing method, equipment and system
JP5329540B2 (en) User-centric information search method, computer-readable recording medium, and user-centric information search system
CN101692223A (en) Refining a search space inresponse to user input
US8359307B2 (en) Method and apparatus for building sales tools by mining data from websites
CN103365924A (en) Method, device and terminal for searching information
CN107918644A (en) News subject under discussion analysis method and implementation system in reputation Governance framework
US8234584B2 (en) Computer system, information collection support device, and method for supporting information collection
CN103942268A (en) Method and device for combining search and application and application interface
Gasparetti et al. Exploiting web browsing activities for user needs identification
Sivakumar Effectual web content mining using noise removal from web pages
US20130031075A1 (en) Action-based deeplinks for search results
CN102063454A (en) Method and equipment combining search and application
US11941073B2 (en) Generating and implementing keyword clusters
KR102107474B1 (en) Social issue deduction system and method using crawling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant