CN102542022A - Theme search algorithm based on body - Google Patents

Theme search algorithm based on body Download PDF

Info

Publication number
CN102542022A
CN102542022A CN2011104317036A CN201110431703A CN102542022A CN 102542022 A CN102542022 A CN 102542022A CN 2011104317036 A CN2011104317036 A CN 2011104317036A CN 201110431703 A CN201110431703 A CN 201110431703A CN 102542022 A CN102542022 A CN 102542022A
Authority
CN
China
Prior art keywords
algorithm based
theme
search algorithm
subject
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011104317036A
Other languages
Chinese (zh)
Inventor
闫俊英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dianji University
Original Assignee
Shanghai Dianji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dianji University filed Critical Shanghai Dianji University
Priority to CN2011104317036A priority Critical patent/CN102542022A/en
Publication of CN102542022A publication Critical patent/CN102542022A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides theme search algorithm based on a body, which comprises steps of establishing theme models based on the body; matching appropriate member search engines according to different theme models; and processing search results. The theme search algorithm based on the body is the theme search algorithm with good performance based on the body, can effectively meet different search requirements for different themes of different users, and obtains a high precision ratio under the premise of ensuring the recall ratio.

Description

A kind of subject search algorithm based on body
Technical field
The present invention relates to customized information searching algorithm field, and be particularly related to a kind of subject search algorithm based on body.
Background technology
In present a lot of search services; There are some to be directed against the information search service of the personalization of different user; Like personalized search service based on user behavior analysis; The Query Result that returns for the same queries request of different user is also identical to some extent, and promptly system can discern the difference on the different user individual information demand to a certain extent.
But confirm accurately and describe owing to can not compare user's inquiry theme, therefore how in the process of search the different search fors based on the user carry out unit's search based on theme, become many scholars' in the information retrieval field research focus.
In some individual info services, according to the behavior of following the tracks of the user, set up user's interest model, confirm user's interest field and theme with this.But there is very big changeability in user's interest behavior, in case the new search behavior of user with interest model before inconsistent the time, the result's of search accuracy can be greatly affected greatly.
Body is the clear and definite formal normalized illustration of the conceptual model shared; Its target is through the analysis to the knowledge of association area; Common understanding to this domain knowledge is provided; Confirm in this field the notion (term) of common approval, provide the clearly definition of the mutual relationship between these notions from different levels, and with these terms of formalization language description and the mutual relationship thereof of standard.Therefore, quote body and can express each different theme notion more accurately.
Summary of the invention
The present invention proposes a kind of subject search algorithm based on body; Obtain a kind of well behaved subject search algorithm based on body; Under the prerequisite that guarantees recall ratio, more effectively satisfy the search need of different user to different themes, obtain higher precision ratio.
In order to achieve the above object, the present invention proposes a kind of subject search algorithm based on body, comprises the following steps:
Foundation is based on the topic model of body;
According to different topic models, mate suitable member's search engine;
Search Results is handled.
Further, said topic model based on body is taked tlv triple Topic (C, P S) is represented, forms the subject tree structure, and wherein: C representes by the name word concept in the subject fields, the set with notion class of same alike result and behavior structure; P describes the attribute of notion and relation; S representes the structural relation between the theme class.
Further, said C adopts vector space model to represent, and use doublet Ci (Keyi, Weighti), wherein Keyi representes keyword, Weighti representes the weight of keyword.
Further, member's search engine step that said coupling is suitable is preset with member's search engine of recommendation, and can increase and decrease operation to said member's search engine.
Further, said pre-service, extraction characteristic word set and the theme coupling that comprises Search Results that Search Results is handled.
Further, said pre-service to Search Results for will from the result for retrieval of each member's search engine through integrated, go to carry out word segmentation processing after heavy.
Further, said extraction characteristic word set is to extract the characteristic speech of expressing web page contents, and gives corresponding weights according to the different position of characteristic speech, and identical characteristic speech weighted value addition forms the web page characteristics word set.
Further, said result of page searching adopts proper vector to represent that the notion of each sub-category of theme also is a proper vector, and according to vector space model, the cosine value of two proper vector angles is represented their degree of correlation.
Further, calculate the degree of correlation of a webpage and theme, according to preset threshold, several webpages that the degree of correlation is maximum return to the user according to degree of correlation size.
Further, if the degree of correlation of all properties of this notion does not all reach the minimum degree of correlation of setting in the threshold strategies in webpage and the body, then this webpage is identified as and does not belong to the territory that the user confirms, it is rejected from result set.
The subject search algorithm based on body that the present invention proposes based on body, is set up topic model to definition clear and definite between field concept and notion, can confirm topic model comparatively exactly.The user can select the theme that will search for when searching for, according to each topic model coupling best member search engine relevant with theme, the user can delete member's search engine of preference.For the Search Results that each member's search engine returns, the employing vector space model calculates the similarity with theme respectively, and the result who satisfies condition is returned to the user.Owing to adopt body, more accurate to the expression of user's theme, solved and caused the not accurate enough problem of Search Results owing to the user's interest theme is indeterminate, so the accuracy of Search Results is improved.In the process of search, according to the comparatively accurate topic model of having set up result of page searching is carried out the relatedness computation ordering, to obtain the higher webpage of the degree of correlation.This method had both embodied user's personalization, had improved the accuracy of subject search again.
Description of drawings
Shown in Figure 1 for the subject search algorithm flow chart based on body of preferred embodiment of the present invention.
Embodiment
In order more to understand technology contents of the present invention, special act specific embodiment also cooperates appended graphic explanation following.
Please refer to Fig. 1, shown in Figure 1 for the subject search algorithm flow chart based on body of preferred embodiment of the present invention.The present invention proposes a kind of subject search algorithm based on body, comprises the following steps:
Step S100: set up topic model based on body;
Step S200:, mate suitable member's search engine according to different topic models;
Step S300: Search Results is handled.
Tlv triple Topic is taked in the preferred embodiment according to the present invention, said topic model based on body, and (C, P S) represent, form the subject tree structure, and wherein: C representes by the name word concept in the subject fields, the set with notion class of same alike result and behavior structure; P describes the attribute of notion and relation; S representes the structural relation between the theme class, like parent, subclass etc.Said C adopts vector space model (VSM) to represent, and use doublet Ci (Keyi, Weighti), wherein Keyi representes keyword, Weighti representes the weight of keyword.
To each different theme, suitable member's search engine is also different.Member's search engine step that said coupling is suitable is preset with member's search engine of recommendation, and can increase and decrease operation to said member's search engine.To different themes, allocate member's search engine of some recommendations in advance, to user's channeling conduct, the user can increase and decrease member's search engine when selecting the theme of search.
The processing of Search Results comprises the steps such as pre-service, extraction characteristic word set and theme coupling of Search Results; Detailed process is following: (1) is in the result for retrieval pre-processing module; From the result for retrieval of each member's search engine through integrated, go to carry out word segmentation processing after heavy, extract the characteristic speech of expressing web page contents, and the position different according to the characteristic speech (as from web page title, webpage summary, whether with query concept with sentence etc.); Give corresponding weights; Identical characteristic speech weighted value addition, formation web page characteristics word set Ti={ (Word1k, Weight1k) }.Result of page searching has adopted proper vector to represent like this, and the notion of each sub-category of theme also is a proper vector, and according to vector space model, the cosine value of two proper vector angles can be represented their degree of correlation.Can calculate the degree of correlation Simj of a webpage and theme thus, according to preset threshold, several webpages that the degree of correlation is maximum return to the user according to degree of correlation size.(2) though in the Query Result some webpage contain the notion that is complementary with query word, do not belong to the territory that the user confirms, the characteristic word set of these webpages and the degree of correlation of concept term in the body and characteristic word set will be very low.If the degree of correlation of all properties of this notion does not all reach the minimum degree of correlation of setting in the threshold strategies in webpage and the body, then this webpage can be identified as the category that does not belong to this ambit, and it is rejected from result set.
In sum, the subject search algorithm based on body that the present invention proposes based on body, is set up topic model to definition clear and definite between field concept and notion, can confirm topic model comparatively exactly.The user can select the theme that will search for when searching for, according to each topic model coupling best member search engine relevant with theme, the user can delete member's search engine of preference.For the Search Results that each member's search engine returns, the employing vector space model calculates the similarity with theme respectively, and the result who satisfies condition is returned to the user.Owing to adopt body, more accurate to the expression of user's theme, solved and caused the not accurate enough problem of Search Results owing to the user's interest theme is indeterminate, so the accuracy of Search Results is improved.In the process of search, according to the comparatively accurate topic model of having set up result of page searching is carried out the relatedness computation ordering, to obtain the higher webpage of the degree of correlation.This method had both embodied user's personalization, had improved the accuracy of subject search again.
Though the present invention discloses as above with preferred embodiment, so it is not in order to limit the present invention.Have common knowledge the knowledgeable in the technical field under the present invention, do not breaking away from the spirit and scope of the present invention, when doing various changes and retouching.Therefore, protection scope of the present invention is as the criterion when looking claims person of defining.

Claims (10)

1. the subject search algorithm based on body is characterized in that, comprises the following steps:
Foundation is based on the topic model of body;
According to different topic models, mate suitable member's search engine;
Search Results is handled.
2. the subject search algorithm based on body according to claim 1; It is characterized in that said topic model based on body is taked tlv triple Topic (C, P; S) represent; Form the subject tree structure, wherein: C representes by the name word concept in the subject fields, the set with notion class of same alike result and behavior structure; P describes the attribute of notion and relation; S representes the structural relation between the theme class.
3. the subject search algorithm based on body according to claim 2 is characterized in that, said C adopts vector space model to represent, and use doublet Ci (Keyi, Weighti), wherein Keyi representes keyword, Weighti representes the weight of keyword.
4. the subject search algorithm based on body according to claim 1 is characterized in that, member's search engine step that said coupling is suitable is preset with member's search engine of recommendation, and can increase and decrease operation to said member's search engine.
5. the subject search algorithm based on body according to claim 1 is characterized in that, said pre-service, extraction characteristic word set and the theme coupling that comprises Search Results that Search Results is handled.
6. the subject search algorithm based on body according to claim 5 is characterized in that, said pre-service to Search Results for will from the result for retrieval of each member's search engine through integrated, go to carry out word segmentation processing after heavy.
7. the subject search algorithm based on body according to claim 5; It is characterized in that said extraction characteristic word set is to extract the characteristic speech of expressing web page contents, and gives corresponding weights according to the different position of characteristic speech; Identical characteristic speech weighted value addition forms the web page characteristics word set.
8. the subject search algorithm based on body according to claim 1; It is characterized in that said result of page searching adopts proper vector to represent that the notion of each sub-category of theme also is a proper vector; According to vector space model, the cosine value of two proper vector angles is represented their degree of correlation.
9. the subject search algorithm based on body according to claim 8 is characterized in that, calculates the degree of correlation of a webpage and theme, and according to preset threshold, several webpages that the degree of correlation is maximum return to the user according to degree of correlation size.
10. the subject search algorithm based on body according to claim 9; It is characterized in that; If the degree of correlation of all properties of this notion does not all reach the minimum degree of correlation of setting in the threshold strategies in webpage and the body; Then this webpage is identified as and does not belong to the territory that the user confirms, it is rejected from result set.
CN2011104317036A 2011-12-20 2011-12-20 Theme search algorithm based on body Pending CN102542022A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104317036A CN102542022A (en) 2011-12-20 2011-12-20 Theme search algorithm based on body

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104317036A CN102542022A (en) 2011-12-20 2011-12-20 Theme search algorithm based on body

Publications (1)

Publication Number Publication Date
CN102542022A true CN102542022A (en) 2012-07-04

Family

ID=46348905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104317036A Pending CN102542022A (en) 2011-12-20 2011-12-20 Theme search algorithm based on body

Country Status (1)

Country Link
CN (1) CN102542022A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064907A (en) * 2012-12-18 2013-04-24 上海电机学院 System and method for topic meta search based on unsupervised entity relation extraction
CN105095229A (en) * 2014-04-29 2015-11-25 国际商业机器公司 Method for training topic model, method for comparing document content and corresponding device
CN103593413B (en) * 2013-10-27 2016-11-09 西安电子科技大学 META Search Engine personalized method based on Agent
CN108920484A (en) * 2018-04-28 2018-11-30 广州市百果园网络科技有限公司 Search for content processing method, device and storage equipment, computer equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064907A (en) * 2012-12-18 2013-04-24 上海电机学院 System and method for topic meta search based on unsupervised entity relation extraction
CN103593413B (en) * 2013-10-27 2016-11-09 西安电子科技大学 META Search Engine personalized method based on Agent
CN105095229A (en) * 2014-04-29 2015-11-25 国际商业机器公司 Method for training topic model, method for comparing document content and corresponding device
CN108920484A (en) * 2018-04-28 2018-11-30 广州市百果园网络科技有限公司 Search for content processing method, device and storage equipment, computer equipment
CN108920484B (en) * 2018-04-28 2022-06-10 广州市百果园网络科技有限公司 Search content processing method and device, storage device and computer device

Similar Documents

Publication Publication Date Title
CN103593425B (en) Preference-based intelligent retrieval method and system
US20100318537A1 (en) Providing knowledge content to users
CN101840397A (en) Word sense disambiguation method and system
CN102087669A (en) Intelligent search engine system based on semantic association
CN103577416A (en) Query expansion method and system
CN101097570A (en) Advertisement classification method capable of automatic recognizing classified advertisement type
CN103823893A (en) User comment-based product search method and system
CN103853738A (en) Identification method for webpage information related region
CN106372118B (en) Online semantic understanding search system and method towards mass media text data
US20240119047A1 (en) Answer facts from structured content
PH12015502104B1 (en) System for non-deterministic disambiguation and qualitative entity matching of geographical locale data for business entities
CN104572631A (en) Training method and system for language model
CN105677695A (en) Method for calculating similarity of mobile applications based on content
CN103020074A (en) Object-level search technique based on main body
CN103064907A (en) System and method for topic meta search based on unsupervised entity relation extraction
CN117290489B (en) Method and system for quickly constructing industry question-answer knowledge base
Moya et al. Integrating web feed opinions into a corporate data warehouse
CN103136213A (en) Method and device for providing related words
CN102542022A (en) Theme search algorithm based on body
CN108984711A (en) A kind of personalized APP recommended method based on layering insertion
CN111274366A (en) Search recommendation method and device, equipment and storage medium
CN104077327A (en) Core word importance recognition method and equipment and search result sorting method and equipment
CN106227762A (en) A kind of method for vertical search assisted based on user and system
CN107665442B (en) Method and device for acquiring target user
CN105550282A (en) User interest forecasting method by utilizing multidimensional data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120704