CN102542022A - Theme search algorithm based on body - Google Patents
Theme search algorithm based on body Download PDFInfo
- Publication number
- CN102542022A CN102542022A CN2011104317036A CN201110431703A CN102542022A CN 102542022 A CN102542022 A CN 102542022A CN 2011104317036 A CN2011104317036 A CN 2011104317036A CN 201110431703 A CN201110431703 A CN 201110431703A CN 102542022 A CN102542022 A CN 102542022A
- Authority
- CN
- China
- Prior art keywords
- algorithm based
- theme
- search algorithm
- subject
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention provides theme search algorithm based on a body, which comprises steps of establishing theme models based on the body; matching appropriate member search engines according to different theme models; and processing search results. The theme search algorithm based on the body is the theme search algorithm with good performance based on the body, can effectively meet different search requirements for different themes of different users, and obtains a high precision ratio under the premise of ensuring the recall ratio.
Description
Technical field
The present invention relates to customized information searching algorithm field, and be particularly related to a kind of subject search algorithm based on body.
Background technology
In present a lot of search services; There are some to be directed against the information search service of the personalization of different user; Like personalized search service based on user behavior analysis; The Query Result that returns for the same queries request of different user is also identical to some extent, and promptly system can discern the difference on the different user individual information demand to a certain extent.
But confirm accurately and describe owing to can not compare user's inquiry theme, therefore how in the process of search the different search fors based on the user carry out unit's search based on theme, become many scholars' in the information retrieval field research focus.
In some individual info services, according to the behavior of following the tracks of the user, set up user's interest model, confirm user's interest field and theme with this.But there is very big changeability in user's interest behavior, in case the new search behavior of user with interest model before inconsistent the time, the result's of search accuracy can be greatly affected greatly.
Body is the clear and definite formal normalized illustration of the conceptual model shared; Its target is through the analysis to the knowledge of association area; Common understanding to this domain knowledge is provided; Confirm in this field the notion (term) of common approval, provide the clearly definition of the mutual relationship between these notions from different levels, and with these terms of formalization language description and the mutual relationship thereof of standard.Therefore, quote body and can express each different theme notion more accurately.
Summary of the invention
The present invention proposes a kind of subject search algorithm based on body; Obtain a kind of well behaved subject search algorithm based on body; Under the prerequisite that guarantees recall ratio, more effectively satisfy the search need of different user to different themes, obtain higher precision ratio.
In order to achieve the above object, the present invention proposes a kind of subject search algorithm based on body, comprises the following steps:
Foundation is based on the topic model of body;
According to different topic models, mate suitable member's search engine;
Search Results is handled.
Further, said topic model based on body is taked tlv triple Topic (C, P S) is represented, forms the subject tree structure, and wherein: C representes by the name word concept in the subject fields, the set with notion class of same alike result and behavior structure; P describes the attribute of notion and relation; S representes the structural relation between the theme class.
Further, said C adopts vector space model to represent, and use doublet Ci (Keyi, Weighti), wherein Keyi representes keyword, Weighti representes the weight of keyword.
Further, member's search engine step that said coupling is suitable is preset with member's search engine of recommendation, and can increase and decrease operation to said member's search engine.
Further, said pre-service, extraction characteristic word set and the theme coupling that comprises Search Results that Search Results is handled.
Further, said pre-service to Search Results for will from the result for retrieval of each member's search engine through integrated, go to carry out word segmentation processing after heavy.
Further, said extraction characteristic word set is to extract the characteristic speech of expressing web page contents, and gives corresponding weights according to the different position of characteristic speech, and identical characteristic speech weighted value addition forms the web page characteristics word set.
Further, said result of page searching adopts proper vector to represent that the notion of each sub-category of theme also is a proper vector, and according to vector space model, the cosine value of two proper vector angles is represented their degree of correlation.
Further, calculate the degree of correlation of a webpage and theme, according to preset threshold, several webpages that the degree of correlation is maximum return to the user according to degree of correlation size.
Further, if the degree of correlation of all properties of this notion does not all reach the minimum degree of correlation of setting in the threshold strategies in webpage and the body, then this webpage is identified as and does not belong to the territory that the user confirms, it is rejected from result set.
The subject search algorithm based on body that the present invention proposes based on body, is set up topic model to definition clear and definite between field concept and notion, can confirm topic model comparatively exactly.The user can select the theme that will search for when searching for, according to each topic model coupling best member search engine relevant with theme, the user can delete member's search engine of preference.For the Search Results that each member's search engine returns, the employing vector space model calculates the similarity with theme respectively, and the result who satisfies condition is returned to the user.Owing to adopt body, more accurate to the expression of user's theme, solved and caused the not accurate enough problem of Search Results owing to the user's interest theme is indeterminate, so the accuracy of Search Results is improved.In the process of search, according to the comparatively accurate topic model of having set up result of page searching is carried out the relatedness computation ordering, to obtain the higher webpage of the degree of correlation.This method had both embodied user's personalization, had improved the accuracy of subject search again.
Description of drawings
Shown in Figure 1 for the subject search algorithm flow chart based on body of preferred embodiment of the present invention.
Embodiment
In order more to understand technology contents of the present invention, special act specific embodiment also cooperates appended graphic explanation following.
Please refer to Fig. 1, shown in Figure 1 for the subject search algorithm flow chart based on body of preferred embodiment of the present invention.The present invention proposes a kind of subject search algorithm based on body, comprises the following steps:
Step S100: set up topic model based on body;
Step S200:, mate suitable member's search engine according to different topic models;
Step S300: Search Results is handled.
Tlv triple Topic is taked in the preferred embodiment according to the present invention, said topic model based on body, and (C, P S) represent, form the subject tree structure, and wherein: C representes by the name word concept in the subject fields, the set with notion class of same alike result and behavior structure; P describes the attribute of notion and relation; S representes the structural relation between the theme class, like parent, subclass etc.Said C adopts vector space model (VSM) to represent, and use doublet Ci (Keyi, Weighti), wherein Keyi representes keyword, Weighti representes the weight of keyword.
To each different theme, suitable member's search engine is also different.Member's search engine step that said coupling is suitable is preset with member's search engine of recommendation, and can increase and decrease operation to said member's search engine.To different themes, allocate member's search engine of some recommendations in advance, to user's channeling conduct, the user can increase and decrease member's search engine when selecting the theme of search.
The processing of Search Results comprises the steps such as pre-service, extraction characteristic word set and theme coupling of Search Results; Detailed process is following: (1) is in the result for retrieval pre-processing module; From the result for retrieval of each member's search engine through integrated, go to carry out word segmentation processing after heavy, extract the characteristic speech of expressing web page contents, and the position different according to the characteristic speech (as from web page title, webpage summary, whether with query concept with sentence etc.); Give corresponding weights; Identical characteristic speech weighted value addition, formation web page characteristics word set Ti={ (Word1k, Weight1k) }.Result of page searching has adopted proper vector to represent like this, and the notion of each sub-category of theme also is a proper vector, and according to vector space model, the cosine value of two proper vector angles can be represented their degree of correlation.Can calculate the degree of correlation Simj of a webpage and theme thus, according to preset threshold, several webpages that the degree of correlation is maximum return to the user according to degree of correlation size.(2) though in the Query Result some webpage contain the notion that is complementary with query word, do not belong to the territory that the user confirms, the characteristic word set of these webpages and the degree of correlation of concept term in the body and characteristic word set will be very low.If the degree of correlation of all properties of this notion does not all reach the minimum degree of correlation of setting in the threshold strategies in webpage and the body, then this webpage can be identified as the category that does not belong to this ambit, and it is rejected from result set.
In sum, the subject search algorithm based on body that the present invention proposes based on body, is set up topic model to definition clear and definite between field concept and notion, can confirm topic model comparatively exactly.The user can select the theme that will search for when searching for, according to each topic model coupling best member search engine relevant with theme, the user can delete member's search engine of preference.For the Search Results that each member's search engine returns, the employing vector space model calculates the similarity with theme respectively, and the result who satisfies condition is returned to the user.Owing to adopt body, more accurate to the expression of user's theme, solved and caused the not accurate enough problem of Search Results owing to the user's interest theme is indeterminate, so the accuracy of Search Results is improved.In the process of search, according to the comparatively accurate topic model of having set up result of page searching is carried out the relatedness computation ordering, to obtain the higher webpage of the degree of correlation.This method had both embodied user's personalization, had improved the accuracy of subject search again.
Though the present invention discloses as above with preferred embodiment, so it is not in order to limit the present invention.Have common knowledge the knowledgeable in the technical field under the present invention, do not breaking away from the spirit and scope of the present invention, when doing various changes and retouching.Therefore, protection scope of the present invention is as the criterion when looking claims person of defining.
Claims (10)
1. the subject search algorithm based on body is characterized in that, comprises the following steps:
Foundation is based on the topic model of body;
According to different topic models, mate suitable member's search engine;
Search Results is handled.
2. the subject search algorithm based on body according to claim 1; It is characterized in that said topic model based on body is taked tlv triple Topic (C, P; S) represent; Form the subject tree structure, wherein: C representes by the name word concept in the subject fields, the set with notion class of same alike result and behavior structure; P describes the attribute of notion and relation; S representes the structural relation between the theme class.
3. the subject search algorithm based on body according to claim 2 is characterized in that, said C adopts vector space model to represent, and use doublet Ci (Keyi, Weighti), wherein Keyi representes keyword, Weighti representes the weight of keyword.
4. the subject search algorithm based on body according to claim 1 is characterized in that, member's search engine step that said coupling is suitable is preset with member's search engine of recommendation, and can increase and decrease operation to said member's search engine.
5. the subject search algorithm based on body according to claim 1 is characterized in that, said pre-service, extraction characteristic word set and the theme coupling that comprises Search Results that Search Results is handled.
6. the subject search algorithm based on body according to claim 5 is characterized in that, said pre-service to Search Results for will from the result for retrieval of each member's search engine through integrated, go to carry out word segmentation processing after heavy.
7. the subject search algorithm based on body according to claim 5; It is characterized in that said extraction characteristic word set is to extract the characteristic speech of expressing web page contents, and gives corresponding weights according to the different position of characteristic speech; Identical characteristic speech weighted value addition forms the web page characteristics word set.
8. the subject search algorithm based on body according to claim 1; It is characterized in that said result of page searching adopts proper vector to represent that the notion of each sub-category of theme also is a proper vector; According to vector space model, the cosine value of two proper vector angles is represented their degree of correlation.
9. the subject search algorithm based on body according to claim 8 is characterized in that, calculates the degree of correlation of a webpage and theme, and according to preset threshold, several webpages that the degree of correlation is maximum return to the user according to degree of correlation size.
10. the subject search algorithm based on body according to claim 9; It is characterized in that; If the degree of correlation of all properties of this notion does not all reach the minimum degree of correlation of setting in the threshold strategies in webpage and the body; Then this webpage is identified as and does not belong to the territory that the user confirms, it is rejected from result set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011104317036A CN102542022A (en) | 2011-12-20 | 2011-12-20 | Theme search algorithm based on body |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011104317036A CN102542022A (en) | 2011-12-20 | 2011-12-20 | Theme search algorithm based on body |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102542022A true CN102542022A (en) | 2012-07-04 |
Family
ID=46348905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011104317036A Pending CN102542022A (en) | 2011-12-20 | 2011-12-20 | Theme search algorithm based on body |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102542022A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103064907A (en) * | 2012-12-18 | 2013-04-24 | 上海电机学院 | System and method for topic meta search based on unsupervised entity relation extraction |
CN105095229A (en) * | 2014-04-29 | 2015-11-25 | 国际商业机器公司 | Method for training topic model, method for comparing document content and corresponding device |
CN103593413B (en) * | 2013-10-27 | 2016-11-09 | 西安电子科技大学 | META Search Engine personalized method based on Agent |
CN108920484A (en) * | 2018-04-28 | 2018-11-30 | 广州市百果园网络科技有限公司 | Search for content processing method, device and storage equipment, computer equipment |
-
2011
- 2011-12-20 CN CN2011104317036A patent/CN102542022A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103064907A (en) * | 2012-12-18 | 2013-04-24 | 上海电机学院 | System and method for topic meta search based on unsupervised entity relation extraction |
CN103593413B (en) * | 2013-10-27 | 2016-11-09 | 西安电子科技大学 | META Search Engine personalized method based on Agent |
CN105095229A (en) * | 2014-04-29 | 2015-11-25 | 国际商业机器公司 | Method for training topic model, method for comparing document content and corresponding device |
CN108920484A (en) * | 2018-04-28 | 2018-11-30 | 广州市百果园网络科技有限公司 | Search for content processing method, device and storage equipment, computer equipment |
CN108920484B (en) * | 2018-04-28 | 2022-06-10 | 广州市百果园网络科技有限公司 | Search content processing method and device, storage device and computer device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103593425B (en) | Preference-based intelligent retrieval method and system | |
US20100318537A1 (en) | Providing knowledge content to users | |
CN101840397A (en) | Word sense disambiguation method and system | |
CN102087669A (en) | Intelligent search engine system based on semantic association | |
CN103577416A (en) | Query expansion method and system | |
CN101097570A (en) | Advertisement classification method capable of automatic recognizing classified advertisement type | |
CN103823893A (en) | User comment-based product search method and system | |
CN103853738A (en) | Identification method for webpage information related region | |
CN106372118B (en) | Online semantic understanding search system and method towards mass media text data | |
US20240119047A1 (en) | Answer facts from structured content | |
PH12015502104B1 (en) | System for non-deterministic disambiguation and qualitative entity matching of geographical locale data for business entities | |
CN104572631A (en) | Training method and system for language model | |
CN105677695A (en) | Method for calculating similarity of mobile applications based on content | |
CN103020074A (en) | Object-level search technique based on main body | |
CN103064907A (en) | System and method for topic meta search based on unsupervised entity relation extraction | |
CN117290489B (en) | Method and system for quickly constructing industry question-answer knowledge base | |
Moya et al. | Integrating web feed opinions into a corporate data warehouse | |
CN103136213A (en) | Method and device for providing related words | |
CN102542022A (en) | Theme search algorithm based on body | |
CN108984711A (en) | A kind of personalized APP recommended method based on layering insertion | |
CN111274366A (en) | Search recommendation method and device, equipment and storage medium | |
CN104077327A (en) | Core word importance recognition method and equipment and search result sorting method and equipment | |
CN106227762A (en) | A kind of method for vertical search assisted based on user and system | |
CN107665442B (en) | Method and device for acquiring target user | |
CN105550282A (en) | User interest forecasting method by utilizing multidimensional data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120704 |