CN102279894B - Method for searching, integrating and providing comment information based on semantics and searching system - Google Patents

Method for searching, integrating and providing comment information based on semantics and searching system Download PDF

Info

Publication number
CN102279894B
CN102279894B CN 201110278049 CN201110278049A CN102279894B CN 102279894 B CN102279894 B CN 102279894B CN 201110278049 CN201110278049 CN 201110278049 CN 201110278049 A CN201110278049 A CN 201110278049A CN 102279894 B CN102279894 B CN 102279894B
Authority
CN
China
Prior art keywords
information
review information
module
integrated
review
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201110278049
Other languages
Chinese (zh)
Other versions
CN102279894A (en
Inventor
周诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGZHOU IN-DEPTH DATA TECHNOLOGY CO., LTD.
Original Assignee
JIAXING YIYANTANG INFORMATION TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIAXING YIYANTANG INFORMATION TECHNOLOGY CO LTD filed Critical JIAXING YIYANTANG INFORMATION TECHNOLOGY CO LTD
Priority to CN 201110278049 priority Critical patent/CN102279894B/en
Publication of CN102279894A publication Critical patent/CN102279894A/en
Application granted granted Critical
Publication of CN102279894B publication Critical patent/CN102279894B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the Internet searching technology and aims to provide a method for searching, integrating and providing comment information based on semantics and a searching system. The method comprises the following steps of: discovering and integrating character comment information by utilizing a searching engine; classifying and summarizing comment texts according to a theme; particularly, the method also comprises the following steps of: extracting the non-character comment information; unearthing the semantic contact between the non-character comment information and character comment information; and integrating the two types of information on the basis so as to provide a searching service need. By the method, the estimation, the integration and the summarization of the isomeric information (namely character information and non-character information) and hierarchical comment content are realized. The searching engine is endowed with new functions of discovering, establishing and managing the comment information relevant to the comment object so that more and more thoughtful use experiences are brought and better service can be provided for extensive users.

Description

The searching of semantic-based, integrated and method and the search system of review information be provided
Technical field
The present invention relates to the Internet search technology, particularly realized the assessment to isomery and comment content stratification, integrated and method and the search engine system integrated.
Background technology
It is very general that people utilize search engine to search about review information such as product, service, activity, personages.Obviously, the confidence level of review information uses this information that direct impact is arranged on the user." comment " of the present invention word refers to, but is not limited to, and is dispersed in the comment on the webpage, estimate, and suggestion, speech is judged, assessment etc.In addition, it also makes a general reference text reviews information and the non-legible review information that has comprised any multimedia digital file types such as still image, dynamic image, animation, image.
Reality is that all search engines only return to the user to the link that comprises review information, by the true and false of the own judgement information of user.There is the only a few search engine to add manual authorization information on the information returned next door, such as " comment people identity is identified " etc.In said circumstances, lack the information authenticity evaluation module in the search engine framework, can't satisfy better user's demand.
In addition, a lot of review information have comprised Heterogeneous Information (being Word message and non-legible information, lower same).For example, many reviewers express attitude and viewpoint with emotion icon (Emoticon) and GIF picture in forum, blog and mail.Another exemplary is that product is commented on image information in a large number in the websites such as cnet, tigerdirect.Progressively popular along with iPhone, Digital Video and web camera can be predicted, and non-legible review information will be propagated more widely on the internet.Experience from the user, non-legible information has advantages of directly perceived, understandable.The more important thing is, they are the indivisible parts of user comment, ignore these information and can cause the user can't obtain the problems such as comprehensive information.In existing search engine framework (as shown in Figure 1), owing to lacking the functional modules such as effective mapping, the non-legible information of non-legible information extraction, non-legible information and Word message is integrated, the processing of non-legible information has been left in the basket.
Another problem deserving of attention is, the comment object does not often isolate, and associated object also can provide valuable information for the user.For example, the consumer when buying a digital camera (such as Powershot 4500IS), the brand (such as Canon) of often this camera of paying close attention at first.When the user searched review information about a camera, the information that the search engine auto-returned is relevant with this camera (such as the comment to brand) was highly significant.The new function of in other words, giving the review information that search engine box is found, establishment is relevant with the comment object with management can be served users better.
In a word, the true and false that can reasonable assessment information and manage simultaneously Heterogeneous Information and should not be regarded as the existing dispensable functional characteristics of search engine, but search engine technique further develop in essential function.In addition, as described in the example of Canon's camera, search engine needs that new system and method excavated automatically, the hierarchical relationship of integrated, summary and administrative evaluation object.
Summary of the invention
The problem to be solved in the present invention is, overcomes deficiency of the prior art, and the searching of a kind of semantic-based, integrated and searching method and the search engine system of review information be provided is provided.Be the technical solution problem, solution of the present invention is:
The searching of a kind of semantic-based, integrated and the method for review information is provided is provided, comprises and utilize search engine to find and the integrated character review information, and by subject classification with gather comment text; The method also comprises the extraction to non-legible review information, and excavates the semantic relation between non-legible review information and the text reviews information, and integrated this two category information needs for search service on this basis; The step of its realization comprises:
(1) initiative recognition provides the data source of review information or the linking request that passive reception comprises the information source of review information, sets up and the linking of this data source, and will comprise review information and be saved in interior data and grasp on the server;
(2) analyze the data that comprise review information, extract metamessage to set up the semantic annotation label of text reviews information and non-legible review information;
(3) utilize extracting data text reviews information and the non-legible review information of semantic annotation label from comprising review information;
(4) text reviews information and non-legible review information are carried out standardization, assess, filter inappropriate review information by semantic analysis, and carry out abnormality processing;
(5) carry out integrated to the information of having assessed according to evaluation object and inherent semantic relation thereof;
(6) set up index for integrated good information and raw data;
(7) utilize index information to process search request, return the content of coupling.
In the present note, the metamessage general reference is to the description of the characteristics of information.And the metamessage described in the step (2) refers in particular to description, explanation to review information herein, such as comment people, comment time etc.Metamessage can be used for setting up the note label of review information.
In the step of the present invention (1), the crawl server can initiatively grasp data source, also can the automatic reception data source, and whether identification data source comprises review information and sets up and the linking of the data source that comprises review information.
Step of the present invention (2) comprises the classification of judging the review information place, and is specific as follows:
(A) utilize the key attribute retrieval data source of key-value table and the metamessage of review information; To a certain metamessage, then the value of corresponding value attribute is treated as information classification and returns such as the key attributes match; Can't mate metamessage such as the key attribute, then carry out next step;
(B) label in the retrieval review information source file; If the attribute of label has comprised classificating word or the phrase of appointment, then these words or phrase are returned as information classification; If all tag attributes do not comprise classificating word or the phrase of appointment, then carry out next step;
(C) scanning review information text calculates the word frequency that classificating word or phrase occur; Classificating word or phrase that maximum word frequency is corresponding return as information classification; If the word frequency summation is zero, then information classification is set to NULL.
In the step of the present invention (3), also comprise from the extracting data non-legible review information relevant with text reviews information that is saved.
The described filtration of step of the present invention (4) comprises: filter repeat with shielding rubbish data, content or similar data, with comment object and the conflicting data of content, to commenting on the content of object malicious attack; The described abnormality processing of step (4) comprises:
(A) reason that is filtered, shields by information is classified abnormal information;
(B) abnormal information and abnormal class are deposited in the staqtistical data base, and upgrade relevant statistical parameter; Whether the statistical parameter after the renewal will be in aspect certain between unusual Statistical Area be used to analyzing new review information;
(C) upgrading the value that detects sign comes the reason of mark abnormal and specifies the direction that further detects;
(D) deposit abnormal information in log database.
Step of the present invention (5) is described integrated, comprises text reviews information and non-legible review information from same data source and different pieces of information source are carried out respectively integrated processing, and is undertaken integrated to review information by its inherent semantic relation; That the semantic relation of the review information that is at the beginning discrete state according to evaluation object connected to the integrated of the latter, being about to each bar review information is mapped on the tree structure with single or multiple lift, to identify the relation of this review information and other review information; Based on this tree structure, carry out integrated to shining upon good review information.
The present invention also provides a kind of search engine system for realizing preceding method, comprises Web Spider module, parser modules, retriever module and display device module; This system also comprises: be used for analysis and extract info web to set up the analyzer module of semantic annotation label; Be used for creating the evaluator module that data template, hosting Information and misarrangement are processed; With the application semantics analytical approach information is carried out integrated integrator module; Described Web Spider module, analyzer module, parser modules, evaluator module, integrator module, retriever module and display device module are disposed in order successively.
Analyzer module of the present invention comprises a Classification and Identification device module, and this Classification and Identification device module can be retrieved and scan the information that analyzer module receives, and classifies according to the word of key-value list corresponding relation or appointment or the word frequency of phrase appearance.
Evaluator module of the present invention comprises two assemblies: have word content is carried out standardization, makes up the Word message template file and processes unusual function, thereby be used for the content evaluation device module of Word message; Make up non-legible information model file and content aware function with having, thereby be used for the content evaluation device module of non-legible information.
Integrator module application of the present invention semantic analysis, can either carry out integrated to text reviews information and non-legible review information from same data source or different pieces of information source, can organize the review information that be discrete state with the form of tree construction according to the semantic relation of comment between the object again, the review information that belongs to a theme is carried out integrated on the level.
Compared with prior art, the invention has the beneficial effects as follows:
Non-legible information has advantages of directly perceived, understandable, especially the indivisible part of user comment.Semantic search automotive engine system involved in the present invention comprises a plurality of system function modules, has realized the assessment to the comment content of Heterogeneous Information (being Word message and non-legible information) and stratification, integrated and integration.Give the new function that the review information relevant with the comment object with management found, created to search engine, can bring more experiences of more showing consideration for, serve better users.
Description of drawings
Fig. 1 be used in the prior art finding, integrated and the framework of the search engine of review information is provided.
Fig. 2 be the present patent application described for find, integrated and the framework of the novel search engine of review information is provided.
Fig. 3 is the framework of analyzer module among Fig. 2 and the displaying of processing procedure.
Fig. 4 is that the structure of evaluator module among Fig. 2 is described.
Fig. 5 is that the framework and the processing procedure that are used for the content evaluation device module of Word message among Fig. 4 are described.
Fig. 6 is that the framework and the processing procedure that are used for the content evaluation device module of non-legible information among Fig. 4 are described.
Fig. 7 is the data structure file that is applicable to the isomery review information.
Fig. 8 is the framework of integrated review information, and this framework is suitable for the heterogeneous information integration with same website and many websites.
Fig. 9 is the framed structure of searcher among Fig. 2.
Embodiment
At first need to prove, the present invention relates to the application of search engine technique, is that computer technology is in a kind of application of internet arena.In implementation procedure of the present invention, can relate to the application of a plurality of software function modules.The applicant thinks, as after reading over application documents, accurate understanding realization principle of the present invention and goal of the invention, in the situation that in conjunction with existing known technology, those skilled in the art can use the software programming technical ability of its grasp to realize the present invention fully.The aforementioned software functional module comprises but is not limited to: Web Spider module, analyzer module, parser modules, evaluator module, integrator module, retriever module, display device module, abnormality detection module, Word message masterplate file, non-legible information masterplate file etc., this category of all genus that all the present patent application files are mentioned, the applicant enumerates no longer one by one.
1, the framework of current existing search engine
Fig. 1 has described existing search engine and has been used for finding framework with integrated review information.This framework comprises Web Spider module 100, parser modules 102, integrator module 104 and retriever module 106.Except integrator module 104 is used for by subject classification and gathers the comment text, the framework of the universal search engines such as this framework and Google is about the same, and namely being to the user provides the hyperlink that is linked to review information, rather than the content of assessment review information.In addition, the processing of non-text review information is excluded outside existing framework, and reason is that this framework is only realized the discovery to text message, extracts and gathers function.Simultaneously, existing search engine is not to the ability processed of hierarchical structure of comment object.
2, the framework of semantic search engine
Fig. 2 has showed the search engine framework that is used for searching, gathering review information among the present invention.This framework functions can be divided into three bulks:
First functional block is Web Spider module 200.It can be deployed on one or more servers (namely grasping server), both can grasp selectively the webpage that comprises review information, webpage is saved in the internal memory or file system that grasps on the server, the data that also can the automatic reception data source send, whether identification wherein comprises review information, afterwards active foundation links with the data source that comprises review information, and the data that will comprise review information are saved in the internal memory or file system of crawl server.
Second functional block is integrated index module 210.It comprises following several submodule:
Analyzer module 201: this module is used for analyzing the review information webpage that leaves on the crawl server, and extracts the semantic annotation label (annotators) that the webpage metamessages such as domain name, network address are used for setting up Heterogeneous Information from webpage.The semantic annotation label is some specific file, program or the data structure that the semantic analysis technologies such as body, machine learning produce.Its special case is exactly to have stored the XML file of product information; Comprise the information such as name of product, description and these information in this file and appeared at position on the webpage.More complicated semantic annotation label can be one section and obtain specific information JScript code from webpage.Because generative semantics is explained the purpose of label and extract exactly structurized information from data source, these scripts must be understood the inherent meaning of the information of being extracted, rather than the literal meaning of these information.In other words, the semantic annotation label does not rely on the literal matching technique of key word.The analyzer module that semantic-based is explained label makes search engine have analytical information correlativity and the ability of understanding natural language;
Parser modules 202: this module utilizes the semantic annotation label of analyzer module 201 establishments from extracting data Word message and non-legible information crawled, that preserve.As not comprising the target information of semantic annotation label indication in the current data, this module also will automatically link to the target data source of semantic annotation label indication in order to obtain the target information content;
Evaluator module 203: this module is for assessment of the Heterogeneous Information that is extracted by parser modules 202, and execution information filtering and abnormality detection.The first step that this module is carried out is to create respectively data masterplate file for Word message and non-legible information, and the information content of needs assessment is carried out being loaded in these masterplate files after the standardization.Standardization comprises and converts the information contents such as comment time, comment people address, comment people empirical value to unified form.The second step that the evaluator module is carried out is to filter inappropriate content.This step will utilize the semantic analysis instrument that junk data, content are repeated or similar data, with comment object and the conflicting data of content, the content of comment object malicious attack etc. is filtered or is shielded.20313 pairs of contents that are loaded of the abnormality detection module that evaluator also pre-defines utilization are carried out abnormality detection.The reason that abnormality detection module 20313 is filtered analytical information or shield and type of error, and information and type of error be saved in staqtistical data base and the log database foundation as further analyzing and processing.Do not violate in the comment data of any wrong rule and the internal memory that masterplate will be saved to Analysis server thereof and wait for integrated processing;
Integrator module 204: the information that this module is implemented is integrated to comprise for three steps: the one, carry out integrated to the Heterogeneous Information from same data source, the 2nd, the Heterogeneous Information from the different pieces of information source is carried out integrated, be to carry out integrated to the isomeric data that has mutual relationship at semantic hierarchies at last.The first step is that to utilize information model 2034 to identify the comment theme identical or similar and from the information of same data source, and based on its metamessage of these information updatings, as comment number, comment number, comment time distribute, comment content tendency etc.The second is that to utilize information model 2034 to identify the comment theme same or similar but from the information in different pieces of information source, and based on its metamessage of these information updatings.The 3rd step was to utilize information model 2034 and semantic annotation label to excavate semantic relevance between the comment data, made up the tree structure of single or multiple lift with this, at last comment data is mapped to carry out on this tree structure integrated.
Retriever module 205: this module grabs word, phrase and semantic annotation label mapping to integrated good data message and crawl server the most original data centralization.Simultaneously, this module is stored in database or the file system index as review information to these mapping relations.These index informations will be used to the information inquiry of process user.
The three functions module is display device module 220.This module is responsible for receiving and processing final user's inquiry, and the index information that utilizes retriever module 205 to produce returns the content of coupling to the user.
3, the difference of two kinds of frameworks
The difference of two kinds of frameworks at first shows: the deviser of existing search engine framework thinks, just can realize processing to review information and the demand that can satisfy the user by increase a data integration module in existing framework; The deviser of latter's framework thinks, because review information is rich in natural language feature (such as personalized vocabulary, semantic rules etc.), only relying on increases the individual data integration module, rather than the semantic analysis function is considered as part indispensable in the whole search engine framework, then can't effectively finish the processing to review information.
In addition, the deviser of latter's framework thinks, only review information is processed the demand that can not satisfy well the user.For the user, a lot of review information are with special range of application and obvious hierarchical structure.As the example of above-mentioned Canon camera, the user needs when doing purchase decision is not only review information to this camera, also needs the understanding to a brand.Put from this, it is very important that search engine possesses the ability of analyzing the review information hierarchical structure.Obviously, existing search engine does not have such ability.
At last, as above-mentioned, the viewpoint that isomeric data has been used to express the user has become a kind of trend.Therefore, the deviser of semantic search engine thinks, process user review information better, and search engine must can be processed isomeric data.Obviously, the deviser of traditional search engines not yet recognizes this point.
4, make up the semantic annotation label
Fig. 3 has described analyzer module 201 framed structures.The input of this module comprises: domain name 2011, network address 2012, HTML Word message 2013, the non-legible information 2014 of HTML.The output of this module is respectively the semanteme of sign Word message and resolves label 201B and the semanteme parsing label 201C that identifies non-legible information.
Whole analytic process starts from transmits core buffer 2015 with information such as input domain name, network address, is passing to Classification and Identification device module 2016 afterwards.This module is responsible for judging the classification at review information place." classification " word herein refers to, and has both comprised a macrotaxonomy, also comprises a plurality of subclassifications of a macrotaxonomy and subordinate thereof.What just be worth now proposition is that these classified informations are not only very useful in this module, and all can repeatedly use in afterwards module and flow process.For example, in evaluator module 203, these classified informations are used to make up Word message masterplate 20311 and non-legible information model 20321.These two masterplates will be used to integrated review information in integrator 204.Below introduce the process of discriminator:
1) Classification and Identification device module 2016 domain name that at first retrieval is inputted in a key-value list.In this tabulation, what " key " attribute was corresponding is domain-name information, and what " value " attribute was corresponding is the classification at domain name place.If " key " attribute of tabulation has comprised the domain name of input, so the value of corresponding " value " attribute will be returned as the classification under the input domain name.If " key " attribute of tabulation does not comprise the domain name of input, the classification identification module carries out the 2nd) step;
2) in the search html web page source code<title and<description label.If certain attribute of these labels has comprised classificating word or the phrase of appointment, these words or phrase will be returned as the classification under the input domain name so.For example, if at<title〉certain attribute exists and comprises keyword " HDTV " in the label, and " HDTV " is a predefined classification, then the domain name of input is classified as " HDTV " this classification.If the classification identification module can't be from<title〉and<description〉obtain classificating word or phrase the label, then carry out the 3rd) step;
3) scanning html web page source code calculates the word frequency that specific classification word or phrase occur in this source code.After these word frequency are sorted from high to low, get classificating word corresponding to maximum word frequency or phrase as the domain name classification.If the word frequency summation is zero, the classification identification module is NULL with the classification setting of this domain name so.
After classification identification was complete, classification information was used to select suitable data analysis module in order to create the semantic label of resolving.These data analysis modules comprise regular expression 2017, data mining 2018, multi-medium data analysis 2019 and machine learning 201A.
In general, this analyzer module is analyzed Word message with regular expression module and data-mining module and is created the semantic label of resolving.For non-legible information, multi-medium data analysis module 2019 is that the label main tool resolved in the semanteme that creates this category information, and constructive process is not only based on the self attributes (such as file layout, relative address etc.) of non-legible information, also based on Word message relevant with this non-legible information in same data file.
5, the assessment of comment content
Fig. 4 has described two assemblies of evaluator module 203: be used for the content evaluation device module 2031 of Word message and be used for the content evaluation device module 2032 of non-legible information.Be necessary to point out, evaluation process is exactly in fact the interactive process 2033 of these two modules.Be appreciated that the interaction that needs two modules why, please see example: " what? " certain user write in blog after, added again a series of bashful icon of crying thereafter.Only from Word message " what? " it is inadequate doing sentiment analysis, but adds the analysis to the face icon of crying, and the content evaluation module just can be judged more exactly this user and express the negative emotions such as surprised, puzzled, angry.Conversely, sometimes, list is difficult to judge from non-legible information, at this moment is aided with the accuracy that Word message then might improve judgement.
Fig. 5 has described framework and the composition that is used for the content evaluation device module 2031 of Word message.This module at first makes up Word message masterplate file 20311 based on Word message 20310.This masterplate file had both comprised a theme masterplate (being used for describing the comment object, such as classified information), also comprised a content stencil (being used for loading original review information data and the map information from content stencil to the theme masterplate).
After Word message masterplate file 20311 is set up, at first initialization tags detected of evaluator module 20312, and then carry out abnormality detection.20313 of abnormality detection modules utilize this masterplate file and staqtistical data base 20314 to carry out abnormality detection.Before formal the detection, abnormality detection module 20313 can several detection signs of initialization.These signs are used to indicate abnormal conditions and the state of testing process.
The following abnormal class of abnormality detection resume module:
1) do not mate 20315 (namely comment on to as if certain notebook computer, but the comment content discuss to as if bicycle);
2) conflict 20316 (being the self-contradictory situation that occurs in the same comment);
3) rubbish 20317 (being that certain user ID repeatedly repeats to comment on same comment object within a certain period of time);
4) mislead 20318 (are that certain concrete comment be can not see eye to eye with most other comment contents, and have no factual evidence);
5) other 20319 (lose such as classified information disappearance, comment literal etc.).
After abnormal class was determined, abnormality detection module 20313 can be handled as follows:
1) abnormal class is deposited in the staqtistical data base 20314 as a new record, and upgrade relevant statistical parameter.For example, the ratio that the unusual number of times of certain class and the sum that all are unusual occur.Whether the statistical parameter after the renewal will be in aspect certain between unusual Statistical Area for detection of new review information;
2) label assignment 2031A comes the reason of mark abnormal, and the reason of abnormal is write staqtistical data base 20314:
3) error message is deposited in log database 2031B.
For be not detected unusual data in abnormality detection, abnormality detection module 20313 will pass to integrator module 204 to these data.
Fig. 6 has described framework and the composition that is used for the content evaluation device module 2032 of non-legible information.This module is extracted the attribute informations such as the filename, author, creation-time, modification time, file layout of non-legible review information 20320, and based on the non-legible information masterplate of these information architectures file 20321.Subsequently, the evaluator module is utilized these attribute informations to search in the non-legible information content database 20323 whether to have review information 20320.
If should exist by record, then carry out template renewal process 20326.This process will from the content update of data-base recording to masterplate file 20321.Information model after the renewal will import in the integrator 204 as input parameter.
If record does not exist, carry out non-legible content analysis process 20325.This process is at first extracted the attribute information of non-legible information, comprises file size, size, resolution, pixel, ISO speed, founder, creation-time, final updating time, frame information, ratio of compression etc.Then, this process utilizes these attribute informations to carry out alternate analysis, comprises file type affirmation, character information extraction, action recognition, image cutting and content classification etc.At last, analysis result data is written into and is updated to non-legible information masterplate 20321, also is written into non-legible information content database 20323 simultaneously.After non-legible information model renewal is complete, it will import in the integrator 204 as input parameter.
Fig. 7 has shown a template file 2034, and this template is applicable to process the situation of Word message and non-legible information simultaneously.This template has comprised theme masterplate and content stencil: the theme masterplate comprises the descriptor to the comment object, the metamessage that content stencil comprises comment data and describes comment data.
6, the comment content is integrated
Fig. 8 has described the composition of integrator module 204.This module is used for integrated (the inter-network station integrated 2042) of integrated (with the website integrated 2041) of website comment object, inter-network station comment object with have integrated (level integrated 2048) of the comment object of hierarchical relationship.
If the ID of the domain name that comment data is corresponding identical and comment object, the integrated of content then is integrated with the website so.At this moment, carry out integrated to Word message 2043 and non-legible information 2044 with the website at first respectively.Then be integrated between Heterogeneous Information, the Word message and the non-legible information that are about to after integrated are further integrated, and guaranteeing not produce the contradiction on the content between the two, and total attribute field comprises identical numerical value in the theme masterplate of both correspondences.
Similarly, if comment data corresponding to different domain names but the comment object ID identical (utilizing entity associated to guarantee that identical comment object has identical ID), content is integrated so then is inter-network station integrated 2042.Inter-network station integrated 2042 is with identical with the process of website integrated 2041, namely both comprised integrated to the Word message 2046 at inter-network station and non-legible information 2047, also comprises integrated to 2046 and 2047 these Heterogeneous Informations.
When with website and inter-network station information integrated complete after, carry out level integrated 2048.Level is integrated both can to carry out in same website, also can carry out between the inter-network station.The integrated purpose of level is exactly to organize theme masterplate discrete on the surface but that be correlated with on the content, and they reasonably are mapped in the tree-shaped structure.
For example, the value of theme masterplate A is Canon's brand, and the value of theme masterplate B is Canon's camera, and the value of theme masterplate C is the 450d of Canon.In these theme masterplates, A at first is identified as the father node of tree structure, and reason is that the semantic coverage of " brand " this word is wider compared with other two keywords " camera " and " 450d ".Secondly, by the semantic similarity between the semantic similarity between " brand ", " camera " and " brand ", " 450d " relatively, the former similarity is higher, and therefore, B (but not C) should be as the direct child node of A.Same reason, because the semantic coverage of B is than C wider (on semantic concept, C is actually the special case of B), C can only be as the child node of B, but not the brotgher of node of B.So far, just being organized into based on three of hierarchical relationship discrete theme masterplates is a tree-shaped hierarchical structure.
On function, the integrated process of level is based on Word message masterplate 204A and non-legible information masterplate 204B, and extracts theme masterplate set 204C from these two masterplates, comprises theme 1, theme 2 etc. in this set.These themes are in discrete state when beginning, but after integrating process is complete, they will be organized among the tree construction 204D.This tree structure is to set up according to the semantic relation between the theme masterplate.
7, the retrieval of comment content
Fig. 9 has described framework and the composition of retriever module 205.Retriever module 205 is to be made of subject index file 2051 and content indexing file 2052.Subject index file 2051 is mapped to the key of theme masterplate-value centering to the main information of describing the comment object.Content indexing file 2052 concrete review information are mapped to the key of content stencil-value centering.Mapping process need to be mapped to corresponding theme masterplate 2055 and content template 2056 to word content 2053 and non-legible content 2054 separately, preserves literal and non-legible content with source file form 2057 simultaneously.After index was finished, index data was kept in the index warehouse 2058.

Claims (10)

1. the searching of semantic-based, integrated and the method for review information is provided comprises and utilizes search engine to find and the integrated character review information, and by subject classification with gather comment text; It is characterized in that, the method also comprises the extraction to non-legible review information, and excavates the semantic relation between non-legible review information and the text reviews information, and integrated this two category information needs for search service on this basis; The step of its realization comprises:
(1) initiative recognition provides the data source of review information or the linking request that passive reception comprises the information source of review information, sets up and the linking of this data source, and will comprise review information and be saved in interior data and grasp on the server;
(2) analyze the data that comprise review information, extract metamessage to set up the semantic annotation label of text reviews information and non-legible review information;
(3) utilize extracting data text reviews information and the non-legible review information of semantic annotation label from comprising review information;
(4) text reviews information and non-legible review information are carried out standardization, assess, filter inappropriate review information by semantic analysis, and carry out abnormality processing;
(5) carry out integrated to the information of having assessed according to evaluation object and inherent semantic relation thereof;
(6) set up index for integrated good information and raw data;
(7) utilize index information to process search request, return the content of coupling.
2. method according to claim 1, it is characterized in that, in the described step (1), the crawl server can initiatively grasp data source, also can the automatic reception data source, and whether identification data source comprises review information and sets up and the linking of the data source that comprises review information.
3. method according to claim 1 is characterized in that, described step (2) comprises the classification of judging the review information place, and is specific as follows:
(A) utilize the key attribute retrieval data source of key-value table and the metamessage of review information; To a certain metamessage, then the value of corresponding value attribute is treated as information classification and returns such as the key attributes match; Can't mate metamessage such as the key attribute, then carry out next step;
(B) label in the retrieval review information source file; If the attribute of label has comprised classificating word or the phrase of appointment, then these words or phrase are returned as information classification; If all tag attributes do not comprise classificating word or the phrase of appointment, then carry out next step;
(C) scanning review information text calculates the word frequency that classificating word or phrase occur; Classificating word or phrase that maximum word frequency is corresponding return as information classification; If the word frequency summation is zero, then information classification is set to NULL.
4. method according to claim 1 is characterized in that, in the described step (3), also comprises from the extracting data non-legible review information relevant with text reviews information that is saved.
5. method according to claim 1, it is characterized in that, the described filtration of step (4) comprises: filter repeat with shielding rubbish data, content or similar data, with comment object and the conflicting data of content, to commenting on the content of object malicious attack;
The described abnormality processing of step (4) comprises:
(A) reason that is filtered, shields by information is classified abnormal information;
(B) abnormal information and abnormal class are deposited in the staqtistical data base, and upgrade relevant statistical parameter; Whether the statistical parameter after the renewal will be in aspect certain between unusual Statistical Area be used to analyzing new review information;
(C) upgrading the value that detects sign comes the reason of mark abnormal and specifies the direction that further detects;
(D) deposit abnormal information in log database.
6. method according to claim 1, it is characterized in that, step (5) is described integrated, comprises text reviews information and non-legible review information from same data source and different pieces of information source are carried out respectively integrated processing, and is undertaken integrated to review information by its inherent semantic relation; That the semantic relation of the review information that is at the beginning discrete state according to evaluation object connected to the integrated of the latter, being about to each bar review information is mapped on the tree structure with single or multiple lift, to identify the relation of this review information and other review information; Based on this tree structure, carry out integrated to shining upon good review information.
7. the searching an of semantic-based, integrated and the search engine system of review information is provided comprises Web Spider module, parser modules, retriever module and display device module; It is characterized in that, this system also comprises: be used for analysis and extract info web to set up the analyzer module of semantic annotation label; Be used for creating the evaluator module that data template, hosting Information and misarrangement are processed; With the application semantics analytical approach information is carried out integrated integrator module;
Described Web Spider module, analyzer module, parser modules, evaluator module, integrator module, retriever module and display device module are disposed in order successively; Each module realizes the searching of semantic-based, integrated and review information is provided by following manner:
(1) Web Spider module initiative recognition provides the data source of review information or the linking request that passive reception comprises the information source of review information, sets up and the linking of this data source, and will comprise review information and be saved in interior data and grasp on the server;
(2) the analyzer module analysis comprises the data of review information, extracts metamessage to set up the semantic annotation label of text reviews information and non-legible review information;
(3) parser modules is utilized extracting data text reviews information and the non-legible review information of semantic annotation label from comprising review information;
(4) the evaluator module is carried out standardization to text reviews information and non-legible review information, assesses, filters inappropriate review information by semantic analysis, and carries out abnormality processing;
(5) the integrator module is carried out integrated to the information of having assessed according to evaluation object and inherent semantic relation thereof;
(6) retriever module is that integrated good information and raw data are set up index;
(7) the display device module utilizes index information to process search request, returns the content of coupling.
8. system according to claim 7, it is characterized in that, described analyzer module comprises a Classification and Identification device module, this Classification and Identification device module can be retrieved and scan the information that analyzer module receives, and classifies according to the word of key-value list corresponding relation or appointment or the word frequency of phrase appearance.
9. system according to claim 7, it is characterized in that, described evaluator module comprises two assemblies: have word content is carried out standardization, makes up the Word message template file and processes unusual function, thereby be used for the content evaluation device module of Word message; Make up non-legible information model file and content aware function with having, thereby be used for the content evaluation device module of non-legible information.
10. system according to claim 7, it is characterized in that, described integrator module application semantic analysis, can either carry out integrated to text reviews information and non-legible review information from same data source or different pieces of information source, can organize the review information that be discrete state with the form of tree construction according to the semantic relation of comment between the object again, the review information that belongs to a theme is carried out integrated on the level.
CN 201110278049 2011-09-19 2011-09-19 Method for searching, integrating and providing comment information based on semantics and searching system Expired - Fee Related CN102279894B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110278049 CN102279894B (en) 2011-09-19 2011-09-19 Method for searching, integrating and providing comment information based on semantics and searching system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110278049 CN102279894B (en) 2011-09-19 2011-09-19 Method for searching, integrating and providing comment information based on semantics and searching system

Publications (2)

Publication Number Publication Date
CN102279894A CN102279894A (en) 2011-12-14
CN102279894B true CN102279894B (en) 2013-01-09

Family

ID=45105336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110278049 Expired - Fee Related CN102279894B (en) 2011-09-19 2011-09-19 Method for searching, integrating and providing comment information based on semantics and searching system

Country Status (1)

Country Link
CN (1) CN102279894B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934643A (en) * 2017-12-15 2019-06-25 西安比卓电子科技有限公司 A kind of review record method

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164474B (en) * 2011-12-15 2016-03-30 中国移动通信集团贵州有限公司 A kind of method that data service is analyzed
CN102810110B (en) * 2012-05-07 2015-08-05 北京京东世纪贸易有限公司 Obtain the method and system of network text data
CN102708202B (en) * 2012-05-17 2014-11-26 厦门游家网络有限公司 Method for sharing player thoughts of Flash game in batches
CN103631791B (en) * 2012-08-22 2017-04-12 腾讯科技(深圳)有限公司 Information fusion classification display method and system
CN103150331A (en) * 2013-01-24 2013-06-12 北京京东世纪贸易有限公司 Method and device for providing search engine tags
CN105095181B (en) * 2014-05-19 2017-12-29 株式会社理光 Review spam detection method and equipment
CN104008289A (en) * 2014-05-26 2014-08-27 沈苹 Method and device for evaluating artistic works
CN106415533A (en) * 2014-06-12 2017-02-15 诺基亚技术有限公司 Method, apparatus, computer program product and system for reputation generation
CN104298754B (en) * 2014-10-17 2017-08-25 梁忠伟 Information excavating transmission method, social network device and system by trunk of sequence of pictures
CN104866468B (en) * 2015-04-08 2017-09-29 清华大学深圳研究生院 A kind of false customer's comment recognition methods of Chinese
WO2017120739A1 (en) * 2016-01-11 2017-07-20 程强 Method and system for analyzing restaurant reviews
CN107133239A (en) * 2016-02-29 2017-09-05 上海普兰金融服务有限公司 instant information processing method and device
US10540383B2 (en) * 2016-12-21 2020-01-21 International Business Machines Corporation Automatic ontology generation
CN109213920A (en) * 2017-06-29 2019-01-15 阿里巴巴集团控股有限公司 searching method, client, server and storage medium
CN108564103A (en) * 2018-01-09 2018-09-21 众安信息技术服务有限公司 Data processing method and device
CN109241402A (en) * 2018-07-31 2019-01-18 成都华栖云科技有限公司 A kind of virtual comment machine introduction method based on news content
CN109446512B (en) * 2018-09-12 2023-08-01 阿里巴巴(中国)有限公司 Data processing method, device, terminal equipment and computer storage medium
CN109583958A (en) * 2018-12-01 2019-04-05 深圳市润隆实业有限公司 It is a kind of for integrating the comment system in store
CN111831878B (en) * 2019-04-22 2023-09-15 百度在线网络技术(北京)有限公司 Method for constructing value index relationship, index system and index device
CN110390061B (en) * 2019-07-29 2020-07-21 电子科技大学 Space theme query method based on social media
CN110688451A (en) * 2019-08-15 2020-01-14 中国平安人寿保险股份有限公司 Evaluation information processing method, evaluation information processing device, computer device, and storage medium
CN110991218B (en) * 2019-10-10 2024-01-12 北京邮电大学 Image-based network public opinion early warning system and method
CN113342221A (en) * 2021-05-13 2021-09-03 北京字节跳动网络技术有限公司 Comment information guiding method and device, storage medium and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102087648A (en) * 2009-12-03 2011-06-08 北京大学 Method and system for fetching news comment page

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8756244B2 (en) * 2009-07-29 2014-06-17 Teradata Us, Inc. Metadata as comments for search problem determination and analysis

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102087648A (en) * 2009-12-03 2011-06-08 北京大学 Method and system for fetching news comment page

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934643A (en) * 2017-12-15 2019-06-25 西安比卓电子科技有限公司 A kind of review record method

Also Published As

Publication number Publication date
CN102279894A (en) 2011-12-14

Similar Documents

Publication Publication Date Title
CN102279894B (en) Method for searching, integrating and providing comment information based on semantics and searching system
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN103136360A (en) Internet behavior markup engine and behavior markup method corresponding to same
US10740406B2 (en) Matching of an input document to documents in a document collection
CN102622453A (en) Body-based food security event semantic retrieval system
CN104281702A (en) Power keyword segmentation based data retrieval method and device
CN101393565A (en) Facing virtual museum searching method based on noumenon
Zhao et al. Topic-centric and semantic-aware retrieval system for internet of things
Holzinger et al. Using ontologies for extracting product features from web pages
CN103425740A (en) IOT (Internet Of Things) faced material information retrieval method based on semantic clustering
CN104679783A (en) Network searching method and device
Roopak et al. OntoKnowNHS: ontology driven knowledge centric novel hybridised semantic scheme for image recommendation using knowledge graph
KR101696499B1 (en) Apparatus and method for interpreting korean keyword search phrase
CN103914488A (en) Document collection, identification, association, search and display system
CN103744987B (en) Video website media asset integrating method and system based on DOM tree matching
Greenberg Metadata and digital information
CN109948015B (en) Meta search list result extraction method and system
Katz et al. Data system design alters meaning in ecological data: salmon habitat restoration across the US Pacific Northwest
CN115168401A (en) Data grading processing method and device, electronic equipment and computer readable medium
Rogushina et al. Use of ontologies for metadata records analysis in big data
Yu et al. Friend recommendation mechanism for social media based on content matching
CN113821718A (en) Article information pushing method and device
Dumrewal et al. Citicafe: conversation-based intelligent platform for citizen engagement
Jalal et al. A web content mining application for detecting relevant pages using Jaccard similarity
Singh et al. User specific context construction for personalized multimedia retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150918

Address after: 510000 Guangdong city of Guangzhou province Tianhe District Yanling Road No. 268 building 18 room 107

Patentee after: Zhou Cheng

Address before: 213, room 1369, science and technology building, 314000 Chengnan Road, Zhejiang, Jiaxing

Patentee before: Jiaxing Yiyantang Information Technology Co.,Ltd.

TR01 Transfer of patent right

Effective date of registration: 20170329

Address after: Forty new road in Whampoa District of Guangzhou City, Guangdong province 510000 No. 680 Room 401 by

Patentee after: GUANGZHOU IN-DEPTH DATA TECHNOLOGY CO., LTD.

Address before: 510000 Guangdong city of Guangzhou province Tianhe District Yanling Road No. 268 building 18 room 107

Patentee before: Zhou Cheng

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130109

Termination date: 20200919

CF01 Termination of patent right due to non-payment of annual fee