CN102073641A - Method, device and program for processing consumer-generated media information - Google Patents

Method, device and program for processing consumer-generated media information Download PDF

Info

Publication number
CN102073641A
CN102073641A CN2009102218861A CN200910221886A CN102073641A CN 102073641 A CN102073641 A CN 102073641A CN 2009102218861 A CN2009102218861 A CN 2009102218861A CN 200910221886 A CN200910221886 A CN 200910221886A CN 102073641 A CN102073641 A CN 102073641A
Authority
CN
China
Prior art keywords
information
consumer
media information
cgm
evaluation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2009102218861A
Other languages
Chinese (zh)
Inventor
何楠
王主龙
贾文杰
葛付江
贾晓建
王新文
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN2009102218861A priority Critical patent/CN102073641A/en
Publication of CN102073641A publication Critical patent/CN102073641A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method for processing consumer-generated media (CGM) information. The method comprises the following steps of: acquiring and extracting the CGM information from different information supply sources; filtering the CGM information according to a filter strategy corresponding to the extracted CGM information to acquire predetermined subject related CGM information; and integrating the filtered CGM information based on user customized rules to acquire customized CGM information. The invention also provides a device for implementing the method and a program thereof. The method and the device provide the customized CGM information according to the specific requirements of users so as to remarkably improve the processing and use efficiency of the CGM information, and do not increase additional operation burden for the users.

Description

The consumer is generated method, device and the program that media information is handled
Technical field
The present invention relates to the technical field of information processing on the whole, more specifically, relates to the consumer is generated method, device and the program that medium CGM information is handled.
Background technology
The consumer generates medium, and (Consumer-generated Media CGM) refers to the content that anyone (must not be the medium worker of specialty) can create on the net, and it can be used by other consumers by digital technology.CGM can comprise network log or " blog (Blog) ", mobile phone blog or " mo-blog ", forum (BBS), electronics discussion message, newsgroup, message board (messageboard), BBS analog service (BBS emulating services), product preview and online retail website, community network, media library and the digital library etc. of website, support consumer suggestion are discussed.As seen, CGM information generally is meant the various contents that comprise on the CGM website or on the webpage, for example, and blog article, consumer message, consumer's post (post) etc.CGM information is text normally, still also comprises audio file and stream-type video file (MP3, Web broadcast etc.), animation (flash etc.) and the multimedia of any other form.Posting etc. of blog article, consumer message, consumer is the example of typical C GM information, and certainly, CGM website or webpage itself also can be regarded as a kind of CGM information.Thereby in broad terms, CGM information comprises content and the information that all are relevant with CGM.In addition, in the context of the present specification, " consumer " makes a general reference the consumption and the user of this information creating of network and spreading tool, and not only is meant the consumer of certain the concrete commodity on the ordinary meaning.
Fast development along with computing machine and network technology, the simplification that the demand of personal space, website are created, carry out mutual multiple factor such as quick and convenient by network and promoted CGM in the very big development aspect two of type and the quantity, the thing followed is the CGM information of magnanimity.In the face of kind and the increasing CGM of quantity, how fully effectively utilizing the CGM information that is obtained according to actual needs is the problem that is worth research.
For instance, a user need understand the technical feature information of certain commodity, then by in search engine, keying in speech or the phrase relevant with these commodity, just can obtain a series of websites or webpage as Search Results, perhaps the discussion that can carry out online or off-line by the login product preview relevant and discussion website and the user of these commodity with these commodity with exchange.But, experience is told us, the CGM information of resulting these magnanimity embraces a wide spectrum of ideas, except needed commodity technology performance information, also may comprise price, profile, operation instruction, the production run of these commodity, other users' various information such as in-service evaluation, even the information that content and this commodity have no to concern except comprising keyword.The a large amount of time and efforts of informational needs cost user that often screening is finally wanted from these information has reduced the effectiveness to CGM information processing efficient and CGM information self.Some methods that can help the user to address this problem are provided in the prior art, for example, general search engine all has the Advanced Search function, and the user can narrow the hunting zone and obtain relative accurate search results by the input multi-level key word relevant with the theme that will search for.But perhaps the user only knows the information seldom of the theme that will search for sometimes, and therefore this method can only address this problem on limited degree.In addition, for example can also obtain comparatively accurate search results by search expression formula and the professional research tool of use that makes up specialty.But Jue Daduoshuo domestic consumer does not have this professional ability after all, and therefore in fact this method has increased extra operation burden to the user, can't popularize, thereby be difficult to fundamentally improve to CGM information processing efficient.
Be an example above.There are the many situations that need utilize again effectively CGM information in the reality, but do not find ideal tools or technology that CGM information is handled effectively so far.
Summary of the invention
As fully visible, need a kind of can the processing effectively so that the method and apparatus of the CGM information of the customization that adapts with its particular demands is provided for the user to CGM information.
Relate to according to embodiments of the invention and a kind of the consumer to be generated the method that media information is handled, described method comprises step:
Collect and extract the consumer from different information sources of supply and generate media information;
Generate the corresponding filtering policy of media information according to the consumer who obtains with extraction and the consumer is generated media information filter, generate media information to obtain the consumer relevant with predetermined theme; And
Come that based on the rule of customization the consumer that filtration obtains is generated media information and integrate, generate media information so that obtain the consumer of customization.
Also relate to the device that a kind of information CGM that the consumer is generated handles according to embodiments of the invention, described device comprises:
Collect and extracting unit, be configured to collect and extract the consumer and generate media information from different information sources of supply;
Filter element is configured to generate media information according to the consumer is generated media information filtering with generating the corresponding filtering policy of media information by the consumer who collects and extracting unit obtains with the acquisition consumer relevant with predetermined theme; And
Integral unit is configured to based on the rule of customization the consumer who obtains by filter element to be generated media information and integrates, and generates media information so that obtain the consumer of customization.
Relate to a kind of program product that stores the instruction code that machine readable gets again according to embodiments of the invention, when described instruction code is read and carried out by machine, can carry out the method that the aforesaid information CGM that the consumer is generated handles.
According to the embodiment of the invention consumer is generated the CGM information that method and apparatus that medium CGM information handles can provide customization according to user's particular demands, thereby significantly improved CGM information processing and service efficiency, and avoided the user is increased extra operation burden.
Description of drawings
With reference to below in conjunction with the explanation of accompanying drawing, can understand above and other purpose of the present invention, characteristics and advantage more easily to the embodiment of the invention.Parts in the accompanying drawing are not proportional draftings, and just for principle of the present invention is shown.For the ease of illustrating and describe some parts of the present invention, counterpart may be exaggerated in the accompanying drawing, that is, make it become bigger with respect to other parts in the exemplary means of the actual manufacturing of foundation the present invention.In the accompanying drawings, same or similar technical characterictic or parts will adopt identical or similar Reference numeral to represent.
Fig. 1 shows according to an embodiment of the invention the general flow chart that the consumer is generated the method that medium CGM information handles;
Fig. 2 shows a kind of general flow chart of specific implementation of the method that CGM information is handled of embodiment shown in Figure 1;
Fig. 3 shows the information gathering in method as shown in Figure 2 and extracts the general flow chart of an instantiation of handling;
Fig. 4 shows the general flow chart of an instantiation of the information filtering processing in method as shown in Figure 2;
Fig. 5 shows the general flow chart of an instantiation of emotion analysis in method as shown in Figure 2 and integration processing;
Fig. 6 is the Snipping Tool that is shown schematically in through the histogram of each the CGM information that obtains after as shown in Figure 5 emotion analysis and the integration processing and corresponding evaluation of estimate thereof;
Fig. 7 is the Snipping Tool of the curve map of the quantity of the CGM information of the customization of metering on a time period that is shown schematically in through obtaining after the integration processing as shown in Figure 5;
Fig. 8 is shown schematically in through the curve of the quantity of the CGM information of the customization of metering on a time period of acquisition after the integration processing as shown in Figure 5 and the Snipping Tool of histogram;
Fig. 9 A and 9B are the Snipping Tools of the cake chart of the quantity of the CGM information of the customization of metering on a time period that is shown schematically in through obtaining after the integration processing as shown in Figure 5;
Figure 10 is the Snipping Tool that is shown schematically in through the time span figure of the CGM information of the customization of acquisition after the integration processing as shown in Figure 5;
Figure 11 A and 11B are the synoptic diagram through the CGM information of the customization that obtains after as shown in Figure 5 emotion analysis and the integration processing;
Figure 12 is the example of the critical event that processing obtained of the definite critical event by as shown in Figure 5;
Figure 13 shows according to an embodiment of the invention the simplified block diagram that the consumer is generated the device that medium CGM information handles; With
Figure 14 illustrates the schematic block diagram that can be used for implementing according to the computing machine of the method and apparatus of the embodiment of the invention.
Embodiment
Embodiments of the invention are described with reference to the accompanying drawings.Element of describing in an accompanying drawing of the present invention or a kind of embodiment and feature can combine with element and the feature shown in one or more other accompanying drawing or the embodiment.Should be noted that for purpose clearly, omitted the parts that have nothing to do with the present invention, those of ordinary skills are known and the expression and the description of processing in accompanying drawing and the explanation.
Fig. 1 shows according to an embodiment of the invention the general flow chart that the consumer is generated the method 100 that medium CGM information handles.As shown in Figure 1, described method 100 is from step S110.At step S120, collect and extract the consumer from different information sources of supply and generate medium CGM information.Then,, CGM information is filtered, to obtain the CGM information relevant with predetermined theme according to the corresponding filtering policy of CGM information that obtains with extraction at step S130.Then, at step S140, come the CGM information that filtration obtains is integrated, so that obtain the CGM information of customization based on the rule of customization.
What Fig. 2 showed embodiment shown in Figure 1 generates the general flow chart of a kind of specific implementation of the method that medium CGM information handles to the consumer.As shown in Figure 2, be included in 212 according to the method 200 that CGM information is handled of this implementation and collect various CGM information, and extract processing in the CGM information that 214 pairs of collections obtain from the Internet 210 as the information source of supply.Filter for the CGM information that extraction obtains according to corresponding filtering policy 216.Analyze and integration processing for CGM information based on the rule of customization 218 then through filtration treatment.As the result of analysis, can for example show the user in 220 CGM information with customization by visual way with integration processing.In addition,, then can also determine critical event, and alternatively, this critical event be reported to the user note (not shown) to remind it 222 if predefined critical event has taken place.Determined critical event can be in response to user's particular requirement to user report, and this can be avoided whenever finding that critical event just reports to the user, makes that very it is not tired of the user, thereby improves the hommization (hereinafter will describe in detail) of CGM information processing.
By way of example each step that the consumer is generated the method 200 that medium CGM information handles is as shown in Figure 2 handled below in conjunction with Fig. 3-12 and to be described in detail.
Fig. 3 shows the information gathering in method as shown in Figure 2 and extracts the general flow chart of an instantiation handling 212,214 (representing with 304) in Fig. 3.
As shown in Figure 3, can collect CGM information from various information sources of supply 302, these information sources of supply 302 include but not limited to RSS/ATOM source (Feed) 310, forum 320, search engine 330 and user-defined URLs (URL(uniform resource locator))/Site (website) 340.Describe one by one below from these information sources of supply and collect and extract the CGM information processing.
RSS is the abbreviation of Really Simple Syndication (simple and easy submitting the article), be a certain website be used for and other website between a kind of easy means of content shared, also be aggregated content.RSS, original meaning is literary composition in web site contents such as title, link, the part even is converted to the form of extend markup language (XML:eXtensible MarkupLanguage) in full, to submit the article to other website.The RSS source is actually an xml file, has comprised the lastest imformation (by the lastest imformation that the supplier provided in this source) of article in the file.ATOM is the successor of RSS, is designed to be aspect its all the elements that provide in handling Feed to be more prone to, and for this purpose, the description among the RSS is marked at and is divided into two element summary and content among the ATOM.In view of the RSS/ATOM source is a notion well known to those skilled in the art, do not describe in further detail at this.Blog, news etc. are usually from RSS/ATOM source 310, and is aforesaid, and the sense of organization of these information is stronger, for example generally adopts the XML form, therefore collecting and extracting and handle in 304 to come it is carried out collection and treatment by instruments such as for example FeedFetcher.Feed Fetcher is the Reader of Google and the Robot (robot) that Google individual character homepage is subscribed to device, Crawler (crawl device) in other words.The user is when Google reader or Google homepage have been subscribed to certain user's blog, and the website that the Feedfetcher of Google will be periodically goes for this user along with the RSS address is climbed and got Feed (source or feed).From climb the content of getting acquisition, extract subsequently and obtain the data of XML form, and store in the database 360.Certainly, one of ordinary skill in the art will readily recognize that FeedFetcher used herein is an example, can also use any other appropriate information to obtain instrument and realize from the data of RSS/ATOM source (Feed) 310 collection XML forms.
Positive institute is above-mentioned, utilizes the Feed Fetcher CGM information that 310 collections obtain from the RSS/ATOM source to have the good tissue form usually.The comment of (post) form of posting that obtains with collection is an example, handles by information extraction and can extract the customizing messages that obtains according to tissues such as this people that posts who posts, the time of posting, the title of posting, the contents of posting easily.For example, suppose to download (promptly collecting) to related topics (topic) 1 from certain website: " notebook computer-certain webpage, include in this webpage that N is individual to post, and these full text of posting meet the XML form.Posting during for example, N posts 1 is title the posting for " battery life of notebook computer " that people Mike creates January 1 calendar year 2001 of posting.Then for this post 1 can extract comprising customizing messages, promptly, the time of posting artificial " Mike ", post is " battery life of notebook computer " for " January 1 calendar year 2001 ", the title of posting, and the particular content of wherein posting is included in this main body of posting (body) part.So the data storage that collected and these customizing messages of extracting can the XML form in database 360, for example, "<people posts Mike</people posts ", "<title〉notebook computer battery life</title ", or the like.Can see,, help the CGM that extracts is put in order according to the form of the composition of this CGM information and the details of each component part etc., so that subsequent treatment (back will be described in detail) with the format memory data of XML data.
In addition, have only literary composition or the like in title, link, the part to meet the situation of XML form in the web site contents not being in full, need obtain the content of webpage by the URL of website correspondence.For example, can obtain web page contents from URL by at least a in the following instrument: Gecko (referring to http://en.wikipedia.org/wiki/Gecko_%28layout_engine%29), and other instruments (referring to http://en.wikipedia.org/wiki/List_of_layout_engines).For example handle and to carry out (not shown) by predefined wrapper (Wrapper) and/or by the technology such as wrapper that generate based on template detection based on the extraction that the web page contents that is obtained carries out, will describe in further detail below.
Forum 320 generally includes a series of URL.Therefore, can come to collect the content that obtains corresponding web page by instruments such as above-mentioned Gecko by URL.The CGM information from forum 320 that is obtained (for example forum's page) is most based on various template tissues, has the organizational form of rule.Therefore, by in advance can some templates of predefine to the collecting web page analysis that constitutes CGM information, during handling, extraction 326 utilize the next CGM information that collection is obtained of these predefined templates that comprised in the predefined wrapper to mate, so that extract the data of predetermined format, for example data of XML form according to the structure of template.But template is diversified, and changes through regular meeting, thus predefined template possibly can't contain the template that might occur.For this reason, using the predefine wrapper that collection is obtained CGM information earlier mates, if it fails to match, then carry out template detection 322, and generate new wrapper according to detected template 324, then by utilizing newly-generated wrapper to finish coupling, so that from the CGM information that collection obtains, extract corresponding information.For example, can utilize exercise question is " to generate a kind of data Automatic Extraction method (Automatic Data Extraction from Template-Generated Web Pages) of webpage at template; the author be a Yang Shaohua etc.; be published in Journal of Software; the 19th the 2nd phase of volume; in February, 2008 number; disclosed method is carried out among Fig. 3 322 and 324 template detection and wrapper generation in the list of references of 209-223 page or leaf, and utilizes detected template extracted data from the example webpage.Perhaps, for example, can be " method of a kind of full-automatic generation Web page information extraction Wrapper " by exercise question, the author is plum snow, Cheng Xueqi etc., is published in disclosed method in the document of " Journal of Chinese Information Processing " 2008 the 22nd volume the 1st phase 22-29 page or leaf and is implemented in predefined template and collection is obtained wrapper that CGM information carries out generating by letter under the situation that it fails to match carries out information extraction and handle.In this method, for the CGM information that is not based on the known template tissue, can pass through template detection, utilize structuring, the stratification characteristics of webpage design template, utilization web page interlinkage sorting algorithm and structure of web page separation algorithm etc., extract each message unit in the webpage, and export corresponding wrapper.Utilize the wrapper of these outputs that similar webpage is carried out information extraction then, so that obtain the data of predetermined format, the data of XML form for example.
At this wrapper is carried out brief description.Wrapper is the instrument that is used for webpage (web) information extraction, can use the form of software component, and its information extraction that is used for lying in html document comes out, and converts the data with certain data structure storage that can further handle to.For example, can adopt the method for machine learning to construct wrapper.In the wrapper construction process, provide mark good sample in advance, automatically learn to extract knowledge by machine learning algorithm, and (promptly with suitable model, template) stores, when running into new webpage, extract model and the webpage stored and mate, if coupling then extracts the corresponding information fragment from webpage.Carrying out in the process of information extraction based on the wrapper of machine learning, pre-treatment step is finished text feature and is extracted.Learning algorithm is learnt extraction model on the basis of text feature, and is kept in the wrapper.When carrying out information extraction, wrapper relatively wait to extract object whether with the Model Matching that has extracted, if coupling then obtains target information based on this model.The key concept and the function of wrapper are known to those skilled in the art, do not repeat them here.
For CGM information based on the known template tissue from forum 320, directly utilize predefined wrapper (it comprises predefined template) to determine to organize the structure of the employed template of CGM information when mating 326, just can extract various information specific in the CGM information then based on determined template.Still 1 be example with top posting, the structure of mating the 1 employed template of determining to post by predefined wrapper is: territory (field) 1: title; Territory 2: the people posts; Territory 3: post the time; Territory 4: the main body of posting.So, just can discern each territory of posting in 1 and from corresponding territory, extract corresponding customizing messages, and be stored in the database 360 as the XML formatted data according to this formwork structure.For the CGM information that is not based on the known template tissue from forum 320, as mentioned above, can come from collected CGM information, to extract corresponding specific information by processing such as template detection and wrapper generations, and be stored in the database 360 as the XML formatted data.As mentioned above, if 310 to collect the CGM information obtain be not all based on the XML form from the RSS/ATOM source, then also can and/or carry out information extraction similarly by the technology of the wrapper that generates based on template detection and handle by this predefined wrapper.
For user-defined URLs/ website 340, for example can utilize instruments such as " spider (Spider) " to carry out dynamic web page and download, obtain the content of webpage in the website by the URL of appointment 342.Spider is an auto-programming of search engine, and its effect is the html web page on the access internet, sets up index data base, makes the user can search the webpage of specific user website in search engine.For example, the spider Spider of Google is by reading webpage text content to the extracting of web data, and the linking layer layer depth in the page goes into, thereby obtains the extracting to full station content.For example, can utilize name to be called " application of JavaScript engine in the dynamic web page acquisition technique ", the author is Wang Ying etc., be published in " computer utility " the 24th the 2nd phase of volume, in February, 2004 number, in the list of references of 34-36 page or leaf disclosed technology or utilize SpiderMonkey (referring to https: //developer.mozilla.org/en/SpiderMonkey) technology 340 is collected the CGM information 342 from user-defined URLs/ website, for example obtain the content of dynamic web page, obtaining web page contents, be after the CGM information, use various proper implements 344,, carry out information extraction as predefined wrapper etc.Though do not illustrate among Fig. 3, but understand easily, if predefined wrapper can't realize coupling with webpage to be extracted in the information extraction process, then also can utilize similar above-mentioned 322 and 324 template detection and wrapper to generate to handle the extraction that waits the information of realization and handle.Similarly, will handle the data storage of the XML form that obtains in database 360, through collection and extraction for the usefulness of subsequent treatment.
Search engine 330 as the information source of supply generally includes search engine tabulation and keyword, and wherein keyword can be the default setting of user-defined or system.For the Query Result that keyword obtains, can obtain the web page contents of the Query Result page by instruments such as above-mentioned Gecko in search engine.It may be noted that the CGM information of returning from search engine 330 has certain singularity.Because the data volume of the content that obtains from search engine 330 is limited and various informative, do not need the information of collecting so at first can judge whether according to actual needs to exist.For example, collect the text reviews for certain commodity if desired, then contents such as the picture that search engine 330 can be returned basically, music are judged as irrelevant therewith thereby do not carry out collection and treatment.For the web page contents (being CGM information) that collection obtains, for example can handle by utilizing predefined wrapper to carry out information extraction 332, similar with 326 processing.Similarly, though do not illustrate among Fig. 3, but understand easily,, then also can utilize similar above-mentioned 322 and 324 template detection and wrapper to generate to handle the extraction that waits the information of realization and handle if the predefine wrapper can't realize coupling with webpage to be extracted in the information extraction process.The great majority as a result that return from search engine are based on template and generate, and have the height sense of organization, so after collecting required information from search engine, follow-up extraction processing mode is with similar to the extraction processing mode from the CGM information of forum 320.To also store in the database 360 through collecting and extract the data of handling the XML form that obtains.Existing C GM information processing method general not with the content returned in the search engine as information gathering with extract object, because it is aforesaid, various from the message form that search engine obtains, can't handle simply as on BBS or blog page, directly grasping information needed.The CGM information processing method is also brought search engine into the information source of supply according to an embodiment of the invention, thereby has enlarged the scope of CGM information processing, has improved the CGM information processing efficiency and has widened the effectiveness of CGM information.
Should be noted that, though by information extraction processing acquisition in the instantiation of information gathering illustrated in fig. 3 and extraction processing is the data of XML form, but, those skilled in the art understand, in fact data at this XML form are a kind of object lesson of the structuring form of expression of CGM information, also can use other any data layouts of the composition structure and the content thereof of the various piece that can identify collected CGM information, for example JSON (JavaScript Object Notation) data layout is that another one is selected.JSON is a kind of data interchange format of lightweight, is easy to read and write, and also is easy to machine simultaneously and resolves and generate.Can obtain the introduction of relevant JSON data layout by for example website http://json.org/xml.html, not repeat them here.In addition, database 360 also can be stored the various CGM information that obtain by collection and treatment except the data of the XML form that obtains are handled in storage by information extraction.Understand easily, in the example of Fig. 3, the data of collecting and extract the XML form that processing obtains from the CGM information of different information sources of supply 302 all are stored in the same database 360, but the data of these XML forms can certainly be stored in each self-corresponding independent database.In addition, be used for the data of storing X ML form and to store collected CGM database of information also can be different databases.
Then be described in the simplification flow process of the instantiation that the information filtering in as shown in Figure 2 the method handles in conjunction with Fig. 4.As shown in Figure 4, carry out type of webpage judgment processing as input at 410-450 with the data of storage in the database 360.Different webpage with different web pages type not only information how to issue and how to show aspect difference, and also different aspect content.In the context of the present specification, " type of webpage " includes but not limited to BBS, blog, news, SNS (Social Network Site, social network sites), newsgroup, product preview and the website is discussed, supports the online retail website of consumer's suggestion, or the like.Can carry out the type of webpage judgment processing, so that for the different filtering policy of dissimilar web application.For example, can be the method and apparatus of type of webpage " judge " by denomination of invention, the inventor is nanmu etc. why, and application number is that disclosed method is judged type of webpage in the Chinese patent application of 200910133695X.The method of disclosed judgement type of webpage comprises in this patented claim: based on the URL that waits to judge webpage, carry out rule match in the list of rules of storage in advance, wherein list of rules comprises many regular records that are used for determining type of webpage; If rule match is successful, then obtain waiting to judge the type of webpage of webpage according to the rule that matches; If rule match failure, then from the URL that waits to judge webpage and/or html source code, extract predetermined characteristic, and based on the proper vector that constitutes by the feature of from the predetermined characteristic of extracting, selecting, use sorter to treat and judge that webpage carries out the type of webpage classification, to obtain waiting to judge the type of webpage of webpage.By utilizing this method, can the fusion rule identifying schemes and advantage based on the identifying schemes of statistical learning, and can realize all kinds of type of webpage that comprise blog, forum, news etc. are judged.In addition, also can utilize a kind of blog recognition methods that in the paper of delivering in 2006 " SVMs for the Blogosphere:BlogIdentification and Splog Detection ", proposes by Pranam Kolari, Tim Finin and AnupamJoshi based on SVM (support vector machine), use therein feature mainly comprises the speech in the webpage, the URL(uniform resource locator) (URL) of webpage, the anchor text (anchor text) in the webpage etc., by making up different features, reached good recognition effect.Also can utilize name to be called that propose and method Pranam Kolari among the U.S. Patent application US2007/0294252A1 (on Dec 20th, 2007 open) of " Identifying a web pageas belonging to a blog " similarly based on the blog page determination methods of machine learning, different is that this US patented claim has proposed decision threshold T, if webpage be the probability P of blog page less than threshold value T, then from webpage, extract additional features and rejudge.In addition, can also utilize the blog recognition methods that is proposed in the paper of delivering in 2004 " Automatic Collection and Monitoring of Japanese Weblogs " by people such as Tomoyuki Nanno to carry out type of webpage judges, this method is not used statistical machine study, but the feature of analysis HTML (HTML (Hypertext Markup Language)) page, the page differentiation that will comprise the article clauses and subclauses that meet certain feature is blog page.The feature of these clauses and subclauses comprises: each clauses and subclauses need comprise a date at head and represent, there is consistent form on these dates, and according to ascending order or descending sort.
In the type of webpage judgment processing, if 410 judge type of webpage be the BBS/ blog/other, then continue to carry out Spam (spam) filtration treatment 420." spam " is meant the unsolicited information without asking for, and belongs to a kind of junk information substantially, so need filter out.For example, can be " Detecting spam web pages through content analysis " by exercise question, the author is Alexandros Ntoulas etc., be published in International World Wide WebConference, Proceedings of the 15th international conference on WorldWide Web (2006), disclosed method is filtered spam in the list of references of 83-92 page or leaf.
Subsequently, carry out the correlativity judgment processing 430,450 pairs of information of filtering through Spam.Correlativity is judged the correlativity that is meant between definite webpage and certain theme.As the prerequisite that correlativity is judged, need to set one or more theme, wherein each theme comprises description and one or more key word and key phrase.Check each webpage 430 at each theme, and according to the degree of relevancy between webpage and the corresponding theme come for each theme/webpage to giving score value.Then, will handle the score value and the some predetermined threshold value that obtain through relevance score 450 and compare, and determine this webpage relevant with this theme ("Yes" branch) if score value surpasses this predetermined threshold value, and corresponding web page is stored in the database 460.Can realize this correlativity judgment processing by various suitable methods.For example, can be called " Improved Algorithms for TopicDistillation in a Hyperlinked Environment " by name, the author is Krishna Bharat and Monika R.Henzinger, be published in Annual ACM Conference on Research andDevelopment in Information Retrieval, Proceedings of the 21st annualinternational ACM SIGIR conference on Research and development ininformation retrieval (1998), disclosed method realizes this processing in the list of references of 104-111 page or leaf.In addition, if after 410 process type of webpage judgment processing, judge that type of webpage is a news, then directly carry out relevance score and handle 440.Then, will handle the score value and the some predetermined threshold value that obtain through relevance score 450 and compare, and determine this webpage relevant with this theme ("Yes" branch) if score value surpasses this predetermined threshold value, and corresponding web page is stored in the database 460.Can realize this processing by disclosed methods such as above-mentioned Krishna Bharat equally.
By carrying out the type of webpage judgment processing, and use different filtering policys, can significantly improve the efficient and the accuracy of filtration treatment according to different type of webpage.
Next be described in the simplified flow chart of an instantiation of emotion analysis in as shown in Figure 2 the method and integration processing in conjunction with Fig. 5.As shown in Figure 5, carry out emotion analysis (sentimental analysis) at 510 webpages at storage in database 460.By the emotion analysis, give corresponding evaluation of estimate to the webpage in the database 460, this evaluation of estimate can be represented the tendentiousness and the degree thereof of emotion.For example, can with evaluation of estimate just/bear the positive/negative of representing suggestion, and just/score value of negative evaluation of estimate is high more, the emotion tendency degree of expression positive/negative is big more.For example, can be called " Seeing stars:Exploiting class relationships for sentimentcategorization with respect to rating scales " by name, the author is Bo Pang and LillianLee., be published in Proceedings of ACL (2005), disclosed method is carried out the emotion analyzing and processing in the list of references of 115-124 page or leaf.Can emotion be analyzed resulting evaluation of estimate and predetermined threshold value compares 540.If evaluation of estimate surpasses threshold value (" being "), then determine that 550 the webpage that is endowed this evaluation of estimate constitutes critical event, whether visual actual needs decision is to this critical event of user report.At this, predetermined threshold also can be a predetermined threshold value scope, and is defined under the situation that evaluation of estimate falls into this predetermined threshold range and determines to occur critical event.So-called " critical event " is meant the incident that the user relatively pays close attention to, and this incident can be with to have negative emotion tendentious information-related, also can set according to actual needs with to have a positive emotion tendentious information-related.
The emotion analysis can be carried out according to different emotion evaluation rules.For example,, can carry out the emotion analysis, can carry out the emotion analysis, perhaps can carry out the emotion analysis according to the time of posting according to the people's that posts importance information according to the front and the negative property of body matter in posting for posting on the webpage.Carry out with different emotion evaluation rules that emotion is analyzed resulting evaluation of estimate and implication also is different, this can set according to actual needs.In one embodiment, can predesignate various emotion evaluation rules, include but not limited to following content: the emotion evaluation object, carries out the emotion analysis based on which part (author, body matter, exercise question, creation-time or the like) of CGM information that is; Standards of grading, that is, content how will be endowed evaluation of estimate how in the emotion evaluation object; The emotion influence degree, that is, and the corresponding relation between evaluation of estimate and the emotion tendency (positive influences or negative effect).Though stipulate that usually high more then positive influences of positive evaluation of estimate or influence degree are big more, vice versa, also can set any other corresponding relation as required.In addition, standards of grading can also can be diversified according to actual needs, exist certain responsive vocabulary just the emotion assay value of this CGM information to be set at height or low as long as for example can stipulate to be carried out in the CGM information of emotion analysis.For example, with the evaluation object of posting and analyzing as pending emotion, if celebrating country's major holiday for example during time period on National Day, if in the exercise question of posting or in the body matter or even its author in wording such as " 60 birthdays in anniversary of new China ", " celebrating the National Day " appear, then can directly give high positive evaluation of estimate, illustrate to have higher positive emotional influence property or influence degree posting accordingly.This emotion evaluation rule of predesignating can be set according to user's actual requirement, and can comprise that carrying out emotion analyzes required any suitable content and be not limited to the top listed project of enumerating.But in a kind of embodiment of alternative, the emotion evaluation rule of predesignating also can be used as historical information and stores, for the emotion analyzing and processing of carrying out recently with reference to use.Understand easily, owing to can carry out the emotion analysis according to the emotion evaluation rule of consumer premise, make fit more flexibly and more user's the actual demand of emotion analyzing and processing, this has promoted the value of CGM information processing.
After carrying out the emotion analysis, can carry out integration processing 520 pairs of CGM information through the emotion analysis.An example of integration processing is that the webpage that content is similar condenses together, i.e. clustering processing.Be the implementation procedure of the example of this webpage cluster that provides with the false code form below:
Figure B2009102218861D0000131
The process that the clustering processing of above-mentioned false code form is expressed as follows:
Create kind C 1
With webpage P 1Ownership is kind C 1
For from webpage P 2To P nIn webpage P i
For from kind C 1To C mIn kind C j
Calculate P iWith C jBetween similarity S I, j
Select similarity S I, 1To S I, mIn the similarity S of maximum I, k
If S I, k>predetermined threshold value T
To net P iOwnership is kind C k
Otherwise
Create new kind C M+1
With webpage P iBelong to kind C M+1
Finish
Finish
What above-mentioned integration processing embodied is a kind of clustering processing based on web page contents.In above-mentioned clustering processing, i, j, k, n, m are various parameter P, C, positive integer got in the index of S.Can realize above-mentioned clustering processing by various programming language written program, can certainly be by hardware or firmware with above-mentioned this cluster function, perhaps any combination of software, hardware, firmware realizes.In above-mentioned clustering processing, judge for each webpage that its webpage kind (category) with the webpage that has existed is identical and still belong to a new kind.If the former then is included into this webpage among bunch (cluster) of the kind webpage that has existed identical with it; If the latter then sets up a new webpage kind at this webpage.If onlinely carry out the CGM information processing, processing such as the collection of CGM information via front and extraction, filtration, emotion analysis provide with the continuous uninterrupted form, therefore preferably, this clustering processing mode is carried out in the mode of increment, promptly, only judge whether the webpage that reenters belongs to the webpage bunch of the already present webpage in front, rather than whenever enter a new web page and just whole webpages are carried out one time clustering processing again again.Can improve the efficient of integration processing like this.Certainly, carry out, also can not use increment cluster mode, but carry out clustering processing at the information that obtains after all CGM information experience a series of processing in front that before off-line, cushion if the CGM information processing is an off-line.Top clustering processing is so-called " walking (one pass) one time " formula clustering processing, certainly, also can use any known clustering method to come this clustering processing.
In above-mentioned example, integration processing is carried out at webpage, and is based on the clustering processing that web page contents carries out.Those skilled in the art understand, also can be (for example at various types of CGM information, the website, video or the like) carry out integration processing, and also can carry out integration processing by content-based other attributes (for example creation-time, author, source, evaluation of estimate etc.) in addition.But in a kind of alternative embodiment, also can carry out integration processing based on the combination in any of various attributes.Above-mentioned integrate based on various attributes and condition or the like to can be considered be integration rules, and this integration rules can customize, that is, set arbitrarily according to user's actual needs.For example, if with the combination of creation-time, author and the subject content of CGM information as integration rules, so just can be by integrate the CGM information relevant that author A delivers that obtains between the XX XX month to the YY YY month with theme ZZ.As can be seen, integration processing will have certain general character or conforming CGM information is carried out association, and this general character is determined by the integration rules that customizes.For example, still with the CGM information of webpage as pending integration processing, if with creation-time as integration rules, then the webpage that creation-time is identical or close can think to have general character or consistance; Similarly, if with the author as integration rules, then the webpage that the author is identical can think to have general character or consistance; Perhaps, if with subject content as integration rules, then the webpage that theme is identical or close can think to have general character or consistance, or the like.The user for example can pass through similarity S in the above-mentioned cluster process the customization of integration rules I, k, the isoparametric setting of similarity judgment threshold T and adjust and to realize.For example, similarity judgment threshold T is high more, and what cluster obtained bunch will be many more.
Though it is noted that integration processing is carried out in Fig. 5 after the emotion analyzing and processing, those skilled in the art understand, but at this various alternative schemes can also be arranged.But in a kind of alternative scheme,, then can be omitted in 510,540 and 550 processing if do not need to carry out emotion analysis and evaluation of estimate thereof.But in another kind alternative scheme, can with 510,540 and 550 processing with carry out concurrently in 520 integration processing.In addition, because integration processing is that corresponding C GM information is put in order according to specific integration rules, therefore can also before carrying out 510,540 and 550 processing, carry out integration processing 520.
The CGM information of the customization that is obtained after the process integration processing for example can be presented to the user by the mode of visualization.Visualization can be realized by various suitable demonstration means.For example, can realize this presenting by display device such as display screens.Under the situation of the CGM information that presents customization by display screen, Fig. 6-12 has provided the Snipping Tool of the CGM information of the customization of showing to the user that obtains after through as shown in Figure 5 emotion analysis and/or integration processing.Certainly, it will be appreciated by those skilled in the art that CGM information through the customization that obtained after the integration processing also can be stored in the suitable memory storage for other purposes and do not show to the user.Perhaps, can pass through other suitable exhibition methods, audio frequency for example, text description, perhaps the modes such as combination in any of audio frequency, video and text description are showed to the user.
Fig. 6 shows in the information source of each the CGM information that obtains afterwards through as shown in Figure 5 emotion analysis and integration processing (being clustering processing in this example) and the histogram of corresponding evaluation of estimate.At this, " information source " is meant the source by the CGM information that has certain general character (for example having identical author or identical creation-time etc.) after this integration processing of cluster.In an information source, can comprise because having certain general character by cluster some concrete CGM information together, be referred to below as be corresponding with this information source or this information source under concrete information.As shown in the figure, horizontal ordinate represents that (identifier identifier), is illustrated in CGM information is carried out unique ID of distributing for resulting each information source (i.e. each bunch that obtains through clustering processing) after the clustering processing, i.e. EID information source ID.Ordinate is represented the evaluation of estimate that the afterwards corresponding information source of process emotion analysis is obtained.In Fig. 6, label is that the evaluation of estimate in some information source of " I " is " 0 ", label be the evaluation of estimate in some information source of " II " for just, label is that the evaluation of estimate in some information source of " III " is for negative.In this example, can the regulation evaluation of estimate be the neutral evaluation of 0 expression, evaluation of estimate is for just to represent that positive evaluation, evaluation of estimate are the evaluation of negative indication passiveness, and the positive high more influence of evaluation of estimate is positive more, the low more then influence of negative evaluation of estimate is passive more.Certainly, this is a kind of example, can set various evaluation criterion according to actual conditions.Click the evaluation map of strip wherein and can directly be linked to corresponding information source and obtain each concrete information under this information source, posting of blog article, consumer's message, consumer for example, or the like.
Illustrate a kind of method of the evaluation of estimate of determining the information source below.Suppose webpage P1, P2, P3 is information source A by cluster.Webpage P1, P2, the evaluation of estimate of P3 is respectively-2, and+1,0, then the evaluation of estimate of information source A can round up after the arithmetic mean for the three, that is and, (2+1+0)/3 ≈ 0.Certainly, can take the proper method of the evaluation of estimate in other any definite information sources according to actual needs, for example the weighted mean value of the evaluation of estimate by each the concrete information under the information source waits the evaluation of estimate that obtains the information source.In addition, can carry out the clustering processing evaluation of estimate of each CGM information (for example webpage) before by being illustrated in the similar figure of Fig. 6.
Fig. 7 shows the Snipping Tool of the curve map of the quantity of the CGM information of the customization of metering on a time period that obtains afterwards through as shown in Figure 5 integration processing (being clustering processing in this example).Horizontal ordinate is represented the time period with Zhou Jiliang, and ordinate is represented the data volume of CGM information.Suppose in this example it is that author according to CGM information carries out clustering processing to the CGM information that is obtained.As shown in the figure, at the time point of first week end, include 105 authors' related content in the CGM information of the customization that is obtained after intersection point " (0,105) " the expression process clustering processing of curve 1 and ordinate, that is, obtain 100 " information sources " through clustering processing.(for example, the comment that these authors deliver, blog article etc.) quantity is 210 to the curve 2 CGM information relevant with these 100 authors with intersection point " (0, the 210) " expression of horizontal ordinate,, can think that the concrete information corresponding with these 100 information sources is 210 that is.As seen, curve 1 expression be the quantity in information source in the CGM information of customization, curve 2 expressions be the quantity of concrete information corresponding in the CGM information of customization with the information source.In addition, by clicking the span that links such as " past 12 months ", " 10 weeks in past ", " past 40 days " can change the time that clustering processing contains.It is noted that in this object lesson at this, click links such as " past 12 months ", " 10 weeks in past ", " past 40 days " and do not cause cluster again, but change transverse axis scale time corresponding scope.As clicking " past 12 months ", then horizontal ordinate is corresponding first month of point of 1, if click " 10 weeks in past ", then horizontal ordinate is 1 point first week of correspondence.
Understand easily, carry out clustering processing, can obtain the configuration of different information sources and concrete information thereof according to different clustering rule (being integration rules).In addition, in the example of Fig. 7, also can draw each self-corresponding concrete information quantity curve at each information source (that is CGM information author).
Fig. 8 shows at the curve of the quantity of the CGM information of the customization of metering on a time period that obtains afterwards through as shown in Figure 5 integration processing (being clustering processing in this example) and the Snipping Tool of histogram.Horizontal ordinate is represented the time period with moon metering, the quantity in information source in the CGM information that the ordinate on the left side is represented to customize, the quantity of corresponding concrete information (for example posting) that the ordinate on the right is represented is corresponding with the information source (that is, under the information source).The quantity in curve 1 expression information source, the quantity of the concrete information under the concrete information of the curve 2 expressions source.In histogram, the lower column that is positioned at the left side is represented the quantity in information source, and the higher column that is positioned at the right is represented the quantity of the concrete information that the information source is affiliated.From figure, can see in January, 2008 quantity of the CGM information of the customization in February, 2008 and in March, 2008 (comprising information source and concrete information) and mutual comparable situation visually.
Fig. 9 A and 9B show the Snipping Tool of the cake chart of the quantity of the CGM information of the customization of metering on a time period that obtains afterwards through as shown in Figure 5 integration processing (being clustering processing in this example).The quantity in information source in the CGM information that the cake chart of Fig. 9 A is represented to customize, for example, the quantity in the information source in the CGM information of " 2008.02,243 " expression in February, 2008 acquisition wherein is 243.The cake chart of Fig. 9 B is represented the quantity of the concrete information (for example posting) under the information source, and for example, the quantity of the concrete information of " 2008.02,540 " expression in February, 2008 acquisition wherein is 540.
Figure 10 shows the Snipping Tool at the time span figure of the CGM information of the customization that obtains afterwards through as shown in Figure 5 integration processing (being clustering processing in this example).Provided the time range that each different information source (Resource) is experienced from start to end among the figure, provided beginning (Start) and finish the time point of (Finish), provided the strip diagram of time period simultaneously for each information source.Click the details that the strip diagram can be connected to corresponding information source.
Figure 11 A is the synoptic diagram according to the CGM information of the customization of tabular form statistics through obtaining after as shown in Figure 5 emotion analysis and the integration processing.It is noted that in the shown example of Figure 11 A, but, used the processing of GCM information being classified according to the rule of customization as another alternative mode of integration processing.That is, specify some classifications in advance by user or system, then in the mode of classification, with the webpage (being CGM information) that remains to be integrated compose with one or more classifications.Shown in Figure 11 A, demand according to the user, pre-determine 7 classifications of CGM information according to content, that is: " corporate image ", " potential brand is usurped ", " commission merchant ", " complaint suggestion ", " product is relevant ", " personnel matter ", " other ", content based on CGM information to be integrated is divided in these 7 classifications then, thus the CGM information that obtains customizing.Can provide simultaneously the total quantity of relevant CGM information under each classification, digital shown as " information content " hurdle in the chart of Figure 11 A, and provide the have specific evaluation of estimate quantity (providing) of CGM information of (from 0 to-10) as the numeral in the bracket on " in detail " the right the chart.As seen, this classification is handled and is equivalent to give the classification that shows its special properties " label " for all CGM information.It will be appreciated by those skilled in the art that also and can after the processing of classifying, carry out foregoing clustering processing again.In this case, can show and chart like Figure 11 category-A, but " information content " among Figure 11 A can be used for showing the quantity through relevent information source after classification processing and the clustering processing.Certainly, this chart among above-mentioned Figure 11 A only is a kind of schematic example, can carry out various forms of displayings to obtaining the result after the CGM information via integration processing according to actual needs, and details does not repeat them here.
Figure 11 B provided to CGM information classify and two kinds of processing of cluster after the example of CGM information of the customization that obtains.Shown in Figure 11 B, show the CGM information of all customizations with the tabulation mode.Every information is provided the time that sequence number, message header, related article number, evaluation of estimate, classification, information encoding and information take place.Provided an example among the figure.For example in " the xxxx launch of 1.xxxx company (61; 0; product is relevant; EID:7118) (2009-09-01) ", " 1 " is message sequence number, " the xxxx launch of xxxx company " is message header, " product relevant " be the title of information category, and " 61 " are related article number (that is, with information source that " product is relevant " this classification is associated in the quantity of affiliated concrete information), " 0 " is evaluation of estimate, " 7118 " are information encoding (that is unique Information ID that distributes for each information source of obtaining through clustering processing (that is, bunch),, be EID), " 2009-09-01 " is the information time of origin.The click information title can be connected to the details of this CGM information, the button icon of clicking among the figure " is pressed the ordering of related article number ", " by the start time ordering ", " by the concluding time ordering ", " press the event id ordering ", " by the evaluation of estimate ordering " can make the CGM information of customization sort by corresponding mode, so that inquiry.It is " Web classification using support vector machine " that the processing of above-mentioned Web page classifying for example can be used exercise question, be published in " Workshop On Web Information And Data Management ", 2002, the technology in the 96-99 page or leaf realized.Similarly, this exhibition method among Figure 11 B also only is a kind of schematic example, can be according to actual needs to carrying out various forms of displayings through the CGM information that obtains customizing after classification and the clustering processing, and details does not repeat them here.
The above-mentioned this integration processing of describing with reference to Figure 11 A that realizes with mode classification is with different in conjunction with the described integration processing that realizes in the cluster mode of Fig. 6-10, and the integrated results shown in Fig. 6-10 obtains by clustering processing.Clustering processing be webpage itself that will not have classification (that is, and cluster at sample) be gathered into different groups, the set of such web pages is called webpage bunch.And the classification processing is to predesignate different classifications, according to certain rule webpage is defined as belonging to corresponding classification then.
Though in the superincumbent object lesson, classification is handled and is carried out at webpage, but those skilled in the art understand, also can be (for example at various types of CGM information, the website, video or the like) processing of classifying, and also can carry out this classification based on other attributes (for example creation-time, author, source, evaluation of estimate etc.) beyond the content of CGM information and handle.But in a kind of alternative embodiment, also can be based on the processing of classifying of the combination in any of various attributes.Above-mentioned classify handle based on various attributes and condition or the like to can be considered be classifying rules, i.e. integration rules, and this integration rules can customize, that is, set arbitrarily according to user's actual needs.
Two kinds of above-mentioned processing that CGM information is integrated, i.e. clustering processing and classification handled, and can select one and carry out, and also can the both carry out.Under the situation that two types integration processing are all carried out, its processing sequence can be arbitrarily.That is, can carry out clustering processing earlier, the processing of classifying again also can be carried out according to opposite order, and perhaps, both are parallel simultaneously, and also do not have can not.Still illustrate with the example among above-mentioned Figure 11 A-11B, for example classification is handled and is carried out based on the content of webpage, belonged to 100 of the webpages of " corporate image " this classification after handling by classification, but wherein may be had the identical webpage of being accused of plagiarizing of 50 contents.Suppose that clustering processing also carries out based on the content of webpage, then, these 100 webpages are carried out clustering processing can carry out " refinement " it by the parameter of clustering processing suitably is set.For example, during 100 webpage clusters belonging to " corporate image " this classification were waited webpages bunch to relevant " senior executive of company speech ", " public opinion is to the evaluation of company ", " image of company aspect public and social interest ", convenient user carried out relevant information and inquires about.And, this clustering processing is owing to can all be brought together webpage like the content class, can play one " filtration " effect so plagiarize the webpage of phenomenon at above-mentioned existence, promptly, the user needn't check the redundant webpage that a large amount of contents repeat, and only need check that in the webpage bunch one or a few just can understand the content that all webpages relate in this webpage bunch.As seen, because clustering processing and classification processing can be according to different integration rules, from different angles CGM information is carried out integration processing, the two can carry out complementation to the integration processing of CGM information, thereby further promotes efficient and value to the CGM information processing.In addition, though clustering processing is carried out with the content that classification processing both is based on webpage in the above example, but, these two kinds of integration processing also can be carried out based on the different attribute of webpage, for example, classification is handled and is carried out based on web page contents, and clustering processing can be carried out based on the Web page create time.Concrete integration rules can be customized as required by the user.
Figure 12 is an example of critical event as shown in Figure 5.The evaluation of estimate of having listed date of report, title, start time and the concluding time of this critical event among the figure, having provided by the emotion analysis, event summary, source etc.Wherein, evaluation of estimate " 3 " represents that this incident belongs to negative event, and therefore is confirmed as critical event because this evaluation of estimate " 3 " has fallen into predetermined threshold value scope " 10 to 10 ".Event header and evaluation of estimate can mark out with different fonts or color, so that cause user's concern.Click " reprinting or report " and can issue this critical event to other users.Click the details that website that " source " can be linked to this critical event place obtains this incident.Understand easily, critical event generally is meant the incident that the user relatively pays close attention to, and therefore can be used for determining the standard of critical event according to user's different focus adjustment.Relatively pay close attention to company's product quality and service problem as the xxxx company among Figure 12, therefore content can be related to the product of xxxx company or service standard, can impel the said firm to investigate incident further and seek solution by determined critical event thus as definite critical event.If the user relatively is concerned about near event certain specific time point, then can determine critical event as standard with the time.For example, the CGM information of creating at the time point near more apart from this particular point in time will be endowed high more positive evaluation of estimate, and the CGM information that will fall into the threshold range (for example higher positive evaluation of estimate endpoint value or numerical range) of evaluation of estimate to be defined as be critical event.Thereby the user can obtain critical event for carrying out corresponding subsequent processing.Convenience to the location of critical event has also been improved CGM information processing efficient.
Can find out from above-mentioned analysis, can carry out the CGM information of integration processing to the CGM information of process collection and extraction processing and filtration treatment according to the different customized rules of user to obtain to customize.For example, can be that benchmark is carried out integration with the subject content of CGM information (for example posting), so the CGM information of customization can be provided according to different themes.Again for example, can be that benchmark is carried out integration with the author of CGM information, so the CGM information of customization is come layout according to different authors.In addition, can also be that benchmark is carried out integration with creation-time, the duration equal time information of CGM information, so the CGM information of customization can be provided according to time sequencing.Also can utilize integration rules by the combination in any rule conduct customization of the above-mentioned integration rules of enumerating.For example, can so that CGM information integrate according to time sequencing and author, the CGM information of then integrating the customization that obtains provides the central CGM information relevant with each different authors of variant time period, and the relevant CGM information in different time sections at each author perhaps is provided.Certainly, can set other any suitable integration rules according to the actual requirements, for example CGM information source, evaluation of estimate etc., detail repeats no more.Owing to obtained various CGM information and associated characteristic in the collection of CGM information in front and extraction processing, the filtration treatment, such as the correlativity of the form of the composition of CGM information and the details of each component part (for example the data by the XML form embody), CGM information and particular topic, to emotion assay value of CGM information etc., therefore be convenient in information integrated processing rule information is carried out suitable integration according to customization, this has significantly improved the efficient of CGM information processing and dirigibility, has improved the effectiveness of CGM information.And the integration rules that the user only needs to provide customization just can be obtained the CGM information of the customization that adapts with its demand, need not the user and carries out extra specialty operation, and this has increased the convenience of operation.Those skilled in the art understand, and the integration rules of customization can provide when carrying out the CGM information processing in real time, and the integration rules that also can preestablish customization is used during for integration processing.In addition, historical data that can also storage integration so as when follow-up integration processing with reference to use.The dirigibility of the setting of the integration rules of this customization also improved the CGM information processing dirigibility and with the high fit of actual demand.
In addition, additional embodiments of the present invention also provides the device that a kind of information CGM that the consumer is generated handles.This device 1300 has been shown among Figure 13, and it comprises: collect and extracting unit 1310, it is collected and extracts the consumer from different information sources of supply and generates media information.Filter element 1320, it generates media information to the consumer and filters according to generating the corresponding filtering policy of media information with the consumer who obtains by described collection and extracting unit, generates media information to obtain the consumer relevant with predetermined theme.Integral unit 1330, its rule based on customization generates media information to the consumer who obtains by described filter element and integrates, and generates media information so that obtain the consumer of customization.
Said apparatus 1300 and each included unit 1310,1320 and 1330 thereof can be configured to carry out top with reference to the described various operations of Fig. 1-5.About the further details of these operations, can be not described in detail here with reference to each embodiment described above, embodiment and example.
Describe in detail by block diagram, process flow diagram and/or embodiment above, illustrated the different embodiments of devices in accordance with embodiments of the present invention and/or method.When these block diagrams, process flow diagram and/or embodiment comprise one or more functions and/or operation, it will be obvious to those skilled in the art that each function among these block diagrams, process flow diagram and/or the embodiment and/or operation can by various hardware, software, firmware or in fact they combination in any and individually and/or enforcement jointly.In one embodiment, the several sections of the theme of describing in this instructions can pass through application-specific IC (ASIC), field programmable gate array (FPGA), digital signal processor (DSP) or other integrated form realizations.Yet, those skilled in the art will recognize that, some aspects of the embodiment of describing in this instructions can be whole or in part in integrated circuit with the form of one or more computer programs of on one or more computing machines, moving (for example, form with one or more computer programs of on one or more computer systems, moving), with the form of one or more programs of on one or more processors, moving (for example, form with one or more programs of on one or more microprocessors, moving), form with firmware, or implement equivalently with the form of their combination in any in fact, and, according to disclosed content in this instructions, being designed for circuit of the present disclosure and/or writing the code that is used for software of the present disclosure and/or firmware is fully within those skilled in the art's limit of power.
For example, each composition module, unit, subelement can be configured by the mode of software, firmware, hardware or its combination in any in the said apparatus 1300.Under situation about realizing by software or firmware, can the program that constitute this software be installed to the computing machine with specialized hardware structure (multi-purpose computer 1400 for example shown in Figure 14) from storage medium or network, this computing machine can be carried out various functions when various program is installed.
Figure 14 shows the schematic block diagram that can be used for implementing according to the computing machine of the method and apparatus of the embodiment of the invention.
In Figure 14, CPU (central processing unit) (CPU) 1401 carries out various processing according to program stored among ROM (read-only memory) (ROM) 1402 or from the program that storage area 1408 is loaded into random-access memory (ram) 1403.In RAM 1403, also store data required when CPU 1401 carries out various processing or the like as required.CPU 1401, ROM 1402 and RAM 1403 are connected to each other via bus 1404.Input/output interface 1405 also is connected to bus 1404.
Following parts also are connected to input/output interface 1405: importation 1406 (comprising keyboard, mouse or the like), output 1407 (comprise display, for example cathode ray tube (CRT), LCD (LCD) etc. and loudspeaker etc.), storage area 1408 (comprising hard disk etc.), communications portion 1409 (comprising network interface unit for example LAN card, modulator-demodular unit etc.).Communications portion 1409 is via for example the Internet executive communication processing of network.As required, driver 1410 also can be connected to input/output interface 1405.Detachable media 1411 for example disk, CD, magneto-optic disk, semiconductor memory or the like can be installed on the driver 1410 as required, makes the computer program of therefrom reading be installed to as required in the storage area 1408.
Realizing by software under the situation of above-mentioned series of processes, can from network for example the Internet or from storage medium for example detachable media 1411 program that constitutes softwares is installed.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 14 wherein having program stored therein, distribute separately so that the detachable media 1411 of program to be provided to the user with equipment.The example of detachable media 1411 comprises disk (comprising floppy disk), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 1402, the storage area 1408 or the like, computer program stored wherein, and be distributed to the user with the equipment that comprises them.
Therefore, the invention allows for a kind of program product that stores the instruction code that machine readable gets.When described instruction code is read and carried out by machine, can carry out the above-mentioned method that CGM information is handled according to the embodiment of the invention.Correspondingly, the above-named various storage mediums that are used for carrying this program product are also included within of the present invention open.
Each list of references of mentioning in the superincumbent description for brevity, is incorporated into this with them by reference, this quoting as in this manual these lists of references having been carried out detailed description.
In the above in the description to the specific embodiment of the invention, can in one or more other embodiment, use in identical or similar mode at the feature that a kind of embodiment is described and/or illustrated, combined with the feature in other embodiment, or the feature in alternative other embodiment.
Should emphasize that term " comprises/comprise " existence that refers to feature, key element, step or assembly when this paper uses, but not get rid of the existence of one or more further feature, key element, step or assembly or additional.
In addition, the time sequencing of describing during method of the present invention is not limited to is to specifications carried out, also can according to other time sequencing ground, carry out concurrently or independently.Therefore, the execution sequence of the method for describing in this instructions is not construed as limiting technical scope of the present invention.
By top description to embodiments of the invention as can be known, the technical scheme that the present invention is contained includes but not limited to the described content of following remarks:
Remarks 1, a kind of the consumer is generated the method that media information is handled, described method comprises step:
Collect and extract the consumer from different information sources of supply and generate media information;
Generate the corresponding filtering policy of media information according to the consumer who obtains with extraction and described consumer is generated media information filter, generate media information to obtain the consumer relevant with predetermined theme; And
Come that based on the rule of customization the consumer that filtration obtains is generated media information and integrate, generate media information so that obtain the consumer of customization.
Remarks 2, the consumer is generated the method that media information is handled as remarks 1 is described, wherein, described integration step comprises that at least a attribute that the consumer who obtains for filtration generates in media information generates media information based on described consumer the content, creation-time, author, source, evaluation of estimate carries out cluster and/or classification.
Remarks 3, as remarks 1 or the 2 described methods that consumer's generation information is handled, wherein, described method also is included in the consumer relevant with predetermined theme who obtains for described filtration step before the described integration step and generates media information and carry out the emotion analysis, perhaps after described integration step, the consumer who integrates resulting customization is generated media information and carry out the emotion analysis, give corresponding evaluation of estimate so that the consumer who accepts the emotion analysis is generated media information.
Remarks 4, as the remarks 3 described methods that consumer's generation information is handled, wherein, carry out described emotion analysis according to predetermined emotion evaluation rule, described emotion evaluation rule comprises at least: the emotion evaluation object; Standards of grading; The emotion influence degree.
Remarks 5, described the consumer is generated the method that media information is handled as remarks 3 or 4, comprise that also evaluation of estimate and predetermined threshold value scope that the consumer is generated media information compare, if evaluation of estimate falls into described predetermined threshold value scope, the consumer who then determines to have this evaluation of estimate generates media information and belongs to critical event.
Remarks 6, the consumer is generated the method that media information is handled as remarks 5 is described, wherein, according to user's request and described critical event is reported to described user.
Remarks 7, as any one describedly generates the method that media information is handled to the consumer among the remarks 1-6, wherein, described information source of supply comprises search engine, and described collection and extraction step generate media information by predefined wrapper for the consumer who obtains from search engine and extract processing.
Remarks 8, as any one describedly generates the method that media information is handled to the consumer among the remarks 1-7, wherein, described method also is included in the type that the consumer who extraction is obtained before the described filtration step generates the webpage in the media information to be judged, so that use and the corresponding filtering policy of type of webpage at dissimilar webpages in described filtration step.
The device that remarks 9, a kind of information CGM that the consumer is generated handle, described device comprises:
Collect and extracting unit, be configured to collect and extract the consumer and generate media information from different information sources of supply;
Filter element, be configured to described consumer is generated media information filter, generate media information to obtain the consumer relevant with predetermined theme according to generating the corresponding filtering policy of media information with the consumer who obtains by described collection and extracting unit; And
Integral unit is configured to based on the rule of customization the consumer who obtains by described filter element to be generated media information and integrates, and generates media information so that obtain the consumer of customization.
Remarks 10, as remarks 9 described devices, wherein, described integral unit is configured to generate media information for the consumer that filtration obtains, and at least a attribute that generates based on described consumer in the content, creation-time, author, source, evaluation of estimate of media information carries out cluster and/or classification.
Remarks 11, as remarks 9 or 10 described devices, also comprise the emotion analytic unit, it is configured to generate media information for the consumer relevant with predetermined theme who obtains by described filter element and carries out the emotion analysis, perhaps generate media information and carry out the emotion analysis, give corresponding evaluation of estimate so that the consumer who accepts the emotion analysis is generated media information for consumer by the resulting customization of described integral unit.
Remarks 12, as remarks 11 described devices, wherein, described emotion analytic unit is configured to carry out described emotion analysis according to predetermined emotion evaluation rule, described emotion evaluation rule comprises at least: the emotion evaluation object; Standards of grading; The emotion influence degree.
Remarks 13, as remarks 11 or 12 described devices, wherein, evaluation of estimate and predetermined threshold value scope that described emotion analytic unit also is configured to the consumer is generated media information compare, if evaluation of estimate falls into described predetermined threshold value scope, the consumer who then determines to have this evaluation of estimate generates media information and belongs to critical event.
Remarks 14, as remarks 13 described devices, wherein, described device reports to described user in response to user's request with described critical event.
Remarks 15, as any one described device among the remarks 9-14, wherein, described information source of supply comprises search engine, and described collection and extracting unit are configured to generate media information by predefined wrapper for the consumer who obtains from search engine and extract processing.
Remarks 16, as any one described device among the remarks 9-15, also comprise the type of webpage judging unit, it is configured to judge extract the type that the consumer who obtains generates the webpage in the media information by described collection and extracting unit, wherein, described filter element is configured to use and the corresponding filtering policy of type of webpage at dissimilar webpages.
17. 1 kinds of remarks store the program product of the instruction code that machine readable gets,
When described instruction code is read and carried out by machine, can carry out as any one describedly generates the method that media information is handled to the consumer among the remarks 1-8.
18. 1 kinds of storage mediums that carry as remarks 17 described program products of remarks.
Although the present invention is disclosed above by description to specific embodiments of the invention, but, should be appreciated that those skilled in the art can design various modifications of the present invention, improvement or equivalent in the spirit and scope of claims.These modifications, improvement or equivalent also should be believed to comprise in protection scope of the present invention.

Claims (10)

1. one kind generates the method that media information is handled to the consumer, and described method comprises step:
Collect and extract the consumer from different information sources of supply and generate media information;
Generate the corresponding filtering policy of media information according to the consumer who obtains with extraction and described consumer is generated media information filter, generate media information to obtain the consumer relevant with predetermined theme; And
Come that based on the rule of customization the consumer that filtration obtains is generated media information and integrate, generate media information so that obtain the consumer of customization.
2. as claimed in claim 1 the consumer is generated the method that media information is handled, wherein, described integration step comprises that the consumer who obtains for filtration generates media information, and at least a attribute that generates based on described consumer in the content, creation-time, author, source, evaluation of estimate of media information carries out cluster and/or classification.
3. the method that consumer's generation information is handled as claimed in claim 1 or 2, wherein, described method also is included in the consumer relevant with predetermined theme who obtains for described filtration step before the described integration step and generates media information and carry out the emotion analysis, perhaps after described integration step, the consumer who integrates resulting customization is generated media information and carry out the emotion analysis, give corresponding evaluation of estimate so that the consumer who accepts the emotion analysis is generated media information.
4. the method that consumer's generation information is handled as claimed in claim 3 wherein, is carried out described emotion analysis according to predetermined emotion evaluation rule, and described emotion evaluation rule comprises at least: the emotion evaluation object; Standards of grading; The emotion influence degree.
5. described the consumer is generated the method that media information is handled as claim 3 or 4, comprise that also evaluation of estimate and predetermined threshold value scope that the consumer is generated media information compare, if evaluation of estimate falls into described predetermined threshold value scope, the consumer who then determines to have this evaluation of estimate generates media information and belongs to critical event.
6. as claimed in claim 5 the consumer is generated the method that media information is handled, wherein, according to user's request and described critical event is reported to described user.
7. device that the information CGM that the consumer is generated handles, described device comprises:
Collect and extracting unit, be configured to collect and extract the consumer and generate media information from different information sources of supply;
Filter element, be configured to described consumer is generated media information filter, generate media information to obtain the consumer relevant with predetermined theme according to generating the corresponding filtering policy of media information with the consumer who obtains by described collection and extracting unit; And
Integral unit is configured to based on the rule of customization the consumer who obtains by described filter element to be generated media information and integrates, and generates media information so that obtain the consumer of customization.
8. device as claimed in claim 7, also comprise the emotion analytic unit, it is configured to generate media information for the consumer relevant with predetermined theme who obtains by described filter element and carries out the emotion analysis, perhaps generate media information and carry out the emotion analysis, give corresponding evaluation of estimate so that the consumer who accepts the emotion analysis is generated media information for consumer by the resulting customization of described integral unit.
9. device as claimed in claim 8, wherein, described emotion analytic unit is configured to carry out described emotion analysis according to predetermined emotion evaluation rule, and described emotion evaluation rule comprises at least: the emotion evaluation object; Standards of grading; The emotion influence degree.
10. install as claimed in claim 8 or 9, wherein, evaluation of estimate and predetermined threshold value scope that described emotion analytic unit also is configured to the consumer is generated media information compare, if evaluation of estimate falls into described predetermined threshold value scope, the consumer who then determines to have this evaluation of estimate generates media information and belongs to critical event.
CN2009102218861A 2009-11-19 2009-11-19 Method, device and program for processing consumer-generated media information Pending CN102073641A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102218861A CN102073641A (en) 2009-11-19 2009-11-19 Method, device and program for processing consumer-generated media information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102218861A CN102073641A (en) 2009-11-19 2009-11-19 Method, device and program for processing consumer-generated media information

Publications (1)

Publication Number Publication Date
CN102073641A true CN102073641A (en) 2011-05-25

Family

ID=44032185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102218861A Pending CN102073641A (en) 2009-11-19 2009-11-19 Method, device and program for processing consumer-generated media information

Country Status (1)

Country Link
CN (1) CN102073641A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646134A (en) * 2012-03-29 2012-08-22 百度在线网络技术(北京)有限公司 Method and device for determining message session in message record
CN102799599A (en) * 2011-05-27 2012-11-28 富士通株式会社 Method and device for processing customer-generated media information
CN103246676A (en) * 2012-02-10 2013-08-14 富士通株式会社 Method and device for clustering messages
CN104899309A (en) * 2015-06-12 2015-09-09 百度在线网络技术(北京)有限公司 Method and device for displaying event review opinions
CN106033578A (en) * 2015-03-13 2016-10-19 阿里巴巴集团控股有限公司 Information prompting method and device
CN106294530A (en) * 2015-06-29 2017-01-04 阿里巴巴集团控股有限公司 The method and system of rule match
CN108416642A (en) * 2017-12-05 2018-08-17 青岛海尔工业智能研究院有限公司 A kind of product customization method, apparatus and server
CN109558499A (en) * 2018-10-12 2019-04-02 苏州佳世达光电有限公司 Multimedia messages automatic combination method, apparatus and system
CN110781371A (en) * 2019-10-16 2020-02-11 维沃移动通信有限公司 Content processing method and electronic equipment
CN111737455A (en) * 2019-12-02 2020-10-02 北京京东尚科信息技术有限公司 Text recognition method and device, electronic equipment and medium

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799599A (en) * 2011-05-27 2012-11-28 富士通株式会社 Method and device for processing customer-generated media information
CN103246676A (en) * 2012-02-10 2013-08-14 富士通株式会社 Method and device for clustering messages
CN102646134A (en) * 2012-03-29 2012-08-22 百度在线网络技术(北京)有限公司 Method and device for determining message session in message record
CN106033578A (en) * 2015-03-13 2016-10-19 阿里巴巴集团控股有限公司 Information prompting method and device
CN104899309A (en) * 2015-06-12 2015-09-09 百度在线网络技术(北京)有限公司 Method and device for displaying event review opinions
CN104899309B (en) * 2015-06-12 2019-04-30 百度在线网络技术(北京)有限公司 The method and apparatus of displaying event comment viewpoint
CN106294530A (en) * 2015-06-29 2017-01-04 阿里巴巴集团控股有限公司 The method and system of rule match
CN108416642A (en) * 2017-12-05 2018-08-17 青岛海尔工业智能研究院有限公司 A kind of product customization method, apparatus and server
CN109558499A (en) * 2018-10-12 2019-04-02 苏州佳世达光电有限公司 Multimedia messages automatic combination method, apparatus and system
CN110781371A (en) * 2019-10-16 2020-02-11 维沃移动通信有限公司 Content processing method and electronic equipment
CN110781371B (en) * 2019-10-16 2021-11-30 维沃移动通信有限公司 Content processing method and electronic equipment
CN111737455A (en) * 2019-12-02 2020-10-02 北京京东尚科信息技术有限公司 Text recognition method and device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN102073641A (en) Method, device and program for processing consumer-generated media information
Rehm Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic's Personal Homepage
JP5879260B2 (en) Method and apparatus for analyzing content of microblog message
Johnson et al. Web content mining techniques: a survey
US8135669B2 (en) Information access with usage-driven metadata feedback
Kontostathis et al. A survey of emerging trend detection in textual data mining
CN110597981B (en) Network news summary system for automatically generating summary by adopting multiple strategies
JP4489994B2 (en) Topic extraction apparatus, method, program, and recording medium for recording the program
TWI493367B (en) Progressive filtering search results
CN103577579A (en) Resource recommendation method and system based on potential demands of users
CN110362740B (en) Water conservancy portal information hybrid recommendation method
CN109165367B (en) News recommendation method based on RSS subscription
EP2580726A1 (en) Method, apparatus and system of intelligent navigation
CN105426514A (en) Personalized mobile APP recommendation method
CN102855282A (en) Document recommendation method and device
Rabiei et al. Using text mining techniques for identifying research gaps and priorities: a case study of the environmental science in Iran
CN116384889A (en) Intelligent analysis method for information big data based on natural language processing technology
Wang et al. Seeft: Planned social event discovery and attribute extraction by fusing twitter and web content
CN110334112B (en) Resume information retrieval method and device
CN111859108A (en) Public opinion system search word recommendation system
TW201421265A (en) Intellectual news analyzing system
KR101544142B1 (en) Searching method and system based on topic
Oudshoff et al. Knowledge discovery in virtual community texts: Clustering virtual communities
KR102041915B1 (en) Database module using artificial intelligence, economic data providing system and method using the same
Scharl et al. Extraction and interactive exploration of knowledge from aggregated news and social media content

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110525