CN102402563A - Network information screening method and device - Google Patents

Network information screening method and device Download PDF

Info

Publication number
CN102402563A
CN102402563A CN2010102894956A CN201010289495A CN102402563A CN 102402563 A CN102402563 A CN 102402563A CN 2010102894956 A CN2010102894956 A CN 2010102894956A CN 201010289495 A CN201010289495 A CN 201010289495A CN 102402563 A CN102402563 A CN 102402563A
Authority
CN
China
Prior art keywords
information
incident
subevent
storage
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010102894956A
Other languages
Chinese (zh)
Inventor
王北斗
陈章义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2010102894956A priority Critical patent/CN102402563A/en
Publication of CN102402563A publication Critical patent/CN102402563A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses a network information screening method and device. The method comprises the steps of: analyzing the content of a webpage related to a first event to obtain information of a second event, wherein the information comprises content abstract and time information of the second event; judging whether information identical to the information of the second event exists in the information of all sub-events of the stored first event, if not, storing the information of the second event as the information of a new sub-event of the first event; selecting at least one sub-event from all sub-events of the stored first event; and sequencing the information of the selected at least one sub-event according to the time information in the information and then providing the sequenced information to the user. According to the method and device disclosed by the invention, by extracting the event information from webpage content, only storing different events, sequencing the events according to a time sequence and providing the sequenced events to the user, the time for searching and screening information is saved for the user, and the information acquisition efficiency of the user is increased.

Description

Network information screening technique and device
Technical field
The embodiment of the invention relates to Internet technical field, particularly network information screening technique and device.
Background technology
All have every day variety of event to take place all over the world, each incident all has new progress every day.On the internet, same incident may be by different informant's reports, and same piece of writing report possibly reprinted by a plurality of informants again.User's open any browser just can search countless information.
The information user of magnanimity can not browse one by one, the webpage of browsing value is screened and selected to the identical webpage of content can expend user's great amount of time and energy especially.
Summary of the invention
In view of this; The embodiment of the invention provides a kind of network information screening technique; The information that can from the magnanimity information that continues to bring out, filter out embodiment incident development course through this method offers the user, to save user's time, improves user's information acquisition efficiency.
The embodiment of the invention also provides a kind of network information screening plant, and this device can filter out embodiment incident development course from the magnanimity information that continues to bring out information offers the user, to save user's time, improves user's information acquisition efficiency.
A kind of network information screening technique that the embodiment of the invention provides comprises:
The content analysis of the webpage relevant with first incident is obtained the information of second incident, and wherein said information comprises the synopsis and the temporal information of second incident;
Whether have and information that the information of said second incident is identical, if do not have, be the information of a new subevent of said first incident with the information stores of said second incident if judging in the information of all subevents of said first incident of storage;
From all subevents of said first incident of storage, select at least one subevent;
The information of said at least one subevent of selecting is offered the user after according to the ordering of the temporal information in the said information.
A kind of network information screening plant that the embodiment of the invention provides comprises:
Information extraction modules is used for the information of second incident that the content analysis of the webpage relevant with first incident is obtained, and wherein said information comprises the synopsis and the temporal information of second incident;
Information storage module; Whether the information of all subevents that is used for judging said first incident of storage has and the information that the information of said second incident is identical; If no, the information stores with said second incident is the information of a new subevent of said first incident;
The Information Selection module is used for selecting at least one subevent from all subevents of said first incident of storage, and the information of said at least one subevent of selecting is sorted according to the temporal information in the said information;
Message output module, the information that said Information Selection module is provided offers the user.
Visible by above-mentioned technical scheme; A kind of network information screening technique that the embodiment of the invention provides and device are through extracting the temporal information and the synopsis of incident and comparing from web page contents; Only preserve the information of different events; And the incident that will be mutually related arranges embodying the incident development course according to time sequencing, and offers the user, makes the user can understand the development course of incident clear, apace; Save the time of user information search and screening, improve user's information acquisition efficiency.
Description of drawings
Fig. 1 is a kind of network information method for distilling process flow diagram of the embodiment of the invention.
Fig. 2 provides method flow diagram for a kind of network information of the embodiment of the invention.
Fig. 3 is the particular flow sheet of a kind of network information method for distilling of the embodiment of the invention.
Fig. 4 provides the particular flow sheet of method for a kind of network information of the embodiment of the invention.
Fig. 5 is the subevent information display effect synoptic diagram of the embodiment of the invention.
Fig. 6 is the structural drawing of embodiment of the invention network information screening plant.
Embodiment
For the purpose, technical scheme and the advantage that make embodiments of the invention is clearer, below with reference to the accompanying drawing embodiment that develops simultaneously, to embodiment of the invention further explain.
All can there be every day the webpage of in a large number different event being reported to produce in the network; Among the present invention; For the development course with a certain incident clears and offers the process that the user relies on two: all webpages relevant with this incident are analyzed and extracted wherein valuable information and store, and the information of from canned data, choosing the user and needing offers the user.These two processes are separate, can carry out simultaneously.Fig. 1 and Fig. 2 show the basic procedure of these two processes respectively.
Fig. 1 is a kind of network information method for distilling process flow diagram of the embodiment of the invention.As shown in Figure 1, said method mainly comprises following step.
Step 101 obtains the information of second incident to the content analysis of the webpage relevant with first incident, and wherein said information comprises the synopsis and the temporal information of second incident.
Whether step 102 has the information identical with the information of said second incident in the information of all subevents of said first incident of judgement storage.
Step 103 if do not store identical information, is stored as a new subevent of said first incident with said second incident, that is, be the information of a new subevent of said first incident with the information stores of said second incident; If this information has been stored, then return step 101, another webpage is carried out the extraction of information.
Above-mentioned " first ", " second " etc. only are used to distinguish two identical objects of title, do not have practical significance, down together.
The subevent is meant an incident in the first incident development course, and any incident of being in the news relevant with first incident all is the subevent of first incident.For example, 2008 Olympic Games is as an incident, and the Opening Ceremony of the Games on August 8th, 2008 is exactly a subevent of this incident, and each of every day match also can be a subevent of this incident, or the like.
Through above step, the information of the subevent of first incident just has been extracted out.
Wherein, the said webpage relevant with first incident can be the webpage that utilizes web crawlers to get access to, and also can be the webpage that utilizes the keyword of search engine searches first incident to obtain.When adopting the web crawlers technology, the webpage that obtains possibly reported variety of event respectively, and the synopsis that will from webpage, extract compares with the summary of first incident of storage can obtain the webpage relevant with first incident.Here, the strategy that compares can be decided according to the needs of reality by the technician.For example, the summary of incident can be at least one keyword, and synopsis possibly be a sentence; During coupling; Whether the keyword that can judge first incident all is included in the synopsis of second incident that extracts, if judge that then this second incident is relevant with first incident.The present invention does not do qualification to concrete matching process.
For the reliability of the event information that guarantees to obtain, can at first carry out the webpage denoising, to remove the webpage that only comprises advertisement, individual's comment etc. to the webpage that obtains.Perhaps, also white list or blacklist can be set, make web crawlers only obtain webpage in the website from preset white list, perhaps make web crawlers avoid obtaining the webpage that the website in the blacklist provides.For example, when this method is used for news is extracted when handling, can white list be set to comprise authoritys' such as each flash-news society, portal website source of news.
After the summary of each incident of the synopsis of said second incident and storage compared, judge when all incidents of said second incident and storage are all uncorrelated, set up new incident.Utilizing predetermined strategy that the synopsis of said second incident is handled and obtain second summary, said second summary is saved as the summary of said new events, is the information of a subevent of said new events with the information stores of said second incident.Here, the method that obtains second summary according to the synopsis of second incident also can be designed by the technician according to actual needs.For example, the synopsis of second incident is carried out constituent analysis, remove wherein modal particle, interjection, auxiliary word or the like, can also only keep noun wherein, etc.The present invention does not do qualification to the generation strategy of event summary.
The content analysis of webpage is obtained the synopsis of second incident and the method for temporal information can be: from the body matter of said webpage, confirm the speech of representative time, the speech of the representative time that these are adjacent merges forms said temporal information.Sentence to the speech that comprises the said representative time is analyzed, and extracts at least two sentence keywords as said synopsis.For example, can extract noun, verb and adjective etc. in the sentence, form brief words as synopsis.
Consider that the title of some articles can summarize the content of article well, can also judge whether the title of said Web page text comprises the keyword in the sentence that extracts, if comprise, then with said title as said synopsis.
For fear of the event information of preserving repetition, when second incident that extracts is stored as the subevent of first incident, at first in the information of all subevents of said first incident of judgement storage whether the information identical with the information of said second incident is arranged.Here, whether comparison information is identical can design different strategies as required, for example, can only compare synopsis or temporal information, and whether also can compare the information of judging two incidents to the both identical.For example; Under some situation; Think that an incident only has a subevent in the same time and takes place; Then can the temporal information of each subevent of said first incident of the temporal information of said second incident and storage be compared,, judge that then the information of second incident was stored if the temporal information of a subevent of first incident is identical with the temporal information of second incident.
Except the temporal information and synopsis of the extraction and the incident of preservation, can also extract the details of incident, for example, contain the paragraph of the speech of representative time in the content that said Web page text is first section, the said Web page text, and the address of said webpage etc.
When the information of the information of a subevent judging first incident and second incident is identical, can write down the multiplicity of this information, that is, when the information of this subevent of judgement is identical with the event information that extracts, the multiplicity value of this subevent is added 1.
When the information of the information of a subevent judging first incident and second incident is identical, can also judge further whether the synopsis of this subevent is the title that derives from a Web page text.If the synopsis of this subevent is the sentence keyword that derives from extraction; And the synopsis of second incident is when being the title of Web page text; Consider that title often can summarize content in full well; Can use the synopsis of this subevent of synopsis replacement of second incident, make the synopsis of this subevent derive from the title of Web page text.
It more than is exactly the information extraction process in aforesaid two processes.
Fig. 2 provides method flow diagram for a kind of network information of the embodiment of the invention.As shown in Figure 2, said method mainly comprises following step.
Step 201 is selected at least one subevent from all subevents of said first incident of storage.
Step 202 sorts the information of at least one subevent of selecting according to the temporal information in the said information.
Step 203 offers the user with the information after the ordering.
Through top process, can choose the subevent of needs according to various strategies, and the subevent is offered the user after according to time-sequencing, thereby the development course of incident is clearly represented to the user.
Wherein, step 201 can be triggered by the user, also can be automatic.For example; When receiving the keyword of user's input through a page; Can the keyword of the user input summary with the incident of storage be compared, find the keyword event matching of importing with the user, the information with at least one subevent of this incident offers the user then.Wherein, The method of coupling can be; The keyword of the user input summary with each incident of storage is compared, the summary of judging each incident according to predetermined strategy then whether with the keyword coupling of user's input, definite in view of the above incident that will offer the user.
After confirming incident, when selection will offer user's subevent information, can all subevents all be offered the user, a part also can only be provided, preferably, can also confirm to offer the number of user's subevent according to user's selection.
The number that offers user's subevent can be: a preset number, or the sum of the subevent of this incident multiply by a preset ratio, perhaps comes to confirm according to certain strategy according to the selection information of the user's input that receives.
Preferably; Can pass through user interface; Webpage for example; Offer at least two options of user, confirm the corresponding ratio of option that the user selects with the corresponding relation of different proportion, draw said first number according to the total quantity of the subevent of said ratio and said first incident according to preset said option.For example, can offer 3 kinds of selections of user, respectively correspondence 10%, 50% and 100%.After receiving the selection that the user makes, the ratio of from all subevents of incident, selecting according to the user is chosen the experimental process incident.
Choosing also of subevent can have multiple mode; For example; Can perhaps select at least one subevent to present to user etc. according at least one nearest subevent of the temporal information chosen distance current time of subevent from all subevents of first incident of storage according to the multiplicity of all subevents.Preferably; It is generally acknowledged and be in the news or to reprint the more incident importance of number of times high more; Therefore when having preserved the multiplicity of each subevent; Can select the more subevent of multiplicity to offer the user according to the multiplicity of subevent, the user is had gained some understanding to the cardinal principle development of incident at short notice.
In order clearly to represent the general situation of development of incident, can be only the temporal information and the synopsis of subevent be offered the user for the user.Consider that the user may need further to understand the overview of each subevent; Can also be with the details of the incident of storing; For example; The paragraph that contains the speech of representative time in the content that said Web page text is first section, the said Web page text, and the address of said webpage etc. offers the user according to user's needs.For example; When the user moves to mouse on a certain subevent; Can utilize a suspension window to show above-mentioned details, like this, after the content of the emphasis paragraph that the user has shown in having read the suspension window; Further the web page address of visit demonstration carries out understanding to the subevent carefully.
Preferably, for the further situation such as influence power that help the user to understand incident, can also add up some data of this incident.For example, draw first index of first incident according to the number of the subevent of other incident of the number of the subevent of first incident and storage; Second index that draws said first incident apart from the temporal information of current time subevent farthest and current time according to temporal information in first incident.It is generally acknowledged that the dependent event of an incident is many more, this incident influence power with respect to other incident is high relatively more, therefore can think that first index has characterized the influence power of first incident in current all incidents.The incident near more apart from the current time is fresh more, therefore can think that second index has characterized the freshness of first incident.Therefore; When the subevent information of incident is provided for the user; The various statisticss of this incident can also be provided for the user simultaneously, for example above-mentioned first exponential sum, second index, the user just can judge this incident according to these data, and whether influence power is bigger; Whether fresh, thus whether decision needs detail knowledge.
In addition; Can also the multiplicity of each subevent and the corresponding relation of temporal information in said first incident be offered the user; For example adopt the mode of a chart; The transverse axis of chart and the longitudinal axis be express time and multiplicity respectively, and the temporal information and the multiplicity of each subevent of first incident is embodied in this chart.The user can be important according to the subevent which time point of this information understanding takes place, thereby the subevent information of choosing corresponding time point is understood.
Lifting a concrete instance below is described in detail above two processes respectively.
Fig. 3 is a kind of network information method for distilling process flow diagram of the embodiment of the invention.In the present embodiment,, be convenient to the understanding of technician to the inventive method to carry out the example that is extracted as of Internet news information.As shown in Figure 3, said method can comprise following step.
Step 301 utilizes the web crawlers technology to obtain webpage.
Because the target of extracting is a news information; A website white list can be set; The network address of the website of each flash-news medium, portal website etc. is listed in this white list, makes web crawlers only climb and get webpage in the website from white list, thus the authority and the reliability of the news information that guarantees to obtain.
Web crawlers can be constantly in server the mode with the backstage move, do not stop to obtain up-to-date webpage.
Step 302 is analyzed from the text of webpage.
In this step, concrete analytical approach can be set certain strategy as the case may be by the technician.For example, this analytical approach can for: extract first three paragraph of text, paragraphs carried out participle and part-of-speech tagging, for example mark the speech, noun, verb, adjective of express time wherein etc.
Here, first three paragraph that extracts text be because, can summarize center in full for first section or former sections of the General report, this is the object by the present embodiment information extraction, i.e. news report, characteristic determine.The technician can select suitable rule and tactful according to the characteristics of the object of its information extraction.
Step 303, extracting time information from the text of webpage.
For example, can the adjacent speech of the time of being marked as be merged, as word segmentation result " May t six days t afternoon t " merged into a temporal information " afternoon May 6 ".Here only illustrate, temporal information can also comprise year, hour, minute etc., can also be the speech of expression a period of time length, for example " January is to June ", " before three days " or the like.In addition, the temporal information of extracting can also be converted into the time format of standard, for example yyyy-mm-dd (the form of the year-moon-Ri).
Step 304 is extracted synopsis from the text of webpage.
For example, the statement that contains time mark is extracted its core word, for example a sentence formed in verb, noun, adjective.Extract the core word in the text title according to preset strategy again.If the core word in the title all appears in the above-mentioned sentence, then think title well overview corresponding incident of time in these words (below be called incident A), with the synopsis of title as incident A.If the core word in the title does not appear in the above-mentioned sentence, with the synopsis of this sentence as incident A.
Step 305 has judged whether to store relevant incident according to the event information that extracts.
In this step, need decision event A not have related new events with before incident.Particularly, can the summary of each incident of the synopsis of incident A and storage be compared to seek event matching.If do not find event matching, then decision event A is a new events, execution in step 306; If find event matching, then decision event A is the subevent of a certain incident (calling incident B in the following text), promptly follow-up correlating event, execution in step 307.
Step 306, newly-built incident.Incident A is stored as a subevent of this new events, from the synopsis of incident A, extracts the summary of keyword as new events.The method for distilling of keyword can adopt existing keyword extracting method, is not described in detail here.
Step 307 has judged whether to store relevant subevent according to the event information that extracts.
Whether this step needs decision event A identical with existing certain subevent of incident B, promptly the report of similar events as A has been analyzed the subevent of extracting and being stored as incident B before.Here, can compare with the information of each subevent of the information of incident A and incident B.If suppose same time point; Same incident has only a successor to take place; Then can only compare the temporal information of each subevent of temporal information and the incident B of incident A, promptly whether have the temporal information of subevent identical among the searched events B with the temporal information of incident A.
If do not find subevent with identical temporal information, execution in step 308; If find subevent with identical temporal information, execution in step 309.
Step 308 for the newly-built sub-event entries of incident B, is advanced newly-built subevent clauses and subclauses with the information stores of incident A.
Each subevent information of storage can utilize its time information as index, easy-to-look-up and ordering.
Not only can holding time information in the clauses and subclauses of subevent, synopsis; The content that can also store first section of text or former sections is as news in brief; The source page address of this news report and the multiplicity of this subevent characterize the importance of this incident-time point.
Step 309 is upgraded subevent information.
Renewal process can comprise that the multiplicity with this subevent adds 1.Can also comprise: the synopsis of judging the subevent of having stored is source and the title or the keyword of text; If the synopsis of subevent derives from the text keyword; And the synopsis of incident A derives from title, then replaces the synopsis of this subevent with the synopsis of incident A.
So far, the information extraction of this webpage and storage have just been accomplished, another webpage that can rebound step 301 pair web crawlers obtains carries out information extraction.
Through above step; Each flash-news medium just can be stored in the server by incident and time the information of each incident report; And above-mentioned leaching process also screens each report according to the time, rejected the report that repeats, and each time point is only preserved an event information; Better geography has gone out the development train of thought of incident, in order to offering the user.
Fig. 4 provides the particular flow sheet of method for a kind of network information of the embodiment of the invention.
Step 401 receives the keyword that the user imports.
In the present embodiment, can interface be provided for the user provides a special-purpose information, can be webpage, perhaps client-side program.Method of the present invention can also combine with other network service, for example combines with search engine, for the user who uses search engine provides the information sifting service.
Step 402 is confirmed incident according to the keyword of user's input, and the information of a preset number subevent is offered the user.
In the present embodiment, can offer earlier the user several, for example 5, the information of subevent.Can select the nearest subevent of time of origin according to the temporal information of subevent, also can select the more subevent of multiplicity according to multiplicity.The temporal information and the synopsis that can on webpage, show the subevent.
Step 403 offers the user with the statistical information of incident.
The statistical information of incident can comprise: influence index, freshness index and each subevent time-the multiplicity distribution plan.
Influence index, the entire effect power of sign incident can adopt the mode of scoring to embody.Influence index can draw through the subevent number of evaluate events, and for example the subevent is many more, and this score is high more.
The freshness index, the freshness of embodiment incident can adopt the mode of scoring to embody.The freshness index can obtain through the temporal information of evaluate events neutron event.For example, temporal information is apart from current time subevent farthest, and its information distance current time time is near more, and the freshness of this incident is just high more.
Each subevent time-the multiplicity distribution plan,, respectively as the transverse axis and the longitudinal axis temporal information and the multiplicity of each subevent embodied in the figure with time and multiplicity, embody the situation of incident stages of development.The user can find interested time point according to this figure, and the subevent that this time point takes place is understood.
These statistical informations can initiatively offer the user, also can after the indication that receives the user, offer the user again.
For example, an option is provided, can be icon or text prompt, the user clicks this option, then in pop-up window, shows above details, or when the user moves to mouse on this option, in the suspension window, shows above details.
Step 404 offers the user with the details of subevent.
After the temporal information of subevent and synopsis offered the user, the user possibly want to understand more information about this subevent.Can be when the user moves to mouse on this subevent or clicks this subevent; In a pop-up window or suspension window, show the news in brief of this subevent of storage; Be first section or former sections content of original web page text, the source page address of this news report and the multiplicity of this subevent etc.
Step 405 changes the number of the subevent that provides according to user's indication.
The user who has likes the angle from the overall situation, and the general view whole event only goes to read of paramount importance subevent in the incident evolution; This incident of the understanding that the user who has then need go deep into, he can read in all relevant subevents of this incident.
Therefore, according to one embodiment of the invention, the overall situation is provided and gos deep into the both view angle adjustment function.The implementation method of this function can for: for the user provides a plurality of icons or text prompt,, change the number of the subevent that provides according to icon or the literal that the user selects.
When subevent that needs show more for a long time, can subevent, display part information, rest parts offers the user through a link, the user clicks the information that this link then can have access to other subevent.
It should be noted that among the above embodiment, the execution sequence of each step can be adjusted according to actual conditions, some step can be carried out simultaneously, in some cases clipped step as required.
Fig. 5 is the subevent information display effect synoptic diagram of the embodiment of the invention.Wherein, 501 is the temporal information of subevent; 502 is the synopsis of subevent; 504 for moving to mouse as the user suspension window of subevent 4 last times demonstration; 503 is the news in brief of subevent 4; It is the content of first section of original text or former sections; 505 is the address link of the source page of subevent 4; 506 is the significance index of subevent 4, can obtain through the multiplicity assessment; 507 is the link of incident statistics; The 513 suspension windows that show when clicking 507 parts for the user; 509 is the influence index of incident; 510 is the freshness index of incident; 511 is temporal information and the graph of a relation of multiplicity of each subevent of this incident; 508 comprise a plurality of icons, and the user can click wherein different icons, come correspondingly to change the number of subevent; 512 for to be used to show the link that remains the subevent, and for example, for the user provides 20 strip event informations, the current page space is limited, has only shown 5, and then the user can visit remaining 15 strip event information through clicking 512.
Above information also can adopt other form to be shown to the user, and Fig. 5 only is an example.
In addition, can also the function of searching the subevent according to time range be provided, for example, can the subevent that take place in the week on this incident, the subevent that take place some day, the subevent that took place before some day etc. be provided for the user for the user.At this moment, only need be in the subevent of this incident search time information satisfy the subevent of the condition that the user proposes, and offer the user and get final product.
According to each incident of storage and the information of subevent thereof, other information retrieval function can also be provided, give an example no longer one by one here.
The present invention also provides a kind of network information screening plant.
Fig. 6 is the structural drawing of embodiment of the invention network information screening plant.As shown in the figure, this device mainly comprises: information extraction modules 601, information storage module 602 and Information Selection module 603 and message output module 604.
Information extraction modules 601 is used for the information of second incident that the content analysis of the webpage relevant with first incident is obtained, and wherein said information comprises the synopsis and the temporal information of second incident;
Information storage module 602; Whether the information of all subevents that is used for judging said first incident of storage has the information identical with the information of said second incident; If no, the information stores with said second incident is the information of a new subevent of said first incident;
Information Selection module 603 is used for selecting at least one subevent from all subevents of said first incident of storage, with the information of said at least one subevent of selecting according to the ordering of the temporal information in the said information and offer message output module 604;
Message output module 604, the information that said Information Selection module 603 is provided offers the user.
According to one embodiment of the invention, said apparatus can also comprise the webpage acquisition module, is used to utilize web crawlers to obtain said webpage.
Wherein, information extraction modules 601 can comprise:
The time extraction unit is used for confirming that from the body matter of said webpage the speech of representative time constitutes said temporal information;
Keyword extracting unit is used for extracting at least two sentence keywords as said synopsis from the sentence of the speech that comprises the said representative time;
Summary confirms to be used to judge whether the title of said Web page text comprises said sentence keyword in the unit, if comprise, with said title as said synopsis.
Information storage module 602 can compare the summary of each incident of the synopsis of second incident that extracts from webpage and storage; When all incidents of judging said second incident and storage are all uncorrelated; Set up new incident; Utilize predetermined strategy that the synopsis of said second incident is handled and obtain second summary; Said second summary is saved as the summary of said new events, is the information of a subevent of said new events with the information stores of said second incident.
Information storage module 602 can comprise the time comparing unit; Be used for the temporal information of at least one subevent of first incident of the temporal information of second incident and storage is compared; Judge whether to have stored identical temporal information to first incident; If have, then judge and stored identical information to first incident.
Information storage module 602 can also be stored the details of said second incident; Wherein, said details comprise at least one in following: the content that said Web page text is first section; The content of paragraph that contains the speech of representative time in the said Web page text; And the address of said webpage.603 of Information Selection modules offer said message output module 604 with the details of second incident of said storage when receiving the instruction of said second incident of inquiry that the user sends.
Information storage module 602 can also be used for synopsis when said information and derive from a title and judge when having stored the information identical with said information; Judge whether the synopsis in the said information of storing derives from the sentence keyword; If then use the said said synopsis that derives from the sentence keyword that derives from the synopsis replacement storage of title.
Information storage module 602 can also be stored the multiplicity of subevent; If judge the said information that stored, the multiplicity of the subevent that the said information of storage is corresponding adds one.
Information Selection module 603 can be selected at least one subevent according to the temporal information or the multiplicity of subevent.The number of subevent can be a number preset in the Information Selection module 603, perhaps the number that obtains of the information of Information Selection module 603 analysis user input.For example, Information Selection module 603 can offer at least two options of user; Option according to preset is confirmed the corresponding ratio of option that the user selects with the corresponding relation of different proportion, draws the number of the subevent of needs selection according to the total quantity of the subevent of said ratio and said first incident.
Information Selection module 603 can also offer message output module 604 with the multiplicity of at least one subevent of selecting.
Information Selection module 603 draws said first index according to the number of the subevent of other incident of the number of the subevent of first incident and storage; Second index that draws first incident apart from the temporal information of current time subevent the earliest and current time according to temporal information in said first incident; Extract the corresponding relation of the multiplicity and the temporal information of each subevent in first incident, above-mentioned first index, the said corresponding relation of second exponential sum are offered message output module 604.
In sum, more than being merely part embodiment of the present invention, is not to be used to limit protection scope of the present invention.All any modifications of within scope of the present invention, being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (18)

1. a network information screening technique is characterized in that, comprising:
The content analysis of the webpage relevant with first incident is obtained the information of second incident, and wherein said information comprises the synopsis and the temporal information of second incident;
Judging in the information of all subevents of said first incident of storage whether the information identical with the information of said second incident is arranged, if do not have, is the information of a new subevent of said first incident with the information stores of said second incident;
From all subevents of said first incident of storage, select at least one subevent;
The information of said at least one subevent of selecting is offered the user after according to the ordering of the temporal information in the said information.
2. like power 1 described method, it is characterized in that,
Whether have with the information that the information of said second incident is identical in the information of all subevents of said first incident of said judgement storage and comprise: the temporal information of at least one subevent of said first incident of the temporal information of said second incident and storage is compared; Judge whether to have stored identical temporal information to said first incident; If have, then judge the said information that stored.
3. like power 1 described method, it is characterized in that, further comprise:
The multiplicity of storage subevent;
If judge the said information of having stored, the multiplicity of the subevent that the said information of storage is corresponding adds one;
From all subevents of said first incident of storage, select at least one subevent to present to the user according to the multiplicity of said all subevents.
4. weigh 3 described methods, it is characterized in that, said at least one subevent of from all subevents of said first incident of storage, selecting comprises:
From all subevents of said first incident of storage, select more first number subevent of multiplicity;
Wherein said first number is: the information of analyzing said user's input obtains first number; Perhaps offer at least two options of user; Said at least two options according to preset are confirmed the corresponding ratio of option that the user selects with the corresponding relation of different proportion, draw said first number according to the total quantity of the subevent of said ratio and said first incident.
5. like power 1 described method, it is characterized in that, further comprise:
The multiplicity of storage subevent;
If judge the identical information that stored, the multiplicity of the subevent that the said information of storage is corresponding adds one;
When presenting said subevent, the multiplicity of said subevent is presented to the user.
6. like power 1 described method, it is characterized in that, further comprise:
Utilize web crawlers to obtain said webpage;
The summary of each incident of the synopsis of said second incident and storage is compared;
When all incidents of judging said second incident and storage are all uncorrelated; Set up new incident; Utilize predetermined strategy that the synopsis of said second incident is handled and obtain second summary; Said second summary is saved as the summary of said new events, is the information of a subevent of said new events with the information stores of said second incident.
7. as weighing said 1 method, it is characterized in that the information that said content analysis to the webpage relevant with first incident obtains second incident comprises:
The speech of from the body matter of said webpage, confirming the representative time constitutes said temporal information;
From the sentence of the speech that comprises the said representative time, extract at least two sentence keywords as said synopsis;
Said method further comprises:
Whether the title of judging said Web page text comprises said sentence keyword, if comprise, with said title as said synopsis;
When the synopsis in the said information derives from a title and judges when having stored the information identical with said information; Judge whether the synopsis in the said information of storing derives from the sentence keyword; If then use the said said synopsis that derives from the sentence keyword that derives from the synopsis replacement storage of title.
8. like power 1 described method, it is characterized in that, when storing the information of said second incident, further comprise the details of storing said second incident; When receiving the instruction of said second incident of inquiry that the user sends, the details of second incident of said storage are offered the user;
Wherein, said details comprise at least one in following:
The content that said Web page text is first section;
The content of paragraph that contains the speech of representative time in the said Web page text; And
The address of said webpage.
9. as power 1 described method, it is characterized in that, further comprise, at least one in offering below the said user:
Draw first index of said first incident according to the number of the subevent of other incident of the number of the subevent of said first incident and storage;
Second index that draws said first incident apart from the temporal information of current time subevent the earliest and current time according to temporal information in said first incident;
The multiplicity of each subevent and the corresponding relation of temporal information in said first incident.
10. a network information screening plant is characterized in that, comprising:
Information extraction modules is used for the information of second incident that the content analysis of the webpage relevant with first incident is obtained, and wherein said information comprises the synopsis and the temporal information of second incident;
Information storage module; Whether the information of all subevents that is used for judging said first incident of storage has the information identical with the information of said second incident; If no, the information stores with said second incident is the information of a new subevent of said first incident;
The Information Selection module is used for selecting at least one subevent from all subevents of said first incident of storage, and the information of said at least one subevent of selecting is sorted according to the temporal information in the said information;
Message output module, the information after the ordering that said Information Selection module is provided offers the user.
11. like power 10 described devices, it is characterized in that,
Said information storage module is further used for storing the multiplicity of subevent; The temporal information of at least one subevent of said first incident of the temporal information of said second incident and storage is compared; Judge whether to have stored identical temporal information to said first incident; If have; Then judge and stored said information, and the multiplicity of the subevent of the said information correspondence that will store adds one to said first incident;
Said Information Selection module is used for selecting at least one subevent from all subevents of said first incident of said information storage module storage according to the multiplicity of said all subevents.
12. like power 11 described devices, it is characterized in that,
Said Information Selection module is used for: select more first number subevents of multiplicity from all subevents of said first incident of said information storage module storage;
Wherein said first number is: the information of the said user's input of said Information Selection module analysis obtains first number; Said Information Selection module offers at least two options of user; Said at least two options according to preset are confirmed the corresponding ratio of option that the user selects with the corresponding relation of different proportion, draw said first number according to the total quantity of the subevent of said ratio and said first incident.
13. like power 10 described devices, it is characterized in that,
Said information storage module is further used for: the multiplicity of storage subevent; If judge the identical information that stored, the multiplicity of the subevent that the said information of storage is corresponding adds one;
Said Information Selection module is further used for the multiplicity of said at least one subevent of selecting is offered said message output module.
14. like power 10 described devices, it is characterized in that, further comprise:
The webpage acquisition module is used to utilize web crawlers to obtain said webpage;
Said information storage module is used for: the summary of each incident of the synopsis of said second incident and storage is compared; When all incidents of judging said second incident and storage are all uncorrelated; Set up new incident; Utilize predetermined strategy that the synopsis of said second incident is handled and obtain second summary; Said second summary is saved as the summary of said new events, is the information of a subevent of said new events with the information stores of said second incident.
15. as weigh said 10 device, it is characterized in that said information extraction modules comprises:
The time extraction unit is used for confirming that from the body matter of said webpage the speech of representative time constitutes said temporal information;
Keyword extracting unit is used for extracting at least two sentence keywords as said synopsis from the sentence of the speech that comprises the said representative time.
16., it is characterized in that said information extraction modules further comprises like power 15 described devices:
Summary confirms to be used to judge whether the title of said Web page text comprises said sentence keyword in the unit, if comprise, with said title as said synopsis;
Said information storage module is further used for deriving from a title and judging when having stored the information identical with said information when the synopsis in the said information; Judge whether the synopsis in the said information of storing derives from the sentence keyword; If then use the said said synopsis that derives from the sentence keyword that derives from the synopsis replacement storage of title.
17., it is characterized in that said information storage module is further used for storing the details of said second incident like power 10 described devices; Wherein, said details comprise at least one in following: the content that said Web page text is first section; The content of paragraph that contains the speech of representative time in the said Web page text; And the address of said webpage;
Said Information Selection module is used for when receiving the instruction of said second incident of inquiry that the user sends, and the details of second incident of said storage are offered said message output module.
18., it is characterized in that at least one during said Information Selection module is further used for offering below the said message output module like power 10 described devices:
Number according to the subevent of other incident of the number of the subevent of said first incident and storage draws said first index;
Second index that draws said first incident apart from the temporal information of current time subevent the earliest and current time according to temporal information in said first incident;
The multiplicity of each subevent and the corresponding relation of temporal information in said first incident.
CN2010102894956A 2010-09-19 2010-09-19 Network information screening method and device Pending CN102402563A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102894956A CN102402563A (en) 2010-09-19 2010-09-19 Network information screening method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102894956A CN102402563A (en) 2010-09-19 2010-09-19 Network information screening method and device

Publications (1)

Publication Number Publication Date
CN102402563A true CN102402563A (en) 2012-04-04

Family

ID=45884774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102894956A Pending CN102402563A (en) 2010-09-19 2010-09-19 Network information screening method and device

Country Status (1)

Country Link
CN (1) CN102402563A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218410A (en) * 2013-03-26 2013-07-24 亿赞普(北京)科技有限公司 Internet event analysis method and device
CN104252488A (en) * 2013-06-28 2014-12-31 华为技术有限公司 Data processing method and server
CN105589950A (en) * 2015-12-18 2016-05-18 百度在线网络技术(北京)有限公司 Event attribute statement determination method, early warning method and apparatus based on event attribute statement
CN105989073A (en) * 2015-02-10 2016-10-05 阿里巴巴集团控股有限公司 Information selection method and apparatus
CN107229645A (en) * 2016-03-24 2017-10-03 腾讯科技(深圳)有限公司 Information processing method, service platform and client
CN110674429A (en) * 2018-07-03 2020-01-10 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer-readable storage medium for information retrieval
CN111488092A (en) * 2020-04-20 2020-08-04 成都安易迅科技有限公司 Additional information presentation method and device and electronic equipment
CN112347249A (en) * 2020-10-30 2021-02-09 中科曙光南京研究院有限公司 Alarm condition element extraction system and extraction method thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007233438A (en) * 2006-02-27 2007-09-13 Dainippon Printing Co Ltd Trend analysis server and trend analysis method
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function
CN101488150A (en) * 2009-03-04 2009-07-22 哈尔滨工程大学 Real-time multi-view network focus event analysis apparatus and analysis method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101076800A (en) * 2004-08-23 2007-11-21 汤姆森环球资源公司 Repetitive file detecting and displaying function
JP2007233438A (en) * 2006-02-27 2007-09-13 Dainippon Printing Co Ltd Trend analysis server and trend analysis method
CN101488150A (en) * 2009-03-04 2009-07-22 哈尔滨工程大学 Real-time multi-view network focus event analysis apparatus and analysis method

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218410A (en) * 2013-03-26 2013-07-24 亿赞普(北京)科技有限公司 Internet event analysis method and device
CN104252488A (en) * 2013-06-28 2014-12-31 华为技术有限公司 Data processing method and server
CN104252488B (en) * 2013-06-28 2017-12-22 华为技术有限公司 The method and server of processing data
CN105989073A (en) * 2015-02-10 2016-10-05 阿里巴巴集团控股有限公司 Information selection method and apparatus
CN105589950A (en) * 2015-12-18 2016-05-18 百度在线网络技术(北京)有限公司 Event attribute statement determination method, early warning method and apparatus based on event attribute statement
CN105589950B (en) * 2015-12-18 2018-12-25 百度在线网络技术(北京)有限公司 Event attribute sentence is determining and is based on event attribute sentence method for early warning and device
CN107229645A (en) * 2016-03-24 2017-10-03 腾讯科技(深圳)有限公司 Information processing method, service platform and client
CN110674429A (en) * 2018-07-03 2020-01-10 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer-readable storage medium for information retrieval
CN110674429B (en) * 2018-07-03 2022-05-31 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer readable storage medium for information retrieval
CN111488092A (en) * 2020-04-20 2020-08-04 成都安易迅科技有限公司 Additional information presentation method and device and electronic equipment
CN112347249A (en) * 2020-10-30 2021-02-09 中科曙光南京研究院有限公司 Alarm condition element extraction system and extraction method thereof
CN112347249B (en) * 2020-10-30 2024-02-27 中科曙光南京研究院有限公司 Alert condition element extraction system and extraction method thereof

Similar Documents

Publication Publication Date Title
CN102402563A (en) Network information screening method and device
JP4637969B1 (en) Properly understand the intent of web pages and user preferences, and recommend the best information in real time
US20070255754A1 (en) Recording, generation, storage and visual presentation of user activity metadata for web page documents
CN111008265A (en) Enterprise information searching method and device
US20150095320A1 (en) Apparatus, systems and methods for scoring the reliability of online information
CN101118560A (en) Keyword outputting apparatus, keyword outputting method, and keyword outputting computer program product
US8560518B2 (en) Method and apparatus for building sales tools by mining data from websites
CN103412881A (en) Method and system for providing search result
CN102436448A (en) Search method and search system
US20070136248A1 (en) Keyword driven search for questions in search targets
CN110175264A (en) Construction method, server and the computer readable storage medium of video user portrait
WO2014000130A1 (en) Method or system for automated extraction of hyper-local events from one or more web pages
Mukherjee Do open‐access journals in library and information science have any scholarly impact? A bibliometric study of selected open‐access journals using Google Scholar
CN110134845A (en) Project public sentiment monitoring method, device, computer equipment and storage medium
US9792377B2 (en) Sentiment trent visualization relating to an event occuring in a particular geographic region
KR102124935B1 (en) Disaster Monitoring System, Method Using Crowd Sourcing, and Computer Program therefor
CN110134844A (en) Subdivision field public sentiment monitoring method, device, computer equipment and storage medium
CN102902792A (en) List page recognition system and method
CN104156458A (en) Information extraction method and device
CN102929948A (en) List page identification system and method
CN102945272A (en) Processing method, equipment and server for collection information
CN109948015B (en) Meta search list result extraction method and system
CN113836434B (en) Web page data processing method based on database
CN113407678B (en) Knowledge graph construction method, device and equipment
CN109033133A (en) Event detection and tracking based on Feature item weighting growth trend

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20120404