CN102831248B - Network focus method for digging and device - Google Patents

Network focus method for digging and device Download PDF

Info

Publication number
CN102831248B
CN102831248B CN201210346827.9A CN201210346827A CN102831248B CN 102831248 B CN102831248 B CN 102831248B CN 201210346827 A CN201210346827 A CN 201210346827A CN 102831248 B CN102831248 B CN 102831248B
Authority
CN
China
Prior art keywords
network data
network
text
phrase
categories
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210346827.9A
Other languages
Chinese (zh)
Other versions
CN102831248A (en
Inventor
林英杰
马良
陈强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201210346827.9A priority Critical patent/CN102831248B/en
Priority to CN201610225018.0A priority patent/CN105912670A/en
Publication of CN102831248A publication Critical patent/CN102831248A/en
Application granted granted Critical
Publication of CN102831248B publication Critical patent/CN102831248B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention discloses a kind of network focus method for digging and device. This device comprises: classification memory module, is suitable for collection network data, to the network data storage of classifying and classify; Filter extraction module, be suitable for respectively the network data under of all categories being filtered according to the filtering rule setting in advance, and in the network data after filtration, extract centre word from of all categories; Ordered set compound module, is suitable for sorting from the centre word of consolidated network extracting data, and the centre word after the sequence of consolidated network data is combined, and obtains the center phrase of each networking data under of all categories; Focus statistics module, is suitable for adding up the occurrence number of center phrase under affiliated classification, obtains respectively network focus phrase the displaying of classifying under of all categories. By means of technical scheme of the present invention, excavation network focus that can be more macroscopical, makes Result more can reflect the objective fact of internet public opinion, can reflect more targetedly the focus in a certain field.

Description

Network focus method for digging and device
Technical field
The present invention relates to field of Internet communication, particularly relate to a kind of network focus method for digging and device.
Background technology
In the prior art, along with the development of internet, user-generated content has been introduced in increasing website(UserGeneratedContent, referred to as UGC) function, a large amount of netizens pour in forum, blog,In microblogging, deliver the suggestion of oneself and disclose all kinds of news, having every day thousands of topic to produce from internetRaw, how from internet mass information, to obtain faster network focus, will to understand social development situation,Grasp public opinion and dynamically play directiveness effect.
At present, the focus method for digging generally adopting in prior art is by the text in special time periodTransfer amount, click volume, the reply volume weighted calculation of carrying out predetermined condition obtain text calorific value, pass through calorific valueSequence obtains the hottest text. But there is following problem in the technical scheme of prior art: 1, due to only rightSingle text self attributes is added up, and the much-talked-about topic of obtaining only can reflect the temperature of a certain article on microcosmicSituation, and cannot reflect the upper temperature situation to a certain netizen's focus of macroscopic view; 2, due to the sample of adding upIntegrate as full dose data, and do not get down to corresponding statistical analysis from content of text, the result therefore producing does not have pinTo property, can not divide the focus situation of field reflection for this field; 3, technical scheme of the prior art onlyThe text of the identical same content of energy statistical nature, acquired results repeatability is large, readable poor.
Summary of the invention
The invention provides a kind of network focus method for digging and device, dig to solve network focus in prior artPick result is not macroscopical, can not divide field reflection large, readable for the focus situation in this field and repeatabilityThe problem that property is poor.
The invention provides a kind of network focus method for digging, comprising: collection network data, network data is enteredRow classification and classification storage; Respectively the network data under of all categories is carried out according to the filtering rule setting in advanceFilter, and from of all categories, in the network data after filtration, extract centre word respectively; To from consolidated network dataThe centre word of middle extraction sorts, and the centre word after the sequence of consolidated network data is combined, and obtainsThe center phrase of each networking data under of all categories; Statistics center phrase goes out occurrence under affiliated classificationNumber, obtains respectively network focus phrase the displaying of classifying under of all categories.
Alternatively, network data comprises: text header, the article content corresponding with text header andThe text attribute corresponding with text header.
Alternatively, text attribute further comprise following one of at least: the URL that text is correspondingThe issuing time of the source forum/blog of URL, text, the source column of text, text, text author,The reply number of text and text browse number.
Alternatively, network data is classified and is classified storage further comprise: utilize text automatic classificationTechnology is carried out text classification according to article content to network data, obtains the contingency table corresponding with network dataSign, and the tag along sort of corresponding text header, correspondence and corresponding text property store are arrived to engineIn; Every the scheduled time, engine is carried out to primary network data acquisition, and will collect according to tag along sortNetwork data stores classifiedly in the different XML files of given server.
Alternatively, filtering rule further comprise following one of at least: text header is not met to predetermined number of wordsNetwork data delete; The network data that issuing time is against regulation is deleted; To URLIn contain predetermined domain name network data delete, wherein, predetermined domain name is the black name of the domain name that sets in advanceDomain name in list; Or, the network data that contains predetermined domain name in URL is retained; To source versionPiece is that the network data of predetermined column is deleted, and wherein, predetermined column is the column blacklist setting in advanceIn column; Or the network data that is predetermined column to source column retains; Source is not metThe network data of regulation is deleted, and wherein, source comprises: forum, blog or whole model; RightReply number is not inconsistent the network data of regulation and deletes; Delete browsing several network datas against regulationRemove; The network data that author is against regulation is deleted; And network data is disappeared and heavily processed.
Alternatively, adopt participle technique from of all categories, in the network data after filtration, extract respectively centre word itBefore, said method also comprises: according to the prefix dictionary setting in advance, text header is carried out to prefix filtration.
Alternatively, adopting participle technique from of all categories, in the network data after filtration, to extract centre word respectively entersOne step comprises: adopt participle technique respectively the text header after lower filtration of all categories to be carried out to participle, obtain pointWord result, and using word segmentation result as centre word.
Alternatively, before sorting from the centre word of consolidated network extracting data, method also comprises:According to the conventional dictionary setting in advance, the everyday words in the centre word extracting is filtered.
Alternatively, the centre word after the sequence of consolidated network data is combined further and comprised: according to inciting somebody to actionThe centre word belonging to after the sequence of same text header combines, and wherein, n is for belonging to same textTotal number of the centre word of title, r≤n and 2≤r≤5.
Alternatively, the centre word after the sequence of consolidated network data is combined, obtain each under of all categoriesAfter the center phrase of individual networking data, said method also comprises: according to the rubbish dictionary centering setting in advanceRubbish phrase in heart phrase filters.
Alternatively, the occurrence number of statistics center phrase under affiliated classification, obtains respectively the net under of all categoriesNetwork focus phrase further comprises: statistics center phrase goes out occurrence in different text headers under affiliated classificationNumber, the center phrase that occurrence number is greater than to predetermined threshold is arranged according to predefined procedure, obtains respectively eachNetwork focus phrase under classification.
Alternatively, after obtaining respectively the network focus phrase under of all categories, said method also comprises: toNetwork focus phrase identical under one classification merges; Calculate network focus phrase institute under of all categories correspondingTemperature value; Search for the link of the corresponding focus incident of lower network focus phrase of all categories.
Alternatively, the displaying of classifying further comprises: show hot spot report to user, wherein, focus reportAnnouncement comprises: network focus phrase under of all categories in the affiliated classification of network focus phrase, predetermined amount of time,The corresponding temperature value of network focus phrase under of all categories and lower network focus phrase of all categories institute are correspondingThe link of focus incident, predetermined amount of time comprise following one of at least: per hour, every day, weekly, withAnd monthly.
The present invention also provides a kind of network focus excavating gear, comprising: classification memory module, is suitable for gatheringNetwork data, to the network data storage of classifying and classify; Filter extraction module, be suitable for according to establishing in advanceThe filtering rule of putting filters the network data under of all categories respectively, and from of all categories net after filtrationNetwork extracting data centre word; Ordered set compound module, is suitable for the centre word from consolidated network extracting dataSort, and the centre word after the sequence of consolidated network data is combined, obtain each under of all categoriesThe center phrase of individual networking data; Focus statistics module, is suitable for adding up center phrase going out under affiliated classificationOccurrence number, obtains respectively network focus phrase the displaying of classifying under of all categories.
Alternatively, network data also comprises: text header, the article content corresponding with text header, withAnd the text attribute corresponding with text header.
Alternatively, text attribute further comprise following one of at least: the URL that text is correspondingThe issuing time of the source forum/blog of URL, text, the source column of text, text, text author,The reply number of text and text browse number.
Alternatively, classification memory module is further adapted for: utilize Technologies of Automated Text Classification according to article contentNetwork data is carried out to text classification, obtain the tag along sort corresponding with network data, and by corresponding textThe text property store of the tag along sort of title, correspondence and correspondence is in engine; Every the scheduled time pairEngine carries out primary network data acquisition, and according to tag along sort by the network data collecting store classifiedly inIn the different XML files of given server.
Alternatively, filtering rule further comprise following one of at least: text header is not met to predetermined number of wordsNetwork data delete; The network data that issuing time is against regulation is deleted; To URLIn contain predetermined domain name network data delete, wherein, predetermined domain name is the black name of the domain name that sets in advanceDomain name in list; Or, the network data that contains predetermined domain name in URL is retained; To source versionPiece is that the network data of predetermined column is deleted, and wherein, predetermined column is the column blacklist setting in advanceIn column; Or the network data that is predetermined column to source column retains; Source is not metThe network data of regulation is deleted, and wherein, source comprises: forum, blog or whole model; RightReply number is not inconsistent the network data of regulation and deletes; Delete browsing several network datas against regulationRemove; The network data that author is against regulation is deleted; And network data is disappeared and heavily processed.
Alternatively, filtering extraction module is further adapted for: adopt participle technique respectively from of all categories after filtrationNetwork data in extract centre word before, according to the prefix dictionary setting in advance, text header is carried out to prefixFilter.
Alternatively, filtering extraction module is further adapted for: adopt participle technique respectively to after lower filtration of all categoriesText header carry out participle, obtain word segmentation result, and using word segmentation result as centre word.
Alternatively, ordered set compound module is further adapted for: to entering from the centre word of consolidated network extracting dataBefore line ordering, according to the conventional dictionary setting in advance, the everyday words in the centre word extracting is filtered.
Alternatively, ordered set compound module is further adapted for: according toThe sequence of same text header will be belonged toAfter centre word combine, wherein, n is the total number that belongs to the centre word of same text header, r≤ n and 2≤r≤5.
Alternatively, ordered set compound module is further adapted for: the centre word after the sequence of consolidated network data is enteredRow combination, after obtaining the center phrase of each networking data under of all categories, according to the rubbish setting in advanceDictionary filters the rubbish phrase in the phrase of center.
Alternatively, focus statistics module is further adapted for: statistics center phrase different texts under affiliated classificationOccurrence number in title, the center phrase that occurrence number is greater than to predetermined threshold is arranged according to predefined procedureRow, obtain respectively the network focus phrase under of all categories.
Alternatively, focus statistics module is further adapted for: network focus phrase identical under same classification is enteredRow merges; Calculate the corresponding temperature value of network focus phrase under of all categories; Search for lower network heat of all categoriesThe link of the corresponding focus incident of some phrase.
Alternatively, focus statistics module is further adapted for: show hot spot report to user, wherein, focus reportAnnouncement comprises: network focus phrase under of all categories in the affiliated classification of network focus phrase, predetermined amount of time,The corresponding temperature value of network focus phrase under of all categories and lower network focus phrase of all categories institute are correspondingThe link of focus incident, predetermined amount of time comprise following one of at least: per hour, every day, weekly, withAnd monthly.
Beneficial effect of the present invention is as follows:
Excavate by utilizing hot word Computing Principle to realize focus, and by Text Classification and focus digging technologyCombine, solved in prior art network focus Result not macroscopical, can not reflect for this in point fieldLarge, the readable poor problem of the focus situation in field and repeatability; Excavation network that can be more macroscopicalFocus, the upper temperature situation to a certain netizen's focus of reflection macroscopic view, makes Result more can reflect internetThe objective fact of public opinion, the identical content article that easier merger repeats, and can be more targetedThe focus in a certain field of reflection.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technology of the present inventionMeans, and can being implemented according to the content of description, and for allow above and other objects of the present invention,Feature and advantage can become apparent, below especially exemplified by the specific embodiment of the present invention.
Brief description of the drawings
By reading below detailed description of the preferred embodiment, various other advantage and benefit for abilityIt is cheer and bright that territory those of ordinary skill will become. Accompanying drawing is only for the object of preferred embodiment is shown, and alsoDo not think limitation of the present invention. And in whole accompanying drawing, represent identical by identical reference symbolParts. In the accompanying drawings:
Fig. 1 is the flow chart of the network focus method for digging of the embodiment of the present invention;
Fig. 2 is the schematic diagram of the filtering rule of the embodiment of the present invention;
Fig. 3 is the detailed processing schematic diagram of the network focus method for digging of the embodiment of the present invention;
Fig. 4 is the structural representation of the network focus excavating gear of the embodiment of the present invention.
Detailed description of the invention
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail. Although shown in accompanying drawingExemplary embodiment of the present disclosure, but should be appreciated that and can realize the disclosure with various forms and should be byHere the embodiment setting forth limits. On the contrary, providing these embodiment is in order more thoroughly to understand thisOpen, and can be by the those skilled in the art that conveys to complete the scope of the present disclosure.
For solve in prior art network focus Result not macroscopical, can not reflect for this neck in point fieldLarge, the readable poor problem of the focus situation in territory and repeatability, the invention provides a kind of network focusMethod for digging and device, the network focus method for digging of the embodiment of the present invention and device adopt text automatic classificationTechnology and hot word computing technique realize. Below in conjunction with accompanying drawing and embodiment, the present invention is entered to oneStep describes in detail. Should be appreciated that specific embodiment described herein is only in order to explain the present invention, noLimit the present invention.
According to embodiments of the invention, a kind of network focus method for digging is provided, Fig. 1 is the invention processThe flow chart of the network focus method for digging of example, as shown in Figure 1, according to the network focus of the embodiment of the present inventionMethod for digging comprises following processing:
Step 101, collection network data, to the network data storage of classifying and classify;
Wherein, the network data described in step 101 specifically comprises: text header, relative with text headerThe article content of answering and the text attribute corresponding with text header. Wherein, text attribute specifically comprisesBelow one of at least: the URL (Uniform/UniversalResource that text is correspondingLocator, referred to as URL), the source forum/blog of text, the source column of text, the issue of textThe reply number of time, text author, text and text browse number.
In step 101, network data is classified and classify to store specifically to be comprised:
Step 1, utilizes Technologies of Automated Text Classification, according to article content, network data is carried out to text classification,Obtain the tag along sort corresponding with network data, and by corresponding text header, corresponding tag along sort, withAnd corresponding text property store is in engine; Wherein, Technologies of Automated Text Classification refers to: utilize engineeringThe principle of practising rely on model parameter after small-sample learning to text set (or other entities or object) according to oneFixed taxonomic hierarchies or standard are carried out automatic classification mark.
Step 2, carries out primary network data acquisition every the scheduled time to engine, and will according to tag along sortThe network data collecting stores classifiedly in the different XML files of given server. Wherein, pre-timingBetween can be 1 hour, 6 hours, 1 day, in embodiments of the present invention, the scheduled time can be according to collectionData characteristics (for example, renewal speed) arrange flexibly.
Step 102, filters the network data under of all categories respectively according to the filtering rule setting in advance,And from of all categories, in the network data after filtration, extract centre word respectively;
Preferably, Fig. 2 is the schematic diagram of the filtering rule of the embodiment of the present invention, as shown in Figure 2, and at thisIn bright embodiment, filtering rule specifically comprise following one of at least: 1, text header is not met to reserved wordThe network data of number is deleted; 2, issuing time network data against regulation is deleted; 3,The network data that contains predetermined domain name in URL is deleted, and wherein, predetermined domain name is what set in advanceDomain name in domain name blacklist; Or, the network data that contains predetermined domain name in URL is retained; 4,The network data that is predetermined column to source column is deleted, and wherein, predetermined column is the version setting in advanceColumn in piece blacklist; Or the network data that is predetermined column to source column retains; 5,The network data against regulation of originating is deleted, and wherein, source comprises: forum, blog orAll models; 6, the network data that is not inconsistent regulation to replying number is deleted; 7, do not meet rule to browsing numberFixed network data is deleted; 8, author's network data against regulation is deleted; 9, to netNetwork data disappear and heavily process.
It should be noted that, the filtering rule in the embodiment of the present invention is not limited to 9 rules listed above,In embodiments of the present invention, filtering rule can arrange as required, for example, and by filtering rule settingFor: the network data that the number of words of article is not exceeded to predetermined number of words threshold value is deleted etc.
In addition, in step 102, before extracting centre word, in order to extract better the center needingWord, can carry out prefix filtration to text header according to the prefix dictionary setting in advance, and for example, cat is flutterred greatlyThe unwanted prefixes of this class such as student base, ends of the earth tittle-tattle are filtered. These prefixes are not participated in centre wordExtract. And, in embodiments of the present invention, can adopt participle technique respectively from of all categories after filtrationIn network data, extract centre word; Particularly, can adopt participle technique respectively to after lower filtration of all categoriesText header carries out participle, obtains word segmentation result, and using word segmentation result as centre word. It should be noted that,Above-mentioned participle technique is centre word extractive technique ripe in prior art, and the embodiment of the present invention can also be usedOther technologies are carried out the extraction of centre word.
Step 103, to sorting from the centre word of consolidated network extracting data, and by consolidated network numberAccording to sequence after centre word combine, obtain the center phrase of each networking data under of all categories;
Step 103 realizes by hot word computing technique, and hot word computing technique refers to: automatically in real timeThe web page text gathering carries out participle, grouping merger, calculates high frequency focus keyword, and according to predefinedDictionary and preset rules are filtered, and export real-time internet hot spots vocabulary.
In step 103, before sorting from the centre word of consolidated network extracting data, Ke YigenAccording to the conventional dictionary setting in advance, the everyday words in the centre word extracting is filtered, above-mentioned everyday words isRefer to the vocabulary such as such as original, reprinting, figure group, these vocabulary need to be filtered out.
And, in step 103, carry out centre word combination and refer to: according toTo belong to same text markCentre word after the sequence of topic combines, and wherein, n is centre word total who belongs to same text headerNumber, r≤n and 2≤r≤5.
After having carried out step 103, in embodiments of the present invention, preferably, can be according to setting in advanceRubbish dictionary the rubbish phrase in the phrase of center is filtered.
Step 104, the occurrence number of statistics center phrase under affiliated classification, obtains respectively under of all categoriesNetwork focus phrase the displaying of classifying.
Step 104 specifically comprises following processing: statistics center phrase is under affiliated classification in different text headersOccurrence number, the center phrase that occurrence number is greater than to predetermined threshold is arranged according to predefined procedure, pointDo not obtain the network focus phrase under of all categories. Wherein, above-mentioned predefined procedure can be by many by occurrence numberArrange to few.
After the network focus phrase having obtained under of all categories, can be to network boom identical under same classificationPoint phrase merges; Calculate the corresponding temperature value of network focus phrase under of all categories; And search for all kinds ofThe link of the corresponding focus incident of other lower network focus phrase. Think that user provides the letter of focus more in all directionsBreath.
In step 104, classification displaying refers to: show hot spot report to user, wherein, hot spot report bagDraw together: network focus phrase under of all categories in the affiliated classification of network focus phrase, predetermined amount of time, all kinds ofThe corresponding temperature value of network focus phrase not and the corresponding heat of lower network focus phrase of all categoriesThe link of some event, predetermined amount of time comprise following one of at least: per hour, every day, weekly and everyMonth.
Below in conjunction with accompanying drawing, the technical scheme of the embodiment of the present invention is illustrated.
Fig. 3 is the detailed processing schematic diagram of the network focus method for digging of the embodiment of the present invention, as shown in Figure 3,Specifically comprise following processing according to the network focus method for digging of the embodiment of the present invention:
Step 301, utilizes self-defined language material to generate disaggregated model by machine learning module, by classification mouldType carries out text classification to the network data collecting, and by tag along sort together with text attribute together deposit intoIn engine.
Step 302, per hourly carries out a data acquisition to engine, and by data by storing classifiedly in appointmentDifferent extend markup languages (ExtensibleMarkupLanguage, referred to as the XML) literary composition of serverIn part.
Step 303, by following filtering rule filtering data, and remains into the data after filtering in database,Wherein, user can manage filtering rule by data filtering regulation management backstage.
Particularly, comprise according to the filtering rule of the embodiment of the present invention:
1, title filters: the data filtering by the number of words of title between 5-30 word is come in;
2, the temporal filtering of posting is that the model on the same day filters into by the time of posting;
3, domain name is filtered: (1) takes fuzzy matching, can will in the URL of model, have corresponding domain name or listThe model of word filters into; Or, (2) by domain name by 30 current events forums, 20 automobile forums andModel with auto in the URL of model filters into; Or, meet these two kinds of (1), (2) regularAll to filter into.
4, column filters: filter according to the URL of plate seed; Also can by column title band certainThe model of Chinese character filters into; For example, filter out the model of the band amusement of column title or Eight Diagrams printed words;
5, domain name blacklist filters: the result filtering out is carried out to deletion action, by certain second-level domain aboveIn name or secondary URL, filter out with the model of certain word; And, be xinhuanet.com at TLDResult in, be filtering out of 120ask.xinhuanet.com by domain name;
6, column blacklist: the result filtering out is carried out to deletion action, certain seed or column aboveIn name, filter out with the model of certain word; And, be filtering out of reporting of new person by column name;
7, source filtering: the data filtering that meets filtration source is come in, wherein, filter source and refers to:Forum, blog or whole model;
8, replying number hits filters: come in replying the data filtering of number within 0-1000; To clickThe data filtering of number within 0-10000 come in;
9, disappear and heavily process: disappear heavily according to the URL of model, model of all calculations that TLD is identical;
10, filtered fields comprises: title, URL, source forum, carry out active plate, the time of posting, author,Reply number, browse number etc.
11, filter logic order: above-mentioned the 3rd article of filtering rule and the 4th article of filtering rule are the passes of "or"System, between other filtering rules, be " with " relation.
Step 304, extracts centre word to all text headers, and a title may have multiple centre words, logicalCross participle technique title is carried out to participle, word segmentation result is title centre word. Preferably, first right before participleTitle carries out prefix filtration, and these prefixes do not participate in participle, for example, and " cat flutters university student base ", " ends of the earthTittle-tattle " etc. the prefix of this class. Wherein, user can by prefix manage backstage to needs filter prefix enterLine pipe reason;
Step 305, focus phrase calculates:
Step 1, for example, by the everyday words in word segmentation result (, the vocabulary such as " original ", " reprinting ", " picture group ")Filter; Wherein, user can by everyday words manage backstage to needs filter everyday words manage;
Step 2, carries out phrase sequence (for example a, centre word that title is extracted out by the centre word after filteringFor bca, after sequence, become abc);
Step 3, combines the centre word of each title, the centre word of each titleCombination, combinationFormula:Only retain the phrase of 2-5 word;
In conjunction with example, centre word being carried out to phrase sequence combination below, is illustrated.
Title one is extracted centre word b, a, c out, and a, b, c after sequence, form phrase ab, bc, ac, abc
Title two is extracted centre word c, b, d out, and b, c, d after sequence, form phrase bc, cd, bd, bcd
Title three is extracted centre word b, c out, forms phrase bc
The phrase seniority among brothers and sisters that these three titles form is so exactly: bc(3), ab(1), ac(1), cd(1),bd(1)、abc(1)、bcd(1)。
Step 4, filters rubbish phrase, removes as inquiry ### prize-winning, ### phone, ### consulting, mobile phoneThe rubbish phrase of prize-winning and so on; Wherein, the rubbish that user can filter needs by rubbish phrase management backstageRubbish phrase manages;
Step 306, forms focus phrase ranking list, adds up the title quantity of each focus phrase behind and pressesThe descending of title quantity, more than 2 phrase of retain header quantity, this parameter can be done according to real dataAdjust;
In sum, by means of the technical scheme of the embodiment of the present invention, by utilizing hot word Computing Principle to realizeFocus excavates, and Text Classification is combined with focus digging technology, has solved network in prior artFocus Result is not macroscopical, can not divide the focus situation of field reflection for this field and repeatability is large,Readable poor problem; Excavation network focus that can be more macroscopical, reflection macroscopic view is upper to be paid close attention to a certain netizenThe temperature situation of point, makes Result more can reflect the objective fact of internet public opinion, and easier merger repeatsThe identical content article occurring, and can reflect more targetedly the focus in a certain field.
According to embodiments of the invention, a kind of network focus excavating gear is provided, Fig. 4 is the invention processThe structural representation of the network focus excavating gear of example, as shown in Figure 4, according to the network of the embodiment of the present inventionFocus excavating gear comprises: classification memory module 40, filter extraction module 42, ordered set compound module 44,And focus statistics module 46, below the modules of the embodiment of the present invention is described in detail.
Classification memory module 40, is suitable for collection network data, to the network data storage of classifying and classify;
Wherein, above-mentioned network data specifically comprises: text header, the article content corresponding with text header,And the text attribute corresponding with text header. Wherein, above-mentioned text attribute specifically comprise following at least itOne: the URL that text is corresponding, the source forum/blog of text, the source column of text, the issue of textThe reply number of time, text author, text and text browse number.
Classification memory module 40 be specifically suitable for: 1, utilize Technologies of Automated Text Classification according to article content to netNetwork data are carried out text classification, obtain the tag along sort corresponding with network data, and by corresponding text header,The text property store of corresponding tag along sort and correspondence is in engine; Wherein, text automatic classification skillArt refers to: utilize the principle of machine learning rely on model parameter after small-sample learning to text set (or otherEntity or object) carry out automatic classification mark according to certain taxonomic hierarchies or standard. 2, every pre-timingBetween engine is carried out to primary network data acquisition, and according to tag along sort, the network data classification collecting is depositedBe put in the different XML files of given server. Wherein, the scheduled time can be 1 hour, 6 hours,1 day, in embodiments of the present invention, the scheduled time can (for example, be upgraded speed according to the data characteristics gatheringDegree) arrange flexibly.
Filter extraction module 42, be suitable for according to the filtering rule setting in advance respectively to the network number under of all categoriesAccording to filtering, and in the network data after filtration, extract centre word from of all categories;
In embodiments of the present invention, Fig. 2 is the schematic diagram of the filtering rule of the embodiment of the present invention, as Fig. 2 instituteShow, filtering rule specifically comprise following one of at least: the network that 1, text header is not met predetermined number of wordsData are deleted; 2, issuing time network data against regulation is deleted; 3, to URLIn contain predetermined domain name network data delete, wherein, predetermined domain name is the black name of the domain name that sets in advanceDomain name in list; Or, the network data that contains predetermined domain name in URL is retained; 4, to sourceColumn is that the network data of predetermined column is deleted, and wherein, predetermined column is the black name of column setting in advanceColumn in list; Or the network data that is predetermined column to source column retains; 5, to sourceNetwork data against regulation is deleted, and wherein, source comprises: forum, blog or whole noteSon; 6, the network data that is not inconsistent regulation to replying number is deleted; 7, to browsing several nets against regulationNetwork data are deleted; 8, author's network data against regulation is deleted; 9, to network dataDisappear and heavily process.
It should be noted that, the filtering rule in the embodiment of the present invention is not limited to 9 rules listed above,In embodiments of the present invention, filtering rule can arrange as required, for example, and by filtering rule settingFor: the network data that the number of words of article is not exceeded to predetermined number of words threshold value is deleted etc.
In addition,, before extracting centre word, in order to extract better the centre word needing, filter extraction mouldPiece 42 is further adapted for: can carry out prefix filtration to text header according to the prefix dictionary setting in advance,For example, cat being flutterred to the unwanted prefix of this class such as university student base, ends of the earth tittle-tattle filters. These prefixesDo not participate in the extraction of centre word. And in embodiments of the present invention, filtering extraction module 42 can adoptParticiple technique extracts centre word in the network data after filtration respectively from of all categories; Particularly, filter and extractModule 42 can adopt participle technique respectively the text header after lower filtration of all categories to be carried out to participle, obtainsWord segmentation result, and using word segmentation result as centre word. It should be noted that, above-mentioned participle technique is existing skillRipe centre word extractive technique in art, the embodiment of the present invention can also be used other technologies to carry out centre wordExtract.
Ordered set compound module 44, is suitable for sorting from the centre word of consolidated network extracting data, and willCentre word after the sequence of consolidated network data combines, obtain each networking data under of all categories inHeart phrase;
Ordered set compound module 44 is above-mentioned processing of realizing by hot word computing technique, and hot word calculates skillArt refers to: automatically the web page text of Real-time Collection is carried out to participle, grouping merger, calculate high frequency focus keyWord, and filter according to predefined dictionary and preset rules, real-time internet hot spots vocabulary exported.
Before sorting from the centre word of consolidated network extracting data, ordered set compound module 44 canAccording to the conventional dictionary setting in advance, the everyday words in the centre word extracting is filtered. Above-mentioned everyday wordsRefer to the vocabulary such as such as original, reprinting, figure group, these vocabulary need to be filtered out.
Ordered set compound module 44 carries out centre word combination and refers to: ordered set compound module 44 basesTo belong to sameCentre word after the sequence of a text header combines, and wherein, n belongs to same text headerTotal number of centre word, r≤n and 2≤r≤5.
Preferably, the centre word after the sequence of consolidated network data is being combined, obtaining under of all categoriesAfter the center phrase of each networking data, ordered set compound module 44 is further adapted for: according to setting in advanceRubbish dictionary the rubbish phrase in the phrase of center is filtered.
Focus statistics module 46, is suitable for adding up the occurrence number of center phrase under affiliated classification, obtains respectivelyNetwork focus phrase under of all categories the displaying of classifying.
Focus statistics module 46 is specifically suitable for: statistics center phrase is under affiliated classification in different text headersOccurrence number, the center phrase that occurrence number is greater than to predetermined threshold is arranged according to predefined procedure, pointDo not obtain the network focus phrase under of all categories.
After the network focus phrase having obtained under of all categories, focus statistics module 46 is further adapted for:Network focus phrase identical under same classification is merged; Calculate the network focus phrase institute under of all categoriesCorresponding temperature value; Search for the link of the corresponding focus incident of lower network focus phrase of all categories.
The 46 classification displayings of focus statistics module refer to: show hot spot report to user, wherein, hot spot reportComprise: lower network focus phrase of all categories in the affiliated classification of network focus phrase, predetermined amount of time, respectivelyThe corresponding temperature value of network focus phrase and lower network focus phrase of all categories under classification are correspondingThe link of focus incident, predetermined amount of time comprise following one of at least: per hour, every day, weekly andMonthly.
Below in conjunction with accompanying drawing, the technical scheme of the embodiment of the present invention is illustrated.
Fig. 3 is the detailed processing schematic diagram of the network focus method for digging of the embodiment of the present invention, as shown in Figure 3,Specifically comprise following processing according to the network focus method for digging of the embodiment of the present invention:
Step 301, utilizes self-defined language material to generate disaggregated model by machine learning module, classification storage mouldPiece 40 carries out text classification by disaggregated model to the network data collecting, and by tag along sort together with literary compositionThis attribute is together deposited in engine.
Step 302, memory module 40 is per hour that engine is carried out to a data acquisition in classification, and data are pressedStore classifiedly different extend markup languages in given server (ExtensibleMarkupLanguage,Referred to as XML) in file.
Step 303, filters extraction module 42 by following filtering rule filtering data, and by the data after filteringRemain in database, wherein, user can carry out filtering rule by data filtering regulation management backstageManagement.
Particularly, Fig. 3 is the preferred schematic diagram of the filtering rule of the embodiment of the present invention, as shown in Figure 3, and rootFiltering rule according to the embodiment of the present invention comprises:
1, title filters: the data filtering by the number of words of title between 5-30 word is come in;
2, the temporal filtering of posting is that the model on the same day filters into by the time of posting;
3, domain name is filtered: (1) takes fuzzy matching, can will in the URL of model, have corresponding domain name or listThe model of word filters into; Or, (2) by domain name by 30 current events forums, 20 automobile forums andModel with auto in the URL of model filters into; Or, meet these two kinds of (1), (2) regularAll to filter into.
4, column filters: filter according to the URL of plate seed; Also can by column title band certainThe model of Chinese character filters into; For example, filter out the model of the band amusement of column title or Eight Diagrams printed words;
5, domain name blacklist filters: the result filtering out is carried out to deletion action, by certain second-level domain aboveIn name or secondary URL, filter out with the model of certain word; And, be xinhuanet.com at TLDResult in, be filtering out of 120ask.xinhuanet.com by domain name;
6, column blacklist: the result filtering out is carried out to deletion action, certain seed or column aboveIn name, filter out with the model of certain word; And, be filtering out of reporting of new person by column name;
7, source filtering: the data filtering that meets filtration source is come in, wherein, filter source and refers to:Forum, blog or whole model;
8, replying number hits filters: come in replying the data filtering of number within 0-1000; To clickThe data filtering of number within 0-10000 come in;
9, disappear and heavily process: disappear heavily according to the URL of model, model of all calculations that TLD is identical;
10, filtered fields comprises: title, URL, source forum, carry out active plate, the time of posting, author,Reply number, browse number etc.
11, filter logic order: above-mentioned the 3rd article of filtering rule and the 4th article of filtering rule are the passes of "or"System, between other filtering rules, be " with " relation.
Step 304, filters extraction module 42 all text headers is extracted to centre word, and a title may haveMultiple centre words, carry out participle by participle technique to title, and word segmentation result is title centre word. PreferablyGround, first carries out prefix filtration to title before participle, and these prefixes do not participate in participle, for example, and " Mao Pu universityRaw base ", the prefix of this class such as " ends of the earth tittle-tattle ". Wherein, user can manage backstage to need by prefixThe prefix of filtering manages;
Step 305, ordered set compound module 44 carries out the calculating of focus phrase:
Step 1, for example, by the everyday words in word segmentation result (, the vocabulary such as " original ", " reprinting ", " picture group ")Filter; Wherein, user can by everyday words manage backstage to needs filter everyday words manage;
Step 2, carries out phrase sequence (for example a, centre word that title is extracted out by the centre word after filteringFor bca, after sequence, become abc);
Step 3, combines the centre word of each title, the centre word of each titleCombination, combinationFormula:Only retain the phrase of 2-5 word;
In conjunction with example, centre word being carried out to phrase sequence combination below, is illustrated.
Title one is extracted centre word b, a, c out, and a, b, c after sequence, form phrase ab, bc, ac, abc
Title two is extracted centre word c, b, d out, and b, c, d after sequence, form phrase bc, cd, bd, bcd
Title three is extracted centre word b, c out, forms phrase bc
The phrase seniority among brothers and sisters that these three titles form is so exactly: bc(3), ab(1), ac(1), cd(1), bd(1)、abc(1)、bcd(1)。
Step 4, filters rubbish phrase, removes as inquiry ### prize-winning, ### phone, ### consulting, mobile phoneThe rubbish phrase of prize-winning and so on; Wherein, the rubbish that user can filter needs by rubbish phrase management backstageRubbish phrase manages;
Step 306, focus statistics module 46 forms focus phrase ranking list, adds up each focus phrase behindTitle quantity and press the descending of title quantity, more than 2 phrase of retain header quantity, this parameter canAdjust according to real data;
In sum, by means of the technical scheme of the embodiment of the present invention, by utilizing hot word Computing Principle to realizeFocus excavates, and Text Classification is combined with focus digging technology, has solved network in prior artFocus Result is not macroscopical, can not divide the focus situation of field reflection for this field and repeatability is large,Readable poor problem; Excavation network focus that can be more macroscopical, reflection macroscopic view is upper to be paid close attention to a certain netizenThe temperature situation of point, makes Result more can reflect the objective fact of internet public opinion, and easier merger repeatsThe identical content article occurring, and can reflect more targetedly the focus in a certain field.
The algorithm providing at this and demonstration are not intrinsic with any certain computer, virtual system or miscellaneous equipmentRelevant. Various general-purpose systems also can with based on using together with this teaching. According to description above, structureIt is apparent making the desired structure of this type systematic. In addition, the present invention is not also for any certain programmedLanguage. It should be understood that and can utilize various programming languages to realize content of the present invention described here, andThe description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the description that provided herein, a large amount of details are described. But, can understand, thisBright embodiment can put into practice in the situation that there is no these details. In some instances, not detailedKnown method, structure and technology are shown, so that not fuzzy understanding of this description.
Similarly, should be appreciated that for simplify the disclosure and help to understand in each inventive aspect one orMultiple, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is sometimes by oneRise and be grouped into single embodiment, figure or in its description. But, should be by the method for the disclosureBe construed to the following intention of reflection: the present invention for required protection requires clearer and more definite than institute in each claimThe more feature of feature of recording. Or rather, as reflected in claims below, send outBright aspect is to be less than all features of disclosed single embodiment above. Therefore, follow detailed description of the inventionClaims be incorporated to clearly thus this detailed description of the invention, wherein each claim conduct itselfIndependent embodiment of the present invention.
Those skilled in the art are appreciated that and can carry out adaptive to the module in the equipment in embodimentChange to answering property and they are arranged in one or more equipment different from this embodiment. Can be realityExecute module in example or unit or assembly and be combined into a module or unit or assembly, and in addition can be itBe divided into multiple submodules or subelement or sub-component. Except in such feature and/or process or unitAt least some are outside mutually repelling, and can adopt any combination (to comprise that the right of following will to this descriptionAsk, summary and accompanying drawing) in disclosed all features and disclosed any method or equipment all like thisProcess or unit combine. Unless clearly statement in addition, this description (comprise the claim followed,Summary and accompanying drawing) in disclosed each feature can be by providing identical, be equal to or the alternative features of similar objectReplace.
In addition, although those skilled in the art will appreciate that embodiment more described herein comprise otherIncluded some feature instead of further feature in embodiment, but the combination of the feature of different embodiment meaningTaste within scope of the present invention and is formed different embodiment. For example, claim belowIn book, the one of any of embodiment required for protection can be used with combination arbitrarily.
All parts embodiment of the present invention can realize with hardware, or with at one or more processorThe software module of upper operation realizes, or realizes with their combination. It will be understood by those of skill in the art thatCan use in practice microprocessor or digital signal processor (DSP) to realize real according to the present inventionExecute the some or all functions of the some or all parts in routine network focus excavating gear. The present inventionCan also be embodied as part or all equipment or the device for carrying out method as described hereinProgram (for example, computer program and computer program). Like this to realize program of the present invention passableBe stored on computer-readable medium, or can there is the form of one or more signal. Such letterNumber can download and obtain from internet website, or provide on carrier signal, or with any other shapeFormula provides.
It should be noted above-described embodiment the present invention will be described instead of limit the invention, andAnd those skilled in the art can design to replace and implement in the case of not departing from the scope of claimsExample. In the claims, any reference symbol between bracket should be configured to claimRestriction. Word " comprises " not to be got rid of existence and is not listed as element or step in the claims. Be positioned at element itBefore word " " or " one " do not get rid of and have multiple such elements. The present invention can be by means of bagDraw together the hardware of some different elements and realize by means of the computer of suitably programming. Enumerated someIn the unit claim of device, several in these devices can be to come specifically by same hardware branchEmbody. The use of word first, second and C grade does not represent any order. Can be by these word solutionsBe interpreted as title.

Claims (22)

1. a network focus excavating gear for data text classification Network Based, is characterized in that, comprising:
Classification memory module, is suitable for collection network data, to the storage of classifying and classify of described network data;
Filter extraction module, be suitable for respectively the network data under of all categories being filtered according to the filtering rule setting in advance, and in the network data after filtration, extract centre word from of all categories;
Ordered set compound module, is suitable for the described centre word from consolidated network extracting data to sort, and the centre word after the sequence of consolidated network data is combined, and obtains the center phrase of each network data under of all categories;
Focus statistics module, is suitable for adding up the occurrence number of described center phrase under affiliated classification, obtains respectively network focus phrase the displaying of classifying under of all categories;
Described classification memory module is further adapted for:
Analyze the content of text in described network data and with this, described network data carried out to text classification, obtain the tag along sort corresponding with described network data, and by corresponding text header, corresponding tag along sort and corresponding text property store in engine;
Every the scheduled time, described engine is carried out to primary network data acquisition, and according to described tag along sort, the network data collecting is stored classifiedly in the different XML files of given server;
Described filtering rule further comprise following one of at least:
The network data that text header is not met to predetermined number of words is deleted;
The network data that issuing time is against regulation is deleted;
The network data that contains predetermined domain name in URL is deleted, and wherein, described predetermined domain name is the domain name in the domain name blacklist setting in advance; Or, the network data that contains predetermined domain name in URL is retained;
The network data that is predetermined column to source column is deleted, and wherein, described predetermined column is the column in the column blacklist setting in advance; Or the network data that is predetermined column to source column retains;
The network data against regulation of originating is deleted, and wherein, described source comprises: forum, blog or whole model;
The network data that is not inconsistent regulation to replying number is deleted;
Delete browsing several network datas against regulation;
The network data that author is against regulation is deleted; And
Network data is disappeared and heavily processed.
2. device as claimed in claim 1, is characterized in that, described network data further comprises: text header, the article content corresponding with described text header and the text attribute corresponding with described text header.
3. device as claimed in claim 1 or 2, it is characterized in that, described text attribute further comprise following one of at least: the reply number of the source column of the uniform resource position mark URL that text is corresponding, the source forum/blog of text, text, the issuing time of text, text author, text and text browse number.
4. device as claimed in claim 1 or 2, it is characterized in that, described filtration extraction module is further adapted for: employing participle technique carries out prefix filtration according to the prefix dictionary setting in advance to described text header before extracting centre word in the network data after filtration respectively from of all categories.
5. device as claimed in claim 1 or 2, it is characterized in that, described filtration extraction module is further adapted for: adopts participle technique respectively the text header after lower filtration of all categories to be carried out to participle, obtains word segmentation result, and using described word segmentation result as described centre word.
6. device as claimed in claim 1 or 2, it is characterized in that, described ordered set compound module is further adapted for: before sorting from the described centre word of consolidated network extracting data, according to the conventional dictionary setting in advance, the everyday words in the described centre word extracting is filtered.
7. device as claimed in claim 1 or 2, is characterized in that, described ordered set compound module is further adapted for: according toThe centre word belonging to after the sequence of same text header is combined, and wherein, n is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.
8. device as claimed in claim 1 or 2, it is characterized in that, described ordered set compound module is further adapted for: the centre word after the sequence of consolidated network data is combined, after obtaining the center phrase of each network data under of all categories, according to the rubbish dictionary setting in advance, the rubbish phrase in the phrase of described center is filtered.
9. device as claimed in claim 1 or 2, it is characterized in that, described focus statistics module is further adapted for: add up described center phrase occurrence number in different text headers under affiliated classification, the center phrase that described occurrence number is greater than to predetermined threshold is arranged according to predefined procedure, obtains respectively the network focus phrase under of all categories.
10. device as claimed in claim 1 or 2, is characterized in that, described focus statistics module is further adapted for: network focus phrase identical under same classification is merged; Calculate the corresponding temperature value of network focus phrase under of all categories; Search for the link of the corresponding focus incident of lower network focus phrase of all categories.
11. devices as claimed in claim 1 or 2, it is characterized in that, described focus statistics module is further adapted for: show hot spot report to user, wherein, described hot spot report comprises: network focus phrase under of all categories in the affiliated classification of network focus phrase, predetermined amount of time, the corresponding temperature value of network focus phrase under of all categories and the link of the corresponding focus incident of lower network focus phrase of all categories, described predetermined amount of time comprise following one of at least: per hour, every day, weekly and monthly.
The network focus method for digging of 12. 1 kinds of data text classification Network Based, is characterized in that, comprising:
Collection network data, analyze the content of text in described network data and with this, described network data carried out to text classification, obtain the tag along sort corresponding with described network data, and by corresponding text header, corresponding tag along sort and corresponding text property store in engine;
Every the scheduled time, described engine is carried out to primary network data acquisition, and according to described tag along sort, the network data collecting is stored classifiedly in the different XML files of given server;
Respectively the network data under of all categories is filtered according to the filtering rule setting in advance, and from of all categories, in the network data after filtration, extract centre word respectively;
Described centre word from consolidated network extracting data is sorted, and the centre word after the sequence of consolidated network data is combined, obtain the center phrase of each network data under of all categories;
Add up the occurrence number of described center phrase under affiliated classification, obtain respectively network focus phrase the displaying of classifying under of all categories;
Described filtering rule further comprise following one of at least:
The network data that text header is not met to predetermined number of words is deleted;
The network data that issuing time is against regulation is deleted;
The network data that contains predetermined domain name in URL is deleted, and wherein, described predetermined domain name is the domain name in the domain name blacklist setting in advance; Or, the network data that contains predetermined domain name in URL is retained;
The network data that is predetermined column to source column is deleted, and wherein, described predetermined column is the column in the column blacklist setting in advance; Or the network data that is predetermined column to source column retains;
The network data against regulation of originating is deleted, and wherein, described source comprises: forum, blog or whole model;
The network data that is not inconsistent regulation to replying number is deleted;
Delete browsing several network datas against regulation;
The network data that author is against regulation is deleted; And
Network data is disappeared and heavily processed.
13. methods as claimed in claim 12, is characterized in that, described network data comprises: text header, the article content corresponding with described text header and the text attribute corresponding with described text header.
14. methods as described in claim 12 or 13, it is characterized in that, described text attribute further comprise following one of at least: the reply number of the source column of the uniform resource position mark URL that text is corresponding, the source forum/blog of text, text, the issuing time of text, text author, text and text browse number.
15. methods as described in claim 12 or 13, is characterized in that, described from of all categories, in the network data after filtration, extract centre word respectively before, described method also comprises:
According to the prefix dictionary setting in advance, described text header is carried out to prefix filtration.
16. methods as described in claim 12 or 13, is characterized in that, extract centre word respectively and further comprise from of all categories in the network data after filtration:
Adopt participle technique respectively the text header after lower filtration of all categories to be carried out to participle, obtain word segmentation result, and using described word segmentation result as described centre word.
17. methods as described in claim 12 or 13, is characterized in that, before sorting from the described centre word of consolidated network extracting data, described method also comprises:
According to the conventional dictionary setting in advance, the everyday words in the described centre word extracting is filtered.
18. methods as described in claim 12 or 13, is characterized in that, the centre word after the sequence of consolidated network data is combined further and comprised:
According toThe centre word belonging to after the sequence of same text header is combined, and wherein, n is the total number that belongs to the centre word of same text header, r≤n and 2≤r≤5.
19. methods as described in claim 12 or 13, is characterized in that, described centre word after the sequence of consolidated network data are combined, and after obtaining the center phrase of each network data under of all categories, described method also comprises:
According to the rubbish dictionary setting in advance, the rubbish phrase in the phrase of described center is filtered.
20. methods as described in claim 12 or 13, is characterized in that, add up the occurrence number of described center phrase under affiliated classification, and the network focus phrase obtaining respectively under of all categories further comprises:
Add up described center phrase occurrence number in different text headers under affiliated classification, the center phrase that described occurrence number is greater than to predetermined threshold is arranged according to predefined procedure, obtains respectively the network focus phrase under of all categories.
21. methods as described in claim 12 or 13, is characterized in that, after obtaining respectively the network focus phrase under of all categories, described method also comprises:
Network focus phrase identical under same classification is merged;
Calculate the corresponding temperature value of network focus phrase under of all categories;
Search for the link of the corresponding focus incident of lower network focus phrase of all categories.
22. methods as described in claim 12 or 13, is characterized in that, described in classify and show and further comprise:
Show hot spot report to user, wherein, described hot spot report comprises: network focus phrase under of all categories in the affiliated classification of network focus phrase, predetermined amount of time, the corresponding temperature value of network focus phrase under of all categories and the link of the corresponding focus incident of lower network focus phrase of all categories, described predetermined amount of time comprise following one of at least: per hour, every day, weekly and monthly.
CN201210346827.9A 2012-09-18 2012-09-18 Network focus method for digging and device Expired - Fee Related CN102831248B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201210346827.9A CN102831248B (en) 2012-09-18 2012-09-18 Network focus method for digging and device
CN201610225018.0A CN105912670A (en) 2012-09-18 2012-09-18 Method and device for network hotspot excavation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210346827.9A CN102831248B (en) 2012-09-18 2012-09-18 Network focus method for digging and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201610225018.0A Division CN105912670A (en) 2012-09-18 2012-09-18 Method and device for network hotspot excavation

Publications (2)

Publication Number Publication Date
CN102831248A CN102831248A (en) 2012-12-19
CN102831248B true CN102831248B (en) 2016-05-11

Family

ID=47334383

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201210346827.9A Expired - Fee Related CN102831248B (en) 2012-09-18 2012-09-18 Network focus method for digging and device
CN201610225018.0A Pending CN105912670A (en) 2012-09-18 2012-09-18 Method and device for network hotspot excavation

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201610225018.0A Pending CN105912670A (en) 2012-09-18 2012-09-18 Method and device for network hotspot excavation

Country Status (1)

Country Link
CN (2) CN102831248B (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902596B (en) * 2012-12-28 2017-10-20 中国电信股份有限公司 High frequency content of pages clustering method and system
CN103324718B (en) * 2013-06-25 2016-08-10 百度在线网络技术(北京)有限公司 Method and system based on humongous search Web log mining topic venation
CN103761234A (en) * 2013-10-29 2014-04-30 北京奇虎科技有限公司 Method and device for optimizing search ranking of network resource point
CN103544294B (en) * 2013-10-30 2017-02-01 北京京东尚科信息技术有限公司 Keyword popularity automatic control method
CN103580997B (en) * 2013-11-19 2017-09-29 湖南蚁坊软件有限公司 The extracting method and its device of a kind of popular microblogging in vertical field
CN104714820A (en) * 2013-12-17 2015-06-17 青岛龙泰天翔通信科技有限公司 Cloud on-line updating method
CN105095175B (en) * 2014-04-18 2019-04-30 北京搜狗科技发展有限公司 Obtain the method and device of truncated web page title
CN105095318B (en) * 2014-05-22 2019-02-26 北京启明星辰信息安全技术有限公司 A kind of method and apparatus for realizing analysis of central issue
CN105373551A (en) * 2014-08-25 2016-03-02 阿里巴巴集团控股有限公司 Method for determining sensitive resource processing policy and server
CN105989176A (en) * 2015-03-05 2016-10-05 北大方正集团有限公司 Data processing method and device
CN108108346B (en) * 2016-11-25 2021-12-24 广东亿迅科技有限公司 Method and device for extracting theme characteristic words of document
CN108182191B (en) * 2016-12-08 2022-01-18 腾讯科技(深圳)有限公司 Hotspot data processing method and device
CN107133201B (en) * 2017-04-21 2021-03-16 东莞中国科学院云计算产业技术创新与育成中心 Hot spot information acquisition method and device based on text code recognition
CN108881968B (en) * 2017-05-15 2020-10-30 北京国双科技有限公司 Network video advertisement putting method and system
CN107315838A (en) * 2017-07-17 2017-11-03 深圳源广安智能科技有限公司 A kind of efficient network hotspot digging system
CN107423444B (en) * 2017-08-10 2020-05-19 世纪龙信息网络有限责任公司 Hot word phrase extraction method and system
CN107967299B (en) * 2017-11-03 2020-05-12 中国农业大学 Agricultural public opinion-oriented automatic hot word extraction method and system
CN108712403B (en) * 2018-05-04 2020-08-04 哈尔滨工业大学(威海) Illegal domain name mining method based on domain name construction similarity
CN110516066B (en) * 2019-07-23 2022-04-15 同盾控股有限公司 Text content safety protection method and device
CN110765115A (en) * 2019-09-27 2020-02-07 上海麦克风文化传媒有限公司 Method for combining multiple sorting categories
CN110929160A (en) * 2019-12-02 2020-03-27 上海麦克风文化传媒有限公司 Method for optimizing system sequencing result
CN110888986B (en) * 2019-12-06 2023-05-30 北京明略软件系统有限公司 Information pushing method, device, electronic equipment and computer readable storage medium
CN111580921B (en) * 2020-05-15 2021-10-22 北京字节跳动网络技术有限公司 Content creation method and device
CN112380339A (en) * 2020-11-23 2021-02-19 北京达佳互联信息技术有限公司 Hot event mining method and device and server

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101420356A (en) * 2008-05-30 2009-04-29 北京天腾时空信息科技有限公司 Network content classified processing method and apparatus
CN101923544A (en) * 2009-06-15 2010-12-22 北京百分通联传媒技术有限公司 Method for monitoring and displaying Internet hot spots
CN102004792A (en) * 2010-12-07 2011-04-06 百度在线网络技术(北京)有限公司 Method and system for generating hot-searching word

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8046361B2 (en) * 2008-04-18 2011-10-25 Yahoo! Inc. System and method for classifying tags of content using a hyperlinked corpus of classified web pages
CN101788988B (en) * 2009-01-22 2012-06-27 蔡亮华 Information extraction method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101420356A (en) * 2008-05-30 2009-04-29 北京天腾时空信息科技有限公司 Network content classified processing method and apparatus
CN101923544A (en) * 2009-06-15 2010-12-22 北京百分通联传媒技术有限公司 Method for monitoring and displaying Internet hot spots
CN102004792A (en) * 2010-12-07 2011-04-06 百度在线网络技术(北京)有限公司 Method and system for generating hot-searching word

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
互联网舆情发现与观点挖掘技术研究;罗引;《电子科技大学硕士学位论文》;20110415;第2章第2.1.2节、2.2.2节、第4章4.2节、第5章 *

Also Published As

Publication number Publication date
CN105912670A (en) 2016-08-31
CN102831248A (en) 2012-12-19

Similar Documents

Publication Publication Date Title
CN102831248B (en) Network focus method for digging and device
CN102945290B (en) Hot microblog topic excavating gear and method
CN102708096B (en) Network intelligence public sentiment monitoring system based on semantics and work method thereof
CN102208992B (en) The malicious information filtering system of Internet and method thereof
JP4489994B2 (en) Topic extraction apparatus, method, program, and recording medium for recording the program
CN103617169B (en) A kind of hot microblog topic extracting method based on Hadoop
CN103365924B (en) A kind of method of internet information search, device and terminal
CN102982157A (en) Device and method used for mining microblog hot topics
CN104281607A (en) Microblog hot topic analyzing method
CN104077402B (en) Data processing method and data handling system
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN104933093A (en) Regional public opinion monitoring and decision-making auxiliary system and method based on big data
CN103136358B (en) A kind of method of Automatic Extraction forum data
CN103235827B (en) A kind of method of scientific and technical information automatic classification screening
CN102945246B (en) The disposal route of network information data and device
CN108984667A (en) A kind of public sentiment monitoring system
CN103177076A (en) Public sentiment monitoring system and method based on fixed point websites
Kim et al. Event diffusion patterns in social media
CN105589953A (en) Unexpected public health event internet text extraction method
CN102811207A (en) Network information pushing method and system
CN103064880A (en) Method, device and system based on searching information for providing users with website choice
CN110232126A (en) Hot spot method for digging and server and computer readable storage medium
CN106021418A (en) News event clustering method and device
CN103577504A (en) Method and device for putting personalized contents
CN107220745A (en) A kind of recognition methods, system and equipment for being intended to behavioral data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160511

Termination date: 20210918