CN108197259A - A kind of online topic big data detection method of network - Google Patents

A kind of online topic big data detection method of network Download PDF

Info

Publication number
CN108197259A
CN108197259A CN201711489608.5A CN201711489608A CN108197259A CN 108197259 A CN108197259 A CN 108197259A CN 201711489608 A CN201711489608 A CN 201711489608A CN 108197259 A CN108197259 A CN 108197259A
Authority
CN
China
Prior art keywords
text
topic
big data
online
detection method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711489608.5A
Other languages
Chinese (zh)
Other versions
CN108197259B (en
Inventor
马永军
柴梦瑶
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Science and Technology
Original Assignee
Tianjin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Science and Technology filed Critical Tianjin University of Science and Technology
Priority to CN201711489608.5A priority Critical patent/CN108197259B/en
Publication of CN108197259A publication Critical patent/CN108197259A/en
Application granted granted Critical
Publication of CN108197259B publication Critical patent/CN108197259B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of online topic big data detection method of network, technical characteristics:Big data network text crawls online;Extract text feature and expression;Using Single Pass clustering algorithms, multiple similarity factors are chosen, carry out topic detection.Reasonable design of the present invention, it is on the basis of existing Single Pass algorithms, by analyzing text feature, using more similarity calculating methods, by the value for assigning different weight factors, weighted array acquires the similarity of text, can use and be declined on omission factor, false drop rate and consuming functional value, Clustering Effect is obviously improved.

Description

A kind of online topic big data detection method of network
Technical field
The invention belongs to computer data modeling technique field, especially a kind of online topic big data detection side of network Method.
Background technology
Compared with traditional information transmission channel, network has the opening of bigger and virtual, with from Media Era Arrival, a variety of different viewpoints, words and mood form network public-opinion thing by the continuous fermentation scale-up of cyberspace, some Part.Under the background of current China's construction network power, Internet public opinion analysis is paid high attention to.Network public-opinion is studied and is sent out For exhibition, development abroad is in 19th century mid-term, and China will propose that network public-opinion refers to more a little later to the research of network public-opinion The public by interconnection netlist reaches and propagates different moods, attitude and opinion.
Topic is the discovery that one ring of key of Internet public opinion analysis, and research method is concentrated mainly on to Text Clustering Algorithm at present Selection on, such as clustering algorithm based on division, the clustering algorithm based on density space, is based on the clustering algorithm based on level Clustering algorithm of grid etc., wherein the most commonly used clustering algorithm is Single-Pass clusters.Since Single-Pass gathers Class algorithm employs single similarity calculating method, does not account for the design feature of text, thus affects the accurate of cluster Degree, omission factor and the bit error rate are higher.
Invention content
It is an object of the invention to overcome the deficiencies in the prior art, propose a kind of online topic big data detection side of network Method solves the problems, such as that omission factor and the bit error rate are higher existing for existing clustering algorithm.
The present invention solves its technical problem and following technical scheme is taken to realize:
A kind of online topic big data detection method of network, includes the following steps:
Step 1, big data network text crawl online;
Step 2, extraction text feature and expression;
Step 3, using Single-Pass clustering algorithms, choose multiple similarity factors, carry out topic detection.
Further, the implementation method of the step 1 is:Hadoop distributed type assemblies are built, are installed in every machine CentOS and distributed Theme Crawler of Content, build big data data acquisition platform.
Further, the concrete methods of realizing of the step 2 includes the following steps:
(1) Text Pretreatment:Feature selection is carried out to text using text frequency techniques, from the Candidate Set of Feature Words Select some to the stronger word of text characterization ability as characteristic item, using classical weight evaluation method TF-IDF to choosing Characteristic item distribution weight out;
(2) text representation model is built:Text is represented using vector space model, the expression formula of mathematical model is as follows:
VSM (d)=<(t11);(t22);(t33);…(tnn)>
Wherein n represents the number of characteristic item, ti(1≤i≤n) be text feature item, ωi(1≤i≤n) is characterized a tiIt is right The weights answered, by establishing VSM by text representation into a vector in a n-dimensional space.
Further, the step 3 is chosen multiple similarity factors and is included:Time factor, the place factor and source factor.
Further, the time factor is:
In formula, Simtime(di,dj) represent document diAnd djTime gap, t=| ti-tj|, m then for set automatically when Between be spaced.
Further, the place factor is:
In formula, Simplace(pi,pj) similarity for two place names, deep (pi∩pj) it is place name piWith place name pjIn geography Apart from the common depth of root node, deep (p on treei) it is place name piDepth apart from root node, deep (pj) it is place name pjDistance The depth of root node.
Further, the source factor is:
Wherein, PR(p)Represent the PR values of website p, wherein d is damped coefficient, and a is to judge that chain goes out whether website is station exterior chain The specific gravity factor connect, V1 are the set that chain page-out and the website p pages are not same website, CiRepresentation page i whole chains go out The quantity of the page, V2 belong to the set of same website, C for chain page-out with page pjRepresentation page j whole chain page-outs Quantity.
Further, the damped coefficient d values are 0.85, and the specific gravity factor a values are 0.75.
Further, the method that the step 3 carries out topic detection includes the following steps:
(1) a news documents d is inputted;
(2) judge whether d is first news report, if it is goes to step (3), otherwise goes to step (4);
(3) create new topic and text d is added to new topic, turn to step (7);
(4) text d is pre-processed and build vector space model;
(5) the similarity of document d and each text of existing topic, record maximum similarity S are calculatedmaxAnd it finds and is corresponding to it Topic class T;
(6) if maximum similarity SmaxMore than preset threshold value Tc, then document d is clustered into topic class T, otherwise Turn to step (3);
(7) primary cluster terminates.
The advantages and positive effects of the present invention are:
The present invention is on the basis of existing Single-Pass algorithms, by analyzing text feature, using more similarity calculations Method, by assigning the value of different weight factors, weighted array acquires the similarity of text, can use omission factor, miss Declined on inspection rate and consuming functional value, Clustering Effect is obviously improved.
Description of the drawings
Fig. 1 is the overhaul flow chart of the present invention;
Fig. 2 is false drop rate-fallout ratio change curve of the present invention and a variety of detection methods;
Fig. 3 is every evaluation metrics comparison diagram of the present invention and a variety of detection methods.
Specific embodiment
The embodiment of the present invention is further described below in conjunction with attached drawing.
A kind of online topic big data detection method of network, as shown in Figure 1, including the following steps:
Step 1, big data network text crawl online.
In this step, Hadoop distributed type assemblies are built, 64 CentOS and distributed theme are installed in every machine Reptile builds big data data acquisition platform.
Step 2, extraction text feature and expression.
In this step, after the data after participle are carried out with feature selecting and weighting processing, and pass through structure vector space Model converts text to the structural data that can be calculated.Specific method is:
(1) Text Pretreatment:Feature selection is carried out to text using text frequency techniques, from the Candidate Set of Feature Words Select some to the stronger word of text characterization ability as characteristic item, using classical weight evaluation method TF-IDF to choosing Characteristic item distribution weight out.
(2) text representation model is built:Text, the expression of mathematical model are represented using vector space model (VSM) Formula is as follows:
VSM (d)=<(t11);(t22);(t33);…(tnn) >
Wherein n represents the number of characteristic item, ti(1≤i≤n) be text feature item, ωi(1≤i≤n) is characterized a tiIt is right The weights answered, by establishing VSM by text representation into a vector in a n-dimensional space.
Step 3, using Single-Pass clustering algorithms, choose multiple similarity factors, carry out topic detection.
The present invention using Single-Pass clustering algorithms, access time, place and source as the similarity factor, In:
(1) time factor:Introduce time gap concept
Simtime(di,dj) represent document diAnd djTime gap, wherein t=| ti-tj|, i.e., time of two documents it Difference, m are then the time interval set automatically.
(2) the place factor:It utilizes《Chinese place name is recorded》And " national place name atural object dictionary " data provided by data hall Structure one using China as root node geography set, using subordinate's relationship between place name by each place name be expressed as tree in one A node.
Calculating the similarity of two geographical trees needs to consider that each child node in geographical tree is the one of father node A branch, there are two the distance between child node, the common depth of two child nodes and each nodal distance root nodes Depth.Consider influence of the above-mentioned three kinds of factors to two place name similarities of calculating, it is as follows to define calculation formula:
Simplace(pi,pj) similarity for two place names, wherein deep (pi∩pj) it is place name piWith place name pjIn geography Apart from the common depth of root node, deep (p on treei) it is place name piDepth apart from root node, deep (pj) it is place name pjDistance The depth of root node.
(3) source factor:The PR values of the page are calculated using improved PageRank algorithms, calculation formula is as follows:
PR(p)Represent the PR values of website p, wherein d is damped coefficient, and usual value is 0.85, a to judge that chain goes out website and is The no specific gravity factor for the outer link in station, relative to the page in station, the outer page of standing can more embody the importance of the affiliated website of the page, take It is worth the set for 0.75, V1 to be chain page-out with the p pages be not same websites, CiThe number of representation page i whole chain page-outs Amount, V2 belong to the set of same website, C for chain page-out with page pjThe quantity of representation page j whole chain page-outs.
In this step, it after the access time factor, the place factor and source factor are as more similarity factors, carries out Line topic detection.Its input is:The document sets of news report, similarity threshold Tc;It exports as multiple topic classes.Idiographic flow is such as Under:
(1) a news documents d is inputted
(2) judge whether d is first news report, if it is goes to step (3), otherwise goes to step (4)
(3) it creates new topic and text d is added to new topic, turn to step (7)
(4) text d is pre-processed and builds vector space model
(5) similarity of document d and each text of existing topic, record maximum similarity S are calculatedmaxAnd find it is right therewith The topic class T answered
(6) if maximum similarity SmaxMore than preset threshold value Tc, then document d is clustered into topic class T, it is no Then turn to step (3)
(7) primary cluster terminates.
The present invention incorporates more similarity calculating methods in time, place name and source, by assigning different weight factors Value, weighted array acquire total similarity of two texts.Fig. 2 and Fig. 3 gives the present invention and Single-Pass clustering methods False drop rate-fallout ratio change curve and the present invention and every evaluation metrics comparison diagrams of Single-Pass clustering methods, can To find out, the present invention is declined on omission factor, false drop rate and consuming functional value, and detection result is preferable.
It is emphasized that embodiment of the present invention is illustrative rather than limited, therefore present invention packet Include the embodiment being not limited to described in specific embodiment, it is every by those skilled in the art according to the technique and scheme of the present invention The other embodiment obtained, also belongs to the scope of protection of the invention.

Claims (9)

1. a kind of online topic big data detection method of network, it is characterised in that include the following steps:
Step 1, big data network text crawl online;
Step 2, extraction text feature and expression;
Step 3, using Single-Pass clustering algorithms, choose multiple similarity factors, carry out topic detection.
2. a kind of online topic big data detection method of network according to claim 1, it is characterised in that:The step 1 Implementation method be:Hadoop distributed type assemblies are built, in every machine installation CentOS and distributed Theme Crawler of Content, are built big Data Data obtains platform.
3. a kind of online topic big data detection method of network according to claim 1, it is characterised in that:The step 2 Concrete methods of realizing include the following steps:
(1) Text Pretreatment:Feature selection is carried out to text using text frequency techniques, is selected from the Candidate Set of Feature Words Some to the stronger word of text characterization ability as characteristic item, using classical weight evaluation method TF-IDF to select come Characteristic item distribution weight;
(2) text representation model is built:Text is represented using vector space model, the expression formula of mathematical model is as follows:
VSM (d)=<(t11);(t22);(t33);…(tnn)>
Wherein n represents the number of characteristic item, ti(1≤i≤n) be text feature item, ωi(1≤i≤n) is characterized a tiIt is corresponding Weights, by establishing VSM by text representation into a vector in a n-dimensional space.
4. a kind of online topic big data detection method of network according to claim 1, it is characterised in that:The step 3 Multiple similarity factors are chosen to include:Time factor, the place factor and source factor.
5. a kind of online topic big data detection method of network according to claim 4, it is characterised in that:The time because Son is:
In formula, Simtime(di,dj) represent document diAnd djTime gap, t=| ti-tj|, m is then between time for setting automatically Every.
6. a kind of online topic big data detection method of network according to claim 4, it is characterised in that:The place because Son is:
In formula, Simplace(pi,pj) similarity for two place names, deep (pi∩pj) it is place name piWith place name pjOn geography tree Common depth apart from root node, deep (pi) it is place name piDepth apart from root node, deep (pj) it is place name pjApart from root section The depth of point.
7. a kind of online topic big data detection method of network according to claim 4, it is characterised in that:The source because Son is:
Wherein, PR(p)Represent the PR values of website p, wherein d is damped coefficient, and a is to judge that chain goes out whether website is the outer link in station Specific gravity factor, V1 are the set that chain page-out and the website p pages are not same website, CiRepresentation page i whole chain page-outs Quantity, V2 is the set that chain page-out and page p belong to same website, CjThe quantity of representation page j whole chain page-outs.
8. a kind of online topic big data detection method of network according to claim 7, it is characterised in that:The damping system Number d values are 0.85, and the specific gravity factor a values are 0.75.
9. a kind of online topic big data detection method of network according to claim 1, it is characterised in that:The step 3 The method for carrying out topic detection includes the following steps:
(1) a news documents d is inputted;
(2) judge whether d is first news report, if it is goes to step (3), otherwise goes to step (4);
(3) create new topic and text d is added to new topic, turn to step (7);
(4) text d is pre-processed and build vector space model;
(5) the similarity of document d and each text of existing topic, record maximum similarity S are calculatedmaxAnd if finding and being corresponding to it Inscribe class T;
(6) if maximum similarity SmaxMore than preset threshold value Tc, then document d is clustered into topic class T, otherwise turned to Step is (3);
(7) primary cluster terminates.
CN201711489608.5A 2017-12-30 2017-12-30 Online topic big data detection method for network Active CN108197259B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711489608.5A CN108197259B (en) 2017-12-30 2017-12-30 Online topic big data detection method for network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711489608.5A CN108197259B (en) 2017-12-30 2017-12-30 Online topic big data detection method for network

Publications (2)

Publication Number Publication Date
CN108197259A true CN108197259A (en) 2018-06-22
CN108197259B CN108197259B (en) 2024-03-05

Family

ID=62587443

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711489608.5A Active CN108197259B (en) 2017-12-30 2017-12-30 Online topic big data detection method for network

Country Status (1)

Country Link
CN (1) CN108197259B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111982393A (en) * 2020-08-27 2020-11-24 天津科技大学 Real-time monitoring vacuum instrument

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN105488092A (en) * 2015-07-13 2016-04-13 中国科学院信息工程研究所 Time-sensitive self-adaptive on-line subtopic detecting method and system
CN105718598A (en) * 2016-03-07 2016-06-29 天津大学 AT based time model construction method and network emergency early warning method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN105488092A (en) * 2015-07-13 2016-04-13 中国科学院信息工程研究所 Time-sensitive self-adaptive on-line subtopic detecting method and system
CN105718598A (en) * 2016-03-07 2016-06-29 天津大学 AT based time model construction method and network emergency early warning method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111982393A (en) * 2020-08-27 2020-11-24 天津科技大学 Real-time monitoring vacuum instrument
CN111982393B (en) * 2020-08-27 2021-11-19 天津科技大学 Real-time monitoring vacuum instrument

Also Published As

Publication number Publication date
CN108197259B (en) 2024-03-05

Similar Documents

Publication Publication Date Title
Rousseau et al. Main core retention on graph-of-words for single-document keyword extraction
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
JP5092165B2 (en) Data construction method and system
CN106156372B (en) A kind of classification method and device of internet site
CN111708740A (en) Mass search query log calculation analysis system based on cloud platform
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN107992542A (en) A kind of similar article based on topic model recommends method
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN104484343A (en) Topic detection and tracking method for microblog
CN103268350A (en) Internet public opinion information monitoring system and monitoring method
CN105653518A (en) Specific group discovery and expansion method based on microblog data
CN110502640A (en) A kind of extracting method of the concept meaning of a word development grain based on construction
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN105138558A (en) User access content-based real-time personalized information collection method
CN103530402A (en) Method for identifying microblog key users based on improved Page Rank
CN103246732A (en) Online Web news content extracting method and system
CN110990718A (en) Social network model building module of company image improving system
CN109992784A (en) A kind of heterogeneous network building and distance metric method for merging multi-modal information
CN107832467A (en) A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN110110220A (en) Merge the recommended models of social networks and user&#39;s evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant