CN108197259A - A kind of online topic big data detection method of network - Google Patents
A kind of online topic big data detection method of network Download PDFInfo
- Publication number
- CN108197259A CN108197259A CN201711489608.5A CN201711489608A CN108197259A CN 108197259 A CN108197259 A CN 108197259A CN 201711489608 A CN201711489608 A CN 201711489608A CN 108197259 A CN108197259 A CN 108197259A
- Authority
- CN
- China
- Prior art keywords
- text
- topic
- big data
- online
- detection method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 claims abstract description 21
- 238000011156 evaluation Methods 0.000 claims description 5
- 230000005484 gravity Effects 0.000 claims description 5
- 230000000712 assembly Effects 0.000 claims description 3
- 238000000429 assembly Methods 0.000 claims description 3
- 238000012512 characterization method Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000013178 mathematical model Methods 0.000 claims description 3
- 238000013016 damping Methods 0.000 claims 1
- 238000009434 installation Methods 0.000 claims 1
- 238000013461 design Methods 0.000 abstract description 2
- 230000000694 effects Effects 0.000 abstract description 2
- 238000004364 calculation method Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000036651 mood Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000270322 Lepidosauria Species 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000855 fermentation Methods 0.000 description 1
- 230000004151 fermentation Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- VMXUWOKSQNHOCA-UKTHLTGXSA-N ranitidine Chemical compound [O-][N+](=O)\C=C(/NC)NCCSCC1=CC=C(CN(C)C)O1 VMXUWOKSQNHOCA-UKTHLTGXSA-N 0.000 description 1
- 238000013341 scale-up Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of online topic big data detection method of network, technical characteristics:Big data network text crawls online;Extract text feature and expression;Using Single Pass clustering algorithms, multiple similarity factors are chosen, carry out topic detection.Reasonable design of the present invention, it is on the basis of existing Single Pass algorithms, by analyzing text feature, using more similarity calculating methods, by the value for assigning different weight factors, weighted array acquires the similarity of text, can use and be declined on omission factor, false drop rate and consuming functional value, Clustering Effect is obviously improved.
Description
Technical field
The invention belongs to computer data modeling technique field, especially a kind of online topic big data detection side of network
Method.
Background technology
Compared with traditional information transmission channel, network has the opening of bigger and virtual, with from Media Era
Arrival, a variety of different viewpoints, words and mood form network public-opinion thing by the continuous fermentation scale-up of cyberspace, some
Part.Under the background of current China's construction network power, Internet public opinion analysis is paid high attention to.Network public-opinion is studied and is sent out
For exhibition, development abroad is in 19th century mid-term, and China will propose that network public-opinion refers to more a little later to the research of network public-opinion
The public by interconnection netlist reaches and propagates different moods, attitude and opinion.
Topic is the discovery that one ring of key of Internet public opinion analysis, and research method is concentrated mainly on to Text Clustering Algorithm at present
Selection on, such as clustering algorithm based on division, the clustering algorithm based on density space, is based on the clustering algorithm based on level
Clustering algorithm of grid etc., wherein the most commonly used clustering algorithm is Single-Pass clusters.Since Single-Pass gathers
Class algorithm employs single similarity calculating method, does not account for the design feature of text, thus affects the accurate of cluster
Degree, omission factor and the bit error rate are higher.
Invention content
It is an object of the invention to overcome the deficiencies in the prior art, propose a kind of online topic big data detection side of network
Method solves the problems, such as that omission factor and the bit error rate are higher existing for existing clustering algorithm.
The present invention solves its technical problem and following technical scheme is taken to realize:
A kind of online topic big data detection method of network, includes the following steps:
Step 1, big data network text crawl online;
Step 2, extraction text feature and expression;
Step 3, using Single-Pass clustering algorithms, choose multiple similarity factors, carry out topic detection.
Further, the implementation method of the step 1 is:Hadoop distributed type assemblies are built, are installed in every machine
CentOS and distributed Theme Crawler of Content, build big data data acquisition platform.
Further, the concrete methods of realizing of the step 2 includes the following steps:
(1) Text Pretreatment:Feature selection is carried out to text using text frequency techniques, from the Candidate Set of Feature Words
Select some to the stronger word of text characterization ability as characteristic item, using classical weight evaluation method TF-IDF to choosing
Characteristic item distribution weight out;
(2) text representation model is built:Text is represented using vector space model, the expression formula of mathematical model is as follows:
VSM (d)=<(t1,ω1);(t2,ω2);(t3,ω3);…(tn,ωn)>
Wherein n represents the number of characteristic item, ti(1≤i≤n) be text feature item, ωi(1≤i≤n) is characterized a tiIt is right
The weights answered, by establishing VSM by text representation into a vector in a n-dimensional space.
Further, the step 3 is chosen multiple similarity factors and is included:Time factor, the place factor and source factor.
Further, the time factor is:
In formula, Simtime(di,dj) represent document diAnd djTime gap, t=| ti-tj|, m then for set automatically when
Between be spaced.
Further, the place factor is:
In formula, Simplace(pi,pj) similarity for two place names, deep (pi∩pj) it is place name piWith place name pjIn geography
Apart from the common depth of root node, deep (p on treei) it is place name piDepth apart from root node, deep (pj) it is place name pjDistance
The depth of root node.
Further, the source factor is:
Wherein, PR(p)Represent the PR values of website p, wherein d is damped coefficient, and a is to judge that chain goes out whether website is station exterior chain
The specific gravity factor connect, V1 are the set that chain page-out and the website p pages are not same website, CiRepresentation page i whole chains go out
The quantity of the page, V2 belong to the set of same website, C for chain page-out with page pjRepresentation page j whole chain page-outs
Quantity.
Further, the damped coefficient d values are 0.85, and the specific gravity factor a values are 0.75.
Further, the method that the step 3 carries out topic detection includes the following steps:
(1) a news documents d is inputted;
(2) judge whether d is first news report, if it is goes to step (3), otherwise goes to step (4);
(3) create new topic and text d is added to new topic, turn to step (7);
(4) text d is pre-processed and build vector space model;
(5) the similarity of document d and each text of existing topic, record maximum similarity S are calculatedmaxAnd it finds and is corresponding to it
Topic class T;
(6) if maximum similarity SmaxMore than preset threshold value Tc, then document d is clustered into topic class T, otherwise
Turn to step (3);
(7) primary cluster terminates.
The advantages and positive effects of the present invention are:
The present invention is on the basis of existing Single-Pass algorithms, by analyzing text feature, using more similarity calculations
Method, by assigning the value of different weight factors, weighted array acquires the similarity of text, can use omission factor, miss
Declined on inspection rate and consuming functional value, Clustering Effect is obviously improved.
Description of the drawings
Fig. 1 is the overhaul flow chart of the present invention;
Fig. 2 is false drop rate-fallout ratio change curve of the present invention and a variety of detection methods;
Fig. 3 is every evaluation metrics comparison diagram of the present invention and a variety of detection methods.
Specific embodiment
The embodiment of the present invention is further described below in conjunction with attached drawing.
A kind of online topic big data detection method of network, as shown in Figure 1, including the following steps:
Step 1, big data network text crawl online.
In this step, Hadoop distributed type assemblies are built, 64 CentOS and distributed theme are installed in every machine
Reptile builds big data data acquisition platform.
Step 2, extraction text feature and expression.
In this step, after the data after participle are carried out with feature selecting and weighting processing, and pass through structure vector space
Model converts text to the structural data that can be calculated.Specific method is:
(1) Text Pretreatment:Feature selection is carried out to text using text frequency techniques, from the Candidate Set of Feature Words
Select some to the stronger word of text characterization ability as characteristic item, using classical weight evaluation method TF-IDF to choosing
Characteristic item distribution weight out.
(2) text representation model is built:Text, the expression of mathematical model are represented using vector space model (VSM)
Formula is as follows:
VSM (d)=<(t1,ω1);(t2,ω2);(t3,ω3);…(tn,ωn) >
Wherein n represents the number of characteristic item, ti(1≤i≤n) be text feature item, ωi(1≤i≤n) is characterized a tiIt is right
The weights answered, by establishing VSM by text representation into a vector in a n-dimensional space.
Step 3, using Single-Pass clustering algorithms, choose multiple similarity factors, carry out topic detection.
The present invention using Single-Pass clustering algorithms, access time, place and source as the similarity factor,
In:
(1) time factor:Introduce time gap concept
Simtime(di,dj) represent document diAnd djTime gap, wherein t=| ti-tj|, i.e., time of two documents it
Difference, m are then the time interval set automatically.
(2) the place factor:It utilizes《Chinese place name is recorded》And " national place name atural object dictionary " data provided by data hall
Structure one using China as root node geography set, using subordinate's relationship between place name by each place name be expressed as tree in one
A node.
Calculating the similarity of two geographical trees needs to consider that each child node in geographical tree is the one of father node
A branch, there are two the distance between child node, the common depth of two child nodes and each nodal distance root nodes
Depth.Consider influence of the above-mentioned three kinds of factors to two place name similarities of calculating, it is as follows to define calculation formula:
Simplace(pi,pj) similarity for two place names, wherein deep (pi∩pj) it is place name piWith place name pjIn geography
Apart from the common depth of root node, deep (p on treei) it is place name piDepth apart from root node, deep (pj) it is place name pjDistance
The depth of root node.
(3) source factor:The PR values of the page are calculated using improved PageRank algorithms, calculation formula is as follows:
PR(p)Represent the PR values of website p, wherein d is damped coefficient, and usual value is 0.85, a to judge that chain goes out website and is
The no specific gravity factor for the outer link in station, relative to the page in station, the outer page of standing can more embody the importance of the affiliated website of the page, take
It is worth the set for 0.75, V1 to be chain page-out with the p pages be not same websites, CiThe number of representation page i whole chain page-outs
Amount, V2 belong to the set of same website, C for chain page-out with page pjThe quantity of representation page j whole chain page-outs.
In this step, it after the access time factor, the place factor and source factor are as more similarity factors, carries out
Line topic detection.Its input is:The document sets of news report, similarity threshold Tc;It exports as multiple topic classes.Idiographic flow is such as
Under:
(1) a news documents d is inputted
(2) judge whether d is first news report, if it is goes to step (3), otherwise goes to step (4)
(3) it creates new topic and text d is added to new topic, turn to step (7)
(4) text d is pre-processed and builds vector space model
(5) similarity of document d and each text of existing topic, record maximum similarity S are calculatedmaxAnd find it is right therewith
The topic class T answered
(6) if maximum similarity SmaxMore than preset threshold value Tc, then document d is clustered into topic class T, it is no
Then turn to step (3)
(7) primary cluster terminates.
The present invention incorporates more similarity calculating methods in time, place name and source, by assigning different weight factors
Value, weighted array acquire total similarity of two texts.Fig. 2 and Fig. 3 gives the present invention and Single-Pass clustering methods
False drop rate-fallout ratio change curve and the present invention and every evaluation metrics comparison diagrams of Single-Pass clustering methods, can
To find out, the present invention is declined on omission factor, false drop rate and consuming functional value, and detection result is preferable.
It is emphasized that embodiment of the present invention is illustrative rather than limited, therefore present invention packet
Include the embodiment being not limited to described in specific embodiment, it is every by those skilled in the art according to the technique and scheme of the present invention
The other embodiment obtained, also belongs to the scope of protection of the invention.
Claims (9)
1. a kind of online topic big data detection method of network, it is characterised in that include the following steps:
Step 1, big data network text crawl online;
Step 2, extraction text feature and expression;
Step 3, using Single-Pass clustering algorithms, choose multiple similarity factors, carry out topic detection.
2. a kind of online topic big data detection method of network according to claim 1, it is characterised in that:The step 1
Implementation method be:Hadoop distributed type assemblies are built, in every machine installation CentOS and distributed Theme Crawler of Content, are built big
Data Data obtains platform.
3. a kind of online topic big data detection method of network according to claim 1, it is characterised in that:The step 2
Concrete methods of realizing include the following steps:
(1) Text Pretreatment:Feature selection is carried out to text using text frequency techniques, is selected from the Candidate Set of Feature Words
Some to the stronger word of text characterization ability as characteristic item, using classical weight evaluation method TF-IDF to select come
Characteristic item distribution weight;
(2) text representation model is built:Text is represented using vector space model, the expression formula of mathematical model is as follows:
VSM (d)=<(t1,ω1);(t2,ω2);(t3,ω3);…(tn,ωn)>
Wherein n represents the number of characteristic item, ti(1≤i≤n) be text feature item, ωi(1≤i≤n) is characterized a tiIt is corresponding
Weights, by establishing VSM by text representation into a vector in a n-dimensional space.
4. a kind of online topic big data detection method of network according to claim 1, it is characterised in that:The step 3
Multiple similarity factors are chosen to include:Time factor, the place factor and source factor.
5. a kind of online topic big data detection method of network according to claim 4, it is characterised in that:The time because
Son is:
In formula, Simtime(di,dj) represent document diAnd djTime gap, t=| ti-tj|, m is then between time for setting automatically
Every.
6. a kind of online topic big data detection method of network according to claim 4, it is characterised in that:The place because
Son is:
In formula, Simplace(pi,pj) similarity for two place names, deep (pi∩pj) it is place name piWith place name pjOn geography tree
Common depth apart from root node, deep (pi) it is place name piDepth apart from root node, deep (pj) it is place name pjApart from root section
The depth of point.
7. a kind of online topic big data detection method of network according to claim 4, it is characterised in that:The source because
Son is:
Wherein, PR(p)Represent the PR values of website p, wherein d is damped coefficient, and a is to judge that chain goes out whether website is the outer link in station
Specific gravity factor, V1 are the set that chain page-out and the website p pages are not same website, CiRepresentation page i whole chain page-outs
Quantity, V2 is the set that chain page-out and page p belong to same website, CjThe quantity of representation page j whole chain page-outs.
8. a kind of online topic big data detection method of network according to claim 7, it is characterised in that:The damping system
Number d values are 0.85, and the specific gravity factor a values are 0.75.
9. a kind of online topic big data detection method of network according to claim 1, it is characterised in that:The step 3
The method for carrying out topic detection includes the following steps:
(1) a news documents d is inputted;
(2) judge whether d is first news report, if it is goes to step (3), otherwise goes to step (4);
(3) create new topic and text d is added to new topic, turn to step (7);
(4) text d is pre-processed and build vector space model;
(5) the similarity of document d and each text of existing topic, record maximum similarity S are calculatedmaxAnd if finding and being corresponding to it
Inscribe class T;
(6) if maximum similarity SmaxMore than preset threshold value Tc, then document d is clustered into topic class T, otherwise turned to
Step is (3);
(7) primary cluster terminates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711489608.5A CN108197259B (en) | 2017-12-30 | 2017-12-30 | Online topic big data detection method for network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711489608.5A CN108197259B (en) | 2017-12-30 | 2017-12-30 | Online topic big data detection method for network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108197259A true CN108197259A (en) | 2018-06-22 |
CN108197259B CN108197259B (en) | 2024-03-05 |
Family
ID=62587443
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711489608.5A Active CN108197259B (en) | 2017-12-30 | 2017-12-30 | Online topic big data detection method for network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108197259B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111982393A (en) * | 2020-08-27 | 2020-11-24 | 天津科技大学 | Real-time monitoring vacuum instrument |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102937960A (en) * | 2012-09-06 | 2013-02-20 | 北京邮电大学 | Device and method for identifying and evaluating emergency hot topic |
CN105488092A (en) * | 2015-07-13 | 2016-04-13 | 中国科学院信息工程研究所 | Time-sensitive self-adaptive on-line subtopic detecting method and system |
CN105718598A (en) * | 2016-03-07 | 2016-06-29 | 天津大学 | AT based time model construction method and network emergency early warning method |
-
2017
- 2017-12-30 CN CN201711489608.5A patent/CN108197259B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102937960A (en) * | 2012-09-06 | 2013-02-20 | 北京邮电大学 | Device and method for identifying and evaluating emergency hot topic |
CN105488092A (en) * | 2015-07-13 | 2016-04-13 | 中国科学院信息工程研究所 | Time-sensitive self-adaptive on-line subtopic detecting method and system |
CN105718598A (en) * | 2016-03-07 | 2016-06-29 | 天津大学 | AT based time model construction method and network emergency early warning method |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111982393A (en) * | 2020-08-27 | 2020-11-24 | 天津科技大学 | Real-time monitoring vacuum instrument |
CN111982393B (en) * | 2020-08-27 | 2021-11-19 | 天津科技大学 | Real-time monitoring vacuum instrument |
Also Published As
Publication number | Publication date |
---|---|
CN108197259B (en) | 2024-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rousseau et al. | Main core retention on graph-of-words for single-document keyword extraction | |
CN108052593A (en) | A kind of subject key words extracting method based on descriptor vector sum network structure | |
JP5092165B2 (en) | Data construction method and system | |
CN106156372B (en) | A kind of classification method and device of internet site | |
CN111708740A (en) | Mass search query log calculation analysis system based on cloud platform | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
CN101593200A (en) | Chinese Web page classification method based on the keyword frequency analysis | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN108763348B (en) | Classification improvement method for feature vectors of extended short text words | |
CN111767725B (en) | Data processing method and device based on emotion polarity analysis model | |
CN109670039A (en) | Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering | |
CN104484343A (en) | Topic detection and tracking method for microblog | |
CN103268350A (en) | Internet public opinion information monitoring system and monitoring method | |
CN105653518A (en) | Specific group discovery and expansion method based on microblog data | |
CN110502640A (en) | A kind of extracting method of the concept meaning of a word development grain based on construction | |
CN104978332B (en) | User-generated content label data generation method, device and correlation technique and device | |
CN106980651B (en) | Crawling seed list updating method and device based on knowledge graph | |
CN105138558A (en) | User access content-based real-time personalized information collection method | |
CN103530402A (en) | Method for identifying microblog key users based on improved Page Rank | |
CN103246732A (en) | Online Web news content extracting method and system | |
CN110990718A (en) | Social network model building module of company image improving system | |
CN109992784A (en) | A kind of heterogeneous network building and distance metric method for merging multi-modal information | |
CN107832467A (en) | A kind of microblog topic detecting method based on improved Single pass clustering algorithms | |
CN110110220A (en) | Merge the recommended models of social networks and user's evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |