CN110147439A - A kind of news event detecting method and system based on big data processing technique - Google Patents
A kind of news event detecting method and system based on big data processing technique Download PDFInfo
- Publication number
- CN110147439A CN110147439A CN201810792930.3A CN201810792930A CN110147439A CN 110147439 A CN110147439 A CN 110147439A CN 201810792930 A CN201810792930 A CN 201810792930A CN 110147439 A CN110147439 A CN 110147439A
- Authority
- CN
- China
- Prior art keywords
- news
- event
- topic
- url
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000012545 processing Methods 0.000 title claims abstract description 20
- 238000005516 engineering process Methods 0.000 claims abstract description 21
- 230000003068 static effect Effects 0.000 claims abstract description 15
- 230000009467 reduction Effects 0.000 claims abstract description 10
- 238000000605 extraction Methods 0.000 claims abstract description 3
- 238000001514 detection method Methods 0.000 claims description 47
- 238000004458 analytical method Methods 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 7
- 238000003860 storage Methods 0.000 claims description 7
- 241001269238 Data Species 0.000 claims description 6
- 230000004044 response Effects 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 3
- 238000000513 principal component analysis Methods 0.000 claims description 3
- 230000001174 ascending effect Effects 0.000 claims description 2
- 230000009193 crawling Effects 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 13
- 238000000429 assembly Methods 0.000 description 7
- 230000000712 assembly Effects 0.000 description 7
- 238000011161 development Methods 0.000 description 7
- 230000018109 developmental process Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 5
- 238000005259 measurement Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000033772 system development Effects 0.000 description 3
- 238000009826 distribution Methods 0.000 description 2
- 239000002360 explosive Substances 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 241000233805 Phoenix Species 0.000 description 1
- 244000097202 Rathbunia alamosensis Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 210000001367 artery Anatomy 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 210000003462 vein Anatomy 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of news event detecting methods based on big data processing technique, comprising the following steps: S1. crawls news data from each news portal website using static Web page crawler technology and dynamic web page crawler technology respectively;S2. the noise in news data is filtered, text duplicate removal, name Entity recognition and text summarization then are carried out to news data and generate these operations;S3. media event is detected from news data by these steps of participle, feature extraction, Feature Dimension Reduction and text cluster, and event is tracked, to form news topic;S4. the news topic information eventually detected is shown by interface.
Description
Technical field
The present invention relates to topic detection and tracking technical fields, are based on big data processing technique more particularly, to one kind
News event detecting method and system.
Background technique
In recent years, internet news show the scene of a piece of prosperity and development, and Internet news has become people's life
In essential a part.With its spread speed, fast, multimedia, global and interactive feature gradually take internet news
For mass traditional communications media such as newspaper, broadcast, become a kind of important way that people obtain newest information.
At the same time, since explosive growth is presented in internet information, the data scale of each enterprise web site platform is got over
Come huger, causes to be difficult to effectively to handle these mass datas with conventional software frame.In order to cope with interconnection
The mass data of explosive growth in net, big data processing technique come into being, and have obtained quick development in recent years.Its
In, Spark is the distributed type assemblies computing system of a support high-speed computation, it is come using elasticity distribution formula data set (RDD)
Storage object set, and a distributed machines learning program library MLlib is provided to be transported parallel to machine learning algorithm
It calculates, excavates for the large-scale user's access of enterprise platform and big data and provide important support with analysis.
In addition, traditional relevant database has been difficult to cope with depositing for mass data with the arriving of big data era
The data access problem of storage and high concurrent, in order to solve these problems, there has been proposed NoSQL (non-relational) databases.
Wherein, Couchbase is the distributed NoSQL database of the open source of an Oriented Documents, it has flexible data model, bullet
Property easily extension, high availability the features such as, be very suitable for for storing a large amount of news documents data.
Nowadays, people be typically only capable in each portal website browsing on the day of or these last few days Domestic News, it is difficult to
The media event information that same topic is discussed in one section of long period is obtained, for some specific event, user is also difficult to obtain
The news report information in all directions of the event is obtained, and is difficult to clear the historical development situation of the event.In order to solve this
Problem, numerous scholars study the method and system for carrying out media event detection.
Patent publication No. is that the patent document of CN103198078A proposes a kind of internet news event report trend point
Analyse method and system.The system acquires according to the characteristic information of media event first and screens news information, then by new
Hear data and analyzed to obtain the theme of media event, and according in different cycles theme and related information measure out conversion master
Topic, finally according to the relevant report quantity of theme, to it, temporally developmental sequence is shown.But the patent is not described
Analysis is carried out to generate the method and detailed process of theme of news to news information.
The patent document that patent publication No. is CN107145568A propose a kind of quick media event clustering system and
Method.The system includes news handling module, newsletter archive preliminary treatment module, newsletter archive affair clustering module and data
Memory module, wherein the system carries out permutation and combination to word segmentation result in cluster, and maps documents into first layer cluster,
Then calculate document and son cluster at a distance from, finally according to calculated result determine document belonging to cluster and create new son gather
Class.However, the patent only describes the detailed process of its cluster, there is no the designs and specific processing to system modules
Process is illustrated.
How the method and system of above-mentioned patent disclosure quickly has in the environment of magnanimity news data if not accounting for
Effect ground carries out this problem of event detection.Under big data environment, event detection system must have efficient, stable, easily extension
And the features such as High Availabitity, system can efficiently detection obtains media event and topic information from a large amount of news datas, and will
These information show user by Web page with open arms.
Summary of the invention
For the deficiencies in the prior art, the invention proposes a kind of media events based on big data processing technique
Detection method, this method can be in the case where big data handle frame Spark and NoSQL database Couchbase by relevant news
Report gathers together, and forms media event, and the development of track of events, user is allowed to will appreciate that media event in all directions
Information clears the development grain of event.
To realize the above goal of the invention, the technical solution adopted is that:
A kind of news event detecting method based on big data processing technique, comprising the following steps:
S1. it is crawled newly using static Web page crawler technology and dynamic web page crawler technology from each news portal website respectively
Hear data;
S2. the noise in news data is filtered, text duplicate removal, name Entity recognition and text then are carried out to news data
Autoabstract generates these operations;
S3. news is detected from news data by these steps of participle, feature extraction, Feature Dimension Reduction and text cluster
Event, and event is tracked, to form news topic;
S4. the news topic information eventually detected is shown by interface.
Preferably, in the step S1, static Web page crawler technology crawls static Web page using Scrapy, fixed first
Justice crawls the regular expression rule of target URL, seed URL is then generated according to certain rules, then since seed URL
Webpage is crawled, regular expression predetermined is regular and the URL was not crawled when the webpage URL crawled can be matched correctly
When, which is added in URL queue;Dynamic web page crawler technology crawls Dynamic Networks using HTTP request and response technology
Page analyzes the HTTP request parameter of target webpage URL first, then constructs HTTP request message based on the analysis results, and set
Request row, request header these parameters for setting message, send it to destination host, finally just to the message in http response message
Text is parsed, and is therefrom extracted webpage URL and is added it in URL queue;For static Web page crawler technology and dynamic
The webpage URL that spiders technology is extracted, carrys out analyzing web page using XPath or jsoup and therefrom extracts news data,
URL is taken out from URL queue first, and accesses its corresponding webpage, then the HTML DOM structure of analyzing web page, is therefrom extracted
These news datas of headline, issuing time, classification and text out.
Preferably, in the step S2, first using the noise in regular expression rule-based filtering body, then from
Repeated text is detected in news data and is removed it, and then extracts body using the name Entity recognition module of FNLP
Name entity, and automatically generate using TextRank4ZH the abstract of body, finally by filtered news data and
The name entity and summary info of body are stored into Couchbase database.
Preferably, in the step S3, the news number of specified classification and issuing time is inquired from Couchbase first
According to, and its issuing time ascending sort is pressed to news report;Then body is divided using the word segmentation module in FNLP
Word, and the stop words in word segmentation result is removed according to the deactivated vocabulary of Chinese and English;Then use TF-IDF by each news documents
Text be converted into high dimensional feature vector, and dimensionality reduction is carried out to feature vector using PCA principal component analysis;Finally using the band time
The Single-Pass algorithm of window carries out clustering to news documents, obtains media event, and calculate using Single-Pass
Method is tracked event, to form news topic;Finally media event and topic information are stored to Couchbase data
In library.
Preferably, in the step S4, when showing the summary info of topic, assignment algorithm type and parameter are inquired first
Topic information, then obtain the event that occurs the latest in topic and then obtain and send out the latest as the representative event of topic
The news report issued in part earliest of making trouble will finally represent the mark of news report as the representative news report of topic
Topic, issuing time and text abstract are shown on webpage as the summary info of topic;It is first when showing the details of topic
The list of thing of topic is first obtained according to topic ID, and the title and issuing time of each event are obtained from list of thing,
To form track of issues information;Then it obtains and the news report list of event occurs in list of thing the latest, and from Xin Wen Bao
The title, issuing time, source and URL information of every news are obtained in road list, to form event detection information;Then it obtains
The news report occurred earliest in generation event the latest is taken, and obtains the title and text summary info of the news report, by it
Title and abstract as topic;Finally shown above- mentioned information as the details of topic on webpage.
Meanwhile the present invention also provides a kind of system using above method, specific scheme is as follows:
Including webcrawler module, data preprocessing module, event checking module and event display module, wherein network is climbed
Erpoglyph block is used to execute step S3 for executing step S1, data preprocessing module for executing step S2, event checking module,
Event display module is for executing step S4.
Preferably, the webcrawler module includes static Web page crawler submodule, dynamic web page crawler submodule and net
Page analyzing sub-module.
Compared with prior art, the beneficial effects of the present invention are:
1, the present invention shows a series of this complete process by data acquisition, data prediction, event detection and event
It completes from initial internet news data to final this convert task of media event and topic information, and passes through Web circle
Face is presented to user with open arms, and user is allowed to will appreciate that the information in all directions of each news topic, clears the development arteries and veins of media event
Network.
2, the present invention carries out event inspection at distributed type assemblies Computational frame Spark and NoSQL database Couchbase
It surveys, can effectively support the excavation and analysis task of a large amount of news datas, so that the efficiency of event detection is promoted, the system of building
Have the characteristics that stabilization, be easy to extending transversely and High Availabitity.
Detailed description of the invention
Fig. 1 system overall flow figure
Fig. 2 webcrawler module flow chart
Fig. 3 data preprocessing module flow chart
Fig. 4 event checking module flow chart
Fig. 5 event display module flow chart
Single-Pass algorithm flow chart of the Fig. 6 with time window
The event number of Fig. 7 event detection
The cluster time of Fig. 8 event detection
The topic numbers of Fig. 9 track of issues
The cluster time of Figure 10 track of issues
The effect measurement of Figure 11 event detection
The total time of Figure 12 event detection
Figure 13 topic summary info Web page
Figure 14 topic details Web page
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
Below in conjunction with drawings and examples, the present invention is further elaborated.
Embodiment 1
Fig. 1 is the flow diagram of method provided by the invention.Method provided by the invention structure when specifically used
The system built includes that there are four modules, respectively webcrawler module, data preprocessing module, event checking module and event exhibition
Show module.Webcrawler module uses static and dynamic web page crawler technology from each news portal website of internet respectively
News data needed for crawling system.Data preprocessing module filters the noise in initial news data first, then to news
Data carry out text duplicate removal, name Entity recognition and text summarization operation.Event checking module is mentioned by participle, feature
Take, Feature Dimension Reduction and text cluster and etc. go out media event from news data centralized detecting, and event is tracked, thus
Form news topic.Event display module is shown the news topic information eventually detected by web interface, allows user
News topic content and event development are got information about.
The new of phoenix net information and China News Service is crawled using static Web page crawler technology in webcrawler module
Data are heard, the news data of Tencent's news and Sina News is crawled using dynamic web page crawler technology, webcrawler module
Specific process flow is as shown in Figure 2.Static Web page crawler submodule crawls static Web page using Scrapy, and the module is fixed first
Justice crawls the regular expression rule of target URL, seed URL is then generated according to certain rules, then since seed URL
Crawl webpage, when the webpage URL crawled can correctly match regular expression predetermined and the URL be not crawled it is out-of-date,
The URL is added in URL queue.Dynamic web page crawler submodule crawls Dynamic Networks using HTTP request and response technology
Page, the module analyze the HTTP request parameter of target webpage URL first, then construct HTTP request message based on the analysis results,
And the parameters such as the request row of message, request header are set, destination host is sent it to, finally to disappearing in http response message
Breath text is parsed, and is therefrom extracted webpage URL and is added it in URL queue.Web analysis submodule uses
XPath or jsoup carrys out analyzing web page and therefrom extracts news data, which takes out URL from URL queue first, and visits
Ask its corresponding webpage, then the HTML DOM structure of analyzing web page, therefrom extract headline, issuing time, classification and
The news datas such as text.
Pretreatment operation is carried out to the initial news data obtained in webcrawler module in data preprocessing module,
The specific process flow of data preprocessing module is as shown in Figure 3.The module is used first in regular expression filtering body
Noise, then go out repeated text from news data centralized detecting and remove it, then use FNLP name Entity recognition
Module extracts the name entity of body, and the abstract of body is automatically generated using TextRank4ZH, finally will be clear
The name entity and summary info of news data and body after washing are stored into Couchbase database.
Event detection is carried out using Single-Pass algorithm in event checking module, entire testing process is in Spark
Under handled, the specific process flow of event checking module is as shown in Figure 4.It is first in Single-Pass event detection process
The news data of specified classification and issuing time is first inquired from Couchbase, and its issuing time liter is pressed to news report
Sequence sequence.Then body is segmented using the word segmentation module in FNLP, and is removed point according to the deactivated vocabulary of Chinese and English
Stop words in word result.Then high dimensional feature vector is converted by the text of each news documents using TF-IDF, and made
Dimensionality reduction is carried out to feature vector with PCA principal component analysis.Finally use the Single-Pass algorithm with time window to news text
Shelves carry out clustering, media event are obtained, and be tracked to event using Single-Pass algorithm, to form news
Topic.Finally by media event and topic information storage into Couchbase database.
A Web system is built using Spring frame in event display module, shows the summary letter of news topic
Breath and details, the specific process flow of event display module are as shown in Figure 5.When showing the summary info of topic, the mould
Block inquires the topic information of assignment algorithm type and parameter first, then obtains the event occurred the latest in topic, as
The representative event of topic, then obtains the news report issued earliest in generation event the latest, and the representative as topic is new
Report is heard, finally shows the title for representing news report, issuing time and text abstract in net as the summary info of topic
On page.When showing the details of topic, which obtains the list of thing of topic according to topic ID first, and from event column
The title and issuing time of each event are obtained in table, to form track of issues information.Then it obtains in list of thing most
The news report list of late generation event, and obtain from news report list the title of every news, issuing time, source and
URL information, to form event detection information.Then the news report occurred earliest in generation event the latest is obtained, and is obtained
The title and text summary info of the news report, as the title and abstract of topic.Finally using above- mentioned information as words
The details of topic are shown on webpage.
Wherein, in step 3 event detection process, the specific reality of the event detection process based on Spark and Couchbase
Shown in now steps are as follows:
1, believed according to bucket title of URL, Spark Apply Names of Spark cluster manager dual system and Couchbase etc.
Breath building SparkConf configuration information, and JavaSparkContext is constructed according to SparkConf.
2, it is constructed according to JavaSparkContext for interacting operation with Couchbase database at Spark
CouchbaseSparkContext.
3, it according to the cluster of server host name creation Couchbase, and is opened according to specified bucket title
Corresponding bucket in cluster.
4, to the news data and algorithm information building json document after cleaning at Spark, and RDD is converted thereof into,
Then using couchbaseDocumentRDD () function of couchbase-spark-connector concurrently by above-mentioned number
According to storage into Couchbase database.
5, using couchbaseQuery () function of CouchbaseSparkContext at Spark concurrently from
Inquire the news data of specified time interval and classification in Couchbase database, and by its according to the news briefing time from
It is early to sort to late.
6, FNLP participle operation concurrently is carried out to each element in body RDD using the map function of Spark,
Finally obtain the RDD that word list is constituted after segmenting.
7, word list RDD is converted into TF feature vector using the HashingTF of TF-IDF module in Spark MLlib
RDD, and it is cached, TF feature vector RDD is then converted into TF-IDF feature vector RDD using IDFModel.
8, TF-IDF feature vector RDD is converted into RowMatrix matrix using Spark MLlib, then calculated
The principal component matrix of RowMatrix matrix, the square finally by RowMatrix matrix and principal component matrix multiple, after obtaining dimensionality reduction
Battle array, and it is converted into the feature vector RDD after dimensionality reduction.
9, news report is indicated with the feature vector after dimensionality reduction, and use the Single-Pass algorithm pair with time window
News report carries out increment cluster operation, and final detection obtains media event, and event information is stored to Couchbase number
According in library.Fig. 6 illustrates the Single-Pass algorithm flow chart with time window.
Media event is from morning to night ranked up according to initial time, then using Single-Pass algorithm to event
Clustering is carried out, obtains news topic, and by topic information storage into Couchbase database.
Embodiment 2
The present embodiment has carried out specific experiment to method provided by the invention, system development and deployment software in experiment
Environment is as shown in table 1:
1 system development of table and deployment software environment
The system builds Spark distributed computing cluster and Couchbase data base set using 2 servers respectively
Group, and on the server by Web system deployment, the hardware configuration of exploitation and deployment is as shown in table 2:
2 system development of table and deployment hardware configuration
Server | CPU | Memory |
Server 1 | Intel(R)Core(TM)i5-4570CPU@ | 8GB |
Server 2 | Intel(R)Core(TM)i5-2450M CPU@ | 6GB |
Experimental result:
The classification that this experiment is obtained using system by web crawlers is that domestic news data tests system, real
The details for testing data are as shown in table 3.
The details of 3 experimental data of table
This experiment is tested and has been analyzed to the result of event detection and track of issues.This experiment is to following 6 contents
It is tested:
(1) event number that event detection obtains
(2) time of cost is clustered in event detection
(3) topic numbers that track of issues obtains
(4) time of cost is clustered in track of issues
(5) effect measurement of event detection
(6) total time of event detection
This experiment look first at the Single-Pass algorithm with time window that event detection uses similarity threshold and
The variation of time window parameter is then observed the Single-Pass that track of issues uses and is calculated to influence caused by (1) and (2)
The similarity threshold of method and the variation of event number then use recall rate, accuracy rate and F to influence caused by (3) and (4)
It is worth the effect of this 3 index test event detections, finally compares and handled using traditional single machine processing technique and distributed type assemblies
Experimental result of the technology on (6).
Fig. 7 illustrates the experimental result of the event number obtained in event detection procedure.As can see from Figure 7, right
In window at the same time, with the increase of Single-Pass similarity threshold, the event number that event detection obtains is also therewith
Increase.The reason is that similarity threshold is bigger, the similarity of every news report and existing event is more likely to be less than the threshold
Value, is more possible to be judged as new events, therefore finally obtained event number is more naturally.For identical
Single-Pass similarity threshold, with the increase of time window, event number is gradually decreased.The reason is that time window is got over
Greatly, every news report needs are compared with more events, therefore are more possible to some event being added thereto, and are caused
Finally obtained event number tails off.
Fig. 8 is illustrated to be clustered in event detection procedure using the Single-Pass algorithm with time window and be spent
Time taking experimental result.As can see from Figure 8, for window at the same time, with Single-Pass similarity threshold
The increase of value, the time for clustering cost also increase with it.The main reason is that similarity threshold is bigger, it is judged as new events
News report quantity is more, and the event number in time window is also more, therefore every news report needs and more events
It is compared, eventually leads to the time increase that cluster is spent.For identical Single-Pass similarity threshold, with the time
The increase of window, the time for clustering cost also increase with it.The main reason is that time window is bigger, include in time window
Event number is also more, therefore every news report needs are compared with more events, eventually lead to cluster cost when
Between also increase with it.
Fig. 9 illustrates the experimental result of the topic numbers obtained during track of issues.The experiment is by event detection
Time window is set as 24 hours, and the Single-Pass algorithm similarity threshold of event detection is respectively set to 0.5,0.6,0.7 and
0.8, to be respectively formed 3495,4441,5236,6062 events.As it can be seen in figure 9 that for identical event number,
With the increase of the Single-Pass similarity threshold of track of issues, the topic numbers that track of issues obtains also are increased with it.Its
The reason is that similarity threshold is bigger, the similarity of each event and existing topic is more likely to be less than the threshold value, naturally more
It is possible that being judged as new topic, therefore finally obtained topic numbers are more.For identical Single-Pass similarity
Threshold value, with the increase of event number, the topic numbers that track of issues obtains also are increased with it, but the speed increased becomes faster.
The reason is that event number is more, the topic numbers that event is related to are more, therefore the topic number obtained in track of issues
It measures also more.But since the topic content that track of issues is formed in low similarity threshold is more wide in range, each topic packet
The event number contained is more, therefore influence of the event number to topic numbers and little;And event chases after in high similarity threshold
The topic content that track is formed is more single, and the event number that each topic includes is less, therefore event number is to topic numbers
It is affected.
Figure 10, which is illustrated, is clustered spent time taking experiment using Single-Pass algorithm during track of issues
As a result.The experiment equally sets the time window of event detection to for 24 hours, the Single-Pass algorithm similarity of event detection
Threshold value is respectively set to 0.5,0.6,0.7 and 0.8, to be respectively formed 3495,4441,5236,6062 events.With Fig. 8 phase
Than can see, cluster the time it takes is more than the time spent in event detection during track of issues, this
It is because time window is not arranged for the cluster of track of issues, is that whole events are clustered, it is therefore desirable to spends more
More time.As can see from Figure 10, for identical event number, with the Single-Pass similarity of track of issues
Cluster the time it takes of the increase of threshold value, track of issues also increases therewith.The main reason is that similarity threshold is bigger, thing
The topic numbers that part is tracked are more, therefore each event needs are compared with more topics, are eventually led to poly-
It needs to take more time when class.For identical track of issues Single-Pass similarity threshold, with event number
Increase, cluster the time it takes of track of issues also increases therewith.The reason is that event number is more, just have more in cluster
More event needs are compared with topic, therefore the cluster needs of track of issues take more time.
Figure 11 illustrates the experimental result of the effect measurement of event detection.The experiment has chosen time interval in table 3
As experimental data, which determines 231 news report in 1 to 2 November in 2017 after manually marking
For comprising 87 events, wherein possess 23 news report comprising the most event of news report, least event only have 1 it is new
Hear report.The time window of event detection is set 48h by the experiment.The experiment uses three fingers of measurement Clustering Effect
Mark recall rate (Recall), accuracy rate (Precision) and F value (F-Measure) as event detection Measure Indexes.
With in conventional IR field recall rate, accuracy rate and F value it is different, the experiment use these three indexs definition such as
Under:
Wherein, n is the total quantity of news report, niFor the quantity for the news report that i-th of true cluster includes, njIt is
The quantity of the j news report for including by the cluster that incident Detection Algorithm obtains, nijFor i-th of true cluster and pass through for j-th
The quantity for the identical news report that the cluster that incident Detection Algorithm obtains includes.K is the quantity of true cluster, and k ' is to be examined by event
The quantity for the cluster that method of determining and calculating obtains.As can see from Figure 11, with the increasing of the Single-Pass similarity threshold of event detection
Greatly, recall rate is in rising trend, and accuracy rate is in slow downward trend, and F value is almost unchanged.The reason is that similarity threshold is got over
Greatly, njAnd nijAll increase accordingly, and niIt is constant, and njThe speed ratio n of growthijIt is fast, therefore recall rate is increased, and it is accurate
Rate decreases.From experimental result, it can be seen that, all 90% or more, the results show is got over for recall rate, accuracy rate and F value
The validity of part detection.
Figure 12, which is illustrated, carries out whole event using traditional single machine processing technique and distributed type assemblies processing technique respectively
The experimental result of time spent by detection process.In this experiment, the time window of event detection is set as 24 hours, and event chases after
The Single-Pass similarity threshold of track is set as 0.7, and single machine processing technique uses server 1 and server 2 respectively, distribution
Formula clustering techniques use the cluster built by server 1 and server 2.As can see from Figure 12, for identical
Event detection Single-Pass similarity threshold, the time longest that server 2 expends, server 1 takes second place, distributed type assemblies
The time of consuming is minimum, and the time that distributed type assemblies expend is about the half of server 2, about the 2/3 of server 1.It is main
It is calculated the reason is that distributed type assemblies use distributed proccessing, accelerates the efficiency of calculating, and 1 kimonos of server
Business device 2 is all made of traditional single machine processing technique, therefore computational efficiency is limited;And since the hardware configuration of server 1 is than servicing
Device 2 wants high, therefore the time that server 1 expends is fewer than server 2.The experimental results showed that handling skill relative to traditional single machine
Art, the big data processing technique that this system uses be obviously improved in processing speed, it was demonstrated that the system of the invention designed
High efficiency in event detection.
System web interface is shown:
Figure 13 illustrates the topic summary info page of system, which is in the homepage of system, main presentation hot topic
The summary info of topic, which includes topic titles, the synopsis of topic time and topic.In this subsystem web interface
In displaying, the present invention has chosen to be shown comprising 2 most topics of event number, this 2 topics are master craftsman of the Spring and Autumn period's prize, color respectively
Stupefied lattice river.
Figure 14 illustrates the topic details page of system, and the present invention has chosen this word of Xi'an to Chengdu turnaround
Topic is shown.Topic details page shows the details of a nearest event for the topic first, then shows thing
The information of part tracking.The details of event mainly include event header, synopsis and the event of event include it is related
News report.Track of issues information mainly includes list of thing relevant to the event content, and when according to the generation of event
Between from evening to the early time of origin and heading message for showing event.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair
The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description
To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this
Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention
Protection scope within.
Claims (7)
1. a kind of news event detecting method based on big data processing technique, it is characterised in that: the following steps are included:
S1. news number is crawled from each news portal website using static Web page crawler technology and dynamic web page crawler technology respectively
According to;
S2. the noise in news data is filtered, it is automatic then to carry out text duplicate removal, name Entity recognition and text to news data
These operations of summarization generation;
S3. news thing is detected from news data by these steps of participle, feature extraction, Feature Dimension Reduction and text cluster
Part, and event is tracked, to form news topic;
S4. the news topic information eventually detected is shown by interface.
2. the news event detecting method according to claim 1 based on big data processing technique, it is characterised in that: described
In step S1, static Web page crawler technology crawls static Web page using Scrapy, defines the canonical for crawling target URL first
Expression formula rule, then generates seed URL according to certain rules, then webpage is crawled since seed URL, when what is crawled
Webpage URL can correctly match regular expression predetermined rule and the URL is not crawled out-of-date, which is added to URL
In queue;Dynamic web page crawler technology crawls dynamic web page using HTTP request and response technology, first analysis target network
The HTTP request parameter of page URL, then constructs HTTP request message based on the analysis results, and the request row of message, request is arranged
These parameters of head, send it to destination host, finally parse to the message text in http response message, therefrom extract
It webpage URL and adds it in URL queue out;The net extracted for static Web page crawler technology and dynamic web page crawler technology
Page URL, carrys out analyzing web page using XPath or jsoup and therefrom extracts news data, take out from URL queue first
URL, and access its corresponding webpage, the then HTML DOM structure of analyzing web page, when therefrom extracting headline, publication
Between, classification and text these news datas.
3. the news event detecting method according to claim 2 based on big data processing technique, it is characterised in that: described
In step S2, first using the noise in regular expression rule-based filtering body, weight is then detected from news data
Multiple text simultaneously removes it, and the name entity of body is then extracted using the name Entity recognition module of FNLP, and use
TextRank4ZH automatically generates the abstract of body, finally that the name of filtered news data and body is real
Body and summary info storage are into Couchbase database.
4. the news event detecting method according to claim 3 based on big data processing technique, it is characterised in that: described
In step S3, the news data of specified classification and issuing time is inquired from Couchbase first, and it is pressed to news report
Issuing time ascending sort;Then body is segmented using the word segmentation module in FNLP, and is deactivated according to Chinese and English
Vocabulary removes the stop words in word segmentation result;Then high dimensional feature is converted by the text of each news documents using TF-IDF
Vector, and dimensionality reduction is carried out to feature vector using PCA principal component analysis;The Single-Pass with time window is finally used to calculate
Method carries out clustering to news documents, obtains media event, and be tracked to event using Single-Pass algorithm, from
And form news topic;Finally by media event and topic information storage into Couchbase database.
5. the news event detecting method according to claim 4 based on big data processing technique, it is characterised in that: described
In step S4, when showing the summary info of topic, the topic information of assignment algorithm type and parameter is inquired first, is then obtained
The event occurred the latest in topic is then obtained and is issued earliest in generation event the latest as the representative event of topic
News report will finally represent title, issuing time and the positive digest of news report as the representative news report of topic
It to be shown on webpage as the summary info of topic;When showing the details of topic, is obtained talk about according to topic ID first
The list of thing of topic, and the title and issuing time of each event are obtained from list of thing, to form track of issues letter
Breath;Then the news report list that event occurs in list of thing the latest is obtained, and obtains every newly from news report list
Title, issuing time, source and the URL information of news, to form event detection information;Then it obtains in generation event the latest most
The news report early occurred, and obtain the title and text summary info of the news report, title as topic and is plucked
It wants;Finally shown above- mentioned information as the details of topic on webpage.
6. a kind of system of any one according to claim 1~5 detection method, it is characterised in that: including web crawlers mould
Block, data preprocessing module, event checking module and event display module, wherein webcrawler module is used to execute step S1,
Data preprocessing module is for executing step S2, and event checking module is for executing step S3, and event display module is for executing
Step S4.
7. system according to claim 6, it is characterised in that: the webcrawler module includes static Web page crawler submodule
Block, dynamic web page crawler submodule and web analysis submodule.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810792930.3A CN110147439A (en) | 2018-07-18 | 2018-07-18 | A kind of news event detecting method and system based on big data processing technique |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810792930.3A CN110147439A (en) | 2018-07-18 | 2018-07-18 | A kind of news event detecting method and system based on big data processing technique |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110147439A true CN110147439A (en) | 2019-08-20 |
Family
ID=67589149
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810792930.3A Pending CN110147439A (en) | 2018-07-18 | 2018-07-18 | A kind of news event detecting method and system based on big data processing technique |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110147439A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110597981A (en) * | 2019-09-16 | 2019-12-20 | 西华大学 | Network news summary system for automatically generating summary by adopting multiple strategies |
CN110990705A (en) * | 2019-12-06 | 2020-04-10 | 腾讯科技(深圳)有限公司 | News processing method, device, equipment and medium |
CN111291299A (en) * | 2020-01-22 | 2020-06-16 | 北京飞漫软件技术有限公司 | Method for directly obtaining local command execution result and local server |
CN111324753A (en) * | 2020-01-22 | 2020-06-23 | 天窗智库文化传播(苏州)有限公司 | Media information publishing management method and system |
CN111460160A (en) * | 2020-04-02 | 2020-07-28 | 复旦大学 | Event clustering method for streaming text data based on reinforcement learning |
CN111581480A (en) * | 2020-05-12 | 2020-08-25 | 杭州风远科技有限公司 | News information aggregation analysis method and system, terminal and storage medium |
CN111930936A (en) * | 2020-06-28 | 2020-11-13 | 山东师范大学 | Method and system for excavating platform message text |
CN112287254A (en) * | 2020-11-23 | 2021-01-29 | 武汉虹旭信息技术有限责任公司 | Webpage structured information extraction method and device, electronic equipment and storage medium |
CN112597269A (en) * | 2020-12-25 | 2021-04-02 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Stream data event text topic and detection system |
CN112818200A (en) * | 2021-01-28 | 2021-05-18 | 平安普惠企业管理有限公司 | Data crawling and event analyzing method and system based on static website |
CN113554538A (en) * | 2021-05-28 | 2021-10-26 | 四川社智雲科技有限公司 | Digital information integrated system for urban and rural community management |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102262635A (en) * | 2010-05-25 | 2011-11-30 | 北京启明星辰信息技术股份有限公司 | Page crawler system and page crawler method |
CN102831220A (en) * | 2012-08-23 | 2012-12-19 | 江苏物联网研究发展中心 | Subject-oriented customized news information extraction system |
CN103092936A (en) * | 2013-01-08 | 2013-05-08 | 华北电力大学(保定) | Real-time information acquisition method of dynamic page of Internet of Things |
CN104462253A (en) * | 2014-11-20 | 2015-03-25 | 武汉数为科技有限公司 | Topic detection or tracking method for network text big data |
CN107862039A (en) * | 2017-11-06 | 2018-03-30 | 工业和信息化部电子第五研究所 | Web data acquisition methods, system and Data Matching method for pushing |
-
2018
- 2018-07-18 CN CN201810792930.3A patent/CN110147439A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102262635A (en) * | 2010-05-25 | 2011-11-30 | 北京启明星辰信息技术股份有限公司 | Page crawler system and page crawler method |
CN102831220A (en) * | 2012-08-23 | 2012-12-19 | 江苏物联网研究发展中心 | Subject-oriented customized news information extraction system |
CN103092936A (en) * | 2013-01-08 | 2013-05-08 | 华北电力大学(保定) | Real-time information acquisition method of dynamic page of Internet of Things |
CN104462253A (en) * | 2014-11-20 | 2015-03-25 | 武汉数为科技有限公司 | Topic detection or tracking method for network text big data |
CN107862039A (en) * | 2017-11-06 | 2018-03-30 | 工业和信息化部电子第五研究所 | Web data acquisition methods, system and Data Matching method for pushing |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110597981A (en) * | 2019-09-16 | 2019-12-20 | 西华大学 | Network news summary system for automatically generating summary by adopting multiple strategies |
CN110990705A (en) * | 2019-12-06 | 2020-04-10 | 腾讯科技(深圳)有限公司 | News processing method, device, equipment and medium |
CN110990705B (en) * | 2019-12-06 | 2024-04-12 | 深圳市雅阅科技有限公司 | News processing method, device, equipment and medium |
CN111324753B (en) * | 2020-01-22 | 2021-09-03 | 天窗智库文化传播(苏州)有限公司 | Media information publishing management method and system |
CN111291299A (en) * | 2020-01-22 | 2020-06-16 | 北京飞漫软件技术有限公司 | Method for directly obtaining local command execution result and local server |
CN111324753A (en) * | 2020-01-22 | 2020-06-23 | 天窗智库文化传播(苏州)有限公司 | Media information publishing management method and system |
CN111291299B (en) * | 2020-01-22 | 2023-08-15 | 北京飞漫软件技术有限公司 | Method for directly obtaining local command execution result and local server |
CN111460160A (en) * | 2020-04-02 | 2020-07-28 | 复旦大学 | Event clustering method for streaming text data based on reinforcement learning |
CN111460160B (en) * | 2020-04-02 | 2023-08-18 | 复旦大学 | Event clustering method of stream text data based on reinforcement learning |
CN111581480B (en) * | 2020-05-12 | 2023-09-08 | 杭州风远科技有限公司 | News information aggregation analysis method and system, terminal and storage medium |
CN111581480A (en) * | 2020-05-12 | 2020-08-25 | 杭州风远科技有限公司 | News information aggregation analysis method and system, terminal and storage medium |
CN111930936A (en) * | 2020-06-28 | 2020-11-13 | 山东师范大学 | Method and system for excavating platform message text |
CN112287254A (en) * | 2020-11-23 | 2021-01-29 | 武汉虹旭信息技术有限责任公司 | Webpage structured information extraction method and device, electronic equipment and storage medium |
CN112287254B (en) * | 2020-11-23 | 2023-10-27 | 武汉虹旭信息技术有限责任公司 | Webpage structured information extraction method and device, electronic equipment and storage medium |
CN112597269A (en) * | 2020-12-25 | 2021-04-02 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Stream data event text topic and detection system |
CN112818200A (en) * | 2021-01-28 | 2021-05-18 | 平安普惠企业管理有限公司 | Data crawling and event analyzing method and system based on static website |
CN113554538A (en) * | 2021-05-28 | 2021-10-26 | 四川社智雲科技有限公司 | Digital information integrated system for urban and rural community management |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110147439A (en) | A kind of news event detecting method and system based on big data processing technique | |
Yu et al. | Ring: Real-time emerging anomaly monitoring system over text streams | |
CN103546326B (en) | Website traffic statistic method | |
CN111708740A (en) | Mass search query log calculation analysis system based on cloud platform | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN100462969C (en) | Method for providing and inquiry information for public by interconnection network | |
CN106383887A (en) | Environment-friendly news data acquisition and recommendation display method and system | |
CN102254004A (en) | Method and system for modeling Web in weblog excavation | |
CN101814083A (en) | Automatic webpage classification method and system | |
CN103530429B (en) | Webpage content extracting method | |
US10467255B2 (en) | Methods and systems for analyzing reading logs and documents thereof | |
CN106021418A (en) | News event clustering method and device | |
CN108804576A (en) | A kind of domain name hierarchical structure detection method based on link analysis | |
CN107918644A (en) | News subject under discussion analysis method and implementation system in reputation Governance framework | |
Sujatha | Improved user navigation pattern prediction technique from web log data | |
CN109947935A (en) | The generation method and device of media event | |
Afyouni et al. | Spatio-temporal event discovery in the big social data era | |
Holzmann et al. | Delusive PageRank in incomplete graphs | |
KR20120090131A (en) | Method, system and computer readable recording medium for providing search results | |
CN107229654A (en) | A kind of heat searches word acquisition methods and system | |
Xue et al. | Cross-media topic detection associated with hot search queries | |
CN109033133A (en) | Event detection and tracking based on Feature item weighting growth trend | |
CN115757963A (en) | User behavior image drawing method based on distributed log analysis | |
Song et al. | Multi-Stage Malicious Click Detection on Large Scale Web Advertising Data. | |
Kitsuregawa et al. | Socio-Sense: A system for analysing the societal behavior from long term Web archive |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190820 |
|
RJ01 | Rejection of invention patent application after publication |