CN110147439A - A kind of news event detecting method and system based on big data processing technique - Google Patents

A kind of news event detecting method and system based on big data processing technique Download PDF

Info

Publication number
CN110147439A
CN110147439A CN201810792930.3A CN201810792930A CN110147439A CN 110147439 A CN110147439 A CN 110147439A CN 201810792930 A CN201810792930 A CN 201810792930A CN 110147439 A CN110147439 A CN 110147439A
Authority
CN
China
Prior art keywords
news
event
topic
url
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810792930.3A
Other languages
Chinese (zh)
Inventor
刘玉葆
吴杰锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201810792930.3A priority Critical patent/CN110147439A/en
Publication of CN110147439A publication Critical patent/CN110147439A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of news event detecting methods based on big data processing technique, comprising the following steps: S1. crawls news data from each news portal website using static Web page crawler technology and dynamic web page crawler technology respectively;S2. the noise in news data is filtered, text duplicate removal, name Entity recognition and text summarization then are carried out to news data and generate these operations;S3. media event is detected from news data by these steps of participle, feature extraction, Feature Dimension Reduction and text cluster, and event is tracked, to form news topic;S4. the news topic information eventually detected is shown by interface.

Description

A kind of news event detecting method and system based on big data processing technique
Technical field
The present invention relates to topic detection and tracking technical fields, are based on big data processing technique more particularly, to one kind News event detecting method and system.
Background technique
In recent years, internet news show the scene of a piece of prosperity and development, and Internet news has become people's life In essential a part.With its spread speed, fast, multimedia, global and interactive feature gradually take internet news For mass traditional communications media such as newspaper, broadcast, become a kind of important way that people obtain newest information.
At the same time, since explosive growth is presented in internet information, the data scale of each enterprise web site platform is got over Come huger, causes to be difficult to effectively to handle these mass datas with conventional software frame.In order to cope with interconnection The mass data of explosive growth in net, big data processing technique come into being, and have obtained quick development in recent years.Its In, Spark is the distributed type assemblies computing system of a support high-speed computation, it is come using elasticity distribution formula data set (RDD) Storage object set, and a distributed machines learning program library MLlib is provided to be transported parallel to machine learning algorithm It calculates, excavates for the large-scale user's access of enterprise platform and big data and provide important support with analysis.
In addition, traditional relevant database has been difficult to cope with depositing for mass data with the arriving of big data era The data access problem of storage and high concurrent, in order to solve these problems, there has been proposed NoSQL (non-relational) databases. Wherein, Couchbase is the distributed NoSQL database of the open source of an Oriented Documents, it has flexible data model, bullet Property easily extension, high availability the features such as, be very suitable for for storing a large amount of news documents data.
Nowadays, people be typically only capable in each portal website browsing on the day of or these last few days Domestic News, it is difficult to The media event information that same topic is discussed in one section of long period is obtained, for some specific event, user is also difficult to obtain The news report information in all directions of the event is obtained, and is difficult to clear the historical development situation of the event.In order to solve this Problem, numerous scholars study the method and system for carrying out media event detection.
Patent publication No. is that the patent document of CN103198078A proposes a kind of internet news event report trend point Analyse method and system.The system acquires according to the characteristic information of media event first and screens news information, then by new Hear data and analyzed to obtain the theme of media event, and according in different cycles theme and related information measure out conversion master Topic, finally according to the relevant report quantity of theme, to it, temporally developmental sequence is shown.But the patent is not described Analysis is carried out to generate the method and detailed process of theme of news to news information.
The patent document that patent publication No. is CN107145568A propose a kind of quick media event clustering system and Method.The system includes news handling module, newsletter archive preliminary treatment module, newsletter archive affair clustering module and data Memory module, wherein the system carries out permutation and combination to word segmentation result in cluster, and maps documents into first layer cluster, Then calculate document and son cluster at a distance from, finally according to calculated result determine document belonging to cluster and create new son gather Class.However, the patent only describes the detailed process of its cluster, there is no the designs and specific processing to system modules Process is illustrated.
How the method and system of above-mentioned patent disclosure quickly has in the environment of magnanimity news data if not accounting for Effect ground carries out this problem of event detection.Under big data environment, event detection system must have efficient, stable, easily extension And the features such as High Availabitity, system can efficiently detection obtains media event and topic information from a large amount of news datas, and will These information show user by Web page with open arms.
Summary of the invention
For the deficiencies in the prior art, the invention proposes a kind of media events based on big data processing technique Detection method, this method can be in the case where big data handle frame Spark and NoSQL database Couchbase by relevant news Report gathers together, and forms media event, and the development of track of events, user is allowed to will appreciate that media event in all directions Information clears the development grain of event.
To realize the above goal of the invention, the technical solution adopted is that:
A kind of news event detecting method based on big data processing technique, comprising the following steps:
S1. it is crawled newly using static Web page crawler technology and dynamic web page crawler technology from each news portal website respectively Hear data;
S2. the noise in news data is filtered, text duplicate removal, name Entity recognition and text then are carried out to news data Autoabstract generates these operations;
S3. news is detected from news data by these steps of participle, feature extraction, Feature Dimension Reduction and text cluster Event, and event is tracked, to form news topic;
S4. the news topic information eventually detected is shown by interface.
Preferably, in the step S1, static Web page crawler technology crawls static Web page using Scrapy, fixed first Justice crawls the regular expression rule of target URL, seed URL is then generated according to certain rules, then since seed URL Webpage is crawled, regular expression predetermined is regular and the URL was not crawled when the webpage URL crawled can be matched correctly When, which is added in URL queue;Dynamic web page crawler technology crawls Dynamic Networks using HTTP request and response technology Page analyzes the HTTP request parameter of target webpage URL first, then constructs HTTP request message based on the analysis results, and set Request row, request header these parameters for setting message, send it to destination host, finally just to the message in http response message Text is parsed, and is therefrom extracted webpage URL and is added it in URL queue;For static Web page crawler technology and dynamic The webpage URL that spiders technology is extracted, carrys out analyzing web page using XPath or jsoup and therefrom extracts news data, URL is taken out from URL queue first, and accesses its corresponding webpage, then the HTML DOM structure of analyzing web page, is therefrom extracted These news datas of headline, issuing time, classification and text out.
Preferably, in the step S2, first using the noise in regular expression rule-based filtering body, then from Repeated text is detected in news data and is removed it, and then extracts body using the name Entity recognition module of FNLP Name entity, and automatically generate using TextRank4ZH the abstract of body, finally by filtered news data and The name entity and summary info of body are stored into Couchbase database.
Preferably, in the step S3, the news number of specified classification and issuing time is inquired from Couchbase first According to, and its issuing time ascending sort is pressed to news report;Then body is divided using the word segmentation module in FNLP Word, and the stop words in word segmentation result is removed according to the deactivated vocabulary of Chinese and English;Then use TF-IDF by each news documents Text be converted into high dimensional feature vector, and dimensionality reduction is carried out to feature vector using PCA principal component analysis;Finally using the band time The Single-Pass algorithm of window carries out clustering to news documents, obtains media event, and calculate using Single-Pass Method is tracked event, to form news topic;Finally media event and topic information are stored to Couchbase data In library.
Preferably, in the step S4, when showing the summary info of topic, assignment algorithm type and parameter are inquired first Topic information, then obtain the event that occurs the latest in topic and then obtain and send out the latest as the representative event of topic The news report issued in part earliest of making trouble will finally represent the mark of news report as the representative news report of topic Topic, issuing time and text abstract are shown on webpage as the summary info of topic;It is first when showing the details of topic The list of thing of topic is first obtained according to topic ID, and the title and issuing time of each event are obtained from list of thing, To form track of issues information;Then it obtains and the news report list of event occurs in list of thing the latest, and from Xin Wen Bao The title, issuing time, source and URL information of every news are obtained in road list, to form event detection information;Then it obtains The news report occurred earliest in generation event the latest is taken, and obtains the title and text summary info of the news report, by it Title and abstract as topic;Finally shown above- mentioned information as the details of topic on webpage.
Meanwhile the present invention also provides a kind of system using above method, specific scheme is as follows:
Including webcrawler module, data preprocessing module, event checking module and event display module, wherein network is climbed Erpoglyph block is used to execute step S3 for executing step S1, data preprocessing module for executing step S2, event checking module, Event display module is for executing step S4.
Preferably, the webcrawler module includes static Web page crawler submodule, dynamic web page crawler submodule and net Page analyzing sub-module.
Compared with prior art, the beneficial effects of the present invention are:
1, the present invention shows a series of this complete process by data acquisition, data prediction, event detection and event It completes from initial internet news data to final this convert task of media event and topic information, and passes through Web circle Face is presented to user with open arms, and user is allowed to will appreciate that the information in all directions of each news topic, clears the development arteries and veins of media event Network.
2, the present invention carries out event inspection at distributed type assemblies Computational frame Spark and NoSQL database Couchbase It surveys, can effectively support the excavation and analysis task of a large amount of news datas, so that the efficiency of event detection is promoted, the system of building Have the characteristics that stabilization, be easy to extending transversely and High Availabitity.
Detailed description of the invention
Fig. 1 system overall flow figure
Fig. 2 webcrawler module flow chart
Fig. 3 data preprocessing module flow chart
Fig. 4 event checking module flow chart
Fig. 5 event display module flow chart
Single-Pass algorithm flow chart of the Fig. 6 with time window
The event number of Fig. 7 event detection
The cluster time of Fig. 8 event detection
The topic numbers of Fig. 9 track of issues
The cluster time of Figure 10 track of issues
The effect measurement of Figure 11 event detection
The total time of Figure 12 event detection
Figure 13 topic summary info Web page
Figure 14 topic details Web page
Specific embodiment
The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent;
Below in conjunction with drawings and examples, the present invention is further elaborated.
Embodiment 1
Fig. 1 is the flow diagram of method provided by the invention.Method provided by the invention structure when specifically used The system built includes that there are four modules, respectively webcrawler module, data preprocessing module, event checking module and event exhibition Show module.Webcrawler module uses static and dynamic web page crawler technology from each news portal website of internet respectively News data needed for crawling system.Data preprocessing module filters the noise in initial news data first, then to news Data carry out text duplicate removal, name Entity recognition and text summarization operation.Event checking module is mentioned by participle, feature Take, Feature Dimension Reduction and text cluster and etc. go out media event from news data centralized detecting, and event is tracked, thus Form news topic.Event display module is shown the news topic information eventually detected by web interface, allows user News topic content and event development are got information about.
The new of phoenix net information and China News Service is crawled using static Web page crawler technology in webcrawler module Data are heard, the news data of Tencent's news and Sina News is crawled using dynamic web page crawler technology, webcrawler module Specific process flow is as shown in Figure 2.Static Web page crawler submodule crawls static Web page using Scrapy, and the module is fixed first Justice crawls the regular expression rule of target URL, seed URL is then generated according to certain rules, then since seed URL Crawl webpage, when the webpage URL crawled can correctly match regular expression predetermined and the URL be not crawled it is out-of-date, The URL is added in URL queue.Dynamic web page crawler submodule crawls Dynamic Networks using HTTP request and response technology Page, the module analyze the HTTP request parameter of target webpage URL first, then construct HTTP request message based on the analysis results, And the parameters such as the request row of message, request header are set, destination host is sent it to, finally to disappearing in http response message Breath text is parsed, and is therefrom extracted webpage URL and is added it in URL queue.Web analysis submodule uses XPath or jsoup carrys out analyzing web page and therefrom extracts news data, which takes out URL from URL queue first, and visits Ask its corresponding webpage, then the HTML DOM structure of analyzing web page, therefrom extract headline, issuing time, classification and The news datas such as text.
Pretreatment operation is carried out to the initial news data obtained in webcrawler module in data preprocessing module, The specific process flow of data preprocessing module is as shown in Figure 3.The module is used first in regular expression filtering body Noise, then go out repeated text from news data centralized detecting and remove it, then use FNLP name Entity recognition Module extracts the name entity of body, and the abstract of body is automatically generated using TextRank4ZH, finally will be clear The name entity and summary info of news data and body after washing are stored into Couchbase database.
Event detection is carried out using Single-Pass algorithm in event checking module, entire testing process is in Spark Under handled, the specific process flow of event checking module is as shown in Figure 4.It is first in Single-Pass event detection process The news data of specified classification and issuing time is first inquired from Couchbase, and its issuing time liter is pressed to news report Sequence sequence.Then body is segmented using the word segmentation module in FNLP, and is removed point according to the deactivated vocabulary of Chinese and English Stop words in word result.Then high dimensional feature vector is converted by the text of each news documents using TF-IDF, and made Dimensionality reduction is carried out to feature vector with PCA principal component analysis.Finally use the Single-Pass algorithm with time window to news text Shelves carry out clustering, media event are obtained, and be tracked to event using Single-Pass algorithm, to form news Topic.Finally by media event and topic information storage into Couchbase database.
A Web system is built using Spring frame in event display module, shows the summary letter of news topic Breath and details, the specific process flow of event display module are as shown in Figure 5.When showing the summary info of topic, the mould Block inquires the topic information of assignment algorithm type and parameter first, then obtains the event occurred the latest in topic, as The representative event of topic, then obtains the news report issued earliest in generation event the latest, and the representative as topic is new Report is heard, finally shows the title for representing news report, issuing time and text abstract in net as the summary info of topic On page.When showing the details of topic, which obtains the list of thing of topic according to topic ID first, and from event column The title and issuing time of each event are obtained in table, to form track of issues information.Then it obtains in list of thing most The news report list of late generation event, and obtain from news report list the title of every news, issuing time, source and URL information, to form event detection information.Then the news report occurred earliest in generation event the latest is obtained, and is obtained The title and text summary info of the news report, as the title and abstract of topic.Finally using above- mentioned information as words The details of topic are shown on webpage.
Wherein, in step 3 event detection process, the specific reality of the event detection process based on Spark and Couchbase Shown in now steps are as follows:
1, believed according to bucket title of URL, Spark Apply Names of Spark cluster manager dual system and Couchbase etc. Breath building SparkConf configuration information, and JavaSparkContext is constructed according to SparkConf.
2, it is constructed according to JavaSparkContext for interacting operation with Couchbase database at Spark CouchbaseSparkContext.
3, it according to the cluster of server host name creation Couchbase, and is opened according to specified bucket title Corresponding bucket in cluster.
4, to the news data and algorithm information building json document after cleaning at Spark, and RDD is converted thereof into, Then using couchbaseDocumentRDD () function of couchbase-spark-connector concurrently by above-mentioned number According to storage into Couchbase database.
5, using couchbaseQuery () function of CouchbaseSparkContext at Spark concurrently from Inquire the news data of specified time interval and classification in Couchbase database, and by its according to the news briefing time from It is early to sort to late.
6, FNLP participle operation concurrently is carried out to each element in body RDD using the map function of Spark, Finally obtain the RDD that word list is constituted after segmenting.
7, word list RDD is converted into TF feature vector using the HashingTF of TF-IDF module in Spark MLlib RDD, and it is cached, TF feature vector RDD is then converted into TF-IDF feature vector RDD using IDFModel.
8, TF-IDF feature vector RDD is converted into RowMatrix matrix using Spark MLlib, then calculated The principal component matrix of RowMatrix matrix, the square finally by RowMatrix matrix and principal component matrix multiple, after obtaining dimensionality reduction Battle array, and it is converted into the feature vector RDD after dimensionality reduction.
9, news report is indicated with the feature vector after dimensionality reduction, and use the Single-Pass algorithm pair with time window News report carries out increment cluster operation, and final detection obtains media event, and event information is stored to Couchbase number According in library.Fig. 6 illustrates the Single-Pass algorithm flow chart with time window.
Media event is from morning to night ranked up according to initial time, then using Single-Pass algorithm to event Clustering is carried out, obtains news topic, and by topic information storage into Couchbase database.
Embodiment 2
The present embodiment has carried out specific experiment to method provided by the invention, system development and deployment software in experiment Environment is as shown in table 1:
1 system development of table and deployment software environment
The system builds Spark distributed computing cluster and Couchbase data base set using 2 servers respectively Group, and on the server by Web system deployment, the hardware configuration of exploitation and deployment is as shown in table 2:
2 system development of table and deployment hardware configuration
Server CPU Memory
Server 1 Intel(R)Core(TM)i5-4570CPU@ 8GB
Server 2 Intel(R)Core(TM)i5-2450M CPU@ 6GB
Experimental result:
The classification that this experiment is obtained using system by web crawlers is that domestic news data tests system, real The details for testing data are as shown in table 3.
The details of 3 experimental data of table
This experiment is tested and has been analyzed to the result of event detection and track of issues.This experiment is to following 6 contents It is tested:
(1) event number that event detection obtains
(2) time of cost is clustered in event detection
(3) topic numbers that track of issues obtains
(4) time of cost is clustered in track of issues
(5) effect measurement of event detection
(6) total time of event detection
This experiment look first at the Single-Pass algorithm with time window that event detection uses similarity threshold and The variation of time window parameter is then observed the Single-Pass that track of issues uses and is calculated to influence caused by (1) and (2) The similarity threshold of method and the variation of event number then use recall rate, accuracy rate and F to influence caused by (3) and (4) It is worth the effect of this 3 index test event detections, finally compares and handled using traditional single machine processing technique and distributed type assemblies Experimental result of the technology on (6).
Fig. 7 illustrates the experimental result of the event number obtained in event detection procedure.As can see from Figure 7, right In window at the same time, with the increase of Single-Pass similarity threshold, the event number that event detection obtains is also therewith Increase.The reason is that similarity threshold is bigger, the similarity of every news report and existing event is more likely to be less than the threshold Value, is more possible to be judged as new events, therefore finally obtained event number is more naturally.For identical Single-Pass similarity threshold, with the increase of time window, event number is gradually decreased.The reason is that time window is got over Greatly, every news report needs are compared with more events, therefore are more possible to some event being added thereto, and are caused Finally obtained event number tails off.
Fig. 8 is illustrated to be clustered in event detection procedure using the Single-Pass algorithm with time window and be spent Time taking experimental result.As can see from Figure 8, for window at the same time, with Single-Pass similarity threshold The increase of value, the time for clustering cost also increase with it.The main reason is that similarity threshold is bigger, it is judged as new events News report quantity is more, and the event number in time window is also more, therefore every news report needs and more events It is compared, eventually leads to the time increase that cluster is spent.For identical Single-Pass similarity threshold, with the time The increase of window, the time for clustering cost also increase with it.The main reason is that time window is bigger, include in time window Event number is also more, therefore every news report needs are compared with more events, eventually lead to cluster cost when Between also increase with it.
Fig. 9 illustrates the experimental result of the topic numbers obtained during track of issues.The experiment is by event detection Time window is set as 24 hours, and the Single-Pass algorithm similarity threshold of event detection is respectively set to 0.5,0.6,0.7 and 0.8, to be respectively formed 3495,4441,5236,6062 events.As it can be seen in figure 9 that for identical event number, With the increase of the Single-Pass similarity threshold of track of issues, the topic numbers that track of issues obtains also are increased with it.Its The reason is that similarity threshold is bigger, the similarity of each event and existing topic is more likely to be less than the threshold value, naturally more It is possible that being judged as new topic, therefore finally obtained topic numbers are more.For identical Single-Pass similarity Threshold value, with the increase of event number, the topic numbers that track of issues obtains also are increased with it, but the speed increased becomes faster. The reason is that event number is more, the topic numbers that event is related to are more, therefore the topic number obtained in track of issues It measures also more.But since the topic content that track of issues is formed in low similarity threshold is more wide in range, each topic packet The event number contained is more, therefore influence of the event number to topic numbers and little;And event chases after in high similarity threshold The topic content that track is formed is more single, and the event number that each topic includes is less, therefore event number is to topic numbers It is affected.
Figure 10, which is illustrated, is clustered spent time taking experiment using Single-Pass algorithm during track of issues As a result.The experiment equally sets the time window of event detection to for 24 hours, the Single-Pass algorithm similarity of event detection Threshold value is respectively set to 0.5,0.6,0.7 and 0.8, to be respectively formed 3495,4441,5236,6062 events.With Fig. 8 phase Than can see, cluster the time it takes is more than the time spent in event detection during track of issues, this It is because time window is not arranged for the cluster of track of issues, is that whole events are clustered, it is therefore desirable to spends more More time.As can see from Figure 10, for identical event number, with the Single-Pass similarity of track of issues Cluster the time it takes of the increase of threshold value, track of issues also increases therewith.The main reason is that similarity threshold is bigger, thing The topic numbers that part is tracked are more, therefore each event needs are compared with more topics, are eventually led to poly- It needs to take more time when class.For identical track of issues Single-Pass similarity threshold, with event number Increase, cluster the time it takes of track of issues also increases therewith.The reason is that event number is more, just have more in cluster More event needs are compared with topic, therefore the cluster needs of track of issues take more time.
Figure 11 illustrates the experimental result of the effect measurement of event detection.The experiment has chosen time interval in table 3 As experimental data, which determines 231 news report in 1 to 2 November in 2017 after manually marking For comprising 87 events, wherein possess 23 news report comprising the most event of news report, least event only have 1 it is new Hear report.The time window of event detection is set 48h by the experiment.The experiment uses three fingers of measurement Clustering Effect Mark recall rate (Recall), accuracy rate (Precision) and F value (F-Measure) as event detection Measure Indexes. With in conventional IR field recall rate, accuracy rate and F value it is different, the experiment use these three indexs definition such as Under:
Wherein, n is the total quantity of news report, niFor the quantity for the news report that i-th of true cluster includes, njIt is The quantity of the j news report for including by the cluster that incident Detection Algorithm obtains, nijFor i-th of true cluster and pass through for j-th The quantity for the identical news report that the cluster that incident Detection Algorithm obtains includes.K is the quantity of true cluster, and k ' is to be examined by event The quantity for the cluster that method of determining and calculating obtains.As can see from Figure 11, with the increasing of the Single-Pass similarity threshold of event detection Greatly, recall rate is in rising trend, and accuracy rate is in slow downward trend, and F value is almost unchanged.The reason is that similarity threshold is got over Greatly, njAnd nijAll increase accordingly, and niIt is constant, and njThe speed ratio n of growthijIt is fast, therefore recall rate is increased, and it is accurate Rate decreases.From experimental result, it can be seen that, all 90% or more, the results show is got over for recall rate, accuracy rate and F value The validity of part detection.
Figure 12, which is illustrated, carries out whole event using traditional single machine processing technique and distributed type assemblies processing technique respectively The experimental result of time spent by detection process.In this experiment, the time window of event detection is set as 24 hours, and event chases after The Single-Pass similarity threshold of track is set as 0.7, and single machine processing technique uses server 1 and server 2 respectively, distribution Formula clustering techniques use the cluster built by server 1 and server 2.As can see from Figure 12, for identical Event detection Single-Pass similarity threshold, the time longest that server 2 expends, server 1 takes second place, distributed type assemblies The time of consuming is minimum, and the time that distributed type assemblies expend is about the half of server 2, about the 2/3 of server 1.It is main It is calculated the reason is that distributed type assemblies use distributed proccessing, accelerates the efficiency of calculating, and 1 kimonos of server Business device 2 is all made of traditional single machine processing technique, therefore computational efficiency is limited;And since the hardware configuration of server 1 is than servicing Device 2 wants high, therefore the time that server 1 expends is fewer than server 2.The experimental results showed that handling skill relative to traditional single machine Art, the big data processing technique that this system uses be obviously improved in processing speed, it was demonstrated that the system of the invention designed High efficiency in event detection.
System web interface is shown:
Figure 13 illustrates the topic summary info page of system, which is in the homepage of system, main presentation hot topic The summary info of topic, which includes topic titles, the synopsis of topic time and topic.In this subsystem web interface In displaying, the present invention has chosen to be shown comprising 2 most topics of event number, this 2 topics are master craftsman of the Spring and Autumn period's prize, color respectively Stupefied lattice river.
Figure 14 illustrates the topic details page of system, and the present invention has chosen this word of Xi'an to Chengdu turnaround Topic is shown.Topic details page shows the details of a nearest event for the topic first, then shows thing The information of part tracking.The details of event mainly include event header, synopsis and the event of event include it is related News report.Track of issues information mainly includes list of thing relevant to the event content, and when according to the generation of event Between from evening to the early time of origin and heading message for showing event.
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims (7)

1. a kind of news event detecting method based on big data processing technique, it is characterised in that: the following steps are included:
S1. news number is crawled from each news portal website using static Web page crawler technology and dynamic web page crawler technology respectively According to;
S2. the noise in news data is filtered, it is automatic then to carry out text duplicate removal, name Entity recognition and text to news data These operations of summarization generation;
S3. news thing is detected from news data by these steps of participle, feature extraction, Feature Dimension Reduction and text cluster Part, and event is tracked, to form news topic;
S4. the news topic information eventually detected is shown by interface.
2. the news event detecting method according to claim 1 based on big data processing technique, it is characterised in that: described In step S1, static Web page crawler technology crawls static Web page using Scrapy, defines the canonical for crawling target URL first Expression formula rule, then generates seed URL according to certain rules, then webpage is crawled since seed URL, when what is crawled Webpage URL can correctly match regular expression predetermined rule and the URL is not crawled out-of-date, which is added to URL In queue;Dynamic web page crawler technology crawls dynamic web page using HTTP request and response technology, first analysis target network The HTTP request parameter of page URL, then constructs HTTP request message based on the analysis results, and the request row of message, request is arranged These parameters of head, send it to destination host, finally parse to the message text in http response message, therefrom extract It webpage URL and adds it in URL queue out;The net extracted for static Web page crawler technology and dynamic web page crawler technology Page URL, carrys out analyzing web page using XPath or jsoup and therefrom extracts news data, take out from URL queue first URL, and access its corresponding webpage, the then HTML DOM structure of analyzing web page, when therefrom extracting headline, publication Between, classification and text these news datas.
3. the news event detecting method according to claim 2 based on big data processing technique, it is characterised in that: described In step S2, first using the noise in regular expression rule-based filtering body, weight is then detected from news data Multiple text simultaneously removes it, and the name entity of body is then extracted using the name Entity recognition module of FNLP, and use TextRank4ZH automatically generates the abstract of body, finally that the name of filtered news data and body is real Body and summary info storage are into Couchbase database.
4. the news event detecting method according to claim 3 based on big data processing technique, it is characterised in that: described In step S3, the news data of specified classification and issuing time is inquired from Couchbase first, and it is pressed to news report Issuing time ascending sort;Then body is segmented using the word segmentation module in FNLP, and is deactivated according to Chinese and English Vocabulary removes the stop words in word segmentation result;Then high dimensional feature is converted by the text of each news documents using TF-IDF Vector, and dimensionality reduction is carried out to feature vector using PCA principal component analysis;The Single-Pass with time window is finally used to calculate Method carries out clustering to news documents, obtains media event, and be tracked to event using Single-Pass algorithm, from And form news topic;Finally by media event and topic information storage into Couchbase database.
5. the news event detecting method according to claim 4 based on big data processing technique, it is characterised in that: described In step S4, when showing the summary info of topic, the topic information of assignment algorithm type and parameter is inquired first, is then obtained The event occurred the latest in topic is then obtained and is issued earliest in generation event the latest as the representative event of topic News report will finally represent title, issuing time and the positive digest of news report as the representative news report of topic It to be shown on webpage as the summary info of topic;When showing the details of topic, is obtained talk about according to topic ID first The list of thing of topic, and the title and issuing time of each event are obtained from list of thing, to form track of issues letter Breath;Then the news report list that event occurs in list of thing the latest is obtained, and obtains every newly from news report list Title, issuing time, source and the URL information of news, to form event detection information;Then it obtains in generation event the latest most The news report early occurred, and obtain the title and text summary info of the news report, title as topic and is plucked It wants;Finally shown above- mentioned information as the details of topic on webpage.
6. a kind of system of any one according to claim 1~5 detection method, it is characterised in that: including web crawlers mould Block, data preprocessing module, event checking module and event display module, wherein webcrawler module is used to execute step S1, Data preprocessing module is for executing step S2, and event checking module is for executing step S3, and event display module is for executing Step S4.
7. system according to claim 6, it is characterised in that: the webcrawler module includes static Web page crawler submodule Block, dynamic web page crawler submodule and web analysis submodule.
CN201810792930.3A 2018-07-18 2018-07-18 A kind of news event detecting method and system based on big data processing technique Pending CN110147439A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810792930.3A CN110147439A (en) 2018-07-18 2018-07-18 A kind of news event detecting method and system based on big data processing technique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810792930.3A CN110147439A (en) 2018-07-18 2018-07-18 A kind of news event detecting method and system based on big data processing technique

Publications (1)

Publication Number Publication Date
CN110147439A true CN110147439A (en) 2019-08-20

Family

ID=67589149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810792930.3A Pending CN110147439A (en) 2018-07-18 2018-07-18 A kind of news event detecting method and system based on big data processing technique

Country Status (1)

Country Link
CN (1) CN110147439A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597981A (en) * 2019-09-16 2019-12-20 西华大学 Network news summary system for automatically generating summary by adopting multiple strategies
CN110990705A (en) * 2019-12-06 2020-04-10 腾讯科技(深圳)有限公司 News processing method, device, equipment and medium
CN111291299A (en) * 2020-01-22 2020-06-16 北京飞漫软件技术有限公司 Method for directly obtaining local command execution result and local server
CN111324753A (en) * 2020-01-22 2020-06-23 天窗智库文化传播(苏州)有限公司 Media information publishing management method and system
CN111460160A (en) * 2020-04-02 2020-07-28 复旦大学 Event clustering method for streaming text data based on reinforcement learning
CN111581480A (en) * 2020-05-12 2020-08-25 杭州风远科技有限公司 News information aggregation analysis method and system, terminal and storage medium
CN111930936A (en) * 2020-06-28 2020-11-13 山东师范大学 Method and system for excavating platform message text
CN112287254A (en) * 2020-11-23 2021-01-29 武汉虹旭信息技术有限责任公司 Webpage structured information extraction method and device, electronic equipment and storage medium
CN112597269A (en) * 2020-12-25 2021-04-02 西南电子技术研究所(中国电子科技集团公司第十研究所) Stream data event text topic and detection system
CN112818200A (en) * 2021-01-28 2021-05-18 平安普惠企业管理有限公司 Data crawling and event analyzing method and system based on static website
CN113554538A (en) * 2021-05-28 2021-10-26 四川社智雲科技有限公司 Digital information integrated system for urban and rural community management

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102262635A (en) * 2010-05-25 2011-11-30 北京启明星辰信息技术股份有限公司 Page crawler system and page crawler method
CN102831220A (en) * 2012-08-23 2012-12-19 江苏物联网研究发展中心 Subject-oriented customized news information extraction system
CN103092936A (en) * 2013-01-08 2013-05-08 华北电力大学(保定) Real-time information acquisition method of dynamic page of Internet of Things
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN107862039A (en) * 2017-11-06 2018-03-30 工业和信息化部电子第五研究所 Web data acquisition methods, system and Data Matching method for pushing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102262635A (en) * 2010-05-25 2011-11-30 北京启明星辰信息技术股份有限公司 Page crawler system and page crawler method
CN102831220A (en) * 2012-08-23 2012-12-19 江苏物联网研究发展中心 Subject-oriented customized news information extraction system
CN103092936A (en) * 2013-01-08 2013-05-08 华北电力大学(保定) Real-time information acquisition method of dynamic page of Internet of Things
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN107862039A (en) * 2017-11-06 2018-03-30 工业和信息化部电子第五研究所 Web data acquisition methods, system and Data Matching method for pushing

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597981A (en) * 2019-09-16 2019-12-20 西华大学 Network news summary system for automatically generating summary by adopting multiple strategies
CN110990705A (en) * 2019-12-06 2020-04-10 腾讯科技(深圳)有限公司 News processing method, device, equipment and medium
CN110990705B (en) * 2019-12-06 2024-04-12 深圳市雅阅科技有限公司 News processing method, device, equipment and medium
CN111324753B (en) * 2020-01-22 2021-09-03 天窗智库文化传播(苏州)有限公司 Media information publishing management method and system
CN111291299A (en) * 2020-01-22 2020-06-16 北京飞漫软件技术有限公司 Method for directly obtaining local command execution result and local server
CN111324753A (en) * 2020-01-22 2020-06-23 天窗智库文化传播(苏州)有限公司 Media information publishing management method and system
CN111291299B (en) * 2020-01-22 2023-08-15 北京飞漫软件技术有限公司 Method for directly obtaining local command execution result and local server
CN111460160A (en) * 2020-04-02 2020-07-28 复旦大学 Event clustering method for streaming text data based on reinforcement learning
CN111460160B (en) * 2020-04-02 2023-08-18 复旦大学 Event clustering method of stream text data based on reinforcement learning
CN111581480B (en) * 2020-05-12 2023-09-08 杭州风远科技有限公司 News information aggregation analysis method and system, terminal and storage medium
CN111581480A (en) * 2020-05-12 2020-08-25 杭州风远科技有限公司 News information aggregation analysis method and system, terminal and storage medium
CN111930936A (en) * 2020-06-28 2020-11-13 山东师范大学 Method and system for excavating platform message text
CN112287254A (en) * 2020-11-23 2021-01-29 武汉虹旭信息技术有限责任公司 Webpage structured information extraction method and device, electronic equipment and storage medium
CN112287254B (en) * 2020-11-23 2023-10-27 武汉虹旭信息技术有限责任公司 Webpage structured information extraction method and device, electronic equipment and storage medium
CN112597269A (en) * 2020-12-25 2021-04-02 西南电子技术研究所(中国电子科技集团公司第十研究所) Stream data event text topic and detection system
CN112818200A (en) * 2021-01-28 2021-05-18 平安普惠企业管理有限公司 Data crawling and event analyzing method and system based on static website
CN113554538A (en) * 2021-05-28 2021-10-26 四川社智雲科技有限公司 Digital information integrated system for urban and rural community management

Similar Documents

Publication Publication Date Title
CN110147439A (en) A kind of news event detecting method and system based on big data processing technique
Yu et al. Ring: Real-time emerging anomaly monitoring system over text streams
CN103546326B (en) Website traffic statistic method
CN111708740A (en) Mass search query log calculation analysis system based on cloud platform
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN100462969C (en) Method for providing and inquiry information for public by interconnection network
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN102254004A (en) Method and system for modeling Web in weblog excavation
CN101814083A (en) Automatic webpage classification method and system
CN103530429B (en) Webpage content extracting method
US10467255B2 (en) Methods and systems for analyzing reading logs and documents thereof
CN106021418A (en) News event clustering method and device
CN108804576A (en) A kind of domain name hierarchical structure detection method based on link analysis
CN107918644A (en) News subject under discussion analysis method and implementation system in reputation Governance framework
Sujatha Improved user navigation pattern prediction technique from web log data
CN109947935A (en) The generation method and device of media event
Afyouni et al. Spatio-temporal event discovery in the big social data era
Holzmann et al. Delusive PageRank in incomplete graphs
KR20120090131A (en) Method, system and computer readable recording medium for providing search results
CN107229654A (en) A kind of heat searches word acquisition methods and system
Xue et al. Cross-media topic detection associated with hot search queries
CN109033133A (en) Event detection and tracking based on Feature item weighting growth trend
CN115757963A (en) User behavior image drawing method based on distributed log analysis
Song et al. Multi-Stage Malicious Click Detection on Large Scale Web Advertising Data.
Kitsuregawa et al. Socio-Sense: A system for analysing the societal behavior from long term Web archive

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190820

RJ01 Rejection of invention patent application after publication