CN108197112A - A kind of method that event is extracted from news - Google Patents

A kind of method that event is extracted from news Download PDF

Info

Publication number
CN108197112A
CN108197112A CN201810054183.3A CN201810054183A CN108197112A CN 108197112 A CN108197112 A CN 108197112A CN 201810054183 A CN201810054183 A CN 201810054183A CN 108197112 A CN108197112 A CN 108197112A
Authority
CN
China
Prior art keywords
news
similarity
event
case
affiliated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810054183.3A
Other languages
Chinese (zh)
Inventor
范艳艳
李源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Digital Peak Technology Co Ltd
Chengdu Rui Code Technology Co Ltd
Original Assignee
Hangzhou Digital Peak Technology Co Ltd
Chengdu Rui Code Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Digital Peak Technology Co Ltd, Chengdu Rui Code Technology Co Ltd filed Critical Hangzhou Digital Peak Technology Co Ltd
Priority to CN201810054183.3A priority Critical patent/CN108197112A/en
Publication of CN108197112A publication Critical patent/CN108197112A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of methods that event is extracted from news, affiliated event is used as by the summary info extracted in news, and the vector that newsletter archive is carried out to numeralization conversion acquisition text represents, the similarity of news is calculated using clustering method, quickly classified based on similarity by affiliated event to news, it simply and effectively can will belong to the news clusters of similar events together, and obtain the temperature of news, be monitored convenient for subsequent public sentiment.This method can simply, it is quick, effectively classify to magnanimity news information, provide guidance for the analysis of public opinion, further improve the control and monitoring of public sentiment, and can make decisions support in time and public sentiment guides.

Description

A kind of method that event is extracted from news
Technical field
The present invention relates to computer network communication technology fields, and in particular to a kind of method that event is extracted from news.
Background technology
With the continuous development of computer networking technology, the acquisition of network information has become people and recognizes the main of event One of approach, and a principal mode of the news as network information resource, in face of domestic and international news portal website all the time A large amount of news will be generated, people are often absorbed in poverty-stricken condition, and the magnanimity information on the one hand received has no way of selecting and disappear Change, be submerged in numerous and diverse information, be on the other hand that information is lost, people are difficult to find the information oneself really needed;Therefore, It is nowadays active demand of the people for the network information that information needed, which can quickly and efficiently be obtained,.In this case, exist A large amount of information is effectively clustered automatically, seems necessary.
In addition, With the fast development of internet, network public-opinion is increasing to the influence power of society, either government's net Network public sentiment monitoring needs or enterprise carry out branding communication and brand public relations needs, how magnanimity public sentiment item Under part, hot ticket is fast and effeciently obtained, to analyze the Sentiment orientation of public sentiment so that related personnel timely and reliably determines Plan and public sentiment guiding, the public opinion environment of response quickly variation.
Invention content
Of the existing technology in order to solve the problems, such as, the present invention proposes a kind of method that event is extracted from news, real The temperature monitoring of media event is showed, can be that public sentiment monitoring and decision make effective guiding.
A kind of method that event is extracted from news of the present invention, includes the following steps:
Step 1: it obtains and the relevant original news data collection of target topic;
Step 2: newsletter archive and is carried out numeralization conversion by the abstract of extraction news respectively as affiliated event;
Step 3: one news case of setting, determines whether there is news in news case, if not having, by the affiliated thing of the news Part is added in as new events in news case, and the news is put under this event;Conversely, then perform step 4;
Step 4: calculating the news with having the similarity of news in news case, determine the news new according to similarity Hear the affiliated event in case;
Step 5: it determines whether comprising whole news and its affiliated event in news case, if so, terminating;Conversely, it then returns Return step 1.
Further, original news data collection includes news ID, headline and the news content of the news in step 1.
Further, the numeralization conversion in step 2 further includes:
Step 2.1 divides headline and news content to word;
Step 2.2 will divide the text after word to be passed to doc2vec models, so as to which the vector for obtaining text represents.
Further, similarity is calculated using COS distance in step 4, specially:
Sim (x, y)=0.3*cos (x1,y1)+0.7*cos(x2,y2), wherein,x1And y1It is two The title vector of a news, x2And y2For the content vector of two news, cos (x1,y1) similarity between title vector, cos(x2,y2) similarity between content vector, sim (x, y) is the final similarity of two news, is worth bigger, similarity It is higher.
Further, affiliated event of the remaining news in news case is determined in step 4 according to similarity, specially:Such as Fruit similarity is more than threshold value, then the news is put under the affiliated event of news of similarity maximum in news case;Conversely, then should The affiliated event of news is added in as new events in news case, and the news is put under this event.
Further, this method further includes step 6:The temperature of event in news case is calculated, obtains current hot ticket.
The present invention is used as media event by the abstract extracted in news content, is gathered all news according to news similarity Class can obtain the temperature of target topic, simple and quick acquisition hot ticket into event;Can be that public sentiment guiding and decision are made Effective data are oriented to.
Description of the drawings
Attached drawing described herein is used for providing further understanding the embodiment of the present invention, forms one of the application Point, do not form the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the first embodiment flow chart for the method that the present invention extracts event from news;
Fig. 2 is the second embodiment flow chart for the method that the present invention extracts event from news.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiment and attached drawing, to this Invention is described in further detail, and exemplary embodiment of the invention and its explanation are only used for explaining the present invention, do not make For limitation of the invention.
Embodiment 1
As shown in Figure 1, present embodiments providing a kind of method that event is extracted from news, this method specifically includes following Step:
S01, acquisition and the relevant original news data collection of target topic;Including news ID, headline and news Content.
Newsletter archive and is carried out numeralization conversion by S02, the abstract for extracting news respectively as affiliated event.
Numeralization conversion includes:
Step 2.1, training doc2vec models:Headline and news content are subjected to a point word processing, such as " today weather It is very good " divide the result after word for " the present ", " my god ", " my god ", " gas ", "true", " good ";In headline and news using point good word Hold, the doc2vec models of title and the doc2vec models of content is respectively trained, and preserve to local;
Step 2.2 converts the text to vector:The news new to any one, carries out a point word to title and content first, And above-mentioned trained doc2vec models are utilized, the digital vectors for respectively tieing up title into 300 with content transformation.
One S03, setting news case, determine whether there is news in news case, if not having, the affiliated event of the news is made It is added in news case for new events, and the news is put under this event;Conversely, then perform S04.
S04, the news is calculated with having the similarity of news in news case, determine the news in news case according to similarity In affiliated event.
COS distance is the computational methods for measuring direction rather than length, when two vectorial similarities are 1, show this two The direction of a vector is consistent, and such as (1,1) and (3,3), but specific numerical value is variant.Therefore, our common cosine away from From measuring the similarity between two vectors.Similarity is calculated using COS distance, specially:
Sim (x, y)=0.3*cos (x1,y1)+0.7*cos(x2,y2), wherein,x1And y1It is two The title vector of a news, x2And y2For the content vector of two news, cos (x1,y1) similarity between title vector, cos(x2,y2) similarity between content vector, sim (x, y) is the final similarity of two news, is worth bigger, similarity It is higher.
S05, it determines whether comprising whole news and its affiliated event in news case, if so, terminating;Conversely, it then returns S01。
Further include step S06:The temperature of event in news case is calculated, obtains current hot ticket.
The temperature of news is codetermined by news amount of reading, comment number and participation number.Event is the set of similar news, Temperature is added for the temperature of each news, by calculating the temperature of event, current hot ticket is obtained, so as to according to self-demand Make respective reaction.
Embodiment 2
As shown in Fig. 2, a kind of method that event is extracted from news is present embodiments provided, on the basis of above-described embodiment On, it is further provided the specific method of news affiliated event in news case is determined according to similarity, correspondingly, this method has Body includes:
S11, acquisition and the relevant original news data collection of target topic;Including news ID, headline and news Content;
Newsletter archive and is carried out numeralization conversion by S12, the abstract for extracting news respectively as affiliated event;
Numeralization conversion includes:
Step 2.1, training doc2vec models:Headline and news content are subjected to a point word processing, such as " today weather It is very good " divide the result after word for " the present ", " my god ", " my god ", " gas ", "true", " good ";In headline and news using point good word Hold, the doc2vec models of title and the doc2vec models of content is respectively trained, and preserve to local;
Step 2.2 converts the text to vector:The news new to any one, carries out a point word to title and content first, And above-mentioned trained doc2vec models are utilized, the digital vectors for respectively tieing up title into 300 with content transformation.
One S13, setting news case, determine whether there is news in news case, if not having, the affiliated event of the news is made For
New events are added in news case, and the news is put under this event;Conversely, then perform S04.
S14, the news is calculated with having the similarity of news in news case, determine the news in news case according to similarity In affiliated event;
COS distance is the computational methods for measuring direction rather than length, when two vectorial similarities are 1, show this two The direction of a vector is consistent, and such as (1,1) and (3,3), but specific numerical value is variant.Therefore, our common cosine away from From measuring the similarity between two vectors.Similarity is calculated using COS distance, specially:
Sim (x, y)=0.3*cos (x1,y1)+0.7*cos(x2,y2), wherein,x1And y1It is two The title vector of a news, x2And y2For the content vector of two news, cos (x1,y1) similarity between title vector, cos(x2,y2) similarity between content vector, sim (x, y) is the final similarity of two news, is worth bigger, similarity It is higher.
Affiliated event of the news in news case is determined according to similarity, specially:If similarity is more than threshold value, Then the news is put under the affiliated event of news of similarity maximum in news case;Conversely, then using the affiliated event of the news as New events are added in news case, and the news is put under this event.
S15, it determines whether comprising whole news and its affiliated event in news case, if so, terminating;Conversely, it then returns S01。
Further include step S16:The temperature of event in news case is calculated, obtains current hot ticket.
The temperature of news is codetermined by news amount of reading, comment number and participation number.Event is the set of similar news, Temperature is added for the temperature of each news, by calculating the temperature of event, current hot ticket is obtained, so as to according to self-demand Make respective reaction.
The present invention is used as media event by the abstract extracted in news content, is gathered all news according to news similarity Class can obtain the temperature of target topic, this method can be simple, quickly, effectively classifies to magnanimity information, and right into event Public sentiment controls and guiding provides effectively guidance.
Above-described specific embodiment has carried out the purpose of the present invention, technical solution and advantageous effect further It is described in detail, it should be understood that the foregoing is merely the specific embodiment of the present invention, is not intended to limit the present invention Protection domain, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims (6)

  1. A kind of 1. method that event is extracted from news, which is characterized in that including:
    Step 1: it obtains and the relevant original news data collection of target topic;
    Step 2: newsletter archive and is carried out numeralization conversion by the abstract of extraction news respectively as affiliated event;
    Step 3: one news case of setting, determines whether there is news in news case, if not having, the affiliated event of the news is made It is added in news case for new events, and the news is put under this event;Conversely, then perform step 4;
    Step 4: calculating the news with having the similarity of news in news case, determine the news in news case according to similarity In affiliated event;
    Step 5: it determines whether comprising whole news and its affiliated event in news case, if so, terminating;Conversely, then return to step Rapid one.
  2. 2. according to the method described in claim 1, it is characterized in that, original news data collection includes the new of the news in step 1 Hear ID, headline and news content.
  3. 3. according to the method described in claim 2, it is characterized in that, the numeralization conversion in step 2 further includes:
    Step 2.1 divides headline and news content to word;
    Step 2.2 will divide the text after word to be passed to doc2vec models, so as to which the vector for obtaining text represents.
  4. 4. according to the method described in claim 3, it is characterized in that, similarity is calculated using COS distance in step 4, specifically For:Sim (x, y)=0.3*cos (x1,y1)+0.7*cos(x2,y2), wherein,x1And y1It is new for two The title vector of news, x2And y2For the content vector of two news, cos (x1,y1) similarity between title vector, cos (x2,y2) similarity between content vector, sim (x, y) is the final similarity of two news, and value is bigger, and similarity is got over It is high.
  5. 5. according to the method described in claim 4, it is characterized in that, determine the news in news according to similarity in step 4 Affiliated event in case, specially:If similarity is more than threshold value, which is put into maximum new of similarity in news case Under event belonging to news;Conversely, then being added in the affiliated event of the news as new events in news case, and the news is put into this thing Under part.
  6. 6. according to the method described in claim 1, it is characterized in that, further include step 6:The temperature of event in news case is calculated, Obtain current hot ticket.
CN201810054183.3A 2018-01-19 2018-01-19 A kind of method that event is extracted from news Pending CN108197112A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810054183.3A CN108197112A (en) 2018-01-19 2018-01-19 A kind of method that event is extracted from news

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810054183.3A CN108197112A (en) 2018-01-19 2018-01-19 A kind of method that event is extracted from news

Publications (1)

Publication Number Publication Date
CN108197112A true CN108197112A (en) 2018-06-22

Family

ID=62590346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810054183.3A Pending CN108197112A (en) 2018-01-19 2018-01-19 A kind of method that event is extracted from news

Country Status (1)

Country Link
CN (1) CN108197112A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263254A (en) * 2019-06-20 2019-09-20 北京百度网讯科技有限公司 Event stage division, device, equipment and medium
CN110489541A (en) * 2019-07-26 2019-11-22 昆明理工大学 Case-involving public sentiment newsletter archive method of abstracting based on case element and BiGRU
CN111950199A (en) * 2020-08-11 2020-11-17 杭州叙简科技股份有限公司 Earthquake data structured automation method based on earthquake news event
CN113515624A (en) * 2021-04-28 2021-10-19 乐山师范学院 Text classification method for emergency news

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080183665A1 (en) * 2007-01-29 2008-07-31 Klaus Brinker Method and apparatus for incorprating metadata in datas clustering
CN103870474A (en) * 2012-12-11 2014-06-18 北京百度网讯科技有限公司 News topic organizing method and device
CN104035960A (en) * 2014-05-08 2014-09-10 东莞市巨细信息科技有限公司 Internet information hotspot predicting method
CN105320646A (en) * 2015-11-17 2016-02-10 天津大学 Incremental clustering based news topic mining method and apparatus thereof
CN106844786A (en) * 2016-12-08 2017-06-13 中国电子科技网络信息安全有限公司 A kind of public sentiment region focus based on text similarity finds method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080183665A1 (en) * 2007-01-29 2008-07-31 Klaus Brinker Method and apparatus for incorprating metadata in datas clustering
CN103870474A (en) * 2012-12-11 2014-06-18 北京百度网讯科技有限公司 News topic organizing method and device
CN104035960A (en) * 2014-05-08 2014-09-10 东莞市巨细信息科技有限公司 Internet information hotspot predicting method
CN105320646A (en) * 2015-11-17 2016-02-10 天津大学 Incremental clustering based news topic mining method and apparatus thereof
CN106844786A (en) * 2016-12-08 2017-06-13 中国电子科技网络信息安全有限公司 A kind of public sentiment region focus based on text similarity finds method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263254A (en) * 2019-06-20 2019-09-20 北京百度网讯科技有限公司 Event stage division, device, equipment and medium
CN110489541A (en) * 2019-07-26 2019-11-22 昆明理工大学 Case-involving public sentiment newsletter archive method of abstracting based on case element and BiGRU
CN110489541B (en) * 2019-07-26 2021-02-05 昆明理工大学 Case element and BiGRU-based text summarization method for case public opinion related news
CN111950199A (en) * 2020-08-11 2020-11-17 杭州叙简科技股份有限公司 Earthquake data structured automation method based on earthquake news event
CN113515624A (en) * 2021-04-28 2021-10-19 乐山师范学院 Text classification method for emergency news

Similar Documents

Publication Publication Date Title
CN106250513B (en) Event modeling-based event personalized classification method and system
Gharge et al. An integrated approach for malicious tweets detection using NLP
CN108197112A (en) A kind of method that event is extracted from news
CN103795612A (en) Method for detecting junk and illegal messages in instant messaging
CN109145180B (en) Enterprise hot event mining method based on incremental clustering
US20150113651A1 (en) Spammer group extraction apparatus and method
CN104040963A (en) System and methods for spam detection using frequency spectra of character strings
CN107944032B (en) Method and apparatus for generating information
CN105939359A (en) Method and device for detecting privacy leakage of mobile terminal
CN105408894B (en) A kind of user identity classification determines method and device
Kim et al. SMS spam filterinig using keyword frequency ratio
Wang et al. TextDroid: Semantics-based detection of mobile malware using network flows
CN106097113B (en) Social network user dynamic and static interest mining method
EP3009942A1 (en) Social contact message monitoring method and device
Unankard et al. Location-based emerging event detection in social networks
US11438346B2 (en) Restrict transmission of manipulated content in a networked environment
CN116738369A (en) Traffic data classification method, device, equipment and storage medium
CN104268214A (en) Micro-blog user relationship based user gender identification method and system
CN106503198A (en) A kind of cold data recognition methodss and system based on hadoop metadata
Eichinger et al. Affinity: A system for latent user similarity comparison on texting data
Nagdeve et al. Spam detection by designing machine learning approach in Twitter stream
CN107391695A (en) A kind of information extracting method based on big data
US11356474B2 (en) Restrict transmission of manipulated content in a networked environment
Nampoothiri et al. Email forensic analysis based on k-means clustering
Lu et al. A method of SNS topic models extraction based on self-adaptively LDA modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180622