CN108197112A - A kind of method that event is extracted from news - Google Patents
A kind of method that event is extracted from news Download PDFInfo
- Publication number
- CN108197112A CN108197112A CN201810054183.3A CN201810054183A CN108197112A CN 108197112 A CN108197112 A CN 108197112A CN 201810054183 A CN201810054183 A CN 201810054183A CN 108197112 A CN108197112 A CN 108197112A
- Authority
- CN
- China
- Prior art keywords
- news
- similarity
- event
- case
- affiliated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of methods that event is extracted from news, affiliated event is used as by the summary info extracted in news, and the vector that newsletter archive is carried out to numeralization conversion acquisition text represents, the similarity of news is calculated using clustering method, quickly classified based on similarity by affiliated event to news, it simply and effectively can will belong to the news clusters of similar events together, and obtain the temperature of news, be monitored convenient for subsequent public sentiment.This method can simply, it is quick, effectively classify to magnanimity news information, provide guidance for the analysis of public opinion, further improve the control and monitoring of public sentiment, and can make decisions support in time and public sentiment guides.
Description
Technical field
The present invention relates to computer network communication technology fields, and in particular to a kind of method that event is extracted from news.
Background technology
With the continuous development of computer networking technology, the acquisition of network information has become people and recognizes the main of event
One of approach, and a principal mode of the news as network information resource, in face of domestic and international news portal website all the time
A large amount of news will be generated, people are often absorbed in poverty-stricken condition, and the magnanimity information on the one hand received has no way of selecting and disappear
Change, be submerged in numerous and diverse information, be on the other hand that information is lost, people are difficult to find the information oneself really needed;Therefore,
It is nowadays active demand of the people for the network information that information needed, which can quickly and efficiently be obtained,.In this case, exist
A large amount of information is effectively clustered automatically, seems necessary.
In addition, With the fast development of internet, network public-opinion is increasing to the influence power of society, either government's net
Network public sentiment monitoring needs or enterprise carry out branding communication and brand public relations needs, how magnanimity public sentiment item
Under part, hot ticket is fast and effeciently obtained, to analyze the Sentiment orientation of public sentiment so that related personnel timely and reliably determines
Plan and public sentiment guiding, the public opinion environment of response quickly variation.
Invention content
Of the existing technology in order to solve the problems, such as, the present invention proposes a kind of method that event is extracted from news, real
The temperature monitoring of media event is showed, can be that public sentiment monitoring and decision make effective guiding.
A kind of method that event is extracted from news of the present invention, includes the following steps:
Step 1: it obtains and the relevant original news data collection of target topic;
Step 2: newsletter archive and is carried out numeralization conversion by the abstract of extraction news respectively as affiliated event;
Step 3: one news case of setting, determines whether there is news in news case, if not having, by the affiliated thing of the news
Part is added in as new events in news case, and the news is put under this event;Conversely, then perform step 4;
Step 4: calculating the news with having the similarity of news in news case, determine the news new according to similarity
Hear the affiliated event in case;
Step 5: it determines whether comprising whole news and its affiliated event in news case, if so, terminating;Conversely, it then returns
Return step 1.
Further, original news data collection includes news ID, headline and the news content of the news in step 1.
Further, the numeralization conversion in step 2 further includes:
Step 2.1 divides headline and news content to word;
Step 2.2 will divide the text after word to be passed to doc2vec models, so as to which the vector for obtaining text represents.
Further, similarity is calculated using COS distance in step 4, specially:
Sim (x, y)=0.3*cos (x1,y1)+0.7*cos(x2,y2), wherein,x1And y1It is two
The title vector of a news, x2And y2For the content vector of two news, cos (x1,y1) similarity between title vector,
cos(x2,y2) similarity between content vector, sim (x, y) is the final similarity of two news, is worth bigger, similarity
It is higher.
Further, affiliated event of the remaining news in news case is determined in step 4 according to similarity, specially:Such as
Fruit similarity is more than threshold value, then the news is put under the affiliated event of news of similarity maximum in news case;Conversely, then should
The affiliated event of news is added in as new events in news case, and the news is put under this event.
Further, this method further includes step 6:The temperature of event in news case is calculated, obtains current hot ticket.
The present invention is used as media event by the abstract extracted in news content, is gathered all news according to news similarity
Class can obtain the temperature of target topic, simple and quick acquisition hot ticket into event;Can be that public sentiment guiding and decision are made
Effective data are oriented to.
Description of the drawings
Attached drawing described herein is used for providing further understanding the embodiment of the present invention, forms one of the application
Point, do not form the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the first embodiment flow chart for the method that the present invention extracts event from news;
Fig. 2 is the second embodiment flow chart for the method that the present invention extracts event from news.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiment and attached drawing, to this
Invention is described in further detail, and exemplary embodiment of the invention and its explanation are only used for explaining the present invention, do not make
For limitation of the invention.
Embodiment 1
As shown in Figure 1, present embodiments providing a kind of method that event is extracted from news, this method specifically includes following
Step:
S01, acquisition and the relevant original news data collection of target topic;Including news ID, headline and news
Content.
Newsletter archive and is carried out numeralization conversion by S02, the abstract for extracting news respectively as affiliated event.
Numeralization conversion includes:
Step 2.1, training doc2vec models:Headline and news content are subjected to a point word processing, such as " today weather
It is very good " divide the result after word for " the present ", " my god ", " my god ", " gas ", "true", " good ";In headline and news using point good word
Hold, the doc2vec models of title and the doc2vec models of content is respectively trained, and preserve to local;
Step 2.2 converts the text to vector:The news new to any one, carries out a point word to title and content first,
And above-mentioned trained doc2vec models are utilized, the digital vectors for respectively tieing up title into 300 with content transformation.
One S03, setting news case, determine whether there is news in news case, if not having, the affiliated event of the news is made
It is added in news case for new events, and the news is put under this event;Conversely, then perform S04.
S04, the news is calculated with having the similarity of news in news case, determine the news in news case according to similarity
In affiliated event.
COS distance is the computational methods for measuring direction rather than length, when two vectorial similarities are 1, show this two
The direction of a vector is consistent, and such as (1,1) and (3,3), but specific numerical value is variant.Therefore, our common cosine away from
From measuring the similarity between two vectors.Similarity is calculated using COS distance, specially:
Sim (x, y)=0.3*cos (x1,y1)+0.7*cos(x2,y2), wherein,x1And y1It is two
The title vector of a news, x2And y2For the content vector of two news, cos (x1,y1) similarity between title vector,
cos(x2,y2) similarity between content vector, sim (x, y) is the final similarity of two news, is worth bigger, similarity
It is higher.
S05, it determines whether comprising whole news and its affiliated event in news case, if so, terminating;Conversely, it then returns
S01。
Further include step S06:The temperature of event in news case is calculated, obtains current hot ticket.
The temperature of news is codetermined by news amount of reading, comment number and participation number.Event is the set of similar news,
Temperature is added for the temperature of each news, by calculating the temperature of event, current hot ticket is obtained, so as to according to self-demand
Make respective reaction.
Embodiment 2
As shown in Fig. 2, a kind of method that event is extracted from news is present embodiments provided, on the basis of above-described embodiment
On, it is further provided the specific method of news affiliated event in news case is determined according to similarity, correspondingly, this method has
Body includes:
S11, acquisition and the relevant original news data collection of target topic;Including news ID, headline and news
Content;
Newsletter archive and is carried out numeralization conversion by S12, the abstract for extracting news respectively as affiliated event;
Numeralization conversion includes:
Step 2.1, training doc2vec models:Headline and news content are subjected to a point word processing, such as " today weather
It is very good " divide the result after word for " the present ", " my god ", " my god ", " gas ", "true", " good ";In headline and news using point good word
Hold, the doc2vec models of title and the doc2vec models of content is respectively trained, and preserve to local;
Step 2.2 converts the text to vector:The news new to any one, carries out a point word to title and content first,
And above-mentioned trained doc2vec models are utilized, the digital vectors for respectively tieing up title into 300 with content transformation.
One S13, setting news case, determine whether there is news in news case, if not having, the affiliated event of the news is made
For
New events are added in news case, and the news is put under this event;Conversely, then perform S04.
S14, the news is calculated with having the similarity of news in news case, determine the news in news case according to similarity
In affiliated event;
COS distance is the computational methods for measuring direction rather than length, when two vectorial similarities are 1, show this two
The direction of a vector is consistent, and such as (1,1) and (3,3), but specific numerical value is variant.Therefore, our common cosine away from
From measuring the similarity between two vectors.Similarity is calculated using COS distance, specially:
Sim (x, y)=0.3*cos (x1,y1)+0.7*cos(x2,y2), wherein,x1And y1It is two
The title vector of a news, x2And y2For the content vector of two news, cos (x1,y1) similarity between title vector,
cos(x2,y2) similarity between content vector, sim (x, y) is the final similarity of two news, is worth bigger, similarity
It is higher.
Affiliated event of the news in news case is determined according to similarity, specially:If similarity is more than threshold value,
Then the news is put under the affiliated event of news of similarity maximum in news case;Conversely, then using the affiliated event of the news as
New events are added in news case, and the news is put under this event.
S15, it determines whether comprising whole news and its affiliated event in news case, if so, terminating;Conversely, it then returns
S01。
Further include step S16:The temperature of event in news case is calculated, obtains current hot ticket.
The temperature of news is codetermined by news amount of reading, comment number and participation number.Event is the set of similar news,
Temperature is added for the temperature of each news, by calculating the temperature of event, current hot ticket is obtained, so as to according to self-demand
Make respective reaction.
The present invention is used as media event by the abstract extracted in news content, is gathered all news according to news similarity
Class can obtain the temperature of target topic, this method can be simple, quickly, effectively classifies to magnanimity information, and right into event
Public sentiment controls and guiding provides effectively guidance.
Above-described specific embodiment has carried out the purpose of the present invention, technical solution and advantageous effect further
It is described in detail, it should be understood that the foregoing is merely the specific embodiment of the present invention, is not intended to limit the present invention
Protection domain, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include
Within protection scope of the present invention.
Claims (6)
- A kind of 1. method that event is extracted from news, which is characterized in that including:Step 1: it obtains and the relevant original news data collection of target topic;Step 2: newsletter archive and is carried out numeralization conversion by the abstract of extraction news respectively as affiliated event;Step 3: one news case of setting, determines whether there is news in news case, if not having, the affiliated event of the news is made It is added in news case for new events, and the news is put under this event;Conversely, then perform step 4;Step 4: calculating the news with having the similarity of news in news case, determine the news in news case according to similarity In affiliated event;Step 5: it determines whether comprising whole news and its affiliated event in news case, if so, terminating;Conversely, then return to step Rapid one.
- 2. according to the method described in claim 1, it is characterized in that, original news data collection includes the new of the news in step 1 Hear ID, headline and news content.
- 3. according to the method described in claim 2, it is characterized in that, the numeralization conversion in step 2 further includes:Step 2.1 divides headline and news content to word;Step 2.2 will divide the text after word to be passed to doc2vec models, so as to which the vector for obtaining text represents.
- 4. according to the method described in claim 3, it is characterized in that, similarity is calculated using COS distance in step 4, specifically For:Sim (x, y)=0.3*cos (x1,y1)+0.7*cos(x2,y2), wherein,x1And y1It is new for two The title vector of news, x2And y2For the content vector of two news, cos (x1,y1) similarity between title vector, cos (x2,y2) similarity between content vector, sim (x, y) is the final similarity of two news, and value is bigger, and similarity is got over It is high.
- 5. according to the method described in claim 4, it is characterized in that, determine the news in news according to similarity in step 4 Affiliated event in case, specially:If similarity is more than threshold value, which is put into maximum new of similarity in news case Under event belonging to news;Conversely, then being added in the affiliated event of the news as new events in news case, and the news is put into this thing Under part.
- 6. according to the method described in claim 1, it is characterized in that, further include step 6:The temperature of event in news case is calculated, Obtain current hot ticket.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810054183.3A CN108197112A (en) | 2018-01-19 | 2018-01-19 | A kind of method that event is extracted from news |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810054183.3A CN108197112A (en) | 2018-01-19 | 2018-01-19 | A kind of method that event is extracted from news |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108197112A true CN108197112A (en) | 2018-06-22 |
Family
ID=62590346
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810054183.3A Pending CN108197112A (en) | 2018-01-19 | 2018-01-19 | A kind of method that event is extracted from news |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108197112A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110263254A (en) * | 2019-06-20 | 2019-09-20 | 北京百度网讯科技有限公司 | Event stage division, device, equipment and medium |
CN110489541A (en) * | 2019-07-26 | 2019-11-22 | 昆明理工大学 | Case-involving public sentiment newsletter archive method of abstracting based on case element and BiGRU |
CN111950199A (en) * | 2020-08-11 | 2020-11-17 | 杭州叙简科技股份有限公司 | Earthquake data structured automation method based on earthquake news event |
CN113515624A (en) * | 2021-04-28 | 2021-10-19 | 乐山师范学院 | Text classification method for emergency news |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080183665A1 (en) * | 2007-01-29 | 2008-07-31 | Klaus Brinker | Method and apparatus for incorprating metadata in datas clustering |
CN103870474A (en) * | 2012-12-11 | 2014-06-18 | 北京百度网讯科技有限公司 | News topic organizing method and device |
CN104035960A (en) * | 2014-05-08 | 2014-09-10 | 东莞市巨细信息科技有限公司 | Internet information hotspot predicting method |
CN105320646A (en) * | 2015-11-17 | 2016-02-10 | 天津大学 | Incremental clustering based news topic mining method and apparatus thereof |
CN106844786A (en) * | 2016-12-08 | 2017-06-13 | 中国电子科技网络信息安全有限公司 | A kind of public sentiment region focus based on text similarity finds method |
-
2018
- 2018-01-19 CN CN201810054183.3A patent/CN108197112A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080183665A1 (en) * | 2007-01-29 | 2008-07-31 | Klaus Brinker | Method and apparatus for incorprating metadata in datas clustering |
CN103870474A (en) * | 2012-12-11 | 2014-06-18 | 北京百度网讯科技有限公司 | News topic organizing method and device |
CN104035960A (en) * | 2014-05-08 | 2014-09-10 | 东莞市巨细信息科技有限公司 | Internet information hotspot predicting method |
CN105320646A (en) * | 2015-11-17 | 2016-02-10 | 天津大学 | Incremental clustering based news topic mining method and apparatus thereof |
CN106844786A (en) * | 2016-12-08 | 2017-06-13 | 中国电子科技网络信息安全有限公司 | A kind of public sentiment region focus based on text similarity finds method |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110263254A (en) * | 2019-06-20 | 2019-09-20 | 北京百度网讯科技有限公司 | Event stage division, device, equipment and medium |
CN110489541A (en) * | 2019-07-26 | 2019-11-22 | 昆明理工大学 | Case-involving public sentiment newsletter archive method of abstracting based on case element and BiGRU |
CN110489541B (en) * | 2019-07-26 | 2021-02-05 | 昆明理工大学 | Case element and BiGRU-based text summarization method for case public opinion related news |
CN111950199A (en) * | 2020-08-11 | 2020-11-17 | 杭州叙简科技股份有限公司 | Earthquake data structured automation method based on earthquake news event |
CN113515624A (en) * | 2021-04-28 | 2021-10-19 | 乐山师范学院 | Text classification method for emergency news |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106250513B (en) | Event modeling-based event personalized classification method and system | |
Gharge et al. | An integrated approach for malicious tweets detection using NLP | |
CN108197112A (en) | A kind of method that event is extracted from news | |
CN103795612A (en) | Method for detecting junk and illegal messages in instant messaging | |
CN109145180B (en) | Enterprise hot event mining method based on incremental clustering | |
US20150113651A1 (en) | Spammer group extraction apparatus and method | |
CN104040963A (en) | System and methods for spam detection using frequency spectra of character strings | |
CN107944032B (en) | Method and apparatus for generating information | |
CN105939359A (en) | Method and device for detecting privacy leakage of mobile terminal | |
CN105408894B (en) | A kind of user identity classification determines method and device | |
Kim et al. | SMS spam filterinig using keyword frequency ratio | |
Wang et al. | TextDroid: Semantics-based detection of mobile malware using network flows | |
CN106097113B (en) | Social network user dynamic and static interest mining method | |
EP3009942A1 (en) | Social contact message monitoring method and device | |
Unankard et al. | Location-based emerging event detection in social networks | |
US11438346B2 (en) | Restrict transmission of manipulated content in a networked environment | |
CN116738369A (en) | Traffic data classification method, device, equipment and storage medium | |
CN104268214A (en) | Micro-blog user relationship based user gender identification method and system | |
CN106503198A (en) | A kind of cold data recognition methodss and system based on hadoop metadata | |
Eichinger et al. | Affinity: A system for latent user similarity comparison on texting data | |
Nagdeve et al. | Spam detection by designing machine learning approach in Twitter stream | |
CN107391695A (en) | A kind of information extracting method based on big data | |
US11356474B2 (en) | Restrict transmission of manipulated content in a networked environment | |
Nampoothiri et al. | Email forensic analysis based on k-means clustering | |
Lu et al. | A method of SNS topic models extraction based on self-adaptively LDA modeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180622 |