CN109241438B - Element-based cross-channel hot event discovery method and device and storage medium - Google Patents

Element-based cross-channel hot event discovery method and device and storage medium Download PDF

Info

Publication number
CN109241438B
CN109241438B CN201811128658.5A CN201811128658A CN109241438B CN 109241438 B CN109241438 B CN 109241438B CN 201811128658 A CN201811128658 A CN 201811128658A CN 109241438 B CN109241438 B CN 109241438B
Authority
CN
China
Prior art keywords
data
similarity
preset
news
news data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811128658.5A
Other languages
Chinese (zh)
Other versions
CN109241438A (en
Inventor
段东圣
杜翠兰
李鹏霄
刘晓辉
李扬曦
佟玲玲
程光
张琳
井雅琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201811128658.5A priority Critical patent/CN109241438B/en
Publication of CN109241438A publication Critical patent/CN109241438A/en
Application granted granted Critical
Publication of CN109241438B publication Critical patent/CN109241438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a method, a device and a storage medium for discovering a cross-channel hot event based on elements.

Description

Element-based cross-channel hot event discovery method and device and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a cross-channel hot event discovery method and device based on elements and a computer readable storage medium.
Background
Network hotspot events refer to events that are of most interest and frequent discussion to the public at a certain time and in a certain range. The discovery of the hot event is to automatically discover the hot content from the propagated data stream and link other information associated with the hot content. In different fields, for example: sports, finance, politics, entertainment and the like, the focus of people can be captured more quickly by timely discovering the hot events in the field, the field development situation can be better grasped, and the method has great significance for the guidance of topics.
In traditional research on hot topics, research on event sets in topics tends to lack more detailed analysis, and long texts such as news reports tend to be insufficient in public attention and timeliness compared with texts in social networks, so that research on hot event discovery in long texts lacks higher timeliness and sensitivity, so that traditional research is not enough to adapt to a hot event discovery task.
Disclosure of Invention
The invention provides a method and a device for discovering a cross-channel hot event based on elements and a computer readable storage medium, which aim to solve the problem of poor timeliness of a news hot event in the prior art.
In one aspect, the present invention provides a method for discovering a cross-channel hot spot event based on elements, the method comprising:
preprocessing the collected news data to obtain news data with irrelevant information filtered out, and further processing the news data with irrelevant information filtered out;
performing joint analysis on the filtered news data with irrelevant information and the further processed news data;
the joint analysis comprises the steps of respectively extracting elements from the filtered news data with irrelevant information and the further processed news data by using a CRF (domain name function) model and rule combination method, carrying out similarity calculation and voice similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the voice similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the events.
Preferably, the elements include one or more of: time, place, people, event description keywords.
Preferably, the news data with the irrelevant information filtered out is further processed, including:
marking the extracted keywords, and reserving a keyword set capable of representing preset domain knowledge;
training Wikipedia data through a word2vec model to obtain word vectors, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity higher than a preset similarity threshold value with the words in the Wikipedia dictionary to form an initial keyword library in the field;
continuously collecting Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the Sina microblog data after segmenting words according to the initial keyword library, and extracting preset microblog data in the field.
Preferably, the method further comprises: and updating the initial keyword library.
Preferably, the preset similarity threshold is 0.7.
In another aspect, the present invention provides an element-based cross-channel hot spot event discovery apparatus, including:
the preprocessing unit is used for preprocessing the collected news data to obtain the news data with irrelevant information filtered out and further processing the news data with irrelevant information filtered out;
the analysis unit is used for carrying out joint analysis on the filtered news data with irrelevant information and the further processed news data; the joint analysis comprises the steps of respectively extracting elements from the filtered news data with irrelevant information and the further processed news data by using a CRF (domain name function) model and rule combination method, carrying out similarity calculation and voice similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the voice similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the events.
Preferably, the elements include one or more of: time, place, person, event description keyword.
Preferably, the preprocessing unit is further configured to label the extracted keywords, and reserve a keyword set capable of representing preset domain knowledge; training Wikipedia data through a word2vec model to obtain word vectors, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity higher than a preset similarity threshold value with the words in the Wikipedia dictionary to form an initial keyword library in the field; continuously collecting the Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the segmented Sina microblog data according to the initial keyword library, and extracting preset microblog data in the field.
Preferably, the apparatus further comprises: and the updating unit is used for updating the initial keyword library.
In yet another aspect, the present invention further provides a computer-readable storage medium storing a computer program for signal mapping, where the computer program is executed by at least one processor to implement any one of the above element-based cross-channel hotspot event discovery methods.
The invention has the following beneficial effects:
according to the method, news report data and microblog data in a certain field are fused, and the hot events in the field can be found through semantic similarity analysis of the elements and texts extracted by combining two channels, so that the hot events can be more comprehensively and more carefully known.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic flowchart of a cross-channel hot event discovery method based on elements according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of data preprocessing according to an embodiment of the present invention;
FIG. 3 is a schematic flow diagram of a joint analysis method according to an embodiment of the invention;
FIG. 4 is a schematic flow chart of updating a keyword library according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an element-based cross-channel hot spot event discovery apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
A first embodiment of the present invention provides a method for discovering a cross-channel hot spot event based on elements, which is shown in fig. 1 and includes:
preprocessing the collected news data to obtain news data with irrelevant information filtered out, and further processing the news data with irrelevant information filtered out;
performing joint analysis on the filtered news data with irrelevant information and the further processed news data;
the joint analysis comprises the steps of respectively extracting elements from the filtered news data with irrelevant information and the further processed news data by using a CRF (domain name function) model and rule combination method, carrying out similarity calculation and voice similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the voice similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the events.
According to the method, news report data and microblog data in a certain field are fused, and the hot events in the field can be found through semantic similarity analysis of the elements and texts extracted by combining two channels, so that the hot events can be more comprehensively and more carefully known.
It should be noted that the elements described in the embodiments of the present invention include one or more of the following: time, place, people, event description keywords.
In the embodiment of the invention, the news data with the filtered irrelevant information is further processed, and the processing comprises the following steps:
marking the extracted keywords, and reserving a keyword set capable of representing preset domain knowledge;
training Wikipedia data through a word2vec model to obtain word vectors, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity higher than a preset similarity threshold value with the words in the Wikipedia dictionary to form an initial keyword library in the field;
continuously collecting the Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the segmented Sina microblog data according to the initial keyword library, and extracting preset microblog data in the field.
As shown in fig. 2, the preprocessing steps in the embodiment of the present invention specifically include: firstly, collecting news reports in a specific field, wherein the reports comprise news website data in the field and special columns of the field of some news websites; and extracting keywords from the acquired data by using a deep learning method, and expressing the domain knowledge by using an extracted keyword set. The method comprises the steps of collecting microblog data for a period of time, filtering some irrelevant information from the microblog data, identifying the microblog data by using domain knowledge extracted from news, and extracting the microblog data conforming to the specific domain. The method comprises the following specific steps:
1. selecting a news website in a specific field or a specific field column of some large news websites as a collection object, and filtering out irrelevant information from collected news reports.
2. And (3) extracting a keyword set from the data processed in the step (1) by using a TF-IDF method, manually labeling the extracted keywords, and only reserving the keyword set which can represent the knowledge in the field.
3. Training Wikipedia data through a word2vec model to obtain word vectors, performing similar word expansion on the keyword set obtained in the step (2), and adding words with the similarity degree larger than 0.7 to the words in the Wikipedia dictionary to form an initial keyword library in the field.
4. And continuously collecting Sina microblog data, filtering irrelevant characters from the collected data, and removing repeated text data.
5. And (5) segmenting the data obtained in the step (4), searching the segmented Sina microblog data according to the initial keyword library obtained in the step (3), and extracting the Sina microblog data in the field.
As shown in fig. 3, the joint analysis according to the embodiment of the present invention specifically includes: respectively extracting event elements from the preprocessed news data and the news microblog data in the specific field by using a CRF (domain name relationship) model and rule combination method, and extracting four elements: time, place, people, event description keywords. Defining weight coefficients for the extracted four elements, performing element matching on hot events in microblogs and news in the field, and then merging the hot events into a Wikipedia knowledge base through a deep learning method to perform similarity calculation of the microblogs and the news events; and weighting the similarity of the event element information and the similarity of the event context semantic information to obtain a final similarity score, and combining the similarity texts of the two channels to jointly describe the hot event. The method comprises the following specific steps:
1. and (2) extracting elements of the data obtained in the step (1) in the pretreatment by using a method of combining a CRF model and a rule, and extracting: time, place, name, event description key information.
2. And (3) performing element extraction on the data obtained in the step (5) in the pretreatment by using a method of combining a CRF model and a rule, and extracting: time, place, name, event description key information.
3. Calculating the similarity of the elements in the two channels, and calculating a function: a T + b Pl + c Pe + d E, wherein a, b, c, d represent weighting coefficients; t, Pl and Pe respectively indicate whether the events described by the two pieces of information are the same in time, place and name, the same is 1, and the different is 0; e denotes the number of similar words in the two pieces of information, where the similar words set a similarity threshold of 0.85.
4. Semantic similarity of information in the two channels is calculated, a wikipedia corpus is trained by using a deep learning method to obtain a word vector model, text sentences in the two channels are vectorized and expressed, and cosine similarity is used for calculating similarity of the vectorized sentences.
5. Defining a calculation function of cross-channel event similarity: (3) and (4) weighted summation.
6. And (5) calculating the similarity between the texts in the two channels according to the similarity calculation function defined in the step (5), and putting the texts with the similarity scores larger than a certain threshold value into the same set to describe the events of one type.
The method of the embodiment of the invention also comprises the following steps: and updating the initial keyword library.
As shown in fig. 4, the updating of the domain knowledge base keywords specifically includes: and (3) extracting keywords from the Sina microblog data in the field obtained after preprocessing, and adding the keyword as the field knowledge of the field into the field knowledge base in the preprocessing step 1. The method comprises the following specific steps:
and extracting keywords from the preprocessed data by using a TF-IDF model.
And comparing the similarity of the keyword obtained in the preprocessing with the similarity of the word in the initial keyword library of the field which cannot be obtained last time, and if the similarity of the keyword with a certain word in the initial keyword library exceeds 0.7 and the certain word is not in the initial keyword library, adding the keyword into the keyword library.
In specific implementation, the preset similarity threshold value is 0.7, and in specific implementation, a person skilled in the art may perform other settings according to actual needs, which is not specifically limited by the present invention.
Generally, the embodiment of the invention is based on a cross-channel hot spot event discovery technology, and the hot spot event is discovered by collecting data of different channels and fusing the data characteristics of each channel.
Moreover, the embodiment of the invention can also realize the automatic expansion of the domain knowledge: the new words with high similarity are continuously added into the domain knowledge word bank by continuously extracting the keywords from the new data and judging the similarity of the new data and the original domain knowledge word bank, so that the domain knowledge word bank can be dynamically updated.
In addition, the method and the device enable the description of the event similarity of different channels to be more accurate by matching the event elements of different channels and comparing the semantic similarity of different channels.
The method of the embodiment of the invention can at least obtain the following beneficial effects:
firstly, the invention improves the timeliness, the sensitivity and the complete degree of event description of the hot event discovery by combining different channel data, can expand the semantic expression of key information in the field by combining the field knowledge of different channels, and has important significance for improving the accuracy of event discovery.
Secondly, the event elements of the news reports are extracted to supplement the Xinlang microblog event elements, so that the events can be better described. And weighting the event element similarity score and the text context semantic similarity score to obtain a final event similarity function, so that a similar hotspot event set can be obtained more accurately and comprehensively. By extracting the event elements, the event under a certain specific element can be specifically focused, for example, the event in a specific time or a specific place is focused on.
In addition, the invention can lead the discovery of the hot events in the field to be more accurate, more comprehensive and more real-time by regularly updating the field knowledge base, and lead the whole system to automatically run without additional manual intervention.
A second aspect of the present invention provides an element-based cross-channel hot spot event discovery apparatus, as shown in fig. 5, including:
the preprocessing unit is used for preprocessing the collected news data to obtain the news data with irrelevant information filtered out and further processing the news data with irrelevant information filtered out;
the analysis unit is used for carrying out joint analysis on the news data with the irrelevant information filtered out and the further processed news data; the joint analysis comprises the steps of respectively extracting elements from the filtered news data with irrelevant information and the further processed news data by using a CRF (domain name function) model and rule combination method, carrying out similarity calculation and voice similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the voice similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the events.
According to the method, news report data and microblog data in a certain field are fused, and the hot events in the field can be found through semantic similarity analysis of the elements and texts extracted by combining two channels, so that the hot events can be more comprehensively and more carefully known.
It should be noted that the elements described in the embodiments of the present invention include one or more of the following: time, place, people, event description keywords.
Furthermore, the preprocessing unit in the embodiment of the present invention is further configured to label the extracted keywords, and reserve a keyword set capable of representing preset domain knowledge; training Wikipedia data through a word2vec model to obtain a word vector, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity larger than a preset similarity threshold value with words in the Wikipedia dictionary and the keyword set to form an initial keyword library in the field; continuously collecting the Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the segmented Sina microblog data according to the initial keyword library, and extracting preset microblog data in the field.
In specific implementation, the apparatus according to the embodiment of the present invention further includes: and the updating unit is used for updating the initial keyword library.
Relevant parts of the embodiments of the present invention can be understood by referring to the method embodiments, and detailed description is omitted here.
In a third embodiment of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, realizes the following method steps:
preprocessing the collected news data to obtain news data with irrelevant information filtered out, and further processing the news data with irrelevant information filtered out;
performing joint analysis on the filtered news data with irrelevant information and the further processed news data;
the joint analysis comprises the steps of respectively extracting elements from the filtered news data with irrelevant information and the further processed news data by using a CRF (domain name function) model and rule combination method, carrying out similarity calculation and voice similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the voice similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the events.
Relevant parts of the embodiments of the present invention can be understood by referring to the method embodiments, and detailed description is omitted here.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system is apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a distributed file system data import apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (8)

1. A cross-channel hot spot event discovery method based on elements is characterized by comprising the following steps:
preprocessing the collected news data to obtain news data with irrelevant information filtered out, and further processing the news data with irrelevant information filtered out;
performing joint analysis on the filtered news data with irrelevant information and the further processed news data;
the joint analysis comprises the steps of respectively extracting elements from news data with irrelevant information filtered out and news data after further processing by using a CRF (cross domain name) model and rule combination method, carrying out similarity calculation and semantic similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the semantic similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the event;
and further processing the news data with the filtered irrelevant information, including:
marking the extracted keywords, and reserving a keyword set capable of representing preset domain knowledge;
training Wikipedia data through a word2vec model to obtain a word vector, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity larger than a preset similarity threshold value with words in the Wikipedia dictionary and the keyword set to form an initial keyword library in the field;
continuously collecting the Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the segmented Sina microblog data according to the initial keyword library, and extracting preset microblog data in the field.
2. The method of claim 1,
the elements include one or more of the following: time, place, people, event description keywords.
3. The method of claim 1, further comprising:
and updating the initial keyword library.
4. The method of claim 1,
the preset similarity threshold is 0.7.
5. An element-based cross-channel hot spot event discovery device, comprising:
the preprocessing unit is used for preprocessing the collected news data to obtain the news data with irrelevant information filtered out and further processing the news data with irrelevant information filtered out;
the analysis unit is used for carrying out joint analysis on the filtered news data with irrelevant information and the further processed news data; the joint analysis comprises the steps of respectively extracting elements from the filtered news data with irrelevant information and the further processed news data by using a method of combining a CRF (cross domain feature) model and a rule, carrying out similarity calculation and semantic similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the semantic similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the events;
the preprocessing unit is also used for labeling the extracted keywords and reserving a keyword set capable of representing preset domain knowledge; training Wikipedia data through a word2vec model to obtain word vectors, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity higher than a preset similarity threshold value with the words in the Wikipedia dictionary to form an initial keyword library in the field; continuously collecting the Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the segmented Sina microblog data according to the initial keyword library, and extracting preset microblog data in the field.
6. The apparatus of claim 5,
the elements include one or more of the following: time, place, people, event description keywords.
7. The apparatus of claim 5, further comprising:
and the updating unit is used for updating the initial keyword library.
8. A computer-readable storage medium storing a signal-mapped computer program which, when executed by at least one processor, implements the element-based cross-channel hotspot event discovery method of any one of claims 1-4.
CN201811128658.5A 2018-09-27 2018-09-27 Element-based cross-channel hot event discovery method and device and storage medium Active CN109241438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811128658.5A CN109241438B (en) 2018-09-27 2018-09-27 Element-based cross-channel hot event discovery method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811128658.5A CN109241438B (en) 2018-09-27 2018-09-27 Element-based cross-channel hot event discovery method and device and storage medium

Publications (2)

Publication Number Publication Date
CN109241438A CN109241438A (en) 2019-01-18
CN109241438B true CN109241438B (en) 2022-06-24

Family

ID=65057152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811128658.5A Active CN109241438B (en) 2018-09-27 2018-09-27 Element-based cross-channel hot event discovery method and device and storage medium

Country Status (1)

Country Link
CN (1) CN109241438B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287338B (en) * 2019-06-21 2022-04-29 北京百度网讯科技有限公司 Industry hotspot determination method, device, equipment and medium
CN111753197B (en) * 2020-06-18 2024-04-05 达观数据有限公司 News element extraction method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250513A (en) * 2016-08-02 2016-12-21 西南石油大学 A kind of event personalization sorting technique based on event modeling and system
CN106294619A (en) * 2016-08-01 2017-01-04 上海交通大学 Public sentiment intelligent supervision method
CN106484767A (en) * 2016-09-08 2017-03-08 中国科学院信息工程研究所 A kind of event extraction method across media
CN106886567A (en) * 2017-01-12 2017-06-23 北京航空航天大学 Microblogging incident detection method and device based on semantic extension
CN108334628A (en) * 2018-02-23 2018-07-27 北京东润环能科技股份有限公司 A kind of method, apparatus, equipment and the storage medium of media event cluster

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354186A (en) * 2015-11-05 2016-02-24 同济大学 News event extraction method and system
US10909140B2 (en) * 2016-09-26 2021-02-02 Splunk Inc. Clustering events based on extraction rules

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294619A (en) * 2016-08-01 2017-01-04 上海交通大学 Public sentiment intelligent supervision method
CN106250513A (en) * 2016-08-02 2016-12-21 西南石油大学 A kind of event personalization sorting technique based on event modeling and system
CN106484767A (en) * 2016-09-08 2017-03-08 中国科学院信息工程研究所 A kind of event extraction method across media
CN106886567A (en) * 2017-01-12 2017-06-23 北京航空航天大学 Microblogging incident detection method and device based on semantic extension
CN108334628A (en) * 2018-02-23 2018-07-27 北京东润环能科技股份有限公司 A kind of method, apparatus, equipment and the storage medium of media event cluster

Also Published As

Publication number Publication date
CN109241438A (en) 2019-01-18

Similar Documents

Publication Publication Date Title
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN106649818B (en) Application search intention identification method and device, application search method and server
CN106156365B (en) A kind of generation method and device of knowledge mapping
CN103744981B (en) System for automatic classification analysis for website based on website content
CN110209897B (en) Intelligent dialogue method, device, storage medium and equipment
WO2016180270A1 (en) Webpage classification method and apparatus, calculation device and machine readable storage medium
CN104199833B (en) The clustering method and clustering apparatus of a kind of network search words
WO2017097231A1 (en) Topic processing method and device
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
CN107729468A (en) Answer extracting method and system based on deep learning
CN104537341B (en) Face picture information getting method and device
TWI695277B (en) Automatic website data collection method
CN105279277A (en) Knowledge data processing method and device
US20140188830A1 (en) Social Community Identification for Automatic Document Classification
CN102135967A (en) Webpage keywords extracting method, device and system
CN109947952B (en) Retrieval method, device, equipment and storage medium based on English knowledge graph
CN107506472B (en) Method for classifying browsed webpages of students
KR20150096295A (en) System and method for buinding q&as database, and search system and method using the same
CN109145180B (en) Enterprise hot event mining method based on incremental clustering
CN104794161A (en) Method for monitoring network public opinions
CN109241438B (en) Element-based cross-channel hot event discovery method and device and storage medium
CN112989824A (en) Information pushing method and device, electronic equipment and storage medium
CN113282754A (en) Public opinion detection method, device, equipment and storage medium for news events
CN111199151A (en) Data processing method and data processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant