CN109241438B

CN109241438B - Element-based cross-channel hot event discovery method and device and storage medium

Info

Publication number: CN109241438B
Application number: CN201811128658.5A
Authority: CN
Inventors: 段东圣; 杜翠兰; 李鹏霄; 刘晓辉; 李扬曦; 佟玲玲; 程光; 张琳; 井雅琪
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2022-06-24
Anticipated expiration: 2038-09-27
Also published as: CN109241438A

Abstract

The invention discloses a method, a device and a storage medium for discovering a cross-channel hot event based on elements.

Description

Element-based cross-channel hot event discovery method and device and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a cross-channel hot event discovery method and device based on elements and a computer readable storage medium.

Background

Network hotspot events refer to events that are of most interest and frequent discussion to the public at a certain time and in a certain range. The discovery of the hot event is to automatically discover the hot content from the propagated data stream and link other information associated with the hot content. In different fields, for example: sports, finance, politics, entertainment and the like, the focus of people can be captured more quickly by timely discovering the hot events in the field, the field development situation can be better grasped, and the method has great significance for the guidance of topics.

In traditional research on hot topics, research on event sets in topics tends to lack more detailed analysis, and long texts such as news reports tend to be insufficient in public attention and timeliness compared with texts in social networks, so that research on hot event discovery in long texts lacks higher timeliness and sensitivity, so that traditional research is not enough to adapt to a hot event discovery task.

Disclosure of Invention

The invention provides a method and a device for discovering a cross-channel hot event based on elements and a computer readable storage medium, which aim to solve the problem of poor timeliness of a news hot event in the prior art.

In one aspect, the present invention provides a method for discovering a cross-channel hot spot event based on elements, the method comprising:

preprocessing the collected news data to obtain news data with irrelevant information filtered out, and further processing the news data with irrelevant information filtered out;

performing joint analysis on the filtered news data with irrelevant information and the further processed news data;

the joint analysis comprises the steps of respectively extracting elements from the filtered news data with irrelevant information and the further processed news data by using a CRF (domain name function) model and rule combination method, carrying out similarity calculation and voice similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the voice similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the events.

Preferably, the elements include one or more of: time, place, people, event description keywords.

Preferably, the news data with the irrelevant information filtered out is further processed, including:

marking the extracted keywords, and reserving a keyword set capable of representing preset domain knowledge;

training Wikipedia data through a word2vec model to obtain word vectors, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity higher than a preset similarity threshold value with the words in the Wikipedia dictionary to form an initial keyword library in the field;

continuously collecting Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the Sina microblog data after segmenting words according to the initial keyword library, and extracting preset microblog data in the field.

Preferably, the method further comprises: and updating the initial keyword library.

Preferably, the preset similarity threshold is 0.7.

In another aspect, the present invention provides an element-based cross-channel hot spot event discovery apparatus, including:

the preprocessing unit is used for preprocessing the collected news data to obtain the news data with irrelevant information filtered out and further processing the news data with irrelevant information filtered out;

the analysis unit is used for carrying out joint analysis on the filtered news data with irrelevant information and the further processed news data; the joint analysis comprises the steps of respectively extracting elements from the filtered news data with irrelevant information and the further processed news data by using a CRF (domain name function) model and rule combination method, carrying out similarity calculation and voice similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the voice similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the events.

Preferably, the elements include one or more of: time, place, person, event description keyword.

Preferably, the preprocessing unit is further configured to label the extracted keywords, and reserve a keyword set capable of representing preset domain knowledge; training Wikipedia data through a word2vec model to obtain word vectors, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity higher than a preset similarity threshold value with the words in the Wikipedia dictionary to form an initial keyword library in the field; continuously collecting the Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the segmented Sina microblog data according to the initial keyword library, and extracting preset microblog data in the field.

Preferably, the apparatus further comprises: and the updating unit is used for updating the initial keyword library.

In yet another aspect, the present invention further provides a computer-readable storage medium storing a computer program for signal mapping, where the computer program is executed by at least one processor to implement any one of the above element-based cross-channel hotspot event discovery methods.

The invention has the following beneficial effects:

according to the method, news report data and microblog data in a certain field are fused, and the hot events in the field can be found through semantic similarity analysis of the elements and texts extracted by combining two channels, so that the hot events can be more comprehensively and more carefully known.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flowchart of a cross-channel hot event discovery method based on elements according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of data preprocessing according to an embodiment of the present invention;

FIG. 3 is a schematic flow diagram of a joint analysis method according to an embodiment of the invention;

FIG. 4 is a schematic flow chart of updating a keyword library according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an element-based cross-channel hot spot event discovery apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

A first embodiment of the present invention provides a method for discovering a cross-channel hot spot event based on elements, which is shown in fig. 1 and includes:

It should be noted that the elements described in the embodiments of the present invention include one or more of the following: time, place, people, event description keywords.

In the embodiment of the invention, the news data with the filtered irrelevant information is further processed, and the processing comprises the following steps:

continuously collecting the Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the segmented Sina microblog data according to the initial keyword library, and extracting preset microblog data in the field.

As shown in fig. 2, the preprocessing steps in the embodiment of the present invention specifically include: firstly, collecting news reports in a specific field, wherein the reports comprise news website data in the field and special columns of the field of some news websites; and extracting keywords from the acquired data by using a deep learning method, and expressing the domain knowledge by using an extracted keyword set. The method comprises the steps of collecting microblog data for a period of time, filtering some irrelevant information from the microblog data, identifying the microblog data by using domain knowledge extracted from news, and extracting the microblog data conforming to the specific domain. The method comprises the following specific steps:

1. selecting a news website in a specific field or a specific field column of some large news websites as a collection object, and filtering out irrelevant information from collected news reports.

2. And (3) extracting a keyword set from the data processed in the step (1) by using a TF-IDF method, manually labeling the extracted keywords, and only reserving the keyword set which can represent the knowledge in the field.

3. Training Wikipedia data through a word2vec model to obtain word vectors, performing similar word expansion on the keyword set obtained in the step (2), and adding words with the similarity degree larger than 0.7 to the words in the Wikipedia dictionary to form an initial keyword library in the field.

4. And continuously collecting Sina microblog data, filtering irrelevant characters from the collected data, and removing repeated text data.

5. And (5) segmenting the data obtained in the step (4), searching the segmented Sina microblog data according to the initial keyword library obtained in the step (3), and extracting the Sina microblog data in the field.

As shown in fig. 3, the joint analysis according to the embodiment of the present invention specifically includes: respectively extracting event elements from the preprocessed news data and the news microblog data in the specific field by using a CRF (domain name relationship) model and rule combination method, and extracting four elements: time, place, people, event description keywords. Defining weight coefficients for the extracted four elements, performing element matching on hot events in microblogs and news in the field, and then merging the hot events into a Wikipedia knowledge base through a deep learning method to perform similarity calculation of the microblogs and the news events; and weighting the similarity of the event element information and the similarity of the event context semantic information to obtain a final similarity score, and combining the similarity texts of the two channels to jointly describe the hot event. The method comprises the following specific steps:

1. and (2) extracting elements of the data obtained in the step (1) in the pretreatment by using a method of combining a CRF model and a rule, and extracting: time, place, name, event description key information.

2. And (3) performing element extraction on the data obtained in the step (5) in the pretreatment by using a method of combining a CRF model and a rule, and extracting: time, place, name, event description key information.

3. Calculating the similarity of the elements in the two channels, and calculating a function: a T + b Pl + c Pe + d E, wherein a, b, c, d represent weighting coefficients; t, Pl and Pe respectively indicate whether the events described by the two pieces of information are the same in time, place and name, the same is 1, and the different is 0; e denotes the number of similar words in the two pieces of information, where the similar words set a similarity threshold of 0.85.

4. Semantic similarity of information in the two channels is calculated, a wikipedia corpus is trained by using a deep learning method to obtain a word vector model, text sentences in the two channels are vectorized and expressed, and cosine similarity is used for calculating similarity of the vectorized sentences.

5. Defining a calculation function of cross-channel event similarity: (3) and (4) weighted summation.

6. And (5) calculating the similarity between the texts in the two channels according to the similarity calculation function defined in the step (5), and putting the texts with the similarity scores larger than a certain threshold value into the same set to describe the events of one type.

The method of the embodiment of the invention also comprises the following steps: and updating the initial keyword library.

As shown in fig. 4, the updating of the domain knowledge base keywords specifically includes: and (3) extracting keywords from the Sina microblog data in the field obtained after preprocessing, and adding the keyword as the field knowledge of the field into the field knowledge base in the preprocessing step 1. The method comprises the following specific steps:

and extracting keywords from the preprocessed data by using a TF-IDF model.

And comparing the similarity of the keyword obtained in the preprocessing with the similarity of the word in the initial keyword library of the field which cannot be obtained last time, and if the similarity of the keyword with a certain word in the initial keyword library exceeds 0.7 and the certain word is not in the initial keyword library, adding the keyword into the keyword library.

In specific implementation, the preset similarity threshold value is 0.7, and in specific implementation, a person skilled in the art may perform other settings according to actual needs, which is not specifically limited by the present invention.

Generally, the embodiment of the invention is based on a cross-channel hot spot event discovery technology, and the hot spot event is discovered by collecting data of different channels and fusing the data characteristics of each channel.

Moreover, the embodiment of the invention can also realize the automatic expansion of the domain knowledge: the new words with high similarity are continuously added into the domain knowledge word bank by continuously extracting the keywords from the new data and judging the similarity of the new data and the original domain knowledge word bank, so that the domain knowledge word bank can be dynamically updated.

In addition, the method and the device enable the description of the event similarity of different channels to be more accurate by matching the event elements of different channels and comparing the semantic similarity of different channels.

The method of the embodiment of the invention can at least obtain the following beneficial effects:

firstly, the invention improves the timeliness, the sensitivity and the complete degree of event description of the hot event discovery by combining different channel data, can expand the semantic expression of key information in the field by combining the field knowledge of different channels, and has important significance for improving the accuracy of event discovery.

Secondly, the event elements of the news reports are extracted to supplement the Xinlang microblog event elements, so that the events can be better described. And weighting the event element similarity score and the text context semantic similarity score to obtain a final event similarity function, so that a similar hotspot event set can be obtained more accurately and comprehensively. By extracting the event elements, the event under a certain specific element can be specifically focused, for example, the event in a specific time or a specific place is focused on.

In addition, the invention can lead the discovery of the hot events in the field to be more accurate, more comprehensive and more real-time by regularly updating the field knowledge base, and lead the whole system to automatically run without additional manual intervention.

A second aspect of the present invention provides an element-based cross-channel hot spot event discovery apparatus, as shown in fig. 5, including:

the analysis unit is used for carrying out joint analysis on the news data with the irrelevant information filtered out and the further processed news data; the joint analysis comprises the steps of respectively extracting elements from the filtered news data with irrelevant information and the further processed news data by using a CRF (domain name function) model and rule combination method, carrying out similarity calculation and voice similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the voice similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the events.

Furthermore, the preprocessing unit in the embodiment of the present invention is further configured to label the extracted keywords, and reserve a keyword set capable of representing preset domain knowledge; training Wikipedia data through a word2vec model to obtain a word vector, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity larger than a preset similarity threshold value with words in the Wikipedia dictionary and the keyword set to form an initial keyword library in the field; continuously collecting the Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the segmented Sina microblog data according to the initial keyword library, and extracting preset microblog data in the field.

In specific implementation, the apparatus according to the embodiment of the present invention further includes: and the updating unit is used for updating the initial keyword library.

Relevant parts of the embodiments of the present invention can be understood by referring to the method embodiments, and detailed description is omitted here.

In a third embodiment of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, realizes the following method steps:

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system is apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a distributed file system data import apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A cross-channel hot spot event discovery method based on elements is characterized by comprising the following steps:

the joint analysis comprises the steps of respectively extracting elements from news data with irrelevant information filtered out and news data after further processing by using a CRF (cross domain name) model and rule combination method, carrying out similarity calculation and semantic similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the semantic similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the event;

and further processing the news data with the filtered irrelevant information, including:

training Wikipedia data through a word2vec model to obtain a word vector, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity larger than a preset similarity threshold value with words in the Wikipedia dictionary and the keyword set to form an initial keyword library in the field;

2. The method of claim 1,

the elements include one or more of the following: time, place, people, event description keywords.

3. The method of claim 1, further comprising:

and updating the initial keyword library.

4. The method of claim 1,

the preset similarity threshold is 0.7.

5. An element-based cross-channel hot spot event discovery device, comprising:

the analysis unit is used for carrying out joint analysis on the filtered news data with irrelevant information and the further processed news data; the joint analysis comprises the steps of respectively extracting elements from the filtered news data with irrelevant information and the further processed news data by using a method of combining a CRF (cross domain feature) model and a rule, carrying out similarity calculation and semantic similarity calculation on the elements obtained from the two channels according to a preset weight coefficient, carrying out weighted summation on the similarity calculation and the semantic similarity of the elements obtained from the two channels based on a preset cross-channel event similarity calculation function, calculating the similarity between the Chinese texts of the two channels, and putting the texts with the similarity values larger than a preset similarity threshold value into the same set to describe the events;

the preprocessing unit is also used for labeling the extracted keywords and reserving a keyword set capable of representing preset domain knowledge; training Wikipedia data through a word2vec model to obtain word vectors, performing similar word expansion on a keyword set obtained by filtering news data with irrelevant information, and adding words with the similarity higher than a preset similarity threshold value with the words in the Wikipedia dictionary to form an initial keyword library in the field; continuously collecting the Sina microblog data, filtering irrelevant characters from the collected data, removing repeated text data, segmenting words according to the data obtained by removing the repeated text data, searching the segmented Sina microblog data according to the initial keyword library, and extracting preset microblog data in the field.

6. The apparatus of claim 5,

7. The apparatus of claim 5, further comprising:

and the updating unit is used for updating the initial keyword library.

8. A computer-readable storage medium storing a signal-mapped computer program which, when executed by at least one processor, implements the element-based cross-channel hotspot event discovery method of any one of claims 1-4.