CN113934910A

CN113934910A - Automatic optimization and updating theme library construction method and hot event real-time updating method

Info

Publication number: CN113934910A
Application number: CN202111188831.2A
Authority: CN
Inventors: 周洁琴; 周金明
Original assignee: Nanjing Inspector Intelligent Technology Co Ltd
Current assignee: Nanjing Inspector Intelligent Technology Co Ltd
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2022-01-14

Abstract

The invention discloses a method for constructing an automatically optimized and updated theme library and a method for updating a hot event in real time, wherein the method for constructing the theme library comprises the following steps: acquiring history document data of a subject library: determining an internet information source of historical document data, and crawling the historical document data of the determined internet information source website for a period of time through a crawler tool; performing text preprocessing on the acquired historical document data, and merging the title and the text of the preprocessed historical document data; and calculating the weight of the words appearing in each document by using a TF-IDF algorithm, clustering each document by adopting a method of combining clustering algorithms, and setting a crawling cycle of a crawler tool to update the keyword dictionary under each topic library. Generating keyword dictionaries under different topic libraries by crawling historical webpage data, and automatically updating the topic libraries by directionally crawling webpage data in real time according to the keyword dictionaries.

Description

Automatic optimization and updating theme library construction method and hot event real-time updating method

Technical Field

The invention relates to the field of natural language processing research, in particular to a theme library construction method capable of automatically optimizing and updating and a hot event real-time updating method.

Background

With the rapid development of the internet, each internet platform can issue a large number of hot events every day, and due to the huge information amount and the difficulty in supervision, the overload phenomenon of each hot event is serious, the quality of topic contents is good and uneven, and the phenomena of repeated contents, different titles, a 'title party' and the like exist, so that a user cannot accurately know the current hot event. Therefore, how to automatically discover the hot events from the network news is very important.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: the current hot event automatic mining technology in China needs to manually determine the theme or the situation that the theme cannot be automatically updated, and has a hysteresis phenomenon; the comprehensiveness is insufficient, and because different groups have different attention degrees to the hot events under different topics, different topics need to be considered to show the hot events; in the case of a title party, in order to attract attention of a group, a case where a title and a content do not match each other may occur. Therefore, an automatic mining technology for hot events is needed, which can directly and automatically mine hot events, thereby improving the efficiency of mining the hot events, providing different subject libraries which are automatically updated, and improving the accuracy of identifying the hot subjects.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an automatic optimization and updating topic library construction method and a hot event real-time updating method, which realize the automatic real-time extraction of hot events from massive internet text information. The technical scheme is as follows:

in a first aspect, a method for automatically optimizing and updating a theme library is provided, which comprises the following steps:

acquiring history document data of a subject library: determining the internet information source of the history document data,

crawling the historical document data of the determined internet information source website for a period of time by using a crawler tool;

performing text preprocessing on the acquired historical document data, including webpage title extraction and text extraction, analyzing the webpage structure characteristics of websites to which the historical document data belong, and extracting the title and the text of each webpage according to different regular expression rules for each website in a multi-process mode; removing noise data and filtering redundant meaningless information in the webpage;

merging the title and the text of the preprocessed historical document data, and performing word segmentation and stop word processing on the sentence;

the TF-IDF algorithm is used for calculating the weight of the appeared words in each document, and the calculation formula is as follows:

wherein, count (W) is the occurrence frequency of the word W, | D_iL is document D_iThe number of all words in the document, n is the total number of all documents, I (W, D)_i) Representing document D_iWhether the word W is contained or not is 1 if the word W is contained, and otherwise is 0;

clustering each document by adopting a method of combining LDA and k-means clustering algorithms, obtaining document-subject distribution through an LDA subject model, using the document-subject distribution as input of k-means, clustering all documents into c categories, and using a plurality of keywords with maximum weight under each category as subject names of the categories.

Processing and extracting keywords of the news data of each theme, wherein the processing and extraction include selecting verbs and nouns and removing duplication to obtain all non-repeated keywords under each theme, and the non-repeated keywords are used as a keyword dictionary of each theme, so that the c categories obtain a theme library containing c themes;

setting a crawling cycle of the crawler tool, performing directional crawling on news data through the keyword dictionary under each topic library, and performing preprocessing (preferably, the preprocessing comprises word segmentation, duplicate removal and the like) on the news data to update the keyword dictionary under each topic library.

Preferably, the internet information sources include news websites and portal websites with high popularity as corpus sources.

Preferably, the historical document data of the determined internet information source website over a period of time is crawled through a crawler tool, specifically, historical data of a recent year is crawled.

Preferably, the top multiple keywords with the largest weight in each category are used as the subject names of the category, and specifically, the top 3 keywords with the largest weight in each category are used as the subject names of the category.

Preferably, the number of classes c is determined by a contour coefficient method.

Preferably, the first keywords with the largest weight in each category are used as the subject names of the category, and specifically, the first three keywords with the largest weight in each category are used as the subject names of the category.

In a second aspect, the present invention provides a method for updating a hotspot event in real time, which includes the following steps:

according to any one of all possible implementation manners, the method for constructing the automatically optimized and updated topic library obtains a keyword dictionary of each topic under the topic library, the latest news webpage is directionally crawled according to the keyword dictionary of each topic, the text content under each topic is subjected to the text automatic abstracting technology (avoiding the heading party), the core content is generated as the title, and similarity calculation, namely similarity calculation is carried out on abstract results of each topic_X，YIf the similarity is larger than a given threshold value, the two texts are considered to be the same text i, and the number Count of the text i is counted_iAnd the total number of documents under that topic;

calculating the heat value of each news content according to a heat value calculation formula, wherein the formula is as follows:

wherein H₀Is an initial heat value; a is a number weight coefficient, Count_iIs the number of texts i and N is the total number of documents(ii) a b is a time weight coefficient, k is a time coefficient, T₁For news release time, T₀Is the current time;

and sorting according to the heat value from big to small, setting a hot event display number p value, and updating the first p hot events under each theme at regular time (such as every day and every week).

Preferably, the automatic text summarization technology is adopted for the news text content under each topic, and the core content is generated as a title, specifically as follows:

(1) for document D_iSegmenting according to the complete sentence to obtain a plurality of sentences;

(2) for each sentence, preprocessing is performed by word segmentation tools such as Chinese, Japanese, etc., including word segmentation, stop word, part of speech tagging, etc.

(3) Firstly, calculating the weight of each word by TF-IDF algorithm

Then considering title factor and part of speech factor, adjusting the weight to obtain new weight

The calculation method is as follows:

wherein, T_wFor title factor, if word w appears in the title of the document, T_w＞1，T_wThe specific size value is adjusted according to the effect of the last text abstract if T_wIf the acquisition is larger, the title is more important, and T is adjusted according to the importance of the title_w(ii) a If the word w is not present in the document title, T_w＝1。P_wIs a part-of-speech coefficient, if the word w is an entity name, P_wIs more than 1; otherwise, P_w＝1。

According to the new weight

Generating text sentence vector d_iExpressed as:

(4) and calculating the similarity between sentences by utilizing the cosine similarity to construct a similarity matrix.

Wherein, X_iWeight for the ith word of text sentence vector X_W，X，Y_iWeight for the ith word of text sentence vector Y_W，Y。

(5) And (5) carrying out iterative calculation by using the TextRank to obtain the score of each sentence.

(6) And sorting the sentences according to the importance degree, and selecting a plurality of sentences with the importance degree larger than a preset value as candidate abstract sentences.

(7) And extracting sentences from the candidate abstract sentences to form an abstract result according to the set word count or sentence word count requirement.

Compared with the prior art, one of the technical schemes has the following beneficial effects: generating keyword dictionaries under different topic libraries by crawling historical webpage data, and then directionally crawling the webpage data in real time according to the keyword dictionaries to update the topic libraries. For the latest webpage data, the text abstract is automatically extracted as the title through an automatic abstracting technology, so that the contents of the title party are effectively avoided, and hot events of different topics are sequenced according to the heat value.

Detailed Description

In order to clarify the technical solution and the working principle of the present invention, the embodiments of the present disclosure will be described in further detail below. All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

The numerical descriptions of the terms "(1)", "(2)", "(3)" etc. in the description and claims of this application are for distinguishing between similar items and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be practiced in sequences other than those described herein.

In a first aspect: the embodiment of the disclosure provides a method for constructing an automatically optimized and updated theme library, which comprises the following steps:

acquiring history document data of a subject library: determining internet information sources of the historical document data, wherein the internet information sources include portal websites, news websites, wikipedia, Baidu encyclopedia, forums, blogs and the like, preferably, in consideration of accuracy and authority of the historical document data, the internet information sources include news websites and portal websites with high popularity as corpus sources, such as people's network, phoenix network, fox searching network, Tencent network, new wave network, internet surfing network, Xinhua network and the like. And crawling historical document data of the determined internet information source website for a period of time by using a crawler tool, preferably crawling historical data of the determined internet information source website for one year.

Performing text preprocessing on the acquired historical document data, including webpage title extraction and text extraction, analyzing the webpage structure characteristics of websites to which the historical document data belong, and extracting the title and the text of each webpage according to different regular expression rules for each website in a multi-process mode; noise data such as advertisements, registration and copyright information are removed, redundant meaningless information in the webpage is filtered, and the effectiveness of information extraction is improved.

Merging the title and the text of the preprocessed historical document data, and performing word segmentation and stop word processing on the sentence.

wherein, count (w) is the occurrence frequency of word w, | D_iL is document D_iThe number of all words in, n is the total number of all documents, I (w, D)_i) Representing document D_iWhether the word w is contained or not is 1 if the word w is contained, and otherwise is 0;

clustering each document by adopting a method of combining LDA and k-means clustering algorithms, obtaining document-subject distribution through an LDA subject model, using the document-subject distribution as input of k-means, clustering all documents into c categories, and using a plurality of (three) keywords with the maximum weight under each category as subject names of the categories. Preferably, the category number c is determined by a contour coefficient method, and an optimal c value is selected, so that the document clustering effect is good.

Processing and extracting keywords of the news data of each theme, wherein the processing and extraction include selecting verbs and nouns, removing duplication and the like, so that all non-repeated keywords under each theme are obtained and serve as a keyword dictionary of each theme, and thus the c categories obtain a theme library containing c themes; the names of the topics are the first multiple (three) keywords with the largest weight.

Setting a crawling cycle (for example, every day and every week) of the crawler tool, performing directional crawling on news data through the keyword dictionary in each topic library, and performing preprocessing (including word segmentation, duplicate removal and other processing) on the news data to update the keyword dictionary in each topic library.

In a second aspect, an embodiment of the present disclosure provides a method for updating a hotspot event in real time, where the method includes the following steps:

automatically optimized and updated subject library builder according to any one of all possible implementation mannersThe method comprises the steps of obtaining a keyword dictionary of each topic in a topic library, directionally crawling the latest news webpage according to the keyword dictionary of each topic, adopting a text automatic abstract technology (avoiding a heading party) for news text contents of each topic to generate core contents as a heading, and performing similarity calculation, namely similarity calculation on abstract results of each topic_X，YIf the similarity is larger than a given threshold value, the two texts are considered to be the same text i, and the number Count of the text i is counted_iAnd the total number of documents under that topic;

wherein H₀Is an initial heat value; a is a number weight coefficient, Count_iThe number of the texts i is N is the total number of the documents; b is a time weight coefficient, k is a time coefficient, T₁For news release time, T₀Is the current time;

1. for document D_iSegmenting according to the complete sentence to obtain a plurality of sentences;

2. for each sentence, preprocessing is performed by word segmentation tools such as Chinese, Japanese, etc., including word segmentation, stop word, part of speech tagging, etc. High-frequency words such as ' and ' are ' and the like without much meaning are filtered out by stop words. And determining the part of speech of each word, such as nouns, verbs, adverbs and adjectives, and identifying entity names including names of people, places, organizations and the like in the text through part of speech tagging pairs.

3. Firstly, calculating the weight of each word by TF-IDF algorithm

This takes into account not only the probability of a word appearing in a single document and the weight of that word in the entire document set, but also the importance of the word itself in the title and in the entity name. The calculation method is as follows:

According to the new weight

Generating text sentence vector d_iExpressed as:

4. and calculating the similarity between sentences by utilizing the cosine similarity to construct a similarity matrix.

Wherein, X_iWeight of i-th word as text sentence vector Xweight_W，X，Y_iWeight for the ith word of text sentence vector Y_W，Y。

5. And (5) carrying out iterative calculation by using the TextRank to obtain the score of each sentence.

6. And sorting the sentences according to the importance degree, and selecting a plurality of sentences with the importance degree larger than a preset value as candidate abstract sentences.

7. And extracting sentences from the candidate abstract sentences to form an abstract result according to the set word count or sentence word count requirement.

The invention has been described above by way of example, it is obvious that the specific implementation of the invention is not limited by the above-described manner, and that various insubstantial modifications are possible using the method concepts and technical solutions of the invention; or directly apply the conception and the technical scheme of the invention to other occasions without improvement and equivalent replacement, and the invention is within the protection scope of the invention.

Claims

1. A method for constructing an automatically optimized and updated theme library is characterized by comprising the following steps:

clustering each document by adopting a method of combining LDA and k-means clustering algorithms, obtaining document-subject distribution through an LDA subject model, using the document-subject distribution as input of k-means, clustering all documents into c categories, and using a plurality of keywords with maximum weight under each category as subject names of the categories;

setting a crawling cycle of the crawler tool, performing directional crawling on news data through the keyword dictionary under each topic library, and preprocessing the news data to update the keyword dictionary under each topic library.

2. The method as claimed in claim 1, wherein the internet information sources include news websites and portals with high popularity as corpus sources.

3. The method for automatically optimizing and updating the theme base according to claim 1, wherein the historical document data of the determined internet information source website over a period of time is crawled by a crawler tool, specifically, the historical data of the last year is crawled.

4. The method as claimed in claim 1, wherein the top keywords with the highest weight in each category are used as the subject names of the categories, and specifically, the top 3 keywords with the highest weight in each category are used as the subject names of the categories.

5. The method for automatically optimizing and updating the theme base according to claim 1, wherein the category number c is determined by a contour coefficient method.

6. The method as claimed in claim 1, wherein the first keywords with the highest weight in each category are used as the subject names of the categories, and specifically, the first three keywords with the highest weight in each category are used as the subject names of the categories.

7. The method for automatically optimizing and updating the theme library according to any one of claims 1 to 6, wherein the preprocessing of the news data comprises word segmentation and duplicate removal.

8. A method for updating a hot spot event in real time is characterized by comprising the following steps:

the method for constructing an automatically optimized and updated topic library according to any one of claims 1 to 7, wherein a keyword dictionary of each topic in the topic library is obtained, the latest news webpage is directionally crawled according to the keyword dictionary of each topic, the text automatic abstracting technology is adopted for the news text content of each topic to generate core content as a title, and the similarity meter is carried out on the abstract result of each topicInstant similarity_X,YIf the similarity is larger than a given threshold value, the two texts are considered to be the same text i, and the number Count of the text i is counted_iAnd the total number of documents under that topic;

and sequencing according to the heat degree value from large to small, setting a hot event display number p value, and updating the first p hot events under each theme at regular time.

9. The method according to claim 8, wherein a text automatic abstraction technology is adopted for news text content under each topic to generate core content as a title, and the method specifically comprises the following steps:

(2) for each sentence, preprocessing the sentence through a word segmentation tool, wherein the preprocessing comprises word segmentation, stop word removal and part of speech tagging;

(3) firstly, calculating the weight of each word by TF-IDF algorithm

The calculation method is as follows:

wherein, T_wFor title factor, if word w appears in the title of the document, T_w＞1，T_wThe specific size value is adjusted according to the effect of the last text abstract if T_wIf the acquisition is larger, the title is more important, and T is adjusted according to the importance of the title_w(ii) a If the word w is not present in the document title, T_w＝1；P_wIs a part-of-speech coefficient, if the word w is an entity name, P_wIs more than 1; otherwise, P_w＝1；

According to the new weight

Generating text sentence vector d_iExpressed as:

(4) calculating the similarity between sentences by using the cosine similarity, and constructing a similarity matrix;

wherein, X_iWeight for the ith word of text sentence vector X_W，X，Y_iWeight for the ith word of text sentence vector Y_W，Y；

(5) Carrying out iterative calculation by using the TextRank to obtain the score of each sentence;

(6) sorting the sentences according to the importance degree, and selecting a plurality of sentences with the importance degree larger than a preset value as candidate abstract sentences;