CN117391071B

CN117391071B - News topic data mining method, device and storage medium

Info

Publication number: CN117391071B
Application number: CN202311639781.4A
Authority: CN
Inventors: 谢红韬; 袁公萍; 陈林翠; 张瑶; 严增勇
Original assignee: CETC Big Data Research Institute Co Ltd
Current assignee: CETC Big Data Research Institute Co Ltd
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2024-02-27
Anticipated expiration: 2043-12-04
Also published as: CN117391071A

Abstract

The application discloses a news topic data mining method, a news topic data mining device and a storage medium, wherein the news topic data mining method comprises the following steps: collecting time sequence data of news manuscript quantity, and dividing the time sequence data through a preconfigured time window; converting the time sequence data into a one-dimensional vector based on the time scale of the time window; calculating a first-order differential vector of the one-dimensional vector; traversing the first-order differential vector through a symbol function to generate a trend vector; traversing from the tail of the trend vector, and correcting zero values in the trend vector according to a pre-configured correction rule; performing first-order differential calculation on the corrected trend vector to obtain a second-order differential value; dividing the time sequence data into a plurality of independent event groups according to the second-order differential value pair; acquiring text data of all news in an event group; converting the text data into TF-IDF vectors; performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups; and analyzing word frequency and part of speech through an NLP tool to generate a corresponding event title.

Description

News topic data mining method, device and storage medium

Technical Field

The present disclosure relates to the field of big data technologies, and in particular, to a news topic data mining method, a news topic data mining device, and a storage medium.

Background

In the information age today, public opinion analysis and news event discovery are increasingly important. With the rapid development of big data technology, processing huge news data and extracting valuable information therefrom has become a complex and vital task. This not only helps to understand social dynamics and public concerns, but also provides rapid and accurate decision support for governments, businesses and individuals.

However, the conventional news event discovery method has a series of problems, which restrict its effectiveness and practicality in coping with the present-day complex information environment. First, these methods often have difficulty in accurately grasping the context of the event development, and cannot capture the complete process of the event from initial occurrence to gradual warming to dissipation. Secondly, the processing of redundant events becomes a serious problem, and the traditional method is easy to generate a large number of similar events when processing massive news data, so that the screening and acquisition of key information by users are greatly influenced. Finally, the timeliness challenge is also a significant problem with conventional approaches, which tend to be frustrating, particularly in situations where real-time feedback and decision making are required.

The conventional news event discovery method is based on manual summary or simple clustering algorithm, and cannot cope with diversity and rapid change of modern social information well due to low accuracy and time-consuming challenges. When processing massive amounts of news data, these methods often have difficulty meeting the precise needs of users for critical information, and often do not provide an effective means for a deep understanding of the event development process.

Therefore, an innovative news event discovery method is needed to solve the above problems, improve accuracy, reduce redundant events, and enhance timeliness, so as to better adapt to the needs of the information age. The method aims at providing a brand new and efficient news event discovery algorithm by integrating key steps such as time sequence data analysis and text clustering so as to meet urgent requirements of modern society on accurate and real-time news information.

Disclosure of Invention

In order to solve the technical problems, the application provides a news topic data mining method, a news topic data mining device and a storage medium.

The following describes the technical solutions provided in the present application:

the first aspect of the application provides a news topic data mining method, which comprises the following steps:

collecting time sequence data of news manuscript sending quantity, and dividing the time sequence data through a preconfigured time window;

converting the time sequence data into a one-dimensional vector based on the time scale of the time window;

calculating a first-order differential vector of the one-dimensional vector;

traversing the first-order differential vector through a symbol function to generate a trend vector;

traversing from the tail of the trend vector, and correcting zero values in the trend vector according to a pre-configured correction rule;

Performing first-order differential calculation on the corrected trend vector to obtain a second-order differential value;

dividing the time sequence data into a plurality of independent event groups according to the second-order differential value;

for each event group, acquiring text data of all news in the event group;

converting the text data into TF-IDF vectors by a feature extraction method;

performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups;

and carrying out word frequency part of speech analysis on each event news group through an NLP tool to generate a corresponding event title.

Optionally, the converting the text data into TF-IDF vectors by the feature extraction method includes:

for text data in any event group, the following algorithm is performed:

word segmentation processing is carried out on the text data;

calculating word frequencies of various words in the text data by the following formula:

TF(t,d)=count(t,d)/total_terms(d)；

wherein d represents any text data, t represents a given word, count (t, d) represents the number of times the given word t appears in the text data d, total_terms (d) represents the total number of words in the text data d, and TF (t, d) represents the word frequency of the given word t;

the inverse document frequency in the text data is calculated by:

IDF(t,D)=log（total_documents(D)/documents_containing_term(t,D)）；

Wherein D represents a set of text data in the event group, total_documents (D) represents the number of text data in the event group, documents_rotation_term (t, D) represents the number of text data containing a given term t in the set of text data, log represents a natural logarithm operation, and IDF (t, D) represents an inverse document frequency of the given term t;

the TF-IDF vector is calculated by the following equation:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)；

where TF-IDF (t, D) represents the TF-IDF vector for a given word t.

Optionally, the performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups includes:

step one: determining a sample set d= (x 1, x2,) xm, a neighborhood parameter (e, minPts), and a sample distance measure, wherein e represents a neighborhood distance threshold, minPts represents a number of sample points that should be included in at least a neighborhood of one point, the sample set being a set of a plurality of TF-IDF vectors;

step two: calculating an e-neighborhood sub-sample set Ne (Xj) of each sample Xj based on the sample distance measurement mode, wherein the e-neighborhood represents a circular area with the sample Xj as a center and the radius being e, the sample distance measurement mode is used for determining the distance between samples, and the e-neighborhood sub-sample set Ne (Xj) contains all other samples with the distance not exceeding e from the sample Xj;

Step three: comparing the absolute value |Ne (Xj) | of the e-neighborhood sub-sample set Ne (Xj) with the MinPts, and adding samples Xj larger than the MinPts into a core object sample set omega;

step four: when the core object sample set Ω is not empty, randomly selecting a core object o from the core object sample set Ω, and executing the following algorithm:

initializing a current cluster core object queue omega cur= { o };

initializing a class sequence number k=k+1;

initializing a current cluster sample set Ck= { o };

updating the unvisited sample set Γ = Γ - { o };

step five: if the current cluster core object queue omega cur is empty, finishing the generation of the current cluster Ck; after generating the cluster Ck, updating the cluster partition c=c { Ck }, and updating the core object sample set Ω=Ω -Ck;

step six: if the current cluster core object queue Ω cur is not empty, then the following algorithm is performed:

taking out a core object o' from a current cluster core object queue omega cur;

determining all e-neighborhood subsampled sets Ne (o') by a neighborhood distance threshold e;

let Δ=ne (o')Γ;

updating a current cluster sample set Ck=Ck U delta, and updating an unvisited sample set Γ=Γ -delta;

update Ω cur=Ω cur & (ΔΣΩ) - { o' };

Repeating the fifth step;

step seven: output cluster division c= { C1, C2,..and Ck }, resulting in multiple event news groups.

Optionally, dividing the time series data into a plurality of independent event groups according to the second-order differential value includes:

identifying peaks and troughs in the time sequence data according to the second-order differential value;

the time series data is divided into a plurality of independent event clusters based on the peaks and valleys.

Optionally, the traversing from the tail of the trend vector, and correcting the zero value in the trend vector according to a preconfigured correction rule includes:

traversing the trend vector from the tail and making the following corrections:

trend (i) =1 if Trend (i) =0 and Trend (i+1) > 0;

trend (i) = -1 if Trend (i) = 0 and Trend (i+1) < 0;

where Trend (i) represents the i-th Trend vector from the tail.

Optionally, the performing word frequency part of speech analysis on each event news group through the NLP tool, and generating the corresponding event title includes:

counting the word frequency of nouns, prepositions and verbs contained in the titles of each event news group, and determining the nouns, prepositions and verbs with the highest word frequency;

the maximum class word frequency sum is calculated by the following equation:

Cnvp=Cn+Cv+Cp；

Wherein Cnvp represents the sum of the most part of word frequencies, cn represents the word frequency of the most noun, cv represents the most verb word frequency, cp represents the most preposition word frequency;

the keyword word frequency threshold is calculated by the following formula:

C_threshold=（（1-Cnvp）/(Csum)）×(Cnvp/3)；

wherein, C_threshold represents the word frequency threshold of the keyword, and the sum of word frequencies of all the words of Csum;

determining all keywords with word frequencies greater than the word frequency threshold value of the keywords, and forming a keyword array;

traversing the title of each event news group and generating event titles based on the keyword arrays.

Optionally, the traversing each event news group and generating the event title based on the keyword array includes:

calculating the inclusion degree and word number of each event news group on the keyword array;

the title with the largest inclusion and the smallest word number is taken as the event title.

Optionally, the calculating the inclusion degree of each event news group to the keyword array includes:

for each title, the ratio of the number of keywords contained therein to the total number of keyword arrays is calculated.

A second aspect of the present application provides a news topic data mining apparatus, including:

the acquisition unit is used for acquiring time sequence data of news manuscript sending quantity and dividing the time sequence data through a preconfigured time window;

A conversion unit for converting the time series data into a one-dimensional vector based on the time scale of the time window;

a first-order calculation unit for calculating a first-order difference vector of the one-dimensional vector;

the trend vector generation unit is used for traversing the first-order difference vector through a symbol function to generate a trend vector;

the correcting unit is used for traversing from the tail part of the trend vector and correcting zero values in the trend vector according to a preset correcting rule;

the second-order computing unit is used for carrying out first-order difference computation on the corrected trend vector to obtain a second-order difference value;

the event group dividing unit is used for dividing the time sequence data into a plurality of independent event groups according to the second-order differential value;

a text data obtaining unit, configured to obtain, for each event group, text data of all news in the event group;

a vector conversion unit for converting the text data into TF-IDF vectors by a feature extraction method;

the clustering unit is used for carrying out text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups;

the event title generation unit is used for carrying out word frequency part of speech analysis on each event news group through the NLP tool to generate a corresponding event title.

A third aspect of the present application provides a news topic data mining apparatus, the apparatus comprising:

a processor, a memory, an input-output unit, and a bus;

the processor is connected with the memory, the input/output unit and the bus;

the memory holds a program that the processor invokes to perform the method of any of the first aspect and optionally the method of the first aspect.

A fourth aspect of the present application provides a computer readable storage medium having stored thereon a program which when executed on a computer performs the method of any one of the first aspect and optionally the first aspect.

From the above technical scheme, the application has the following advantages:

1. by adopting a time sequence data mining mode, the change trend of news manuscript quantity can be captured better, event manuscript rule is used as guide, and event groups can be extracted rapidly and accurately, so that the evolution process of news topics can be known more comprehensively.

2. The trend vector is generated through the first-order difference and the symbol function, so that the change trend of the news manuscript quantity can be reflected more clearly, and key time points such as occurrence, burst and dissipation can be identified.

3. A pre-configured correction rule is introduced to correct zero values in the trend vector. This helps to more accurately identify the start and end of the trend, improving the accuracy of event discovery.

4. The time sequence data is divided into a plurality of independent event groups by utilizing the second-order differential value, so that different events are distinguished more carefully, and the discovered events are more specific and targeted.

5. Through TF-IDF vector and text clustering based on density, text data in the event group is effectively extracted and clustered, and events are better organized and understood.

6. And 4, word frequency part of speech analysis is performed by using an NLP tool, so that key information is extracted from the text, and a representative event title is generated.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of one embodiment of a method for news topic data mining provided in the present application;

FIG. 2 is a schematic diagram of timing data in the present application;

FIG. 3 is a schematic diagram of identifying peaks and troughs of time series data in the present application;

FIG. 4 is a schematic diagram of event clusters based on peak-trough partitioning in the present application;

FIG. 5 is a schematic structural diagram of one embodiment of a news topic data mining apparatus provided in the present application;

fig. 6 is a schematic structural diagram of another embodiment of the news topic data mining apparatus provided in the present application.

Detailed Description

It should be noted that, the method provided in the present application may be applied to a terminal or a system, and may also be applied to a server, for example, the terminal may be a smart phone or a computer, a tablet computer, a smart television, a smart watch, a portable computer terminal, or a fixed terminal such as a desktop computer. For convenience of explanation, the terminal is taken as an execution body for illustration in the application.

Referring to fig. 1, the present application first provides an embodiment of a news topic data mining method, which includes:

s101, collecting time sequence data of news manuscript quantity, and dividing the time sequence data through a preconfigured time window;

in this step, first, time series data of news release amounts, that is, the number of news releases recorded in time series are collected. Then, the time series data is divided by a preset time window, and the whole time series is cut into a plurality of small time window segments for subsequent processing.

In this step of the process, the process is carried out,

and acquiring time sequence data of the news manuscript quantity by using a data acquisition tool or platform. This may include retrieving relevant information from news websites, social media platforms, or other data sources. The collected data is ensured to contain the time stamp of the news release and the corresponding manuscript sending amount. The size of the time window and the sliding step size are preset. The size of the time windows represents the time range covered by each window, while the sliding step represents the time interval between windows. The time interval may be hours in this embodiment.

Dividing the whole time sequence data according to a preset time window. The entire time series data can be segmented by moving one step at a time in a sliding window manner. The data within each time window forms a sub-sequence representing the news posting conditions within that window.

For each time window, the news manuscript amount in the window can be counted, and the data are arranged according to the time sequence to form a one-dimensional vector. Each vector element represents a contribution amount within a respective time window. And traversing all time windows to obtain a one-dimensional vector of the whole time sequence data. This vector may reflect the trend of news manuscript amount over time.

Fig. 2 is a schematic diagram of a timing data.

S102, converting the time sequence data into a one-dimensional vector based on the time scale of the time window;

at this step, the time series data is converted into a one-dimensional vector by sorting the news manuscript amount in each time window. Each element of the one-dimensional vector corresponds to a time window whose value represents the amount of news contribution within the time window.

In this step, the goal of converting the time series data into one-dimensional vectors is to provide a more convenient form of data for subsequent analysis and processing. For the whole time sequence data, the news manuscript quantity in each window is processed in turn by traversing each time window. And counting news manuscript sending quantity in each time window. This may be a news amount, click-through, or other suitable indicator describing news progress. And constructing a one-dimensional vector by taking the manuscript sending amount in each time window as an element. The order of the vectors should be arranged in the order of the time windows, i.e. in the time order of the time series data.

S103, calculating a first-order differential vector of the one-dimensional vector;

in this step, the first-order difference refers to a difference between adjacent elements. In this embodiment, the first order difference operation will be applied to the one-dimensional vector, and the variation of the news posting volume in the adjacent time window is calculated. This can be achieved by traversing one-dimensional vectors and calculating the differences between adjacent elements.

The following examples illustrate:

given a one-dimensional vector X, its elements are x_1, x_2, …, x_n, where n is the length of the vector.

The first order differential vector Diff is defined as:

Diff_i=X_{i+1}-X_i

wherein Diff is a first order differential vector, i is the index of the vector, and 1.ltoreq.i.ltoreq.n-1. Diff_i represents the difference between the i-th element and the i+1-th element in the original vector.

Can be expressed in mathematical notation as:

Diff=[Diff_1,Diff_2,…,Diff_{n-1}]

this vector Diff is the first order difference vector of the one-dimensional vector X.

S104, traversing the first-order differential vector through a symbol function to generate a trend vector;

in the step, signs in the first-order differential vector are extracted through a sign function, and a trend vector is generated. The trend vector reflects the trend of the news manuscript amount, namely occurrence, burst, dissipation and the like.

And extracting signs of the first-order differential vectors to generate trend vectors.

Definition of the sign function:

sign(x)={

1,if x>0

0,if x=0

-1,if x<0

}

first order differential vector:

Diff=[Diff_1,Diff_2,...,Diff_{n-1}]

trend vector:

TrenDiff=[sign(Diff_1),sign(Diff_2),...,sign(Diff_{n-1})]

s105, traversing from the tail of the trend vector, and correcting zero values in the trend vector according to a preset correction rule;

the goal of this step is to correct the trend vector to increase its accuracy, traversing from the tail of the vector, correcting by pre-configured correction rules. In the time-series data of news manuscript amount, the trend vector is corrected to more accurately capture the trend of the event development.

In some cases, the raw data may contain some noise or fluctuations, resulting in transient zero values in the trend vector. By modifying these zero values, the trend vector can be smoothed to more closely match the actual contribution change trend.

The correction may increase the sensitivity of the trend vector to changes in the manuscript amount. If there are consecutive zero values in the trend vector, some subtle changes may be missed. By correction, the actual manuscript quantity change can be better captured, and the sensitivity of the algorithm is improved.

In some cases, the raw data may cause zero values to occur due to acquisition or processing uncertainties. Correction helps to reduce such errors and make the trend vector more accurately reflect the actual situation.

A specific modified embodiment is provided below:

traversing the trend vector from the tail and making the following corrections:

trend (i) =1 if Trend (i) =0 and Trend (i+1) > 0;

trend (i) = -1 if Trend (i) = 0 and Trend (i+1) < 0;

where Trend (i) represents the i-th Trend vector from the tail.

In this embodiment, correction rules are pre-configured, primarily for zero values in the trend vector. The correction rules may include some of the following possible scenarios:

If a certain zero value in the trend vector is followed by a positive value, the zero value is corrected to a positive value.

If a certain zero value in the trend vector is followed by a negative value, the zero value is corrected to a negative value.

Depending on the actual situation, other correction rules may also need to be considered to ensure the rationality and validity of the correction.

For each zero value, the correction is performed according to a pre-configured correction rule. The corrected trend vector can more accurately reflect the change trend of the news manuscript quantity.

The following is an example of pseudo code for this step:

for i from n-1to1:

if Trend(i)==0:

if Trend(i+1)>0:

Trend(i)= 1

elif Trend(i+1)<0:

Trend(i)=-1

in this example, we assume that when the zero value in the trend vector is followed by a positive value, the zero value is corrected to a positive value, and if it is a negative value, the correction is negative.

S106, performing first-order difference calculation on the corrected trend vector to obtain a second-order difference value;

in this step, the present embodiment performs a first order difference operation on the corrected trend vector to calculate a second order difference value. The first-order differential operation has been used in the previous step, which represents the trend of variation between adjacent time points. The second order difference represents the trend of the first order difference.

An example of pseudo code is provided below:

corrected trend vector:

Trend=[Trend_1,Trend_2,...,Trend_n]

First order differential operation:

FirstDiff=[Trend[i]-Trend[i-1]foriinrange(1,n)]

second order differential operation:

SecondDiff=[FirstDiff[i]-FirstDiff[i-1]foriinrange(1,n-1)]

s107, dividing the time sequence data into a plurality of independent event groups according to the second-order differential value;

in this step, the time series data is divided into a plurality of independent event groups according to the second-order differential value. The specific implementation can be carried out according to the following steps:

first, a partitioning rule for defining a second order difference value is required to determine when to partition time series data into a new event group. This may be achieved by setting a threshold, detecting peaks and valleys, etc.

Traversing the calculated second order difference value from beginning to end. And according to the dividing rule, finding the positions of the second-order differential values meeting the dividing condition, and dividing the time sequence data into different event groups at the positions. For each event group, the time points at which it starts and ends are determined, resulting in a time range for each individual event group.

The following is an example of pseudo code for this process:

second order difference value #

SecondDiff=[SecondDiff_1,SecondDiff_2,...,SecondDiff_{n-1}]

# division rule (example: dividing into New events when the second order differential value is greater than a certain threshold)

threshold=0.5

# initializing event group list

event_groups=[]

# traversal second order difference value

foriinrange(n-1):

ifSecondDiff[i]>threshold:

# according to the partitioning rules, partition into new events

event_group={

'start_time': i, # event start time

'end_time': i+1, # event end time

'data' timing [ i: i+1] # event data

}

event_groups.append(event_group)

# obtain time Range and data for each event group

forevent_groupinevent_groups:

start_time=event_group['start_time']

end_time=event_group['end_time']

event_data=event_group['data']

# the data of each event group can be further processed or stored, analyzed, etc

In another alternative implementation manner, the peaks and the troughs of the second-order differential value can be identified and divided according to the peaks and the troughs.

The method comprises the following specific steps: identifying peaks and troughs in the time sequence data according to the second-order differential value; the time series data is divided into a plurality of independent event clusters based on the peaks and valleys.

In this alternative implementation, the time series data is divided based on peaks and troughs by identifying the peaks and troughs of the second order differential values. For the second order differential value, peaks and valleys are identified by some algorithm or rule. An alternative approach is to find turning points from positive to negative or from negative to positive in the second order differential value or to find extreme points (maxima or minima) which may represent peaks or troughs. The time series data is divided into a plurality of independent event clusters according to the identified peaks and valleys. The time period between each peak and trough may be considered as an event cluster that includes an occurrence-burst-dissipation change in the amount of manuscript.

Determining the peaks and troughs by the extremum points includes finding local maxima in the time series data as peaks and local minima as troughs. The following is one possible implementation:

first, the second-order differential value is traversed, and positions meeting the requirement of being larger than adjacent points are found, and the positions are peaks.

And traversing the second-order differential value to find out the positions which are smaller than the adjacent points and are the wave troughs.

And merging the positions of the peaks and the troughs, and sequencing according to a time sequence to obtain a peak-trough sequence.

Referring to fig. 3 and 4, fig. 3 is a schematic diagram of an identified peak and trough, and fig. 4 is a schematic diagram of dividing a plurality of event clusters based on the peak and trough.

The time sequence data is divided into a plurality of independent event groups according to the positions of the wave crests and the wave troughs. Each event group corresponds to a peak-to-valley time period and includes an occurrence-burst-dissipation change in the amount of manuscript.

The following is an example of pseudo code for this process:

second order difference value #

SecondDiff=[SecondDiff_1,SecondDiff_2,...,SecondDiff_{n-1}]

Finding peaks #

peaks=[iforiinrange(1,n-1)ifSecondDiff[i-1]<SecondDiff[i]>SecondDiff[i+1]]

# find troughs

valleys=[iforiinrange(1,n-1)ifSecondDiff[i-1]>SecondDiff[i]<SecondDiff[i+1]]

Combining wave crest and wave trough and arranging

peaks_and_valleys=sorted(peaks+valleys)

# dividing event clusters according to peaks and troughs

event_groups=[]

foriinrange(1,len(peaks_and_valleys)):

start_index=peaks_and_valleys[i-1]

end_index=peaks_and_valleys[i]

event_group={

'start_time':start_index,

'end_time':end_index,

'data': timing [ start_index: end_index ] # event data

}

event_groups.append(event_group)

# obtain time Range and data for each event group

forevent_groupinevent_groups:

start_time=event_group['start_time']

end_time=event_group['end_time']

event_data=event_group['data']

The determination of the peak and trough by the extreme points and by the turning points has the advantages:

the advantages of the wave crest and the wave trough are determined through extreme points:

intuitiveness: extreme points generally correspond to significant changes in the data, which are relatively easy to understand and interpret. Peaks generally indicate that data peaks within a certain period of time, while valleys indicate that data bottoms out.

Stability: extreme points may reflect the overall trend of the data to some extent, so peaks and troughs are generally more stable to the overall characteristics of the data.

The advantages of the wave crest and the wave trough are determined through turning points:

flexibility: the turning point can more flexibly cope with transient changes in the data than just relying on extrema. In some cases, short-time, sharp fluctuations may not appear noticeable at extreme points, but are more easily identified at turning points.

The adaptability: the determination of turning points can be better adapted to data of different shapes and distributions. In some cases, the data may not have significant extrema, but there are rapid changes in the amount of manuscript that the turning point can better capture.

S108, for each event group, acquiring text data of all news in the event group;

in this step, for each divided event group, it is necessary to acquire text data of all news in the event group. For each event group previously partitioned, these event groups are traversed. A time frame may be obtained from information of an event group to determine which news to extract belongs to the event group. And according to the time range of the event group, inquiring a corresponding database or a data source to acquire news published in the time range. Text data, including headlines, body texts, etc., are extracted from the queried news.

S109, converting the text data into TF-IDF vectors through a feature extraction method;

in the step, feature extraction is carried out on the acquired text data, and a TF-IDF method is adopted to convert the text data into TF-IDF vectors. The TF-IDF vector is used to represent the characteristics of the text data reflecting the importance of each word in the text.

In implementing the conversion of text data into TF-IDF vectors, some common Natural Language Processing (NLP) libraries and tools may be used to simplify the task. The following is a specific implementation:

And preprocessing the acquired text data, including removing stop words, punctuation marks, special characters, converting the stop words, punctuation marks into lowercase letters and the like. The text data is segmented into words or lexical units. This may use a word segmentation tool such as NLTK (Natural Language Toolkit) or spaCy. The TF-IDF value for each word is calculated using the TF-IDF algorithm. TF (Term Frequency) indicates the frequency of occurrence of words in text, IDF (Inverse Document Frequency) indicates the inverse document frequency of words. TF-IDF is the product of the two. And combining the calculated TF-IDF values into a vector. Each word corresponds to a dimension in the vector.

The TF-IDF conversion can be implemented in this step using the tfidfvector class of the scikit-learn library, which encapsulates the TF-IDF calculation and vectorization process.

One specific TF-IDF vector generation implementation is provided below:

for text data in any event group, the following algorithm is performed:

word segmentation processing is carried out on the text data;

TF(t,d)=count(t,d)/total_terms(d)；

Calculating an inverse document frequency in the text data by:

IDF(t,D)=log（total_documents(D)/documents_containing_term(t,D)）；

the TF-IDF vector is calculated by the following equation:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)；

where TF-IDF (t, D) represents the TF-IDF vector for a given word t.

In the news data mining processing scene of the scheme, the text data is converted into the TF-IDF vector, and the method has the following advantages:

the TF-IDF vector can effectively convert text information into numerical characteristics, and the important information of words in the text is reserved. This helps the machine learning algorithm understand and process the text data. TF-IDF takes into account the importance of each word in the text and helps emphasize key words by calculating the relative importance of each word in the text collection. In news data mining, this enables better capture of key information in news headlines and content. The TF-IDF vector is a sparse vector in which the vast majority of elements are zero. This sparsity helps reduce the need for storage and computing resources when processing large-scale text data. TF-IDF vectors are commonly used for text clustering and classification tasks. By using the TF-IDF vector, patterns, trends and topics can be found in mining news data, and effective classification and clustering of news can be achieved. For some words that frequently appear in the entire set of text but are not important in the particular text, the TF-IDF suppresses its effect by reducing its weight so that the model is more focused on those words that are more critical in the particular text. The use of TF-IDF vectors can be conveniently integrated with a variety of machine learning algorithms, including clustering, classification, and other text analysis tasks.

S110, performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups;

in this step, density-based text clustering is performed on the TF-IDF vectors to aggregate text data into a plurality of event news groups. Each event news group contains news that is similar in text feature space.

In the step of implementing density-based text clustering, it may be implemented as follows:

the TF-IDF method is used for each news text, which is represented as a high-dimensional numerical vector, where each dimension corresponds to a vocabulary item.

Here, density-based text clustering may be used, and a Density-Based Spatial Clustering of Applications with Noise algorithm may be selected. The algorithm can automatically find clusters of arbitrary shape based on the density of data points.

The algorithm has two key parameters, namely a density radius (e) and a minimum data point number (MinPts). The density radius determines the neighborhood of a core point, while the minimum number of data points refers to at least how many data points are needed in the neighborhood to form a cluster.

Using TF-IDF vectors as input, text clustering is performed using an algorithm. The algorithm will aggregate similar text into one cluster according to the density of the text in TF-IDF space and identify outliers (noise).

After execution of the algorithm, a plurality of text clusters will be obtained, each cluster representing an event news group. The text in these clusters is highly similar and may represent the same event or topic. Each event news group is analyzed for specific news content contained therein, for example, meaning of each cluster may be analyzed by keywords, topics, etc.

One specific example of obtaining multiple event news clusters through density-based clustering is provided below:

in this embodiment, performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups includes:

step two: calculating an e-neighborhood sub-sample set Ne (Xj) of each sample Xj based on the sample distance measurement mode;

step three: comparing an absolute value |Ne (Xj) | of the e-neighborhood sub-sample set Ne (Xj) with the MinPts, and adding samples Xj which are larger than the MinPts into a core object sample set omega, wherein e-neighborhood represents a circular area with a radius of e and a sample Xj as a center, the sample distance measurement mode is used for determining the distance between samples, and the e-neighborhood sub-sample set Ne (Xj) comprises all other samples with the distance of not more than e from the sample Xj;

The method comprises the following specific steps:

for each sample Xj, the distance between it and all other samples in the dataset is calculated.

Taking the sample Xj as a center and the radius as e, finding all samples with the distance not exceeding e from the Xj, and putting the samples into Ne (Xj).

initializing a current cluster core object queue omega cur= { o };

initializing a class sequence number k=k+1;

initializing a current cluster sample set Ck= { o };

updating the unvisited sample set Γ = Γ - { o };

taking out a core object o' from a current cluster core object queue omega cur;

let Δ=ne (o')Γ;

Update Ω cur=Ω cur & (ΔΣΩ) - { o' };

repeating the fifth step;

In this embodiment, the density-based clustering algorithm can adapt the distribution of data without pre-specifying the number of clusters. This allows for more flexibility in the algorithm for event news groups of different sizes and densities. The algorithm has robustness to noisy data, enabling outliers (points not belonging to any cluster) to be marked as noise. In news data, there may be some uncorrelated or abnormal news, and these noise points do not interfere with the generation of normal event news clusters.

Unlike traditional K-means and other algorithms, density-based clustering can form clusters of arbitrary shape, and is suitable for complex shapes and distribution of event news groups.

The algorithm is capable of processing data with widely varying densities. In event news, there may be a higher posting density for some time periods and a lower posting density for other time periods, and this change can be captured well by the algorithm. The algorithm is different from K-means algorithm and the like, the number of clusters is not required to be specified in advance in clustering based on density, and priori knowledge of a data structure is avoided. The clustering results are relatively easy to explain, each cluster represents an event news group, and news in the clusters are relatively similar, so that understanding and analysis are facilitated.

S111, analyzing word frequency parts of speech of each event news group through an NLP tool, and generating corresponding event titles.

The news in each event news group is subjected to part-of-speech analysis of NLP (natural language processing) tools. In performing this step, word frequency part of speech analysis may be performed by the NLP tool to generate a corresponding event title, and appropriate natural language processing tools, such as NLTK (Natural Language Toolkit), space, stanford NLP, etc., are first determined to support word frequency part of speech analysis of the text.

Before using the NLP tool, each news text is preprocessed, including removing stop words and punctuation marks, and performing word drying (Stemming) or word shape reduction (Stemming) to reduce noise and redundant information of the text.

And analyzing news texts in each event news group by using an NLP tool, counting word frequencies, and finding out keywords with higher occurrence frequency. Word frequency information may be obtained by counting the number of occurrences of each word in the event news group. And performing part-of-speech analysis by using an NLP tool, and determining the part-of-speech of each word in the text. This helps understand the grammatical roles of words and extracts key information such as nouns, verbs, etc.

According to the word frequency and the part-of-speech analysis result, keywords such as nouns, verbs and the like with higher occurrence frequency are selected, and a representative event title can be constructed by combining the context. The title may be generated according to a certain rule, such as selecting several words with highest word frequency to form the title, or according to a certain algorithm to generate the title with generalization. And associating the generated event titles with corresponding event news groups to form final results for subsequent analysis and display.

A specific embodiment for generating event titles is provided below, which includes:

the maximum class word frequency sum is calculated by the following equation:

Cnvp=Cn+Cv+Cp；

the keyword word frequency threshold is calculated by the following formula:

C_threshold=（（1-Cnvp）/(Csum)）×(Cnvp/3)；

According to the embodiment, automatic word frequency statistics and threshold calculation are carried out on the text data, so that automatic processing of event titles is realized, and the burden of manual operation is reduced. By considering the word frequency of keywords such as nouns, verbs, prepositions and the like and combining the calculation of the sum of the maximum word frequency, the key information in the title can be effectively extracted, and the representative event title can be generated. The threshold mode is adopted, so that the keyword selection has certain flexibility. The threshold value can be adjusted according to actual conditions so as to meet the requirements in different scenes.

The word frequency part of speech analysis is carried out by utilizing a Natural Language Processing (NLP) tool, so that the deep understanding of text data is enhanced, and the accuracy of title generation is improved. The method based on statistics and threshold is suitable for news events in different fields and topics, and has certain universality and adaptability.

The foregoing describes in detail an embodiment of a news topic data mining method provided in the present application, and the following describes in detail an embodiment of a news topic data mining apparatus provided in the present application:

Referring to fig. 4, the present application first provides an embodiment of a news topic data mining apparatus, including:

the acquisition unit 401 is configured to acquire time sequence data of a news manuscript amount, and divide the time sequence data through a preconfigured time window;

a conversion unit 402, configured to convert the time series data into a one-dimensional vector based on a time scale of the time window;

a first-order calculation unit 403 for calculating a first-order differential vector of the one-dimensional vector;

a trend vector generating unit 404, configured to traverse the first-order difference vector through a sign function, and generate a trend vector;

a correction unit 405, configured to traverse from the tail of the trend vector, and correct zero values in the trend vector according to a pre-configured correction rule;

a second-order computing unit 406, configured to perform a first-order difference computation on the corrected trend vector to obtain a second-order difference value;

an event group dividing unit 407 configured to divide the time-series data into a plurality of independent event groups according to the second-order differential value;

a text data obtaining unit 408, configured to obtain, for each event group, text data of all news in the event group;

A vector conversion unit 409 for converting the text data into TF-IDF vectors by a feature extraction method;

a clustering unit 410, configured to perform text clustering on the TF-IDF vector based on density, so as to obtain a plurality of event news groups;

the event title generating unit 411 is configured to perform word frequency part of speech analysis on each event news group through the NLP tool, and generate a corresponding event title.

Optionally, the vector conversion unit 409 is specifically configured to:

for text data in any event group, the following algorithm is performed:

word segmentation processing is carried out on the text data;

TF(t,d)=count(t,d)/total_terms(d)；

the inverse document frequency in the text data is calculated by:

IDF(t,D)=log（total_documents(D)/documents_containing_term(t,D)）；

The TF-IDF vector is calculated by the following equation:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)；

where TF-IDF (t, D) represents the TF-IDF vector for a given word t.

Optionally, the event group dividing unit 407 is specifically configured to:

traversing the trend vector from the tail and making the following corrections:

trend (i) =1 if Trend (i) =0 and Trend (i+1) > 0;

trend (i) = -1 if Trend (i) = 0 and Trend (i+1) < 0;

where Trend (i) represents the i-th Trend vector from the tail.

Optionally, the event title generating unit 411 is specifically configured to:

the maximum class word frequency sum is calculated by the following equation:

Cnvp=Cn+Cv+Cp；

the keyword word frequency threshold is calculated by the following formula:

C_threshold=（（1-Cnvp）/(Csum)）×(Cnvp/3)；

Optionally, the event title generating unit 411 is specifically configured to:

Referring to fig. 6, the present application further provides a news topic data mining apparatus, including:

a processor 601, a memory 602, an input/output unit 603, and a bus 604;

the processor 601 is connected to the memory 602, the input-output unit 603, and the bus 604;

the memory 602 holds a program, which the processor 601 invokes to perform any of the methods described above.

The present application also relates to a computer readable storage medium having a program stored thereon, characterized in that the program, when run on a computer, causes the computer to perform any of the methods as above.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A news topic data mining method, the method comprising:

Calculating a first-order differential vector of the one-dimensional vector;

for each event group, acquiring text data of all news in the event group;

converting the text data into TF-IDF vectors by a feature extraction method;

for each event news group, performing word frequency part-of-speech analysis through an NLP tool to generate a corresponding event title;

performing text clustering on the TF-IDF vector based on density to obtain a plurality of event news groups, wherein the steps comprise:

initializing a current cluster core object queue omega cur= { o };

initializing a class sequence number k=k+1;

initializing a current cluster sample set Ck= { o };

updating the unvisited sample set Γ = Γ - { o };

Taking out a core object o' from a current cluster core object queue omega cur;

let Δ=ne (o')Γ;

update Ω cur=Ω cur & (ΔΣΩ) - { o' };

repeating the fifth step;

2. The news topic data mining method of claim 1, wherein the converting the text data into TF-IDF vectors by the feature extraction method includes:

for text data in any event group, the following algorithm is performed:

word segmentation processing is carried out on the text data;

TF(t,d)=count(t,d)/total_terms(d)；

the inverse document frequency in the text data is calculated by:

IDF(t,D)=log（total_documents(D)/documents_containing_term(t,D)）；

The TF-IDF vector is calculated by the following equation:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)；

where TF-IDF (t, D) represents the TF-IDF vector for a given word t.

3. The news topic data mining method of claim 1, wherein dividing the time series data into a plurality of independent event groups based on the second order differential value includes:

4. The news topic data mining method of claim 1, wherein traversing from the tail of the trend vector, correcting zero values in the trend vector according to a pre-configured correction rule includes:

traversing the trend vector from the tail and making the following corrections:

trend (i) =1 if Trend (i) =0 and Trend (i+1) > 0;

trend (i) = -1 if Trend (i) = 0 and Trend (i+1) < 0;

where Trend (i) represents the i-th Trend vector from the tail.

5. The news topic data mining method of claim 1, wherein the performing word frequency part of speech analysis on each event news group through the NLP tool to generate the corresponding event title includes:

the maximum class word frequency sum is calculated by the following equation:

Cnvp=Cn+Cv+Cp；

the keyword word frequency threshold is calculated by the following formula:

C_threshold=（（1-Cnvp）/(Csum)）×(Cnvp/3)；

6. The news topic data mining method of claim 5 wherein traversing each event news group and generating event titles based on the keyword array includes:

7. The news topic data mining method of claim 6, wherein said calculating the inclusion of each event news group into the keyword array includes:

8. A news topic data mining apparatus, comprising:

the event title generation unit is used for carrying out word frequency part-of-speech analysis on each event news group through the NLP tool to generate a corresponding event title;

the clustering unit is specifically configured to perform the following steps:

initializing a current cluster core object queue omega cur= { o };

initializing a class sequence number k=k+1;

initializing a current cluster sample set Ck= { o };

updating the unvisited sample set Γ = Γ - { o };

taking out a core object o' from a current cluster core object queue omega cur;

let Δ=ne (o')Γ;

update Ω cur=Ω cur & (ΔΣΩ) - { o' };

repeating the fifth step;

9. A news topic data mining apparatus, the apparatus comprising:

A processor, a memory, an input-output unit, and a bus;

the processor is connected with the memory, the input/output unit and the bus;

the memory holds a program which the processor invokes to perform the method of any one of claims 1 to 7.

10. A computer readable storage medium having a program stored thereon, which when executed on a computer performs the method of any of claims 1 to 7.