CN113822048A

CN113822048A - Social media text denoising method based on space-time burst characteristics

Info

Publication number: CN113822048A
Application number: CN202111086719.8A
Authority: CN
Inventors: 费高雷; 程勇; 胡光岷
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-21
Anticipated expiration: 2041-09-16
Also published as: CN113822048B

Abstract

The invention discloses a social media text denoising method based on space-time burst characteristics, belongs to the field of data processing, and aims to solve the problem of poor text classification effect in the prior art. Meanwhile, in order to reduce the influence of the Ripley's K function threshold l of the words on the result and reduce word misjudgment, a graph regularization algorithm is introduced, and the accuracy of word validity judgment is improved by fusing relevance information among the words.

Description

Social media text denoising method based on space-time burst characteristics

Technical Field

The invention belongs to the field of data processing, and particularly relates to a text denoising technology.

Background

With the widespread use of social media, billions of text messages are published on social media such as Twitter, Facebook, Instagram and microblog every day, and the messages comprise the most complete and most time-efficient messages. By extracting and analyzing this information, we can do a lot of valuable things.

Text is one of the important forms for content expression in social media by users, so social media text data contains a great deal of valuable information, and the data is also input for many social media data mining tasks. However, due to the openness of social media, most of the text information in the social media is description of personal life and personal emotion, and the text usually does not contain valuable information.

The social media text denoising aims to identify and retain texts related to events, topics and the like from the texts, and the texts can be used as input of various social media mining tasks and are valuable text information; otherwise, the text is useless text (called as 'noise text') and needs to be removed. In social media, the vast majority of textual information is noisy text. Therefore, text denoising is required before using information on social media, and the influence of noise texts on subsequent tasks is reduced. By removing useless data in the social media, the influence of noise on various models can be reduced, the data processing amount can be greatly reduced, and the data processing rate of each task based on the social media is greatly improved under the condition that the result is not deteriorated.

The existing social media text denoising problem is essentially a text classification problem, namely whether a text is valuable or not is judged. There are generally two types of approaches to solving this problem: the method is supervised learning based on model training, namely establishing a tweet data set, manually marking whether tweets contain value information one by one, and then training data for classifying input through a machine learning model (SVM, GBDT, XGboost and the like) or a deep learning model (C-LSTM or TextCNN and the like). Such methods require a large number of manual and large numbers of labels, and such methods cannot "keep pace" -some text is currently noisy data, but do not mean that such text is noisy data all the time thereafter.

And the second is an unsupervised method, which measures the value of the words by analyzing the time sequence characteristics of the words in the text flow, and further judges whether the text contains useful information. Whether a word has value at the current moment can be measured by judging whether the word belongs to a certain event or a topic in the time period, namely whether the word is a partial expression of the current certain event. If the word belongs to an event at the current time, the word is valuable. When an event occurs, the frequency of occurrence of the word corresponding to the event is increased rapidly, so that the frequency of occurrence of a word in each time period can be counted in real time, and whether the number of words is increased rapidly in the current time period is judged. Thus, the question of determining whether a word is valuable translates into determining whether a time series of occurrences of a word is sudden at the current time. Determining whether a sequence is bursty at a time is typically done using z-score. The method does not need to be marked and can be carried out 'with time', and the method has the defect that the method is not accurate enough, only the time sequence characteristics of the words are considered, and the association between the geographic position characteristics of the words and the words is not considered.

The existing social media text denoising method generally comprises the steps of giving a training set by a user, extracting text features, constructing a classification model by a supervised learning method, and classifying texts. The effectiveness of these methods depends strongly on the number and quality of training sets, while scalability is poor (classification for one task is difficult to extend to other tasks). In practice, the labeling of the training set generally needs to be completed manually, so that it is very difficult to obtain a large number of training sets with high quality; meanwhile, since social media text data has the characteristics of strong sparsity, irregular expression and the like, the classification features of a plurality of texts are difficult to accurately extract, and the classification effect of the existing method is poor.

Disclosure of Invention

In order to solve the technical problems, the invention provides a social media text denoising method based on space-time burst characteristics, which comprehensively measures the value of words by extracting the time and the geographical position information of each word, corrects misjudgment through the relevance between the words by establishing a relevance graph between the words, reduces the sensitivity of a threshold value to noise word recognition, improves the accuracy of value word recognition, and is an optimization extension of a second method.

The technical scheme adopted by the invention is as follows: a social media text denoising method based on space-time burst characteristics comprises the following steps:

s1, according to the sending time and the sending place of the social media text, counting the time-space information of each word, and representing the time-space information of the word as a point in a three-dimensional space;

s2, measuring the aggregation degree of the points in the three-dimensional space through a Ripley' S K function;

and S3, introducing the conditional probability among words to establish a word association graph, and judging whether the words are noise words or not by combining the aggregation degree through a graph regularization method.

Step S1, the space-time information is expressed in [ time_i,longitude_i,latitude_i]，time_iTime information, longituude, representing the ith tweet_iLongitude information, longituude, representing the ith tweet_iAnd indicating the latitude information of the ith tweet.

Step S1 is to obtain the longitude and latitude information of the word in the text according to the longitude and latitude information of the geographic entity by recognizing the geographic entity word in the text.

Step S2, measuring the aggregation degree of the points in the three-dimensional space through the Ripley' S K function, where the expression of the aggregation degree is:

wherein k is_w(t, h) represents the Ripley's K function; lambda [ alpha ]_wRepresents the density of the word w in three-dimensional space; n is a radical of_w(t, h) represents the number of other points of the word in a region around a certain three-dimensional space point of the word w, wherein h is a space radius and t is a time radius; e [ N ]_w(t,h)]Represents to N_w(t, h) is desired.

Step S2 calculating the value of the Ripley' S K function of the word w by the following formula;

wherein the content of the first and second substances,

values representing the Ripley's K function, R being the area of space in which the tweet appears, T being the time span over which the tweet was collected, N_wThe number of occurrences of the word w in the space-time RT; n is a radical of_w(h_i.j<h,abs(t_i,t_j)<t) represents the number of point pairs, wherein the space distance between every two three-dimensional coordinate points corresponding to the space-time position where the word w appears is less than h and the time interval is less than t.

Step S2 also includes the steps of

And (4) carrying out standardization:

wherein the content of the first and second substances,

after representation of the standardization

Step S3 specifically includes the following substeps:

s31, setting two thresholds l₁And l₂，l₁<l₂If, if

The word w is determined to be a noise word if

The word w is judged to be valuable; otherwise, executing step S32;

s32, calculating word w_iWord w at the time of occurrence_jConditional probability of occurrence, then step S33 is performed;

s33, taking each word as a vertex W of the graph, taking the conditional probability of the mutual appearance of the words as a directed edge E, and constructing an inter-word association relationship graph G which is (W, E);

s34, using the word association map G ═ (W, E), based on the word itself

Of values and their neighbours

And judging the noise words by a graph regularization method.

The implementation process of step S34 is: for the

According to the word's neighbourhood

Of the word itself by value pair

And (3) updating the value:

then the updated L'_wIs assigned to

Obtaining the word by iteration

Or

Final output

The word of (1).

Wherein the content of the first and second substances,

representing the normalized conditional probability of the neighbor word v.

The specific normalization method is that the conditional probability P (w | v) of the neighboring word v is divided by the sum of the conditional probabilities of all neighboring words of the word w, so as to obtain the normalization result

The method further comprises the step of setting a condition for stopping iteration, specifically: setting a threshold value theta₀Setting the initial value of theta to be 0, and updating the value of theta after each iteration to be the value before the iteration

Updated after the iteration

The square of the difference + the value of theta before the iteration, if theta obtained after the iteration is less than or equal to theta₀The iteration is stopped.

The invention has the beneficial effects that: the context-pushing denoising method based on the spatiotemporal information can quickly read and accurately identify the noise context without data marking. The method is characterized in that valuable text information is captured to have aggregation characteristics in space and time, a space-time Ripley's K function is constructed to describe the space-time burstiness of words in the text, so that words containing valuable information in the text are identified, and the text containing no valuable information words is removed. The method provided by the invention is an unsupervised noise text recognition method, which can effectively recognize the noise information in the text without depending on a training set, and has the following advantages:

1) the Ripley's K function based on the spatio-temporal information can be used for measuring whether data points have aggregations in spatio-temporal, and the function is suitable for solving similar problems of the evaluation of the aggregations in the spatio-temporal of the data points;

2) establishing an association graph of the words by using the conditional probability of the words, mining association information among the words by using a graph regularization algorithm, reducing the sensitivity of value word recognition to threshold setting, and improving the accuracy of noise text recognition;

3) the method filters the text information based on the value words, is an unsupervised noise text recognition method, does not need to manually mark a training set, and therefore has better robustness and usability.

Drawings

FIG. 1 is a block diagram of an embodiment of the present invention;

FIG. 2 is a flow chart of a graph regularization algorithm of the present invention;

FIG. 3 provides coverage for 54 events with different numbers of words without regularization and with regularization, according to an embodiment of the present invention.

Detailed Description

The invention provides a social media text denoising method based on space-time burst characteristics. The method aims at the characteristic that value information (texts related to events, subjects and the like) in social media has aggregative property in time and space, and models the space-time distribution of each word in the texts from the perspective of space-time burstiness, so that words related to the events and the subjects are identified from mass social media data. The value text and the noise text are distinguished according to whether the words in the text have space-time aggregation, and usability and effectiveness of the social media text denoising method can be improved.

The method aims at judging whether words in the text in the current time window have aggregative property in space-time distribution or not, and identifies the words with aggregative property, namely the words with value. And if a certain text does not contain any value word, judging the text to be a noise text and directly removing the noise text. Therefore, the technical scheme of the invention is mainly divided into three parts:

firstly, according to the sending time and the sending place of the social media text, counting the time-space information of each word, and representing the time-space information of the word as a point in a three-dimensional space;

secondly, measuring the aggregation degree of points in the three-dimensional space through a Ripley's K function, and taking the aggregation degree as an evaluation criterion of the value of the word;

then, a word association graph is established by introducing the conditional probability among words, a graph regularization method is designed to reduce the sensitivity of value word recognition to a decision threshold and improve the accuracy and robustness of value word judgment. As shown in fig. 1, the method of the present invention comprises the steps of:

1. spatiotemporal information extraction of words

For a word w, the latitude and longitude information of the sending time and the geographic position of the text containing the word is counted. Suppose that the word w appears in n pieces of text, and the spatiotemporal information of the ith piece of text is [ time_i,longitude_i,latitude_i]Then the spatiotemporal information of this word can be expressed as: w is a_info＝{[time_i,longitude_i,latitude_i]And i is more than or equal to 1 and less than or equal to n. Time information time_iCan be viewed as a dimension, latitude and longitude information (longitude)_i、latitude_i) Only the surface of the earth is considered corresponding to the spherical position on the earth, so the longitude and latitude can be regarded as the coordinates of a point at a certain position on a two-dimensional plane, and the space-time information of each word can be regarded as a set of points in a three-dimensional space.

To measure the distance of the midpoint in the three-dimensional space corresponding to a word, the distance in the time dimension and the distance in the space dimension need to be calculated respectively. The distance of the time dimension is the time difference, and the distance of the space dimension is the distance between two points on the earth surface obtained by the given longitude and latitude. The longitude and latitude coordinates A (LatA, LonA) and B (LatB, LonB) of two points on the earth surface are given, and a distance formula between the two points on the earth surface is solved according to the longitude and latitude coordinates as follows:

wherein, alpha ═ (LatA-LatB)/2, beta ═ LonA-LonB)/2, R₁Representing the mean radius of the earth, R₁＝6371km。

In practice, text on social media carries the sending time on a per-case basis, but only a small fraction carries latitude and longitude information. Therefore, it is necessary to use the tool kit provided by NLTK, StanfordNLP and space to collectively identify which words are entity words representing geographical locations, analyze these geographical entity words through the geopy library to obtain their longitudes and latitudes, and use the longitudes and latitudes of the geographical entity words in the text as the longitudes and latitudes information of the text. In the concrete implementation, a geographical entity and a longitude and latitude information table thereof are generally established, for a new text, a geographical entity word is firstly identified, if the longitude and latitude corresponding to the geographical entity word is stored in the table, the geographical entity word is directly used, otherwise, the entity word is analyzed, and the analyzed information is put into the table. In a specific experiment, taking 500w tweets as an example of a tweet to test, wherein 200w tweets contain one or more geographic entity words, means that more than 40% of the tweets can be used to deduce the longitude and latitude coordinates of the tweets in this way.

2. Word space-time burst feature extraction based on Ripley's K function

At a particular time and space, things that have a large number of people discussing are considered valuable, as are keywords to which they correspond. Thus, it can be modeled as: when a word has aggregations in a space-time three-dimensional space, the word is proved to be discussed in a large amount at a certain place at the current moment, so that the word corresponds to a certain event and has value.

The Replay's K function can be used to measure whether a set of points on a two-dimensional plane is bursty, and can be extended to three dimensions according to the two-dimensional Ripley's K function. For a certain word w, the Replay's K function based on three-dimensional spatio-temporal information is defined as follows:

wherein λ is_wThe average value of the number of words in each space-time unit is equivalent to the density of the words w in a three-dimensional space; n is a radical of_w(t, h) represents that the word w is surrounded by a three-dimensional space point and has a space radius of h and a time radius of t (the region is a cylinder, h is the bottom radius of the cylinder, and t is the height of the cylinder) and the word is divided into three partsNumber of points, E [ N ]_w(t,h)]Is desirable for this.

For a given spatial distance h and time interval t, the value of the Ripley's K function can be estimated by the following equation:

wherein R is the area of the space where the tweet appears, and if the collected tweets are all from New York, R is the area of the New York; t is the time span of text pushing acquisition; n is a radical of_wThe number of occurrences of the word w in the space-time RT; n is a radical of_w(h_i.j<h,abs(t_i,t_j)<t) represents that the logarithm of points with the space distance smaller than h and the time interval smaller than t are simultaneously satisfied between every two three-dimensional coordinate points corresponding to the space-time positions where the words w appear. Intuitively, it can be seen that if a certain word appears in a certain small space-time, the Ripley's K function value of the word is large.

For understanding the time span, for example: the time of the collected tweet is 2020.03.20 zero at the earliest and 2020.03.23 zero at the latest, and then T is 72 h.

The noise words are randomly distributed in the time dimension and the space dimension, so N_wThe expectation of (t, h) is the word density λ_wMultiplying by the cylinder volume corresponding to the spatiotemporal range:

E[N_w(t,h)]＝λ_wπh²t (4)

thus, it can be seen that when a word is randomly distributed in space-time, the Ripley's K function value K of the word_w(t,h)＝πh²t, when the word is a noise word. If the word is valuable, then there is aggregation in spatio-temporal space, then its Ripley's K function value K_w(t,h)>πh²t. Thus, the function value of Ripley's K of a word can be calculated

By comparison

And pi h²t to determine whether the word is valuable. More generally, the set spatio-temporal threshold parameters t, h are paired for erasure

Influence of the value, the invention on

The following normalization was performed:

thus, if a word is a noisy word,

3. optimizing value word recognition using graph regularization

By calculating a normalized Ripley's K function value for a word

Setting a threshold value l when a word is present

If the value is less than the threshold value l, the word is considered to be a noise word, otherwise, the word is a value word. The difficulty of the method lies in that a proper threshold value l is difficult to find, if l is too large, a large number of valuable words can be misjudged as noise words, and if l is too small, a large number of noise words can not be identified, so that the text denoising effect is influenced.

In order to solve the above problem, the present invention uses graph regularization to reduce the effect of setting the threshold value l on the result. First, two thresholds l are set₁And l₂(l₁<l₂) Can be prepared by₁Set to a smaller value,/₂Set to a larger value if

The word w is determined to be a noise word if

The word w is determined to be valuable. When in use

It is considered that it is temporarily impossible to determine whether the word w is valuable because the probability that the word is misjudged in this interval is high. When a valuable event occurs, words related to the event are often accompanied with each other, namely when a word with a high degree of association is mostly a valuable word, the word is also a valuable word with a high probability. Thus, for a word that is temporarily not able to be determined as valuable, the word can be determined based on other words associated with the word

The value is determined.

First, the conditional probability of the occurrence of each word is calculated as the association strength between the words. When two words are often accompanied by the same tweet, the conditional probability that the two words appear mutually is high, and the association strength is high. Word w_iWord w at the time of occurrence_jThe conditional probability of occurrence calculation formula is as follows:

when the conditional probability of the appearance of every two words is calculated, only the number of the cases that the two words appear in one tweed and the number of the cases that a certain word appears in one tweed are counted.

After the conditional probability among the words is calculated, the words are taken as the vertexes W of the graph, the conditional probability of the words appearing mutually is taken as the directed edge E, and the words can be constructedAnd G is (W, E). Using word association graphs, we can use the words themselves

Of values and their neighbours

And judging the words which cannot be judged by a graph regularization method.

In conjunction with the normalized Ripley's K function and graph regularization, FIG. 2 is a flow chart of a complete algorithm for determining whether a word is valuable. In the algorithm, θ₀The threshold for determining when the program stops iterating is typically set to a small value, e.g., 0.01, and θ is the number of words after and before iteration

Sum of squares of differences in values, when theta is small, e.g. theta<θ₀If 0.01, the word is considered to be

The value is iterated to be stable, and the iteration can be stopped.

In each iteration, for the words with the attribute being not judged, the adjacent words are used

Of value to itself

The value is corrected, if the word with high relevance degree is mostly valuable

Larger value), then the formula is updated in the flow chart

Can make the word

The value becomes large, and the word is judged as a value word (i.e., updated)

) (ii) a On the contrary, if most of the words associated with the word are noise words, the word is divided into a plurality of words

The value becomes small and thus is judged as a noise word (i.e., updated)

). After a plurality of iterations, all the data are output

I.e. the valuable words. By the graph regularization method, the relevance among the words can be effectively utilized, and the accuracy of value word recognition is improved.

Wherein the content of the first and second substances,

the normalized conditional probability of the neighboring word v is expressed, and the specific normalization mode is that the conditional probability P (w | w) of the neighboring word v is divided by the sum of the conditional probabilities of all neighboring words of the word w, so as to obtain the normalized result

In the algorithm shown in fig. 2, two thresholds l need to be set₁And l₂It is noted that the algorithm of the present invention is insensitive to these two thresholds. That is, you can change l₁Set to a very small value (e.g., only 1% of words have a normalized Ripley's K function value less than l₁) Is prepared by₂Set to a large value (e.g., only 1% of words have a normalized Ripley's K function value greater than l₂) The results obtained are combined with well-chosen l₁And l₂The difference is small. The reason is the robustness of graph regularization, and only a small part of high-value words are needed, so that the graph can be positioned at l through continuous iteration₁And l₂And identifying words with high association strength in the interval, performing next iteration by taking the words as high-value words to form a chain reaction, and identifying all the high-value words after a plurality of iterations.

By the graph regularization method, errors caused by setting the threshold value l can be bypassed, and word misjudgment is reduced. When a certain tweet does not contain a valuable word, the tweet may be considered a noise tweet. In practice, whether a word is valuable requires "keep-all", i.e., a word may not be valuable at a previous time, but may be valuable at a current time, so the algorithm shown in FIG. 2 needs to run every time period.

In order to verify the accuracy of measuring word value by using a standardized Ripley's K function and the effectiveness of correcting word misjudgment by using graph regularization, 2000 pieces of published tweets from 26/2020/03 to 31/2020/03 are collected, keywords are input into GoogleNews for retrieval after the tweets are clustered through texts, classes belonging to a certain event are identified, the classes are labeled manually, and 54 different events can be obtained. With the 54 valuable events contained in the 2000w pieces of tweets, we can measure the denoising capability of the tweet denoising algorithm under the premise of reserving all event tweets.

The 2000w pieces of tweets contain 15w different words, and words with a small number of occurrences are often misspelled or irregular expressions, so we remove words with a number of occurrences less than 100 and decide directly as noise words. For the remaining 11355 words, we calculate the normalized Ripley's K function values for each word, sort the values from small to large, and select the Ripley's K function values corresponding to the first 30% and last 30% of the words as the limit l₁And l₂. Sorting the function values of the words Ripley's K from large to small, and FIG. 3 shows different numbers without regularization and with regularizationThe destination word corresponds to a coverage of 54 events.

From fig. 3, if graph regularization is not used, 8000 normalized Ripley's K words with the largest function value are required to cover all events, that is, 8000 words are used to cover all events in the current tweet, and 8000 words are used to filter 2000w tweets, which can filter 53.63% of tweets. With graph regularization, only 3980 words are needed to cover all events, at which time 74.2% of the tweets can be filtered. From experimental results, it can be seen that the denoising operation of the tweet can be completed on the basis of retaining all valuable information by using the improved Ripley's K function, and the denoising rate can be further improved by combining with a graph regularization method.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A social media text denoising method based on space-time burst characteristics is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the method comprises the steps ofS1 the space-time information is expressed in [ time_i,longitude_i,latitude_i]，time_iTime information, longituude, representing the ith tweet_iLongitude information, latitude, indicating the ith sentence_iAnd indicating the latitude information of the ith tweet.

3. The method for denoising social media text based on spatiotemporal burst characteristics according to claim 2, wherein step S1 is to obtain the latitude and longitude information of the words in the text according to the latitude and longitude information of the geographic entities by identifying the geographic entities words in the text.

4. The method for denoising social media text based on spatiotemporal burst characteristics as claimed in claim 2, wherein the step S2 measures the aggregation degree of points in the three-dimensional space through Ripley' S K function, and the expression of the aggregation degree is:

wherein, K_w(t, h) represents the Ripley's K function; lambda [ alpha ]_wRepresents the density of the word w in three-dimensional space; n is a radical of_w(t, h) represents the number of other points of the word in a region around a certain three-dimensional space point of the word w, wherein h is a space radius and t is a time radius; e [ N ]_w(t,h)]Represents to N_w(t, h) is desired.

5. The method for denoising social media text based on spatiotemporal burst characteristics as claimed in claim 4, wherein the step S2 calculates the numerical value of Ripley' S K function of the word w by the following formula;

wherein the content of the first and second substances,

values representing the Ripley's K function, R being the area of space in which the tweet appears, T being the time span over which the tweet was collected, N_wThe number of occurrences of the word w in the space-time RT; n is a radical of_w(h_i.j<h,abs(t_i,t_j)<t) represents that the logarithm of points with the space distance smaller than h and the time interval smaller than t are simultaneously satisfied between every two three-dimensional coordinate points corresponding to the space-time positions where the words w appear.

6. The method for denoising social media text based on spatiotemporal burst feature of claim 5, wherein step S2 further comprises denoising the social media text based on the spatiotemporal burst feature

And (4) carrying out standardization:

wherein the content of the first and second substances,

after representation of the standardization

7. The method for denoising social media text based on spatiotemporal burst feature of claim 6, wherein step S3 specifically comprises the following sub-steps:

s31, setting two thresholds l₁And l₂，l₁<l₂If, if

The word w is determined to be a noise word if

The word w is judged to be valuable; otherwise, executing step S32;

s34, using the word association map G ═ (W, E), based on the word itself

Of values and their neighbours

And judging the noise words by a graph regularization method.

8. The method for denoising social media text based on spatiotemporal burst feature of claim 7, wherein the step S34 is implemented as follows: for the

According to the word's neighbourhood

Of the word itself by value pair

And (3) updating the value:

then the updated L'_wIs assigned to

Obtaining the word by iteration

Or

Final output

The word of (1); wherein the content of the first and second substances,

representing the normalized conditional probability of the neighbor word v.

9. The method for denoising social media text based on spatiotemporal burst features as claimed in claim 8, wherein the normalization of the conditional probability is: the conditional probability P (w | v) of the neighboring word v is divided by the sum of the conditional probabilities of all neighboring words of the word w, resulting in a normalized result

10. The method for denoising social media text based on spatiotemporal burst features as claimed in claim 8, further comprising setting a condition for stopping iteration, specifically: setting a threshold value theta₀Setting the initial value of theta to be 0, and updating the value of theta after each iteration to be the value before the iteration

Updated after the iteration