CN113822048A - Social media text denoising method based on space-time burst characteristics - Google Patents

Social media text denoising method based on space-time burst characteristics Download PDF

Info

Publication number
CN113822048A
CN113822048A CN202111086719.8A CN202111086719A CN113822048A CN 113822048 A CN113822048 A CN 113822048A CN 202111086719 A CN202111086719 A CN 202111086719A CN 113822048 A CN113822048 A CN 113822048A
Authority
CN
China
Prior art keywords
word
words
time
space
social media
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111086719.8A
Other languages
Chinese (zh)
Other versions
CN113822048B (en
Inventor
费高雷
程勇
胡光岷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202111086719.8A priority Critical patent/CN113822048B/en
Publication of CN113822048A publication Critical patent/CN113822048A/en
Application granted granted Critical
Publication of CN113822048B publication Critical patent/CN113822048B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a social media text denoising method based on space-time burst characteristics, belongs to the field of data processing, and aims to solve the problem of poor text classification effect in the prior art. Meanwhile, in order to reduce the influence of the Ripley's K function threshold l of the words on the result and reduce word misjudgment, a graph regularization algorithm is introduced, and the accuracy of word validity judgment is improved by fusing relevance information among the words.

Description

Social media text denoising method based on space-time burst characteristics
Technical Field
The invention belongs to the field of data processing, and particularly relates to a text denoising technology.
Background
With the widespread use of social media, billions of text messages are published on social media such as Twitter, Facebook, Instagram and microblog every day, and the messages comprise the most complete and most time-efficient messages. By extracting and analyzing this information, we can do a lot of valuable things.
Text is one of the important forms for content expression in social media by users, so social media text data contains a great deal of valuable information, and the data is also input for many social media data mining tasks. However, due to the openness of social media, most of the text information in the social media is description of personal life and personal emotion, and the text usually does not contain valuable information.
The social media text denoising aims to identify and retain texts related to events, topics and the like from the texts, and the texts can be used as input of various social media mining tasks and are valuable text information; otherwise, the text is useless text (called as 'noise text') and needs to be removed. In social media, the vast majority of textual information is noisy text. Therefore, text denoising is required before using information on social media, and the influence of noise texts on subsequent tasks is reduced. By removing useless data in the social media, the influence of noise on various models can be reduced, the data processing amount can be greatly reduced, and the data processing rate of each task based on the social media is greatly improved under the condition that the result is not deteriorated.
The existing social media text denoising problem is essentially a text classification problem, namely whether a text is valuable or not is judged. There are generally two types of approaches to solving this problem: the method is supervised learning based on model training, namely establishing a tweet data set, manually marking whether tweets contain value information one by one, and then training data for classifying input through a machine learning model (SVM, GBDT, XGboost and the like) or a deep learning model (C-LSTM or TextCNN and the like). Such methods require a large number of manual and large numbers of labels, and such methods cannot "keep pace" -some text is currently noisy data, but do not mean that such text is noisy data all the time thereafter.
And the second is an unsupervised method, which measures the value of the words by analyzing the time sequence characteristics of the words in the text flow, and further judges whether the text contains useful information. Whether a word has value at the current moment can be measured by judging whether the word belongs to a certain event or a topic in the time period, namely whether the word is a partial expression of the current certain event. If the word belongs to an event at the current time, the word is valuable. When an event occurs, the frequency of occurrence of the word corresponding to the event is increased rapidly, so that the frequency of occurrence of a word in each time period can be counted in real time, and whether the number of words is increased rapidly in the current time period is judged. Thus, the question of determining whether a word is valuable translates into determining whether a time series of occurrences of a word is sudden at the current time. Determining whether a sequence is bursty at a time is typically done using z-score. The method does not need to be marked and can be carried out 'with time', and the method has the defect that the method is not accurate enough, only the time sequence characteristics of the words are considered, and the association between the geographic position characteristics of the words and the words is not considered.
The existing social media text denoising method generally comprises the steps of giving a training set by a user, extracting text features, constructing a classification model by a supervised learning method, and classifying texts. The effectiveness of these methods depends strongly on the number and quality of training sets, while scalability is poor (classification for one task is difficult to extend to other tasks). In practice, the labeling of the training set generally needs to be completed manually, so that it is very difficult to obtain a large number of training sets with high quality; meanwhile, since social media text data has the characteristics of strong sparsity, irregular expression and the like, the classification features of a plurality of texts are difficult to accurately extract, and the classification effect of the existing method is poor.
Disclosure of Invention
In order to solve the technical problems, the invention provides a social media text denoising method based on space-time burst characteristics, which comprehensively measures the value of words by extracting the time and the geographical position information of each word, corrects misjudgment through the relevance between the words by establishing a relevance graph between the words, reduces the sensitivity of a threshold value to noise word recognition, improves the accuracy of value word recognition, and is an optimization extension of a second method.
The technical scheme adopted by the invention is as follows: a social media text denoising method based on space-time burst characteristics comprises the following steps:
s1, according to the sending time and the sending place of the social media text, counting the time-space information of each word, and representing the time-space information of the word as a point in a three-dimensional space;
s2, measuring the aggregation degree of the points in the three-dimensional space through a Ripley' S K function;
and S3, introducing the conditional probability among words to establish a word association graph, and judging whether the words are noise words or not by combining the aggregation degree through a graph regularization method.
Step S1, the space-time information is expressed in [ timei,longitudei,latitudei],timeiTime information, longituude, representing the ith tweetiLongitude information, longituude, representing the ith tweetiAnd indicating the latitude information of the ith tweet.
Step S1 is to obtain the longitude and latitude information of the word in the text according to the longitude and latitude information of the geographic entity by recognizing the geographic entity word in the text.
Step S2, measuring the aggregation degree of the points in the three-dimensional space through the Ripley' S K function, where the expression of the aggregation degree is:
Figure RE-GDA0003302743900000031
wherein k isw(t, h) represents the Ripley's K function; lambda [ alpha ]wRepresents the density of the word w in three-dimensional space; n is a radical ofw(t, h) represents the number of other points of the word in a region around a certain three-dimensional space point of the word w, wherein h is a space radius and t is a time radius; e [ N ]w(t,h)]Represents to Nw(t, h) is desired.
Step S2 calculating the value of the Ripley' S K function of the word w by the following formula;
Figure RE-GDA0003302743900000032
wherein the content of the first and second substances,
Figure RE-GDA0003302743900000033
values representing the Ripley's K function, R being the area of space in which the tweet appears, T being the time span over which the tweet was collected, NwThe number of occurrences of the word w in the space-time RT; n is a radical ofw(hi.j<h,abs(ti,tj)<t) represents the number of point pairs, wherein the space distance between every two three-dimensional coordinate points corresponding to the space-time position where the word w appears is less than h and the time interval is less than t.
Step S2 also includes the steps of
Figure RE-GDA0003302743900000034
And (4) carrying out standardization:
Figure RE-GDA0003302743900000035
wherein the content of the first and second substances,
Figure RE-GDA0003302743900000036
after representation of the standardization
Figure RE-GDA0003302743900000037
Step S3 specifically includes the following substeps:
s31, setting two thresholds l1And l2,l1<l2If, if
Figure RE-GDA0003302743900000038
The word w is determined to be a noise word if
Figure RE-GDA0003302743900000039
The word w is judged to be valuable; otherwise, executing step S32;
s32, calculating word wiWord w at the time of occurrencejConditional probability of occurrence, then step S33 is performed;
s33, taking each word as a vertex W of the graph, taking the conditional probability of the mutual appearance of the words as a directed edge E, and constructing an inter-word association relationship graph G which is (W, E);
s34, using the word association map G ═ (W, E), based on the word itself
Figure RE-GDA00033027439000000310
Of values and their neighbours
Figure RE-GDA00033027439000000311
And judging the noise words by a graph regularization method.
The implementation process of step S34 is: for the
Figure RE-GDA00033027439000000312
According to the word's neighbourhood
Figure RE-GDA00033027439000000313
Of the word itself by value pair
Figure RE-GDA00033027439000000314
And (3) updating the value:
Figure RE-GDA00033027439000000315
then the updated L'wIs assigned to
Figure RE-GDA00033027439000000316
Obtaining the word by iteration
Figure RE-GDA00033027439000000317
Or
Figure RE-GDA00033027439000000318
Final output
Figure RE-GDA00033027439000000319
The word of (1).
Wherein the content of the first and second substances,
Figure RE-GDA00033027439000000320
representing the normalized conditional probability of the neighbor word v.
The specific normalization method is that the conditional probability P (w | v) of the neighboring word v is divided by the sum of the conditional probabilities of all neighboring words of the word w, so as to obtain the normalization result
Figure RE-GDA0003302743900000041
The method further comprises the step of setting a condition for stopping iteration, specifically: setting a threshold value theta0Setting the initial value of theta to be 0, and updating the value of theta after each iteration to be the value before the iteration
Figure RE-GDA0003302743900000042
Updated after the iteration
Figure RE-GDA0003302743900000043
The square of the difference + the value of theta before the iteration, if theta obtained after the iteration is less than or equal to theta0The iteration is stopped.
The invention has the beneficial effects that: the context-pushing denoising method based on the spatiotemporal information can quickly read and accurately identify the noise context without data marking. The method is characterized in that valuable text information is captured to have aggregation characteristics in space and time, a space-time Ripley's K function is constructed to describe the space-time burstiness of words in the text, so that words containing valuable information in the text are identified, and the text containing no valuable information words is removed. The method provided by the invention is an unsupervised noise text recognition method, which can effectively recognize the noise information in the text without depending on a training set, and has the following advantages:
1) the Ripley's K function based on the spatio-temporal information can be used for measuring whether data points have aggregations in spatio-temporal, and the function is suitable for solving similar problems of the evaluation of the aggregations in the spatio-temporal of the data points;
2) establishing an association graph of the words by using the conditional probability of the words, mining association information among the words by using a graph regularization algorithm, reducing the sensitivity of value word recognition to threshold setting, and improving the accuracy of noise text recognition;
3) the method filters the text information based on the value words, is an unsupervised noise text recognition method, does not need to manually mark a training set, and therefore has better robustness and usability.
Drawings
FIG. 1 is a block diagram of an embodiment of the present invention;
FIG. 2 is a flow chart of a graph regularization algorithm of the present invention;
FIG. 3 provides coverage for 54 events with different numbers of words without regularization and with regularization, according to an embodiment of the present invention.
Detailed Description
The invention provides a social media text denoising method based on space-time burst characteristics. The method aims at the characteristic that value information (texts related to events, subjects and the like) in social media has aggregative property in time and space, and models the space-time distribution of each word in the texts from the perspective of space-time burstiness, so that words related to the events and the subjects are identified from mass social media data. The value text and the noise text are distinguished according to whether the words in the text have space-time aggregation, and usability and effectiveness of the social media text denoising method can be improved.
The method aims at judging whether words in the text in the current time window have aggregative property in space-time distribution or not, and identifies the words with aggregative property, namely the words with value. And if a certain text does not contain any value word, judging the text to be a noise text and directly removing the noise text. Therefore, the technical scheme of the invention is mainly divided into three parts:
firstly, according to the sending time and the sending place of the social media text, counting the time-space information of each word, and representing the time-space information of the word as a point in a three-dimensional space;
secondly, measuring the aggregation degree of points in the three-dimensional space through a Ripley's K function, and taking the aggregation degree as an evaluation criterion of the value of the word;
then, a word association graph is established by introducing the conditional probability among words, a graph regularization method is designed to reduce the sensitivity of value word recognition to a decision threshold and improve the accuracy and robustness of value word judgment. As shown in fig. 1, the method of the present invention comprises the steps of:
1. spatiotemporal information extraction of words
For a word w, the latitude and longitude information of the sending time and the geographic position of the text containing the word is counted. Suppose that the word w appears in n pieces of text, and the spatiotemporal information of the ith piece of text is [ timei,longitudei,latitudei]Then the spatiotemporal information of this word can be expressed as: w is ainfo={[timei,longitudei,latitudei]And i is more than or equal to 1 and less than or equal to n. Time information timeiCan be viewed as a dimension, latitude and longitude information (longitude)i、latitudei) Only the surface of the earth is considered corresponding to the spherical position on the earth, so the longitude and latitude can be regarded as the coordinates of a point at a certain position on a two-dimensional plane, and the space-time information of each word can be regarded as a set of points in a three-dimensional space.
To measure the distance of the midpoint in the three-dimensional space corresponding to a word, the distance in the time dimension and the distance in the space dimension need to be calculated respectively. The distance of the time dimension is the time difference, and the distance of the space dimension is the distance between two points on the earth surface obtained by the given longitude and latitude. The longitude and latitude coordinates A (LatA, LonA) and B (LatB, LonB) of two points on the earth surface are given, and a distance formula between the two points on the earth surface is solved according to the longitude and latitude coordinates as follows:
Figure RE-GDA0003302743900000051
wherein, alpha ═ (LatA-LatB)/2, beta ═ LonA-LonB)/2, R1Representing the mean radius of the earth, R1=6371km。
In practice, text on social media carries the sending time on a per-case basis, but only a small fraction carries latitude and longitude information. Therefore, it is necessary to use the tool kit provided by NLTK, StanfordNLP and space to collectively identify which words are entity words representing geographical locations, analyze these geographical entity words through the geopy library to obtain their longitudes and latitudes, and use the longitudes and latitudes of the geographical entity words in the text as the longitudes and latitudes information of the text. In the concrete implementation, a geographical entity and a longitude and latitude information table thereof are generally established, for a new text, a geographical entity word is firstly identified, if the longitude and latitude corresponding to the geographical entity word is stored in the table, the geographical entity word is directly used, otherwise, the entity word is analyzed, and the analyzed information is put into the table. In a specific experiment, taking 500w tweets as an example of a tweet to test, wherein 200w tweets contain one or more geographic entity words, means that more than 40% of the tweets can be used to deduce the longitude and latitude coordinates of the tweets in this way.
2. Word space-time burst feature extraction based on Ripley's K function
At a particular time and space, things that have a large number of people discussing are considered valuable, as are keywords to which they correspond. Thus, it can be modeled as: when a word has aggregations in a space-time three-dimensional space, the word is proved to be discussed in a large amount at a certain place at the current moment, so that the word corresponds to a certain event and has value.
The Replay's K function can be used to measure whether a set of points on a two-dimensional plane is bursty, and can be extended to three dimensions according to the two-dimensional Ripley's K function. For a certain word w, the Replay's K function based on three-dimensional spatio-temporal information is defined as follows:
Figure RE-GDA0003302743900000061
wherein λ iswThe average value of the number of words in each space-time unit is equivalent to the density of the words w in a three-dimensional space; n is a radical ofw(t, h) represents that the word w is surrounded by a three-dimensional space point and has a space radius of h and a time radius of t (the region is a cylinder, h is the bottom radius of the cylinder, and t is the height of the cylinder) and the word is divided into three partsNumber of points, E [ N ]w(t,h)]Is desirable for this.
For a given spatial distance h and time interval t, the value of the Ripley's K function can be estimated by the following equation:
Figure RE-GDA0003302743900000062
wherein R is the area of the space where the tweet appears, and if the collected tweets are all from New York, R is the area of the New York; t is the time span of text pushing acquisition; n is a radical ofwThe number of occurrences of the word w in the space-time RT; n is a radical ofw(hi.j<h,abs(ti,tj)<t) represents that the logarithm of points with the space distance smaller than h and the time interval smaller than t are simultaneously satisfied between every two three-dimensional coordinate points corresponding to the space-time positions where the words w appear. Intuitively, it can be seen that if a certain word appears in a certain small space-time, the Ripley's K function value of the word is large.
For understanding the time span, for example: the time of the collected tweet is 2020.03.20 zero at the earliest and 2020.03.23 zero at the latest, and then T is 72 h.
The noise words are randomly distributed in the time dimension and the space dimension, so NwThe expectation of (t, h) is the word density λwMultiplying by the cylinder volume corresponding to the spatiotemporal range:
E[Nw(t,h)]=λwπh2t (4)
thus, it can be seen that when a word is randomly distributed in space-time, the Ripley's K function value K of the wordw(t,h)=πh2t, when the word is a noise word. If the word is valuable, then there is aggregation in spatio-temporal space, then its Ripley's K function value Kw(t,h)>πh2t. Thus, the function value of Ripley's K of a word can be calculated
Figure RE-GDA0003302743900000071
By comparison
Figure RE-GDA0003302743900000072
And pi h2t to determine whether the word is valuable. More generally, the set spatio-temporal threshold parameters t, h are paired for erasure
Figure RE-GDA0003302743900000073
Influence of the value, the invention on
Figure RE-GDA0003302743900000074
The following normalization was performed:
Figure RE-GDA0003302743900000075
thus, if a word is a noisy word,
Figure RE-GDA0003302743900000076
3. optimizing value word recognition using graph regularization
By calculating a normalized Ripley's K function value for a word
Figure RE-GDA0003302743900000077
Setting a threshold value l when a word is present
Figure RE-GDA0003302743900000078
If the value is less than the threshold value l, the word is considered to be a noise word, otherwise, the word is a value word. The difficulty of the method lies in that a proper threshold value l is difficult to find, if l is too large, a large number of valuable words can be misjudged as noise words, and if l is too small, a large number of noise words can not be identified, so that the text denoising effect is influenced.
In order to solve the above problem, the present invention uses graph regularization to reduce the effect of setting the threshold value l on the result. First, two thresholds l are set1And l2(l1<l2) Can be prepared by1Set to a smaller value,/2Set to a larger value if
Figure RE-GDA0003302743900000079
The word w is determined to be a noise word if
Figure RE-GDA00033027439000000710
The word w is determined to be valuable. When in use
Figure RE-GDA00033027439000000711
It is considered that it is temporarily impossible to determine whether the word w is valuable because the probability that the word is misjudged in this interval is high. When a valuable event occurs, words related to the event are often accompanied with each other, namely when a word with a high degree of association is mostly a valuable word, the word is also a valuable word with a high probability. Thus, for a word that is temporarily not able to be determined as valuable, the word can be determined based on other words associated with the word
Figure RE-GDA00033027439000000712
The value is determined.
First, the conditional probability of the occurrence of each word is calculated as the association strength between the words. When two words are often accompanied by the same tweet, the conditional probability that the two words appear mutually is high, and the association strength is high. Word wiWord w at the time of occurrencejThe conditional probability of occurrence calculation formula is as follows:
Figure RE-GDA00033027439000000713
when the conditional probability of the appearance of every two words is calculated, only the number of the cases that the two words appear in one tweed and the number of the cases that a certain word appears in one tweed are counted.
After the conditional probability among the words is calculated, the words are taken as the vertexes W of the graph, the conditional probability of the words appearing mutually is taken as the directed edge E, and the words can be constructedAnd G is (W, E). Using word association graphs, we can use the words themselves
Figure RE-GDA00033027439000000714
Of values and their neighbours
Figure RE-GDA00033027439000000715
And judging the words which cannot be judged by a graph regularization method.
In conjunction with the normalized Ripley's K function and graph regularization, FIG. 2 is a flow chart of a complete algorithm for determining whether a word is valuable. In the algorithm, θ0The threshold for determining when the program stops iterating is typically set to a small value, e.g., 0.01, and θ is the number of words after and before iteration
Figure RE-GDA0003302743900000081
Sum of squares of differences in values, when theta is small, e.g. theta<θ0If 0.01, the word is considered to be
Figure RE-GDA0003302743900000082
The value is iterated to be stable, and the iteration can be stopped.
In each iteration, for the words with the attribute being not judged, the adjacent words are used
Figure RE-GDA0003302743900000083
Of value to itself
Figure RE-GDA0003302743900000084
The value is corrected, if the word with high relevance degree is mostly valuable
Figure RE-GDA0003302743900000085
Larger value), then the formula is updated in the flow chart
Figure RE-GDA0003302743900000086
Can make the word
Figure RE-GDA0003302743900000087
The value becomes large, and the word is judged as a value word (i.e., updated)
Figure RE-GDA0003302743900000088
) (ii) a On the contrary, if most of the words associated with the word are noise words, the word is divided into a plurality of words
Figure RE-GDA0003302743900000089
The value becomes small and thus is judged as a noise word (i.e., updated)
Figure RE-GDA00033027439000000810
). After a plurality of iterations, all the data are output
Figure RE-GDA00033027439000000811
I.e. the valuable words. By the graph regularization method, the relevance among the words can be effectively utilized, and the accuracy of value word recognition is improved.
Wherein the content of the first and second substances,
Figure RE-GDA00033027439000000812
the normalized conditional probability of the neighboring word v is expressed, and the specific normalization mode is that the conditional probability P (w | w) of the neighboring word v is divided by the sum of the conditional probabilities of all neighboring words of the word w, so as to obtain the normalized result
Figure RE-GDA00033027439000000813
In the algorithm shown in fig. 2, two thresholds l need to be set1And l2It is noted that the algorithm of the present invention is insensitive to these two thresholds. That is, you can change l1Set to a very small value (e.g., only 1% of words have a normalized Ripley's K function value less than l1) Is prepared by2Set to a large value (e.g., only 1% of words have a normalized Ripley's K function value greater than l2) The results obtained are combined with well-chosen l1And l2The difference is small. The reason is the robustness of graph regularization, and only a small part of high-value words are needed, so that the graph can be positioned at l through continuous iteration1And l2And identifying words with high association strength in the interval, performing next iteration by taking the words as high-value words to form a chain reaction, and identifying all the high-value words after a plurality of iterations.
By the graph regularization method, errors caused by setting the threshold value l can be bypassed, and word misjudgment is reduced. When a certain tweet does not contain a valuable word, the tweet may be considered a noise tweet. In practice, whether a word is valuable requires "keep-all", i.e., a word may not be valuable at a previous time, but may be valuable at a current time, so the algorithm shown in FIG. 2 needs to run every time period.
In order to verify the accuracy of measuring word value by using a standardized Ripley's K function and the effectiveness of correcting word misjudgment by using graph regularization, 2000 pieces of published tweets from 26/2020/03 to 31/2020/03 are collected, keywords are input into GoogleNews for retrieval after the tweets are clustered through texts, classes belonging to a certain event are identified, the classes are labeled manually, and 54 different events can be obtained. With the 54 valuable events contained in the 2000w pieces of tweets, we can measure the denoising capability of the tweet denoising algorithm under the premise of reserving all event tweets.
The 2000w pieces of tweets contain 15w different words, and words with a small number of occurrences are often misspelled or irregular expressions, so we remove words with a number of occurrences less than 100 and decide directly as noise words. For the remaining 11355 words, we calculate the normalized Ripley's K function values for each word, sort the values from small to large, and select the Ripley's K function values corresponding to the first 30% and last 30% of the words as the limit l1And l2. Sorting the function values of the words Ripley's K from large to small, and FIG. 3 shows different numbers without regularization and with regularizationThe destination word corresponds to a coverage of 54 events.
From fig. 3, if graph regularization is not used, 8000 normalized Ripley's K words with the largest function value are required to cover all events, that is, 8000 words are used to cover all events in the current tweet, and 8000 words are used to filter 2000w tweets, which can filter 53.63% of tweets. With graph regularization, only 3980 words are needed to cover all events, at which time 74.2% of the tweets can be filtered. From experimental results, it can be seen that the denoising operation of the tweet can be completed on the basis of retaining all valuable information by using the improved Ripley's K function, and the denoising rate can be further improved by combining with a graph regularization method.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (10)

1. A social media text denoising method based on space-time burst characteristics is characterized by comprising the following steps:
s1, according to the sending time and the sending place of the social media text, counting the time-space information of each word, and representing the time-space information of the word as a point in a three-dimensional space;
s2, measuring the aggregation degree of the points in the three-dimensional space through a Ripley' S K function;
and S3, introducing the conditional probability among words to establish a word association graph, and judging whether the words are noise words or not by combining the aggregation degree through a graph regularization method.
2. The method as claimed in claim 1, wherein the method comprises the steps ofS1 the space-time information is expressed in [ timei,longitudei,latitudei],timeiTime information, longituude, representing the ith tweetiLongitude information, latitude, indicating the ith sentenceiAnd indicating the latitude information of the ith tweet.
3. The method for denoising social media text based on spatiotemporal burst characteristics according to claim 2, wherein step S1 is to obtain the latitude and longitude information of the words in the text according to the latitude and longitude information of the geographic entities by identifying the geographic entities words in the text.
4. The method for denoising social media text based on spatiotemporal burst characteristics as claimed in claim 2, wherein the step S2 measures the aggregation degree of points in the three-dimensional space through Ripley' S K function, and the expression of the aggregation degree is:
Figure FDA0003265819820000011
wherein, Kw(t, h) represents the Ripley's K function; lambda [ alpha ]wRepresents the density of the word w in three-dimensional space; n is a radical ofw(t, h) represents the number of other points of the word in a region around a certain three-dimensional space point of the word w, wherein h is a space radius and t is a time radius; e [ N ]w(t,h)]Represents to Nw(t, h) is desired.
5. The method for denoising social media text based on spatiotemporal burst characteristics as claimed in claim 4, wherein the step S2 calculates the numerical value of Ripley' S K function of the word w by the following formula;
Figure FDA0003265819820000012
wherein the content of the first and second substances,
Figure FDA0003265819820000013
values representing the Ripley's K function, R being the area of space in which the tweet appears, T being the time span over which the tweet was collected, NwThe number of occurrences of the word w in the space-time RT; n is a radical ofw(hi.j<h,abs(ti,tj)<t) represents that the logarithm of points with the space distance smaller than h and the time interval smaller than t are simultaneously satisfied between every two three-dimensional coordinate points corresponding to the space-time positions where the words w appear.
6. The method for denoising social media text based on spatiotemporal burst feature of claim 5, wherein step S2 further comprises denoising the social media text based on the spatiotemporal burst feature
Figure FDA0003265819820000021
And (4) carrying out standardization:
Figure FDA0003265819820000022
wherein the content of the first and second substances,
Figure FDA0003265819820000023
after representation of the standardization
Figure FDA0003265819820000024
7. The method for denoising social media text based on spatiotemporal burst feature of claim 6, wherein step S3 specifically comprises the following sub-steps:
s31, setting two thresholds l1And l2,l1<l2If, if
Figure FDA0003265819820000025
The word w is determined to be a noise word if
Figure FDA0003265819820000026
The word w is judged to be valuable; otherwise, executing step S32;
s32, calculating word wiWord w at the time of occurrencejConditional probability of occurrence, then step S33 is performed;
s33, taking each word as a vertex W of the graph, taking the conditional probability of the mutual appearance of the words as a directed edge E, and constructing an inter-word association relationship graph G which is (W, E);
s34, using the word association map G ═ (W, E), based on the word itself
Figure FDA0003265819820000027
Of values and their neighbours
Figure FDA0003265819820000028
And judging the noise words by a graph regularization method.
8. The method for denoising social media text based on spatiotemporal burst feature of claim 7, wherein the step S34 is implemented as follows: for the
Figure FDA0003265819820000029
According to the word's neighbourhood
Figure FDA00032658198200000210
Of the word itself by value pair
Figure FDA00032658198200000211
And (3) updating the value:
Figure FDA00032658198200000212
then the updated L'wIs assigned to
Figure FDA00032658198200000213
Obtaining the word by iteration
Figure FDA00032658198200000214
Or
Figure FDA00032658198200000215
Final output
Figure FDA00032658198200000216
The word of (1); wherein the content of the first and second substances,
Figure FDA00032658198200000217
representing the normalized conditional probability of the neighbor word v.
9. The method for denoising social media text based on spatiotemporal burst features as claimed in claim 8, wherein the normalization of the conditional probability is: the conditional probability P (w | v) of the neighboring word v is divided by the sum of the conditional probabilities of all neighboring words of the word w, resulting in a normalized result
Figure FDA00032658198200000218
10. The method for denoising social media text based on spatiotemporal burst features as claimed in claim 8, further comprising setting a condition for stopping iteration, specifically: setting a threshold value theta0Setting the initial value of theta to be 0, and updating the value of theta after each iteration to be the value before the iteration
Figure FDA00032658198200000219
Updated after the iteration
Figure FDA00032658198200000220
The square of the difference + the value of theta before the iteration, if theta obtained after the iteration is less than or equal to theta0The iteration is stopped.
CN202111086719.8A 2021-09-16 2021-09-16 Social media text denoising method based on space-time burst characteristics Active CN113822048B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111086719.8A CN113822048B (en) 2021-09-16 2021-09-16 Social media text denoising method based on space-time burst characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111086719.8A CN113822048B (en) 2021-09-16 2021-09-16 Social media text denoising method based on space-time burst characteristics

Publications (2)

Publication Number Publication Date
CN113822048A true CN113822048A (en) 2021-12-21
CN113822048B CN113822048B (en) 2023-03-21

Family

ID=78914748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111086719.8A Active CN113822048B (en) 2021-09-16 2021-09-16 Social media text denoising method based on space-time burst characteristics

Country Status (1)

Country Link
CN (1) CN113822048B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170132288A1 (en) * 2015-11-06 2017-05-11 International Business Machines Corporation Extracting and Denoising Concept Mentions Using Distributed Representations of Concepts
CN107609103A (en) * 2017-09-12 2018-01-19 电子科技大学 It is a kind of based on push away spy event detecting method
CN108038734A (en) * 2017-12-25 2018-05-15 武汉大学 City commercial facility space distribution detection method and system based on comment data
CN108537274A (en) * 2018-04-08 2018-09-14 武汉大学 A kind of Multi scale Fast Speed Clustering based on grid
CN108846384A (en) * 2018-07-09 2018-11-20 北京邮电大学 Merge the multitask coordinated recognition methods and system of video-aware
CN110457711A (en) * 2019-08-20 2019-11-15 电子科技大学 A kind of social media event topic recognition methods based on descriptor

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170132288A1 (en) * 2015-11-06 2017-05-11 International Business Machines Corporation Extracting and Denoising Concept Mentions Using Distributed Representations of Concepts
CN107609103A (en) * 2017-09-12 2018-01-19 电子科技大学 It is a kind of based on push away spy event detecting method
CN108038734A (en) * 2017-12-25 2018-05-15 武汉大学 City commercial facility space distribution detection method and system based on comment data
CN108537274A (en) * 2018-04-08 2018-09-14 武汉大学 A kind of Multi scale Fast Speed Clustering based on grid
CN108846384A (en) * 2018-07-09 2018-11-20 北京邮电大学 Merge the multitask coordinated recognition methods and system of video-aware
CN110457711A (en) * 2019-08-20 2019-11-15 电子科技大学 A kind of social media event topic recognition methods based on descriptor

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GAOLEI FEI 等: "Real-Time Detection of COVID-19 Events From Twitter:A Spatial-Temporally Bursty-Aware Methd" *
GAOLEI FEI 等: "Twitter Event Detection Under Spatio-Temporal Constraints" *
伏家云;靖常峰;杜明义;付艳丽;戴培培;: "参数优化DBSCAN算法的城管案件聚类分析" *
程勇: "社交网络在线事件检测及分析方法研究" *

Also Published As

Publication number Publication date
CN113822048B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
CN111833172A (en) Consumption credit fraud detection method and system based on isolated forest
AU2018101946A4 (en) Geographical multivariate flow data spatio-temporal autocorrelation analysis method based on cellular automaton
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
CN104573130B (en) The entity resolution method and device calculated based on colony
CN110991657A (en) Abnormal sample detection method based on machine learning
CN109635010B (en) User characteristic and characteristic factor extraction and query method and system
CN112149758B (en) Hyperspectral open set classification method based on Euclidean distance and deep learning
CN113516228B (en) Network anomaly detection method based on deep neural network
CN108171119B (en) SAR image change detection method based on residual error network
CN110942099A (en) Abnormal data identification and detection method of DBSCAN based on core point reservation
CN112188532A (en) Training method of network anomaly detection model, network detection method and device
CN109871805B (en) Electromagnetic signal open set identification method
CN111008337A (en) Deep attention rumor identification method and device based on ternary characteristics
Lawrence et al. Explaining neural matrix factorization with gradient rollback
Kadavankandy et al. The power of side-information in subgraph detection
CN111738319A (en) Clustering result evaluation method and device based on large-scale samples
CN115062186A (en) Video content retrieval method, device, equipment and storage medium
CN113822048B (en) Social media text denoising method based on space-time burst characteristics
CN114710344B (en) Intrusion detection method based on traceability graph
CN116597197A (en) Long-tail target detection method capable of adaptively eliminating negative gradient of classification
CN109739840A (en) Data processing empty value method, apparatus and terminal device
CN111798237B (en) Abnormal transaction diagnosis method and system based on application log
CN113792105A (en) Geospatial point data sampling method based on half-variogram
CN111027515A (en) Face library photo updating method
CN114818883B (en) CART decision tree fire disaster image recognition method based on optimal combination of color features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant