CN113822048A - Social media text denoising method based on space-time burst characteristics - Google Patents
Social media text denoising method based on space-time burst characteristics Download PDFInfo
- Publication number
- CN113822048A CN113822048A CN202111086719.8A CN202111086719A CN113822048A CN 113822048 A CN113822048 A CN 113822048A CN 202111086719 A CN202111086719 A CN 202111086719A CN 113822048 A CN113822048 A CN 113822048A
- Authority
- CN
- China
- Prior art keywords
- word
- words
- time
- space
- social media
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a social media text denoising method based on space-time burst characteristics, belongs to the field of data processing, and aims to solve the problem of poor text classification effect in the prior art. Meanwhile, in order to reduce the influence of the Ripley's K function threshold l of the words on the result and reduce word misjudgment, a graph regularization algorithm is introduced, and the accuracy of word validity judgment is improved by fusing relevance information among the words.
Description
Technical Field
The invention belongs to the field of data processing, and particularly relates to a text denoising technology.
Background
With the widespread use of social media, billions of text messages are published on social media such as Twitter, Facebook, Instagram and microblog every day, and the messages comprise the most complete and most time-efficient messages. By extracting and analyzing this information, we can do a lot of valuable things.
Text is one of the important forms for content expression in social media by users, so social media text data contains a great deal of valuable information, and the data is also input for many social media data mining tasks. However, due to the openness of social media, most of the text information in the social media is description of personal life and personal emotion, and the text usually does not contain valuable information.
The social media text denoising aims to identify and retain texts related to events, topics and the like from the texts, and the texts can be used as input of various social media mining tasks and are valuable text information; otherwise, the text is useless text (called as 'noise text') and needs to be removed. In social media, the vast majority of textual information is noisy text. Therefore, text denoising is required before using information on social media, and the influence of noise texts on subsequent tasks is reduced. By removing useless data in the social media, the influence of noise on various models can be reduced, the data processing amount can be greatly reduced, and the data processing rate of each task based on the social media is greatly improved under the condition that the result is not deteriorated.
The existing social media text denoising problem is essentially a text classification problem, namely whether a text is valuable or not is judged. There are generally two types of approaches to solving this problem: the method is supervised learning based on model training, namely establishing a tweet data set, manually marking whether tweets contain value information one by one, and then training data for classifying input through a machine learning model (SVM, GBDT, XGboost and the like) or a deep learning model (C-LSTM or TextCNN and the like). Such methods require a large number of manual and large numbers of labels, and such methods cannot "keep pace" -some text is currently noisy data, but do not mean that such text is noisy data all the time thereafter.
And the second is an unsupervised method, which measures the value of the words by analyzing the time sequence characteristics of the words in the text flow, and further judges whether the text contains useful information. Whether a word has value at the current moment can be measured by judging whether the word belongs to a certain event or a topic in the time period, namely whether the word is a partial expression of the current certain event. If the word belongs to an event at the current time, the word is valuable. When an event occurs, the frequency of occurrence of the word corresponding to the event is increased rapidly, so that the frequency of occurrence of a word in each time period can be counted in real time, and whether the number of words is increased rapidly in the current time period is judged. Thus, the question of determining whether a word is valuable translates into determining whether a time series of occurrences of a word is sudden at the current time. Determining whether a sequence is bursty at a time is typically done using z-score. The method does not need to be marked and can be carried out 'with time', and the method has the defect that the method is not accurate enough, only the time sequence characteristics of the words are considered, and the association between the geographic position characteristics of the words and the words is not considered.
The existing social media text denoising method generally comprises the steps of giving a training set by a user, extracting text features, constructing a classification model by a supervised learning method, and classifying texts. The effectiveness of these methods depends strongly on the number and quality of training sets, while scalability is poor (classification for one task is difficult to extend to other tasks). In practice, the labeling of the training set generally needs to be completed manually, so that it is very difficult to obtain a large number of training sets with high quality; meanwhile, since social media text data has the characteristics of strong sparsity, irregular expression and the like, the classification features of a plurality of texts are difficult to accurately extract, and the classification effect of the existing method is poor.
Disclosure of Invention
In order to solve the technical problems, the invention provides a social media text denoising method based on space-time burst characteristics, which comprehensively measures the value of words by extracting the time and the geographical position information of each word, corrects misjudgment through the relevance between the words by establishing a relevance graph between the words, reduces the sensitivity of a threshold value to noise word recognition, improves the accuracy of value word recognition, and is an optimization extension of a second method.
The technical scheme adopted by the invention is as follows: a social media text denoising method based on space-time burst characteristics comprises the following steps:
s1, according to the sending time and the sending place of the social media text, counting the time-space information of each word, and representing the time-space information of the word as a point in a three-dimensional space;
s2, measuring the aggregation degree of the points in the three-dimensional space through a Ripley' S K function;
and S3, introducing the conditional probability among words to establish a word association graph, and judging whether the words are noise words or not by combining the aggregation degree through a graph regularization method.
Step S1, the space-time information is expressed in [ timei,longitudei,latitudei],timeiTime information, longituude, representing the ith tweetiLongitude information, longituude, representing the ith tweetiAnd indicating the latitude information of the ith tweet.
Step S1 is to obtain the longitude and latitude information of the word in the text according to the longitude and latitude information of the geographic entity by recognizing the geographic entity word in the text.
Step S2, measuring the aggregation degree of the points in the three-dimensional space through the Ripley' S K function, where the expression of the aggregation degree is:
wherein k isw(t, h) represents the Ripley's K function; lambda [ alpha ]wRepresents the density of the word w in three-dimensional space; n is a radical ofw(t, h) represents the number of other points of the word in a region around a certain three-dimensional space point of the word w, wherein h is a space radius and t is a time radius; e [ N ]w(t,h)]Represents to Nw(t, h) is desired.
Step S2 calculating the value of the Ripley' S K function of the word w by the following formula;
wherein the content of the first and second substances,values representing the Ripley's K function, R being the area of space in which the tweet appears, T being the time span over which the tweet was collected, NwThe number of occurrences of the word w in the space-time RT; n is a radical ofw(hi.j<h,abs(ti,tj)<t) represents the number of point pairs, wherein the space distance between every two three-dimensional coordinate points corresponding to the space-time position where the word w appears is less than h and the time interval is less than t.
Step S3 specifically includes the following substeps:
s31, setting two thresholds l1And l2,l1<l2If, ifThe word w is determined to be a noise word ifThe word w is judged to be valuable; otherwise, executing step S32;
s32, calculating word wiWord w at the time of occurrencejConditional probability of occurrence, then step S33 is performed;
s33, taking each word as a vertex W of the graph, taking the conditional probability of the mutual appearance of the words as a directed edge E, and constructing an inter-word association relationship graph G which is (W, E);
s34, using the word association map G ═ (W, E), based on the word itselfOf values and their neighboursAnd judging the noise words by a graph regularization method.
The implementation process of step S34 is: for theAccording to the word's neighbourhoodOf the word itself by value pairAnd (3) updating the value:then the updated L'wIs assigned toObtaining the word by iterationOrFinal outputThe word of (1).
Wherein the content of the first and second substances,representing the normalized conditional probability of the neighbor word v.
The specific normalization method is that the conditional probability P (w | v) of the neighboring word v is divided by the sum of the conditional probabilities of all neighboring words of the word w, so as to obtain the normalization result
The method further comprises the step of setting a condition for stopping iteration, specifically: setting a threshold value theta0Setting the initial value of theta to be 0, and updating the value of theta after each iteration to be the value before the iterationUpdated after the iterationThe square of the difference + the value of theta before the iteration, if theta obtained after the iteration is less than or equal to theta0The iteration is stopped.
The invention has the beneficial effects that: the context-pushing denoising method based on the spatiotemporal information can quickly read and accurately identify the noise context without data marking. The method is characterized in that valuable text information is captured to have aggregation characteristics in space and time, a space-time Ripley's K function is constructed to describe the space-time burstiness of words in the text, so that words containing valuable information in the text are identified, and the text containing no valuable information words is removed. The method provided by the invention is an unsupervised noise text recognition method, which can effectively recognize the noise information in the text without depending on a training set, and has the following advantages:
1) the Ripley's K function based on the spatio-temporal information can be used for measuring whether data points have aggregations in spatio-temporal, and the function is suitable for solving similar problems of the evaluation of the aggregations in the spatio-temporal of the data points;
2) establishing an association graph of the words by using the conditional probability of the words, mining association information among the words by using a graph regularization algorithm, reducing the sensitivity of value word recognition to threshold setting, and improving the accuracy of noise text recognition;
3) the method filters the text information based on the value words, is an unsupervised noise text recognition method, does not need to manually mark a training set, and therefore has better robustness and usability.
Drawings
FIG. 1 is a block diagram of an embodiment of the present invention;
FIG. 2 is a flow chart of a graph regularization algorithm of the present invention;
FIG. 3 provides coverage for 54 events with different numbers of words without regularization and with regularization, according to an embodiment of the present invention.
Detailed Description
The invention provides a social media text denoising method based on space-time burst characteristics. The method aims at the characteristic that value information (texts related to events, subjects and the like) in social media has aggregative property in time and space, and models the space-time distribution of each word in the texts from the perspective of space-time burstiness, so that words related to the events and the subjects are identified from mass social media data. The value text and the noise text are distinguished according to whether the words in the text have space-time aggregation, and usability and effectiveness of the social media text denoising method can be improved.
The method aims at judging whether words in the text in the current time window have aggregative property in space-time distribution or not, and identifies the words with aggregative property, namely the words with value. And if a certain text does not contain any value word, judging the text to be a noise text and directly removing the noise text. Therefore, the technical scheme of the invention is mainly divided into three parts:
firstly, according to the sending time and the sending place of the social media text, counting the time-space information of each word, and representing the time-space information of the word as a point in a three-dimensional space;
secondly, measuring the aggregation degree of points in the three-dimensional space through a Ripley's K function, and taking the aggregation degree as an evaluation criterion of the value of the word;
then, a word association graph is established by introducing the conditional probability among words, a graph regularization method is designed to reduce the sensitivity of value word recognition to a decision threshold and improve the accuracy and robustness of value word judgment. As shown in fig. 1, the method of the present invention comprises the steps of:
1. spatiotemporal information extraction of words
For a word w, the latitude and longitude information of the sending time and the geographic position of the text containing the word is counted. Suppose that the word w appears in n pieces of text, and the spatiotemporal information of the ith piece of text is [ timei,longitudei,latitudei]Then the spatiotemporal information of this word can be expressed as: w is ainfo={[timei,longitudei,latitudei]And i is more than or equal to 1 and less than or equal to n. Time information timeiCan be viewed as a dimension, latitude and longitude information (longitude)i、latitudei) Only the surface of the earth is considered corresponding to the spherical position on the earth, so the longitude and latitude can be regarded as the coordinates of a point at a certain position on a two-dimensional plane, and the space-time information of each word can be regarded as a set of points in a three-dimensional space.
To measure the distance of the midpoint in the three-dimensional space corresponding to a word, the distance in the time dimension and the distance in the space dimension need to be calculated respectively. The distance of the time dimension is the time difference, and the distance of the space dimension is the distance between two points on the earth surface obtained by the given longitude and latitude. The longitude and latitude coordinates A (LatA, LonA) and B (LatB, LonB) of two points on the earth surface are given, and a distance formula between the two points on the earth surface is solved according to the longitude and latitude coordinates as follows:
wherein, alpha ═ (LatA-LatB)/2, beta ═ LonA-LonB)/2, R1Representing the mean radius of the earth, R1=6371km。
In practice, text on social media carries the sending time on a per-case basis, but only a small fraction carries latitude and longitude information. Therefore, it is necessary to use the tool kit provided by NLTK, StanfordNLP and space to collectively identify which words are entity words representing geographical locations, analyze these geographical entity words through the geopy library to obtain their longitudes and latitudes, and use the longitudes and latitudes of the geographical entity words in the text as the longitudes and latitudes information of the text. In the concrete implementation, a geographical entity and a longitude and latitude information table thereof are generally established, for a new text, a geographical entity word is firstly identified, if the longitude and latitude corresponding to the geographical entity word is stored in the table, the geographical entity word is directly used, otherwise, the entity word is analyzed, and the analyzed information is put into the table. In a specific experiment, taking 500w tweets as an example of a tweet to test, wherein 200w tweets contain one or more geographic entity words, means that more than 40% of the tweets can be used to deduce the longitude and latitude coordinates of the tweets in this way.
2. Word space-time burst feature extraction based on Ripley's K function
At a particular time and space, things that have a large number of people discussing are considered valuable, as are keywords to which they correspond. Thus, it can be modeled as: when a word has aggregations in a space-time three-dimensional space, the word is proved to be discussed in a large amount at a certain place at the current moment, so that the word corresponds to a certain event and has value.
The Replay's K function can be used to measure whether a set of points on a two-dimensional plane is bursty, and can be extended to three dimensions according to the two-dimensional Ripley's K function. For a certain word w, the Replay's K function based on three-dimensional spatio-temporal information is defined as follows:
wherein λ iswThe average value of the number of words in each space-time unit is equivalent to the density of the words w in a three-dimensional space; n is a radical ofw(t, h) represents that the word w is surrounded by a three-dimensional space point and has a space radius of h and a time radius of t (the region is a cylinder, h is the bottom radius of the cylinder, and t is the height of the cylinder) and the word is divided into three partsNumber of points, E [ N ]w(t,h)]Is desirable for this.
For a given spatial distance h and time interval t, the value of the Ripley's K function can be estimated by the following equation:
wherein R is the area of the space where the tweet appears, and if the collected tweets are all from New York, R is the area of the New York; t is the time span of text pushing acquisition; n is a radical ofwThe number of occurrences of the word w in the space-time RT; n is a radical ofw(hi.j<h,abs(ti,tj)<t) represents that the logarithm of points with the space distance smaller than h and the time interval smaller than t are simultaneously satisfied between every two three-dimensional coordinate points corresponding to the space-time positions where the words w appear. Intuitively, it can be seen that if a certain word appears in a certain small space-time, the Ripley's K function value of the word is large.
For understanding the time span, for example: the time of the collected tweet is 2020.03.20 zero at the earliest and 2020.03.23 zero at the latest, and then T is 72 h.
The noise words are randomly distributed in the time dimension and the space dimension, so NwThe expectation of (t, h) is the word density λwMultiplying by the cylinder volume corresponding to the spatiotemporal range:
E[Nw(t,h)]=λwπh2t (4)
thus, it can be seen that when a word is randomly distributed in space-time, the Ripley's K function value K of the wordw(t,h)=πh2t, when the word is a noise word. If the word is valuable, then there is aggregation in spatio-temporal space, then its Ripley's K function value Kw(t,h)>πh2t. Thus, the function value of Ripley's K of a word can be calculatedBy comparisonAnd pi h2t to determine whether the word is valuable. More generally, the set spatio-temporal threshold parameters t, h are paired for erasureInfluence of the value, the invention onThe following normalization was performed:
3. optimizing value word recognition using graph regularization
By calculating a normalized Ripley's K function value for a wordSetting a threshold value l when a word is presentIf the value is less than the threshold value l, the word is considered to be a noise word, otherwise, the word is a value word. The difficulty of the method lies in that a proper threshold value l is difficult to find, if l is too large, a large number of valuable words can be misjudged as noise words, and if l is too small, a large number of noise words can not be identified, so that the text denoising effect is influenced.
In order to solve the above problem, the present invention uses graph regularization to reduce the effect of setting the threshold value l on the result. First, two thresholds l are set1And l2(l1<l2) Can be prepared by1Set to a smaller value,/2Set to a larger value ifThe word w is determined to be a noise word ifThe word w is determined to be valuable. When in useIt is considered that it is temporarily impossible to determine whether the word w is valuable because the probability that the word is misjudged in this interval is high. When a valuable event occurs, words related to the event are often accompanied with each other, namely when a word with a high degree of association is mostly a valuable word, the word is also a valuable word with a high probability. Thus, for a word that is temporarily not able to be determined as valuable, the word can be determined based on other words associated with the wordThe value is determined.
First, the conditional probability of the occurrence of each word is calculated as the association strength between the words. When two words are often accompanied by the same tweet, the conditional probability that the two words appear mutually is high, and the association strength is high. Word wiWord w at the time of occurrencejThe conditional probability of occurrence calculation formula is as follows:
when the conditional probability of the appearance of every two words is calculated, only the number of the cases that the two words appear in one tweed and the number of the cases that a certain word appears in one tweed are counted.
After the conditional probability among the words is calculated, the words are taken as the vertexes W of the graph, the conditional probability of the words appearing mutually is taken as the directed edge E, and the words can be constructedAnd G is (W, E). Using word association graphs, we can use the words themselvesOf values and their neighboursAnd judging the words which cannot be judged by a graph regularization method.
In conjunction with the normalized Ripley's K function and graph regularization, FIG. 2 is a flow chart of a complete algorithm for determining whether a word is valuable. In the algorithm, θ0The threshold for determining when the program stops iterating is typically set to a small value, e.g., 0.01, and θ is the number of words after and before iterationSum of squares of differences in values, when theta is small, e.g. theta<θ0If 0.01, the word is considered to beThe value is iterated to be stable, and the iteration can be stopped.
In each iteration, for the words with the attribute being not judged, the adjacent words are usedOf value to itselfThe value is corrected, if the word with high relevance degree is mostly valuableLarger value), then the formula is updated in the flow chartCan make the wordThe value becomes large, and the word is judged as a value word (i.e., updated)) (ii) a On the contrary, if most of the words associated with the word are noise words, the word is divided into a plurality of wordsThe value becomes small and thus is judged as a noise word (i.e., updated)). After a plurality of iterations, all the data are outputI.e. the valuable words. By the graph regularization method, the relevance among the words can be effectively utilized, and the accuracy of value word recognition is improved.
Wherein the content of the first and second substances,the normalized conditional probability of the neighboring word v is expressed, and the specific normalization mode is that the conditional probability P (w | w) of the neighboring word v is divided by the sum of the conditional probabilities of all neighboring words of the word w, so as to obtain the normalized result
In the algorithm shown in fig. 2, two thresholds l need to be set1And l2It is noted that the algorithm of the present invention is insensitive to these two thresholds. That is, you can change l1Set to a very small value (e.g., only 1% of words have a normalized Ripley's K function value less than l1) Is prepared by2Set to a large value (e.g., only 1% of words have a normalized Ripley's K function value greater than l2) The results obtained are combined with well-chosen l1And l2The difference is small. The reason is the robustness of graph regularization, and only a small part of high-value words are needed, so that the graph can be positioned at l through continuous iteration1And l2And identifying words with high association strength in the interval, performing next iteration by taking the words as high-value words to form a chain reaction, and identifying all the high-value words after a plurality of iterations.
By the graph regularization method, errors caused by setting the threshold value l can be bypassed, and word misjudgment is reduced. When a certain tweet does not contain a valuable word, the tweet may be considered a noise tweet. In practice, whether a word is valuable requires "keep-all", i.e., a word may not be valuable at a previous time, but may be valuable at a current time, so the algorithm shown in FIG. 2 needs to run every time period.
In order to verify the accuracy of measuring word value by using a standardized Ripley's K function and the effectiveness of correcting word misjudgment by using graph regularization, 2000 pieces of published tweets from 26/2020/03 to 31/2020/03 are collected, keywords are input into GoogleNews for retrieval after the tweets are clustered through texts, classes belonging to a certain event are identified, the classes are labeled manually, and 54 different events can be obtained. With the 54 valuable events contained in the 2000w pieces of tweets, we can measure the denoising capability of the tweet denoising algorithm under the premise of reserving all event tweets.
The 2000w pieces of tweets contain 15w different words, and words with a small number of occurrences are often misspelled or irregular expressions, so we remove words with a number of occurrences less than 100 and decide directly as noise words. For the remaining 11355 words, we calculate the normalized Ripley's K function values for each word, sort the values from small to large, and select the Ripley's K function values corresponding to the first 30% and last 30% of the words as the limit l1And l2. Sorting the function values of the words Ripley's K from large to small, and FIG. 3 shows different numbers without regularization and with regularizationThe destination word corresponds to a coverage of 54 events.
From fig. 3, if graph regularization is not used, 8000 normalized Ripley's K words with the largest function value are required to cover all events, that is, 8000 words are used to cover all events in the current tweet, and 8000 words are used to filter 2000w tweets, which can filter 53.63% of tweets. With graph regularization, only 3980 words are needed to cover all events, at which time 74.2% of the tweets can be filtered. From experimental results, it can be seen that the denoising operation of the tweet can be completed on the basis of retaining all valuable information by using the improved Ripley's K function, and the denoising rate can be further improved by combining with a graph regularization method.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.
Claims (10)
1. A social media text denoising method based on space-time burst characteristics is characterized by comprising the following steps:
s1, according to the sending time and the sending place of the social media text, counting the time-space information of each word, and representing the time-space information of the word as a point in a three-dimensional space;
s2, measuring the aggregation degree of the points in the three-dimensional space through a Ripley' S K function;
and S3, introducing the conditional probability among words to establish a word association graph, and judging whether the words are noise words or not by combining the aggregation degree through a graph regularization method.
2. The method as claimed in claim 1, wherein the method comprises the steps ofS1 the space-time information is expressed in [ timei,longitudei,latitudei],timeiTime information, longituude, representing the ith tweetiLongitude information, latitude, indicating the ith sentenceiAnd indicating the latitude information of the ith tweet.
3. The method for denoising social media text based on spatiotemporal burst characteristics according to claim 2, wherein step S1 is to obtain the latitude and longitude information of the words in the text according to the latitude and longitude information of the geographic entities by identifying the geographic entities words in the text.
4. The method for denoising social media text based on spatiotemporal burst characteristics as claimed in claim 2, wherein the step S2 measures the aggregation degree of points in the three-dimensional space through Ripley' S K function, and the expression of the aggregation degree is:
wherein, Kw(t, h) represents the Ripley's K function; lambda [ alpha ]wRepresents the density of the word w in three-dimensional space; n is a radical ofw(t, h) represents the number of other points of the word in a region around a certain three-dimensional space point of the word w, wherein h is a space radius and t is a time radius; e [ N ]w(t,h)]Represents to Nw(t, h) is desired.
5. The method for denoising social media text based on spatiotemporal burst characteristics as claimed in claim 4, wherein the step S2 calculates the numerical value of Ripley' S K function of the word w by the following formula;
wherein the content of the first and second substances,values representing the Ripley's K function, R being the area of space in which the tweet appears, T being the time span over which the tweet was collected, NwThe number of occurrences of the word w in the space-time RT; n is a radical ofw(hi.j<h,abs(ti,tj)<t) represents that the logarithm of points with the space distance smaller than h and the time interval smaller than t are simultaneously satisfied between every two three-dimensional coordinate points corresponding to the space-time positions where the words w appear.
6. The method for denoising social media text based on spatiotemporal burst feature of claim 5, wherein step S2 further comprises denoising the social media text based on the spatiotemporal burst featureAnd (4) carrying out standardization:
7. The method for denoising social media text based on spatiotemporal burst feature of claim 6, wherein step S3 specifically comprises the following sub-steps:
s31, setting two thresholds l1And l2,l1<l2If, ifThe word w is determined to be a noise word ifThe word w is judged to be valuable; otherwise, executing step S32;
s32, calculating word wiWord w at the time of occurrencejConditional probability of occurrence, then step S33 is performed;
s33, taking each word as a vertex W of the graph, taking the conditional probability of the mutual appearance of the words as a directed edge E, and constructing an inter-word association relationship graph G which is (W, E);
8. The method for denoising social media text based on spatiotemporal burst feature of claim 7, wherein the step S34 is implemented as follows: for theAccording to the word's neighbourhoodOf the word itself by value pairAnd (3) updating the value:then the updated L'wIs assigned toObtaining the word by iterationOrFinal outputThe word of (1); wherein the content of the first and second substances,representing the normalized conditional probability of the neighbor word v.
9. The method for denoising social media text based on spatiotemporal burst features as claimed in claim 8, wherein the normalization of the conditional probability is: the conditional probability P (w | v) of the neighboring word v is divided by the sum of the conditional probabilities of all neighboring words of the word w, resulting in a normalized result
10. The method for denoising social media text based on spatiotemporal burst features as claimed in claim 8, further comprising setting a condition for stopping iteration, specifically: setting a threshold value theta0Setting the initial value of theta to be 0, and updating the value of theta after each iteration to be the value before the iterationUpdated after the iterationThe square of the difference + the value of theta before the iteration, if theta obtained after the iteration is less than or equal to theta0The iteration is stopped.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111086719.8A CN113822048B (en) | 2021-09-16 | 2021-09-16 | Social media text denoising method based on space-time burst characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111086719.8A CN113822048B (en) | 2021-09-16 | 2021-09-16 | Social media text denoising method based on space-time burst characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113822048A true CN113822048A (en) | 2021-12-21 |
CN113822048B CN113822048B (en) | 2023-03-21 |
Family
ID=78914748
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111086719.8A Active CN113822048B (en) | 2021-09-16 | 2021-09-16 | Social media text denoising method based on space-time burst characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113822048B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170132288A1 (en) * | 2015-11-06 | 2017-05-11 | International Business Machines Corporation | Extracting and Denoising Concept Mentions Using Distributed Representations of Concepts |
CN107609103A (en) * | 2017-09-12 | 2018-01-19 | 电子科技大学 | It is a kind of based on push away spy event detecting method |
CN108038734A (en) * | 2017-12-25 | 2018-05-15 | 武汉大学 | City commercial facility space distribution detection method and system based on comment data |
CN108537274A (en) * | 2018-04-08 | 2018-09-14 | 武汉大学 | A kind of Multi scale Fast Speed Clustering based on grid |
CN108846384A (en) * | 2018-07-09 | 2018-11-20 | 北京邮电大学 | Merge the multitask coordinated recognition methods and system of video-aware |
CN110457711A (en) * | 2019-08-20 | 2019-11-15 | 电子科技大学 | A kind of social media event topic recognition methods based on descriptor |
-
2021
- 2021-09-16 CN CN202111086719.8A patent/CN113822048B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170132288A1 (en) * | 2015-11-06 | 2017-05-11 | International Business Machines Corporation | Extracting and Denoising Concept Mentions Using Distributed Representations of Concepts |
CN107609103A (en) * | 2017-09-12 | 2018-01-19 | 电子科技大学 | It is a kind of based on push away spy event detecting method |
CN108038734A (en) * | 2017-12-25 | 2018-05-15 | 武汉大学 | City commercial facility space distribution detection method and system based on comment data |
CN108537274A (en) * | 2018-04-08 | 2018-09-14 | 武汉大学 | A kind of Multi scale Fast Speed Clustering based on grid |
CN108846384A (en) * | 2018-07-09 | 2018-11-20 | 北京邮电大学 | Merge the multitask coordinated recognition methods and system of video-aware |
CN110457711A (en) * | 2019-08-20 | 2019-11-15 | 电子科技大学 | A kind of social media event topic recognition methods based on descriptor |
Non-Patent Citations (4)
Title |
---|
GAOLEI FEI 等: "Real-Time Detection of COVID-19 Events From Twitter:A Spatial-Temporally Bursty-Aware Methd" * |
GAOLEI FEI 等: "Twitter Event Detection Under Spatio-Temporal Constraints" * |
伏家云;靖常峰;杜明义;付艳丽;戴培培;: "参数优化DBSCAN算法的城管案件聚类分析" * |
程勇: "社交网络在线事件检测及分析方法研究" * |
Also Published As
Publication number | Publication date |
---|---|
CN113822048B (en) | 2023-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111833172A (en) | Consumption credit fraud detection method and system based on isolated forest | |
AU2018101946A4 (en) | Geographical multivariate flow data spatio-temporal autocorrelation analysis method based on cellular automaton | |
CN109189767B (en) | Data processing method and device, electronic equipment and storage medium | |
CN104573130B (en) | The entity resolution method and device calculated based on colony | |
CN110991657A (en) | Abnormal sample detection method based on machine learning | |
CN109635010B (en) | User characteristic and characteristic factor extraction and query method and system | |
CN112149758B (en) | Hyperspectral open set classification method based on Euclidean distance and deep learning | |
CN113516228B (en) | Network anomaly detection method based on deep neural network | |
CN108171119B (en) | SAR image change detection method based on residual error network | |
CN110942099A (en) | Abnormal data identification and detection method of DBSCAN based on core point reservation | |
CN112188532A (en) | Training method of network anomaly detection model, network detection method and device | |
CN109871805B (en) | Electromagnetic signal open set identification method | |
CN111008337A (en) | Deep attention rumor identification method and device based on ternary characteristics | |
Lawrence et al. | Explaining neural matrix factorization with gradient rollback | |
Kadavankandy et al. | The power of side-information in subgraph detection | |
CN111738319A (en) | Clustering result evaluation method and device based on large-scale samples | |
CN115062186A (en) | Video content retrieval method, device, equipment and storage medium | |
CN113822048B (en) | Social media text denoising method based on space-time burst characteristics | |
CN114710344B (en) | Intrusion detection method based on traceability graph | |
CN116597197A (en) | Long-tail target detection method capable of adaptively eliminating negative gradient of classification | |
CN109739840A (en) | Data processing empty value method, apparatus and terminal device | |
CN111798237B (en) | Abnormal transaction diagnosis method and system based on application log | |
CN113792105A (en) | Geospatial point data sampling method based on half-variogram | |
CN111027515A (en) | Face library photo updating method | |
CN114818883B (en) | CART decision tree fire disaster image recognition method based on optimal combination of color features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |