CN106503256B

CN106503256B - A kind of hot information method for digging based on social networks document

Info

Publication number: CN106503256B
Application number: CN201611005521.1A
Authority: CN
Inventors: 李静远; 郝晓波; 南军啸; 刘悦; 程学旗; 王凤
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2016-11-11
Filing date: 2016-11-11
Publication date: 2019-05-07
Anticipated expiration: 2036-11-11
Also published as: CN106503256A

Abstract

The present invention provides a kind of hot information method for digging based on social networks document, the degree of fluctuation for including the following steps: 1) benchmark weight of the weight according to lexical item in hot statistics window relative to the lexical item in corpus, obtains temperature of the lexical item in hot statistics window；2) the temperature sequence based on each lexical item, obtains the hot spot lexical item in current hot statistics window.The present invention can be improved the accuracy rate that candidate word is excavated in social networks；The semanteme of more accurately expression social networks focus incident can be obtained.

Description

A kind of hot information method for digging based on social networks document

Technical field

The present invention relates to data mining technology fields, specifically, the present invention relates to a kind of based on social networks document Hot information method for digging.

Background technique

With the arrival of Web2.0, the mode that netizen participates in internet turns from browsing information before, passive receive Become manufacturing information, the mode of active transmission, and these make network environment and spreading network information that huge change all have occurred Become.The social networking applications such as microblogging, blog, wechat have become the important component of hundreds of millions netizens life.

The publication and latest developments of many grave news are derived from social networks at present, and internet is therefore at many societies The spot of focus incident and communication channel, such as nearest focus incident " Liu Xiang is retired ", " the super Li Jiacheng of Wang Jianlin is at Asia New richest " etc., these are all to become focus incident due to the higher attention rate of netizen.Meanwhile internet is also portion, government The channel that door knows public feelings is the important public opinion position that government department payes attention under the new era.For example netizen is exposed by internet Some corrupt officials, a kind of anti-corruption supervision of the masses new model as Internet era of network.Government's only discovery in time, in time Follow-up handles in time, could comprehensively hold public opinion and the condition of the people, guide the positive development of internet.

Different from traditional news media media, internet hot spots event have outburst rapidly, the spies such as data magnanimity, content be lack of standardization Property, therefore how to capture focus incident in the irregular information of magnanimity as quickly as possible, and by its core information with being easy to manage The language elements high efficient expression of solution is a popular research problem of educational circles in recent years.

Current mainstream hot spot mining algorithm mainly relies on core vocabulary (hot word) set and its rule in focus incident Then indicate event.Wherein, hot spot vocabulary mining is the part and parcel of focus incident discovery, it reflect in certain time and People's question of common concern and things in space.The accuracy of hot spot vocabulary mining has important shadow to focus incident discovery It rings.

On the other hand, since the ability to express of vocabulary is limited, often a vocabulary is the abstract one aspect of event, no The real meaning of the focus incident of energy effective expression hot word behind.For example " Liu Xiang ", " Wang Jianlin " are difficult that people is allowed accurately to obtain it Real event behind.Therefore, be discovery focus incident, also need sometimes for hot word do the meaning of a word extension (or be known as semanteme Extension).

In terms of meaning of a word extension, a kind of currently existing scheme be by extracting the affixes such as prefix, then according to syntax rule come Identification number and proper noun, but this method limitation is larger, and effect concentrates on the apparent substantive noun of identification.Separately A kind of currently existing scheme is: the word-building template of one three words of building, using neural network algorithm to three word strings filtered out into Row training, is then filtered to obtain hot word using statistical nature.This method recognition speed is fast, but can only be directed to specific The word of rule.

Although the scheme of above-mentioned prior art can carry out semantic extension to a certain extent, not fully suitable for magnanimity Social network data analysis, because the data of social networks mainly have following characteristics.1) grammer is lack of standardization, and rubbish noise is big；2) Lack effective editor, there is no event center overview.And traditional meaning of a word extension is more based on news corpus, due to news corpus " title " has stronger ability to express in itself, has natural advantage, therefore the quality of corpus is relatively high.But in face of being based on Non- news corpus collection, content of text be second-rate, without the social networks corpus of effective " central event title ", then aforementioned side The effect of method will be a greater impact.

Summary of the invention

Therefore, a task of the invention is to provide a kind of hot information that can more accurately excavate social networks document Solution.

According to an aspect of the invention, there is provided a kind of hot information method for digging based on social networks document, packet Include the following steps:

1) benchmark weight according to lexical item in the weight in hot statistics window relative to the lexical item in total corpus Degree of fluctuation obtains temperature of the lexical item in hot statistics window；

2) the temperature sequence based on each lexical item, obtains the hot spot lexical item in current hot statistics window.

Wherein, in the step 1), weight of the lexical item in hot statistics window is the lexical item in hot statistics window TFIDF weight in mouthful；The lexical item is TFIDF power of the lexical item in total corpus in the benchmark weight in total corpus Value.

Wherein, in the step 1), the benchmark weight of each lexical item is to carry out dynamic according to certain measurement period The current base weight for updating and obtaining.

Wherein, in the step 1), for any lexical item k, the calculation method of temperature F (k) is as follows:

Wherein, N represents the number of documents in current hot statistics time window, c_kRepresent current hot statistics time window The interior document number comprising lexical item k, f_tkFrequency of the lexical item k in document t is represented, D represents the update cycle of lexical item weight, W The time window length of lexical item hot statistics is represented, base (k) represents the benchmark weight of lexical item k.

Wherein, in the step 1), the current base weight is obtained according to the benchmark weight table that dynamic updates, benchmark power The calculation method of the current base weight of each lexical item k is as follows in weight table:

Wherein, D represents the unit time of lexical item weight table statistics, and W represents the time window length of vocabulary hot statistics, Base represents benchmark lexical item value, and subscript d represents current lexical item weight table statistics number, D_iRepresent i-th lexical item weight table system The time interval of meter and preceding lexical item weight table statistics.

Wherein, the hot information method for digging based on social networks document further comprises the steps of:

3) for the hot spot lexical item in current hot statistics window, merge with theme.

4) word centered on the hot spot lexical item after merging is subjected to semantic extension, obtains the hot spot that can express more contents Information.

Wherein, in the step 3), carrying out the method that merges with theme is: word-based message vector matrix, according to word to The cosine similarity of amount merges same descriptor.

Wherein, in the step 4), the semantic extension method of lexical item is as follows: semantic tagger is carried out to the context of lexical item, Then the keyword of context is extracted according to preset grammar templates, and forms effective semanteme based on the lexical item, counts corpus The frequency for concentrating various effective semantemes based on the lexical item selects which effectively semanteme by the lexical item as this according to the frequency The semantic extension of lexical item.

Wherein, the preset grammar templates include: " principal series table ", " Subject, Predicate and Object ", " main " and " guest of honour " template.

Compared with prior art, the present invention has following technical effect:

1, the present invention can be improved the accuracy rate that candidate word is excavated in social networks.

2, the present invention can obtain the semanteme of more accurately expression social networks focus incident.

3, the present invention can preferably have found the relatively small number of focus incident of those number of documents in social networks.

4, the present invention can have found the focus incident for still breaking out early period in social networks in public sentiment more in time earlier.

Detailed description of the invention

Hereinafter, carrying out the embodiment that the present invention will be described in detail in conjunction with attached drawing, in which:

Fig. 1 shows the flow chart of the hot information method for digging of one embodiment of the invention.

Specific embodiment

As it was noted above, the publication and latest developments of many grave news are derived from social networks at present, internet is Spot and communication channel at many social hotspots events.An object of the present invention is sought to by magnanimity social activity The data mining of network documentation finds the hot information (such as focus incident) contained in these data.Below with reference to implementation Example is further described through the present invention.

According to one embodiment of present invention, a kind of self feed back meaning of a word extension (Self-Feedback Semantic is proposed Extension, hereinafter referred to as SSE) magnanimity social networks document hot information method for digging, Fig. 1 shows the hot information The flow chart of method for digging, with reference to Fig. 1, this method includes the following steps:

Step 1: history corpus being segmented and establishes lexical item weight table.In this step, to magnanimity social networks to be analyzed Document is segmented, and each lexical item and its corresponding lexical item weight are obtained.The lexical item weight indicates word in entire document sets Weight.In the present embodiment, use TFIDF value as lexical item weight.TFIDF(term frequency–inverse document It frequency) is a kind of common weighting technique for information retrieval and data mining, it can assess vocabulary for a text The significance level of part collection or a copy of it file in a corpus.Time that the importance of words occurs hereof with it The directly proportional increases of number, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.TFIDF value on the one hand can be Filtering noise word to a certain extent, on the other hand only needing to carry out corpus primary traversal can be obtained, and compares and is suitble to do The calculating of large-scale corpus collection.

In one embodiment, the document of the long text in magnanimity social networks document is established into document sets, only to document The document of concentration is segmented and is established lexical item weight table.The lexical item weight table has recorded each lexical item and its lexical item weight, i.e., TFIDF value.

Step 2: calculating the temperature of each lexical item in actual time window, selecting temperature is preceding several lexical items.If The significance level of lexical item has change under current corpus, then lexical item can be detected to the fluctuation journey in history corpus Degree, and then show temperature of the lexical item in corpus.The promotion of lexical item TFIDF value both may be that the frequency of use of lexical item increases, It is also likely to be the concentration outburst of lexical item dependent event.Therefore, the fluctuation of TFIDF value can embody the temperature of lexical item.Moreover, because The use of verb, adjective etc. is not close coupling in focus incident, so the weight variation of verb, adjective etc. will not be too Greatly.Therefore, these noise words unrelated with focus incident can further be eliminated while counting lexical item temperature in aforementioned manners ?.

In one embodiment, lexical item temperature F (k) is calculated according to formula (1).

Wherein, k represents the lexical item currently counted, and N represents the number of documents in current hot statistics time window, c_kGeneration The document number comprising lexical item k in the current hot statistics time window of table, f_tkRepresent frequency of the lexical item k in document t.D generation The table lexical item weight table update cycle, that is, the unit time of lexical item weight table is counted, the time window that W represents lexical item hot statistics is long Degree, base represent the benchmark weight of lexical item.Benchmark weight is exactly that the weight of lexical item is corresponded in current lexical item weight table.Wherein, It can incite somebody to actionIt is considered as TFIDF value of the lexical item k in current hot statistics time window,It is and system Relevant correction factor between timing.

In this way, lexical item benchmark weight describes the benchmark liveness of lexical item in a network, and belong to current hot information The liveness of lexical item centainly can be significantly hotter than its a reference value.Therefore the present embodiment proposes SSE algorithm, passes through statistical time window The weight of lexical item excavates current hot word, while also dynamic as the difference of lexical item current active degree and lexical item benchmark liveness in mouthful Ground updates each lexical item benchmark liveness, to promote hot word statistical accuracy (see step 3).

Step 3: it is every to pass through a measurement period, lexical item weight table is once updated.

Lexical item temperature has time attribute, and over time, common lexical item can also be become by continuing hot word.Such as: have A little network words (such as " giving power " etc.) may become hot word in a short time " quick-fried red ", and the word may after a certain period of time It can generally be received by numerous netizens, become the common lexical item being commonly used.At this point, the lexical item is no longer current hotspot information, But since it has been widely accepted, its liveness in corpus may improve significantly, and original benchmark weight is It is no longer appropriate for describing the normal active degree of the current lexical item.Therefore dynamically updating the benchmark weight of lexical item, (i.e. benchmark is active Degree), help to obtain more accurate hot information.

In the present embodiment, it is based on certain corpus, counts the TFIDF value of lexical item as benchmark lexical item weight table.And pass through Current lexical item is counted, benchmark lexical item weight table is updated.

More new formula is such as shown in (2):

Wherein, D represents the unit time of lexical item weight table statistics, and W represents the time window length of vocabulary hot statistics, Base represents benchmark lexical item value, and d represents the number of lexical item weight table statistics, i.e. d represents the d times statistics, and d-1 is represented the d-1 times Statistics.D_iRepresent the time interval of i-th lexical item weight table statistics with preceding lexical item weight table statistics.In the present embodiment, often Twice lexical item weight table statistics between time interval be equal, therefore be referred to as lexical item weight table statistics unit when Between.Certainly, in other embodiments of the present invention, the time interval twice between lexical item weight table statistics can also be unequal.

Above-mentioned update method may be constructed the closed loop of lexical item table self study, and due to not depending on participle and specific area, There are preferable scalability and portability.

Step 4: merging with theme lexical item.

Although the above process can efficiently generate candidate lexical item, the defect of TFIDF is to be also easy to produce several same descriptor ?.Such as " the super Li Jiacheng of Wang Jianlin is at the new richest in Asia ", " Wang Jianlin " and " Li Jiacheng " is likely to the same theme Become candidate lexical item simultaneously.The present embodiment passes through word message vector matrix V (w_i) same descriptor is merged.

V(w_i)=(d₁(w_i),d₂(w_i),d₃(w_i)……d_n(w_i)) (3)

Wherein, n indicates the sum of the text extracted in text participle, w_iRepresent word i, d_j(w_i) indicate j-th of text Weight in term vector i.Weight can directly use TFIDF weight herein.But this mode is not uniquely, in this hair In bright other embodiments, different types of weight can also be applied according to usage scenario, such as: 0-1 weight, textRank power Value etc..That is, using d here_j(w_i) abstract represent term vector weight.

Range formula selects cosine similarity, specific formula is as follows:

w_iAnd w_jRepresent term vector i and j, w_ikRepresent the weight of j-th of document in term vector i.Pass through phase between candidate lexical item Like the calculating of degree, the lexical item for being greater than certain threshold value can be merged.

Step 5: semantic extension is carried out to selected lexical item.

Since social networks corpus grammer is lack of standardization, and expression way has very strong attractability, and non-event is retouched The property stated, so the present invention carries out semantic matches by the content to social networks corpus, rather than traditional news media corpus is for mark Topic carries out semantic matches.Method of the invention is that the context semanteme word of extension lexical item is extracted by grammar templates.Although being based on The mode of rule is easy to die plate failure, but the present embodiment according to language expression mode selected 4 it is more representational Grammar templates are respectively " principal series table ", " Subject, Predicate and Object ", " main ", " guest of honour ".Due to being the grammatical representation mode of foundation, without It is the grammar templates for specific area expression characteristic construction, therefore such method has more versatility, can be applied to big portion Divide in field.But the problem of simple general-purpose template is to generate more rubbish item, some unexpected semantic extensions It can be extracted.Further, the present embodiment is by the support of a large amount of corpus of text collection, and the statistics meaning of a word extends frequency, finally from expansion The expression way of most high frequency is selected in the semanteme of exhibition.SSE algorithm carries out semantic tagger to candidate word context first, then basis The keyword of the extraction context of template matching, and form an effective semanteme.By counting having for candidate lexical item corpus Effect is semantic, and the frequent meaning of a word is selected to be extended to the semantic extension of lexical item.It is fixed according to big number due to the corpus based on certain scale Rule, statistical value can consider very close true value.

After completing semantic extension, the semanteme of more accurately expression social networks focus incident can be obtained.

Further, in order to verify the technical effects of the present invention, inventor has carried out contrast test, be divided into below data set, The screening of hot spot lexical item and extension three parts of the meaning of a word describe the contrast test.

Data set:

The data set selected in testing is wechat public data, and particular content is the message of wechat public account publication. Daily data collection capacity is about 20,000,000 texts, and stores the data set in nearest 20 days.Daily long text data volume is about It is 8,000,000, short text data amount is about 12,000,000.

Candidate lexical item is calculated for the wechat data in certain time window.The time window of experiment is 24 hours, altogether about There are 20,000,000 texts.Wherein candidate hot word is calculated for 8,000,000 lengthy documents.Moreover, using 15 days in the past historical datas Benchmark training is carried out to lexical item weight, as lexical item reference data.

9 themes were randomly selected to 24 hours data, are " minor's judicial protection ", " Chinese military affairs respectively Strategic white paper ", " Song Jianguo is accused of corrupting ", " Henan home for destitute catches fire ", " Yang Kezhang raging fire rescues people ", " doomsday collapses Collapse ", " Beijing's tobacco control regulations ", " South Sea controversial issue " and " courts across the country's Drug-related crimes administration of justice ".Number of documents is respectively 1244,898,66,800,227,666,661,2000 and 251.For the authenticity of simulated experiment, selected at random from data set It takes out 4348 documents and does interference data.Final total number of documents is 10759.Before experiment carries out, ICTCLASS2015 is used Participle tool segments document sets, the raw 55387 different lexical items of common property.Meanwhile in order to ensure the test results accurate Property, invite 5 estimators manually to mark 100 documents that each theme randomly selects.116 words have finally been determined For item as candidate hot word, remaining 55271 lexical item is rubbish word.

The screening of hot spot lexical item:

Experimental data has 9 themes, and 10 keywords can completely describe a theme, therefore and theme under normal conditions Tight relevant candidate word number is at 90 or so.Also, in production application, generally within 30, this is candidate word number Due to: 1) excessive lexical item be easier introduce rubbish 2) user more concerned with the higher lexical item of ranking is concerned about all lexical items.In reality In testing, select Top10, Top20, Top30, Top40, Top50 and the Top60 of lexical item weight as candidate word, respectively before use The algorithm for stating the SSE algorithm proposed in embodiment and ICTCLASS keyword abstraction module in the prior art (may be simply referred to as ICT candidate lexical item) is extracted, and calculates accuracy rate (Accuracy).Wherein, accuracy rate (Accuracy) describes in candidate lexical item The percentage of candidate hot word.Table 1 shows the candidate word accuracy rate in comparative experiments.

Table 1

It is found through experiments that, the candidate word that the candidate word Average Accuracy ratio ICT that SSE algorithm excavates is provided is high by 10.2%.

Next, analyzing the quality of two kinds of algorithm candidate words by the distribution of descriptor.Descriptor is each theme line Subject, descriptor can not only more accurately describe event, and help to extend the meaning of a word in next step, restore theme.Therefore main Epigraph can react candidate word quality.This 9 descriptor are " minor ", white paper, " Song Jianguo ", " endowment respectively Institute ", " Yang Kezhang ", " doomsday ", " tobacco control ", " South Sea " and " drugs ".Table 2 shows the number that descriptor is hit in comparative experiments Amount.

Table 2

From analysis of experimental results, Top50 candidate word, SEE algorithm all excavates 9 descriptor.Top60 is candidate Word, ICT excavate 8 descriptor, do not find all descriptor.Also, the descriptor that SEE algorithm is excavated in each stage Number is both greater than equal to ICT.This illustrates the quality of SEE candidate word better than ICT.

Wherein, ICT algorithm does not excavate candidate word " Song Jianguo ".The theme has totally 66 documents, Zhan Suoyou document sets 0.6%, be one small document sets theme.And " Song Jianguo " appears in the Top50 of SEE candidate word, this proves that SEE algorithm can Hot spot word is excavated with significantly more efficient.Especially in the data set of magnanimity, depth excavation is carried out to data set, excavates potential heat Point can effectively carry out focus incident tracking.

Extend the meaning of a word:

The Top30 candidate word of selection SEE algorithm does meaning of a word extension.Related term is merged by term vector first, is then passed through Semantic tagger and lexical item weight select the representative lexical item of portmanteau word.Representing lexical item is usually noun subject, if existed simultaneously more A noun subject, then selecting the highest lexical item of weight as representing lexical item.The representative lexical item filtered out is " teenage respectively People ", white paper, " Song Jianguo ", " home for destitute ", " Yang Kezhang ", " doomsday ", " tobacco control ", " South Sea " and " drugs ".Next, The semantic extension method based on part of speech and Nagao string frequency method calculate the extension meaning of a word through the invention, and use the master of 9 themes Sentence is inscribed as model answer, similarity is calculated by editing distance.

Calculation formula

For formula (5), wherein T represents theme line, and S represents candidate sentence, and L (T) represents the length of theme line, Distance (S, T) calculates the editing distance of S and T, and normalizes divided by L (T).

Table 3 shows the semantic editing distance with theme line of the extension based on SEE algorithm of the invention.Table 4 shows base In the semantic editing distance with theme line of the extension of existing Nagao algorithm.

Table 3

Table 4

It is compared by two experimental results, the average similarity ratio Nagao of semanteme and theme line after the extension of this algorithm is calculated The similarity of method is high by 36.2%.On time complexity, this algorithm is based on semantic tagger, so time complexity is O (N²)； Nagao is counted to string frequency, and each Chinese character all can serve as to terminate item, so time complexity is O (N²)；Therefore in phase In same complexity, this algorithm is more suitable for doing semantic extension to the data of social networks.

In specific extension, f (s, the t) value of Nagao algorithm to the semantic extension of white paper He " home for destitute " lexical item It is relatively low, lead to readable relatively low difference.By the relevant wechat data of analysis white paper, the table of many article themes is found Certain call is all had up to mode, for example " China's military affairs white paper is issued, everybody surrounds and watches fastly " is exactly than more typical Expression way.Cause to extend semanteme by high frequency string in this way, many nonstandard expression ways can be introduced.Along with so not The expression way of specification is a kind of normality in data set, therefore the method by finding high frequency string is not suitable for the social number such as wechat According to.Algorithm of the invention is by setting syntax rule, and the mode based on template extracts semanteme, it is ensured that the semantic results of extraction There is higher quality, it is more reasonable to carry out high frequency statistics again in this way.

Finally it should be noted that above embodiments are only used for description technical solution of the present invention rather than to this technology method It is limited, the present invention can above extend to other modifications, variation, application and embodiment, and therefore, it is considered that institute in application There are such modification, variation, application, embodiment all within the scope of spirit or teaching of the invention.

Claims

1. a kind of hot information method for digging based on social networks document, characterized in that it comprises the following steps:

1) the fluctuation journey of benchmark weight of the weight according to lexical item in hot statistics window relative to the lexical item in corpus Degree, obtains temperature of the lexical item in hot statistics window；

In step 1), the lexical item is TFIDF of the lexical item in hot statistics window in the weight in hot statistics window Weight, the lexical item is TFIDF weight of the lexical item in total corpus in the benchmark weight in total corpus, and each word The benchmark weight of item is the current base weight for carrying out dynamic update according to certain measurement period and obtaining；

For any lexical item k, the calculation method of temperature F (k) is as follows:

Wherein, N represents the number of documents in current hot statistics time window, c_kIt represents in current hot statistics time window Document number comprising lexical item k, f_tkFrequency of the lexical item k in document t is represented, D represents the update cycle of lexical item weight, and W is represented The time window length of lexical item hot statistics, base (k) represent the benchmark weight of lexical item k；

2. the hot information method for digging according to claim 1 based on social networks document, which is characterized in that the step It is rapid 1) in, the current base weight is obtained according to the benchmark weight table that dynamic updates, and each lexical item k's works as in benchmark weight table The calculation method of preceding benchmark weight is as follows:

Wherein, D represents the unit time of lexical item weight table statistics, and W represents the time window length of vocabulary hot statistics, base generation Table benchmark lexical item value, subscript d represent current lexical item weight table statistics number, D_iI-th lexical item weight table statistics is represented with before The time interval of lexical item weight table statistics.

3. the hot information method for digging according to claim 1 based on social networks document, which is characterized in that further include Step:

3) word centered on the hot spot lexical item after merging is subjected to semantic extension, obtains the hot spot letter that can express more contents Breath.

4. the hot information method for digging according to claim 3 based on social networks document, which is characterized in that in step 2) it is further comprised the steps of: between step 3)

30) for the hot spot lexical item in current hot statistics window, merge with theme.

5. the hot information method for digging according to claim 4 based on social networks document, which is characterized in that the step It is rapid 30) in, carrying out the method that merges with theme is: word-based message vector matrix, according to the cosine similarity of term vector to same Descriptor merges.

6. the hot information method for digging according to claim 3 based on social networks document, which is characterized in that the step It is rapid 3) in, the semantic extension method of lexical item is as follows: semantic tagger is carried out to the context of lexical item, then according to preset grammer mould Plate extracts the keyword of context, and forms effective semanteme based on the lexical item, counts in corpus based on the various of the lexical item Effectively semantic frequency is selected according to the frequency by which effectively semantic semantic extension as the lexical item of the lexical item.

7. the hot information method for digging according to claim 6 based on social networks document, which is characterized in that described pre- If grammar templates include: " principal series table ", " Subject, Predicate and Object ", " main " and " guest of honour " template.