CN106503256B - A kind of hot information method for digging based on social networks document - Google Patents
A kind of hot information method for digging based on social networks document Download PDFInfo
- Publication number
- CN106503256B CN106503256B CN201611005521.1A CN201611005521A CN106503256B CN 106503256 B CN106503256 B CN 106503256B CN 201611005521 A CN201611005521 A CN 201611005521A CN 106503256 B CN106503256 B CN 106503256B
- Authority
- CN
- China
- Prior art keywords
- lexical item
- hot
- weight
- statistics
- social networks
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000004364 calculation method Methods 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 2
- 230000000694 effects Effects 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 5
- 241000208125 Nicotiana Species 0.000 description 3
- 235000002637 Nicotiana tabacum Nutrition 0.000 description 3
- 238000007418 data mining Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 229940079593 drug Drugs 0.000 description 3
- 238000005065 mining Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Economics (AREA)
- Marketing (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Computing Systems (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of hot information method for digging based on social networks document, the degree of fluctuation for including the following steps: 1) benchmark weight of the weight according to lexical item in hot statistics window relative to the lexical item in corpus, obtains temperature of the lexical item in hot statistics window;2) the temperature sequence based on each lexical item, obtains the hot spot lexical item in current hot statistics window.The present invention can be improved the accuracy rate that candidate word is excavated in social networks;The semanteme of more accurately expression social networks focus incident can be obtained.
Description
Technical field
The present invention relates to data mining technology fields, specifically, the present invention relates to a kind of based on social networks document
Hot information method for digging.
Background technique
With the arrival of Web2.0, the mode that netizen participates in internet turns from browsing information before, passive receive
Become manufacturing information, the mode of active transmission, and these make network environment and spreading network information that huge change all have occurred
Become.The social networking applications such as microblogging, blog, wechat have become the important component of hundreds of millions netizens life.
The publication and latest developments of many grave news are derived from social networks at present, and internet is therefore at many societies
The spot of focus incident and communication channel, such as nearest focus incident " Liu Xiang is retired ", " the super Li Jiacheng of Wang Jianlin is at Asia
New richest " etc., these are all to become focus incident due to the higher attention rate of netizen.Meanwhile internet is also portion, government
The channel that door knows public feelings is the important public opinion position that government department payes attention under the new era.For example netizen is exposed by internet
Some corrupt officials, a kind of anti-corruption supervision of the masses new model as Internet era of network.Government's only discovery in time, in time
Follow-up handles in time, could comprehensively hold public opinion and the condition of the people, guide the positive development of internet.
Different from traditional news media media, internet hot spots event have outburst rapidly, the spies such as data magnanimity, content be lack of standardization
Property, therefore how to capture focus incident in the irregular information of magnanimity as quickly as possible, and by its core information with being easy to manage
The language elements high efficient expression of solution is a popular research problem of educational circles in recent years.
Current mainstream hot spot mining algorithm mainly relies on core vocabulary (hot word) set and its rule in focus incident
Then indicate event.Wherein, hot spot vocabulary mining is the part and parcel of focus incident discovery, it reflect in certain time and
People's question of common concern and things in space.The accuracy of hot spot vocabulary mining has important shadow to focus incident discovery
It rings.
On the other hand, since the ability to express of vocabulary is limited, often a vocabulary is the abstract one aspect of event, no
The real meaning of the focus incident of energy effective expression hot word behind.For example " Liu Xiang ", " Wang Jianlin " are difficult that people is allowed accurately to obtain it
Real event behind.Therefore, be discovery focus incident, also need sometimes for hot word do the meaning of a word extension (or be known as semanteme
Extension).
In terms of meaning of a word extension, a kind of currently existing scheme be by extracting the affixes such as prefix, then according to syntax rule come
Identification number and proper noun, but this method limitation is larger, and effect concentrates on the apparent substantive noun of identification.Separately
A kind of currently existing scheme is: the word-building template of one three words of building, using neural network algorithm to three word strings filtered out into
Row training, is then filtered to obtain hot word using statistical nature.This method recognition speed is fast, but can only be directed to specific
The word of rule.
Although the scheme of above-mentioned prior art can carry out semantic extension to a certain extent, not fully suitable for magnanimity
Social network data analysis, because the data of social networks mainly have following characteristics.1) grammer is lack of standardization, and rubbish noise is big;2)
Lack effective editor, there is no event center overview.And traditional meaning of a word extension is more based on news corpus, due to news corpus
" title " has stronger ability to express in itself, has natural advantage, therefore the quality of corpus is relatively high.But in face of being based on
Non- news corpus collection, content of text be second-rate, without the social networks corpus of effective " central event title ", then aforementioned side
The effect of method will be a greater impact.
Summary of the invention
Therefore, a task of the invention is to provide a kind of hot information that can more accurately excavate social networks document
Solution.
According to an aspect of the invention, there is provided a kind of hot information method for digging based on social networks document, packet
Include the following steps:
1) benchmark weight according to lexical item in the weight in hot statistics window relative to the lexical item in total corpus
Degree of fluctuation obtains temperature of the lexical item in hot statistics window;
2) the temperature sequence based on each lexical item, obtains the hot spot lexical item in current hot statistics window.
Wherein, in the step 1), weight of the lexical item in hot statistics window is the lexical item in hot statistics window
TFIDF weight in mouthful;The lexical item is TFIDF power of the lexical item in total corpus in the benchmark weight in total corpus
Value.
Wherein, in the step 1), the benchmark weight of each lexical item is to carry out dynamic according to certain measurement period
The current base weight for updating and obtaining.
Wherein, in the step 1), for any lexical item k, the calculation method of temperature F (k) is as follows:
Wherein, N represents the number of documents in current hot statistics time window, ckRepresent current hot statistics time window
The interior document number comprising lexical item k, ftkFrequency of the lexical item k in document t is represented, D represents the update cycle of lexical item weight, W
The time window length of lexical item hot statistics is represented, base (k) represents the benchmark weight of lexical item k.
Wherein, in the step 1), the current base weight is obtained according to the benchmark weight table that dynamic updates, benchmark power
The calculation method of the current base weight of each lexical item k is as follows in weight table:
Wherein, D represents the unit time of lexical item weight table statistics, and W represents the time window length of vocabulary hot statistics,
Base represents benchmark lexical item value, and subscript d represents current lexical item weight table statistics number, DiRepresent i-th lexical item weight table system
The time interval of meter and preceding lexical item weight table statistics.
Wherein, the hot information method for digging based on social networks document further comprises the steps of:
3) for the hot spot lexical item in current hot statistics window, merge with theme.
Wherein, the hot information method for digging based on social networks document further comprises the steps of:
4) word centered on the hot spot lexical item after merging is subjected to semantic extension, obtains the hot spot that can express more contents
Information.
Wherein, in the step 3), carrying out the method that merges with theme is: word-based message vector matrix, according to word to
The cosine similarity of amount merges same descriptor.
Wherein, in the step 4), the semantic extension method of lexical item is as follows: semantic tagger is carried out to the context of lexical item,
Then the keyword of context is extracted according to preset grammar templates, and forms effective semanteme based on the lexical item, counts corpus
The frequency for concentrating various effective semantemes based on the lexical item selects which effectively semanteme by the lexical item as this according to the frequency
The semantic extension of lexical item.
Wherein, the preset grammar templates include: " principal series table ", " Subject, Predicate and Object ", " main " and " guest of honour " template.
Compared with prior art, the present invention has following technical effect:
1, the present invention can be improved the accuracy rate that candidate word is excavated in social networks.
2, the present invention can obtain the semanteme of more accurately expression social networks focus incident.
3, the present invention can preferably have found the relatively small number of focus incident of those number of documents in social networks.
4, the present invention can have found the focus incident for still breaking out early period in social networks in public sentiment more in time earlier.
Detailed description of the invention
Hereinafter, carrying out the embodiment that the present invention will be described in detail in conjunction with attached drawing, in which:
Fig. 1 shows the flow chart of the hot information method for digging of one embodiment of the invention.
Specific embodiment
As it was noted above, the publication and latest developments of many grave news are derived from social networks at present, internet is
Spot and communication channel at many social hotspots events.An object of the present invention is sought to by magnanimity social activity
The data mining of network documentation finds the hot information (such as focus incident) contained in these data.Below with reference to implementation
Example is further described through the present invention.
According to one embodiment of present invention, a kind of self feed back meaning of a word extension (Self-Feedback Semantic is proposed
Extension, hereinafter referred to as SSE) magnanimity social networks document hot information method for digging, Fig. 1 shows the hot information
The flow chart of method for digging, with reference to Fig. 1, this method includes the following steps:
Step 1: history corpus being segmented and establishes lexical item weight table.In this step, to magnanimity social networks to be analyzed
Document is segmented, and each lexical item and its corresponding lexical item weight are obtained.The lexical item weight indicates word in entire document sets
Weight.In the present embodiment, use TFIDF value as lexical item weight.TFIDF(term frequency–inverse document
It frequency) is a kind of common weighting technique for information retrieval and data mining, it can assess vocabulary for a text
The significance level of part collection or a copy of it file in a corpus.Time that the importance of words occurs hereof with it
The directly proportional increases of number, but the frequency that can occur in corpus with it simultaneously is inversely proportional decline.TFIDF value on the one hand can be
Filtering noise word to a certain extent, on the other hand only needing to carry out corpus primary traversal can be obtained, and compares and is suitble to do
The calculating of large-scale corpus collection.
In one embodiment, the document of the long text in magnanimity social networks document is established into document sets, only to document
The document of concentration is segmented and is established lexical item weight table.The lexical item weight table has recorded each lexical item and its lexical item weight, i.e.,
TFIDF value.
Step 2: calculating the temperature of each lexical item in actual time window, selecting temperature is preceding several lexical items.If
The significance level of lexical item has change under current corpus, then lexical item can be detected to the fluctuation journey in history corpus
Degree, and then show temperature of the lexical item in corpus.The promotion of lexical item TFIDF value both may be that the frequency of use of lexical item increases,
It is also likely to be the concentration outburst of lexical item dependent event.Therefore, the fluctuation of TFIDF value can embody the temperature of lexical item.Moreover, because
The use of verb, adjective etc. is not close coupling in focus incident, so the weight variation of verb, adjective etc. will not be too
Greatly.Therefore, these noise words unrelated with focus incident can further be eliminated while counting lexical item temperature in aforementioned manners
?.
In one embodiment, lexical item temperature F (k) is calculated according to formula (1).
Wherein, k represents the lexical item currently counted, and N represents the number of documents in current hot statistics time window, ckGeneration
The document number comprising lexical item k in the current hot statistics time window of table, ftkRepresent frequency of the lexical item k in document t.D generation
The table lexical item weight table update cycle, that is, the unit time of lexical item weight table is counted, the time window that W represents lexical item hot statistics is long
Degree, base represent the benchmark weight of lexical item.Benchmark weight is exactly that the weight of lexical item is corresponded in current lexical item weight table.Wherein,
It can incite somebody to actionIt is considered as TFIDF value of the lexical item k in current hot statistics time window,It is and system
Relevant correction factor between timing.
In this way, lexical item benchmark weight describes the benchmark liveness of lexical item in a network, and belong to current hot information
The liveness of lexical item centainly can be significantly hotter than its a reference value.Therefore the present embodiment proposes SSE algorithm, passes through statistical time window
The weight of lexical item excavates current hot word, while also dynamic as the difference of lexical item current active degree and lexical item benchmark liveness in mouthful
Ground updates each lexical item benchmark liveness, to promote hot word statistical accuracy (see step 3).
Step 3: it is every to pass through a measurement period, lexical item weight table is once updated.
Lexical item temperature has time attribute, and over time, common lexical item can also be become by continuing hot word.Such as: have
A little network words (such as " giving power " etc.) may become hot word in a short time " quick-fried red ", and the word may after a certain period of time
It can generally be received by numerous netizens, become the common lexical item being commonly used.At this point, the lexical item is no longer current hotspot information,
But since it has been widely accepted, its liveness in corpus may improve significantly, and original benchmark weight is
It is no longer appropriate for describing the normal active degree of the current lexical item.Therefore dynamically updating the benchmark weight of lexical item, (i.e. benchmark is active
Degree), help to obtain more accurate hot information.
In the present embodiment, it is based on certain corpus, counts the TFIDF value of lexical item as benchmark lexical item weight table.And pass through
Current lexical item is counted, benchmark lexical item weight table is updated.
More new formula is such as shown in (2):
Wherein, D represents the unit time of lexical item weight table statistics, and W represents the time window length of vocabulary hot statistics,
Base represents benchmark lexical item value, and d represents the number of lexical item weight table statistics, i.e. d represents the d times statistics, and d-1 is represented the d-1 times
Statistics.DiRepresent the time interval of i-th lexical item weight table statistics with preceding lexical item weight table statistics.In the present embodiment, often
Twice lexical item weight table statistics between time interval be equal, therefore be referred to as lexical item weight table statistics unit when
Between.Certainly, in other embodiments of the present invention, the time interval twice between lexical item weight table statistics can also be unequal.
Above-mentioned update method may be constructed the closed loop of lexical item table self study, and due to not depending on participle and specific area,
There are preferable scalability and portability.
Step 4: merging with theme lexical item.
Although the above process can efficiently generate candidate lexical item, the defect of TFIDF is to be also easy to produce several same descriptor
?.Such as " the super Li Jiacheng of Wang Jianlin is at the new richest in Asia ", " Wang Jianlin " and " Li Jiacheng " is likely to the same theme
Become candidate lexical item simultaneously.The present embodiment passes through word message vector matrix V (wi) same descriptor is merged.
V(wi)=(d1(wi),d2(wi),d3(wi)……dn(wi)) (3)
Wherein, n indicates the sum of the text extracted in text participle, wiRepresent word i, dj(wi) indicate j-th of text
Weight in term vector i.Weight can directly use TFIDF weight herein.But this mode is not uniquely, in this hair
In bright other embodiments, different types of weight can also be applied according to usage scenario, such as: 0-1 weight, textRank power
Value etc..That is, using d herej(wi) abstract represent term vector weight.
Range formula selects cosine similarity, specific formula is as follows:
wiAnd wjRepresent term vector i and j, wikRepresent the weight of j-th of document in term vector i.Pass through phase between candidate lexical item
Like the calculating of degree, the lexical item for being greater than certain threshold value can be merged.
Step 5: semantic extension is carried out to selected lexical item.
Since social networks corpus grammer is lack of standardization, and expression way has very strong attractability, and non-event is retouched
The property stated, so the present invention carries out semantic matches by the content to social networks corpus, rather than traditional news media corpus is for mark
Topic carries out semantic matches.Method of the invention is that the context semanteme word of extension lexical item is extracted by grammar templates.Although being based on
The mode of rule is easy to die plate failure, but the present embodiment according to language expression mode selected 4 it is more representational
Grammar templates are respectively " principal series table ", " Subject, Predicate and Object ", " main ", " guest of honour ".Due to being the grammatical representation mode of foundation, without
It is the grammar templates for specific area expression characteristic construction, therefore such method has more versatility, can be applied to big portion
Divide in field.But the problem of simple general-purpose template is to generate more rubbish item, some unexpected semantic extensions
It can be extracted.Further, the present embodiment is by the support of a large amount of corpus of text collection, and the statistics meaning of a word extends frequency, finally from expansion
The expression way of most high frequency is selected in the semanteme of exhibition.SSE algorithm carries out semantic tagger to candidate word context first, then basis
The keyword of the extraction context of template matching, and form an effective semanteme.By counting having for candidate lexical item corpus
Effect is semantic, and the frequent meaning of a word is selected to be extended to the semantic extension of lexical item.It is fixed according to big number due to the corpus based on certain scale
Rule, statistical value can consider very close true value.
After completing semantic extension, the semanteme of more accurately expression social networks focus incident can be obtained.
Further, in order to verify the technical effects of the present invention, inventor has carried out contrast test, be divided into below data set,
The screening of hot spot lexical item and extension three parts of the meaning of a word describe the contrast test.
Data set:
The data set selected in testing is wechat public data, and particular content is the message of wechat public account publication.
Daily data collection capacity is about 20,000,000 texts, and stores the data set in nearest 20 days.Daily long text data volume is about
It is 8,000,000, short text data amount is about 12,000,000.
Candidate lexical item is calculated for the wechat data in certain time window.The time window of experiment is 24 hours, altogether about
There are 20,000,000 texts.Wherein candidate hot word is calculated for 8,000,000 lengthy documents.Moreover, using 15 days in the past historical datas
Benchmark training is carried out to lexical item weight, as lexical item reference data.
9 themes were randomly selected to 24 hours data, are " minor's judicial protection ", " Chinese military affairs respectively
Strategic white paper ", " Song Jianguo is accused of corrupting ", " Henan home for destitute catches fire ", " Yang Kezhang raging fire rescues people ", " doomsday collapses
Collapse ", " Beijing's tobacco control regulations ", " South Sea controversial issue " and " courts across the country's Drug-related crimes administration of justice ".Number of documents is respectively
1244,898,66,800,227,666,661,2000 and 251.For the authenticity of simulated experiment, selected at random from data set
It takes out 4348 documents and does interference data.Final total number of documents is 10759.Before experiment carries out, ICTCLASS2015 is used
Participle tool segments document sets, the raw 55387 different lexical items of common property.Meanwhile in order to ensure the test results accurate
Property, invite 5 estimators manually to mark 100 documents that each theme randomly selects.116 words have finally been determined
For item as candidate hot word, remaining 55271 lexical item is rubbish word.
The screening of hot spot lexical item:
Experimental data has 9 themes, and 10 keywords can completely describe a theme, therefore and theme under normal conditions
Tight relevant candidate word number is at 90 or so.Also, in production application, generally within 30, this is candidate word number
Due to: 1) excessive lexical item be easier introduce rubbish 2) user more concerned with the higher lexical item of ranking is concerned about all lexical items.In reality
In testing, select Top10, Top20, Top30, Top40, Top50 and the Top60 of lexical item weight as candidate word, respectively before use
The algorithm for stating the SSE algorithm proposed in embodiment and ICTCLASS keyword abstraction module in the prior art (may be simply referred to as
ICT candidate lexical item) is extracted, and calculates accuracy rate (Accuracy).Wherein, accuracy rate (Accuracy) describes in candidate lexical item
The percentage of candidate hot word.Table 1 shows the candidate word accuracy rate in comparative experiments.
Table 1
It is found through experiments that, the candidate word that the candidate word Average Accuracy ratio ICT that SSE algorithm excavates is provided is high by 10.2%.
Next, analyzing the quality of two kinds of algorithm candidate words by the distribution of descriptor.Descriptor is each theme line
Subject, descriptor can not only more accurately describe event, and help to extend the meaning of a word in next step, restore theme.Therefore main
Epigraph can react candidate word quality.This 9 descriptor are " minor ", white paper, " Song Jianguo ", " endowment respectively
Institute ", " Yang Kezhang ", " doomsday ", " tobacco control ", " South Sea " and " drugs ".Table 2 shows the number that descriptor is hit in comparative experiments
Amount.
Table 2
From analysis of experimental results, Top50 candidate word, SEE algorithm all excavates 9 descriptor.Top60 is candidate
Word, ICT excavate 8 descriptor, do not find all descriptor.Also, the descriptor that SEE algorithm is excavated in each stage
Number is both greater than equal to ICT.This illustrates the quality of SEE candidate word better than ICT.
Wherein, ICT algorithm does not excavate candidate word " Song Jianguo ".The theme has totally 66 documents, Zhan Suoyou document sets
0.6%, be one small document sets theme.And " Song Jianguo " appears in the Top50 of SEE candidate word, this proves that SEE algorithm can
Hot spot word is excavated with significantly more efficient.Especially in the data set of magnanimity, depth excavation is carried out to data set, excavates potential heat
Point can effectively carry out focus incident tracking.
Extend the meaning of a word:
The Top30 candidate word of selection SEE algorithm does meaning of a word extension.Related term is merged by term vector first, is then passed through
Semantic tagger and lexical item weight select the representative lexical item of portmanteau word.Representing lexical item is usually noun subject, if existed simultaneously more
A noun subject, then selecting the highest lexical item of weight as representing lexical item.The representative lexical item filtered out is " teenage respectively
People ", white paper, " Song Jianguo ", " home for destitute ", " Yang Kezhang ", " doomsday ", " tobacco control ", " South Sea " and " drugs ".Next,
The semantic extension method based on part of speech and Nagao string frequency method calculate the extension meaning of a word through the invention, and use the master of 9 themes
Sentence is inscribed as model answer, similarity is calculated by editing distance.
Calculation formula
For formula (5), wherein T represents theme line, and S represents candidate sentence, and L (T) represents the length of theme line, Distance
(S, T) calculates the editing distance of S and T, and normalizes divided by L (T).
Table 3 shows the semantic editing distance with theme line of the extension based on SEE algorithm of the invention.Table 4 shows base
In the semantic editing distance with theme line of the extension of existing Nagao algorithm.
Table 3
Table 4
It is compared by two experimental results, the average similarity ratio Nagao of semanteme and theme line after the extension of this algorithm is calculated
The similarity of method is high by 36.2%.On time complexity, this algorithm is based on semantic tagger, so time complexity is O (N2);
Nagao is counted to string frequency, and each Chinese character all can serve as to terminate item, so time complexity is O (N2);Therefore in phase
In same complexity, this algorithm is more suitable for doing semantic extension to the data of social networks.
In specific extension, f (s, the t) value of Nagao algorithm to the semantic extension of white paper He " home for destitute " lexical item
It is relatively low, lead to readable relatively low difference.By the relevant wechat data of analysis white paper, the table of many article themes is found
Certain call is all had up to mode, for example " China's military affairs white paper is issued, everybody surrounds and watches fastly " is exactly than more typical
Expression way.Cause to extend semanteme by high frequency string in this way, many nonstandard expression ways can be introduced.Along with so not
The expression way of specification is a kind of normality in data set, therefore the method by finding high frequency string is not suitable for the social number such as wechat
According to.Algorithm of the invention is by setting syntax rule, and the mode based on template extracts semanteme, it is ensured that the semantic results of extraction
There is higher quality, it is more reasonable to carry out high frequency statistics again in this way.
Finally it should be noted that above embodiments are only used for description technical solution of the present invention rather than to this technology method
It is limited, the present invention can above extend to other modifications, variation, application and embodiment, and therefore, it is considered that institute in application
There are such modification, variation, application, embodiment all within the scope of spirit or teaching of the invention.
Claims (7)
1. a kind of hot information method for digging based on social networks document, characterized in that it comprises the following steps:
1) the fluctuation journey of benchmark weight of the weight according to lexical item in hot statistics window relative to the lexical item in corpus
Degree, obtains temperature of the lexical item in hot statistics window;
In step 1), the lexical item is TFIDF of the lexical item in hot statistics window in the weight in hot statistics window
Weight, the lexical item is TFIDF weight of the lexical item in total corpus in the benchmark weight in total corpus, and each word
The benchmark weight of item is the current base weight for carrying out dynamic update according to certain measurement period and obtaining;
For any lexical item k, the calculation method of temperature F (k) is as follows:
Wherein, N represents the number of documents in current hot statistics time window, ckIt represents in current hot statistics time window
Document number comprising lexical item k, ftkFrequency of the lexical item k in document t is represented, D represents the update cycle of lexical item weight, and W is represented
The time window length of lexical item hot statistics, base (k) represent the benchmark weight of lexical item k;
2) the temperature sequence based on each lexical item, obtains the hot spot lexical item in current hot statistics window.
2. the hot information method for digging according to claim 1 based on social networks document, which is characterized in that the step
It is rapid 1) in, the current base weight is obtained according to the benchmark weight table that dynamic updates, and each lexical item k's works as in benchmark weight table
The calculation method of preceding benchmark weight is as follows:
Wherein, D represents the unit time of lexical item weight table statistics, and W represents the time window length of vocabulary hot statistics, base generation
Table benchmark lexical item value, subscript d represent current lexical item weight table statistics number, DiI-th lexical item weight table statistics is represented with before
The time interval of lexical item weight table statistics.
3. the hot information method for digging according to claim 1 based on social networks document, which is characterized in that further include
Step:
3) word centered on the hot spot lexical item after merging is subjected to semantic extension, obtains the hot spot letter that can express more contents
Breath.
4. the hot information method for digging according to claim 3 based on social networks document, which is characterized in that in step
2) it is further comprised the steps of: between step 3)
30) for the hot spot lexical item in current hot statistics window, merge with theme.
5. the hot information method for digging according to claim 4 based on social networks document, which is characterized in that the step
It is rapid 30) in, carrying out the method that merges with theme is: word-based message vector matrix, according to the cosine similarity of term vector to same
Descriptor merges.
6. the hot information method for digging according to claim 3 based on social networks document, which is characterized in that the step
It is rapid 3) in, the semantic extension method of lexical item is as follows: semantic tagger is carried out to the context of lexical item, then according to preset grammer mould
Plate extracts the keyword of context, and forms effective semanteme based on the lexical item, counts in corpus based on the various of the lexical item
Effectively semantic frequency is selected according to the frequency by which effectively semantic semantic extension as the lexical item of the lexical item.
7. the hot information method for digging according to claim 6 based on social networks document, which is characterized in that described pre-
If grammar templates include: " principal series table ", " Subject, Predicate and Object ", " main " and " guest of honour " template.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611005521.1A CN106503256B (en) | 2016-11-11 | 2016-11-11 | A kind of hot information method for digging based on social networks document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611005521.1A CN106503256B (en) | 2016-11-11 | 2016-11-11 | A kind of hot information method for digging based on social networks document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106503256A CN106503256A (en) | 2017-03-15 |
CN106503256B true CN106503256B (en) | 2019-05-07 |
Family
ID=58324577
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611005521.1A Active CN106503256B (en) | 2016-11-11 | 2016-11-11 | A kind of hot information method for digging based on social networks document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106503256B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107122413B (en) * | 2017-03-31 | 2020-04-10 | 北京奇艺世纪科技有限公司 | Keyword extraction method and device based on graph model |
CN110019771B (en) * | 2017-07-28 | 2021-08-13 | 北京国双科技有限公司 | Text processing method and device |
CN109542545B (en) * | 2017-09-22 | 2022-07-29 | 北京国双科技有限公司 | Hot word display method and device |
CN110750682B (en) * | 2018-07-06 | 2022-08-16 | 武汉斗鱼网络科技有限公司 | Title hot word automatic metering method, storage medium, electronic equipment and system |
CN109800431B (en) * | 2019-01-23 | 2020-07-28 | 中国科学院自动化研究所 | Event information keyword extracting and monitoring method and system and storage and processing device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101296128A (en) * | 2007-04-24 | 2008-10-29 | 北京大学 | Method for monitoring abnormal state of internet information |
CN102609436A (en) * | 2011-12-22 | 2012-07-25 | 北京大学 | System and method for mining hot words and events in social network |
CN103699663A (en) * | 2013-12-27 | 2014-04-02 | 中国科学院自动化研究所 | Hot event mining method based on large-scale knowledge base |
CN104679738A (en) * | 2013-11-27 | 2015-06-03 | 北京拓尔思信息技术股份有限公司 | Method and device for mining Internet hot words |
CN105718598A (en) * | 2016-03-07 | 2016-06-29 | 天津大学 | AT based time model construction method and network emergency early warning method |
CN105975459A (en) * | 2016-05-24 | 2016-09-28 | 北京奇艺世纪科技有限公司 | Lexical item weight labeling method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090119281A1 (en) * | 2007-11-03 | 2009-05-07 | Andrew Chien-Chung Wang | Granular knowledge based search engine |
-
2016
- 2016-11-11 CN CN201611005521.1A patent/CN106503256B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101296128A (en) * | 2007-04-24 | 2008-10-29 | 北京大学 | Method for monitoring abnormal state of internet information |
CN102609436A (en) * | 2011-12-22 | 2012-07-25 | 北京大学 | System and method for mining hot words and events in social network |
CN104679738A (en) * | 2013-11-27 | 2015-06-03 | 北京拓尔思信息技术股份有限公司 | Method and device for mining Internet hot words |
CN103699663A (en) * | 2013-12-27 | 2014-04-02 | 中国科学院自动化研究所 | Hot event mining method based on large-scale knowledge base |
CN105718598A (en) * | 2016-03-07 | 2016-06-29 | 天津大学 | AT based time model construction method and network emergency early warning method |
CN105975459A (en) * | 2016-05-24 | 2016-09-28 | 北京奇艺世纪科技有限公司 | Lexical item weight labeling method and device |
Also Published As
Publication number | Publication date |
---|---|
CN106503256A (en) | 2017-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106503256B (en) | A kind of hot information method for digging based on social networks document | |
CN110941692B (en) | Internet political outturn news event extraction method | |
CN106910501B (en) | Text entities extracting method and device | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN105468605B (en) | Entity information map generation method and device | |
CN106649272B (en) | A kind of name entity recognition method based on mixed model | |
CN104036010B (en) | Semi-supervised CBOW based user search term subject classification method | |
CN106886567B (en) | Microblogging incident detection method and device based on semantic extension | |
CN106445920A (en) | Sentence similarity calculation method based on sentence meaning structure characteristics | |
CN104035975B (en) | It is a kind of to realize the method that remote supervisory character relation is extracted using Chinese online resource | |
CN105718585B (en) | Document and label word justice correlating method and its device | |
CN104573028A (en) | Intelligent question-answer implementing method and system | |
CN106610955A (en) | Dictionary-based multi-dimensional emotion analysis method | |
CN102890702A (en) | Internet forum-oriented opinion leader mining method | |
CN105653518A (en) | Specific group discovery and expansion method based on microblog data | |
CN104484343A (en) | Topic detection and tracking method for microblog | |
CN105844424A (en) | Product quality problem discovery and risk assessment method based on network comments | |
CN104915443B (en) | A kind of abstracting method of Chinese microblogging evaluation object | |
CN108170678A (en) | A kind of text entities abstracting method and system | |
CN104899335A (en) | Method for performing sentiment classification on network public sentiment of information | |
CN107918644A (en) | News subject under discussion analysis method and implementation system in reputation Governance framework | |
CN109492027B (en) | Cross-community potential character relation analysis method based on weak credible data | |
CN105447144A (en) | Microblog forwarding visualization analysis method and system based on big data analysis technology | |
CN104346382B (en) | Use the text analysis system and method for language inquiry | |
Bhardwaj et al. | Web scraping using summarization and named entity recognition (ner) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |