CN107688596A - Happen suddenly topic detecting method and burst topic detection equipment - Google Patents
Happen suddenly topic detecting method and burst topic detection equipment Download PDFInfo
- Publication number
- CN107688596A CN107688596A CN201710433359.1A CN201710433359A CN107688596A CN 107688596 A CN107688596 A CN 107688596A CN 201710433359 A CN201710433359 A CN 201710433359A CN 107688596 A CN107688596 A CN 107688596A
- Authority
- CN
- China
- Prior art keywords
- topic
- participle
- word
- keyword
- word segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2358—Change logging, detection, and notification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides one kind burst topic detecting method and equipment, suitable for Internet technical field, this method includes:Persistently obtain the topic data in Information Sharing platform;When getting each topic data, each word in topic data and default dictionary is subjected to matching treatment, to export a variety of word segmentation results;It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;According to the keyword, the summary info associated with the topic data is updated;The keyword and the summary info are shown, so that user knows the burst topic at current time.The present invention is capable of determining that keyword corresponding to topic data, and updates summary info based on the keyword so that user can promptly recognize the burst topic on Information Sharing platform from the keyword and summary info of output.
Description
Technical field
The invention belongs to Internet technical field, more particularly to a kind of burst topic detecting method and burst topic detection to set
It is standby.
Background technology
In microblogging, push away on the Information Sharing platform such as special Twitter and forum, the opening based on platform, users can
To share whenever and wherever possible and forward various information.In the short period of time, if identical is all shared or forwarded to a large number of users
Information, then the specific topic corresponding to the information can develop into the higher burst topic of temperature.If these burst topics and spy
Fixed enterprise is related, then may be that enterprise brings huge public opinion to influence.If enterprise can not in time find and track and public affairs
Related burst topic event is taken charge of, then can miss and eliminate the Best Times that negative public opinion influences, so as to reduce enterprise itself
Soft power.
However, in the prior art, it is difficult to recognize the burst topic on Information Sharing platform rapidly by technological means,
It is difficult to determine whether each burst topic is related to enterprise itself.
The content of the invention
In view of this, the embodiments of the invention provide one kind burst topic detecting method and temperature event detection device, with
Solve to be difficult to recognize rapidly by technological means in the prior art the burst topic on Information Sharing platform and be difficult to determine
It is each burst topic it is whether related to enterprise itself the problem of.
The first aspect of the embodiment of the present invention provides a kind of burst topic detecting method, including:
Persistently obtain the topic data in Information Sharing platform;
When getting each topic data, by each word progress in the topic data and default dictionary
With processing, to export a variety of word segmentation results;
It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;
According to the keyword, the summary info associated with the topic data is updated;
The keyword and the summary info are shown, so that user knows the burst topic at current time.
The second aspect of the embodiment of the present invention provides a kind of burst topic detection equipment, the burst topic detection equipment
Including memory, processor and the burst topic detection journey that is stored on the memory and can run on the processor
Sequence, following steps are realized during burst topic detection program described in the computing device:
Persistently obtain the topic data in Information Sharing platform;
When getting each topic data, by each word progress in the topic data and default dictionary
With processing, to export a variety of word segmentation results;
It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;
According to the keyword, the summary info associated with the topic data is updated;
The keyword and the summary info are shown, so that user knows the burst topic at current time.
The third aspect of the embodiment of the present invention provides a kind of computer-readable recording medium, the computer-readable storage
Media storage has burst topic detection program, when the burst topic detection program is by least one computing device, realizes such as
Lower step:
Persistently obtain the topic data in Information Sharing platform;
When getting each topic data, by each word progress in the topic data and default dictionary
With processing, to export a variety of word segmentation results;
It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;
According to the keyword, the summary info associated with the topic data is updated;
The keyword and the summary info are shown, so that user knows the burst topic at current time.
In the embodiment of the present invention, in the topic data in getting Information Sharing platform each time, by determining this
Keyword corresponding to topic data, and based on the keyword come real-time update summary info so that user can be from the pass of output
The very first time recognizes what content the burst topic on Information Sharing platform is probably in keyword and summary info, can be based on
The summary info promptly determines whether the burst topic related to enterprise itself, it is possible thereby to effectively find and tracking at
The reason burst topic event related to enterprise, improve the soft power of enterprise.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art
In the required accompanying drawing used be briefly described, it should be apparent that, drawings in the following description be only the present invention some
Embodiment, for those of ordinary skill in the art, without having to pay creative labor, can also be according to these
Accompanying drawing obtains other accompanying drawings.
Fig. 1 is the implementation process figure of burst topic detecting method provided in an embodiment of the present invention;
Fig. 2 is burst topic detecting method S103 provided in an embodiment of the present invention specific implementation flow chart;
Fig. 3 is burst topic detecting method S104 provided in an embodiment of the present invention specific implementation flow chart;
Fig. 4 is burst topic detecting method S303 provided in an embodiment of the present invention specific implementation flow chart;
Fig. 5 is burst topic detecting method S305 provided in an embodiment of the present invention specific implementation flow chart;
Fig. 6 is the schematic diagram of burst topic detection device provided in an embodiment of the present invention;
Fig. 7 is the schematic diagram of burst topic equipment provided in an embodiment of the present invention.
Embodiment
In describing below, in order to illustrate rather than in order to limit, it is proposed that such as tool of particular system structure, technology etc
Body details, thoroughly to understand the embodiment of the present invention.However, it will be clear to one skilled in the art that there is no these specific
The present invention can also be realized in the other embodiments of details.In other situations, omit to well-known system, device, electricity
Road and the detailed description of method, in case unnecessary details hinders description of the invention.
In order to illustrate technical solutions according to the invention, illustrated below by specific embodiment.
Fig. 1 shows the implementation process of burst topic detecting method provided in an embodiment of the present invention, and this method flow includes
Step S101 to S105.The specific implementation principle of each step is as follows:
S101:Persistently obtain the topic data in Information Sharing platform.
In the embodiment of the present invention, Information Sharing platform includes but is not limited to microblogging, push away special Twitter, Facebook and
Major BBS forums etc..Each topic data is specially the provision that can be showed in Information Sharing platform and be issued by user
Word information, it can associate one or more accidents.These text informations include but is not limited in Information Sharing platform
User comment data corresponding to original text, reprinting text and original text or reprinting text etc..
The topic data obtained in Information Sharing platform can be realized by following two modes:First way, according to pre-
First create and can be used in and the API of Information Sharing platform (Application Programming Interface, using journey
Sequence DLL) application program that interacts, and according to the account key obtained in advance, in the application program, call
The api interface that Information Sharing platform is provided, so as to obtain the topic data that Information Sharing platform is returned;The second way,
The topic data in Information Sharing platform is persistently crawled by crawlers.
Due to the topic data in Information Sharing platform be constantly update, it is ever-increasing, therefore, the embodiment of the present invention
In, the topic data in Information Sharing platform is obtained in real time, i.e., continuously obtains topic data, ensures system when each
Newest topic data can be got by inscribing, so as to detection that is accurate, performing burst topic in time and promptly.
S102:When getting each topic data, by each word in the topic data and default dictionary
Matching treatment is carried out, to export a variety of word segmentation results.
When often receiving a new topic data, system can carry out word match processing to the topic data.Specifically,
Whether system since the first character of topic data, will be judged in the topic data comprising the word in default dictionary.When
When determining that the word that the character continuously occurred in topic data is formed is identical with the word in default dictionary, this is continuously gone out
Existing character is defined as a participle, and in topic data, since the first character after the participle, re-executes above-mentioned
Word match process.After each participle in topic data determines, it is determined that completing a word match process, the then word
Matching process correspondingly exports a kind of word segmentation result, and the word segmentation result includes multiple participles.Especially, the character of each participle
Sum is two or more.
In fact, for a character in topic data, it not only can form one with left adjacent one or more characters
The individual first participle, also a first participle can be formed with right adjacent one or more characters, therefore, in the different feelings of word segmentation regulation
Under condition, same topic data can obtain different word segmentation results.In the embodiment of the present invention, for a topic data, output
Pre-stored each word segmentation regulation distinguish corresponding to a kind of word segmentation result.Matching degree corresponding to different word segmentation results may not
Together.Wherein, matching degree represents, each participle in word segmentation result, user can know the actual semantic of topic data
Degree.
S103:Multiple participles output that matching degree highest word segmentation result is included pass corresponding to the topic data
Keyword.
In the embodiment of the present invention, the matching of every kind of word segmentation result can be determined according to the character average of each participle
Degree, or the matching degree of every kind of word segmentation result is determined according to the character sum variance of each participle, it is not limited thereto.
Preferably due to the character sum of participle is bigger, the easier reality that topic data is determined from participle of user
Semanteme, therefore, the matching degree of each word segmentation result is weighed based on longest match principle.Comparing each word segmentation result
After matching degree, each first participle output that the maximum word segmentation result of matching degree is included is crucial corresponding to topic data
Word.
For example, when " data wire " three Chinese characters only occurs in topic data, because " data wire " and " data " can be with
A participle is formed, and the matching degree of " data wire " is higher, because determining the participle that the maximum word segmentation result of matching degree is included
It is keyword by " data wire " output for " data wire ".
As one embodiment of the present of invention, the calculation of word segmentation result matching degree is further qualified.Such as Fig. 2 institutes
Show, above-mentioned S103 is specifically included:
S201:It is corresponding according to character sum and each word segmentation result corresponding to each participle in each word segmentation result
Participle sum, calculate the participle character average of each word segmentation result.
Multiple participles are included in each word segmentation result, each participle forms by least two characters.The present invention
In embodiment, the sum of participle is identified, and identifies that the character sum each segmented (judges the included character of each participle
Quantity).It is above-mentioned participle character average by character sum corresponding to each participle and with participle sum ratio output.
For example, a kind of word segmentation result obtained by after if word segmentation processing is carried out to topic data is { group/data everyday
Line/yield }, then three participles in the word segmentation result are respectively " group everyday ", " data wire ", " yield ", and these three divide
The character sum of word is respectively 4,3,3, and the participle sum of the word segmentation result is 3, and participle character average is (4+3+3)/3=
3.33。
S202:The participle character average corresponding to each word segmentation result and the participle sum are weighted
Processing, to export the matching degree of each word segmentation result.
In the embodiment of the present invention, character average A is segmented1Corresponding weight coefficient is preset value a1, segment total A2Institute
Corresponding weight coefficient is preset value a2, and a1+a2=1.The matching degree of each word segmentation result is C=A1×a1+A2×a2。
S203:Multiple participles output that the matching degree highest word segmentation result is included is corresponding for the topic data
Keyword.
If M kind word segmentation results are obtained after carrying out word segmentation processing to topic data, and the matching degree of M kind word segmentation results is respectively
C1、C2…、Cm, then in C1、C2…、CmIt is middle to choose a maximum value C of numerical valuei, and by ClIn a kind of corresponding word segmentation result
Each participle output is a keyword corresponding to topic data.Wherein, m is integer more than 1, i≤m.
In the embodiment of the present invention, because participle character average and the two total factors of participle all have to word segmentation result
There is considerable influence, can determine whether user is capable of determining that the actual semanteme of topic data, thus by being put down to participle character
Mean and participle sum be weighted processing, and weighs key using the value obtained after weighting as the matching degree of word segmentation result
Word, it is possible to increase the accuracy and validity that keyword is chosen, so as to be accurately positioned out the event content of burst topic.
S104:According to the keyword, the summary info associated with the topic data is updated.
At any one time, system is by accumulative reception to a plurality of topic data, it is determined that the keyword of each topic data
Afterwards, system will be regenerated for describing current accumulative reception to the summary info of all topic datas, allows the user to base
In the summary info, the general contents of current time burst topic are can be apparent that.
Keyword possesses the determinant attribute for having topic data, in order to generate and current accumulative reception to all topic datas
Associated summary info, the accumulative word frequency of each keyword in each bar topic data can be counted, with big according to accumulative word frequency
Summary info is generated in the keyword of threshold value.Wherein, using the summary info in TextRank algorithm or word instruments
Core Generator etc., generation and topic data and the summary info with crucial word association.
Preferably as one embodiment of the present of invention, as shown in figure 3, above-mentioned S104 is specifically included:
S301:The accumulative word frequency of each keyword is obtained respectively, and calculates the growth acceleration of the accumulative word frequency,
Wherein, in all topic datas that the accumulative word frequency expression of the keyword has obtained at current time, the keyword occurs
Cumulative number.
In the embodiment of the present invention, the accumulative word frequency of a keyword is represented in current accumulative reception to all topic datas
In, the occurrence number of the keyword.Among the state for persistently obtaining topic data being in because of system, therefore for same key
Word, its accumulative word frequency are also constantly increasing.If in period Δ T, the accumulative word frequency of system detectio to keyword A increases Δ S,
Then the growth rate of keyword A accumulative word frequency is V=Δ S/ Δ T, and it is growth rate that it, which accumulates the growth acceleration a of word frequency,
V is to the partial derivative of time, i.e. a=V ' (t).Growth acceleration is bigger, and in unit time, keyword is come across in topic data
Number is more, and topic is sudden higher.
S302:The growth acceleration corresponding to each keyword is added in the matrix previously generated.
When receiving new topic data every time, system determines the keyword of the topic data and adding up for keyword
The growth acceleration of word frequency.If the keyword of the topic data has K, K growth acceleration will be obtained.If system adds up
The quantity of the growth acceleration arrived is P (P >=K, N ∈ Z), then matrix will be extended to P × P matrix, and will be obtained in real time
This K increases acceleration and is added in P × P matrix.In P × P matrix, in addition to increasing acceleration comprising P, also wrap
Include null value.
S303:The characteristic value of matrix described in current time is calculated, when the characteristic value is more than first threshold, from the square
The growth acceleration more than Second Threshold is determined in battle array.
System is monitored to each growth acceleration in matrix, to detect the characteristic value of matrix in real time.With accumulative
The topic data acquired is more and more, the size of matrix and its comprising growth acceleration sum also constantly changing,
Thus the characteristic value of matrix also increases therewith.When characteristic value is more than default first threshold, system will be included from matrix
In each growth acceleration, orient numerical value and increase acceleration more than the one or more of Second Threshold.
As one embodiment of the present of invention, as shown in figure 4, above-mentioned S303 is specifically included:
S401:Each growth acceleration in matrix described in current time is divided into N number of group, and by the increasing of each group
Long acceleration is mapped in a submatrix.
Because the quantity for increasing acceleration in matrix is more, in order to improve numerical value more than the growth acceleration of Second Threshold
Locating speed, matrix is subjected to dimension-reduction treatment.
Specifically, according to default rule, all growth acceleration in the presence of matrix are divided into N number of group so that
Each group includes multiple growth acceleration of negligible amounts.Wherein, the quantity for increasing acceleration in each group can be with identical
Can also be different.Multiple growth acceleration that each group is included are mapped in a submatrix.Therefore when the quantity of group
For B when, the quantity of submatrix is also B.In the case where topic data gradually increases, obtained each growth is updated every time
Acceleration will be also respectively mapped in the B submatrix.
S402:The characteristic value of each submatrix is calculated, when the characteristic value of the submatrix is more than four threshold values, from
The growth acceleration more than Second Threshold is filtered out in the submatrix.
The characteristic value of each submatrix is calculated, if the characteristic value of any number of submatrixs is equal in B submatrix
More than default 4th threshold value, then it is more than from characteristic value in each submatrix of the 4th threshold value, filters out respectively more than the second threshold
Each growth acceleration of value.
In the embodiment of the present invention, increase acceleration because the quantity of the growth acceleration in submatrix is considerably less than in matrix
Quantity, therefore,, can be from right in the case where characteristic value is more than the 4th threshold value by the characteristic value of calculated sub-matrix respectively
The growth acceleration more than Second Threshold is oriented in the submatrix answered rapidly, so as to improve the detection efficiency of burst topic.
S304:According to the participle corresponding to each growth acceleration determined, from all topic datas got
In filter out the topic data comprising the participle.
A keyword is corresponded to because each in matrix or submatrix increases acceleration, and each keyword is topic
A participle in data in the maximum word segmentation result of matching degree, thus system can according to the growth acceleration prestored and
The mapping table of participle, inquire numerical value and distinguish corresponding participle more than each growth acceleration of Second Threshold.If number
Each growth acceleration that value is more than Second Threshold has L, then the participle inquired also has L.
The each topic data that system has obtained to current time successively carries out Screening Treatment, judges each topic
Whether above-mentioned L participle is contained in data.If certain topic data contains above-mentioned L participle, screening system goes out this
Topic data, and step S305 is performed to the topic data.
S305:Word segmentation processing is carried out again to the topic data comprising the participle, and calculate obtain after word segmentation processing it is each
The words-frequency feature value of individual participle.
To each topic data filtered out, system carries out word segmentation processing to it again.Participle process can use existing
All kinds of segmentation methods having, the including but not limited to segmentation methods based on string matching, segmentation methods based on statistics etc..Point
After word terminates, multiple participles of this topic data will be retrieved.In order to distinguish in the participle and S305 that are obtained in S102
Obtained participle, participle resulting in S102 is referred to as the first participle at this, the participle obtained in S305 is referred to as second point
Word.Wherein, the first participle is identical with the second participle possibility, it is also possible to different.Summary info is influenceed in order to further filter out
More second participle, based on the words-frequency feature amount of each second participle, calculate each second words-frequency feature segmented
Value.These words-frequency feature amounts are including but not limited to word frequency, reverse document-frequency (termfrequency-TF) etc..
As one embodiment of the present of invention, as shown in figure 5, above-mentioned S305 is specifically included:
S501:Word segmentation processing is carried out again to the topic data comprising the participle, obtains multiple participles.
S502:In all topic datas accessed by current time, obtained respectively after calculating word segmentation processing each
Statistics word frequency and reverse document-frequency corresponding to participle.
In the embodiment of the present invention, number of each second participle appeared in a plurality of topic data filtered out is calculated,
The occurrence number for then counting to obtain is the statistics word frequency F of the second participleTF.If the sum of the topic data filtered out is X bars, wherein
Topic data comprising a certain second participle is X ' (X '≤X, N ∈ Z) bar, then the reverse document-frequency F of second participleIDFFor
S503:The statistics word frequency and the reverse document-frequency to each participle are weighted processing, with output
The words-frequency feature value of the participle.
Count word frequency FTFCorresponding weight coefficient is preset value a3, reverse document-frequency FIDFCorresponding weight coefficient
For preset value a4, and a3+a4=1.The words-frequency feature value of each the second participle is F=FTF×a3+FIDF×a4。
In the embodiment of the present invention, according to TF the and IDF values of every one second participle, customized weight coefficient can be based on,
The words-frequency feature value of the second participle is calculated, so as to by considering the TF-IDF values of the second participle, can filtered out
A plurality of topic data, quantization contrast is carried out to the significance level of each second participle.
S306:It is high frequency words by the participle output that the words-frequency feature value is more than the 3rd threshold value, the algorithm that passes budgets is to each
The individual high frequency words are attached processing, to obtain including the summary info of each high frequency words.
Determine that words-frequency feature value F is more than each second participle of default 3rd threshold value, then these second participles are
High frequency words appeared in topic data.Using the summary info Core Generator in above-mentioned TextRank algorithm, word instruments with
And other custom algorithms etc., each high frequency words are attached, to obtain with topic data and be plucked with high frequency word association
Want information.
S105:The keyword and the summary info are shown, so that user knows the burst words at current time
Topic.
Summary info after the keyword obtained in real time and renewal is shown by system.Under actual conditions, only exist
When topic data is burst topic, the growth acceleration of the accumulative word frequency of each keyword can just be more than threshold value, and summary info is
It can be updated, therefore, the word content of system institute real-time display and the true content for the topic event that happens suddenly have higher phase
Like degree, there is certain reference value.
In the embodiment of the present invention, in the topic data in getting Information Sharing platform each time, by determining this
Keyword corresponding to topic data, and based on the keyword come real-time update summary info so that user can be from the pass of output
The very first time recognizes what content the burst topic on Information Sharing platform is probably in keyword and summary info, can be based on
The summary info promptly determines whether the burst topic related to enterprise itself, it is possible thereby to effectively find and tracking at
The reason burst topic event related to enterprise, improve the soft power of enterprise.
It should be understood that the size of the sequence number of each step is not meant to the priority of execution sequence, each process in above-described embodiment
Execution sequence should determine that the implementation process without tackling the embodiment of the present invention forms any limit with its function and internal logic
It is fixed.
Corresponding to the burst topic detecting method described in foregoing embodiments, Fig. 6 shows provided in an embodiment of the present invention prominent
The schematic diagram for topic detection means of giving orders or instructions, for convenience of description, illustrate only the part related to the embodiment of the present invention.
Reference picture 6, the device include:
Acquisition module 61, for persistently obtaining the topic data in Information Sharing platform.
Matching module 62, for when getting each topic data, by the topic data and default dictionary
Each word carry out matching treatment, to export a variety of word segmentation results.
Output module 63, multiple participles output for matching degree highest word segmentation result to be included is the topic number
According to corresponding keyword.
Update module 64, for according to the keyword, updating the summary info associated with the topic data.
Display module 65, for being shown to the keyword and the summary info, during so that user knowing current
The burst topic at quarter.
Alternatively, the update module 64 includes:
First calculating sub module, for obtaining the accumulative word frequency of each keyword respectively, and calculate the accumulative word
The growth acceleration of frequency, wherein, the accumulative word frequency of the keyword is represented in all topic datas for having been obtained at current time,
The cumulative number that the keyword occurs.
Submodule is added, for the growth acceleration corresponding to each keyword to be added into what is previously generated
In matrix.
Determination sub-module, for calculating the characteristic value of matrix described in current time, when the characteristic value is more than first threshold
When, the growth acceleration more than Second Threshold is determined from the matrix.
Submodule is screened, for the participle according to corresponding to each growth acceleration determined, from the institute got
Have and the topic data comprising the participle is filtered out in topic data.
Submodule is segmented, for carrying out word segmentation processing again to the topic data comprising the participle, and calculates word segmentation processing
The words-frequency feature value of each participle obtained afterwards.
First output sub-module, the participle output for the words-frequency feature value to be more than to the 3rd threshold value is high frequency words, is led to
Cross prediction algorithm and processing is attached to each high frequency words, believed with obtaining the summary comprising each high frequency words
Breath.
Alternatively, the determination sub-module is specifically used for:
Each growth acceleration in matrix described in current time is divided into N number of group, and the growth of each group is added
Speed is mapped in a submatrix;
The characteristic value of each submatrix is calculated, when the characteristic value of the submatrix is more than four threshold values, from described
The growth acceleration more than Second Threshold is filtered out in submatrix;
Wherein, the N is the integer more than 1.
Alternatively, the participle submodule is specifically used for:
Word segmentation processing is carried out again to the topic data comprising the participle, obtains multiple participles;
In all topic datas accessed by current time, each participle pair obtained after word segmentation processing is calculated respectively
The statistics word frequency answered and reverse document-frequency;
The statistics word frequency and the reverse document-frequency to each participle are weighted processing, to export the participle
Words-frequency feature value.
Alternatively, the output module 63 includes:
Second calculating sub module, it is total and each for the character according to corresponding to each participle in each word segmentation result
Participle sum corresponding to kind word segmentation result, calculate the participle character average of each word segmentation result.
Submodule is weighted, for total to the participle character average corresponding to each word segmentation result and the participle
Number is weighted processing, to export the matching degree of each word segmentation result.
Second output sub-module, multiple participles for the matching degree highest word segmentation result to be included are exported as institute
State keyword corresponding to topic data.
Fig. 7 is the schematic diagram of burst topic detection equipment provided in an embodiment of the present invention.As shown in fig. 7, the embodiment
Burst topic detection equipment 7 includes:Processor 70, memory 71 and it is stored in the memory 71 and can be in the processing
The computer program 72 run on device 70, such as burst topic detection program.The processor 70 performs the computer program
Realize the step in above-mentioned each burst topic detecting method embodiment when 72, such as step 101 shown in Fig. 1 is to 105.Or
Person, the processor 70 realize the function of each module/unit in above-mentioned each device embodiment when performing the computer program 72,
Such as the function of module 61 to 65 shown in Fig. 6.
Exemplary, the computer program 72 can be divided into one or more module/units, it is one or
Multiple module/units are stored in the memory 71, and are performed by the processor 70, to complete the present invention.Described one
Individual or multiple module/units can be the series of computation machine programmed instruction section that can complete specific function, and the instruction segment is used for
Implementation procedure of the computer program 72 in the burst topic detection equipment 7 is described.For example, the computer program 72
Acquisition module, matching module, output module, update module, display module can be divided into, each module concrete function is as follows:
Acquisition module is used to persistently obtain the topic data in Information Sharing platform.
Matching module be used for when getting each topic data, by the topic data with it is each in default dictionary
Individual word carries out matching treatment, to export a variety of word segmentation results.
It is the topic data that output module, which is used for multiple participles output that matching degree highest word segmentation result is included,
Corresponding keyword.
Update module is used to, according to the keyword, update the summary info associated with the topic data.
Display module is used to be shown the keyword and the summary info, so that user knows current time
Happen suddenly topic.
The burst topic detection equipment 7 can be desktop PC, notebook, palm PC and cloud server etc.
Computing device.It will be understood by those skilled in the art that Fig. 7 is only the example of burst topic detection equipment 7, do not form to prominent
Give orders or instructions to inscribe the restriction of detection device 7, can include than illustrating more or less parts, either combine some parts or not
Same part, such as the burst topic detection equipment equipment can also include input-output equipment, network access equipment, bus
Deng.
Alleged processor 70 can be CPU (Central Processing Unit, CPU), can also be
Other general processors, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other PLDs, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor
Deng.
The memory 71 can be the internal storage unit of the burst topic detection equipment 7, such as the topic inspection that happens suddenly
The hard disk or internal memory of measurement equipment 7.The memory 71 can also be the External memory equipment of the burst topic detection equipment 7,
Such as the plug-in type hard disk being equipped with the burst topic detection equipment 7, intelligent memory card (Smart Media Card, SMC),
Secure digital (Secure Digital, SD) blocks, flash card (Flash Card) etc..Further, the memory 71 may be used also
With the internal storage unit both including the burst topic detection equipment 7 or including External memory equipment.The memory 71 is used
In other programs and data needed for the storage computer program and the burst topic detection equipment.The memory 71
It can be also used for temporarily storing the data that has exported or will export.
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each work(
Can unit, module division progress for example, in practical application, can be as needed and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device are divided into different functional units or module, more than completion
The all or part of function of description.Each functional unit, module in embodiment can be integrated in a processing unit, also may be used
To be that unit is individually physically present, can also two or more units it is integrated in a unit, it is above-mentioned integrated
Unit can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.In addition, each function list
Member, the specific name of module are not limited to the protection domain of the application also only to facilitate mutually distinguish.Said system
The specific work process of middle unit, module, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and is not described in detail or remembers in some embodiment
The part of load, it may refer to the associated description of other embodiments.
Those of ordinary skill in the art are it is to be appreciated that the list of each example described with reference to the embodiments described herein
Member and algorithm steps, it can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually
Performed with hardware or software mode, application-specific and design constraint depending on technical scheme.Professional and technical personnel
Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed
The scope of the present invention.
In embodiment provided by the present invention, it should be understood that disclosed device/terminal device and method, can be with
Realize by another way.For example, device described above/terminal device embodiment is only schematical, for example, institute
The division of module or unit is stated, only a kind of division of logic function, there can be other dividing mode when actually realizing, such as
Multiple units or component can combine or be desirably integrated into another system, or some features can be ignored, or not perform.Separately
A bit, shown or discussed mutual coupling or direct-coupling or communication connection can be by some interfaces, device
Or INDIRECT COUPLING or the communication connection of unit, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit
The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also
That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list
Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated module/unit realized in the form of SFU software functional unit and as independent production marketing or
In use, it can be stored in a computer read/write memory medium.Based on such understanding, the present invention realizes above-mentioned implementation
All or part of flow in example method, by computer program the hardware of correlation can also be instructed to complete, described meter
Calculation machine program can be stored in a computer-readable recording medium, and the computer program can be achieved when being executed by processor
The step of stating each embodiment of the method..Wherein, the computer program includes computer program code, the computer program
Code can be source code form, object identification code form, executable file or some intermediate forms etc..Computer-readable Jie
Matter can include:Can carry any entity or device of the computer program code, recording medium, USB flash disk, mobile hard disk,
Magnetic disc, CD, computer storage, read-only storage (ROM, Read-Only Memory), random access memory (RAM,
Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It is it should be noted that described
The content that computer-readable medium includes can carry out appropriate increasing according to legislation in jurisdiction and the requirement of patent practice
Subtract, such as in some jurisdictions, according to legislation and patent practice, computer-readable medium do not include be electric carrier signal and
Telecommunication signal.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although with reference to foregoing reality
Example is applied the present invention is described in detail, it will be understood by those within the art that:It still can be to foregoing each
Technical scheme described in embodiment is modified, or carries out equivalent substitution to which part technical characteristic;And these are changed
Or replace, the essence of appropriate technical solution is departed from the spirit and scope of various embodiments of the present invention technical scheme, all should
Within protection scope of the present invention.
Claims (10)
1. one kind burst topic detecting method, it is characterised in that including:
Persistently obtain the topic data in Information Sharing platform;
When getting each topic data, each word in the topic data and default dictionary is carried out at matching
Reason, to export a variety of word segmentation results;
It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;
According to the keyword, the summary info associated with the topic data is updated;
The keyword and the summary info are shown, so that user knows the burst topic at current time.
2. burst topic detecting method as claimed in claim 1, it is characterised in that it is described according to the keyword, renewal with
The summary info of the topic data association, including:
The accumulative word frequency of each keyword is obtained respectively, and calculates the growth acceleration of the accumulative word frequency, wherein, it is described
In all topic datas that the accumulative word frequency expression of keyword has obtained at current time, the keyword occurs accumulative secondary
Number;
The growth acceleration corresponding to each keyword is added in the matrix previously generated;
The characteristic value of matrix described in current time is calculated, when the characteristic value is more than first threshold, is determined from the matrix
Go out to be more than the growth acceleration of Second Threshold;
According to the participle corresponding to each growth acceleration determined, bag is filtered out from all topic datas got
Topic data containing the participle;
Word segmentation processing is carried out again to the topic data comprising the participle, and calculates the word of each participle obtained after word segmentation processing
Frequency characteristic value;
It is high frequency words by the participle output that the words-frequency feature value is more than the 3rd threshold value, the algorithm that passes budgets is to each high frequency
Word is attached processing, to obtain including the summary info of each high frequency words.
3. burst topic detecting method as claimed in claim 2, it is characterised in that described to calculate matrix described in current time
Characteristic value, when the characteristic value is more than first threshold, the growth acceleration more than Second Threshold is determined from the matrix,
Including:
Each growth acceleration in matrix described in current time is divided into N number of group, and by the growth acceleration of each group
Map in a submatrix;
The characteristic value of each submatrix is calculated, when the characteristic value of the submatrix is more than four threshold values, from the sub- square
The growth acceleration more than Second Threshold is filtered out in battle array;
Wherein, the N is the integer more than 1.
4. burst topic detecting method as claimed in claim 2, it is characterised in that the described pair of topic data for including the participle
Word segmentation processing is carried out again, and calculates the words-frequency feature value of each participle obtained after word segmentation processing, including:
Word segmentation processing is carried out again to the topic data comprising the participle, obtains multiple participles;
In all topic datas accessed by current time, calculate respectively corresponding to each participle obtained after word segmentation processing
Count word frequency and reverse document-frequency;
The statistics word frequency and the reverse document-frequency to each participle are weighted processing, to export the word of the participle
Frequency characteristic value.
5. burst topic detecting method as claimed in claim 1, it is characterised in that described by matching degree highest word segmentation result
Comprising multiple participles output be keyword corresponding to the topic data, including:
It is total according to participle corresponding to character sum corresponding to each participle in each word segmentation result and each word segmentation result
Number, calculate the participle character average of each word segmentation result;
Processing is weighted to the participle character average corresponding to each word segmentation result and the participle sum, with defeated
Go out the matching degree of each word segmentation result;
It is keyword corresponding to the topic data by multiple participles output that the matching degree highest word segmentation result is included.
6. a kind of computer-readable recording medium, the computer-readable recording medium storage has burst topic detection program, its
It is characterised by, when the burst topic detection program is by least one computing device, realizes such as any one of claim 1-5
The step of described burst topic detecting method.
7. one kind burst topic detection equipment, it is characterised in that the burst topic detection equipment include memory, processor and
The burst topic detection program that is stored on the memory and can run on the processor, described in the computing device
Following steps are realized during the topic detection program that happens suddenly:
Persistently obtain the topic data in Information Sharing platform;
When getting each topic data, each word in the topic data and default dictionary is carried out at matching
Reason, to export a variety of word segmentation results;
It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;
According to the keyword, the summary info associated with the topic data is updated;
The keyword and the summary info are shown, so that user knows the burst topic at current time.
8. burst topic detection equipment as claimed in claim 7, it is characterised in that it is described according to the keyword, renewal with
The step of summary info of the topic data association, specifically include:
The accumulative word frequency of each keyword is obtained respectively, and calculates the growth acceleration of the accumulative word frequency, wherein, it is described
In all topic datas that the accumulative word frequency expression of keyword has obtained at current time, the keyword occurs accumulative secondary
Number;
The growth acceleration corresponding to each keyword is added in the matrix previously generated;
The characteristic value of matrix described in current time is calculated, when the characteristic value is more than first threshold, is determined from the matrix
Go out to be more than the growth acceleration of Second Threshold;
According to the participle corresponding to each growth acceleration determined, bag is filtered out from all topic datas got
Topic data containing the participle;
Word segmentation processing is carried out again to the topic data comprising the participle, and calculates the word of each participle obtained after word segmentation processing
Frequency characteristic value;
It is high frequency words by the participle output that the words-frequency feature value is more than the 3rd threshold value, the algorithm that passes budgets is to each high frequency
Word is attached processing, to obtain including the summary info of each high frequency words.
9. burst topic detection equipment as claimed in claim 8, it is characterised in that described to calculate matrix described in current time
Characteristic value, when the characteristic value is more than first threshold, the growth acceleration more than Second Threshold is determined from the matrix
The step of, specifically include:
Each growth acceleration in matrix described in current time is divided into N number of group, and by the growth acceleration of each group
Map in a submatrix;
The characteristic value of each submatrix is calculated, when the characteristic value of the submatrix is more than four threshold values, from the sub- square
The growth acceleration more than Second Threshold is filtered out in battle array;
Wherein, the N is the integer more than 1.
10. burst topic detection equipment as claimed in claim 8, it is characterised in that if described pair includes the first participle
Inscribe data and carry out word segmentation processing, the step of obtaining multiple second participles, and calculate the words-frequency feature value of each second participle, specifically
Including:
Word segmentation processing is carried out again to the topic data comprising the participle, obtains multiple participles;
In all topic datas accessed by current time, calculate respectively corresponding to each participle obtained after word segmentation processing
Count word frequency and reverse document-frequency;
The statistics word frequency and the reverse document-frequency to each participle are weighted processing, to export the word of the participle
Frequency characteristic value.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710433359.1A CN107688596B (en) | 2017-06-09 | 2017-06-09 | Burst topic detection method and burst topic detection equipment |
PCT/CN2018/074870 WO2018223718A1 (en) | 2017-06-09 | 2018-01-31 | Trending topic detection method, apparatus and device, and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710433359.1A CN107688596B (en) | 2017-06-09 | 2017-06-09 | Burst topic detection method and burst topic detection equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107688596A true CN107688596A (en) | 2018-02-13 |
CN107688596B CN107688596B (en) | 2020-02-21 |
Family
ID=61152644
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710433359.1A Active CN107688596B (en) | 2017-06-09 | 2017-06-09 | Burst topic detection method and burst topic detection equipment |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107688596B (en) |
WO (1) | WO2018223718A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111897958A (en) * | 2020-07-16 | 2020-11-06 | 邓桦 | Ancient poetry classification method based on natural language processing |
CN113204638A (en) * | 2021-04-23 | 2021-08-03 | 上海明略人工智能(集团)有限公司 | Recommendation method, system, computer and storage medium based on work session unit |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102289487A (en) * | 2011-08-09 | 2011-12-21 | 浙江大学 | Network burst hotspot event detection method based on topic model |
CN102346766A (en) * | 2011-09-20 | 2012-02-08 | 北京邮电大学 | Method and device for detecting network hot topics found based on maximal clique |
CN102971762A (en) * | 2010-07-01 | 2013-03-13 | 费斯布克公司 | Facilitating interaction among users of a social network |
CN104615593A (en) * | 2013-11-01 | 2015-05-13 | 北大方正集团有限公司 | Method and device for automatic detection of microblog hot topics |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102646114A (en) * | 2012-02-17 | 2012-08-22 | 清华大学 | News topic timeline abstract generating method based on breakthrough point |
CN105022827B (en) * | 2015-07-23 | 2016-06-15 | 合肥工业大学 | A kind of Web news dynamic aggregation method of domain-oriented theme |
-
2017
- 2017-06-09 CN CN201710433359.1A patent/CN107688596B/en active Active
-
2018
- 2018-01-31 WO PCT/CN2018/074870 patent/WO2018223718A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102971762A (en) * | 2010-07-01 | 2013-03-13 | 费斯布克公司 | Facilitating interaction among users of a social network |
CN102289487A (en) * | 2011-08-09 | 2011-12-21 | 浙江大学 | Network burst hotspot event detection method based on topic model |
CN102346766A (en) * | 2011-09-20 | 2012-02-08 | 北京邮电大学 | Method and device for detecting network hot topics found based on maximal clique |
CN104615593A (en) * | 2013-11-01 | 2015-05-13 | 北大方正集团有限公司 | Method and device for automatic detection of microblog hot topics |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111897958A (en) * | 2020-07-16 | 2020-11-06 | 邓桦 | Ancient poetry classification method based on natural language processing |
CN111897958B (en) * | 2020-07-16 | 2024-03-12 | 邓桦 | Ancient poetry classification method based on natural language processing |
CN113204638A (en) * | 2021-04-23 | 2021-08-03 | 上海明略人工智能(集团)有限公司 | Recommendation method, system, computer and storage medium based on work session unit |
CN113204638B (en) * | 2021-04-23 | 2024-02-23 | 上海明略人工智能(集团)有限公司 | Recommendation method, system, computer and storage medium based on working session unit |
Also Published As
Publication number | Publication date |
---|---|
CN107688596B (en) | 2020-02-21 |
WO2018223718A1 (en) | 2018-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105894372B (en) | The method and apparatus for predicting colony's credit | |
CN104081392A (en) | Influence scores for social media profiles | |
CN106874253A (en) | Recognize the method and device of sensitive information | |
CN107507036A (en) | The method and terminal of a kind of data prediction | |
CN116601626A (en) | Personal knowledge graph construction method and device and related equipment | |
CN103827895A (en) | Entity fingerprints | |
CN108804617A (en) | Field term abstracting method, device, terminal device and storage medium | |
CN109376287B (en) | House property map construction method, device, computer equipment and storage medium | |
CN108319377A (en) | Method and system, terminal and the computer readable storage medium of displaying word input | |
CN114625973B (en) | Anonymous information cross-domain recommendation method and device, electronic equipment and storage medium | |
CN110473073A (en) | The method and device that linear weighted function is recommended | |
CN114357184B (en) | Item recommendation method and related device, electronic equipment and storage medium | |
CN107688596A (en) | Happen suddenly topic detecting method and burst topic detection equipment | |
CN109885747A (en) | Industry public sentiment monitoring method, device, computer equipment and storage medium | |
CN110390011A (en) | The method and apparatus of data classification | |
CN111401959B (en) | Risk group prediction method, apparatus, computer device and storage medium | |
CN112766995B (en) | Article recommendation method, device, terminal equipment and storage medium | |
CN111209403A (en) | Data processing method, device, medium and electronic equipment | |
CN107798249B (en) | Method for releasing behavior pattern data and terminal equipment | |
CN116304251A (en) | Label processing method, device, computer equipment and storage medium | |
CN111708821B (en) | Method, device and storage medium for determining personnel intimacy | |
CN116868207A (en) | Decision tree of original graph database | |
CN114925275A (en) | Product recommendation method and device, computer equipment and storage medium | |
CN113868373A (en) | Word cloud generation method and device, electronic equipment and storage medium | |
CN107622129B (en) | Method and device for organizing knowledge base and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |