CN107688596A - Happen suddenly topic detecting method and burst topic detection equipment - Google Patents

Happen suddenly topic detecting method and burst topic detection equipment Download PDF

Info

Publication number
CN107688596A
CN107688596A CN201710433359.1A CN201710433359A CN107688596A CN 107688596 A CN107688596 A CN 107688596A CN 201710433359 A CN201710433359 A CN 201710433359A CN 107688596 A CN107688596 A CN 107688596A
Authority
CN
China
Prior art keywords
topic
participle
word
keyword
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710433359.1A
Other languages
Chinese (zh)
Other versions
CN107688596B (en
Inventor
王健宗
黄章成
吴天博
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201710433359.1A priority Critical patent/CN107688596B/en
Priority to PCT/CN2018/074870 priority patent/WO2018223718A1/en
Publication of CN107688596A publication Critical patent/CN107688596A/en
Application granted granted Critical
Publication of CN107688596B publication Critical patent/CN107688596B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2358Change logging, detection, and notification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides one kind burst topic detecting method and equipment, suitable for Internet technical field, this method includes:Persistently obtain the topic data in Information Sharing platform;When getting each topic data, each word in topic data and default dictionary is subjected to matching treatment, to export a variety of word segmentation results;It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;According to the keyword, the summary info associated with the topic data is updated;The keyword and the summary info are shown, so that user knows the burst topic at current time.The present invention is capable of determining that keyword corresponding to topic data, and updates summary info based on the keyword so that user can promptly recognize the burst topic on Information Sharing platform from the keyword and summary info of output.

Description

Happen suddenly topic detecting method and burst topic detection equipment
Technical field
The invention belongs to Internet technical field, more particularly to a kind of burst topic detecting method and burst topic detection to set It is standby.
Background technology
In microblogging, push away on the Information Sharing platform such as special Twitter and forum, the opening based on platform, users can To share whenever and wherever possible and forward various information.In the short period of time, if identical is all shared or forwarded to a large number of users Information, then the specific topic corresponding to the information can develop into the higher burst topic of temperature.If these burst topics and spy Fixed enterprise is related, then may be that enterprise brings huge public opinion to influence.If enterprise can not in time find and track and public affairs Related burst topic event is taken charge of, then can miss and eliminate the Best Times that negative public opinion influences, so as to reduce enterprise itself Soft power.
However, in the prior art, it is difficult to recognize the burst topic on Information Sharing platform rapidly by technological means, It is difficult to determine whether each burst topic is related to enterprise itself.
The content of the invention
In view of this, the embodiments of the invention provide one kind burst topic detecting method and temperature event detection device, with Solve to be difficult to recognize rapidly by technological means in the prior art the burst topic on Information Sharing platform and be difficult to determine It is each burst topic it is whether related to enterprise itself the problem of.
The first aspect of the embodiment of the present invention provides a kind of burst topic detecting method, including:
Persistently obtain the topic data in Information Sharing platform;
When getting each topic data, by each word progress in the topic data and default dictionary With processing, to export a variety of word segmentation results;
It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;
According to the keyword, the summary info associated with the topic data is updated;
The keyword and the summary info are shown, so that user knows the burst topic at current time.
The second aspect of the embodiment of the present invention provides a kind of burst topic detection equipment, the burst topic detection equipment Including memory, processor and the burst topic detection journey that is stored on the memory and can run on the processor Sequence, following steps are realized during burst topic detection program described in the computing device:
Persistently obtain the topic data in Information Sharing platform;
When getting each topic data, by each word progress in the topic data and default dictionary With processing, to export a variety of word segmentation results;
It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;
According to the keyword, the summary info associated with the topic data is updated;
The keyword and the summary info are shown, so that user knows the burst topic at current time.
The third aspect of the embodiment of the present invention provides a kind of computer-readable recording medium, the computer-readable storage Media storage has burst topic detection program, when the burst topic detection program is by least one computing device, realizes such as Lower step:
Persistently obtain the topic data in Information Sharing platform;
When getting each topic data, by each word progress in the topic data and default dictionary With processing, to export a variety of word segmentation results;
It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;
According to the keyword, the summary info associated with the topic data is updated;
The keyword and the summary info are shown, so that user knows the burst topic at current time.
In the embodiment of the present invention, in the topic data in getting Information Sharing platform each time, by determining this Keyword corresponding to topic data, and based on the keyword come real-time update summary info so that user can be from the pass of output The very first time recognizes what content the burst topic on Information Sharing platform is probably in keyword and summary info, can be based on The summary info promptly determines whether the burst topic related to enterprise itself, it is possible thereby to effectively find and tracking at The reason burst topic event related to enterprise, improve the soft power of enterprise.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art In the required accompanying drawing used be briefly described, it should be apparent that, drawings in the following description be only the present invention some Embodiment, for those of ordinary skill in the art, without having to pay creative labor, can also be according to these Accompanying drawing obtains other accompanying drawings.
Fig. 1 is the implementation process figure of burst topic detecting method provided in an embodiment of the present invention;
Fig. 2 is burst topic detecting method S103 provided in an embodiment of the present invention specific implementation flow chart;
Fig. 3 is burst topic detecting method S104 provided in an embodiment of the present invention specific implementation flow chart;
Fig. 4 is burst topic detecting method S303 provided in an embodiment of the present invention specific implementation flow chart;
Fig. 5 is burst topic detecting method S305 provided in an embodiment of the present invention specific implementation flow chart;
Fig. 6 is the schematic diagram of burst topic detection device provided in an embodiment of the present invention;
Fig. 7 is the schematic diagram of burst topic equipment provided in an embodiment of the present invention.
Embodiment
In describing below, in order to illustrate rather than in order to limit, it is proposed that such as tool of particular system structure, technology etc Body details, thoroughly to understand the embodiment of the present invention.However, it will be clear to one skilled in the art that there is no these specific The present invention can also be realized in the other embodiments of details.In other situations, omit to well-known system, device, electricity Road and the detailed description of method, in case unnecessary details hinders description of the invention.
In order to illustrate technical solutions according to the invention, illustrated below by specific embodiment.
Fig. 1 shows the implementation process of burst topic detecting method provided in an embodiment of the present invention, and this method flow includes Step S101 to S105.The specific implementation principle of each step is as follows:
S101:Persistently obtain the topic data in Information Sharing platform.
In the embodiment of the present invention, Information Sharing platform includes but is not limited to microblogging, push away special Twitter, Facebook and Major BBS forums etc..Each topic data is specially the provision that can be showed in Information Sharing platform and be issued by user Word information, it can associate one or more accidents.These text informations include but is not limited in Information Sharing platform User comment data corresponding to original text, reprinting text and original text or reprinting text etc..
The topic data obtained in Information Sharing platform can be realized by following two modes:First way, according to pre- First create and can be used in and the API of Information Sharing platform (Application Programming Interface, using journey Sequence DLL) application program that interacts, and according to the account key obtained in advance, in the application program, call The api interface that Information Sharing platform is provided, so as to obtain the topic data that Information Sharing platform is returned;The second way, The topic data in Information Sharing platform is persistently crawled by crawlers.
Due to the topic data in Information Sharing platform be constantly update, it is ever-increasing, therefore, the embodiment of the present invention In, the topic data in Information Sharing platform is obtained in real time, i.e., continuously obtains topic data, ensures system when each Newest topic data can be got by inscribing, so as to detection that is accurate, performing burst topic in time and promptly.
S102:When getting each topic data, by each word in the topic data and default dictionary Matching treatment is carried out, to export a variety of word segmentation results.
When often receiving a new topic data, system can carry out word match processing to the topic data.Specifically, Whether system since the first character of topic data, will be judged in the topic data comprising the word in default dictionary.When When determining that the word that the character continuously occurred in topic data is formed is identical with the word in default dictionary, this is continuously gone out Existing character is defined as a participle, and in topic data, since the first character after the participle, re-executes above-mentioned Word match process.After each participle in topic data determines, it is determined that completing a word match process, the then word Matching process correspondingly exports a kind of word segmentation result, and the word segmentation result includes multiple participles.Especially, the character of each participle Sum is two or more.
In fact, for a character in topic data, it not only can form one with left adjacent one or more characters The individual first participle, also a first participle can be formed with right adjacent one or more characters, therefore, in the different feelings of word segmentation regulation Under condition, same topic data can obtain different word segmentation results.In the embodiment of the present invention, for a topic data, output Pre-stored each word segmentation regulation distinguish corresponding to a kind of word segmentation result.Matching degree corresponding to different word segmentation results may not Together.Wherein, matching degree represents, each participle in word segmentation result, user can know the actual semantic of topic data Degree.
S103:Multiple participles output that matching degree highest word segmentation result is included pass corresponding to the topic data Keyword.
In the embodiment of the present invention, the matching of every kind of word segmentation result can be determined according to the character average of each participle Degree, or the matching degree of every kind of word segmentation result is determined according to the character sum variance of each participle, it is not limited thereto.
Preferably due to the character sum of participle is bigger, the easier reality that topic data is determined from participle of user Semanteme, therefore, the matching degree of each word segmentation result is weighed based on longest match principle.Comparing each word segmentation result After matching degree, each first participle output that the maximum word segmentation result of matching degree is included is crucial corresponding to topic data Word.
For example, when " data wire " three Chinese characters only occurs in topic data, because " data wire " and " data " can be with A participle is formed, and the matching degree of " data wire " is higher, because determining the participle that the maximum word segmentation result of matching degree is included It is keyword by " data wire " output for " data wire ".
As one embodiment of the present of invention, the calculation of word segmentation result matching degree is further qualified.Such as Fig. 2 institutes Show, above-mentioned S103 is specifically included:
S201:It is corresponding according to character sum and each word segmentation result corresponding to each participle in each word segmentation result Participle sum, calculate the participle character average of each word segmentation result.
Multiple participles are included in each word segmentation result, each participle forms by least two characters.The present invention In embodiment, the sum of participle is identified, and identifies that the character sum each segmented (judges the included character of each participle Quantity).It is above-mentioned participle character average by character sum corresponding to each participle and with participle sum ratio output.
For example, a kind of word segmentation result obtained by after if word segmentation processing is carried out to topic data is { group/data everyday Line/yield }, then three participles in the word segmentation result are respectively " group everyday ", " data wire ", " yield ", and these three divide The character sum of word is respectively 4,3,3, and the participle sum of the word segmentation result is 3, and participle character average is (4+3+3)/3= 3.33。
S202:The participle character average corresponding to each word segmentation result and the participle sum are weighted Processing, to export the matching degree of each word segmentation result.
In the embodiment of the present invention, character average A is segmented1Corresponding weight coefficient is preset value a1, segment total A2Institute Corresponding weight coefficient is preset value a2, and a1+a2=1.The matching degree of each word segmentation result is C=A1×a1+A2×a2
S203:Multiple participles output that the matching degree highest word segmentation result is included is corresponding for the topic data Keyword.
If M kind word segmentation results are obtained after carrying out word segmentation processing to topic data, and the matching degree of M kind word segmentation results is respectively C1、C2…、Cm, then in C1、C2…、CmIt is middle to choose a maximum value C of numerical valuei, and by ClIn a kind of corresponding word segmentation result Each participle output is a keyword corresponding to topic data.Wherein, m is integer more than 1, i≤m.
In the embodiment of the present invention, because participle character average and the two total factors of participle all have to word segmentation result There is considerable influence, can determine whether user is capable of determining that the actual semanteme of topic data, thus by being put down to participle character Mean and participle sum be weighted processing, and weighs key using the value obtained after weighting as the matching degree of word segmentation result Word, it is possible to increase the accuracy and validity that keyword is chosen, so as to be accurately positioned out the event content of burst topic.
S104:According to the keyword, the summary info associated with the topic data is updated.
At any one time, system is by accumulative reception to a plurality of topic data, it is determined that the keyword of each topic data Afterwards, system will be regenerated for describing current accumulative reception to the summary info of all topic datas, allows the user to base In the summary info, the general contents of current time burst topic are can be apparent that.
Keyword possesses the determinant attribute for having topic data, in order to generate and current accumulative reception to all topic datas Associated summary info, the accumulative word frequency of each keyword in each bar topic data can be counted, with big according to accumulative word frequency Summary info is generated in the keyword of threshold value.Wherein, using the summary info in TextRank algorithm or word instruments Core Generator etc., generation and topic data and the summary info with crucial word association.
Preferably as one embodiment of the present of invention, as shown in figure 3, above-mentioned S104 is specifically included:
S301:The accumulative word frequency of each keyword is obtained respectively, and calculates the growth acceleration of the accumulative word frequency, Wherein, in all topic datas that the accumulative word frequency expression of the keyword has obtained at current time, the keyword occurs Cumulative number.
In the embodiment of the present invention, the accumulative word frequency of a keyword is represented in current accumulative reception to all topic datas In, the occurrence number of the keyword.Among the state for persistently obtaining topic data being in because of system, therefore for same key Word, its accumulative word frequency are also constantly increasing.If in period Δ T, the accumulative word frequency of system detectio to keyword A increases Δ S, Then the growth rate of keyword A accumulative word frequency is V=Δ S/ Δ T, and it is growth rate that it, which accumulates the growth acceleration a of word frequency, V is to the partial derivative of time, i.e. a=V ' (t).Growth acceleration is bigger, and in unit time, keyword is come across in topic data Number is more, and topic is sudden higher.
S302:The growth acceleration corresponding to each keyword is added in the matrix previously generated.
When receiving new topic data every time, system determines the keyword of the topic data and adding up for keyword The growth acceleration of word frequency.If the keyword of the topic data has K, K growth acceleration will be obtained.If system adds up The quantity of the growth acceleration arrived is P (P >=K, N ∈ Z), then matrix will be extended to P × P matrix, and will be obtained in real time This K increases acceleration and is added in P × P matrix.In P × P matrix, in addition to increasing acceleration comprising P, also wrap Include null value.
S303:The characteristic value of matrix described in current time is calculated, when the characteristic value is more than first threshold, from the square The growth acceleration more than Second Threshold is determined in battle array.
System is monitored to each growth acceleration in matrix, to detect the characteristic value of matrix in real time.With accumulative The topic data acquired is more and more, the size of matrix and its comprising growth acceleration sum also constantly changing, Thus the characteristic value of matrix also increases therewith.When characteristic value is more than default first threshold, system will be included from matrix In each growth acceleration, orient numerical value and increase acceleration more than the one or more of Second Threshold.
As one embodiment of the present of invention, as shown in figure 4, above-mentioned S303 is specifically included:
S401:Each growth acceleration in matrix described in current time is divided into N number of group, and by the increasing of each group Long acceleration is mapped in a submatrix.
Because the quantity for increasing acceleration in matrix is more, in order to improve numerical value more than the growth acceleration of Second Threshold Locating speed, matrix is subjected to dimension-reduction treatment.
Specifically, according to default rule, all growth acceleration in the presence of matrix are divided into N number of group so that Each group includes multiple growth acceleration of negligible amounts.Wherein, the quantity for increasing acceleration in each group can be with identical Can also be different.Multiple growth acceleration that each group is included are mapped in a submatrix.Therefore when the quantity of group For B when, the quantity of submatrix is also B.In the case where topic data gradually increases, obtained each growth is updated every time Acceleration will be also respectively mapped in the B submatrix.
S402:The characteristic value of each submatrix is calculated, when the characteristic value of the submatrix is more than four threshold values, from The growth acceleration more than Second Threshold is filtered out in the submatrix.
The characteristic value of each submatrix is calculated, if the characteristic value of any number of submatrixs is equal in B submatrix More than default 4th threshold value, then it is more than from characteristic value in each submatrix of the 4th threshold value, filters out respectively more than the second threshold Each growth acceleration of value.
In the embodiment of the present invention, increase acceleration because the quantity of the growth acceleration in submatrix is considerably less than in matrix Quantity, therefore,, can be from right in the case where characteristic value is more than the 4th threshold value by the characteristic value of calculated sub-matrix respectively The growth acceleration more than Second Threshold is oriented in the submatrix answered rapidly, so as to improve the detection efficiency of burst topic.
S304:According to the participle corresponding to each growth acceleration determined, from all topic datas got In filter out the topic data comprising the participle.
A keyword is corresponded to because each in matrix or submatrix increases acceleration, and each keyword is topic A participle in data in the maximum word segmentation result of matching degree, thus system can according to the growth acceleration prestored and The mapping table of participle, inquire numerical value and distinguish corresponding participle more than each growth acceleration of Second Threshold.If number Each growth acceleration that value is more than Second Threshold has L, then the participle inquired also has L.
The each topic data that system has obtained to current time successively carries out Screening Treatment, judges each topic Whether above-mentioned L participle is contained in data.If certain topic data contains above-mentioned L participle, screening system goes out this Topic data, and step S305 is performed to the topic data.
S305:Word segmentation processing is carried out again to the topic data comprising the participle, and calculate obtain after word segmentation processing it is each The words-frequency feature value of individual participle.
To each topic data filtered out, system carries out word segmentation processing to it again.Participle process can use existing All kinds of segmentation methods having, the including but not limited to segmentation methods based on string matching, segmentation methods based on statistics etc..Point After word terminates, multiple participles of this topic data will be retrieved.In order to distinguish in the participle and S305 that are obtained in S102 Obtained participle, participle resulting in S102 is referred to as the first participle at this, the participle obtained in S305 is referred to as second point Word.Wherein, the first participle is identical with the second participle possibility, it is also possible to different.Summary info is influenceed in order to further filter out More second participle, based on the words-frequency feature amount of each second participle, calculate each second words-frequency feature segmented Value.These words-frequency feature amounts are including but not limited to word frequency, reverse document-frequency (termfrequency-TF) etc..
As one embodiment of the present of invention, as shown in figure 5, above-mentioned S305 is specifically included:
S501:Word segmentation processing is carried out again to the topic data comprising the participle, obtains multiple participles.
S502:In all topic datas accessed by current time, obtained respectively after calculating word segmentation processing each Statistics word frequency and reverse document-frequency corresponding to participle.
In the embodiment of the present invention, number of each second participle appeared in a plurality of topic data filtered out is calculated, The occurrence number for then counting to obtain is the statistics word frequency F of the second participleTF.If the sum of the topic data filtered out is X bars, wherein Topic data comprising a certain second participle is X ' (X '≤X, N ∈ Z) bar, then the reverse document-frequency F of second participleIDFFor
S503:The statistics word frequency and the reverse document-frequency to each participle are weighted processing, with output The words-frequency feature value of the participle.
Count word frequency FTFCorresponding weight coefficient is preset value a3, reverse document-frequency FIDFCorresponding weight coefficient For preset value a4, and a3+a4=1.The words-frequency feature value of each the second participle is F=FTF×a3+FIDF×a4
In the embodiment of the present invention, according to TF the and IDF values of every one second participle, customized weight coefficient can be based on, The words-frequency feature value of the second participle is calculated, so as to by considering the TF-IDF values of the second participle, can filtered out A plurality of topic data, quantization contrast is carried out to the significance level of each second participle.
S306:It is high frequency words by the participle output that the words-frequency feature value is more than the 3rd threshold value, the algorithm that passes budgets is to each The individual high frequency words are attached processing, to obtain including the summary info of each high frequency words.
Determine that words-frequency feature value F is more than each second participle of default 3rd threshold value, then these second participles are High frequency words appeared in topic data.Using the summary info Core Generator in above-mentioned TextRank algorithm, word instruments with And other custom algorithms etc., each high frequency words are attached, to obtain with topic data and be plucked with high frequency word association Want information.
S105:The keyword and the summary info are shown, so that user knows the burst words at current time Topic.
Summary info after the keyword obtained in real time and renewal is shown by system.Under actual conditions, only exist When topic data is burst topic, the growth acceleration of the accumulative word frequency of each keyword can just be more than threshold value, and summary info is It can be updated, therefore, the word content of system institute real-time display and the true content for the topic event that happens suddenly have higher phase Like degree, there is certain reference value.
In the embodiment of the present invention, in the topic data in getting Information Sharing platform each time, by determining this Keyword corresponding to topic data, and based on the keyword come real-time update summary info so that user can be from the pass of output The very first time recognizes what content the burst topic on Information Sharing platform is probably in keyword and summary info, can be based on The summary info promptly determines whether the burst topic related to enterprise itself, it is possible thereby to effectively find and tracking at The reason burst topic event related to enterprise, improve the soft power of enterprise.
It should be understood that the size of the sequence number of each step is not meant to the priority of execution sequence, each process in above-described embodiment Execution sequence should determine that the implementation process without tackling the embodiment of the present invention forms any limit with its function and internal logic It is fixed.
Corresponding to the burst topic detecting method described in foregoing embodiments, Fig. 6 shows provided in an embodiment of the present invention prominent The schematic diagram for topic detection means of giving orders or instructions, for convenience of description, illustrate only the part related to the embodiment of the present invention.
Reference picture 6, the device include:
Acquisition module 61, for persistently obtaining the topic data in Information Sharing platform.
Matching module 62, for when getting each topic data, by the topic data and default dictionary Each word carry out matching treatment, to export a variety of word segmentation results.
Output module 63, multiple participles output for matching degree highest word segmentation result to be included is the topic number According to corresponding keyword.
Update module 64, for according to the keyword, updating the summary info associated with the topic data.
Display module 65, for being shown to the keyword and the summary info, during so that user knowing current The burst topic at quarter.
Alternatively, the update module 64 includes:
First calculating sub module, for obtaining the accumulative word frequency of each keyword respectively, and calculate the accumulative word The growth acceleration of frequency, wherein, the accumulative word frequency of the keyword is represented in all topic datas for having been obtained at current time, The cumulative number that the keyword occurs.
Submodule is added, for the growth acceleration corresponding to each keyword to be added into what is previously generated In matrix.
Determination sub-module, for calculating the characteristic value of matrix described in current time, when the characteristic value is more than first threshold When, the growth acceleration more than Second Threshold is determined from the matrix.
Submodule is screened, for the participle according to corresponding to each growth acceleration determined, from the institute got Have and the topic data comprising the participle is filtered out in topic data.
Submodule is segmented, for carrying out word segmentation processing again to the topic data comprising the participle, and calculates word segmentation processing The words-frequency feature value of each participle obtained afterwards.
First output sub-module, the participle output for the words-frequency feature value to be more than to the 3rd threshold value is high frequency words, is led to Cross prediction algorithm and processing is attached to each high frequency words, believed with obtaining the summary comprising each high frequency words Breath.
Alternatively, the determination sub-module is specifically used for:
Each growth acceleration in matrix described in current time is divided into N number of group, and the growth of each group is added Speed is mapped in a submatrix;
The characteristic value of each submatrix is calculated, when the characteristic value of the submatrix is more than four threshold values, from described The growth acceleration more than Second Threshold is filtered out in submatrix;
Wherein, the N is the integer more than 1.
Alternatively, the participle submodule is specifically used for:
Word segmentation processing is carried out again to the topic data comprising the participle, obtains multiple participles;
In all topic datas accessed by current time, each participle pair obtained after word segmentation processing is calculated respectively The statistics word frequency answered and reverse document-frequency;
The statistics word frequency and the reverse document-frequency to each participle are weighted processing, to export the participle Words-frequency feature value.
Alternatively, the output module 63 includes:
Second calculating sub module, it is total and each for the character according to corresponding to each participle in each word segmentation result Participle sum corresponding to kind word segmentation result, calculate the participle character average of each word segmentation result.
Submodule is weighted, for total to the participle character average corresponding to each word segmentation result and the participle Number is weighted processing, to export the matching degree of each word segmentation result.
Second output sub-module, multiple participles for the matching degree highest word segmentation result to be included are exported as institute State keyword corresponding to topic data.
Fig. 7 is the schematic diagram of burst topic detection equipment provided in an embodiment of the present invention.As shown in fig. 7, the embodiment Burst topic detection equipment 7 includes:Processor 70, memory 71 and it is stored in the memory 71 and can be in the processing The computer program 72 run on device 70, such as burst topic detection program.The processor 70 performs the computer program Realize the step in above-mentioned each burst topic detecting method embodiment when 72, such as step 101 shown in Fig. 1 is to 105.Or Person, the processor 70 realize the function of each module/unit in above-mentioned each device embodiment when performing the computer program 72, Such as the function of module 61 to 65 shown in Fig. 6.
Exemplary, the computer program 72 can be divided into one or more module/units, it is one or Multiple module/units are stored in the memory 71, and are performed by the processor 70, to complete the present invention.Described one Individual or multiple module/units can be the series of computation machine programmed instruction section that can complete specific function, and the instruction segment is used for Implementation procedure of the computer program 72 in the burst topic detection equipment 7 is described.For example, the computer program 72 Acquisition module, matching module, output module, update module, display module can be divided into, each module concrete function is as follows:
Acquisition module is used to persistently obtain the topic data in Information Sharing platform.
Matching module be used for when getting each topic data, by the topic data with it is each in default dictionary Individual word carries out matching treatment, to export a variety of word segmentation results.
It is the topic data that output module, which is used for multiple participles output that matching degree highest word segmentation result is included, Corresponding keyword.
Update module is used to, according to the keyword, update the summary info associated with the topic data.
Display module is used to be shown the keyword and the summary info, so that user knows current time Happen suddenly topic.
The burst topic detection equipment 7 can be desktop PC, notebook, palm PC and cloud server etc. Computing device.It will be understood by those skilled in the art that Fig. 7 is only the example of burst topic detection equipment 7, do not form to prominent Give orders or instructions to inscribe the restriction of detection device 7, can include than illustrating more or less parts, either combine some parts or not Same part, such as the burst topic detection equipment equipment can also include input-output equipment, network access equipment, bus Deng.
Alleged processor 70 can be CPU (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other PLDs, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor Deng.
The memory 71 can be the internal storage unit of the burst topic detection equipment 7, such as the topic inspection that happens suddenly The hard disk or internal memory of measurement equipment 7.The memory 71 can also be the External memory equipment of the burst topic detection equipment 7, Such as the plug-in type hard disk being equipped with the burst topic detection equipment 7, intelligent memory card (Smart Media Card, SMC), Secure digital (Secure Digital, SD) blocks, flash card (Flash Card) etc..Further, the memory 71 may be used also With the internal storage unit both including the burst topic detection equipment 7 or including External memory equipment.The memory 71 is used In other programs and data needed for the storage computer program and the burst topic detection equipment.The memory 71 It can be also used for temporarily storing the data that has exported or will export.
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each work( Can unit, module division progress for example, in practical application, can be as needed and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device are divided into different functional units or module, more than completion The all or part of function of description.Each functional unit, module in embodiment can be integrated in a processing unit, also may be used To be that unit is individually physically present, can also two or more units it is integrated in a unit, it is above-mentioned integrated Unit can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.In addition, each function list Member, the specific name of module are not limited to the protection domain of the application also only to facilitate mutually distinguish.Said system The specific work process of middle unit, module, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and is not described in detail or remembers in some embodiment The part of load, it may refer to the associated description of other embodiments.
Those of ordinary skill in the art are it is to be appreciated that the list of each example described with reference to the embodiments described herein Member and algorithm steps, it can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually Performed with hardware or software mode, application-specific and design constraint depending on technical scheme.Professional and technical personnel Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed The scope of the present invention.
In embodiment provided by the present invention, it should be understood that disclosed device/terminal device and method, can be with Realize by another way.For example, device described above/terminal device embodiment is only schematical, for example, institute The division of module or unit is stated, only a kind of division of logic function, there can be other dividing mode when actually realizing, such as Multiple units or component can combine or be desirably integrated into another system, or some features can be ignored, or not perform.Separately A bit, shown or discussed mutual coupling or direct-coupling or communication connection can be by some interfaces, device Or INDIRECT COUPLING or the communication connection of unit, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated module/unit realized in the form of SFU software functional unit and as independent production marketing or In use, it can be stored in a computer read/write memory medium.Based on such understanding, the present invention realizes above-mentioned implementation All or part of flow in example method, by computer program the hardware of correlation can also be instructed to complete, described meter Calculation machine program can be stored in a computer-readable recording medium, and the computer program can be achieved when being executed by processor The step of stating each embodiment of the method..Wherein, the computer program includes computer program code, the computer program Code can be source code form, object identification code form, executable file or some intermediate forms etc..Computer-readable Jie Matter can include:Can carry any entity or device of the computer program code, recording medium, USB flash disk, mobile hard disk, Magnetic disc, CD, computer storage, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It is it should be noted that described The content that computer-readable medium includes can carry out appropriate increasing according to legislation in jurisdiction and the requirement of patent practice Subtract, such as in some jurisdictions, according to legislation and patent practice, computer-readable medium do not include be electric carrier signal and Telecommunication signal.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although with reference to foregoing reality Example is applied the present invention is described in detail, it will be understood by those within the art that:It still can be to foregoing each Technical scheme described in embodiment is modified, or carries out equivalent substitution to which part technical characteristic;And these are changed Or replace, the essence of appropriate technical solution is departed from the spirit and scope of various embodiments of the present invention technical scheme, all should Within protection scope of the present invention.

Claims (10)

1. one kind burst topic detecting method, it is characterised in that including:
Persistently obtain the topic data in Information Sharing platform;
When getting each topic data, each word in the topic data and default dictionary is carried out at matching Reason, to export a variety of word segmentation results;
It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;
According to the keyword, the summary info associated with the topic data is updated;
The keyword and the summary info are shown, so that user knows the burst topic at current time.
2. burst topic detecting method as claimed in claim 1, it is characterised in that it is described according to the keyword, renewal with The summary info of the topic data association, including:
The accumulative word frequency of each keyword is obtained respectively, and calculates the growth acceleration of the accumulative word frequency, wherein, it is described In all topic datas that the accumulative word frequency expression of keyword has obtained at current time, the keyword occurs accumulative secondary Number;
The growth acceleration corresponding to each keyword is added in the matrix previously generated;
The characteristic value of matrix described in current time is calculated, when the characteristic value is more than first threshold, is determined from the matrix Go out to be more than the growth acceleration of Second Threshold;
According to the participle corresponding to each growth acceleration determined, bag is filtered out from all topic datas got Topic data containing the participle;
Word segmentation processing is carried out again to the topic data comprising the participle, and calculates the word of each participle obtained after word segmentation processing Frequency characteristic value;
It is high frequency words by the participle output that the words-frequency feature value is more than the 3rd threshold value, the algorithm that passes budgets is to each high frequency Word is attached processing, to obtain including the summary info of each high frequency words.
3. burst topic detecting method as claimed in claim 2, it is characterised in that described to calculate matrix described in current time Characteristic value, when the characteristic value is more than first threshold, the growth acceleration more than Second Threshold is determined from the matrix, Including:
Each growth acceleration in matrix described in current time is divided into N number of group, and by the growth acceleration of each group Map in a submatrix;
The characteristic value of each submatrix is calculated, when the characteristic value of the submatrix is more than four threshold values, from the sub- square The growth acceleration more than Second Threshold is filtered out in battle array;
Wherein, the N is the integer more than 1.
4. burst topic detecting method as claimed in claim 2, it is characterised in that the described pair of topic data for including the participle Word segmentation processing is carried out again, and calculates the words-frequency feature value of each participle obtained after word segmentation processing, including:
Word segmentation processing is carried out again to the topic data comprising the participle, obtains multiple participles;
In all topic datas accessed by current time, calculate respectively corresponding to each participle obtained after word segmentation processing Count word frequency and reverse document-frequency;
The statistics word frequency and the reverse document-frequency to each participle are weighted processing, to export the word of the participle Frequency characteristic value.
5. burst topic detecting method as claimed in claim 1, it is characterised in that described by matching degree highest word segmentation result Comprising multiple participles output be keyword corresponding to the topic data, including:
It is total according to participle corresponding to character sum corresponding to each participle in each word segmentation result and each word segmentation result Number, calculate the participle character average of each word segmentation result;
Processing is weighted to the participle character average corresponding to each word segmentation result and the participle sum, with defeated Go out the matching degree of each word segmentation result;
It is keyword corresponding to the topic data by multiple participles output that the matching degree highest word segmentation result is included.
6. a kind of computer-readable recording medium, the computer-readable recording medium storage has burst topic detection program, its It is characterised by, when the burst topic detection program is by least one computing device, realizes such as any one of claim 1-5 The step of described burst topic detecting method.
7. one kind burst topic detection equipment, it is characterised in that the burst topic detection equipment include memory, processor and The burst topic detection program that is stored on the memory and can run on the processor, described in the computing device Following steps are realized during the topic detection program that happens suddenly:
Persistently obtain the topic data in Information Sharing platform;
When getting each topic data, each word in the topic data and default dictionary is carried out at matching Reason, to export a variety of word segmentation results;
It is keyword corresponding to the topic data by multiple participles output that matching degree highest word segmentation result is included;
According to the keyword, the summary info associated with the topic data is updated;
The keyword and the summary info are shown, so that user knows the burst topic at current time.
8. burst topic detection equipment as claimed in claim 7, it is characterised in that it is described according to the keyword, renewal with The step of summary info of the topic data association, specifically include:
The accumulative word frequency of each keyword is obtained respectively, and calculates the growth acceleration of the accumulative word frequency, wherein, it is described In all topic datas that the accumulative word frequency expression of keyword has obtained at current time, the keyword occurs accumulative secondary Number;
The growth acceleration corresponding to each keyword is added in the matrix previously generated;
The characteristic value of matrix described in current time is calculated, when the characteristic value is more than first threshold, is determined from the matrix Go out to be more than the growth acceleration of Second Threshold;
According to the participle corresponding to each growth acceleration determined, bag is filtered out from all topic datas got Topic data containing the participle;
Word segmentation processing is carried out again to the topic data comprising the participle, and calculates the word of each participle obtained after word segmentation processing Frequency characteristic value;
It is high frequency words by the participle output that the words-frequency feature value is more than the 3rd threshold value, the algorithm that passes budgets is to each high frequency Word is attached processing, to obtain including the summary info of each high frequency words.
9. burst topic detection equipment as claimed in claim 8, it is characterised in that described to calculate matrix described in current time Characteristic value, when the characteristic value is more than first threshold, the growth acceleration more than Second Threshold is determined from the matrix The step of, specifically include:
Each growth acceleration in matrix described in current time is divided into N number of group, and by the growth acceleration of each group Map in a submatrix;
The characteristic value of each submatrix is calculated, when the characteristic value of the submatrix is more than four threshold values, from the sub- square The growth acceleration more than Second Threshold is filtered out in battle array;
Wherein, the N is the integer more than 1.
10. burst topic detection equipment as claimed in claim 8, it is characterised in that if described pair includes the first participle Inscribe data and carry out word segmentation processing, the step of obtaining multiple second participles, and calculate the words-frequency feature value of each second participle, specifically Including:
Word segmentation processing is carried out again to the topic data comprising the participle, obtains multiple participles;
In all topic datas accessed by current time, calculate respectively corresponding to each participle obtained after word segmentation processing Count word frequency and reverse document-frequency;
The statistics word frequency and the reverse document-frequency to each participle are weighted processing, to export the word of the participle Frequency characteristic value.
CN201710433359.1A 2017-06-09 2017-06-09 Burst topic detection method and burst topic detection equipment Active CN107688596B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710433359.1A CN107688596B (en) 2017-06-09 2017-06-09 Burst topic detection method and burst topic detection equipment
PCT/CN2018/074870 WO2018223718A1 (en) 2017-06-09 2018-01-31 Trending topic detection method, apparatus and device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710433359.1A CN107688596B (en) 2017-06-09 2017-06-09 Burst topic detection method and burst topic detection equipment

Publications (2)

Publication Number Publication Date
CN107688596A true CN107688596A (en) 2018-02-13
CN107688596B CN107688596B (en) 2020-02-21

Family

ID=61152644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710433359.1A Active CN107688596B (en) 2017-06-09 2017-06-09 Burst topic detection method and burst topic detection equipment

Country Status (2)

Country Link
CN (1) CN107688596B (en)
WO (1) WO2018223718A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897958A (en) * 2020-07-16 2020-11-06 邓桦 Ancient poetry classification method based on natural language processing
CN113204638A (en) * 2021-04-23 2021-08-03 上海明略人工智能(集团)有限公司 Recommendation method, system, computer and storage medium based on work session unit

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model
CN102346766A (en) * 2011-09-20 2012-02-08 北京邮电大学 Method and device for detecting network hot topics found based on maximal clique
CN102971762A (en) * 2010-07-01 2013-03-13 费斯布克公司 Facilitating interaction among users of a social network
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646114A (en) * 2012-02-17 2012-08-22 清华大学 News topic timeline abstract generating method based on breakthrough point
CN105022827B (en) * 2015-07-23 2016-06-15 合肥工业大学 A kind of Web news dynamic aggregation method of domain-oriented theme

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102971762A (en) * 2010-07-01 2013-03-13 费斯布克公司 Facilitating interaction among users of a social network
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model
CN102346766A (en) * 2011-09-20 2012-02-08 北京邮电大学 Method and device for detecting network hot topics found based on maximal clique
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897958A (en) * 2020-07-16 2020-11-06 邓桦 Ancient poetry classification method based on natural language processing
CN111897958B (en) * 2020-07-16 2024-03-12 邓桦 Ancient poetry classification method based on natural language processing
CN113204638A (en) * 2021-04-23 2021-08-03 上海明略人工智能(集团)有限公司 Recommendation method, system, computer and storage medium based on work session unit
CN113204638B (en) * 2021-04-23 2024-02-23 上海明略人工智能(集团)有限公司 Recommendation method, system, computer and storage medium based on working session unit

Also Published As

Publication number Publication date
CN107688596B (en) 2020-02-21
WO2018223718A1 (en) 2018-12-13

Similar Documents

Publication Publication Date Title
CN105894372B (en) The method and apparatus for predicting colony's credit
CN104081392A (en) Influence scores for social media profiles
CN106874253A (en) Recognize the method and device of sensitive information
CN107507036A (en) The method and terminal of a kind of data prediction
CN116601626A (en) Personal knowledge graph construction method and device and related equipment
CN103827895A (en) Entity fingerprints
CN108804617A (en) Field term abstracting method, device, terminal device and storage medium
CN109376287B (en) House property map construction method, device, computer equipment and storage medium
CN108319377A (en) Method and system, terminal and the computer readable storage medium of displaying word input
CN114625973B (en) Anonymous information cross-domain recommendation method and device, electronic equipment and storage medium
CN110473073A (en) The method and device that linear weighted function is recommended
CN114357184B (en) Item recommendation method and related device, electronic equipment and storage medium
CN107688596A (en) Happen suddenly topic detecting method and burst topic detection equipment
CN109885747A (en) Industry public sentiment monitoring method, device, computer equipment and storage medium
CN110390011A (en) The method and apparatus of data classification
CN111401959B (en) Risk group prediction method, apparatus, computer device and storage medium
CN112766995B (en) Article recommendation method, device, terminal equipment and storage medium
CN111209403A (en) Data processing method, device, medium and electronic equipment
CN107798249B (en) Method for releasing behavior pattern data and terminal equipment
CN116304251A (en) Label processing method, device, computer equipment and storage medium
CN111708821B (en) Method, device and storage medium for determining personnel intimacy
CN116868207A (en) Decision tree of original graph database
CN114925275A (en) Product recommendation method and device, computer equipment and storage medium
CN113868373A (en) Word cloud generation method and device, electronic equipment and storage medium
CN107622129B (en) Method and device for organizing knowledge base and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant