WO2018223718A1 - 突发话题检测方法、装置、设备及介质 - Google Patents

突发话题检测方法、装置、设备及介质 Download PDF

Info

Publication number
WO2018223718A1
WO2018223718A1 PCT/CN2018/074870 CN2018074870W WO2018223718A1 WO 2018223718 A1 WO2018223718 A1 WO 2018223718A1 CN 2018074870 W CN2018074870 W CN 2018074870W WO 2018223718 A1 WO2018223718 A1 WO 2018223718A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
word segmentation
topic data
topic
frequency
Prior art date
Application number
PCT/CN2018/074870
Other languages
English (en)
French (fr)
Inventor
王健宗
黄章成
吴天博
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2018223718A1 publication Critical patent/WO2018223718A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2358Change logging, detection, and notification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • the present application belongs to the Internet technology, and in particular, to a method, device, device and medium for detecting a sudden topic.
  • the embodiment of the present invention provides a method for detecting a sudden topic and a device for detecting a hot event, so as to solve the problem in the prior art that it is difficult to quickly understand the sudden topic on the information sharing platform through technical means and it is difficult to determine each Whether the sudden topic is related to the enterprise itself.
  • a first aspect of the embodiments of the present application provides a method for detecting a sudden topic, including:
  • the topic data is matched with each word in the preset vocabulary to output a plurality of word segmentation results
  • the keyword and the summary information are displayed to enable the user to know the sudden topic at the current moment.
  • a second aspect of the embodiments of the present application provides a burst topic detecting apparatus, including:
  • An acquisition module configured to continuously obtain topic data in the information sharing platform
  • a matching module configured to: when each of the topic data is obtained, perform matching processing on the topic data with each word in the preset vocabulary to output a plurality of word segmentation results;
  • An output module configured to output a plurality of word segments included in the segmentation result with the highest matching degree as keywords corresponding to the topic data
  • An update module configured to update summary information associated with the topic data according to the keyword
  • a display module configured to display the keyword and the summary information, so that the user knows the sudden topic at the current moment.
  • a third aspect of the embodiments of the present application provides a burst detecting device including a memory and a processor, wherein the memory stores computer readable instructions executable on the processor, the processor executing the The steps of the burst topic detection method as described in the first aspect are implemented when the computer readable instructions are described.
  • a fourth aspect of the embodiments of the present application provides a computer readable storage medium storing computer readable instructions, the computer readable instructions being executed by a processor to implement the first aspect as described in the first aspect The steps of the sudden topic detection method.
  • each time the topic data in the information sharing platform is acquired the keyword corresponding to the topic data is determined, and the summary information is updated in real time based on the keyword, so that the user can output the keyword.
  • the summary information firstly knows what the sudden topic on the information sharing platform is, and can quickly determine whether the sudden topic is related to the enterprise itself based on the summary information, thereby effectively discovering and tracking the problem.
  • the sudden topic events related to the company have improved the soft power of the company.
  • FIG. 1 is a flowchart of implementing a burst topic detection method according to an embodiment of the present application
  • FIG. 3 is a specific implementation flowchart of a burst topic detecting method S104 according to an embodiment of the present application
  • FIG. 5 is a flowchart of a specific implementation of a burst topic detecting method S305 according to an embodiment of the present application
  • FIG. 6 is a schematic diagram of a burst topic detecting apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a burst topic device according to an embodiment of the present application.
  • FIG. 1 shows an implementation flow of a burst topic detecting method provided by an embodiment of the present application, where the method process includes steps S101 to S105.
  • the specific implementation principles of each step are as follows:
  • the information sharing platform includes, but is not limited to, Weibo, Twitter, Facebook, and various BBS forums.
  • Each topic data is specifically a piece of text information that can be displayed on the information sharing platform and published by the user, and can be associated with one or more emergencies.
  • the text information includes, but is not limited to, the original text in the information sharing platform, the reprinted text, and the user comment data corresponding to the original text or the reproduced text.
  • Obtaining topic data in the information sharing platform can be implemented in two ways: The first method is based on an application that is pre-created and can be used to interact with an API (Application Programming Interface) of the information sharing platform. And, according to the pre-acquired account key, in the application, calling the API interface provided by the information sharing platform to obtain the topic data returned by the information sharing platform; and secondly, continuously crawling the information sharing platform through the crawler program Topic data in .
  • API Application Programming Interface
  • the topic data in the information sharing platform is continuously updated and continuously, in the embodiment of the present application, the topic data in the information sharing platform is obtained in real time, that is, the topic data is continuously obtained, and the system can be ensured at various times. Get the latest topic data so that the detection of sudden topics can be performed accurately, promptly and quickly.
  • the system will perform word matching on the topic data. Specifically, the system will determine whether the words in the preset vocabulary are included in the topic data starting from the first character of the topic data. When it is determined that the words consisting of consecutive characters in the topic data are the same as the words in the preset thesaurus, the consecutively appearing characters are determined as a participle, and in the topic data, the first one after the word segmentation The character begins and the above word matching process is re-executed. After each participle in the topic data is determined, it is determined that the word matching process is completed once, the word matching process correspondingly outputs a word segmentation result, and the word segmentation result includes a plurality of word segments. In particular, the total number of characters per participle is two or more.
  • a character in the topic data it can not only form a first participle with one or more characters of the left neighbor, but also form a first participle with one or more characters of the right neighbor. Therefore, in the word segmentation
  • the same topic data can get different word segmentation results.
  • a word segmentation result corresponding to each of the pre-stored word segmentation rules is output.
  • the matching degree corresponding to different word segmentation results may be different. The matching degree indicates that the user can know the degree of the actual semantics of the topic data according to each participle in the word segmentation result.
  • S103 Output a plurality of word segments included in the word segmentation result with the highest matching degree as keywords corresponding to the topic data.
  • the matching degree of each word segmentation result may be determined according to the average number of characters of each word segment, or the matching degree of each word segmentation result may be determined according to the variance of the total number of characters of each word segment, which is not limited herein.
  • the matching degree of each word segmentation result is measured based on the longest matching principle. After comparing the matching degree of each word segmentation result, each first word segment included in the word segmentation result with the largest matching degree is output as the keyword corresponding to the topic data.
  • the topic data only has three Chinese characters of "data line"
  • the "data line” since "data line” and “data” can form a participle, the "data line” has a higher degree of matching because the wording with the largest matching degree is determined.
  • the result contains the word “data line” and the "data line” as a keyword.
  • the manner of calculating the degree of matching of the word segmentation results is further defined.
  • the foregoing S103 specifically includes:
  • S201 Calculate the average number of word segmentation characters of each word segmentation result according to the total number of characters corresponding to each participle in each word segmentation result and the total number of word segments corresponding to each word segmentation result.
  • Each word segmentation contains multiple word segments, each of which consists of at least two characters.
  • the total number of word segments is identified, and the total number of characters of each word segment is identified (ie, the number of characters included in each word segment is determined).
  • the ratio of the sum of the total number of characters corresponding to each participle to the total number of word segments is output as the average number of the above-mentioned word segmentation characters.
  • a word segmentation result obtained by word segmentation of topic data is ⁇ Tiantian Group/Data Line/Yield ⁇
  • S202 Perform weighting processing on the average number of the word segment characters corresponding to each word segmentation result and the total number of word segmentation numbers to output a matching degree of each word segmentation result.
  • the weighting coefficient corresponding to the word segmentation mean A 1 is a preset value a 1
  • the weighting coefficient corresponding to the total number of word segments A 2 is a preset value a 2
  • a 1 + a 2 1.
  • S203 Output a plurality of participles included in the participle result with the highest matching degree as keywords corresponding to the topic data.
  • the M word segmentation result is obtained after the word segmentation processing of the topic data, and the matching degree of the M segment word results is C 1 , C 2 ..., C m respectively , the largest value is selected among C 1 , C 2 ..., C m .
  • a value C i and each of the word segmentation results corresponding to C i is output as a keyword corresponding to the topic data.
  • m is an integer greater than 1, i ⁇ m.
  • the two factors of the average number of word segmentation characters and the total number of word segments have a great influence on the word segmentation result, it is possible to determine whether the user can determine the actual semantics of the topic data, and thus the average number of word segmentation characters and the total number of word segments.
  • the weighting process is performed, and the weighted value is used as the matching degree of the word segmentation result to measure the keyword, which can improve the accuracy and effectiveness of the keyword selection, thereby accurately positioning the event content of the sudden topic.
  • S104 Update the summary information associated with the topic data according to the keyword.
  • the system will cumulatively receive multiple pieces of topic data. After determining the keywords of each topic data, the system will regenerate summary information for describing the current cumulatively received all topic data, so that the user can base the summary. Information, clearly understand the general content of the sudden topic at the moment.
  • the keyword has a decisive feature of the topic data.
  • the cumulative word frequency of each keyword in each topic data may be counted according to the keyword whose cumulative word frequency is greater than the threshold.
  • the summary information generation tool and the summary information associated with the keyword may be generated by using a TextRank algorithm or a summary information generation tool in the word tool.
  • the foregoing S104 specifically includes:
  • S301 Acquire a cumulative word frequency of each of the keywords, and calculate a growth acceleration of the accumulated word frequency, where the cumulative word frequency of the keyword indicates that the keyword appears in all topic data that has been acquired at the current time. The cumulative number of times.
  • the system determines the keyword of the topic data and the cumulative acceleration of the cumulative word frequency of the keyword. If there are K keywords in the topic data, K growth accelerations will be obtained. If the cumulative acceleration obtained by the system is P (P ⁇ K, N ⁇ Z), the matrix will be expanded to a P ⁇ P matrix, and the K growth accelerations obtained in real time are added to the P ⁇ P matrix. in. In the matrix of P ⁇ P, in addition to including P increasing accelerations, a null value is included.
  • S303 Calculate a feature value of the matrix at a current time, and when the feature value is greater than a first threshold, determine a growth acceleration greater than a second threshold from the matrix.
  • the system monitors each of the increasing accelerations in the matrix to detect the eigenvalues of the matrix in real time. As the cumulatively acquired topic data is increasing, the size of the matrix and the total number of accelerations it contains are constantly changing, and the eigenvalues of the matrix are also increased. When the feature value is greater than the preset first threshold, the system will locate one or more increasing accelerations whose value is greater than the second threshold from among the respective increasing accelerations included in the matrix.
  • the foregoing S303 specifically includes:
  • S401 Divide each growth acceleration in the matrix at the current moment into N groups, and map each group of growth accelerations into one sub-matrix.
  • the matrix Since the number of increasing accelerations in the matrix is large, in order to increase the positioning speed of the increasing acceleration whose value is larger than the second threshold, the matrix is subjected to dimensionality reduction processing.
  • all the growth accelerations existing in the matrix are divided into N groups, so that each group contains a small number of multiple growth accelerations.
  • the number of increasing accelerations in each group may be the same or different. Map multiple growth accelerations contained in each group into one sub-matrix. Therefore, when the number of groups is B, the number of sub-matrices is also B. In the case where the topic data is gradually increased, each of the increasing accelerations obtained by each update will also be mapped to the B sub-matrices, respectively.
  • S402 Calculate a feature value of each of the sub-matrices, and when the feature value of the sub-matrix is greater than a fourth threshold, select a growth acceleration greater than a second threshold from the sub-matrix.
  • the sub-matrix since the number of increasing accelerations in the sub-matrix is much less than the number of increasing accelerations in the matrix, by separately calculating the eigenvalues of the sub-matrices, if the eigenvalues are greater than the fourth threshold, The sub-matrix quickly locates the increasing acceleration greater than the second threshold, thereby improving the detection efficiency of the burst topic.
  • S304 Filter the topic data including the word segmentation from all the topic data that have been acquired according to the determined word segmentation corresponding to each growth acceleration.
  • each of the growth accelerations in the matrix or submatrix corresponds to a keyword, and each keyword is a segmentation result of the segmentation result with the largest matching degree in the topic data
  • the system can follow the pre-stored growth acceleration and the segmentation map.
  • the relationship table queries the word segment corresponding to each of the increasing accelerations whose value is greater than the second threshold. If there are L increasing accelerations whose value is greater than the second threshold, there are also L parts of the query.
  • the system sequentially filters each topic data that has been acquired at the current time to determine whether the above L word segments are included in each topic data. If a piece of topic data includes the above L participles, the system filters out the piece of topic data, and performs step S305 on the topic data.
  • S305 Perform segmentation processing on the topic data including the word segmentation, and calculate word frequency feature values of each word segment obtained after the word segmentation process.
  • the word segmentation process can use existing various word segmentation algorithms, including but not limited to word segmentation based word segmentation algorithms, statistics based word segmentation algorithms, and the like. After the word segmentation is over, multiple word segments of the topic data will be retrieved.
  • the word segment obtained in S102 is referred to as a first word segment
  • the word segment obtained in S305 is referred to as a second word segment.
  • the first participle and the second participle may be the same or different.
  • the word frequency feature value of each second participle is calculated based on the word frequency feature quantity of each second participle.
  • word frequency feature quantities include, but are not limited to, word frequency, reverse frequency (termfrequency-TF), and the like.
  • the foregoing S305 specifically includes:
  • S501 Perform segmentation processing on the topic data including the word segmentation to obtain a plurality of word segments.
  • S502 Calculate, in each of the topic data acquired at the current time, a statistical word frequency and a reverse file frequency corresponding to each participle obtained after the word segmentation process.
  • S503 Perform weighting processing on the statistical word frequency of each participle and the reverse file frequency to output a word frequency feature value of the participle.
  • the weighting coefficient corresponding to the statistical word frequency F TF is the preset value a 3
  • the weighting coefficient corresponding to the reverse file frequency F IDF is the preset value a 4
  • a 3 + a 4 1.
  • the word frequency feature value of the second participle can be calculated based on the customized weighting coefficient, thereby comprehensively considering the TF-IDF value of the second participle, In the selected topic data, the importance degree of each second word segment is quantitatively compared.
  • S306 Output the word segment whose word frequency feature value is greater than the third threshold value as a high frequency word, and perform connection processing on each of the high frequency words by a budget algorithm to obtain the summary information including each of the high frequency words.
  • each second word segment whose word frequency feature value F is greater than a preset third threshold value these second word segments are high frequency words appearing in the topic data.
  • the high-frequency words are connected by using the TextRank algorithm, the summary information generating tool in the word tool, and other custom algorithms, etc., to obtain summary information associated with the topic data and the high-frequency words.
  • S105 Display the keyword and the summary information, so that the user knows the sudden topic at the current moment.
  • the system displays the keywords obtained in real time and the updated summary information.
  • the cumulative acceleration of the cumulative word frequency of each keyword will be greater than the threshold, and the summary information will be updated. Therefore, the text content and the sudden topic event displayed by the system in real time are The real content has a high degree of similarity and has a certain reference value.
  • each time the topic data in the information sharing platform is acquired the keyword corresponding to the topic data is determined, and the summary information is updated in real time based on the keyword, so that the user can output the keyword.
  • the summary information firstly knows what the sudden topic on the information sharing platform is, and can quickly determine whether the sudden topic is related to the enterprise itself based on the summary information, thereby effectively discovering and tracking the problem.
  • the sudden topic events related to the company have improved the soft power of the company.
  • FIG. 6 is a schematic diagram of a burst topic detecting apparatus provided by an embodiment of the present application. For the convenience of description, only parts related to the embodiment of the present application are shown. .
  • the apparatus includes:
  • the obtaining module 61 is configured to continuously obtain topic data in the information sharing platform.
  • the matching module 62 is configured to perform matching processing on the topic data and each word in the preset vocabulary to obtain a plurality of word segmentation results when each of the topic data is acquired.
  • the output module 63 is configured to output a plurality of word segments included in the word segmentation result with the highest matching degree as keywords corresponding to the topic data.
  • the updating module 64 is configured to update the summary information associated with the topic data according to the keyword.
  • the display module 65 is configured to display the keyword and the summary information, so that the user knows the sudden topic at the current moment.
  • the update module 64 includes:
  • a first calculation sub-module configured to respectively acquire a cumulative word frequency of each of the keywords, and calculate a growth acceleration of the cumulative word frequency, wherein the cumulative word frequency of the keyword represents all topic data acquired at the current time , the cumulative number of occurrences of the keyword.
  • a sub-module is added for adding the increasing acceleration corresponding to each of the keywords to a pre-generated matrix.
  • determining a sub-module configured to calculate a feature value of the matrix at the current time, and when the feature value is greater than the first threshold, determine a growth acceleration greater than the second threshold from the matrix.
  • the screening sub-module is configured to filter the topic data including the word segmentation from all the topic data that have been acquired according to the determined word segmentation corresponding to each increasing acceleration.
  • the word segmentation sub-module is configured to perform word segmentation processing on the topic data including the word segmentation, and calculate word frequency feature values of each word segment obtained after the word segmentation process.
  • a first output sub-module configured to output a word segment whose word frequency feature value is greater than a third threshold value as a high-frequency word, and connect each of the high-frequency words by a budget algorithm to obtain a high-frequency word including each The summary information.
  • the determining submodule is specifically configured to:
  • N is an integer greater than 1.
  • the word segmentation sub-module is specifically configured to:
  • the statistical word frequency of each participle and the inverse file frequency are weighted to output a word frequency feature value of the word segmentation.
  • the output module 63 includes:
  • the second calculation sub-module is configured to calculate the average number of word segmentation characters of each word segmentation result according to the total number of characters corresponding to each word segment in each word segmentation result and the total number of word segments corresponding to each word segmentation result.
  • the weighting sub-module is configured to perform weighting processing on the average number of the word segmentation characters corresponding to each word segmentation result and the total number of the word segmentation to output a matching degree of each word segmentation result.
  • a second output sub-module configured to output the plurality of word segments included in the segmentation result with the highest matching degree as keywords corresponding to the topic data.
  • FIG. 7 is a schematic diagram of a burst topic detecting apparatus according to an embodiment of the present application.
  • the burst topic detecting apparatus 7 of this embodiment includes a processor 70 and a memory 71 in which computer readable instructions 72 executable on the processor 70 are stored, for example Burst topic detection program.
  • the processor 70 executes the computer readable instructions 72, the steps in the foregoing embodiments of the various burst topic detection methods are implemented, such as steps 101 to 105 shown in FIG.
  • the processor 70 when executing the computer readable instructions 72, implements the functions of the various modules/units in the various apparatus embodiments described above, such as the functions of the modules 61-65 shown in FIG.
  • the computer readable instructions 72 may be partitioned into one or more modules/units that are stored in the memory 71 and executed by the processor 70, To complete this application.
  • the one or more modules/units may be a series of computer readable instruction segments capable of performing a particular function, the instruction segments being used to describe the execution of the computer readable instructions 72 in the burst topic detecting device 7. .
  • the burst topic detecting device 7 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. It will be understood by those skilled in the art that FIG. 7 is only an example of the burst topic detecting device 7, and does not constitute a limitation on the burst topic detecting device 7, and may include more or less components than those illustrated, or may combine some The components, or different components, such as the bursty topic detection device device, may also include input and output devices, network access devices, buses, and the like.
  • the so-called processor 70 can be a central processing unit (Central Processing Unit, CPU), can also be other general-purpose processors, digital signal processors (DSP), application specific integrated circuits (Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the memory 71 may be an internal storage unit of the burst topic detecting device 7, such as a hard disk or a memory of the burst topic detecting device 7.
  • the memory 71 may also be an external storage device of the burst topic detecting device 7, for example, a plug-in hard disk equipped with the burst topic detecting device 7, a smart memory card (SMC), and a secure digital device. (Secure Digital, SD) card, flash card (Flash Card) and so on.
  • the memory 71 may also include both an internal storage unit of the burst topic detecting device 7 and an external storage device.
  • the memory 71 is configured to store the computer readable instructions and other programs and data required by the burst topic detecting device.
  • the memory 71 can also be used to temporarily store data that has been output or is about to be output.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • a computer readable storage medium A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Abstract

本方案提供了一种突发话题检测方法、装置、设备及介质,适用于互联网技术领域,该方法包括:持续获取信息分享平台中的话题数据;在获取到每一话题数据时,将话题数据与预设词库中的各个词语进行匹配处理,以输出多种分词结果;将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词;根据所述关键词,更新与所述话题数据关联的摘要信息;对所述关键词及所述摘要信息进行展示,以使用户获知当前时刻的突发话题。本方案能够确定出话题数据对应的关键词,并基于该关键词来更新摘要信息,使得用户能够从输出的关键词及摘要信息中迅速地了解到信息分享平台上的突发话题。

Description

突发话题检测方法、装置、设备及介质
本申请要求于2017年06月09日提交中国专利局、申请号为201710433359.1、发明名称为“突发话题检测方法及突发话题检测设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于互联网技术,尤其涉及一种突发话题检测方法、装置、设备及介质。
背景技术
在微博、推特Twitter以及论坛等信息分享平台上,基于平台的开放性,用户们可以随时随地地分享和转发各类信息。在较短的时间内,若大量用户都分享或转发了相同的信息,则该信息所对应的具体话题会演变为热度较高的突发话题。这些突发话题如果与特定的企业相关,则可能会为企业带来巨大的舆论影响。如果企业不能及时发现并跟踪与公司相关的突发话题事件,则会错过消除负面舆论影响的最佳时间,从而降低了企业自身的软实力。
然而,现有技术中,难以通过技术手段迅速了解到信息分享平台上的突发话题,也难以确定各个突发话题是否与企业自身相关。
技术问题
有鉴于此,本发明实施例提供了一种突发话题检测方法及热度事件检测设20备,以解决现有技术中难以通过技术手段迅速了解到信息分享平台上的突发话题以及难以确定各个突发话题是否与企业自身相关的问题。
技术解决方案
本申请实施例的第一方面提供了一种突发话题检测方法,包括:
持续获取信息分享平台中的话题数据;
在获取到每一所述话题数据时,将所述话题数据与预设词库中的各个词语进行匹配处理,以输出多种分词结果;
将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词;
根据所述关键词,更新与所述话题数据关联的摘要信息;
对所述关键词及所述摘要信息进行展示,以使用户获知当前时刻的突发话题。
本申请实施例的第二方面提供了一种突发话题检测装置,包括:
获取模块,用于持续获取信息分享平台中的话题数据;
匹配模块,用于在获取到每一所述话题数据时,将所述话题数据与预设词库中的各个词语进行匹配处理,以输出多种分词结果;
输出模块,用于将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词;
更新模块,用于根据所述关键词,更新与所述话题数据关联的摘要信息;
展示模块,用于对所述关键词及所述摘要信息进行展示,以使用户获知当前时刻的突发话题。
本申请实施例的第三方面提供了一种突发话检测设备,包括存储器以及处理器,所述存储器中存储有可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如第一方面所述的突发话题检测方法的步骤。
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如第一方面所述的突发话题检测方法的步骤。
有益效果
本申请实施例中,在每一次获取到信息分享平台中的话题数据时,通过确定出该话题数据对应的关键词,并基于该关键词来实时更新摘要信息,使得用户能够从输出的关键词及摘要信息中第一时间了解到信息分享平台上的突发话题大概是什么内容,能够基于该摘要信息迅速地确定出该突发话题是否与企业自身相关,由此可以有效地发现及跟踪处理与企业相关的突发话题事件,提高了企业的软实力。
附图说明
图1是本申请实施例提供的突发话题检测方法的实现流程图;
图2是本申请实施例提供的突发话题检测方法S103的具体实现流程图;
图3是本申请实施例提供的突发话题检测方法S104的具体实现流程图;
图4是本申请实施例提供的突发话题检测方法S303的具体实现流程图;
图5是本申请实施例提供的突发话题检测方法S305的具体实现流程图;
图6是本申请实施例提供的突发话题检测装置的示意图;
图7是本申请实施例提供的突发话题设备的示意图。
本发明的实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
为了说明本申请所述的技术方案,下面通过具体实施例来进行说明。
图1示出了本申请实施例提供的突发话题检测方法的实现流程,该方法流程包括步骤S101至S105。各步骤的具体实现原理如下:
S101:持续获取信息分享平台中的话题数据。
本申请实施例中,信息分享平台包括但不限于微博、推特Twitter、Facebook以及各大BBS论坛等。每一条话题数据具体为能够展示于信息分享平台且由用户发布的一条文字信息,其可以关联一个或多个突发事件。这些文字信息包括但不限于信息分享平台中的原文、转载文以及原文或转载文所对应的用户评论数据等。
获取信息分享平台中的话题数据可通过以下两种方式实现:第一种方式,根据预先创建且能够用于与信息分享平台的API(Application Programming Interface,应用程序编程接口)进行交互的应用程序,以及根据预先获取的账号密钥,在该应用程序中,调用信息分享平台所提供的API接口,从而获取信息分享平台所返回的话题数据;第二种方式,通过爬虫程序持续爬取信息分享平台中的话题数据。
由于信息分享平台中的话题数据是不断更新、不断增长的,因此,本申请实施例中,实时获取信息分享平台中的话题数据,即持续不断地获取话题数据,保证系统在各个时刻下都能获取到最新的话题数据,从而能够准确、及时以及迅速地执行突发话题的检测。
S102:在获取到每一所述话题数据时,将所述话题数据与预设词库中的各个词语进行匹配处理,以输出多种分词结果。
每接收到一条新的话题数据时,系统会对该话题数据进行词语匹配处理。具体地,系统将从话题数据的第一个字符开始,判断该话题数据中是否包含预设词库中的词语。当确定出话题数据中连续出现的字符所组成的词语与预设词库中的词语相同时,将该连续出现的字符确定为一个分词,并在话题数据中,从该分词后的第一个字符开始,重新执行上述词语匹配过程。当话题数据中的各个分词均确定后,确定完成一遍词语匹配过程,则该词语匹配过程对应输出一种分词结果,且该分词结果中包括多个分词。特别地,每一分词的字符总数为两个以上。
实际上,对于话题数据中的一个字符,其不但能和左邻的一个或多个字符构成一个第一分词,也能和右邻的一个或多个字符构成一个第一分词,因此,在分词规则不同的情况下,同一话题数据能够得到不同的分词结果。本申请实施例中,对于一条话题数据,输出预存储的每一分词规则所分别对应的一种分词结果。不同分词结果所对应的匹配度可能不同。其中,匹配度表示,根据分词结果中的各个分词,用户能够获知话题数据的实际语义的程度。
S103:将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词。
本申请实施例中,可以根据各个分词的字符平均数来确定每种分词结果的匹配度,或者根据各个分词的字符总数方差来确定每种分词结果的匹配度,在此不作限定。
优选地,由于分词的字符总数越大,用户越容易从分词中确定出话题数据的实际语义,因此,基于最长匹配原则来衡量每一种分词结果的匹配度。在比较每一种分词结果的匹配度后,将匹配度最大的分词结果所包含的各个第一分词输出为话题数据对应的关键词。
例如,当话题数据仅出现“数据线”三个汉字符时,由于“数据线”与“数据”均可以构成一个分词,而“数据线”的匹配度更高,因为确定匹配度最大的分词结果所包含的分词为“数据线”,将“数据线”输出为关键词。
作为本申请的一个实施例,对分词结果匹配度的计算方式作进一步限定。如图2所示,上述S103具体包括:
S201:根据每一种分词结果中各个分词对应的字符总数以及每一种分词结果对应的分词总数,计算每一种分词结果的分词字符平均数。
每一种分词结果中均包含多个分词,每一个分词均由至少两个字符组成。本申请实施例中,识别分词的总数,并识别每个分词的字符总数(即判断每个分词所包含的字符的数量)。将各个分词对应的字符总数的和与分词总数的比值输出为上述分词字符平均数。
例如,若对话题数据进行分词处理后所得到的一种分词结果为{天天集团/数据线/产量},则该分词结果中的三个分词分别为“天天集团”、“数据线”、“产量”,且这三个分词的字符总数分别为4、3、3,该分词结果的分词总数为3,分词字符平均数为(4+3+3)/3=3.33。
S202:对每一种分词结果对应的所述分词字符平均数以及所述分词总数进行加权处理,以输出每一种分词结果的匹配度。
本申请实施例中,分词字符平均数A 1所对应的加权系数为预设值a 1,分词总数A 2所对应的加权系数为预设值a 2,且a 1+a 2=1。每一种分词结果的匹配度为C=A 1×a 1+A 2×a 2
S203:将所述匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词。
若对话题数据进行分词处理后得到M种分词结果,且M种分词结果的匹配度分别为C 1、C 2…、C m,则在C 1、C 2…、C m中选取数值最大的一个值C i,并将C i所对应的一种分词结果中的每一个分词输出为话题数据对应的一个关键词。其中,m为大于1的整数,i≤m。
本申请实施例中,由于分词字符平均数以及分词总数这两个因子都对分词结果具有较大影响,能够决定用户是否能够确定出话题数据的实际语义,因而通过对分词字符平均数以及分词总数进行加权处理,并将加权后得到的值作为分词结果的匹配度来衡量关键词,能够提高关键词选取的准确性及有效性,从而准确定位出突发话题的事件内容。
S104:根据所述关键词,更新与所述话题数据关联的摘要信息。
在任一时刻,系统将累积接收到多条话题数据,在确定每一条话题数据的关键词后,系统将重新生成用于描述当前累积接收到所有话题数据的摘要信息,以使用户能够基于该摘要信息,清楚了解到当前时刻突发话题的大致内容。
关键词具备有话题数据的决定性特征,为了生成与当前累积接收到所有话题数据相关联的摘要信息,可以统计各条话题数据中每一关键词的累计词频,以根据累计词频大于阈值的关键词来生成摘要信息。其中,可利用TextRank算法或者word工具中的摘要信息生成工具等,生成与话题数据以及与关键词关联的摘要信息。
优选地,作为本申请的一个实施例,如图3所示,上述S104具体包括:
S301:分别获取每一所述关键词的累计词频,并计算所述累计词频的增长加速度,其中,所述关键词的累计词频表示在当前时刻已获取的所有话题数据中,所述关键词出现的累计次数。
本申请实施例中,一个关键词的累计词频表示在当前累积接收到所有话题数据中,该关键词的出现次数。因系统处于持续获取话题数据的状态之中,故对于同一个关键词,其累计词频也在不断增长。若时间段ΔT内,系统检测到关键词A的累计词频增长了ΔS,则该关键词A的累计词频的增长速度为V=ΔS/ΔT,其累积词频的增长加速度a为增长速度V对时间的偏导数,即
a=V'(t)。增长加速度越大,单位时长内,关键词出现于话题数据中的次数越多,话题突发性越高。
S302:将各个所述关键词所对应的所述增长加速度添加至预先生成的矩阵中。
每次接收到新的话题数据时,系统确定出该话题数据的关键词以及关键词的累计词频的增长加速度。若该话题数据的关键词有K个时,将得到K个增长加速度。若系统累计得到的增长加速度的数量为P(P≥K,N∈Z),则矩阵将被扩展为P×P的矩阵,并将实时获得的该K个增长加速度添加至P×P的矩阵中。在P×P的矩阵中,除了包含P个增长加速度外,还包括空值。
S303:计算当前时刻所述矩阵的特征值,当所述特征值大于第一阈值时,从所述矩阵中确定出大于第二阈值的增长加速度。
系统对矩阵中的各个增长加速度进行监控,以实时检测矩阵的特征值。随着累计获取得到的话题数据越来越多,矩阵的大小及其包含的增长加速度的总数也在不断变化,因而矩阵的特征值也随之增大。当特征值大于预设的第一阈值时,系统将从矩阵所包含的各个增长加速度中,定位出数值大于第二阈值的一个或多个增长加速度。
作为本申请的一个实施例,如图4所示,上述S303具体包括:
S401:将当前时刻所述矩阵中的各个增长加速度分成N个组别,并将每一组别的增长加速度映射至一个子矩阵中。
由于矩阵中增长加速度的数量较多,为了提高数值大于第二阈值的增长加速度的定位速度,将矩阵进行降维处理。
具体地,依照预设的规则,将矩阵中所存在的所有增长加速度分成N个组别,使得每个组别包含数量较少的多个增长加速度。其中,每个组别中增长加速度的数量可以相同也可以不同。将每一组别所包含的多个增长加速度映射至一个子矩阵中。故当组别的数量为B个时,子矩阵的数量也为B个。在话题数据逐渐增多的情况下,每次更新得到的各个增长加速度也将分别映射至该B个子矩阵中。
S402:计算每一所述子矩阵的特征值,当所述子矩阵的特征值大于第四阈值时,从所述子矩阵中筛选出大于第二阈值的增长加速度。
对每一个子矩阵的特征值进行计算,若B个子矩阵中任意多个子矩阵的特征值均大于预设的第四阈值,则从特征值大于第四阈值的各个子矩阵中,分别筛选出大于第二阈值的各个增长加速度。
本申请实施例中,由于子矩阵中的增长加速度的数量大大少于矩阵中增长加速度的数量,因此,通过分别计算子矩阵的特征值,在特征值大于第四阈值的情况下,能够从对应的子矩阵中迅速定位出大于第二阈值的增长加速度,从而提高了突发话题的检测效率。
S304:根据确定出的每一增长加速度所对应的分词,从已获取到的所有话题数据中筛选出包含该分词的话题数据。
因矩阵或子矩阵中的每一个增长加速度均对应一个关键词,而每个关键词为话题数据中匹配度最大的分词结果中的一个分词,故系统可以依照预先存储的增长加速度以及分词的映射关系表,查询出数值大于第二阈值的各个增长加速度所分别对应的分词。若数值大于第二阈值的各个增长加速度有L个,则查询出的分词也有L个。
系统依次对当前时刻已经获取的每一条话题数据进行筛选处理,判断每一条话题数据中是否包含了上述L个分词。若某条话题数据包含了上述L个分词,则系统筛选出该条话题数据,并对该话题数据执行步骤S305。
S305:对包含该分词的话题数据再次进行分词处理,并计算分词处理后得到的各个分词的词频特征值。
对筛选出的每一条话题数据,系统重新对其进行分词处理。分词过程可以使用现有的各类分词算法,包括但不限于基于字符串匹配的分词算法、基于统计的分词算法等。分词结束后,将重新得到该条话题数据的多个分词。为了区别S102中得到的分词以及S305中得到的分词,在此将S102中所得到的分词称为第一分词,将S305中得到的分词称为第二分词。其中,第一分词与第二分词可能相同,也可能不同。为了进一步筛选出对摘要信息影响程度较大的第二分词,基于各个第二分词的词频特征量,计算出每个第二分词的词频特征值。这些词频特征量包含但不限于词频、逆向文件频率(termfrequency-TF)等。
作为本申请的一个实施例,如图5所示,上述S305具体包括:
S501:对包含该分词的话题数据再次进行分词处理,得到多个分词。
S502:在当前时刻所获取到的所有话题数据中,分别计算分词处理后得到的每个分词对应的统计词频以及逆向文件频率。
本申请实施例中,计算每个第二分词在筛选出的多条话题数据中所出现的次数,则统计得到的出现次数为第二分词的统计词频F TF。若筛选出的话题数据的总数为X条,其中包含某一第二分词的话题数据为X'(X'≤X,X∈Z)条,则该第二分词的逆向文件频率F IDF=lg(X/ X')。
S503:对每一分词的所述统计词频以及所述逆向文件频率进行加权处理,以输出该分词的词频特征值。
统计词频F TF所对应的加权系数为预设值a 3,逆向文件频率F IDF所对应的加权系数为预设值a 4,且a 3+a 4=1。每一个第二分词的词频特征值为F=F TF×a 3+F IDF×a 4
本申请实施例中,根据每一第二分词的TF及IDF值,能够基于自定义的加权系数,计算出第二分词的词频特征值,从而通过综合考虑第二分词的TF-IDF值,能够在筛选出的多条话题数据,对每个第二分词的重要程度进行量化对比。
S306:将所述词频特征值大于第三阈值的分词输出为高频词,通过预算算法对各个所述高频词进行连接处理,以得到包含各个所述高频词的所述摘要信息。
确定出词频特征值F大于预设的第三阈值的每一个第二分词,则这些第二分词为话题数据中所出现的高频词。利用上述TextRank算法、word工具中的摘要信息生成工具以及其他自定义算法等,将各个高频词进行连接,以得到与话题数据以及与高频词关联的摘要信息。
S105:对所述关键词及所述摘要信息进行展示,以使用户获知当前时刻的突发话题。
系统将实时获得的关键词以及更新后的摘要信息进行展示。实际情况下,仅有在话题数据为突发话题时,各个关键词的累计词频的增长加速度才会大于阈值,摘要信息才会得到更新,因此,系统所实时显示的文字内容与突发话题事件的真实内容具有较高的相似度,具有一定的参考价值。
本申请实施例中,在每一次获取到信息分享平台中的话题数据时,通过确定出该话题数据对应的关键词,并基于该关键词来实时更新摘要信息,使得用户能够从输出的关键词及摘要信息中第一时间了解到信息分享平台上的突发话题大概是什么内容,能够基于该摘要信息迅速地确定出该突发话题是否与企业自身相关,由此可以有效地发现及跟踪处理与企业相关的突发话题事件,提高了企业的软实力。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
对应于上文实施例所述的突发话题检测方法,图6示出了本申请实施例提供的突发话题检测装置的示意图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图6,该装置包括:
获取模块61,用于持续获取信息分享平台中的话题数据。
匹配模块62,用于在获取到每一所述话题数据时,将所述话题数据与预设词库中的各个词语进行匹配处理,以输出多种分词结果。
输出模块63,用于将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词。
更新模块64,用于根据所述关键词,更新与所述话题数据关联的摘要信息。
展示模块65,用于对所述关键词及所述摘要信息进行展示,以使用户获知当前时刻的突发话题。
可选地,所述更新模块64包括:
第一计算子模块,用于分别获取每一所述关键词的累计词频,并计算所述累计词频的增长加速度,其中,所述关键词的累计词频表示在当前时刻已获取的所有话题数据中,所述关键词出现的累计次数。
添加子模块,用于将各个所述关键词所对应的所述增长加速度添加至预先生成的矩阵中。
确定子模块,用于计算当前时刻所述矩阵的特征值,当所述特征值大于第一阈值时,从所述矩阵中确定出大于第二阈值的增长加速度。
筛选子模块,用于根据确定出的每一增长加速度所对应的分词,从已获取到的所有话题数据中筛选出包含该分词的话题数据。
分词子模块,用于对包含该分词的话题数据再次进行分词处理,并计算分词处理后得到的各个分词的词频特征值。
第一输出子模块,用于将所述词频特征值大于第三阈值的分词输出为高频词,通过预算算法对各个所述高频词进行连接处理,以得到包含各个所述高频词的所述摘要信息。
可选地,所述确定子模块具体用于:
将当前时刻所述矩阵中的各个增长加速度分成N个组别,并将每一组别的增长加速度映射至一个子矩阵中;
计算每一所述子矩阵的特征值,当所述子矩阵的特征值大于第四阈值时,从所述子矩阵中筛选出大于第二阈值的增长加速度;
其中,所述N为大于1的整数。
可选地,所述分词子模块具体用于:
对包含该分词的话题数据再次进行分词处理,得到多个分词;
在当前时刻所获取到的所有话题数据中,分别计算分词处理后得到的每个分词对应的统计词频以及逆向文件频率;
对每一分词的所述统计词频以及所述逆向文件频率进行加权处理,以输出该分词的词频特征值。
可选地,所述输出模块63包括:
第二计算子模块,用于根据每一种分词结果中各个分词对应的字符总数以及每一种分词结果对应的分词总数,计算每一种分词结果的分词字符平均数。
加权子模块,用于对每一种分词结果对应的所述分词字符平均数以及所述分词总数进行加权处理,以输出每一种分词结果的匹配度。
第二输出子模块,用于将所述匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词。
图7是本申请实施例提供的突发话题检测设备的示意图。如图7所示,该实施例的突发话题检测设备7包括:处理器70以及存储器71,在所述存储器71中存储有可在所述处理器70上运行的计算机可读指令72,例如突发话题检测程序。所述处理器70执行所述计算机可读指令72时实现上述各个突发话题检测方法实施例中的步骤,例如图1所示的步骤101至105。或者,所述处理器70执行所述计算机可读指令72时实现上述各装置实施例中各模块/单元的功能,例如图6所示模块61至65的功能。
示例性的,所述计算机可读指令72可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器71中,并由所述处理器70执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令72在所述突发话题检测设备7中的执行过程。
所述突发话题检测设备7可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解,图7仅仅是突发话题检测设备7的示例,并不构成对突发话题检测设备7的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述突发话题检测设备设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器70可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器 (Digital Signal Processor,DSP)、专用集成电路 (Application Specific Integrated Circuit,ASIC)、现成可编程门阵列 (Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器71可以是所述突发话题检测设备7的内部存储单元,例如突发话题检测设备7的硬盘或内存。所述存储器71也可以是所述突发话题检测设备7的外部存储设备,例如所述突发话题检测设备7上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器71还可以既包括所述突发话题检测设备7的内部存储单元也包括外部存储设备。所述存储器71用于存储所述计算机可读指令以及所述突发话题检测设备所需的其他程序和数据。所述存储器71还可以用于暂时地存储已经输出或者将要输出的数据。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种突发话题检测方法,其特征在于,包括:
    持续获取信息分享平台中的话题数据;
    在获取到每一所述话题数据时,将所述话题数据与预设词库中的各个词语进行匹配处理,以输出多种分词结果;
    将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词;
    根据所述关键词,更新与所述话题数据关联的摘要信息;
    对所述关键词及所述摘要信息进行展示,以使用户获知当前时刻的突发话题。
  2. 如权利要求1所述的突发话题检测方法,其特征在于,所述根据所述关键词,更新与所述话题数据关联的摘要信息,包括:
    分别获取每一所述关键词的累计词频,并计算所述累计词频的增长加速度,其中,所述关键词的累计词频表示在当前时刻已获取的所有话题数据中,所述关键词出现的累计次数;
    将各个所述关键词所对应的所述增长加速度添加至预先生成的矩阵中;
    计算当前时刻所述矩阵的特征值,当所述特征值大于第一阈值时,从所述矩阵中确定出大于第二阈值的增长加速度;
    根据确定出的每一增长加速度所对应的分词,从已获取到的所有话题数据中筛选出包含该分词的话题数据;
    对包含该分词的话题数据再次进行分词处理,并计算分词处理后得到的各个分词的词频特征值;
    将所述词频特征值大于第三阈值的分词输出为高频词,通过预算算法对各个所述高频词进行连接处理,以得到包含各个所述高频词的所述摘要信息。
  3. 如权利要求2所述的突发话题检测方法,其特征在于,所述计算当前时刻所述矩阵的特征值,当所述特征值大于第一阈值时,从所述矩阵中确定出大于第二阈值的增长加速度,包括:
    将当前时刻所述矩阵中的各个增长加速度分成N个组别,并将每一组别的增长加速度映射至一个子矩阵中;
    计算每一所述子矩阵的特征值,当所述子矩阵的特征值大于第四阈值时,从所述子矩阵中筛选出大于第二阈值的增长加速度;
    其中,所述N为大于1的整数。
  4. 如权利要求2所述的突发话题检测方法,其特征在于,所述对包含该分词的话题数据再次进行分词处理,并计算分词处理后得到的各个分词的词频特征值,包括:
    对包含该分词的话题数据再次进行分词处理,得到多个分词;
    在当前时刻所获取到的所有话题数据中,分别计算分词处理后得到的每个分词对应的统计词频以及逆向文件频率;
    对每一分词的所述统计词频以及所述逆向文件频率进行加权处理,以输出该分词的词频特征值。
  5. 如权利要求1所述的突发话题检测方法,其特征在于,所述将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词,包括:
    根据每一种分词结果中各个分词对应的字符总数以及每一种分词结果对应的分词总数,计算每一种分词结果的分词字符平均数;
    对每一种分词结果对应的所述分词字符平均数以及所述分词总数进行加权处理,以输出每一种分词结果的匹配度;
    将所述匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词。
  6. 一种突发话题检测装置,其特征在于,包括:
    获取模块,用于持续获取信息分享平台中的话题数据;
    匹配模块,用于在获取到每一所述话题数据时,将所述话题数据与预设词库中的各个词语进行匹配处理,以输出多种分词结果;
    输出模块,用于将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词;
    更新模块,用于根据所述关键词,更新与所述话题数据关联的摘要信息;
    展示模块,用于对所述关键词及所述摘要信息进行展示,以使用户获知当前时刻的突发话题。
  7. 根据权利要求6所述的突发话题检测装置,其特征在于,所述更新模块包括:
    第一计算子模块,用于分别获取每一所述关键词的累计词频,并计算所述累计词频的增长加速度,其中,所述关键词的累计词频表示在当前时刻已获取的所有话题数据中,所述关键词出现的累计次数;
    添加子模块,用于将各个所述关键词所对应的所述增长加速度添加至预先生成的矩阵中;
    确定子模块,用于计算当前时刻所述矩阵的特征值,当所述特征值大于第一阈值时,从所述矩阵中确定出大于第二阈值的增长加速度;
    筛选子模块,用于根据确定出的每一增长加速度所对应的分词,从已获取到的所有话题数据中筛选出包含该分词的话题数据;
    分词子模块,用于对包含该分词的话题数据再次进行分词处理,并计算分词处理后得到的各个分词的词频特征值;
    第一输出子模块,用于将所述词频特征值大于第三阈值的分词输出为高频词,通过预算算法对各个所述高频词进行连接处理,以得到包含各个所述高频词的所述摘要信息。
  8. 根据权利要求7所述的突发话题检测装置,其特征在于,所述确定子模块具体用于:
    将当前时刻所述矩阵中的各个增长加速度分成N个组别,并将每一组别的增长加速度映射至一个子矩阵中;
    计算每一所述子矩阵的特征值,当所述子矩阵的特征值大于第四阈值时,从所述子矩阵中筛选出大于第二阈值的增长加速度;
    其中,所述N为大于1的整数。
  9. 根据权利要求7所述的突发话题检测装置,其特征在于,所述分词子模块具体用于:
    对包含该分词的话题数据再次进行分词处理,得到多个分词;
    在当前时刻所获取到的所有话题数据中,分别计算分词处理后得到的每个分词对应的统计词频以及逆向文件频率;
    对每一分词的所述统计词频以及所述逆向文件频率进行加权处理,以输出该分词的词频特征值。
  10. 根据权利要求6所述的突发话题检测装置,其特征在于,所述输出模块包括:
    第二计算子模块,用于根据每一种分词结果中各个分词对应的字符总数以及每一种分词结果对应的分词总数,计算每一种分词结果的分词字符平均数;
    加权子模块,用于对每一种分词结果对应的所述分词字符平均数以及所述分词总数进行加权处理,以输出每一种分词结果的匹配度;
    第二输出子模块,用于将所述匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词。
  11. 一种突发话检测设备,其特征在于,包括存储器以及处理器,所述存储器中存储有可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    持续获取信息分享平台中的话题数据;
    在获取到每一所述话题数据时,将所述话题数据与预设词库中的各个词语进行匹配处理,以输出多种分词结果;
    将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词;
    根据所述关键词,更新与所述话题数据关联的摘要信息;
    对所述关键词及所述摘要信息进行展示,以使用户获知当前时刻的突发话题。
  12. 根据权利要求11所述的突发话检测设备,其特征在于,所述根据所述关键词,更新与所述话题数据关联的摘要信息,包括:
    分别获取每一所述关键词的累计词频,并计算所述累计词频的增长加速度,其中,所述关键词的累计词频表示在当前时刻已获取的所有话题数据中,所述关键词出现的累计次数;
    将各个所述关键词所对应的所述增长加速度添加至预先生成的矩阵中;
    计算当前时刻所述矩阵的特征值,当所述特征值大于第一阈值时,从所述矩阵中确定出大于第二阈值的增长加速度;
    根据确定出的每一增长加速度所对应的分词,从已获取到的所有话题数据中筛选出包含该分词的话题数据;
    对包含该分词的话题数据再次进行分词处理,并计算分词处理后得到的各个分词的词频特征值;
    将所述词频特征值大于第三阈值的分词输出为高频词,通过预算算法对各个所述高频词进行连接处理,以得到包含各个所述高频词的所述摘要信息。
  13. 根据权利要求12所述的突发话检测设备,其特征在于,所述计算当前时刻所述矩阵的特征值,当所述特征值大于第一阈值时,从所述矩阵中确定出大于第二阈值的增长加速度,包括:
    将当前时刻所述矩阵中的各个增长加速度分成N个组别,并将每一组别的增长加速度映射至一个子矩阵中;
    计算每一所述子矩阵的特征值,当所述子矩阵的特征值大于第四阈值时,从所述子矩阵中筛选出大于第二阈值的增长加速度;
    其中,所述N为大于1的整数。
  14. 根据权利要求12所述的突发话检测设备,其特征在于,所述对包含该分词的话题数据再次进行分词处理,并计算分词处理后得到的各个分词的词频特征值,包括:
    对包含该分词的话题数据再次进行分词处理,得到多个分词;
    在当前时刻所获取到的所有话题数据中,分别计算分词处理后得到的每个分词对应的统计词频以及逆向文件频率;
    对每一分词的所述统计词频以及所述逆向文件频率进行加权处理,以输出该分词的词频特征值。
  15. 根据权利要求11所述的突发话检测设备,其特征在于,所述将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词,包括:
    根据每一种分词结果中各个分词对应的字符总数以及每一种分词结果对应的分词总数,计算每一种分词结果的分词字符平均数;
    对每一种分词结果对应的所述分词字符平均数以及所述分词总数进行加权处理,以输出每一种分词结果的匹配度;
    将所述匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被至少一个处理器执行时实现如下步骤:
    持续获取信息分享平台中的话题数据;
    在获取到每一所述话题数据时,将所述话题数据与预设词库中的各个词语进行匹配处理,以输出多种分词结果;
    将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词;
    根据所述关键词,更新与所述话题数据关联的摘要信息;
    对所述关键词及所述摘要信息进行展示,以使用户获知当前时刻的突发话题。
  17. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述根据所述关键词,更新与所述话题数据关联的摘要信息,包括:
    分别获取每一所述关键词的累计词频,并计算所述累计词频的增长加速度,其中,所述关键词的累计词频表示在当前时刻已获取的所有话题数据中,所述关键词出现的累计次数;
    将各个所述关键词所对应的所述增长加速度添加至预先生成的矩阵中;
    计算当前时刻所述矩阵的特征值,当所述特征值大于第一阈值时,从所述矩阵中确定出大于第二阈值的增长加速度;
    根据确定出的每一增长加速度所对应的分词,从已获取到的所有话题数据中筛选出包含该分词的话题数据;
    对包含该分词的话题数据再次进行分词处理,并计算分词处理后得到的各个分词的词频特征值;
    将所述词频特征值大于第三阈值的分词输出为高频词,通过预算算法对各个所述高频词进行连接处理,以得到包含各个所述高频词的所述摘要信息。
  18. 根据权利要求17所述的计算机可读存储介质,其特征在于,所述计算当前时刻所述矩阵的特征值,当所述特征值大于第一阈值时,从所述矩阵中确定出大于第二阈值的增长加速度,包括:
    将当前时刻所述矩阵中的各个增长加速度分成N个组别,并将每一组别的增长加速度映射至一个子矩阵中;
    计算每一所述子矩阵的特征值,当所述子矩阵的特征值大于第四阈值时,从所述子矩阵中筛选出大于第二阈值的增长加速度;
    其中,所述N为大于1的整数。
  19. 根据权利要求17所述的计算机可读存储介质,其特征在于,所述对包含该分词的话题数据再次进行分词处理,并计算分词处理后得到的各个分词的词频特征值,包括:
    对包含该分词的话题数据再次进行分词处理,得到多个分词;
    在当前时刻所获取到的所有话题数据中,分别计算分词处理后得到的每个分词对应的统计词频以及逆向文件频率;
    对每一分词的所述统计词频以及所述逆向文件频率进行加权处理,以输出该分词的词频特征值。
  20. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述将匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词,包括:
    根据每一种分词结果中各个分词对应的字符总数以及每一种分词结果对应的分词总数,计算每一种分词结果的分词字符平均数;
    对每一种分词结果对应的所述分词字符平均数以及所述分词总数进行加权处理,以输出每一种分词结果的匹配度;
    将所述匹配度最高的分词结果所包含的多个分词输出为所述话题数据对应的关键词。
PCT/CN2018/074870 2017-06-09 2018-01-31 突发话题检测方法、装置、设备及介质 WO2018223718A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710433359.1A CN107688596B (zh) 2017-06-09 2017-06-09 突发话题检测方法及突发话题检测设备
CN201710433359.1 2017-06-09

Publications (1)

Publication Number Publication Date
WO2018223718A1 true WO2018223718A1 (zh) 2018-12-13

Family

ID=61152644

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/074870 WO2018223718A1 (zh) 2017-06-09 2018-01-31 突发话题检测方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN107688596B (zh)
WO (1) WO2018223718A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897958B (zh) * 2020-07-16 2024-03-12 邓桦 基于自然语言处理的古诗词分类方法
CN113204638B (zh) * 2021-04-23 2024-02-23 上海明略人工智能(集团)有限公司 基于工作会话单元的推荐方法、系统、计算机和存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646114A (zh) * 2012-02-17 2012-08-22 清华大学 一种基于突破点的新闻话题时间线摘要生成方法
CN102971762A (zh) * 2010-07-01 2013-03-13 费斯布克公司 促进社交网络用户之间的交互
CN105022827A (zh) * 2015-07-23 2015-11-04 合肥工业大学 一种面向领域主题的Web新闻动态聚合方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289487B (zh) * 2011-08-09 2013-09-04 浙江大学 基于主题模型的网络突发热点事件检测方法
CN102346766A (zh) * 2011-09-20 2012-02-08 北京邮电大学 基于极大团发现的网络热点话题检测方法及装置
CN104615593B (zh) * 2013-11-01 2017-09-29 北大方正集团有限公司 微博热点话题自动检测方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102971762A (zh) * 2010-07-01 2013-03-13 费斯布克公司 促进社交网络用户之间的交互
CN102646114A (zh) * 2012-02-17 2012-08-22 清华大学 一种基于突破点的新闻话题时间线摘要生成方法
CN105022827A (zh) * 2015-07-23 2015-11-04 合肥工业大学 一种面向领域主题的Web新闻动态聚合方法

Also Published As

Publication number Publication date
CN107688596B (zh) 2020-02-21
CN107688596A (zh) 2018-02-13

Similar Documents

Publication Publication Date Title
AU2014201827B2 (en) Scoring concept terms using a deep network
US10891322B2 (en) Automatic conversation creator for news
CN111581355B (zh) 威胁情报的主题检测方法、装置和计算机存储介质
CN106909694B (zh) 分类标签数据获取方法以及装置
CN107408115B (zh) web站点过滤器、控制对内容的访问的方法和介质
US20140201240A1 (en) System and method to retrieve relevant multimedia content for a trending topic
US11263255B2 (en) Content carousel in a social media timeline
JP2017533531A (ja) フォーカスト・センチメント分類
CN111814770A (zh) 一种新闻视频的内容关键词提取方法、终端设备及介质
WO2022141876A1 (zh) 基于词向量的搜索方法、装置、设备及存储介质
US11516159B2 (en) Systems and methods for providing a comment-centered news reader
CN112749300B (zh) 用于视频分类的方法、装置、设备、存储介质和程序产品
US9881023B2 (en) Retrieving/storing images associated with events
US10699078B2 (en) Comment-centered news reader
US10885121B2 (en) Fast filtering for similarity searches on indexed data
JP2021096858A (ja) ベクトル量子化を利用した重複文書探知方法およびシステム
CN113688310A (zh) 一种内容推荐方法、装置、设备及存储介质
CN113326420A (zh) 问题检索方法、装置、电子设备和介质
WO2018223718A1 (zh) 突发话题检测方法、装置、设备及介质
CN108563713B (zh) 关键词规则生成方法及装置和电子设备
JP6446987B2 (ja) 映像選択装置、映像選択方法、映像選択プログラム、特徴量生成装置、特徴量生成方法及び特徴量生成プログラム
US10339559B2 (en) Associating social comments with individual assets used in a campaign
US8489592B2 (en) Electronic device and method for searching related terms
CN111563276B (zh) 一种网页篡改检测方法、检测系统及相关设备
JP5386548B2 (ja) 急上昇ワード抽出装置及び方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18813985

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 11/03/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18813985

Country of ref document: EP

Kind code of ref document: A1