WO2018086518A1 - Method and device for real-time detection of new subject - Google Patents

Method and device for real-time detection of new subject Download PDF

Info

Publication number
WO2018086518A1
WO2018086518A1 PCT/CN2017/109840 CN2017109840W WO2018086518A1 WO 2018086518 A1 WO2018086518 A1 WO 2018086518A1 CN 2017109840 W CN2017109840 W CN 2017109840W WO 2018086518 A1 WO2018086518 A1 WO 2018086518A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
topic
subject
new
theme
Prior art date
Application number
PCT/CN2017/109840
Other languages
French (fr)
Chinese (zh)
Inventor
徐文斌
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Publication of WO2018086518A1 publication Critical patent/WO2018086518A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to the field of Internet technologies, and in particular, to a real-time detection method and apparatus for a new theme.
  • Topic Detection & Tracking (Topic Detection & Tracing) technology is a highly practical technology in the field of natural processing and information retrieval, intended to discover and process themes that appear in text. It is also a practical technique for effectively discovering and extracting useful information in the context of big data. Often, hot topic discovery and tracking techniques are a technique for finding a topic and tracking the progress of a topic for a particular domain or specific event.
  • the main steps of the detection technology of hot topics at home and abroad include: text acquisition, collecting news reports on the Internet; collecting vectorized representations of documents; document clustering, clustering the vectorized documents, and taking them out A high-frequency word or cluster center document represents a topic; finally, the topics found in a certain period of time are sorted by heat and output according to a specific order.
  • this method only applies to the discovery and tracking of specified topics, but does not detect the emerging topics in real time, and can not effectively express the evolution of the same topic in different periods.
  • the present invention provides a real-time detection method and apparatus for a new theme, the main purpose of which is to detect a new topic in a text in real time and use the topic as an option to classify subsequent texts, thereby improving the accuracy of text classification.
  • the present invention mainly provides the following technical solutions:
  • the present invention provides a real-time detection method for a new theme, the method comprising:
  • the determining whether the subject of the document can be attributed to an existing topic classification includes:
  • the similarity value is determined according to the set similarity threshold, and when the similarity value is less than the threshold, the subject of the document is determined to be a new topic.
  • the calculating the theme of the document according to the distribution of the keywords in the document comprises:
  • Creating a topic model which is a dynamic model for calculating a topic of the document obtained by adding variation Bayesian inference based on an LDA topic model, the topic model being capable of dynamically adjusting due to a keyword according to the calculated document Changes in the subject's probability caused by changes in the distribution;
  • the document is entered into the topic model to calculate the subject of the document.
  • the creating a new topic and dividing the document into the classification of the new topic comprises:
  • the name of the new topic is determined according to the title name of the document and the distribution of the keyword.
  • the document that obtains the vectorized representation in real time according to the specified domain includes:
  • the keywords in the document are filtered using the TF-IDF model.
  • the method further includes:
  • the data information of the document is recorded, and the data information includes a name of the new topic, a correspondence between the topic and the keyword.
  • the method further includes:
  • the new topic is added to the existing topic classification to increase the existing topic classification and subject the subject of the subsequent document to the determination.
  • the present invention also provides a new subject real-time detecting device, the device comprising:
  • An obtaining unit configured to acquire a vectorized representation of the document in real time according to the specified domain
  • a calculating unit configured to calculate a theme of the document according to a distribution of the keyword in the document acquired by the acquiring unit
  • a determining unit configured to determine whether the subject of the document obtained by the calculating unit can be attributed to an existing topic classification
  • a creating unit configured to: when the determining unit determines that the topic of the document cannot be attributed to the existing topic classification, create a new topic, and divide the document into the classification of the new topic.
  • the determining unit comprises:
  • a calculation module configured to calculate, by using cosine similarity, a similarity value between a theme of the document and an existing theme
  • the determining module is configured to determine, according to the set similarity threshold, a similarity value obtained by the calculating module, and when the similarity value is less than the threshold, determine that the subject of the document is a new topic.
  • the calculating unit is further configured to calculate a theme of the document by using a topic model, where the topic model is based on an LDA topic model, by adding a variation Bayesian inference to calculate a theme of the document.
  • a model that dynamically adjusts the change in subject probability due to a change in the distribution of the subject terms based on the calculated document.
  • the creating unit comprises:
  • a determining module configured to determine a name of the new topic according to a title name of the document obtained by the statistical module and a distribution of the keyword.
  • the obtaining unit comprises:
  • An obtaining module configured to obtain a multi-source document according to the specified domain
  • a vectorization module configured to use a word frequency to obtain a vectorized representation of the document acquired by the acquiring module
  • a screening module configured to filter a keyword in the document represented by the vectorization module by using a TF-IDF model.
  • the device further comprises:
  • a sorting unit for sorting existing topics according to the reference quantity, duration, and novelty of the theme
  • a recording unit for recording data information of the document, the data information including a new theme The correspondence between names, topics, and subject terms.
  • the device further comprises:
  • the adding unit is configured to add the new topic to the existing topic category after the creating unit creates a new topic, to add an existing topic category and perform attribution determination on the topic of the subsequent document.
  • the document data of the same domain acquired from the Internet can be processed in real time, and after the document is vectorized, the document is calculated according to the distribution of the keyword in the document.
  • the subject and determine the subject attribution of the document, when the subject of the document does not belong to the existing topic classification, the document is assigned as a new topic.
  • the present invention implements the subject incremental processing on the subject detection in the document, and uses the new theme in the analysis of the subsequent document, thereby improving the accuracy of the classification of the subject to which the document belongs, and simultaneously improving the accuracy of the subject classification of the document.
  • the subject detection method adopted by the present invention can dynamically adjust the distribution probability of the topic words in the theme according to the increase in the number of detected documents. This leads to the evolution of the same topic and the process of evolving new themes from the topic.
  • FIG. 1 is a flowchart of a real-time detection method for a new theme according to an embodiment of the present invention
  • FIG. 3 is a block diagram showing the composition of a real-time detecting device of a new theme according to an embodiment of the present invention
  • FIG. 4 is a block diagram showing the composition of a real-time detecting apparatus of another new theme proposed by the embodiment of the present invention.
  • An embodiment of the present invention provides a real-time detection method for a new topic. As shown in FIG. 1 , the method is applied to real-time topic detection of an online document, discovering a new topic, and automatically assigning the document to a classification of a new topic, This embodiment of the present invention provides the following specific steps:
  • the embodiment of the present invention is also directed to subject detection and attribution calculation for a document in a specified domain. Therefore, the acquired documents are related corpus information collected according to the domain, and the documents may be obtained from a plurality of sources, such as a forum, a news portal, a WeChat public account, and the like.
  • the same-domain documents obtained by different sources have different differences in the subject matter of the documents due to different perspectives, and the possibility of generating new topics is relatively large.
  • the document After the document is obtained, the document needs to be vectorized.
  • the most commonly used document vectorization is a word-based feature, which uses word frequency to vectorize the document, that is, using TF-IDF (term frequency-inverse document). Frequency)
  • TF-IDF term frequency-inverse document. Frequency
  • the model provides a vectorized representation of the document.
  • the specific manner of vectorization of the document is not limited in the embodiment of the present invention, and the purpose of the vectorized document is to facilitate subsequent detailed analysis of the document.
  • the keyword in the document is filtered out, wherein the keyword refers to a word capable of expressing the core idea of the document, and in most cases, the words are words with a high frequency in the document, therefore,
  • the high-frequency words in the document are that the subject of the document is a high-probability event, but for high-frequency words, it is also necessary to exclude words that have no practical meaning, such as "yes” and "yes".
  • the word frequency of these words tends to be higher than the word frequency of the subject words. Therefore, in the specific screening operation, these high frequency words can be avoided by setting a filter word frequency interval, for example, screening the index frequency of the inverted document, that is, the value of IDF. Words between 0.5 and 0.8.
  • the subject of the document is determined based on the distribution of the subject word in the document.
  • the mainstream way of extracting a document theme is to use an LDA (Latent Dirichlet Allocation) topic model, in which the topic represents a concept, an aspect, or a subject in the embodiment of the present invention. It can be described using a series of related words.
  • the distribution of words in a topic is the probability event of the occurrence of these words, that is, according to the distribution of the keywords in the document, To get the probability that the document belongs to different topics.
  • the theme included in the document is extracted by the created topic model, and the main function of the topic model is to determine the attribution theme of the document according to the distribution of the keyword in the document, wherein in the determining process, the theme is
  • the model can automatically update the correspondence between the distribution of the topic words and the theme according to the processed document, that is, the topic model can apply the processing result of the processed document to the analysis of the subsequent document, according to the processed document first.
  • the distribution of the subject terms adjusts to the probability of the attribution topic.
  • the topic model in the embodiment of the present invention is a dynamic model, and the calculation process refers to the processing result of the previous document. Therefore, by recording the corresponding topic word distribution change in the topic, the topic can be obtained in the acquired document. Evolution.
  • the embodiment of the present invention does not limit the specific algorithm used by the topic model.
  • step 104 is performed to determine the topic as a new topic.
  • the specific judgment manner in the embodiment of the present invention is: performing cluster analysis on the document and the document in the existing topic classification, that is, comparing the keyword corresponding to the topic and the existing topic, To judge the similarity between the two themes, when the similarity is lower than a certain value, it can be determined that a new theme is generated.
  • step 103 when the topic of the document cannot be attributed to the existing topic classification, a new topic is created based on the document, and the name of the new topic can be extracted and generated from the keyword of the document.
  • the new topic When a new topic is created, the new topic will be imported into the existing topic category to identify whether subsequent documents can be attributed to the category of the new topic.
  • the name of the new topic When a new document is assigned to a category of a new topic, the name of the new topic is also re-extracted. In this way, the above steps are executed cyclically, and each subject processed is detected, classified, and the name of the theme is updated to realize the discovery and extraction of the new theme, and the tracking of the existing theme is also tracked.
  • the real-time detection method of a new theme adopted by the embodiment of the present invention can be used as a document processing framework when processing an online document, and the processed object is processed by the framework.
  • the document is labeled with the topic it belongs to, and the framework has the ability to detect and add new themes. Function, as the number of documents processed increases, the number of topics that the framework has will also increase, and the name of the theme will continue to evolve as the number of documents increases.
  • the document processing framework is applied, the document in different fields can be switched by saving the processing information, and for the new field, only the model information in the framework needs to be initialized.
  • the processing framework adopted by the embodiment of the present invention implements online real-time detection, which can more fully utilize the processing resources of the system, and at the same time, the framework realizes online incremental processing of the theme. Adding a new theme in real time and applying the new theme to the processing of the new document, that is, the dynamic adjustment of the processing is realized, so that the subject classification of the document is more accurate.
  • the evolution of the theme can be effectively tracked by recording changes in the distribution of subject terms and subject name changes.
  • a plurality of documents in a specified domain are first acquired by a plurality of information sources, and the document is vectorized according to word frequency, that is, the word frequency (TF) and inverted document index frequency of words in the document are calculated by using the TF-IDF model. (IDF), and adopting the TF-IDF model is based on the word segmentation of the document. After word segmentation, some high-frequency words that are obviously not subject to the subject words are filtered out, such as "", "yes” and so on.
  • the TF-IDF model calculates the inverted document index frequency of each word in the document. In the embodiment of the present invention, the word with the IDF value less than 0.8 is determined as the subject word of the document.
  • the calculation method for applying the TF-IDF model to the embodiment of the present invention is as follows:
  • the numerator is the number of occurrences of the word j in the document i, and the denominator is the sum of the occurrences of all the words in the document;
  • the numerator indicates the total number of documents in the corpus
  • the denominator indicates the number of documents in which the word i appears.
  • the obtained vectorized document can be calculated to determine the keyword in the document.
  • the theme extraction and categorization in the document can be realized by custom creation of the theme model, which is a dynamic model for calculating the theme of the document, and can dynamically adjust the distribution of the keyword according to the calculated document.
  • the topic model is an LDA dynamic topic model based on variational Bayesian inference, that is, based on the LDA topic model, introducing variational Bayesian reasoning to calculate each piece. The impact of the document on the evolution of the theme.
  • variational Bayesian reasoning is to find the approximate distribution of the posterior probability distribution of the hidden variable closest to the real model, and use this approximate distribution to replace the posterior probability distribution of the hidden variable.
  • variational Bayesian inference is used to help the LDA topic model find a topic distribution in which each document is closer to the previous document. Since the LDA topic model is a topic model widely used in existing topic recognition technology, the calculation principle and process of the LDA topic model are not described in detail, but after adding variational Bayesian inference, the calculation process of LDA As follows:
  • Mult() is a multinomial distribution and Dir() is a Dirichlet distribution.
  • the topic model Using the LDA dynamic topic model based on variational Bayesian inference, the subject words in the obtained vectorized document are calculated, and the probability of the corresponding topic is obtained according to the specific distribution of the keyword. Since the variational Bayesian reasoning is introduced in the topic model, the topic model can also calculate the evolution of different topics by processing a large number of documents.
  • each document is set to belong to only one topic. Therefore, when determining the topic of the document, the subject with the highest probability value included in the document may be the subject of the document. And with the constant documentation Processing, the subject probability in the document will also change constantly. When the change reaches a certain threshold, the distribution of the keyword will generate a new theme. Specifically, the cosine similarity can be used to calculate the keywords in the subject of the current document.
  • Determining the similarity value of all the keywords in the document belonging to the topic determining whether the theme of the current document is the same as the existing topic by setting the threshold, and determining that the topic of the current document is when the similarity value is less than the threshold A new theme where the threshold is an experience value that can be customized.
  • the document is marked and classified according to the belonging topic.
  • the name of the topic needs to be re-extracted.
  • the specific determination manner is: extracting the keyword corresponding to the topic, and all the titles belonging to the topic document, and calculating the weight of each title according to the keyword and the frequency or distribution of the title in the documents,
  • the title of the title is determined by the name of the topic. That is, the sum of the weights of all the words in the title in the title is calculated * the number of mentions of the title / the number of the mentioned subject words, and the weight of the title is obtained.
  • the evolution result of the theme is updated, and therefore, after being generated by the new theme,
  • the corresponding keywords in the new theme will recalculate the documents in the existing topic classifications that have evolved into new topics, and determine whether there are documents belonging to the new theme in these documents, and the names of the new topics can be determined by In the above determination manner, when only the current document is present, the title of the current document is taken as the name of the new topic, and when a new document is assigned to the new topic, the name of the new topic is determined by referring to the above process.
  • the keyword-based phrase mining method for the determination of the topic name, the keyword-based phrase mining method, the document content summarization method based on the Rank idea, and the like may also be used to determine the name of the topic.
  • the specific manner of the name extraction is not limited in the embodiment of the present invention.
  • This step is a further application of the results achieved by the above steps, that is, through the above steps, the subject detection and classification of a large number of documents, and the recognition of new topics in the document, to increase the classification of the subject of the document, based on
  • the embodiment of the present invention utilizes the annotation of the subject of the document, and can also achieve the hot ranking of all the current topics. For this reason, a simple heat model is provided in this embodiment to calculate the heat ranking of each topic.
  • the heat model takes into account the amount of the topic mentioned, The duration of the theme, as well as the novelty of the theme, to determine the final heat and output as time points.
  • the heat calculation method is as follows:
  • the quantity mentioned is the number of times the subject appears in a certain period of time. In general, the higher the frequency of recent topics will have a higher degree of heat. For example, after a topic is generated in the corpus, a large number of the entire Internet will appear. Related reports, such topics should have a high degree of enthusiasm, such as "G20", "Rio Olympics" and other topics, in the short period after the emergence, have a high mention.
  • Novelty, new themes may be due to the theme just appeared, may not be a lot of mentions when the current time, but such a topic will have a trend into a hot topic, in order to prevent such a
  • the theme thus leads to the lack of information, and it is necessary to introduce novelty as a heat evaluation parameter.
  • the influence on the calculation mode in the theme model is more prominent.
  • the calculation result of the document and the data information that affects the topic model such as the name of the topic, the correspondence between the topic and the topic, etc., can be used to analyze the topic evolution.
  • the process of tracking the history of the topic can be used to analyze the topic evolution.
  • an embodiment of the present invention provides a real-time detection device for a new theme, and the device embodiment corresponds to the foregoing method embodiment.
  • the device embodiment does not implement the foregoing method.
  • the details in the example are described one by one, but it should be clear that the device in this embodiment can implement all the contents in the foregoing method embodiments.
  • the device is used for extracting the topic in the document in real time and tracking the evolution process of the recorded topic. As shown in FIG. 3, the device includes:
  • An obtaining unit 31 configured to acquire a document of the vectorized representation in real time according to the specified domain
  • the calculating unit 32 is configured to calculate a distribution of the keyword in the document acquired by the acquiring unit 31.
  • the subject of the document
  • the determining unit 33 is configured to determine whether the theme of the document obtained by the calculating unit 32 can be attributed to an existing topic classification
  • the creating unit 34 is configured to create a new topic when the determining unit 33 determines that the topic of the document cannot be attributed to the existing topic classification, and divide the document into the classification of the new topic.
  • the determining unit 33 includes:
  • a calculation module 331, configured to calculate, by using cosine similarity, a similarity value between a theme of the document and an existing topic;
  • the determining module 332 is configured to determine, according to the set similarity threshold, a similarity value obtained by the calculating module 331, and when the similarity value is less than the threshold, determine that the subject of the document is a new topic.
  • the calculating unit 32 is further configured to calculate a theme of the document by using a topic model, where the topic model is obtained by adding variational Bayesian inference based on an LDA topic model.
  • the topic model is obtained by adding variational Bayesian inference based on an LDA topic model.
  • a dynamic model of the subject matter of the document, the topic model being capable of dynamically adjusting a change in subject probability due to a change in the distribution of the subject words based on the calculated document.
  • the creating unit 34 includes:
  • a statistics module 341, configured to collect a document that belongs to the new topic
  • the determining module 342 is configured to determine a name of the new topic according to a title name of the document obtained by the statistics module 341 and a distribution of the keyword.
  • the obtaining unit 31 includes:
  • the obtaining module 311 is configured to obtain a multi-source document according to the specified domain
  • a vectorization module 312, configured to use a word frequency to obtain a vectorized representation of the document acquired by the obtaining module 311;
  • the screening module 313 is configured to filter the keywords in the document represented by the vectorization module 312 by using the TF-IDF model.
  • the device further includes:
  • a sorting unit 36 configured to hot-sort the existing topics and the topics created by the creating unit 34 according to the reference quantity, duration, and novelty of the theme
  • the recording unit 37 is configured to record, by the computing unit 32, data information of the document, where the data information includes a name of a new topic, a correspondence between a topic and a topic word.
  • the device further includes:
  • the adding unit 35 is configured to add the new topic to the existing topic category after the creating unit 34 creates a new topic, to add the existing topic classification and perform attribution determination on the topic of the subsequent document.
  • the real-time detection method and apparatus for the new theme adopted by the embodiments of the present invention can process the same-domain document data acquired from the Internet in real time, and after the vector representation of the document, calculate the distribution of the keyword according to the document.
  • the subject of the document and determine the subject attribution of the document.
  • the document is assigned as a new topic.
  • the embodiment of the present invention implements the topic increment processing on the topic detection in the document, and uses the new topic in the analysis of the subsequent document, thereby improving the accuracy of the topic classification of the document.
  • the theme detection method adopted by the embodiment of the present invention can dynamically adjust the keyword in the theme according to the increase of the number of detected documents. The probability of distribution, which leads to the evolution of the same topic and the process of evolving new topics from the topic.
  • the real-time detecting device of the new subject includes a processor and a memory, and the obtaining unit, the calculating unit, the determining unit, the creating unit and the like are all stored as a program unit in a memory, and the processor executes the above-mentioned program unit stored in the memory to implement The corresponding function.
  • the processor contains a kernel, and the kernel removes the corresponding program unit from the memory.
  • the kernel can be set to one or more, and the accuracy of the text classification can be improved by adjusting the kernel parameters to realize the real-time detection of newly appearing topics in the text and classifying the topic as a subsequent text classification.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory (flash RAM), the memory including at least one Memory chip.
  • RAM random access memory
  • ROM read only memory
  • flash RAM flash memory
  • the present application also provides a computer program product, when executed on a data processing device, adapted to perform program code initialization having the following method steps: acquiring a vectorized representation of the document in real time according to a specified field; The distribution of words calculates the subject of the document; determines whether the subject of the document can be attributed to an existing topic category; if not, creates a new topic and divides the document into categories of the new topic.
  • embodiments of the present application can be provided as a method, system, or computer program product. Therefore, the present application may employ an entirely hardware embodiment, an entirely software embodiment, or a combination of soft A form of embodiment of hardware and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), Other types of random access memory (RAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, CD-ROM (CD-ROM) ), a digital versatile disc (DVD) or other optical storage, magnetic cassette, magnetic tape storage or other magnetic storage device or any other non-transportable medium that can be used to store information that can be accessed by a computing device.
  • computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.
  • embodiments of the present application can be provided as a method, system, or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

Abstract

A method and device for real-time detection of a new subject, relating to the technical field of Internet, mainly aim at detecting a subject newly appeared in a text in real time and using the subject as an option to which subsequent text classification belongs so as to improve accuracy of text classification. The method comprises: obtaining a vectorized document according to a designated field in real time (101); calculating a subject of the document according to distribution of subject terms in the document (102); determining whether the subject of the document belongs to an existing subject classification or not (103); and if not, creating a new subject, and allocating the document in the classification of the new subject (104). The method and device are mainly used for online real-time detection of a new subject in a document.

Description

一种新主题的实时检测方法及装置Real-time detection method and device for new theme
本申请基于申请号为201610980540.X、申请日为2016年11月08日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。The present application is filed on the basis of the Chinese Patent Application Serial No. No. No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No
技术领域Technical field
本发明涉及互联网技术领域,尤其涉及一种新主题的实时检测方法及装置。The present invention relates to the field of Internet technologies, and in particular, to a real-time detection method and apparatus for a new theme.
背景技术Background technique
主题检测与跟踪(Topic Detection&Tracing)技术是自然处理与信息检索领域一个实用性非常高的技术,意在发现和处理在文本中出现的主题。也是在大数据的背景下,有效的发现和提取出有用信息的一种实用技术。通常情况下,热门主题的发现和跟踪技术是针对于某个特定的领域或者特定的事件,发现主题并跟踪主题后续进展的一项技术。Topic Detection & Tracking (Topic Detection & Tracing) technology is a highly practical technology in the field of natural processing and information retrieval, intended to discover and process themes that appear in text. It is also a practical technique for effectively discovering and extracting useful information in the context of big data. Often, hot topic discovery and tracking techniques are a technique for finding a topic and tracking the progress of a topic for a particular domain or specific event.
目前,国内外对热点主题的检测技术的主要步骤包括:文本获取,收集互联网上的新闻报道;将收集到文档向量化表示;文档聚类,将向量化的文档进行聚类操作,并取出现频率高的词语或者聚类中心文档表示一个主题;最后对一定时间内所发现的主题进行热度排序,并根据具体的排序进行输出。然而,这种方式只适用于对指定主题的发现与跟踪,而不能对新出现的主题进行实时的检测,并且无法有效表达同一主题在不同时期的演变过程。At present, the main steps of the detection technology of hot topics at home and abroad include: text acquisition, collecting news reports on the Internet; collecting vectorized representations of documents; document clustering, clustering the vectorized documents, and taking them out A high-frequency word or cluster center document represents a topic; finally, the topics found in a certain period of time are sorted by heat and output according to a specific order. However, this method only applies to the discovery and tracking of specified topics, but does not detect the emerging topics in real time, and can not effectively express the evolution of the same topic in different periods.
发明内容Summary of the invention
有鉴于此,本发明提供一种新主题的实时检测方法及装置,主要目的在于实时检测文本中新出现的主题并将该主题作为后续文本分类归属的选项,从而提高对文本分类的准确性。In view of this, the present invention provides a real-time detection method and apparatus for a new theme, the main purpose of which is to detect a new topic in a text in real time and use the topic as an option to classify subsequent texts, thereby improving the accuracy of text classification.
为达到上述目的,本发明主要提供如下技术方案: In order to achieve the above object, the present invention mainly provides the following technical solutions:
一方面,本发明提供了一种新主题的实时检测方法,该方法包括:In one aspect, the present invention provides a real-time detection method for a new theme, the method comprising:
根据指定的领域实时获取向量化表示的文档;Obtain a vectorized representation of the document in real time based on the specified field;
根据所述文档中主题词的分布计算所述文档的主题;Calculating a theme of the document according to a distribution of the keywords in the document;
判断所述文档的主题是否能够归属为已有的主题分类中;Determining whether the subject of the document can be attributed to an existing topic classification;
若不能,则创建新主题,并将所述文档划分在所述新主题的分类中。If not, a new topic is created and the document is divided into categories of the new topic.
优选的,所述判断所述文档的主题是否能够归属为已有的主题分类中包括:Preferably, the determining whether the subject of the document can be attributed to an existing topic classification includes:
利用余弦相似度计算所述文档的主题与已有主题的相似度值;Calculating a similarity value between the subject of the document and an existing theme by using cosine similarity;
根据设置的相似度阈值判断所述相似度值,当所述相似度值小于阈值时,确定所述文档的主题为新主题。The similarity value is determined according to the set similarity threshold, and when the similarity value is less than the threshold, the subject of the document is determined to be a new topic.
优选的,所述根据所述文档中主题词的分布计算所述文档的主题包括:Preferably, the calculating the theme of the document according to the distribution of the keywords in the document comprises:
创建主题模型,所述主题模型是基于LDA主题模型通过加入变分贝叶斯推理得到的用于计算所述文档的主题的动态模型,所述主题模型能够根据所计算的文档动态调整由于主题词的分布变化导致的主题概率变化;Creating a topic model, which is a dynamic model for calculating a topic of the document obtained by adding variation Bayesian inference based on an LDA topic model, the topic model being capable of dynamically adjusting due to a keyword according to the calculated document Changes in the subject's probability caused by changes in the distribution;
将所述文档输入所述主题模型计算所述文档的主题。The document is entered into the topic model to calculate the subject of the document.
优选的,所述创建新主题,并将所述文档划分在所述新主题的分类中包括:Preferably, the creating a new topic and dividing the document into the classification of the new topic comprises:
统计归属所述新主题的文档;Statistics of documents pertaining to the new topic;
根据所述文档的标题名称以及主题词的分布确定所述新主题的名称。The name of the new topic is determined according to the title name of the document and the distribution of the keyword.
优选的,所述根据指定的领域实时获取向量化表示的文档包括:Preferably, the document that obtains the vectorized representation in real time according to the specified domain includes:
根据指定的领域获取多信源文档;Obtain multiple source documents based on the specified domain;
利用词频对所述文档向量化表示;Using a word frequency to vectorize the representation of the document;
利用TF-IDF模型筛选所述文档中的主题词。The keywords in the document are filtered using the TF-IDF model.
优选的,所述方法还包括:Preferably, the method further includes:
根据主题的提及量、延续时间、新颖度对已有的主题进行热度排序;Sorting existing topics according to the topic's mention, duration, and novelty;
记录处理所述文档的数据信息,所述数据信息包括新主题的名称、主题与主题词的对应关系。The data information of the document is recorded, and the data information includes a name of the new topic, a correspondence between the topic and the keyword.
优选的,在创建新主题之后,所述方法还包括:Preferably, after the new theme is created, the method further includes:
将所述新主题添加到已有的主题分类中,以增加已有的主题分类并对后续文档的主题进行归属判断。The new topic is added to the existing topic classification to increase the existing topic classification and subject the subject of the subsequent document to the determination.
另一方面,本发明还提供了一种新主题的实时检测装置,该装置包括: In another aspect, the present invention also provides a new subject real-time detecting device, the device comprising:
获取单元,用于根据指定的领域实时获取向量化表示的文档;An obtaining unit, configured to acquire a vectorized representation of the document in real time according to the specified domain;
计算单元,用于根据所述获取单元获取的文档中主题词的分布计算所述文档的主题;a calculating unit, configured to calculate a theme of the document according to a distribution of the keyword in the document acquired by the acquiring unit;
判断单元,用于判断所述计算单元得到的所述文档的主题是否能够归属为已有的主题分类中;a determining unit, configured to determine whether the subject of the document obtained by the calculating unit can be attributed to an existing topic classification;
创建单元,用于当所述判断单元判断所述文档的主题不能归属于已有的主题分类时,创建新主题,并将所述文档划分在所述新主题的分类中。And a creating unit, configured to: when the determining unit determines that the topic of the document cannot be attributed to the existing topic classification, create a new topic, and divide the document into the classification of the new topic.
优选的,所述判断单元包括:Preferably, the determining unit comprises:
计算模块,用于利用余弦相似度计算所述文档的主题与已有主题的相似度值;a calculation module, configured to calculate, by using cosine similarity, a similarity value between a theme of the document and an existing theme;
判断模块,用于根据设置的相似度阈值判断所述计算模块得到的相似度值,当所述相似度值小于阈值时,确定所述文档的主题为新主题。The determining module is configured to determine, according to the set similarity threshold, a similarity value obtained by the calculating module, and when the similarity value is less than the threshold, determine that the subject of the document is a new topic.
优选的,所述计算单元还用于,利用主题模型计算所述文档的主题,所述主题模型是基于LDA主题模型通过加入变分贝叶斯推理得到的用于计算所述文档的主题的动态模型,所述主题模型能够根据所计算的文档动态调整由于主题词的分布变化导致的主题概率变化。Preferably, the calculating unit is further configured to calculate a theme of the document by using a topic model, where the topic model is based on an LDA topic model, by adding a variation Bayesian inference to calculate a theme of the document. A model that dynamically adjusts the change in subject probability due to a change in the distribution of the subject terms based on the calculated document.
优选的,所述创建单元包括:Preferably, the creating unit comprises:
统计模块,用于统计归属所述新主题的文档;a statistics module for counting documents belonging to the new topic;
确定模块,用于根据所述统计模块得到的文档的标题名称以及主题词的分布确定所述新主题的名称。a determining module, configured to determine a name of the new topic according to a title name of the document obtained by the statistical module and a distribution of the keyword.
优选的,所述获取单元包括:Preferably, the obtaining unit comprises:
获取模块,用于根据指定的领域获取多信源文档;An obtaining module, configured to obtain a multi-source document according to the specified domain;
向量化模块,用于利用词频对所述获取模块获取的文档向量化表示;a vectorization module, configured to use a word frequency to obtain a vectorized representation of the document acquired by the acquiring module;
筛选模块,用于利用TF-IDF模型筛选所述向量化模块表示的文档中的主题词。And a screening module, configured to filter a keyword in the document represented by the vectorization module by using a TF-IDF model.
优选的,所述装置还包括:Preferably, the device further comprises:
排序单元,用于根据主题的提及量、延续时间、新颖度对已有的主题进行热度排序;a sorting unit for sorting existing topics according to the reference quantity, duration, and novelty of the theme;
记录单元,用于记录处理所述文档的数据信息,所述数据信息包括新主题的 名称、主题与主题词的对应关系。a recording unit for recording data information of the document, the data information including a new theme The correspondence between names, topics, and subject terms.
优选的,所述装置还包括:Preferably, the device further comprises:
添加单元,用于在所述创建单元创建新主题之后,将所述新主题添加到已有的主题分类中,以增加已有的主题分类并对后续文档的主题进行归属判断。The adding unit is configured to add the new topic to the existing topic category after the creating unit creates a new topic, to add an existing topic category and perform attribution determination on the topic of the subsequent document.
依据上述本发明所提出的一种新主题的实时检测方法及装置,能够实时地处理从互联网上获取的同领域的文档数据,将文档向量化表示后,根据文档中主题词分布计算该文档的主题,并判断该文档的主题归属,在该文档的主题不属于已有主题分类时,将该文档归属为新主题。相对于现有的主题检测与归属方式,本发明对文档中的主题检测实现了主题增量处理,并将新主题用于后续文档的分析中,从而提高对文档所属主题分类的准确性,同时,由于新主题的产生会影响主题词的分布,从而对后续文档的主题分析产生影响,因此本发明所采用的主题检测方法能够根据检测文档数量的增加而动态调整主题中主题词的分布概率,从而得到同一主题的演变过程以及从该主题中演变出新主题的过程。According to the method and device for real-time detection of a new theme proposed by the present invention, the document data of the same domain acquired from the Internet can be processed in real time, and after the document is vectorized, the document is calculated according to the distribution of the keyword in the document. The subject, and determine the subject attribution of the document, when the subject of the document does not belong to the existing topic classification, the document is assigned as a new topic. Compared with the existing subject detection and attribution mode, the present invention implements the subject incremental processing on the subject detection in the document, and uses the new theme in the analysis of the subsequent document, thereby improving the accuracy of the classification of the subject to which the document belongs, and simultaneously improving the accuracy of the subject classification of the document. Since the generation of the new topic affects the distribution of the topic words, thereby affecting the subject analysis of the subsequent documents, the subject detection method adopted by the present invention can dynamically adjust the distribution probability of the topic words in the theme according to the increase in the number of detected documents. This leads to the evolution of the same topic and the process of evolving new themes from the topic.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1示出了本发明实施例提出的一种新主题的实时检测方法流程图;FIG. 1 is a flowchart of a real-time detection method for a new theme according to an embodiment of the present invention;
图2示出了本发明实施例提出的另一种新主题的实时检测方法流程图;2 is a flowchart of a real-time detection method for another new theme proposed by an embodiment of the present invention;
图3示出了本发明实施例提出的一种新主题的实时检测装置组成框图;FIG. 3 is a block diagram showing the composition of a real-time detecting device of a new theme according to an embodiment of the present invention; FIG.
图4示出了本发明实施例提出的另一种新主题的实时检测装置组成框图。FIG. 4 is a block diagram showing the composition of a real-time detecting apparatus of another new theme proposed by the embodiment of the present invention.
具体实施方式detailed description
下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例,然而应当理解,可以以各种形式实现本发明而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本发明, 并且能够将本发明的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the invention has been shown and described with reference to the embodiments Rather, these embodiments are provided to provide a more thorough understanding of the invention. Further, the scope of the present invention can be fully conveyed to those skilled in the art.
本发明实施例提供了一种新主题的实时检测方法,如图1所示,该方法应用于对在线文档进行实时的主题检测,发现新主题并将文档自动归属为新主题的分类中,对此本发明实施例提供以下具体步骤:An embodiment of the present invention provides a real-time detection method for a new topic. As shown in FIG. 1 , the method is applied to real-time topic detection of an online document, discovering a new topic, and automatically assigning the document to a classification of a new topic, This embodiment of the present invention provides the following specific steps:
101、根据指定的领域实时获取向量化表示的文档。101. Acquire a vectorized representation of the document in real time according to the specified domain.
由于在线文档的内容繁杂,在进行主题检测与归属时,无法进行全领域的计算,因此,本发明实施例也是针对于一个指定领域中的文档进行的主题检测与归属计算。所以,获取的文档是根据领域收集的相关语料信息,并且,这些文档可以是从多个信源所获取的,如论坛、新闻门户网站,微信公众号等。通过不同信源获取的同领域文档,由于关注的角度不同,其文档的主题存在的差异就较大,这样产生新主题的可能性也就相对较大。Due to the complexity of the content of the online document, the entire area of the calculation cannot be performed when the subject detection and attribution are performed. Therefore, the embodiment of the present invention is also directed to subject detection and attribution calculation for a document in a specified domain. Therefore, the acquired documents are related corpus information collected according to the domain, and the documents may be obtained from a plurality of sources, such as a forum, a news portal, a WeChat public account, and the like. The same-domain documents obtained by different sources have different differences in the subject matter of the documents due to different perspectives, and the possibility of generating new topics is relatively large.
在获取到文档后,还需要对该文档进行向量化表示,目前,最为常用的文档向量化是以词语为基础特征,使用词频向量化文档的方式,即使用TF-IDF(term frequency-inverse document frequency)模型对文档进行向量化表示。而本发明实施例对于文档向量化的具体方式并不限定,向量化文档的目的是便于后续对该文档的具体分析。After the document is obtained, the document needs to be vectorized. At present, the most commonly used document vectorization is a word-based feature, which uses word frequency to vectorize the document, that is, using TF-IDF (term frequency-inverse document). Frequency) The model provides a vectorized representation of the document. However, the specific manner of vectorization of the document is not limited in the embodiment of the present invention, and the purpose of the vectorized document is to facilitate subsequent detailed analysis of the document.
102、根据文档中主题词的分布计算该文档的主题。102. Calculate the subject of the document according to the distribution of the keywords in the document.
根据得到的向量化文档,筛选出该文档中的主题词,其中,主题词是指能够表达该文档核心思想的词语,大多情况下,这些词语是在文档中出现频率较高的词语,因此,文档内的高频词语是该文档的主题词为高概率事件,但是,对于高频词语中也需要排除一些无实际意义的词语,如“的”,“是”这样的词语。而这些词语的词频往往会高于主题词的词频,因此,在具体筛选操作时,可以通过设置筛选词频区间来避开这些高频词语,比如,筛选倒排文档索引频率,即IDF的值,在0.5至0.8之间的词语。According to the obtained vectorized document, the keyword in the document is filtered out, wherein the keyword refers to a word capable of expressing the core idea of the document, and in most cases, the words are words with a high frequency in the document, therefore, The high-frequency words in the document are that the subject of the document is a high-probability event, but for high-frequency words, it is also necessary to exclude words that have no practical meaning, such as "yes" and "yes". The word frequency of these words tends to be higher than the word frequency of the subject words. Therefore, in the specific screening operation, these high frequency words can be avoided by setting a filter word frequency interval, for example, screening the index frequency of the inverted document, that is, the value of IDF. Words between 0.5 and 0.8.
在确定出文档的主题词后,根据主题词在该文档中的分布情况来确定该文档的主题。目前,提取文档主题的主流方式是使用LDA(Latent Dirichlet Allocation)主题模型,该主题模型中,主题表示一个概念、一个方面,或者是本发明实施例中的一个主题。其可以使用一系列的相关词的来进行描述。一个主题的词语分布情况便是这些词出现的概率事件,也就是说,根据文档中主题词的分布情况,可 以得到该文档归属不同主题的概率。本发明实施例中,通过创建的主题模型对文档所包含的主题进行提取,该主题模型的主要功能是根据文档中主题词分布来确定该文档的归属主题,其中,在确定过程中,该主题模型能够根据处理的文档自动更新主题词的分布与主题之间的对应关系,也就是说,该主题模型能够将已经处理过文档的处理结果应用在后续文档的分析中,根据先处理的文档中主题词的分布调整对应归属主题的概率。可见,本发明实施例中的主题模型为一个动态模型,其计算过程会参考之前文档的处理结果,因此,通过记录主题中对应的主题词分布变化,就可以得到该主题在所获取文档中的演变过程。对于能够满足以上功能的主题模型,本发明实施例并不限定该主题模型使用的具体算法。After the subject word of the document is determined, the subject of the document is determined based on the distribution of the subject word in the document. At present, the mainstream way of extracting a document theme is to use an LDA (Latent Dirichlet Allocation) topic model, in which the topic represents a concept, an aspect, or a subject in the embodiment of the present invention. It can be described using a series of related words. The distribution of words in a topic is the probability event of the occurrence of these words, that is, according to the distribution of the keywords in the document, To get the probability that the document belongs to different topics. In the embodiment of the present invention, the theme included in the document is extracted by the created topic model, and the main function of the topic model is to determine the attribution theme of the document according to the distribution of the keyword in the document, wherein in the determining process, the theme is The model can automatically update the correspondence between the distribution of the topic words and the theme according to the processed document, that is, the topic model can apply the processing result of the processed document to the analysis of the subsequent document, according to the processed document first. The distribution of the subject terms adjusts to the probability of the attribution topic. It can be seen that the topic model in the embodiment of the present invention is a dynamic model, and the calculation process refers to the processing result of the previous document. Therefore, by recording the corresponding topic word distribution change in the topic, the topic can be obtained in the acquired document. Evolution. For the topic model that can satisfy the above functions, the embodiment of the present invention does not limit the specific algorithm used by the topic model.
103、判断文档的主题是否能够归属为已有的主题分类中。103. Determine whether the subject of the document can be classified into an existing topic classification.
通过主题模型确定出文档的主题后,将判断该主题是否与已有的主题相同,若相同,则将该文档归类到已有的主题分类中,并根据该主题中的文档重新确定该主题的主题名称,而当主题与已有的主题不同时,则执行步骤104,将该主题确定为新主题。其中,本发明实施例中的一种具体的判断方式为:对文档与已有主题分类中的文档进行聚类分析,也就是对主题和已有主题所分别对应的主题词进行比较,以此来判断两个主题之间的相似度,在相似度低于一定值时,就可以确定有新主题产生。After the theme of the document is determined by the theme model, it is determined whether the theme is the same as the existing theme. If the theme is the same, the document is classified into the existing topic classification, and the theme is re-determined according to the document in the theme. The topic name, and when the topic is different from the existing topic, step 104 is performed to determine the topic as a new topic. The specific judgment manner in the embodiment of the present invention is: performing cluster analysis on the document and the document in the existing topic classification, that is, comparing the keyword corresponding to the topic and the existing topic, To judge the similarity between the two themes, when the similarity is lower than a certain value, it can be determined that a new theme is generated.
104、创建新主题,并将文档划分在该新主题的分类中。104. Create a new theme and divide the document into categories of the new topic.
根据步骤103的判断,在文档的主题不能归属为已有的主题分类时,就以该文档为基础创建一个新主题,该新主题的名称可以从该文档的主题词中提取生成。According to the judgment of step 103, when the topic of the document cannot be attributed to the existing topic classification, a new topic is created based on the document, and the name of the new topic can be extracted and generated from the keyword of the document.
当有新主题创建后,该新主题将被导入已有的主题分类中,用于识别后续的文档是否能够归属在该新主题的分类中。而在有新的文档归属到新主题的分类中时,也会对该新主题的名称进行重新提取。如此循环执行上述的步骤,对处理的每一个文档进行主题的检测、归类,并更新主题的名称,实现对新主题的发现与提取,同时也对已有主题的变化实现了跟踪记录。When a new topic is created, the new topic will be imported into the existing topic category to identify whether subsequent documents can be attributed to the category of the new topic. When a new document is assigned to a category of a new topic, the name of the new topic is also re-extracted. In this way, the above steps are executed cyclically, and each subject processed is detected, classified, and the name of the theme is updated to realize the discovery and extraction of the new theme, and the tracking of the existing theme is also tracked.
结合上述的实现方式可以看出,本发明实施例所采用的一种新主题的实时检测方法,在处理在线文档时,可以将该检测方法具象为一个文档处理框架,通过该框架对所处理的文档标注所属的主题,同时,该框架具有检测并增加新主题的 功能,随着处理文档数量的增加,该框架所具有的主题数量也会增加,并且对于主题的名称也会随着文档的增加而进行不断的更新演变。该文档处理框架在应用时,可以通过保存处理信息来切换处理不同领域中的文档,而对于全新领域则只需要对该框架中的模型信息进行初始化即可。相对于现有的主题检测与归属方式,本发明实施例所采用的处理框架实现了在线实时检测,能够更加充分地利用系统的处理资源,同时,该框架实现了主题的在线增量处理,能够实时地增加新主题,并将该新主题再应用到新文档的处理过程中,即实现了处理的动态调整,使得对文档的主题分类更加准确。此外,通过记录主题中的主题词的分布变化以及主题名称的改变,可以有效的跟踪主题的演变过程。It can be seen that the real-time detection method of a new theme adopted by the embodiment of the present invention can be used as a document processing framework when processing an online document, and the processed object is processed by the framework. The document is labeled with the topic it belongs to, and the framework has the ability to detect and add new themes. Function, as the number of documents processed increases, the number of topics that the framework has will also increase, and the name of the theme will continue to evolve as the number of documents increases. When the document processing framework is applied, the document in different fields can be switched by saving the processing information, and for the new field, only the model information in the framework needs to be initialized. Compared with the existing subject detection and attribution mode, the processing framework adopted by the embodiment of the present invention implements online real-time detection, which can more fully utilize the processing resources of the system, and at the same time, the framework realizes online incremental processing of the theme. Adding a new theme in real time and applying the new theme to the processing of the new document, that is, the dynamic adjustment of the processing is realized, so that the subject classification of the document is more accurate. In addition, the evolution of the theme can be effectively tracked by recording changes in the distribution of subject terms and subject name changes.
以下为了更加详细地说明本发明提出的一种新主题的实时检测方法,特别是对各步骤中的具体实现方式以及步骤间的连接处理进行具体的说明,具体如图2所示,该方法所包括的步骤为:In the following, in order to explain in more detail the real-time detection method of a new theme proposed by the present invention, in particular, the specific implementation manner in each step and the connection processing between the steps are specifically described. Specifically, as shown in FIG. 2, the method is as follows. The steps involved are:
201、根据指定的领域实时获取向量化表示的文档。201. Acquire a document of the vectorized representation in real time according to the specified domain.
本步骤中,首先通过多个信息源实时获取指定领域的大量文档,并将该文档根据词频进行向量化表示,即利用TF-IDF模型计算文档中词语的词频(TF)和倒排文档索引频率(IDF),而采用TF-IDF模型要基于对文档的分词处理,经过分词处理后,将一些明显不属于主题词的高频词过滤掉,如“的”,“是”等词,才通过TF-IDF模型,计算文档中各个词的倒排文档索引频率,本发明实施例中,将IDF值小于0.8的词确定为该文档的主题词。对于将TF-IDF模型应用于本发明实施例中的计算方式如下:In this step, a plurality of documents in a specified domain are first acquired by a plurality of information sources, and the document is vectorized according to word frequency, that is, the word frequency (TF) and inverted document index frequency of words in the document are calculated by using the TF-IDF model. (IDF), and adopting the TF-IDF model is based on the word segmentation of the document. After word segmentation, some high-frequency words that are obviously not subject to the subject words are filtered out, such as "", "yes" and so on. The TF-IDF model calculates the inverted document index frequency of each word in the document. In the embodiment of the present invention, the word with the IDF value less than 0.8 is determined as the subject word of the document. The calculation method for applying the TF-IDF model to the embodiment of the present invention is as follows:
Figure PCTCN2017109840-appb-000001
Figure PCTCN2017109840-appb-000001
其中,分子是词j在文档i中出现的次数,而分母则是在文档中所有字词的出现次数之和;Wherein, the numerator is the number of occurrences of the word j in the document i, and the denominator is the sum of the occurrences of all the words in the document;
Figure PCTCN2017109840-appb-000002
Figure PCTCN2017109840-appb-000002
其中,分子表示语料库中的文档总数,分母表示出现词语i的文档个数。Among them, the numerator indicates the total number of documents in the corpus, and the denominator indicates the number of documents in which the word i appears.
通过上述TF-IDF模型的计算方式,能够对所获取的向量化文档进行计算,确定该文档中的主题词。 Through the calculation method of the TF-IDF model described above, the obtained vectorized document can be calculated to determine the keyword in the document.
202、根据文档中主题词的分布计算所述文档的主题。202. Calculate a theme of the document according to a distribution of the keywords in the document.
在本步骤中,可以通过自定义创建主题模型来实现对文档中主题提取并归类,该主题模型是计算文档主题的动态模型,能够根据所计算的文档动态调整由于主题词的分布变化导致的主题概率变化,在本发明实施例中,该主题模型是基于变分贝叶斯推理的LDA动态主题模型,也就是在LDA主题模型的基础上,引入变分贝叶斯推理,以计算每篇文档对主题演变的影响。其中,变分贝叶斯推理的目的是要找出最接近于真实模型的隐变量后验概率分布的那个近似分布,并用这个近似分布来替代隐变量后验概率分布。具体到本发明实施例中,通过引入变分贝叶斯推理用来帮助LDA主题模型找到每一个文档更接近于前一个文档的主题分布情况。由于LDA主题模型为现有主题识别技术中广泛应用的一种主题模型,因此,对于LDA主题模型的计算原理及过程不再详细介绍,而在加入变分贝叶斯推理后,LDA的计算过程如下所示:In this step, the theme extraction and categorization in the document can be realized by custom creation of the theme model, which is a dynamic model for calculating the theme of the document, and can dynamically adjust the distribution of the keyword according to the calculated document. In the embodiment of the present invention, the topic model is an LDA dynamic topic model based on variational Bayesian inference, that is, based on the LDA topic model, introducing variational Bayesian reasoning to calculate each piece. The impact of the document on the evolution of the theme. Among them, the purpose of variational Bayesian reasoning is to find the approximate distribution of the posterior probability distribution of the hidden variable closest to the real model, and use this approximate distribution to replace the posterior probability distribution of the hidden variable. Specifically, in the embodiment of the present invention, the introduction of variational Bayesian inference is used to help the LDA topic model find a topic distribution in which each document is closer to the previous document. Since the LDA topic model is a topic model widely used in existing topic recognition technology, the calculation principle and process of the LDA topic model are not described in detail, but after adding variational Bayesian inference, the calculation process of LDA As follows:
1.For each topic index k∈{1,...K},1.For each topic index k∈{1,...K},
draw topic distributionβk~Dir(ηk)Draw topic distributionβ k ~Dir(η k )
2.For each document d∈{1,...M}:2.For each document d∈{1,...M}:
(a)Draw document′s topic distributionθd~Dir(α)(a) Draw document's topic distributionθ d ~ Dir (α)
(b)For each word n∈{1,...Nd}:(b) For each word n∈{1,...N d }:
I.Choo se topic assignment zd,n~Mult(θd)I.Choo se topic assignment z d,n ~Mult(θ d )
II.Choose word Wd,n~Mult(βzd,n) II.Choose word W d, n ~ Mult (βz d, n)
其中Mult()是一个多项分布,Dir()是一个狄利克雷分布。Where Mult() is a multinomial distribution and Dir() is a Dirichlet distribution.
利用基于变分贝叶斯推理的LDA动态主题模型,计算所获取的向量化文档中的主题词,根据主题词的具体分布得到对应主题的概率。由于该主题模型中引入了变分贝叶斯推理,因此,该主题模型通过处理大量的文档还能够统计出不同主题的演变过程。Using the LDA dynamic topic model based on variational Bayesian inference, the subject words in the obtained vectorized document are calculated, and the probability of the corresponding topic is obtained according to the specific distribution of the keyword. Since the variational Bayesian reasoning is introduced in the topic model, the topic model can also calculate the evolution of different topics by processing a large number of documents.
203、判断文档的主题是否能够归属为已有的主题分类中。203. Determine whether the subject of the document can be classified into an existing topic classification.
随着步骤202中的主题模型处理文档的数量增加,主题所对应的主题词分布情况会发生变化,而这种变化会导致一个主题最终演变为一个新的主题。在本发明实施例中,设定每个文档只归属于一个主题,因此,在确定文档的主题时,可以将该文档中所包含的概率值最大的主题作为该文档的主题。而随着文档的不断 处理,文档中的主题概率会也会不断变化,当变化达到一定阈值时,主题词的分布就会生成一个新主题,具体的,可以利用余弦相似度计算当前文档的主题中的主题词与已经判断归属于该主题的文档中的所有主题词的相似度值,通过设置的阈值来确定当前文档的主题是否与已有的主题相同,当相似度值小于阈值时,就确定当前文档的主题为新主题,其中,阈值为一个经验值,可以自定义设置。As the number of subject model processing documents in step 202 increases, the distribution of subject terms corresponding to the theme changes, and this change causes a theme to eventually evolve into a new theme. In the embodiment of the present invention, each document is set to belong to only one topic. Therefore, when determining the topic of the document, the subject with the highest probability value included in the document may be the subject of the document. And with the constant documentation Processing, the subject probability in the document will also change constantly. When the change reaches a certain threshold, the distribution of the keyword will generate a new theme. Specifically, the cosine similarity can be used to calculate the keywords in the subject of the current document. Determining the similarity value of all the keywords in the document belonging to the topic, determining whether the theme of the current document is the same as the existing topic by setting the threshold, and determining that the topic of the current document is when the similarity value is less than the threshold A new theme where the threshold is an experience value that can be customized.
204、创建新主题,并将该新主题添加到已有的主题分类中。204. Create a new theme and add the new theme to the existing topic category.
通过步骤203的判断,将文档按照所归属的主题进行标注并分类,对于归属到已有主题分类中的文档,由于该已有主题中加入了新的文档,对于该主题的名称就需要重新提取确定,具体确定的方式为:提取该主题对应的主题词,以及所有归属于该主题文档的标题,根据主题词以及标题在这些文档中的出现频率或分布情况来计算每个标题的权重,将权重大的标题确定为该主题的名称。也就是,计算标题中的所有词语在主题中的权重之和*该标题的提及量/提及主题词的个数,得到标题的权重。By the judgment of step 203, the document is marked and classified according to the belonging topic. For the document belonging to the existing topic classification, since the new document is added to the existing topic, the name of the topic needs to be re-extracted. Determining, the specific determination manner is: extracting the keyword corresponding to the topic, and all the titles belonging to the topic document, and calculating the weight of each title according to the keyword and the frequency or distribution of the title in the documents, The title of the title is determined by the name of the topic. That is, the sum of the weights of all the words in the title in the title is calculated * the number of mentions of the title / the number of the mentioned subject words, and the weight of the title is obtained.
而对于新主题,由于在创建新主题后,会将该主题对应的主题词分布重新输入到步骤202中创建的主题模型中,以更新主题的演变结果,因此,在由新主题产生后,针对新主题中对应的主题词,将对演变为新主题的已有主题分类中的文档进行重新计算,确定这些文档中是否存在属于新主题的文档,而对于新主题的名称,其确定方式可以参考上述的确定方式,在只有当前文档时,就以当前文档的标题作为新主题的名称,而当有新的文档归属到该新主题中时,就参照上述的过程来确定新主题的名称。For the new theme, since the theme distribution corresponding to the theme is re-entered into the theme model created in step 202 after the new theme is created, the evolution result of the theme is updated, and therefore, after being generated by the new theme, The corresponding keywords in the new theme will recalculate the documents in the existing topic classifications that have evolved into new topics, and determine whether there are documents belonging to the new theme in these documents, and the names of the new topics can be determined by In the above determination manner, when only the current document is present, the title of the current document is taken as the name of the new topic, and when a new document is assigned to the new topic, the name of the new topic is determined by referring to the above process.
此外,对于主题名称的确定,还可以采用基于关键词的短语挖掘方法,基于Rank思想的文档内容概括方法等方式来确定主题的名称。本发明实施例对于名称提取的具体方式不做限定。In addition, for the determination of the topic name, the keyword-based phrase mining method, the document content summarization method based on the Rank idea, and the like may also be used to determine the name of the topic. The specific manner of the name extraction is not limited in the embodiment of the present invention.
205、根据主题的提及量、延续时间、新颖度对已有的主题进行热度排序。205. Sort the existing topics according to the reference quantity, duration, and novelty of the theme.
本步骤是通过对上述步骤所达到结果的进一步应用,也就是通过上述的步骤能够对大量的文档进行主题检测、分类,并且能够识别出文档中的新主题,来增加文档归属主题的分类,基于主题的增加,本发明实施例利用对文档归属主题的标注,还能够实现对当前所有主题的热度排序,为此,本实施例中提供一种简单的热度模型,来计算各个主题的热度排序。该热度模型同时考虑了主题的提及量, 主题延续时间,以及主题的新颖度来确定最终的热度,并按照时间点输出。热度计算方法如下:This step is a further application of the results achieved by the above steps, that is, through the above steps, the subject detection and classification of a large number of documents, and the recognition of new topics in the document, to increase the classification of the subject of the document, based on The addition of the theme, the embodiment of the present invention utilizes the annotation of the subject of the document, and can also achieve the hot ranking of all the current topics. For this reason, a simple heat model is provided in this embodiment to calculate the heat ranking of each topic. The heat model takes into account the amount of the topic mentioned, The duration of the theme, as well as the novelty of the theme, to determine the final heat and output as time points. The heat calculation method is as follows:
热度=a*延续性+b*提及量+c*新颖度+d*其他因素Heat = a * continuity + b * mention + c * novelity + d * other factors
其中,延续性意在发现那些在一段时间内都会出现的主题,这类主题在长时间内以平稳的趋势出现,往往其出现次数,并不是很高,可能会不如近期出现的主题提及量大,但是考虑到其出现的时间较长,所以将其考虑到热度计算中。Among them, continuation is intended to discover topics that will appear over a period of time. Such topics appear in a steady trend over a long period of time. Often, the number of occurrences is not very high and may not be as good as the recent topical mentions. Large, but considering its longer time, it is considered in the heat calculation.
提及量,就是主题在一定时间段内出现的次数,一般情况下,近期出现的频率越高的主题将具有更高的热度,比如一个主题在语料中产生后,整个互联网上会出现大量的相关报道,这样的主题应该具有较高的热度,比如“G20”,“里约奥运会”等主题,在出现后不久的一段时间内,都具有很高的提及量。The quantity mentioned is the number of times the subject appears in a certain period of time. In general, the higher the frequency of recent topics will have a higher degree of heat. For example, after a topic is generated in the corpus, a large number of the entire Internet will appear. Related reports, such topics should have a high degree of enthusiasm, such as "G20", "Rio Olympics" and other topics, in the short period after the emergence, have a high mention.
新颖度,新出现的主题,可能会因为主题刚刚出现,可能到当前时间的时候,并不会产生很大量的提及量,但是这样的主题会有趋势变为热门主题,为了防止忽略这样的主题从而造成信息缺失,就需要引入新颖度作为热度评价参数。Novelty, new themes, may be due to the theme just appeared, may not be a lot of mentions when the current time, but such a topic will have a trend into a hot topic, in order to prevent such a The theme thus leads to the lack of information, and it is necessary to introduce novelty as a heat evaluation parameter.
其他因素,比如考虑到一个主题随着时间的过去,可能原有的主题会变得不那么热门,类似这样的因素将其加入到其他因素中,对于此种情况的有一种简单的计算方法便是牛顿冷却算法,其会把一个主题的热度跟时间建立关系,从而演变他的热度趋势。Other factors, such as considering a topic over time, may make the original topic less popular, and factors like this add it to other factors, there is a simple calculation method for this situation. It is the Newtonian cooling algorithm, which establishes the relationship between the heat of a topic and time, thus evolving his hot trend.
进一步的,本发明实施例在进行实时主题的检测过程中,由于主题模型会根据输出进行自动调节,特别是在新主题产生后,对主题模型中的计算方式的影响将更为突出。以此,对于每一篇文档的计算,都会记录该文档的计算结果,以及对主题模型产生影响的数据信息,比如主题名称、主题与主题词的对应关系等,这些数据可以用于分析话题演变的过程,从而实现对话题演变过程的跟踪记录。Further, in the process of detecting the real-time theme in the embodiment of the present invention, since the theme model is automatically adjusted according to the output, especially after the new theme is generated, the influence on the calculation mode in the theme model is more prominent. In this way, for each document calculation, the calculation result of the document and the data information that affects the topic model, such as the name of the topic, the correspondence between the topic and the topic, etc., can be used to analyze the topic evolution. The process of tracking the history of the topic.
进一步的,作为对上述方法的实现,本发明实施例提供了一种新主题的实时检测装置,该装置实施例与前述方法实施例对应,为便于阅读,本装置实施例不再对前述方法实施例中的细节内容进行逐一赘述,但应当明确,本实施例中的装置能够对应实现前述方法实施例中的全部内容。该装置用于在线实时提取文档中的主题并跟踪记录主题的演变过程,具体如图3所示,该装置包括:Further, as an implementation of the foregoing method, an embodiment of the present invention provides a real-time detection device for a new theme, and the device embodiment corresponds to the foregoing method embodiment. For ease of reading, the device embodiment does not implement the foregoing method. The details in the example are described one by one, but it should be clear that the device in this embodiment can implement all the contents in the foregoing method embodiments. The device is used for extracting the topic in the document in real time and tracking the evolution process of the recorded topic. As shown in FIG. 3, the device includes:
获取单元31,用于根据指定的领域实时获取向量化表示的文档;An obtaining unit 31, configured to acquire a document of the vectorized representation in real time according to the specified domain;
计算单元32,用于根据所述获取单元31获取的文档中主题词的分布计算所 述文档的主题;The calculating unit 32 is configured to calculate a distribution of the keyword in the document acquired by the acquiring unit 31. The subject of the document;
判断单元33,用于判断所述计算单元32得到的所述文档的主题是否能够归属为已有的主题分类中;The determining unit 33 is configured to determine whether the theme of the document obtained by the calculating unit 32 can be attributed to an existing topic classification;
创建单元34,用于当所述判断单元33判断所述文档的主题不能归属于已有的主题分类时,创建新主题,并将所述文档划分在所述新主题的分类中。The creating unit 34 is configured to create a new topic when the determining unit 33 determines that the topic of the document cannot be attributed to the existing topic classification, and divide the document into the classification of the new topic.
进一步的,如图4所示,所述判断单元33包括:Further, as shown in FIG. 4, the determining unit 33 includes:
计算模块331,用于利用余弦相似度计算所述文档的主题与已有主题的相似度值;a calculation module 331, configured to calculate, by using cosine similarity, a similarity value between a theme of the document and an existing topic;
判断模块332,用于根据设置的相似度阈值判断所述计算模块331得到的相似度值,当所述相似度值小于阈值时,确定所述文档的主题为新主题。The determining module 332 is configured to determine, according to the set similarity threshold, a similarity value obtained by the calculating module 331, and when the similarity value is less than the threshold, determine that the subject of the document is a new topic.
进一步的,如图4所示,所述计算单元32还用于,利用主题模型计算所述文档的主题,所述主题模型是基于LDA主题模型通过加入变分贝叶斯推理得到的用于计算所述文档的主题的动态模型,所述主题模型能够根据所计算的文档动态调整由于主题词的分布变化导致的主题概率变化。Further, as shown in FIG. 4, the calculating unit 32 is further configured to calculate a theme of the document by using a topic model, where the topic model is obtained by adding variational Bayesian inference based on an LDA topic model. A dynamic model of the subject matter of the document, the topic model being capable of dynamically adjusting a change in subject probability due to a change in the distribution of the subject words based on the calculated document.
进一步的,如图4所示,所述创建单元34包括:Further, as shown in FIG. 4, the creating unit 34 includes:
统计模块341,用于统计归属所述新主题的文档;a statistics module 341, configured to collect a document that belongs to the new topic;
确定模块342,用于根据所述统计模块341得到的文档的标题名称以及主题词的分布确定所述新主题的名称。The determining module 342 is configured to determine a name of the new topic according to a title name of the document obtained by the statistics module 341 and a distribution of the keyword.
进一步的,如图4所示,所述获取单元31包括:Further, as shown in FIG. 4, the obtaining unit 31 includes:
获取模块311,用于根据指定的领域获取多信源文档;The obtaining module 311 is configured to obtain a multi-source document according to the specified domain;
向量化模块312,用于利用词频对所述获取模块311获取的文档向量化表示;a vectorization module 312, configured to use a word frequency to obtain a vectorized representation of the document acquired by the obtaining module 311;
筛选模块313,用于利用TF-IDF模型筛选所述向量化模块312表示的文档中的主题词。The screening module 313 is configured to filter the keywords in the document represented by the vectorization module 312 by using the TF-IDF model.
进一步的,如图4所示,所述装置还包括:Further, as shown in FIG. 4, the device further includes:
排序单元36,用于根据主题的提及量、延续时间、新颖度对已有的主题以及所述创建单元34创建的主题进行热度排序;a sorting unit 36, configured to hot-sort the existing topics and the topics created by the creating unit 34 according to the reference quantity, duration, and novelty of the theme;
记录单元37,用于记录所述计算单元32处理所述文档的数据信息,所述数据信息包括新主题的名称、主题与主题词的对应关系。The recording unit 37 is configured to record, by the computing unit 32, data information of the document, where the data information includes a name of a new topic, a correspondence between a topic and a topic word.
进一步的,如图4所示,所述装置还包括: Further, as shown in FIG. 4, the device further includes:
添加单元35,用于在所述创建单元34创建新主题之后,将所述新主题添加到已有的主题分类中,以增加已有的主题分类并对后续文档的主题进行归属判断。The adding unit 35 is configured to add the new topic to the existing topic category after the creating unit 34 creates a new topic, to add the existing topic classification and perform attribution determination on the topic of the subsequent document.
综上所述,本发明实施例所采用的新主题的实时检测方法及装置,能够实时地处理从互联网上获取的同领域的文档数据,将文档向量化表示后,根据文档中主题词分布计算该文档的主题,并判断该文档的主题归属,在该文档的主题不属于已有主题分类时,将该文档归属为新主题。相对于现有的主题检测与归属方式,本发明实施例对文档中的主题检测实现了主题增量处理,并将新主题用于后续文档的分析中,从而提高对文档所属主题分类的准确性,同时,由于新主题的产生会影响主题词的分布,从而对后续文档的主题分析产生影响,因此本发明实施例所采用的主题检测方法能够根据检测文档数量的增加而动态调整主题中主题词的分布概率,从而得到同一主题的演变过程以及从该主题中演变出新主题的过程。In summary, the real-time detection method and apparatus for the new theme adopted by the embodiments of the present invention can process the same-domain document data acquired from the Internet in real time, and after the vector representation of the document, calculate the distribution of the keyword according to the document. The subject of the document, and determine the subject attribution of the document. When the subject of the document does not belong to the existing topic classification, the document is assigned as a new topic. Compared with the existing theme detection and attribution mode, the embodiment of the present invention implements the topic increment processing on the topic detection in the document, and uses the new topic in the analysis of the subsequent document, thereby improving the accuracy of the topic classification of the document. At the same time, since the generation of the new topic affects the distribution of the keyword, thereby affecting the topic analysis of the subsequent document, the theme detection method adopted by the embodiment of the present invention can dynamically adjust the keyword in the theme according to the increase of the number of detected documents. The probability of distribution, which leads to the evolution of the same topic and the process of evolving new topics from the topic.
所述新主题的实时检测装置包括处理器和存储器,上述获取单元、计算单元、判断单元和创建单元等均作为程序单元存储在存储器中,由处理器执行存储在存储器中的上述程序单元来实现相应的功能。The real-time detecting device of the new subject includes a processor and a memory, and the obtaining unit, the calculating unit, the determining unit, the creating unit and the like are all stored as a program unit in a memory, and the processor executes the above-mentioned program unit stored in the memory to implement The corresponding function.
处理器中包含内核,由内核去存储器中调取相应的程序单元。内核可以设置一个或以上,通过调整内核参数来实现实时检测文本中新出现的主题并将该主题作为后续文本分类归属的选项,从而提高对文本分类的准确性。The processor contains a kernel, and the kernel removes the corresponding program unit from the memory. The kernel can be set to one or more, and the accuracy of the text classification can be improved by adjusting the kernel parameters to realize the real-time detection of newly appearing topics in the text and classifying the topic as a subsequent text classification.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory (flash RAM), the memory including at least one Memory chip.
本申请还提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行初始化有如下方法步骤的程序代码:根据指定的领域实时获取向量化表示的文档;根据所述文档中主题词的分布计算所述文档的主题;判断所述文档的主题是否能够归属为已有的主题分类中;若不能,则创建新主题,并将所述文档划分在所述新主题的分类中。The present application also provides a computer program product, when executed on a data processing device, adapted to perform program code initialization having the following method steps: acquiring a vectorized representation of the document in real time according to a specified field; The distribution of words calculates the subject of the document; determines whether the subject of the document can be attributed to an existing topic category; if not, creates a new topic and divides the document into categories of the new topic.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软 件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Therefore, the present application may employ an entirely hardware embodiment, an entirely software embodiment, or a combination of soft A form of embodiment of hardware and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其 他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer readable media includes both permanent and non-persistent, removable and non-removable media. Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), Other types of random access memory (RAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, CD-ROM (CD-ROM) ), a digital versatile disc (DVD) or other optical storage, magnetic cassette, magnetic tape storage or other magnetic storage device or any other non-transportable medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个......”限定的要素,并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It is also to be understood that the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, Other elements not explicitly listed, or elements that are inherent to such a process, method, commodity, or equipment. An element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in a process, method, article, or device that comprises the element, without further limitation.
本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Thus, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
以上仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。 The above is only an embodiment of the present application and is not intended to limit the application. Various changes and modifications can be made to the present application by those skilled in the art. Any modifications, equivalents, improvements, etc. made within the spirit and scope of the present application are intended to be included within the scope of the appended claims.

Claims (10)

  1. 一种新主题的实时检测方法,其特征在于,所述方法包括:A real-time detection method for a new subject, characterized in that the method comprises:
    根据指定的领域实时获取向量化表示的文档;Obtain a vectorized representation of the document in real time based on the specified field;
    根据所述文档中主题词的分布计算所述文档的主题;Calculating a theme of the document according to a distribution of the keywords in the document;
    判断所述文档的主题是否能够归属为已有的主题分类中;Determining whether the subject of the document can be attributed to an existing topic classification;
    若不能,则创建新主题,并将所述文档划分在所述新主题的分类中。If not, a new topic is created and the document is divided into categories of the new topic.
  2. 根据权利要求1所述的方法,其特征在于,所述判断所述文档的主题是否能够归属为已有的主题分类中包括:The method according to claim 1, wherein the determining whether the subject of the document can be attributed to an existing topic classification comprises:
    利用余弦相似度计算所述文档的主题与已有主题的相似度值;Calculating a similarity value between the subject of the document and an existing theme by using cosine similarity;
    根据设置的相似度阈值判断所述相似度值,当所述相似度值小于阈值时,确定所述文档的主题为新主题。The similarity value is determined according to the set similarity threshold, and when the similarity value is less than the threshold, the subject of the document is determined to be a new topic.
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述文档中主题词的分布计算所述文档的主题包括:The method according to claim 1, wherein the calculating the theme of the document according to the distribution of the keywords in the document comprises:
    利用主题模型计算所述文档的主题,所述主题模型是基于LDA主题模型通过加入变分贝叶斯推理得到的用于计算所述文档的主题的动态模型,所述主题模型能够根据所计算的文档动态调整由于主题词的分布变化导致的主题概率变化。Computing a topic of the document using a topic model, the topic model being a dynamic model for calculating a topic of the document obtained by adding variation Bayesian inference based on an LDA topic model, the topic model being capable of being calculated according to The document dynamically adjusts the subject probability changes due to changes in the distribution of subject terms.
  4. 根据权利要求1所述的方法,其特征在于,所述创建新主题,并将所述文档划分在所述新主题的分类中包括:The method of claim 1, wherein the creating a new topic and dividing the document into the categories of the new topic comprises:
    统计归属所述新主题的文档;Statistics of documents pertaining to the new topic;
    根据所述文档的标题名称以及主题词的分布确定所述新主题的名称。The name of the new topic is determined according to the title name of the document and the distribution of the keyword.
  5. 根据权利要求1所述的方法,其特征在于,所述根据指定的领域实时获取向量化表示的文档包括:The method according to claim 1, wherein the obtaining the document of the vectorized representation in real time according to the specified domain comprises:
    根据指定的领域获取多信源文档;Obtain multiple source documents based on the specified domain;
    利用词频对所述文档向量化表示;Using a word frequency to vectorize the representation of the document;
    利用TF-IDF模型筛选所述文档中的主题词。The keywords in the document are filtered using the TF-IDF model.
  6. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    根据主题的提及量、延续时间、新颖度对已有的主题进行热度排序;Sorting existing topics according to the topic's mention, duration, and novelty;
    记录处理所述文档的数据信息,所述数据信息包括新主题的名称、主题与主 题词的对应关系。Recording data information for processing the document, the data information including the name, subject, and subject of the new topic The corresponding relationship of the inscription.
  7. 根据权利要求1所述的方法,其特征在于,在创建新主题之后,所述方法还包括:The method of claim 1, wherein after the new topic is created, the method further comprises:
    将所述新主题添加到已有的主题分类中,以增加已有的主题分类并对后续文档的主题进行归属判断。The new topic is added to the existing topic classification to increase the existing topic classification and subject the subject of the subsequent document to the determination.
  8. 一种新主题的实时检测装置,其特征在于,所述装置包括:A new subject real-time detecting device, characterized in that the device comprises:
    获取单元,用于根据指定的领域实时获取向量化表示的文档;An obtaining unit, configured to acquire a vectorized representation of the document in real time according to the specified domain;
    计算单元,用于根据所述获取单元获取的文档中主题词的分布计算所述文档的主题;a calculating unit, configured to calculate a theme of the document according to a distribution of the keyword in the document acquired by the acquiring unit;
    判断单元,用于判断所述计算单元得到的所述文档的主题是否能够归属为已有的主题分类中;a determining unit, configured to determine whether the subject of the document obtained by the calculating unit can be attributed to an existing topic classification;
    创建单元,用于当所述判断单元判断所述文档的主题不能归属于已有的主题分类时,创建新主题,并将所述文档划分在所述新主题的分类中。And a creating unit, configured to: when the determining unit determines that the topic of the document cannot be attributed to the existing topic classification, create a new topic, and divide the document into the classification of the new topic.
  9. 根据权利要求8所述的装置,其特征在于,所述判断单元包括:The device according to claim 8, wherein the determining unit comprises:
    计算模块,用于利用余弦相似度计算所述文档的主题与已有主题的相似度值;a calculation module, configured to calculate, by using cosine similarity, a similarity value between a theme of the document and an existing theme;
    判断模块,用于根据设置的相似度阈值判断所述计算模块得到的相似度值,当所述相似度值小于阈值时,确定所述文档的主题为新主题。The determining module is configured to determine, according to the set similarity threshold, a similarity value obtained by the calculating module, and when the similarity value is less than the threshold, determine that the subject of the document is a new topic.
  10. 根据权利要求8所述的装置,其特征在于,所述计算单元还用于,利用主题模型计算所述文档的主题,所述主题模型是基于LDA主题模型通过加入变分贝叶斯推理得到的用于计算所述文档的主题的动态模型,所述主题模型能够根据所计算的文档动态调整由于主题词的分布变化导致的主题概率变化。 The apparatus according to claim 8, wherein the calculating unit is further configured to calculate a theme of the document by using a topic model, wherein the topic model is obtained by adding variational Bayesian inference based on an LDA topic model. A dynamic model for calculating a topic of the document, the topic model being capable of dynamically adjusting a change in subject probability due to a change in the distribution of the subject words according to the calculated document.
PCT/CN2017/109840 2016-11-08 2017-11-08 Method and device for real-time detection of new subject WO2018086518A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610980540.XA CN108062319A (en) 2016-11-08 2016-11-08 A kind of real-time detection method and device of new theme
CN201610980540.X 2016-11-08

Publications (1)

Publication Number Publication Date
WO2018086518A1 true WO2018086518A1 (en) 2018-05-17

Family

ID=62109378

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/109840 WO2018086518A1 (en) 2016-11-08 2017-11-08 Method and device for real-time detection of new subject

Country Status (2)

Country Link
CN (1) CN108062319A (en)
WO (1) WO2018086518A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209813A (en) * 2019-05-14 2019-09-06 天津大学 A kind of incident detection and prediction technique based on autocoder

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875057B (en) * 2018-06-29 2021-08-27 北京百度网讯科技有限公司 Method, apparatus, device and computer readable medium for determining data topics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957826A (en) * 2009-07-15 2011-01-26 财团法人工业技术研究院 Method for automatically expanding teaching materials and system for expanding relevant learning teaching materials
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103164540A (en) * 2013-04-15 2013-06-19 武汉大学 Patent hotspot discovery and trend analysis method
CN104462041A (en) * 2014-11-28 2015-03-25 上海埃帕信息科技有限公司 Method for completely detecting hot event from beginning to end

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116651A (en) * 2013-03-05 2013-05-22 南京理工大学常熟研究院有限公司 Public sentiment hot topic dynamic detection method
CN104462253B (en) * 2014-11-20 2018-05-18 武汉数为科技有限公司 A kind of topic detection or tracking of network-oriented text big data
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN104715014B (en) * 2015-01-26 2017-10-10 中山大学 A kind of online topic detecting method of news
CN105956130B (en) * 2016-05-09 2019-04-09 浙江农林大学 The scientific documents motif discovery and tracking and its system of multi-information fusion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957826A (en) * 2009-07-15 2011-01-26 财团法人工业技术研究院 Method for automatically expanding teaching materials and system for expanding relevant learning teaching materials
CN102831193A (en) * 2012-08-03 2012-12-19 人民搜索网络股份公司 Topic detecting device and topic detecting method based on distributed multistage cluster
CN103164540A (en) * 2013-04-15 2013-06-19 武汉大学 Patent hotspot discovery and trend analysis method
CN104462041A (en) * 2014-11-28 2015-03-25 上海埃帕信息科技有限公司 Method for completely detecting hot event from beginning to end

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209813A (en) * 2019-05-14 2019-09-06 天津大学 A kind of incident detection and prediction technique based on autocoder

Also Published As

Publication number Publication date
CN108062319A (en) 2018-05-22

Similar Documents

Publication Publication Date Title
Sanden et al. Enhancing multi-label music genre classification through ensemble techniques
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
Macdonald et al. Blog track research at TREC
Zhang et al. Adapted textrank for term extraction: A generic method of improving automatic term extraction algorithms
TWI571756B (en) Methods and systems for analyzing reading log and documents corresponding thereof
Li et al. Dirichlet multinomial mixture with variational manifold regularization: Topic modeling over short texts
Chen et al. Improving music genre classification using collaborative tagging data
MA Basher et al. Analyzing topics and authors in chat logs for crime investigation
Hou et al. Classifying advertising video by topicalizing high-level semantic concepts
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
Bykau et al. Fine-grained controversy detection in Wikipedia
Rashid et al. Analysis of streaming data using big data and hybrid machine learning approach
WO2018086518A1 (en) Method and device for real-time detection of new subject
Peng et al. Trending sentiment-topic detection on twitter
JP4879775B2 (en) Dictionary creation method
CN116882414B (en) Automatic comment generation method and related device based on large-scale language model
Hsu et al. Hierarchical comments-based clustering
Yang et al. Time-sync video tag extraction using semantic association graph
Abinaya et al. Event identification in social media through latent dirichlet allocation and named entity recognition
Lioma et al. A study of factuality, objectivity and relevance: three desiderata in large-scale information retrieval?
Sharma et al. A trend analysis of significant topics over time in machine learning research
Nobari et al. User intent identification from online discussions using a joint aspect-action topic model
CN112287218B (en) Knowledge graph-based non-coal mine literature association recommendation method
Blair et al. Increasing topic coherence by aggregating topic models
Kumar et al. Music tagging and similarity analysis for recommendation system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17868827

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.08.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 17868827

Country of ref document: EP

Kind code of ref document: A1