WO2018086518A1

WO2018086518A1 - Method and device for real-time detection of new subject

Info

Publication number: WO2018086518A1
Application number: PCT/CN2017/109840
Authority: WO
Inventors: 徐文斌
Original assignee: 北京国双科技有限公司
Priority date: 2016-11-08
Filing date: 2017-11-08
Publication date: 2018-05-17
Also published as: CN108062319A

Abstract

A method and device for real-time detection of a new subject, relating to the technical field of Internet, mainly aim at detecting a subject newly appeared in a text in real time and using the subject as an option to which subsequent text classification belongs so as to improve accuracy of text classification. The method comprises: obtaining a vectorized document according to a designated field in real time (101); calculating a subject of the document according to distribution of subject terms in the document (102); determining whether the subject of the document belongs to an existing subject classification or not (103); and if not, creating a new subject, and allocating the document in the classification of the new subject (104). The method and device are mainly used for online real-time detection of a new subject in a document.

Description

Real-time detection method and device for new theme

The present application is filed on the basis of the Chinese Patent Application Serial No. No. No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No

Technical field

The present invention relates to the field of Internet technologies, and in particular, to a real-time detection method and apparatus for a new theme.

Background technique

Topic Detection & Tracking (Topic Detection & Tracing) technology is a highly practical technology in the field of natural processing and information retrieval, intended to discover and process themes that appear in text. It is also a practical technique for effectively discovering and extracting useful information in the context of big data. Often, hot topic discovery and tracking techniques are a technique for finding a topic and tracking the progress of a topic for a particular domain or specific event.

At present, the main steps of the detection technology of hot topics at home and abroad include: text acquisition, collecting news reports on the Internet; collecting vectorized representations of documents; document clustering, clustering the vectorized documents, and taking them out A high-frequency word or cluster center document represents a topic; finally, the topics found in a certain period of time are sorted by heat and output according to a specific order. However, this method only applies to the discovery and tracking of specified topics, but does not detect the emerging topics in real time, and can not effectively express the evolution of the same topic in different periods.

Summary of the invention

In view of this, the present invention provides a real-time detection method and apparatus for a new theme, the main purpose of which is to detect a new topic in a text in real time and use the topic as an option to classify subsequent texts, thereby improving the accuracy of text classification.

In order to achieve the above object, the present invention mainly provides the following technical solutions:

In one aspect, the present invention provides a real-time detection method for a new theme, the method comprising:

Obtain a vectorized representation of the document in real time based on the specified field;

Calculating a theme of the document according to a distribution of the keywords in the document;

Determining whether the subject of the document can be attributed to an existing topic classification;

If not, a new topic is created and the document is divided into categories of the new topic.

Preferably, the determining whether the subject of the document can be attributed to an existing topic classification includes:

Calculating a similarity value between the subject of the document and an existing theme by using cosine similarity;

The similarity value is determined according to the set similarity threshold, and when the similarity value is less than the threshold, the subject of the document is determined to be a new topic.

Preferably, the calculating the theme of the document according to the distribution of the keywords in the document comprises:

Creating a topic model, which is a dynamic model for calculating a topic of the document obtained by adding variation Bayesian inference based on an LDA topic model, the topic model being capable of dynamically adjusting due to a keyword according to the calculated document Changes in the subject's probability caused by changes in the distribution;

The document is entered into the topic model to calculate the subject of the document.

Preferably, the creating a new topic and dividing the document into the classification of the new topic comprises:

Statistics of documents pertaining to the new topic;

The name of the new topic is determined according to the title name of the document and the distribution of the keyword.

Preferably, the document that obtains the vectorized representation in real time according to the specified domain includes:

Obtain multiple source documents based on the specified domain;

Using a word frequency to vectorize the representation of the document;

The keywords in the document are filtered using the TF-IDF model.

Preferably, the method further includes:

Sorting existing topics according to the topic's mention, duration, and novelty;

The data information of the document is recorded, and the data information includes a name of the new topic, a correspondence between the topic and the keyword.

Preferably, after the new theme is created, the method further includes:

The new topic is added to the existing topic classification to increase the existing topic classification and subject the subject of the subsequent document to the determination.

In another aspect, the present invention also provides a new subject real-time detecting device, the device comprising:

An obtaining unit, configured to acquire a vectorized representation of the document in real time according to the specified domain;

a calculating unit, configured to calculate a theme of the document according to a distribution of the keyword in the document acquired by the acquiring unit;

a determining unit, configured to determine whether the subject of the document obtained by the calculating unit can be attributed to an existing topic classification;

And a creating unit, configured to: when the determining unit determines that the topic of the document cannot be attributed to the existing topic classification, create a new topic, and divide the document into the classification of the new topic.

Preferably, the determining unit comprises:

a calculation module, configured to calculate, by using cosine similarity, a similarity value between a theme of the document and an existing theme;

The determining module is configured to determine, according to the set similarity threshold, a similarity value obtained by the calculating module, and when the similarity value is less than the threshold, determine that the subject of the document is a new topic.

Preferably, the calculating unit is further configured to calculate a theme of the document by using a topic model, where the topic model is based on an LDA topic model, by adding a variation Bayesian inference to calculate a theme of the document. A model that dynamically adjusts the change in subject probability due to a change in the distribution of the subject terms based on the calculated document.

Preferably, the creating unit comprises:

a statistics module for counting documents belonging to the new topic;

a determining module, configured to determine a name of the new topic according to a title name of the document obtained by the statistical module and a distribution of the keyword.

Preferably, the obtaining unit comprises:

An obtaining module, configured to obtain a multi-source document according to the specified domain;

a vectorization module, configured to use a word frequency to obtain a vectorized representation of the document acquired by the acquiring module;

And a screening module, configured to filter a keyword in the document represented by the vectorization module by using a TF-IDF model.

Preferably, the device further comprises:

a sorting unit for sorting existing topics according to the reference quantity, duration, and novelty of the theme;

a recording unit for recording data information of the document, the data information including a new theme The correspondence between names, topics, and subject terms.

Preferably, the device further comprises:

The adding unit is configured to add the new topic to the existing topic category after the creating unit creates a new topic, to add an existing topic category and perform attribution determination on the topic of the subsequent document.

According to the method and device for real-time detection of a new theme proposed by the present invention, the document data of the same domain acquired from the Internet can be processed in real time, and after the document is vectorized, the document is calculated according to the distribution of the keyword in the document. The subject, and determine the subject attribution of the document, when the subject of the document does not belong to the existing topic classification, the document is assigned as a new topic. Compared with the existing subject detection and attribution mode, the present invention implements the subject incremental processing on the subject detection in the document, and uses the new theme in the analysis of the subsequent document, thereby improving the accuracy of the classification of the subject to which the document belongs, and simultaneously improving the accuracy of the subject classification of the document. Since the generation of the new topic affects the distribution of the topic words, thereby affecting the subject analysis of the subsequent documents, the subject detection method adopted by the present invention can dynamically adjust the distribution probability of the topic words in the theme according to the increase in the number of detected documents. This leads to the evolution of the same topic and the process of evolving new themes from the topic.

DRAWINGS

Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not to be construed as limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:

FIG. 1 is a flowchart of a real-time detection method for a new theme according to an embodiment of the present invention;

2 is a flowchart of a real-time detection method for another new theme proposed by an embodiment of the present invention;

FIG. 3 is a block diagram showing the composition of a real-time detecting device of a new theme according to an embodiment of the present invention; FIG.

FIG. 4 is a block diagram showing the composition of a real-time detecting apparatus of another new theme proposed by the embodiment of the present invention.

detailed description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the invention has been shown and described with reference to the embodiments Rather, these embodiments are provided to provide a more thorough understanding of the invention. Further, the scope of the present invention can be fully conveyed to those skilled in the art.

An embodiment of the present invention provides a real-time detection method for a new topic. As shown in FIG. 1 , the method is applied to real-time topic detection of an online document, discovering a new topic, and automatically assigning the document to a classification of a new topic, This embodiment of the present invention provides the following specific steps:

101. Acquire a vectorized representation of the document in real time according to the specified domain.

Due to the complexity of the content of the online document, the entire area of the calculation cannot be performed when the subject detection and attribution are performed. Therefore, the embodiment of the present invention is also directed to subject detection and attribution calculation for a document in a specified domain. Therefore, the acquired documents are related corpus information collected according to the domain, and the documents may be obtained from a plurality of sources, such as a forum, a news portal, a WeChat public account, and the like. The same-domain documents obtained by different sources have different differences in the subject matter of the documents due to different perspectives, and the possibility of generating new topics is relatively large.

After the document is obtained, the document needs to be vectorized. At present, the most commonly used document vectorization is a word-based feature, which uses word frequency to vectorize the document, that is, using TF-IDF (term frequency-inverse document). Frequency) The model provides a vectorized representation of the document. However, the specific manner of vectorization of the document is not limited in the embodiment of the present invention, and the purpose of the vectorized document is to facilitate subsequent detailed analysis of the document.

102. Calculate the subject of the document according to the distribution of the keywords in the document.

According to the obtained vectorized document, the keyword in the document is filtered out, wherein the keyword refers to a word capable of expressing the core idea of the document, and in most cases, the words are words with a high frequency in the document, therefore, The high-frequency words in the document are that the subject of the document is a high-probability event, but for high-frequency words, it is also necessary to exclude words that have no practical meaning, such as "yes" and "yes". The word frequency of these words tends to be higher than the word frequency of the subject words. Therefore, in the specific screening operation, these high frequency words can be avoided by setting a filter word frequency interval, for example, screening the index frequency of the inverted document, that is, the value of IDF. Words between 0.5 and 0.8.

After the subject word of the document is determined, the subject of the document is determined based on the distribution of the subject word in the document. At present, the mainstream way of extracting a document theme is to use an LDA (Latent Dirichlet Allocation) topic model, in which the topic represents a concept, an aspect, or a subject in the embodiment of the present invention. It can be described using a series of related words. The distribution of words in a topic is the probability event of the occurrence of these words, that is, according to the distribution of the keywords in the document, To get the probability that the document belongs to different topics. In the embodiment of the present invention, the theme included in the document is extracted by the created topic model, and the main function of the topic model is to determine the attribution theme of the document according to the distribution of the keyword in the document, wherein in the determining process, the theme is The model can automatically update the correspondence between the distribution of the topic words and the theme according to the processed document, that is, the topic model can apply the processing result of the processed document to the analysis of the subsequent document, according to the processed document first. The distribution of the subject terms adjusts to the probability of the attribution topic. It can be seen that the topic model in the embodiment of the present invention is a dynamic model, and the calculation process refers to the processing result of the previous document. Therefore, by recording the corresponding topic word distribution change in the topic, the topic can be obtained in the acquired document. Evolution. For the topic model that can satisfy the above functions, the embodiment of the present invention does not limit the specific algorithm used by the topic model.

103. Determine whether the subject of the document can be classified into an existing topic classification.

After the theme of the document is determined by the theme model, it is determined whether the theme is the same as the existing theme. If the theme is the same, the document is classified into the existing topic classification, and the theme is re-determined according to the document in the theme. The topic name, and when the topic is different from the existing topic, step 104 is performed to determine the topic as a new topic. The specific judgment manner in the embodiment of the present invention is: performing cluster analysis on the document and the document in the existing topic classification, that is, comparing the keyword corresponding to the topic and the existing topic, To judge the similarity between the two themes, when the similarity is lower than a certain value, it can be determined that a new theme is generated.

104. Create a new theme and divide the document into categories of the new topic.

According to the judgment of step 103, when the topic of the document cannot be attributed to the existing topic classification, a new topic is created based on the document, and the name of the new topic can be extracted and generated from the keyword of the document.

When a new topic is created, the new topic will be imported into the existing topic category to identify whether subsequent documents can be attributed to the category of the new topic. When a new document is assigned to a category of a new topic, the name of the new topic is also re-extracted. In this way, the above steps are executed cyclically, and each subject processed is detected, classified, and the name of the theme is updated to realize the discovery and extraction of the new theme, and the tracking of the existing theme is also tracked.

It can be seen that the real-time detection method of a new theme adopted by the embodiment of the present invention can be used as a document processing framework when processing an online document, and the processed object is processed by the framework. The document is labeled with the topic it belongs to, and the framework has the ability to detect and add new themes. Function, as the number of documents processed increases, the number of topics that the framework has will also increase, and the name of the theme will continue to evolve as the number of documents increases. When the document processing framework is applied, the document in different fields can be switched by saving the processing information, and for the new field, only the model information in the framework needs to be initialized. Compared with the existing subject detection and attribution mode, the processing framework adopted by the embodiment of the present invention implements online real-time detection, which can more fully utilize the processing resources of the system, and at the same time, the framework realizes online incremental processing of the theme. Adding a new theme in real time and applying the new theme to the processing of the new document, that is, the dynamic adjustment of the processing is realized, so that the subject classification of the document is more accurate. In addition, the evolution of the theme can be effectively tracked by recording changes in the distribution of subject terms and subject name changes.

In the following, in order to explain in more detail the real-time detection method of a new theme proposed by the present invention, in particular, the specific implementation manner in each step and the connection processing between the steps are specifically described. Specifically, as shown in FIG. 2, the method is as follows. The steps involved are:

201. Acquire a document of the vectorized representation in real time according to the specified domain.

In this step, a plurality of documents in a specified domain are first acquired by a plurality of information sources, and the document is vectorized according to word frequency, that is, the word frequency (TF) and inverted document index frequency of words in the document are calculated by using the TF-IDF model. (IDF), and adopting the TF-IDF model is based on the word segmentation of the document. After word segmentation, some high-frequency words that are obviously not subject to the subject words are filtered out, such as "", "yes" and so on. The TF-IDF model calculates the inverted document index frequency of each word in the document. In the embodiment of the present invention, the word with the IDF value less than 0.8 is determined as the subject word of the document. The calculation method for applying the TF-IDF model to the embodiment of the present invention is as follows:

Wherein, the numerator is the number of occurrences of the word j in the document i, and the denominator is the sum of the occurrences of all the words in the document;

Among them, the numerator indicates the total number of documents in the corpus, and the denominator indicates the number of documents in which the word i appears.

Through the calculation method of the TF-IDF model described above, the obtained vectorized document can be calculated to determine the keyword in the document.

202. Calculate a theme of the document according to a distribution of the keywords in the document.

In this step, the theme extraction and categorization in the document can be realized by custom creation of the theme model, which is a dynamic model for calculating the theme of the document, and can dynamically adjust the distribution of the keyword according to the calculated document. In the embodiment of the present invention, the topic model is an LDA dynamic topic model based on variational Bayesian inference, that is, based on the LDA topic model, introducing variational Bayesian reasoning to calculate each piece. The impact of the document on the evolution of the theme. Among them, the purpose of variational Bayesian reasoning is to find the approximate distribution of the posterior probability distribution of the hidden variable closest to the real model, and use this approximate distribution to replace the posterior probability distribution of the hidden variable. Specifically, in the embodiment of the present invention, the introduction of variational Bayesian inference is used to help the LDA topic model find a topic distribution in which each document is closer to the previous document. Since the LDA topic model is a topic model widely used in existing topic recognition technology, the calculation principle and process of the LDA topic model are not described in detail, but after adding variational Bayesian inference, the calculation process of LDA As follows:

1.For each topic index k∈{1,...K},

Draw topic distributionβ _k ～Dir(η _k )

2.For each document d∈{1,...M}:

(a) Draw document's topic distributionθ d ~ Dir (α)

(b) For each word n∈{1,...N _d }:

I.Choo se topic assignment z _d,n ~Mult(θ _d )

_{II.Choose word W d, n ~ Mult} (βz d, n)

Where Mult() is a multinomial distribution and Dir() is a Dirichlet distribution.

Using the LDA dynamic topic model based on variational Bayesian inference, the subject words in the obtained vectorized document are calculated, and the probability of the corresponding topic is obtained according to the specific distribution of the keyword. Since the variational Bayesian reasoning is introduced in the topic model, the topic model can also calculate the evolution of different topics by processing a large number of documents.

203. Determine whether the subject of the document can be classified into an existing topic classification.

As the number of subject model processing documents in step 202 increases, the distribution of subject terms corresponding to the theme changes, and this change causes a theme to eventually evolve into a new theme. In the embodiment of the present invention, each document is set to belong to only one topic. Therefore, when determining the topic of the document, the subject with the highest probability value included in the document may be the subject of the document. And with the constant documentation Processing, the subject probability in the document will also change constantly. When the change reaches a certain threshold, the distribution of the keyword will generate a new theme. Specifically, the cosine similarity can be used to calculate the keywords in the subject of the current document. Determining the similarity value of all the keywords in the document belonging to the topic, determining whether the theme of the current document is the same as the existing topic by setting the threshold, and determining that the topic of the current document is when the similarity value is less than the threshold A new theme where the threshold is an experience value that can be customized.

204. Create a new theme and add the new theme to the existing topic category.

By the judgment of step 203, the document is marked and classified according to the belonging topic. For the document belonging to the existing topic classification, since the new document is added to the existing topic, the name of the topic needs to be re-extracted. Determining, the specific determination manner is: extracting the keyword corresponding to the topic, and all the titles belonging to the topic document, and calculating the weight of each title according to the keyword and the frequency or distribution of the title in the documents, The title of the title is determined by the name of the topic. That is, the sum of the weights of all the words in the title in the title is calculated * the number of mentions of the title / the number of the mentioned subject words, and the weight of the title is obtained.

For the new theme, since the theme distribution corresponding to the theme is re-entered into the theme model created in step 202 after the new theme is created, the evolution result of the theme is updated, and therefore, after being generated by the new theme, The corresponding keywords in the new theme will recalculate the documents in the existing topic classifications that have evolved into new topics, and determine whether there are documents belonging to the new theme in these documents, and the names of the new topics can be determined by In the above determination manner, when only the current document is present, the title of the current document is taken as the name of the new topic, and when a new document is assigned to the new topic, the name of the new topic is determined by referring to the above process.

In addition, for the determination of the topic name, the keyword-based phrase mining method, the document content summarization method based on the Rank idea, and the like may also be used to determine the name of the topic. The specific manner of the name extraction is not limited in the embodiment of the present invention.

205. Sort the existing topics according to the reference quantity, duration, and novelty of the theme.

This step is a further application of the results achieved by the above steps, that is, through the above steps, the subject detection and classification of a large number of documents, and the recognition of new topics in the document, to increase the classification of the subject of the document, based on The addition of the theme, the embodiment of the present invention utilizes the annotation of the subject of the document, and can also achieve the hot ranking of all the current topics. For this reason, a simple heat model is provided in this embodiment to calculate the heat ranking of each topic. The heat model takes into account the amount of the topic mentioned, The duration of the theme, as well as the novelty of the theme, to determine the final heat and output as time points. The heat calculation method is as follows:

Heat = a * continuity + b * mention + c * novelity + d * other factors

Among them, continuation is intended to discover topics that will appear over a period of time. Such topics appear in a steady trend over a long period of time. Often, the number of occurrences is not very high and may not be as good as the recent topical mentions. Large, but considering its longer time, it is considered in the heat calculation.

The quantity mentioned is the number of times the subject appears in a certain period of time. In general, the higher the frequency of recent topics will have a higher degree of heat. For example, after a topic is generated in the corpus, a large number of the entire Internet will appear. Related reports, such topics should have a high degree of enthusiasm, such as "G20", "Rio Olympics" and other topics, in the short period after the emergence, have a high mention.

Novelty, new themes, may be due to the theme just appeared, may not be a lot of mentions when the current time, but such a topic will have a trend into a hot topic, in order to prevent such a The theme thus leads to the lack of information, and it is necessary to introduce novelty as a heat evaluation parameter.

Other factors, such as considering a topic over time, may make the original topic less popular, and factors like this add it to other factors, there is a simple calculation method for this situation. It is the Newtonian cooling algorithm, which establishes the relationship between the heat of a topic and time, thus evolving his hot trend.

Further, in the process of detecting the real-time theme in the embodiment of the present invention, since the theme model is automatically adjusted according to the output, especially after the new theme is generated, the influence on the calculation mode in the theme model is more prominent. In this way, for each document calculation, the calculation result of the document and the data information that affects the topic model, such as the name of the topic, the correspondence between the topic and the topic, etc., can be used to analyze the topic evolution. The process of tracking the history of the topic.

Further, as an implementation of the foregoing method, an embodiment of the present invention provides a real-time detection device for a new theme, and the device embodiment corresponds to the foregoing method embodiment. For ease of reading, the device embodiment does not implement the foregoing method. The details in the example are described one by one, but it should be clear that the device in this embodiment can implement all the contents in the foregoing method embodiments. The device is used for extracting the topic in the document in real time and tracking the evolution process of the recorded topic. As shown in FIG. 3, the device includes:

An obtaining unit 31, configured to acquire a document of the vectorized representation in real time according to the specified domain;

The calculating unit 32 is configured to calculate a distribution of the keyword in the document acquired by the acquiring unit 31. The subject of the document;

The determining unit 33 is configured to determine whether the theme of the document obtained by the calculating unit 32 can be attributed to an existing topic classification;

The creating unit 34 is configured to create a new topic when the determining unit 33 determines that the topic of the document cannot be attributed to the existing topic classification, and divide the document into the classification of the new topic.

Further, as shown in FIG. 4, the determining unit 33 includes:

a calculation module 331, configured to calculate, by using cosine similarity, a similarity value between a theme of the document and an existing topic;

The determining module 332 is configured to determine, according to the set similarity threshold, a similarity value obtained by the calculating module 331, and when the similarity value is less than the threshold, determine that the subject of the document is a new topic.

Further, as shown in FIG. 4, the calculating unit 32 is further configured to calculate a theme of the document by using a topic model, where the topic model is obtained by adding variational Bayesian inference based on an LDA topic model. A dynamic model of the subject matter of the document, the topic model being capable of dynamically adjusting a change in subject probability due to a change in the distribution of the subject words based on the calculated document.

Further, as shown in FIG. 4, the creating unit 34 includes:

a statistics module 341, configured to collect a document that belongs to the new topic;

The determining module 342 is configured to determine a name of the new topic according to a title name of the document obtained by the statistics module 341 and a distribution of the keyword.

Further, as shown in FIG. 4, the obtaining unit 31 includes:

The obtaining module 311 is configured to obtain a multi-source document according to the specified domain;

a vectorization module 312, configured to use a word frequency to obtain a vectorized representation of the document acquired by the obtaining module 311;

The screening module 313 is configured to filter the keywords in the document represented by the vectorization module 312 by using the TF-IDF model.

Further, as shown in FIG. 4, the device further includes:

a sorting unit 36, configured to hot-sort the existing topics and the topics created by the creating unit 34 according to the reference quantity, duration, and novelty of the theme;

The recording unit 37 is configured to record, by the computing unit 32, data information of the document, where the data information includes a name of a new topic, a correspondence between a topic and a topic word.

Further, as shown in FIG. 4, the device further includes:

The adding unit 35 is configured to add the new topic to the existing topic category after the creating unit 34 creates a new topic, to add the existing topic classification and perform attribution determination on the topic of the subsequent document.

In summary, the real-time detection method and apparatus for the new theme adopted by the embodiments of the present invention can process the same-domain document data acquired from the Internet in real time, and after the vector representation of the document, calculate the distribution of the keyword according to the document. The subject of the document, and determine the subject attribution of the document. When the subject of the document does not belong to the existing topic classification, the document is assigned as a new topic. Compared with the existing theme detection and attribution mode, the embodiment of the present invention implements the topic increment processing on the topic detection in the document, and uses the new topic in the analysis of the subsequent document, thereby improving the accuracy of the topic classification of the document. At the same time, since the generation of the new topic affects the distribution of the keyword, thereby affecting the topic analysis of the subsequent document, the theme detection method adopted by the embodiment of the present invention can dynamically adjust the keyword in the theme according to the increase of the number of detected documents. The probability of distribution, which leads to the evolution of the same topic and the process of evolving new topics from the topic.

The real-time detecting device of the new subject includes a processor and a memory, and the obtaining unit, the calculating unit, the determining unit, the creating unit and the like are all stored as a program unit in a memory, and the processor executes the above-mentioned program unit stored in the memory to implement The corresponding function.

The processor contains a kernel, and the kernel removes the corresponding program unit from the memory. The kernel can be set to one or more, and the accuracy of the text classification can be improved by adjusting the kernel parameters to realize the real-time detection of newly appearing topics in the text and classifying the topic as a subsequent text classification.

The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory (flash RAM), the memory including at least one Memory chip.

The present application also provides a computer program product, when executed on a data processing device, adapted to perform program code initialization having the following method steps: acquiring a vectorized representation of the document in real time according to a specified field; The distribution of words calculates the subject of the document; determines whether the subject of the document can be attributed to an existing topic category; if not, creates a new topic and divides the document into categories of the new topic.

Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Therefore, the present application may employ an entirely hardware embodiment, an entirely software embodiment, or a combination of soft A form of embodiment of hardware and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.

The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium.

Computer readable media includes both permanent and non-persistent, removable and non-removable media. Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), Other types of random access memory (RAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, CD-ROM (CD-ROM) ), a digital versatile disc (DVD) or other optical storage, magnetic cassette, magnetic tape storage or other magnetic storage device or any other non-transportable medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.

It is also to be understood that the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, Other elements not explicitly listed, or elements that are inherent to such a process, method, commodity, or equipment. An element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in a process, method, article, or device that comprises the element, without further limitation.

Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Thus, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

The above is only an embodiment of the present application and is not intended to limit the application. Various changes and modifications can be made to the present application by those skilled in the art. Any modifications, equivalents, improvements, etc. made within the spirit and scope of the present application are intended to be included within the scope of the appended claims.

Claims

A real-time detection method for a new subject, characterized in that the method comprises:

Obtain a vectorized representation of the document in real time based on the specified field;

Calculating a theme of the document according to a distribution of the keywords in the document;

Determining whether the subject of the document can be attributed to an existing topic classification;

If not, a new topic is created and the document is divided into categories of the new topic.
The method according to claim 1, wherein the determining whether the subject of the document can be attributed to an existing topic classification comprises:

Calculating a similarity value between the subject of the document and an existing theme by using cosine similarity;

The similarity value is determined according to the set similarity threshold, and when the similarity value is less than the threshold, the subject of the document is determined to be a new topic.
The method according to claim 1, wherein the calculating the theme of the document according to the distribution of the keywords in the document comprises:

Computing a topic of the document using a topic model, the topic model being a dynamic model for calculating a topic of the document obtained by adding variation Bayesian inference based on an LDA topic model, the topic model being capable of being calculated according to The document dynamically adjusts the subject probability changes due to changes in the distribution of subject terms.
The method of claim 1, wherein the creating a new topic and dividing the document into the categories of the new topic comprises:

Statistics of documents pertaining to the new topic;

The name of the new topic is determined according to the title name of the document and the distribution of the keyword.
The method according to claim 1, wherein the obtaining the document of the vectorized representation in real time according to the specified domain comprises:

Obtain multiple source documents based on the specified domain;

Using a word frequency to vectorize the representation of the document;

The keywords in the document are filtered using the TF-IDF model.
The method of claim 1 further comprising:

Sorting existing topics according to the topic's mention, duration, and novelty;

Recording data information for processing the document, the data information including the name, subject, and subject of the new topic The corresponding relationship of the inscription.
The method of claim 1, wherein after the new topic is created, the method further comprises:

The new topic is added to the existing topic classification to increase the existing topic classification and subject the subject of the subsequent document to the determination.
A new subject real-time detecting device, characterized in that the device comprises:

An obtaining unit, configured to acquire a vectorized representation of the document in real time according to the specified domain;

a calculating unit, configured to calculate a theme of the document according to a distribution of the keyword in the document acquired by the acquiring unit;

a determining unit, configured to determine whether the subject of the document obtained by the calculating unit can be attributed to an existing topic classification;

And a creating unit, configured to: when the determining unit determines that the topic of the document cannot be attributed to the existing topic classification, create a new topic, and divide the document into the classification of the new topic.
The device according to claim 8, wherein the determining unit comprises:

a calculation module, configured to calculate, by using cosine similarity, a similarity value between a theme of the document and an existing theme;

The determining module is configured to determine, according to the set similarity threshold, a similarity value obtained by the calculating module, and when the similarity value is less than the threshold, determine that the subject of the document is a new topic.
The apparatus according to claim 8, wherein the calculating unit is further configured to calculate a theme of the document by using a topic model, wherein the topic model is obtained by adding variational Bayesian inference based on an LDA topic model. A dynamic model for calculating a topic of the document, the topic model being capable of dynamically adjusting a change in subject probability due to a change in the distribution of the subject words according to the calculated document.