CN107944037A - A kind of much-talked-about topic identification method for tracing and system - Google Patents

A kind of much-talked-about topic identification method for tracing and system Download PDF

Info

Publication number
CN107944037A
CN107944037A CN201711332908.2A CN201711332908A CN107944037A CN 107944037 A CN107944037 A CN 107944037A CN 201711332908 A CN201711332908 A CN 201711332908A CN 107944037 A CN107944037 A CN 107944037A
Authority
CN
China
Prior art keywords
topic
document
calculating
text
hot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711332908.2A
Other languages
Chinese (zh)
Inventor
任东英
朱瑾鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Digital Technology Co Ltd
Original Assignee
Beijing Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Digital Technology Co Ltd filed Critical Beijing Digital Technology Co Ltd
Priority to CN201711332908.2A priority Critical patent/CN107944037A/en
Publication of CN107944037A publication Critical patent/CN107944037A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of much-talked-about topic identification method for tracing and system, employ analytic hierarchy process (AHP) and Information Entropy, by evaluating the topic of identification, quantified the index of much-talked-about topic, the technique effect of scientific quantitative analysis evaluation is reached, with the high advantage of recognition accuracy, and the meaning of Effective selection much-talked-about topic.Reduce the dependence to manual intervention, with the fairness of computer disposal, reduce artificial intervention to greatest extent when judging whether some topic is much-talked-about topic.Find that method can preferably obtain the much-talked-about topic of every field in a period of time with the much-talked-about topic of this paper.Much-talked-about topic ranking based on attention rate is compared with the topic for objectively having reflected that repercussion is bigger in this period.

Description

Hot topic identification and tracking method and system
Technical Field
The disclosure relates to the technical field of data mining, in particular to a hot topic identification and tracking method and system.
Background
Topic identification and tracking research began in 1996 with emphasis on the ability to discover new information, concern about information related to a particular topic rather than a broad subject category, and process textual information that varies over time. The technical purpose of the technology is to help people solve the problem of information overload, and topic reports are organized to be presented to users in a friendly mode through mining and analyzing a large amount of text information. The basic flow is shown in FIG. 1.
In the prior art, in terms of corpus collection and topic identification, a crawler technology is mainly used for crawling a required webpage from the webpage, then webpage content is processed and cleaned to obtain an analyzable text corpus, and then a clustering algorithm is used for realizing topic identification. The technology is relatively mature, but a scientific quantitative analysis method is lacked in hot topic identification, so that the hot topic identification is not accurate enough.
Disclosure of Invention
In view of the above, the present disclosure is proposed to provide a hot topic identification tracking method and system that overcomes or at least partially solves the above problems.
According to an aspect of the present disclosure, there is provided a hot topic identification and tracking method, including:
crawling a webpage by using a web crawler, cleaning the crawled webpage to extract the text content of the webpage, and establishing a text corpus;
vectorizing the text in the text corpus to obtain text data for analysis, and clustering on the basis of a vector space model to obtain a topic list;
calculating and selecting a first-level index and a second-level index related to the hot topic by using an entropy method and an analytic hierarchy process, and establishing a comprehensive evaluation model; and respectively calculating the scores of all the topics according to the comprehensive evaluation model, and identifying hot-spot topics.
The method further comprises the following steps:
and according to the topic development curve of each hot topic, acquiring a text document with the highest average vector similarity with the topic from related documents which occur, develop and are stable in each stage of the topic as a related document and a topic description of the topic.
The establishing of the text corpus specifically includes:
processing the irregular webpage marks crawled by the web crawler, and primarily filtering noise information by using a regular expression according to the format of the webpage;
extracting texts in the webpage based on the label density; if an end mark is read, the proportion of the Chinese characters between the end mark and the initial mark is checked, and if the proportion of the Chinese characters between the end mark and the initial mark is not up to a threshold value, all the content containing marks between the initial mark and the end mark are removed;
extracting webpage content based on the DOM document object model, replacing the escape symbol in the HTML, merging the leaf content to obtain a text file with extracted text content, and establishing a text corpus.
The vectorization processing of the text in the text corpus includes:
a document S is represented as a vector:
S i =(W 1,i ,W 2,i ,...,W n,j )
wherein, W k,i And representing the weight of the kth index item in the ith document, wherein n is the total number of index items in the document.
The vector space model, comprising:
after vectorization of a text, calculating the text vector; calculating the similarity between the documents by adopting a general formula in a vector space model;
the document similarity calculation formula is as follows:
sim(d i ,d j )=D(d i ,d j );
wherein D (D) i ,d j ) For cosine similarity values between documents, the calculation is as follows:
wherein, the first and the second end of the pipe are connected with each other,andmodulo, W, representing the vectors of document i and document j, respectively k,i And representing the weight of the kth index item in the ith document, wherein t is the total number of different index items in the two documents.
The clustering algorithm for clustering based on the vector space model to obtain the topic list comprises the following steps:
when document D i At the time of arrival, pair D i Vector representation is performed and a containment D is created i Class G of j
For G j Each of the micro-classes C in n If C is present n Number m of documents in&l =600, calculate D i And C n Similarity of all documents in the document list; if m is&gt, 600, calculating D i And C n Set of documents { D } m-600 …D m-1 Similarity of all documents in the page description is calculated, and the average value theta of k values with the maximum similarity is calculated as D i And C n Similarity between the micro-classes;
obtaining the maximum theta, if theta is less than the set threshold, creating a containing D i If not, D i Adding to the corresponding subclass C n Performing the following steps;
obtaining G by first clustering j N micro-classes, each micro-class representing a sub-topic; each micro-class is represented by an average vector of the documents in the micro-class, and the correlation between sub-topics is described by cosine values between the average vectors, namely: sim (C) i ,C j )=D(C i ,C j )+R(C i +C j ) (ii) a Wherein D (C) i ,C j ) As the cosine similarity between the micro-classes, R (C) i +C j ) Co-occurrence of named entities in micro-classes.
The method comprises the following steps of calculating and selecting a first-level index and a second-level index related to the hot topics by using an entropy method and an analytic hierarchy process, and establishing a comprehensive evaluation model, wherein the method comprises the following steps:
calculating the weight of the primary index to the hot topic by using an analytic hierarchy process;
calculating the weight of the secondary index to the primary index by using an entropy method, and calculating the weight of the secondary index to the hot topic;
and establishing a comprehensive evaluation model.
Establishing a comprehensive evaluation model; according to the comprehensive evaluation model, respectively calculating each topic score in the topics, and identifying hot topics, wherein the steps of:
calculating the ith observation entropy under the jth index:
wherein
Calculating the difference coefficient and the weight of the j index:
wherein the content of the first and second substances,
calculating the weight of the secondary index to the primary index:
solving the weight of the first-level index to the hot topic by using an analytic hierarchy process;
calculating the weight of the secondary indexes to the hot topic;
calculating a composite score for each observation
The method specifically comprises the following steps:
obtaining the word with the highest weight from the document, and obtaining the word as a related word group of the topic after screening; selecting a word from the related word group, and combining the title and the named entity of the related document to obtain the title of the hot topic;
analyzing related documents of the hot topics, and counting and comparing effective information and key sentences of each document to establish a simple topic framework;
the topic frame comprises topic occurrence time, occurrence place, related people, related mechanisms, affair causes, affair passing and affair results;
obtaining a rough topic description of a hot topic by utilizing the named entities and the related documents and combining the changes of the named entities in different stages;
and perfecting the topic description by using a multi-document automatic abstracting technology to obtain the topic description of each hot topic.
According to another aspect of the present disclosure, there is provided a hot topic identification and tracking system, including:
the text corpus unit is used for crawling the web pages by using a web crawler, cleaning the crawled web pages to extract the text contents of the web pages and establishing a text corpus;
the topic list unit is used for vectorizing the text in the text corpus to obtain text data for analysis, and clustering the text data on the basis of a vector space model to obtain a topic list;
the topic identification unit is used for calculating and selecting a first-level index and a second-level index related to the hot topic by using an entropy method and an analytic hierarchy process, and establishing a comprehensive evaluation model; respectively calculating the scores of all topics in the topics according to the comprehensive evaluation model, and identifying hot-spot topics;
and the topic description unit is used for acquiring a character document with highest average vector similarity with the topic from the related documents which occur, develop and are stable in each stage of the topic according to the topic development curve of each hot topic as the related document and the topic description of the topic.
According to one or more technical schemes disclosed by the disclosure, a hot topic identification and tracking scheme is provided, an analytic hierarchy process and an entropy method are adopted, indexes of the hot topic are quantized by evaluating the identified topic, the technical effect of scientific quantitative evaluation is achieved, and the method has the advantages of high identification accuracy and significance of effectively screening the hot topic. The dependence on manual intervention is reduced, and the manual intervention is reduced to the maximum extent when whether a certain topic is a hot topic is judged by using the fairness of computer processing. The hot topics in various fields in a period of time can be well obtained by the hot topic discovery method. The hot topic ranking based on the attention degree reflects the topic with larger reverberation in the time more objectively.
Drawings
Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the disclosure. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 illustrates a basic flow diagram of topic identification in the prior art;
FIG. 2 illustrates a flow diagram of a hot topic identification tracking method according to one embodiment of the present disclosure;
FIG. 3 shows a schematic structural diagram of a hot topic identification and tracking system according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
According to the scheme of each embodiment of the disclosure, the analytic hierarchy process and the entropy method are adopted, and the identified topics are evaluated, so that indexes of hot topics are quantized, the technical effect of scientific quantitative evaluation is achieved, the method has the advantages of high identification accuracy and significance of effectively screening the hot topics.
According to the scheme of each embodiment of the disclosure, a scientific quantitative analysis method is adopted, the hot topics are divided into a primary index and a secondary index to better measure influence factors of the hot topics, then an entropy method and an analytic hierarchy process are utilized to establish a comprehensive evaluation model, objective evaluation is carried out on the hot degrees of the topics, and therefore the hot topics are effectively identified.
Example one
Fig. 2 shows a flowchart of a hot topic identification and tracking method of this embodiment, and referring to fig. 2, the method may include:
and 11, crawling the web pages by using a web crawler, cleaning the crawled web pages to extract the text contents of the web pages, and establishing a text corpus.
In this embodiment, the corpus is collected and preprocessed first. In consideration of the quality problems of authority, instantaneity and the like of the linguistic data, the portal website is adopted as a linguistic data acquisition source. By utilizing the web crawler, news webpages of specified sites within set time can be collected.
In order to ensure the real-time property of finding the hot topics, the web crawler can automatically collect the hot topics once a week, and collected web pages are stored according to time and topics. A web crawler is a program that automatically downloads web pages from a network. The web crawler downloads the appointed pages from the beginning of one or a plurality of initial web pages, analyzes all the links in the pages at the same time, filters the links, adds the links into a queue to be downloaded, and repeats the steps until a certain stop condition of the system is met.
All web pages crawled by the crawler will be stored in the form of files. The embodiment analyzes the text content of the web page, so the downloaded news web page needs to be subjected to text extraction. The content of a web page can be divided into two categories, one is the markup information provided to the browser and the other is the information provided to the user for reading. The information provided for the user to read not only includes news headlines and text content, but also has useless information such as navigation information, advertisement information and related links, so various marking information and useless information in the webpage are removed first, and the news headlines and the text content are reserved.
The embodiment extracts the text in the web page based on the page and the regular expression, and stores the text in a text mode. Firstly, processing the webpage marks which are not standardized, and then primarily filtering the noise information by using a regular expression according to the format of the webpage.
The embodiment extracts the text in the webpage in a mode based on the label density. If an end mark is read, the proportion of the Chinese characters between the end mark and the start mark is checked, and if the proportion of the Chinese characters between the end mark and the start mark is not up to a closed value, all the content containing marks between the start mark and the end mark are removed. And finally, extracting the webpage content based on the DOM document model, replacing the escape symbol in the HTML, and combining the leaf content to obtain the text file with the extracted text content.
And step 12, vectorizing the texts in the text corpus to obtain text data for analysis, and clustering on the basis of a vector space model to obtain a topic list.
Through the foregoing steps, although a text corpus to be processed is obtained, the text corpus cannot be directly processed due to text data, and thus the text needs to be vector-represented. After determining the index of the document, a document S can be represented as a vector:
S i =(w 1,i ,w 2,i ,...,w n,j )
wherein w k,i And representing the weight of the kth index item in the ith document, wherein n is the total number of index items in the document. Selection of weight schemeThe selection is simple and easy, the simplest calculation method is to use the appearance frequency of the vocabulary entry in the document as the weight, but other more effective calculation methods are often adopted in practical application. The important information for calculating the term weight includes term frequency, document frequency and collection frequency, and is defined as table 1.
TABLE 1
Measurement of Symbol mark Definition of
Frequency of entries tf i,j Word w i In document d i Number of occurrences in
Document frequency df i The occurrence of the word w i Number of documents
Collecting the frequency cf i Word w i Number of occurrences
The term frequency describes the importance degree of the term in a given document, and the larger the value is, the higher the description degree of the term to the document is, and the more accurately reflects the content of the article. Although the similarity between words and document contents can be reflected more when the word frequency is higher, if the number of occurrences of a word in one document is 3 times of the number of occurrences of a word in another document, it cannot be shown that the correlation between the words and the contents has a multiple relation, and the similarity between the word frequency and the document contents has a relation with the document length and the average length of the whole document set.
The document frequency indicates the degree of information of the entry. If a semantic core is limited to only one document, it may appear in that document many times. In contrast, the distribution of a non-semantic core word is relatively uniform across all documents in a document collection. That is, if the word is likely to appear in every document, it is obvious that this indicates that the correlation between the word and the document contents is not great. The most obvious examples are functional words such as "and" in "and the like. Thus indicating that the more frequent a document is for an entry, the less it will contribute to the content of a document. I.e., the weight of an entry is proportional to it and inversely proportional to it.
It is of course most desirable to combine both term frequency and document frequency in one formula, and a weight calculation formula is usually used. The document frequency weight value in the form is called the inverted document frequency, and after years of research, a weight formula is derived into a formula cluster, and many formulas adopt normalization processing, such as evolution, logarithm taking, addition of total number of documents, document length and other information.
Based on the above considerations and the actual needs found by the topics in this embodiment, the weighting scheme of this embodiment employs an inverted query tf idf, where tf, idf are expressed as follows:
tf in the above formula is tf raw /(tf raw + k) ofThe formula performs smoothing processing. tf is raw Is the original number of times that an entry w appears in a document d, and the parameter k can balance the influence of continuous appearance of a certain entry in a document. K here is forIndicates that the average length of the documents in the corpus,the influence of the document length on the similarity calculation is reduced. Df in the above formula represents the number of documents of the term w, N is the total number of documents in the corpus, and idf is smoothed by regularization.
Topics are seed events or activities that are based on a specific time, place, and possibly with some inevitable consequences, and named entities such as dates, places, people, organizations, etc. identified in a document can help us find and distinguish topics better, so that when the document is subjected to vector representation, we increase the weights of the named entities in the document appropriately. The system will dynamically process according to the newly downloaded web page, so this embodiment employs the following incremental df calculation scheme.
At time t, a new document set C is assembled by updating df (w) t Added to the model. Wherein df is ct (w) indicates that the term w is in the document set C t The document frequency of (1).
df t (w)=df t-1 (w)+df Ct (w)
After the vectorization of the text, the text vector can be calculated, and after the vector representation of the document, the similarity between the documents can be calculated by adopting a general formula in a vector space model. The final document similarity calculation formula is:
sim(d i ,d j )=D(d i ,d j )。
D(d i ,d j ) For the cosine similarity values between documents, the calculation is as follows:
whereinAndmodulo, w, representing the vectors of document i and document j, respectively k,j And representing the weight of the kth index item in the ith document, wherein t is the total number of different index items in the two documents.
The comparison of topics and documents is somewhat more complex because a topic typically contains multiple documents, which involves a comparison of collections and elements. When clustering is used in the embodiment, the similarity between a new document and a certain number of documents in each topic is compared, and then the average value of the top k most similar documents is taken as the similarity value between the topic and the document. And deciding whether the document is related to the topic through a threshold strategy. In this embodiment, a threshold policy is used to control the number of times of system clustering iteration, a merged similarity threshold θ is set, and merging can be performed only when the similarity is greater than θ.
When topic discovery is performed by using dynamic news corpora on a network, how many topics are and when new topics are established cannot be predicted. This study is equivalent to unguided clustering work. Clustering is an unguided learning process, which is a clustering process under unsupervised conditions based on some distance between samples. The goal of clustering is to divide a group of objects into several groups or categories, with similar elements in the same group and dissimilar elements in different groups. In hierarchical clusters, each node is a parent except for a class root, so the cluster can be represented in the form of a tree graph, where leaf nodes represent the initial sample, internal nodes of the tree represent the clustered classes, and all children are elements in the class. And so on, belong to hierarchical clustering, the algorithm can be described as creating a hierarchy to decompose a given data set. The method can be divided into two operation modes of top-down decomposition and bottom-up combination.
The disadvantage of hierarchical clustering is that the operation amount is large, and the method is only suitable for processing small sample data. Because a bottom-up method in hierarchical clustering is widely adopted and similar classes need to be combined in an iterative process, a variety of methods for calculating the distance between the classes appear, including a single-connected minimum distance method, a fully-connected maximum distance method, an average-connected class average distance method and a gravity center method.
The category structure is simple in non-hierarchical clustering and the relationship between categories is less clear than hierarchical clustering. Most non-hierarchical clustering algorithms are iterative processes, where the algorithm first initially clusters and then reassigns the class of sample data by successive iterations. Non-hierarchical clustering generally has an initial partitioning assumption, and needs several iterations, each iteration reassigns sample data, so it needs to define a stopping criterion function of the iteration process. The basic principle of the stopping criterion function is to ensure that each iteration improves the clustering effect, and the iteration process can be stopped when the improvement is reduced. The advantage of non-hierarchical clustering is its algorithm efficiency. The mean value clustering algorithm is simple in concept and is widely adopted for processing new problems.
The clustering result is not only related to the selected clustering algorithm, but also related to the selection of the representation characteristics of the object, the adjustment of parameters, the existence of dirty data characteristics and the input sequence of data. When representing a clustered object, the result may be very different if some features are removed or added. Therefore, before clustering, features must be clarified and meaningful features selected. The correlation between the characteristics can also influence the clustering result, so that a plurality of characteristics can be compressed into a plurality of mutually independent indexes containing most information by a principal component analysis method or a factor analysis method, and then clustering is carried out.
When determining the weight of the object feature, if the importance of each feature is different, the weight needs to be adjusted as needed. For example, the weighted euclidean distance may be determined by expert or machine learning. Many clustering algorithms require certain parameters to be input, such as the number of classes desired to be generated, and these adjustable parameters make the quality of clustering difficult to control, especially for high-dimensional huge data without prior information. The purpose of explicit clustering is to keep the distance between classes as far as possible and the distance in classes as close as possible. The number of classes can be determined for research purposes when clustering, but the clustering results have convincing interpretations. In practical operation, the number of classes can be determined by experience or machine learning methods, and the clustering effect of different numbers can be tested.
The application range and characteristics of various clustering methods are comprehensively examined and analyzed, and requirements which need to be met by the clustering algorithm used for topic discovery in the embodiment are summarized, for example, the algorithm is low in complexity, high in efficiency, capable of processing noise data and abnormal points, easy to determine parameters, available in clustering results and the like. The clustering algorithm is described as follows:
first document D 1 At the time of arrival, pair D follows the method as described above 1 Vector representation is performed and a containment D is created 1 The class (c).
When a document D 1 When coming, updating D according to the formula i Df of each entry in (1), and D according to the method described above 1 Vector representation is performed. For G j Each of the micro-classes C in n If C is n Number m of documents in&l =600, calculate D i And C n Similarity of all documents in (m) if m&gt, 600, calculating D i And C n Set of documents { D } m-600 ...D m-1 Similarity of all documents in the page description is calculated, and the average value theta of k values with the maximum similarity is calculated as D i And C n Similarity between the micro-classes.
Find the largest theta, if theta is less than a certain threshold, create a containment D i If not, D i Adding to the corresponding subclass C n In (1).
Obtaining G by first clustering j Each micro-class represents a sub-topic, and some sub-topics are discussed as a topic, so the embodiment combines the micro-classes by using clustering. Each one of which isThe micro-class is represented by the average vectors of the documents in the micro-class, and the correlation between sub-topics can be described by cosine values between the average vectors, namely, a barycenter method which is called when the distance between classes is calculated in hierarchical clustering is adopted. Because the named entities in the topics can help us to distinguish the topics better, the similarity in the clustering of the present embodiment compares the co-occurrence rates of the named entities added into the micro-clusters, that is: sim (C) i ,C j )=D(C i ,C j )+R(C i +C j )。D(C i ,C j ) As the cosine similarity between the micro-classes, R (C) i +C j ) Co-occurrence of named entities in micro-classes.
After the news corpus of a certain month is obtained through the processing of the method, the topic list of the current month is finally obtained. Of these topics, some may be the final hot topics, but most are just general topics. Further calculation of the heat of the topic is needed.
Step 13, calculating and selecting a first-level index and a second-level index related to the hot topic by using an entropy method and an analytic hierarchy process, and establishing a comprehensive evaluation model; and respectively calculating the scores of all the topics according to the comprehensive evaluation model, and identifying hot-spot topics.
After the new topics are found through the previous steps, the topics are identified as hot topics, and compared with general topics, the hot topics have the advantages of large document number, long document time, wide spread range and strong public response. The primary index and the secondary index selected according to the characteristics are shown in the table 2.
TABLE 2
And calculating the hot degree of the hot event, namely firstly calculating the weight of the primary index to the hot topic by utilizing an analytic hierarchy process, then calculating the weight of the secondary index to the primary index by utilizing an entropy method, and finally obtaining the weight of the secondary index to the hot topic. Therefore, a comprehensive evaluation model is established, the hot topic heat degree score is calculated, and the hot topic is screened out according to the score. The method comprises the following specific steps:
(1) Calculating the ith observation entropy under the jth index:
wherein
(2) Calculating the difference coefficient and the weight of the j index:
in the formula
(3) Obtaining the weight of the secondary index to the primary index:
(4) And then, solving the weight of the first-level index to the hot topic by using an analytic hierarchy process.
(5) And calculating the weight of the secondary indexes to the hot topic.
(6) Calculating a composite score for each observation
Further, the embodiment further includes a process of hot topic description.
And according to the topic development curve of each hot topic, obtaining a document with the highest average vector similarity with the topic from the related documents which occur, develop and are stable in each stage of the topic as the related document of the topic. In the process of finding topics in the previous period, named entities are identified for the documents, words are segmented, stop words are removed, and the weight of each word is obtained, so that the invention obtains the word with the highest weight from the documents, and obtains the word as a related word group of the topic after manual screening. And then selecting a word from the related word group, and manually obtaining the title of the hot topic by combining the title and the named entity of the related document.
The topic description is difficult to obtain, has no objective standard, and is one of the difficulties faced at present how to provide a simple and clear topic description interface with rich contents for users. Therefore, the developed topic template can cover all aspects of the topic, the interface is simple and clear, the filled content is clear in order, the readability is strong, the redundancy is low, and different development stages of the topic can be covered. However, the division of each stage of the topic is not clear, and the multi-document automatic abstract technology is immature, so that the problem can be solved very challenging. Information extraction is the automatic structuring of unstructured data using a computer to provide a data base for further data retrieval and mining. The structured information extraction can be realized in two modes, namely a template mode and a webpage library-level structured information extraction mode independent of the webpage. The template mode is to configure a template for a specific webpage in advance, extract required information set in the template, and accurately acquire information of a limited number of websites. The method has the characteristics of simplicity, accuracy, low technical difficulty and convenience and rapidness in deployment. The extraction of the structured information of the webpage library adopts a method of page structure analysis and intelligent node analysis conversion to automatically extract structured data. The method is characterized in that any normal webpage can be extracted, the method is fully automatic, and templates do not need to be generated in advance for specific websites. The intelligent extraction accuracy rate is high, the universality is good, the technical difficulty is high, the early-stage research and development cost is high, the period is long, and the method is suitable for high-end application of structured data acquisition and search.
The topic description of the present embodiment adopts a way of configuring a template. By analyzing the relevant documents of the hot topics, using a statistical method and a manual analysis method in the analysis process, counting and comparing the effective information and the key sentences of each document, a simple topic framework is established. The frame comprises topic occurrence time, occurrence place, related people, related mechanisms, affair causes, affair passing, affair results and the like. When the frame content is filled, the named entities and the related documents are mainly utilized, and a rough topic description of the hot topic is obtained by combining the changes of the named entities at different stages. And then, perfecting the topic description by utilizing the prior art in the multi-document automatic abstract and combining a manual processing method, and finally obtaining a topic description of each hot topic.
Specifically, the number of hot topics published by the media is mostly ten, and the top ten hot topics of 10 internal topics in 2017 discovered by the method of the present embodiment are listed below, specifically referring to table 3.
TABLE 3
The hot topic identification is more concerned about the reaction of the topic in the public. The method reduces the dependence on manual intervention, and furthest reduces the manual intervention when judging whether a certain topic is a hot topic by using the fairness of computer processing. By using the hot topic finding method of the embodiment, hot topics in various fields in a period of time can be obtained well. The hot topic ranking based on the attention objectively reflects the topic with larger reverberation in the period of time.
Example two
As shown in fig. 3, there is disclosed a hot topic identification and tracking system, wherein,
a text corpus unit 21, configured to crawl a web page by using a web crawler, clean the crawled web page to extract text content of the web page, and establish a text corpus;
the topic list unit 22 is configured to perform vectorization processing on the text in the text corpus to obtain text data for analysis, and perform clustering on the basis of a vector space model to obtain a topic list;
the topic identification unit 23 is configured to calculate and select a first-level index and a second-level index related to the hot topic by using an entropy method and an analytic hierarchy process, and establish a comprehensive evaluation model; respectively calculating the scores of all topics in the topics according to the comprehensive evaluation model, and identifying hot-spot topics;
and the topic description unit 24 is configured to obtain, from relevant documents that occur, develop and are stable in each stage of a topic, a text document with the highest average vector similarity to the topic as a relevant document and a topic description of the topic according to a topic development curve of each hot topic.
According to the scheme of the embodiment, the hot topic intelligent identification and tracking comprises the following steps:
step 1: and (3) collection and pretreatment of the corpus, namely crawling web pages on the web by using a crawler technology, and cleaning the crawled web pages.
Step 2: and topic identification, namely performing pre-processing on the preprocessed text corpus by operations of word segmentation, stop word removal, named entity identification and the like, then performing vectorization representation on the text to obtain data for analysis, and clustering on the basis of a vector space model to obtain a topic list.
And step 3: and identifying the hot topics, namely establishing a comprehensive evaluation model according to the related primary indexes and secondary indexes of the selected hot topics by using an entropy method and an analytic hierarchy process, solving topic scores, and finally identifying the hot topics from general topics.
And 4, step 4: topic description, the method combines the technologies of information extraction, multi-document automatic abstract and manual processing. And finally, obtaining a topic description between the words of each hot topic.
In the step 3, firstly, a comprehensive evaluation model is established by using the selected related primary indexes and secondary indexes of the hot topics, the topic scores are calculated, and finally the hot topics are identified from the general topics.
According to one or more technical schemes disclosed by the disclosure, a hot topic identification and tracking scheme is provided, an analytic hierarchy process and an entropy method are adopted, indexes of the hot topic are quantized by evaluating the identified topic, the technical effect of scientific quantitative evaluation is achieved, and the method has the advantages of high identification accuracy and significance of effectively screening the hot topic. The dependence on manual intervention is reduced, and the manual intervention is reduced to the maximum extent when whether a certain topic is a hot topic is judged by using the fairness of computer processing. By using the hot topic finding method, hot topics in various fields in a period of time can be obtained well. The hot topic ranking based on the attention degree reflects the topic with larger reverberation in the time more objectively.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, any of the embodiments claimed in the claims can be used in any combination.
Various component embodiments of the disclosure may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. The present disclosure may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present disclosure may be stored on a computer-readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the disclosure, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The disclosure may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware.
While the foregoing is directed to embodiments of the present disclosure, it is noted that various improvements, modifications, and changes may be made by those skilled in the art without departing from the spirit of the present disclosure, and it is intended that such improvements, modifications, and changes fall within the scope of the present disclosure.

Claims (10)

1. A hot topic identification and tracking method is characterized by comprising the following steps:
crawling a webpage by using a web crawler, cleaning the crawled webpage to extract the text content of the webpage, and establishing a text corpus;
vectorizing the text in the text corpus to obtain text data for analysis, and clustering on the basis of a vector space model to obtain a topic list;
calculating and selecting a first-level index and a second-level index related to the hot topic by using an entropy method and an analytic hierarchy process, and establishing a comprehensive evaluation model; and respectively calculating the scores of all the topics according to the comprehensive evaluation model, and identifying hot-spot topics.
2. The method of claim 1, wherein the method further comprises:
and according to the topic development curve of each hot topic, acquiring a text document with the highest average vector similarity with the topic from related documents which occur, develop and are stable in each stage of the topic as a related document and a topic description of the topic.
3. The method of claim 1, wherein the establishing a text corpus comprises:
processing the irregular webpage marks crawled by the web crawler, and primarily filtering noise information by using a regular expression according to the format of the webpage;
extracting texts in the webpage based on the label density; if an end mark is read, the proportion of the Chinese characters between the end mark and the initial mark is checked, and if the proportion of the Chinese characters between the end mark and the initial mark is not up to a threshold value, all the content containing marks between the initial mark and the end mark are removed;
extracting webpage content based on the DOM document object model, replacing the escape symbol in the HTML, merging the leaf content to obtain a text file with extracted text content, and establishing a text corpus.
4. The method of claim 1, wherein vectorizing text in the corpus of text comprises:
a document S is represented as a vector:
S i =(w 1,i ,w 2,i ,...,w n,j )
wherein, W k,i Represents the weight of the kth index item in the ith document, and n is the total number of index items in the document.
5. The method of claim 4, wherein the vector space model comprises:
after vectorization of a text, calculating the text vector; calculating the similarity between the documents by adopting a general formula in a vector space model;
the document similarity calculation formula is as follows:
sim(d i ,d j )=D(d i ,d j );
wherein D (D) i ,d j ) Calculating the cosine similarity between documentsThe following were used:
wherein, the first and the second end of the pipe are connected with each other,andmodulo, w, representing the vectors of document i and document j, respectively k,j And representing the weight of the kth index item in the ith document, wherein t is the total number of different index items in the two documents.
6. The method of claim 5, wherein the clustering algorithm based on the vector space model to cluster into a list of topics comprises:
when document D i At the time of arrival, pair D i Vector representation is performed and a containing D is created i Class G of j
For G j Each of the micro-classes C in n If C is present n Number m of documents in&=600, calculating D i And C n Similarity of all documents in the document list; if m is&gt, 600, calculating D i And C n Set of documents { D } m-600 ...D m-1 Similarity of all documents in (1) }, and calculate an average value θ of k values with the greatest similarity as D i And C n Similarity between the micro-classes;
obtaining the maximum theta, if theta is less than the set threshold, creating a containing D i Otherwise, D i Adding to the corresponding subclass C n Performing the following steps;
obtaining G by first clustering j N micro-classes, each micro-class representing a sub-topic; each micro-class is represented by an average vector of the documents in the micro-class, and the correlation between sub-topics is described by cosine values between the average vectors, namely: sim (C) i ,C j )=D(C i ,C j )+R(C i +C j ) (ii) a Wherein D (C) i ,C j ) As the cosine similarity between the micro-classes, R (C) i +C j ) Co-occurrence of named entities in micro-classes.
7. The method as claimed in claim 1, wherein the calculating and selecting the related primary index and secondary index of the hot topic by using an entropy method and an analytic hierarchy process to establish a comprehensive evaluation model comprises:
calculating the weight of the primary index to the hot topic by using an analytic hierarchy process;
calculating the weight of the secondary index to the primary index by using an entropy method, and calculating the weight of the secondary index to the hot topic;
and establishing a comprehensive evaluation model.
8. The method of claim 7, wherein said establishing a comprehensive evaluation model; according to the comprehensive evaluation model, calculating the score of each topic in the topics respectively, and identifying hot-spot topics, wherein the method comprises the following steps:
calculating the ith observation entropy under the jth index:
wherein k is greater than 0, and k is greater than 0,e j ≥0;
calculating the difference coefficient and the weight of the j index:
wherein, the first and the second end of the pipe are connected with each other,0≤g j ≤1,
calculating the weight of the secondary index to the primary index:
solving the weight of the first-level index to the hot topic by using an analytic hierarchy process;
calculating the weight of the secondary indexes to the hot topics;
calculating a composite score for each observation
9. The method according to claim 2, characterized in that it comprises in particular:
obtaining the word with the highest weight from the document, and obtaining the word as a related word group of the topic after screening; selecting a word from the related word group, and combining the title and the named entity of the related document to obtain the title of the hot topic;
analyzing relevant documents of hot topics, and carrying out statistics and comparison on effective information and key sentences of each document to establish a simple topic framework;
the topic frame comprises topic occurrence time, occurrence place, related people, related mechanisms, affair causes, affair passing and affair results;
obtaining a rough topic description of a hot topic by using the named entities and the related documents and combining the change of the named entities at different stages;
and perfecting the topic description by using a multi-document automatic abstracting technology to obtain the topic description of each hot topic.
10. A hot topic identification and tracking system, comprising:
the text corpus unit is used for crawling the web pages by using a web crawler, cleaning the crawled web pages to extract the text content of the web pages and establishing a text corpus;
the topic list unit is used for vectorizing the text in the text corpus to obtain text data for analysis, and clustering the text data on the basis of a vector space model to obtain a topic list;
the topic identification unit is used for calculating and selecting a first-level index and a second-level index related to the hot topic by utilizing an entropy method and an analytic hierarchy process, and establishing a comprehensive evaluation model; respectively calculating scores of all topics in the topics according to the comprehensive evaluation model, and identifying hot topics;
and the topic description unit is used for acquiring a character document with highest average vector similarity with the topic from the related documents which occur, develop and are stable in each stage of the topic according to the topic development curve of each hot topic as the related document and the topic description of the topic.
CN201711332908.2A 2017-12-13 2017-12-13 A kind of much-talked-about topic identification method for tracing and system Pending CN107944037A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711332908.2A CN107944037A (en) 2017-12-13 2017-12-13 A kind of much-talked-about topic identification method for tracing and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711332908.2A CN107944037A (en) 2017-12-13 2017-12-13 A kind of much-talked-about topic identification method for tracing and system

Publications (1)

Publication Number Publication Date
CN107944037A true CN107944037A (en) 2018-04-20

Family

ID=61943055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711332908.2A Pending CN107944037A (en) 2017-12-13 2017-12-13 A kind of much-talked-about topic identification method for tracing and system

Country Status (1)

Country Link
CN (1) CN107944037A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
US20130151525A1 (en) * 2011-12-09 2013-06-13 International Business Machines Corporation Inferring emerging and evolving topics in streaming text
CN107220777A (en) * 2017-06-08 2017-09-29 合肥工业大学 The social capital measure of Network brand community users based on entropy model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130151525A1 (en) * 2011-12-09 2013-06-13 International Business Machines Corporation Inferring emerging and evolving topics in streaming text
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN107220777A (en) * 2017-06-08 2017-09-29 合肥工业大学 The social capital measure of Network brand community users based on entropy model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
姚书志: "地方高校突发事件应急管理能力研究", 《中国博士学位论文全文数据库 社会科学II辑》 *
谢宜瑾: "网络舆情分析与管理技术的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
龚海军: "网络热点话题自动发现技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
CN108197117B (en) Chinese text keyword extraction method based on document theme structure and semantics
Buber et al. Web page classification using RNN
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
CN103914478A (en) Webpage training method and system and webpage prediction method and system
CN107577671B (en) Subject term extraction method based on multi-feature fusion
Wang et al. Ptr: Phrase-based topical ranking for automatic keyphrase extraction in scientific publications
CN110543564B (en) Domain label acquisition method based on topic model
CN112861990B (en) Topic clustering method and device based on keywords and entities and computer readable storage medium
US20060179051A1 (en) Methods and apparatus for steering the analyses of collections of documents
CN103226578A (en) Method for identifying websites and finely classifying web pages in medical field
WO2008131607A1 (en) A system and method for intelligent ontology based knowledge search engine
Kim et al. Learning implicit user interest hierarchy for context in personalization
CN107506472B (en) Method for classifying browsed webpages of students
CN108520007B (en) Web page information extracting method, storage medium and computer equipment
CN103678422A (en) Web page classification method and device and training method and device of web page classifier
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
KR101059557B1 (en) Computer-readable recording media containing information retrieval methods and programs capable of performing the information
CN103440315A (en) Web page cleaning method based on theme
CN108681977A (en) A kind of lawyer's information processing method and system
CN108614860A (en) A kind of lawyer's information processing method and system
CN112597370A (en) Webpage information autonomous collecting and screening system with specified demand range
CN112183093A (en) Enterprise public opinion analysis method, device, equipment and readable storage medium
CN111898034A (en) News content pushing method and device, storage medium and computer equipment
Özyirmidokuz Mining unstructured Turkish economy news articles
Xu et al. Research on Tibetan hot words, sensitive words tracking and public opinion classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180420