CN116738068A - Trending topic mining method, device, storage medium and equipment - Google Patents

Trending topic mining method, device, storage medium and equipment Download PDF

Info

Publication number
CN116738068A
CN116738068A CN202310432524.7A CN202310432524A CN116738068A CN 116738068 A CN116738068 A CN 116738068A CN 202310432524 A CN202310432524 A CN 202310432524A CN 116738068 A CN116738068 A CN 116738068A
Authority
CN
China
Prior art keywords
news information
short
industry
news
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310432524.7A
Other languages
Chinese (zh)
Inventor
邱震宇
王玲
曾文秋
朱阿柯
姜聪聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huatai Securities Co ltd
Original Assignee
Huatai Securities Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huatai Securities Co ltd filed Critical Huatai Securities Co ltd
Priority to CN202310432524.7A priority Critical patent/CN116738068A/en
Publication of CN116738068A publication Critical patent/CN116738068A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a device, a storage medium and equipment for mining trending topics, wherein the method for mining trending topics comprises the steps of acquiring news information in a period of time; marking each acquired news information by using an industry label system, and grouping each news information according to the industry labels to obtain a plurality of groups of news information groups; dividing the title of each news information in each news information group to obtain a plurality of short words, and determining the node weight of each short word; determining key short words of each news information group according to node weights of the short words; associating news information of each news information group with key words of the affiliated news information group to obtain a plurality of topic clusters; and calculating the industry heat of each group of topic clusters of each news information group, and determining the topic cluster with the highest industry heat as the hot topic of the industry of the news information group. The invention can dig hot topics of various industries.

Description

Trending topic mining method, device, storage medium and equipment
Technical Field
The invention relates to a hot topic mining method, a hot topic mining device, a storage medium and hot topic mining equipment, and belongs to the technical field of topic mining.
Background
Industry trending topic mining requires increased analysis and aggregation of industry public opinion events, and the trending public opinion events are explored from the industry perspective. Unlike the requirement for hot public opinion event aggregation, the requirement needs to aggregate news from a more abstract perspective, so that the aggregated topics are not necessarily related to a certain subject, but may be an integral public opinion event in an industry or a change in the whole industry caused by a certain macroscopic policy.
The current trending topic mining obtains the topic which is relatively trending and related to finance in the current world through the rapid analysis of the current latest news, and topic content comprises but is not limited to company financial events, political military events at home and abroad and the like. On the basis, the industry hot topics are introduced with industry attributes, and the obtained industry topics need to comprise topics on an industry level and a macroscopic level, such as industry overall change, policy release and the like, besides news topics related to companies.
Through research, industry hot topics are excavated and disassembled into a plurality of sub-problems:
1. topics must first be extracted from the news information.
Since the new topics appearing every day are unknown, the data in this scenario is also generally unlabeled, and therefore cannot be solved in a supervised learning manner. Compared with the current deep neural Network (NLP) method for comparing heat, topic extraction is generally performed by using a topic model based on a statistical method, and classical methods such as Latent Dirichlet Allocation (LDA) and the like. The topic model typically extracts topics for each news item without supervision, the product of which includes a probability distribution of documents-topics that each text can use as a vector representation.
The iterative training time of the LDA topic model is very long, and when the quantity of news information is large, the application efficiency is low. In addition, for short text scenes, LDA has poor effect, and the main is thatThe reason is that LDA has difficulty in handling sparsity of short text. For long text scenes, each news information can contain enough text space to contain information of different short words, a probability distribution theta of a document-theme is obtained by a Dirichlet distribution sampling for one news information, when a certain short word of an article is generated, a theme hidden variable z is generated according to theta, and then the distribution of the theme-word is obtained according to z from another Dirichlet distribution samplingFinally according to->The sampling generates words. For short texts, since one article contains too few short words, the statistical significance of the above generation process is not obvious, and no matter how many articles are added, the problem cannot be alleviated. Finally, topic distribution output by the topic model per se shows that because the information quantity is rich, if the topic distribution is directly used, the aggregation degree of clusters obtained by subsequent topic clustering is not high enough, a plurality of small clusters are generated, and the clusters need to be combined.
2. News information containing similar topics is required to be aggregated to obtain topic clusters in the current stage, so that the heat of each topic cluster can be calculated, and topics corresponding to topic clusters with higher heat are pushed to users.
At present, an unsupervised clustering method is used for unsupervised clustering, and a common clustering method is a K-means method. The core idea of the method is that when each round of clustering iterative computation is performed, the spatial center point of each cluster is searched, and the basis for searching is that the distance between the data points around the point and the center point is as small as possible.
However, the usual K-means clustering method cannot identify outlier data points. Extremely dependent on the number of pre-set hyper-parametric topics. In addition, clusters obtained by clustering have different granularity, post-treatment is needed, and repeated topic clusters are combined.
3. For each topic cluster, one topic description text or topic key phrase needs to be generated for it, for fully representing the topic content of the topic cluster.
For topic description text generation tasks, end-to-end text generation based on a deep neural network is generally adopted, and the topic description text generation tasks are mainly based on a seq2seq frame, and are an end-to-end text automatic generation technology, namely, only source text data need to be input into a frame model, and finally required texts can be generated directly. However, in this scenario, it is difficult to accumulate parallel generated corpora for training. The current common method mainly comprises the steps of word segmentation on a text based on a word segmentation tool, and splicing noun parts to obtain candidate phrases. The most significant set of phrases is then statistically filtered based on their statistical characteristics, e.g., tf-idf (term-frequency, inverted-document-frequency), PMI (Pointwise Mutual Information).
However, the end-to-end text generation method based on the deep neural network requires a certain amount of training corpus in training an available neural network, however, in an actual business scenario, it is difficult to construct a training set with a sufficient data volume. In addition, currently obtained key phrases can only obtain noun-level phrases, have more nonsensical phrases, and have problems in terms of semantic smoothness.
4. The industry attributes are required to be added into topics, and finally, topic popularity statistics can be carried out in the industry, and meanwhile, topic popularity of different industries can be compared transversely.
A news may involve multiple tags, and thus this is a task of multi-tag classification. The common method is to set templates such as keyword rules for each industry label, and then detect based on the rule templates. In addition, a deep learning method is introduced, a neural network structure such as a transducer is introduced, end-to-end model training is carried out on the task, and the industry labels contained in the text are identified through the model. Further, the number of news corresponding to the hot topics below each industry is calculated, and this number is used as an indication of industry popularity.
The text classification method based on the rule templates or the neural network only focuses on the correlation of industry on news related to the company main body, and analysis on news on the industry level and the macroscopic level is absent. In addition, the method for calculating the industry heat only takes the news quantity as the industry heat calculation, so that the heat comparison among different industries is not on a unified and fair scale. As the absolute news amount of some industries themselves is not much as compared to others. But it needs to be reflected from the hot level to be able to give a push or highlight to the business researchers once a significant event has occurred.
In summary, the application provides a method, a device, a storage medium and equipment for mining hot topics.
Disclosure of Invention
The application aims to overcome the defects in the prior art and provide a hot topic mining method, a device, a storage medium and equipment, which can mine hot topics of various industries.
In order to achieve the above purpose, the application is realized by adopting the following technical scheme:
on the one hand, the application provides a hot topic mining method, which comprises the following steps:
acquiring news information within a period of time;
marking each news information by using an industry label system, and grouping each news information according to the industry label to obtain a plurality of news information groups;
dividing news headlines of each news information group into words to obtain a plurality of short words, and determining node weights of the short words;
determining key short words of each news information group by using node weights of the short words;
associating each news information with each keyword of the news information group to obtain a plurality of topic clusters;
and calculating the industry heat of each topic cluster, and determining the topic cluster with the highest industry heat of each news information group as the hot topic of the industry.
Further, the marking each news information using the industry label system includes:
Responding to the news information content including entity company names, and labeling the news information with the industry to which the entity company belongs;
responding to the content of the news information including an industry keyword, and labeling the news information with an industry to which the industry keyword belongs;
the industry keywords comprise industry names, industry main products and industry terminology.
Further, the word segmentation is performed on the news headlines of each news information group to obtain a plurality of short words, and the determining the node weight of each short word includes:
dividing news headlines of each news information group into words to obtain a plurality of short words, and extracting syntax structures between the short words;
and determining node weights of the short words according to the syntax structure between the short words, the co-occurrence times of the short words and other short words in the news information of the affiliated news information group and the occurrence times of the short words in the news information of other news information groups.
Further, the determining the node weight of each short term includes the following formula:
tfidf (i) =i-th short word and other short words co-occur log of the number of times each news information of the belonging news information group (total news headlines include the number of news information of i-th short word+1)
Node weight=ws (V i ) Tfidf weight (i) hot value (i)
Wherein WS (V) i ) TextRank weight of the ith short word, d is a balance coefficient, and the problem of closed loop of the node short word path is avoided, V j For the j-th short term, V i For the ith short term, V k Is the kth short term, w ji Is the similarity between the jth short word and the ith short word, w jk For the similarity between the jth and kth short words,for the syntactic structure between the jth and kth short words, V j ∈In(V i ) Is thatThe syntax structure between the jth short word and the ith short word, tf (i) is Tf weight of the ith short word, and hot value (i) is the number of platforms for publishing news information containing the ith short word;
the method comprises the steps that a syntactic structure is arranged between a jth short word and an ith short word, the jth short word is a starting point short word of the syntactic structure of the jth short word and the ith short word, and the ith short word is an end point short word of the syntactic structure of the jth short word and the ith short word; the jth short word and the kth short word have a syntactic structure, the kth short word is a starting short word with the syntactic structure of the two short words, and the jth short word is an ending short word with the syntactic structure of the two short words.
Further, the determining the keyword of each news information group by using the node weight of each keyword includes:
and ordering the short words in each news information group according to the order of the node weights from large to small, and determining the first N short words in each news information group as key short words corresponding to the affiliated news information group.
Further, associating each news information with each keyword of the news information group to obtain a plurality of topic clusters includes:
randomly combining the key short words of the news information group to obtain a plurality of phrases;
and writing the news information containing each phrase into the corresponding topic cluster to obtain a plurality of topic clusters.
Further, the industry heat includes the formula:
the business heat = the number of news information that the topic cluster does not repeat with other topic clusters the number of entities contained in the news information in the topic cluster/the number of topic clusters in the news information set
The entity comprises an enterprise name and a person name.
On the other hand, the invention provides a hot topic mining device, comprising:
the acquisition module is used for acquiring news information in a period of time;
the marking module is used for marking each news information by utilizing an industry label system, and grouping each news information according to the industry label to obtain a plurality of groups of news information groups;
the weight determining module is used for word segmentation of the news headlines of each news information group to obtain a plurality of short words and determining node weights of the short words;
the keyword determining module is used for determining the keywords of each news information group by using the node weights of the keywords;
The association module is used for associating each news information with each keyword of the news information group to obtain a plurality of topic clusters;
the trending topic determination module is used for calculating the industry trending degree of each topic cluster and determining the topic cluster with the highest industry trending degree of each news information group as the trending topic of the industry.
In another aspect, the invention provides a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described above.
In another aspect, the invention provides a computing device comprising one or more processors, a memory, and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods described above
Compared with the prior art, the invention has the beneficial effects that:
according to the invention, the industry marking grouping is carried out on the news information, so that the industry attribute of each news information can be improved; the news headlines are utilized to determine topic clusters of all groups of news information, and hot topics with industry attributes are determined according to the industry heat of all topic clusters, so that the hot topics can meet scene application requirements more.
Drawings
FIG. 1 is a flow chart of one embodiment of a method of mining trending topics of the present invention;
FIG. 2 is a flow chart of one embodiment of an industry marking of the present invention;
FIG. 3 is a flow chart illustrating one embodiment of a first layer clustering module of the present invention;
FIG. 4 is a flow chart illustrating one embodiment of a second-level clustering module of the present invention;
FIG. 5 is a schematic diagram illustrating one embodiment of a syntax structure of the present invention;
FIG. 6 is a diagram illustrating one embodiment of blacklist filtering according to the present invention;
FIG. 7 is a schematic diagram of an embodiment of the node weighting algorithm of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.
Example 1
The invention belongs to the technical field of artificial intelligence natural language processing, and particularly relates to a method for automatically analyzing information news in the current market through software, automatically generating key phrase representations related to hot topics in different industries, and assisting an industrial researcher to research the current industry hot spot through pushing and statistical display modes.
Industry trending topic detection requires increased analysis and aggregation of industry public opinion events, and the trending public opinion events are explored from the industry perspective. Unlike the requirement for hot public opinion event aggregation, the requirement needs to aggregate news from a more abstract perspective, so that the aggregated topics are not necessarily related to a certain subject, but may be an integral public opinion event in an industry or a change in the whole industry caused by a certain macroscopic policy.
The embodiment introduces a hot topic mining method.
Referring to fig. 1, the mining method of the present embodiment includes the steps of:
s1, acquiring news information in a period of time;
s2, marking each news information by using an industry label system, and grouping each news information according to the industry label to obtain a plurality of groups of news information groups;
s3, dividing the news headlines of each news information group to obtain a plurality of short words, and determining node weights of the short words;
s4, determining key short words of each news information group by using node weights of the short words;
s5, associating each news information with each keyword of the news information group to obtain a plurality of topic clusters;
s6, calculating the industry heat of each topic cluster, and determining the topic cluster with the highest industry heat of each news information group as the hot topic of the industry.
According to the invention, the industry marking grouping is carried out on the news information, so that the industry attribute of each news information can be improved; the news headlines are utilized to determine topic clusters of all groups of news information, and hot topics with industry attributes are determined according to the industry heat of all topic clusters, so that the hot topics can meet scene application requirements more.
Example 2
On the basis of embodiment 1, this embodiment describes in detail a method for mining a trending topic.
S1, acquiring news information in a period of time.
When the method is applied, the news information acquired by the embodiment is provided with information of browsing times and forwarding times.
S2, marking each news information by using an industry label system, and grouping each news information according to the industry labels to obtain a plurality of news information groups.
When the method is applied, the marking of the acquired news information by utilizing the industry label system comprises the following steps:
s21, responding to the news information, wherein the content of the news information comprises the name of an entity company, and labeling the news information with the industry of the entity company;
s22, responding to the news information, wherein the content of the news information comprises an industry keyword, and labeling the news information with an industry to which the industry keyword belongs;
the industry keywords comprise industry names, industry main products and industry terminology.
S3, dividing the news headlines of each news information group to obtain a plurality of short words, and determining node weights of the short words.
When in use, the step S3 comprises the following steps:
s31, dividing the titles of the news information in each news information group to obtain a plurality of short words, and extracting the syntax structures between the short words;
s32, determining node weights of the short words according to the syntax structure between the short words, the co-occurrence times of the short words and other short words in the news information of the affiliated news information group and the occurrence times of the short words in the news information of the other news information group.
The node weights include the following:
tfidf (i) =i-th short word and other short words co-occur log of the number of times each news information of the belonging news information group (total news headlines include the number of news information of i-th short word+1)
Node weight=ws (V i ) Tfidf weight (i) hot value (i)
Wherein WS (V) i ) TextRank weight of the ith short word, d is a balance coefficient, and the problem of closed loop of the node short word path is avoided, V j For the j-th short term, V i For the ith short term, V k Is the kth short term, w ji Is the similarity between the jth short word and the ith short word, w jk For the similarity between the jth and kth short words,for the syntactic structure between the jth and kth short words, V j ∈In(V i ) Tf (i) is Tf weight of the ith short word, and hot value (i) is the number of platforms for publishing news information containing the ith short word;
the method comprises the steps that a syntactic structure is arranged between a jth short word and an ith short word, the jth short word is a starting point short word of the syntactic structure of the jth short word and the ith short word, and the ith short word is an end point short word of the syntactic structure of the jth short word and the ith short word; the jth short word and the kth short word have a syntactic structure, the kth short word is a starting short word with the syntactic structure of the two short words, and the jth short word is an ending short word with the syntactic structure of the two short words.
S4, determining the key short words of each news information group by using the node weights of the short words.
When in use, step S4 includes:
and ordering the short words in each news information group according to the order of the node weights from large to small, and determining the first N short words in each news information group as key short words corresponding to the affiliated news information group.
According to the invention, the short words are combined through the grammar structure, so that the key phrase with grammatical meaning is obtained, and the key phrase information content is higher and the readability is better.
S5, associating each news information with each keyword of the news information group to obtain a plurality of topic clusters.
When in use, step S5 includes:
s51, randomly combining the key short words of the news information group to obtain a plurality of phrases;
s52, writing the news information containing each phrase into the corresponding topic cluster to obtain a plurality of topic clusters.
S6, calculating the industry heat of each topic cluster, and determining the topic cluster with the highest industry heat of each news information group as the hot topic of the industry.
When applied, the industry heat includes the following formula:
the business heat = the number of news information that the topic cluster does not repeat with other topic clusters the number of entities contained in the news information in the topic cluster/the number of topic clusters in the news information set
The entity comprises an enterprise name and a person name.
Example 3
On the basis of embodiment 1 or 2, this embodiment describes in detail a method of mining a trending topic.
According to the embodiment, through analyzing news information of one-year span, a set of mining methods which are completely different from the traditional hot event aggregation is set, and through inputting a news headline list in a specified time span, aggregation analysis, short word extraction, news information association and industry attribute fusion are carried out on the news information.
3.1 industry marking
The embodiment is provided with an industry marking service module which is used for marking the news information with the industry label so that the news information has the label analysis of the industry attribute industry.
The first-level industry in the Shen Mo industry label system comprises urban throwing, medicines, food and beverage, agriculture, forestry, animal husbandry and fishing, real estate, nonferrous metals, steel, coal, transportation, communication, trade and retail, textile clothing, electronics, utilities, household appliances, computers, electrical equipment and new energy, buildings, automobiles, synthesis, finance, building materials, light industry, national defense and military industry, social services, media, machinery, chemical industry and precious metals.
In addition to labeling news information with a universal industry label system, it is also necessary to analyze whether the news information itself is industry-level or macro-level news content.
In practice, the industry marking service module comprises an algorithm model of body detection for extracting body entities in the title of news information, such as company name, organization name, person name, etc., characters, etc.
In order to realize that important company events with important roles in the industry and controllable influence of core products on the industry are acquired by knowing news information with important influence, foresight and wind vane effect of the industry.
But need to meet the following several principles:
1. if the news information includes a macro industry keyword of an industry level of a certain industry, the news information is regarded as news information of the corresponding industry. The industry keywords can be obtained through collection and arrangement of principal product words in industrial chain data and periodic financial reports of companies.
2. The news information of the market companies with the market value ranking front or camping accounting for the larger weight of the whole industry in a certain industry is the news information of the corresponding industry, and the event of the news information generally has larger influence on the whole industry.
3. When the news information relates to the name of entity company in a certain industry, the news information is regarded as the news information of the corresponding industry.
4. Policy release, emerging trends in a certain industry are regarded as news information of the corresponding industry.
Based on the principle, the industry label is introduced into an industry marking analysis model. The industry marking analysis model is constructed based on BERT model recorded in paper deep bidirectional transducer-based language understanding Pre-training model (Devlin J, chang M W, lee K, et al, bert: pre-training of deep bidirectional transformers for language understanding, public website: https:// arxiv. Org/abs/1810.04805), and is obtained through large-scale language Pre-training by taking the transducer as a basic backbone network.
In the scenario of this embodiment, financial corpus and financial word stock are introduced, and the BERT model is retrained in the financial domain, so as to obtain an industry marking analysis model which is more friendly to the financial domain, and refer to fig. 2.
Then, inputting the news information after coarse clustering into a second-layer clustering module, analyzing and clustering topic layers, merging and filtering groups to obtain high-quality news information groups.
3.2 first layer clustering Module
All news information is first grouped according to industry labels, and one piece of news information may be divided into multiple groups of news information because the same piece of news information may be associated with multiple industries.
The embodiment is provided with a first-layer clustering module, and each group of news information groups with industry attributes is input into the first-layer clustering module to perform coarse clustering.
First, a subject recorded in a news headline is identified, corresponding subject labels are marked on corresponding news information, and news information in a news information group is grouped according to the subject labels, so that a plurality of groups of subject news groups are obtained. And extracting vector characterization of each news headline by using the BERT model. Each news headline of this embodiment can be represented by a 768-dimensional vector.
When the method is applied, the BERT model can distribute excessive weights to certain words or character information in news headline texts, so that the obtained vector representation is anisotropic, uneven distribution exists, and if the vector representation extracted by the BERT model is directly used for cluster analysis, different information cannot be distinguished, so that the problem of information collapse can be generated. Therefore, the embodiment is based on whitening operation vector characterization recorded in paper 'whitening sentence characterization method for better and faster semantic retrieval' (Jianlinsu, jiarun Cao, weijie Liu, yangyiwenou. Whitening Sentence Representations for Better Semantics and Faster Retrieval, public website: https:// arxiv. Org/abs/2103.15316), and performs a linear transformation on sentence vectors, so that covariance matrix of sentence vectors is corrected, and finally a vector characterization representation with more uniform distribution is obtained.
Clustering using these vectors yields coarse-grained clusters of news. The traditional method adopts a K-means clustering method to perform clustering, but the method has the biggest defect that abnormal data points cannot be processed.
In practical application, there is no record in news headlines that there is no subject or there is data that cannot be clustered in a range where the number of news information in a subject news group is less than a preset value, and a clustering model is also required to process the data.
Next, in this embodiment, the vector characterization is subjected to cluster analysis by using the DBSCAN clustering method.
The DBSCAN clustering method is based on a density measurement method, and the DBSCAN algorithm can find all dense areas of each vector characterization sample point and treat the dense areas as one cluster. For outliers far from the dense area, there is a special set of unified write outlier stores, see fig. 3.
3.3 second layer clustering Module
The cluster granularity of the result of the first layer polymerization module is still relatively coarse. In order to obtain more abstract news information containing the macro level of the industry, a second-layer clustering module is arranged in the embodiment. Based on the thesis (Yan X, guo J, lan Y, et al A biterm topic model for short texts), a binary word topic probability model (Biterm Topic Model, BTM) described by https:// dl.acm.org/doi/10.1145/2488388.2488514 is disclosed for learning a subject topic probability distribution of a subject news group.
When applied, the model BTM adapts better to short text than the LDA hidden dirichlet model (LDA). The BTM learning process is substantially similar to LDA, but it processes document words differently before doing probability distribution learning, mainly in terms of binary words (Biterm). The binary word is that two words are extracted from a word set of a segmented document to form a word pair, and short word learning is carried out by taking the word pair as a basic element. In addition, unlike LDA, BTM uses a parametric probability distribution for all documents, so that the problem of sparse topics is alleviated, and BTM is well suited for scenes with fewer words in short text, and in addition, for a news headline, it will in many cases contain only one subject topic. And (3) performing iterative learning of the BTM by inputting a central point title in the coarse cluster obtained by the first layer aggregation module, and finally obtaining the main subject probability distribution of each topic cluster and possible industry keywords. The K-means clustering method is introduced to perform vector clustering according to the current subject probability distribution, and is used because after clustering by the first-layer clustering module, the clustering of the topics is controlled on granularity, and the abnormal points in the clusters in the second-layer clustering module are very small and can be discarded, so that a DBSCAN method with larger clustering time is not needed, and the period of the clustering process is shortened.
In practical application, the output result of the second-layer clustering module needs to be refined, referring to fig. 4:
due to the different news information, there are different degrees of attention, such as click-through rate, forward rate, return rate, etc. Therefore, the topic popularity value of each subject news group can be determined in combination with the attention degree of each news information and the number of news information in each subject news group.
In addition, the news information with the overlapping degree exceeding the preset threshold value is combined, and the news information with less information quantity such as notification function, advance notice function and the like is filtered.
According to the invention, through the arrangement of the first layer clustering module, the noise of news information can be reduced, and the information quantity is reduced; and by setting the second-layer clustering module, the industry attribute is fully mined, and the quality of news information is improved.
3.4 Critical phrase extraction and news association Module
Under each industry, the second layer aggregation module can obtain a main news group with higher quality, but half of the main news groups are still strongly related to the main body, certain industry attributes are lacked, and the industry keywords obtained by the main news groups are discrete, so that the information quantity is not high.
Compared with words, the method has the advantages that the phrase granularity is larger, the information quantity is more abundant, the semantic structure is more provided, and a complete main body can be clearly expressed by common moving guest phrases, main-name phrases and the like. The embodiment is based on LAC recorded in paper (Liu D, zou X.sequence labeling of chinese text based on bidirectional Gru-Cnn-Crf model, published website: https:// ieeeexplorer.ieeee.org/document/8632570) of Chinese sequence labeling model based on bidirectional GRU-CNN-CRF, and performs operations such as word segmentation, part-of-speech labeling, entity recognition and the like on news headlines. Based on the DDParse model described in the paper "practical Chinese dependency syntax analysis model based on Large Scale dataset training", zhang S, wang L, sun K, et al A practical chinese dependency parser based on a large-scale dataset, published website https:// arxiv. Org/abs/2009.00901), the short words and the syntax structure between the short words are analyzed to extract the syntax structure in the news headline, forming a triplet { span1, rel, span2}, wherein span1 represents the starting short word in the syntax structure, span2 represents the ending short word in the syntax structure, rel represents different syntax structures such as the main-predicate Structure (SBV), the dynamic guest structure (VOB), the fixed language and the center word relation (ATT), etc., refer to FIG. 5. In addition, in this embodiment, a blacklist is set, where the blacklist includes sensitive short words and parts of speech, and short words associated with the blacklist are filtered, referring to fig. 6, the original news title is the letter part: the new energy vehicle and enterprise are encouraged to combine and reorganize to make strong, and the colon, encouragement and make strong are filtered.
And determining node weights of the short words according to the co-occurrence times of the short words and other phrases with grammar structure relations in the news information of the affiliated news information group and the occurrence times of the short words in the news information of the other news information groups.
This embodiment is based on the paper TextRank: node weights of the short words are calculated for the TextRank graph algorithm described in text order (Mihalcone R, tarau P. TextRank: bringing order into text, public website: https:// acland technology org/W04-3252).
When the method is applied, the PageRank algorithm of google is used as a reference, the short words are used as a node, the relation between the short words is used as an edge, and the node weight of each short word is calculated by introducing a graph random walk algorithm according to the co-occurrence statistical information of the short words and the short words in the news information text. If one short word is co-present with more other short words, the information quantity of the short word is larger, and the corresponding node has a large weight. Referring to fig. 7, a and B co-occurrence have a grammar structure, a is a start short term in the grammar structure, B is an end short term in the grammar structure, B and C co-occurrence have a grammar structure, B is a start short term in the grammar structure, C is an end short term in the grammar structure, when calculating the node weight of C, the node weight of B needs to be considered, and when calculating the node weight of B, the node weight of a needs to be considered.
In practical application, when calculating the node weight of the short word, the frequency of the short word in all news headlines and the frequency characteristic of the inverse document also need to be considered. That is, a short word appears more times in a group of news information groups, but appears less than a preset number of times in other news information groups, and the node weight of the short word is larger when the short word is indicated to be unique to the affiliated news information group.
According to the embodiment, the short words in the news information groups are ordered according to the order of the node weights from large to small, and the first 50 short words in the news information groups are determined to be key short words of the corresponding news information groups. And taking the key phrases as the topic aggregation seeds of the corresponding industry, carrying out association analysis on each news and the key phrases, and if the news headline contains a certain key phrase, associating news associated with similar phrase combinations naturally forms a topic cluster. The topic clusters obtained in this way not only have industry attributes, but also are of higher quality.
When the method is applied, the main difficulty of association is that different key phrases are required to be combined and analyzed to obtain a certain aggregation of each group of topic clusters, namely news topics in the same group of topic clusters are required to be consistent, and meanwhile, each group of topic clusters also needs to have a certain industry attribute, namely news information in the same group of topic clusters needs to have both main related content and industry related content. Therefore, in this embodiment, the key phrases are scattered, and then the key phrases are arbitrarily combined according to different quantity ratios, and for each combination, the following association principle is introduced to carry out news association so as to obtain a plurality of groups of topic clusters:
1. The number of news information in the topic cluster cannot be too small, and a certain number threshold needs to be met;
2. if the coincidence ratio of the two topic clusters in the news information exceeds a preset proportion, merging the two topic clusters;
3. if the occurrence ratio of the respective key phrases in the two topic clusters exceeds a preset reproduction value, combining the two topic clusters.
In practical application, the association principle also comprises: setting a phrase blacklist, wherein the phrase blacklist comprises one or more blacklist phrases, if the proportion of the blacklist phrases contained in the topic cluster exceeds a phrase threshold value, filtering the topic cluster, and finally obtaining a plurality of groups of topic clusters with better quality.
3.5 industry Heat calculation
And calculating the industry heat of each group of topic clusters of each news information group, and determining the topic cluster with the highest industry heat as the hot topic of the industry of the news information group.
After obtaining the topic cluster of each industry, namely the alternative hot topics, the industry heat of the whole industry is required to be output for transverse comparison among different industries.
Through data analysis, the fact that the quantity of news or the quantity of topic clusters simply depends on topics is not enough to objectively represent the real heat of an industry is found, so that an integrated feature-based mode is finally set to calculate the heat of the industry, and the method is mainly characterized by comprising the following steps:
1. The number of news information is not repeated within a certain industry. The more the news information is, the more popular the industry can be reflected to a certain extent.
2. The number of all entities associated with news information within a particular industry. If a news event occurs in many subjects in an industry, the industry is relatively popular.
3. The number of topic clusters. The feature is a negative correlation feature, because through analysis of the data, it is found that many industries have a lot of news information, but the topic clusters have a lot of news information, but the news information of each topic cluster is very little, and some industries, such as real estate, may have a lot of news information, but the news information of some topic clusters has a lot of news information, it is important to indicate that the topic cluster is important, so that the quantity of topic clusters needs to be subjected to dimension adjustment, so that the final obtained industry heat dimension is basically consistent.
Example 4
The present embodiment provides an excavating device for trending topics, including:
the acquisition module is used for acquiring news information in a period of time;
the marking module is used for marking each news information by utilizing an industry label system, and grouping each news information according to the industry label to obtain a plurality of groups of news information groups;
The weight determining module is used for word segmentation of the news headlines of each news information group to obtain a plurality of short words and determining node weights of the short words;
the keyword determining module is used for determining the keywords of each news information group by using the node weights of the keywords;
the association module is used for associating each news information with each keyword of the news information group to obtain a plurality of topic clusters;
the trending topic determination module is used for calculating the industry trending degree of each topic cluster and determining the topic cluster with the highest industry trending degree of each news information group as the trending topic of the industry.
Implementation of specific functions of the above functional modules is according to the methods described in embodiments 1-3.
Example 5
This embodiment introduces a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform the method of any of embodiments 1-3.
Example 6
This embodiment describes a computing device comprising one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing the methods recited in any of embodiments 1-3.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are all within the protection of the present invention.

Claims (10)

1. The hot topic mining method is characterized by comprising the following steps of:
acquiring news information within a period of time;
marking each news information by using an industry label system, and grouping each news information according to the industry label to obtain a plurality of news information groups;
dividing news headlines of each news information group into words to obtain a plurality of short words, and determining node weights of the short words;
determining key short words of each news information group by using node weights of the short words;
associating each news information with each keyword of the news information group to obtain a plurality of topic clusters;
and calculating the industry heat of each topic cluster, and determining the topic cluster with the highest industry heat of each news information group as the hot topic of the industry.
2. The method of claim 1, wherein the marking each news information using an industry label system comprises:
responding to the news information content including entity company names, and labeling the news information with the industry to which the entity company belongs;
responding to the content of the news information including an industry keyword, and labeling the news information with an industry to which the industry keyword belongs;
The industry keywords comprise industry names, industry main products and industry terminology.
3. The method for mining trending topics of claim 1 wherein the word segmentation of news headlines for each news information set to obtain a plurality of short words and determining node weights for each short word comprises:
dividing news headlines of each news information group into words to obtain a plurality of short words, and extracting syntax structures between the short words;
and determining node weights of the short words according to the syntax structure between the short words, the co-occurrence times of the short words and other short words in the news information of the affiliated news information group and the occurrence times of the short words in the news information of other news information groups.
4. The method of mining trending topics of claim 3 wherein determining node weights for each shortterm comprises the formula:
tfidf (i) =i-th short word and other short words co-occur log of the number of times each news information of the belonging news information group (total news headlines include the number of news information of i-th short word+1)
Node weight=ws (V i ) Tfidf weight (i) hot value (i)
Wherein WS (V) i ) TextRank weight of the ith short word, d is a balance coefficient, the problem of closed loop of the node short word path is avoided, V j For the j-th short term, V i For the ith short term, V k Is the kth short term, w ji Is the similarity between the jth short word and the ith short word, w jk For the similarity between the jth and kth short words,for the syntactic structure between the jth and kth short words, V j ∈In(V i ) Tf (i) is Tf weight of the ith short word, and hot value (i) is the number of platforms for publishing news information containing the ith short word;
the method comprises the steps that a syntactic structure is arranged between a jth short word and an ith short word, the jth short word is a starting point short word of the syntactic structure of the jth short word and the ith short word, and the ith short word is an end point short word of the syntactic structure of the jth short word and the ith short word; the jth short word and the kth short word have a syntactic structure, the kth short word is a starting short word with the syntactic structure of the two short words, and the jth short word is an ending short word with the syntactic structure of the two short words.
5. The method of claim 1, wherein determining key words of each news information group using node weights of each word comprises:
and ordering the short words in each news information group according to the order of the node weights from large to small, and determining the first N short words in each news information group as key short words corresponding to the affiliated news information group.
6. The method of claim 1, wherein associating each news information with each keyword of the news information group to obtain a plurality of topic clusters comprises:
randomly combining the key short words of the news information group to obtain a plurality of phrases;
and writing the news information containing each phrase into the corresponding topic cluster to obtain a plurality of topic clusters.
7. The method of mining trending topics of claim 1 wherein the industry trending comprises the formula:
the business heat = the number of news information that the topic cluster does not repeat with other topic clusters the number of entities contained in the news information in the topic cluster/the number of topic clusters in the news information set
The entity comprises an enterprise name and a person name.
8. An excavating device for hot topics, comprising:
the acquisition module is used for acquiring news information in a period of time;
the marking module is used for marking each news information by utilizing an industry label system, and grouping each news information according to the industry label to obtain a plurality of groups of news information groups;
the weight determining module is used for word segmentation of the news headlines of each news information group to obtain a plurality of short words and determining node weights of the short words;
The keyword determining module is used for determining the keywords of each news information group by using the node weights of the keywords;
the association module is used for associating each news information with each keyword of the news information group to obtain a plurality of topic clusters;
the trending topic determination module is used for calculating the industry trending degree of each topic cluster and determining the topic cluster with the highest industry trending degree of each news information group as the trending topic of the industry.
9. A computer readable storage medium storing one or more programs, wherein the one or more programs comprise instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-7.
10. A computing device comprising one or more processors, memory, and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-7.
CN202310432524.7A 2023-04-21 2023-04-21 Trending topic mining method, device, storage medium and equipment Pending CN116738068A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310432524.7A CN116738068A (en) 2023-04-21 2023-04-21 Trending topic mining method, device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310432524.7A CN116738068A (en) 2023-04-21 2023-04-21 Trending topic mining method, device, storage medium and equipment

Publications (1)

Publication Number Publication Date
CN116738068A true CN116738068A (en) 2023-09-12

Family

ID=87908700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310432524.7A Pending CN116738068A (en) 2023-04-21 2023-04-21 Trending topic mining method, device, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN116738068A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117828152A (en) * 2023-11-30 2024-04-05 南京汇编交通科技有限公司 Hot word mining method and system based on big data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117828152A (en) * 2023-11-30 2024-04-05 南京汇编交通科技有限公司 Hot word mining method and system based on big data

Similar Documents

Publication Publication Date Title
CN103049435B (en) Text fine granularity sentiment analysis method and device
Jotheeswaran et al. OPINION MINING USING DECISION TREE BASED FEATURE SELECTION THROUGH MANHATTAN HIERARCHICAL CLUSTER MEASURE.
CN111914087B (en) Public opinion analysis method
Karandikar Clustering short status messages: A topic model based approach
CN113962293B (en) LightGBM classification and representation learning-based name disambiguation method and system
CN106126605B (en) Short text classification method based on user portrait
Tang et al. Learning sentence representation for emotion classification on microblogs
US11886515B2 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
Ye et al. A web services classification method based on GCN
Sadr et al. Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms
CN114997288A (en) Design resource association method
CN116738068A (en) Trending topic mining method, device, storage medium and equipment
Dadhich et al. Social & juristic challenges of AI for opinion mining approaches on Amazon & flipkart product reviews using machine learning algorithms
Chen et al. Popular topic detection in Chinese micro-blog based on the modified LDA model
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
Xiao et al. Research on multimodal emotion analysis algorithm based on deep learning
Zhang et al. An overview on supervised semi-structured data classification
Hamdi et al. BERT and word embedding for interest mining of instagram users
Thilagavathi et al. Document clustering in forensic investigation by hybrid approach
Bhavani et al. An efficient clustering approach for fair semantic web content retrieval via tri-level ontology construction model with hybrid dragonfly algorithm
Wang et al. Multi-modal online review driven product improvement design based on scientific effects knowledge graph
Chen et al. Multi-modal multi-layered topic classification model for social event analysis
Hui et al. A weighted topical document embedding based clustering method for news text
Kalaiarasu et al. Sentiment analysis using improved novel convolutional neural network (SNCNN)
Wang et al. Information Classification and Extraction on Official Web Pages of Organizations.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination