CN109508386A

CN109508386A - A kind of relevance metric method of stock information press center word and related stock

Info

Publication number: CN109508386A
Application number: CN201811318217.1A
Authority: CN
Inventors: 王家华; 薛醒思; 詹先银; 朱钟元; 范淑娟; 刘艳萍; 杨莹
Original assignee: Fujian University of Technology
Current assignee: Fujian University of Technology
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2019-03-22

Abstract

The present invention provides the relevance metric method of a kind of stock information news keyword and related stock, comprising: step S10, reads the data in the stock information news file of preparation, and constructs transaction database D；Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequent item set group L_k；Step S30, from the frequent item set F_k,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation are F_k.mNonvoid proper subset, item collection β be the item collection α about the frequent item set F_k.mSupplementary set, and the correlation rule α → β is included into Term co-occurrence database.The relevance metric method of stock information news keyword and related stock of the invention is counted by the number that stock information news keyword and related stock occurs in different news in mining algorithm, the relevance metric of stock information news keyword and related stock is carried out using degree of association formula, computational efficiency is high, quick reliable.

Description

A kind of relevance metric method of stock information press center word and related stock

Technical field

The present invention relates to stock certificate datas to send out digging technology field, more particularly to a kind of stock information press center word and phase Close the relevance metric method of stock.

Background technique

The collection of item is collectively referred to as item collection；Item collection comprising k item is known as k- item collection；Support is greater than minimum support threshold value Item collection is frequent item set；The item frequency that goes out of item collection is the number of transactions comprising item collection, the referred to as frequency of item collection, support counting or It counts.Correlation rule is the implications shaped like X → Y, wherein X and Y is referred to as the guide of correlation rule and subsequent.

With the rapid development of information technology and universal, all kinds of news information rapid expansions in relation to stock of network.Such as Useful information required for what is quickly and accurately obtained out from massive information just becomes a problem [1].In order to have A large amount of news datas that the organization and management of effect is all updated in accumulation all the time need to be labeled its content to realize knot Structure [2].In view of marking the time consuming nature of news by hand, accurately and easily carrying out automatic marking to the information news of magnanimity becomes The needs in market.And stock information news automatic marking is completed, it needs to construct a storage center word and related stock exists " synonym " corpus of cooccurrence relation.So in the building of " synonym " corpus, stock information press center word is measured Whether there is incidence relation to related stock is a critical issue for realizing stock information news automatic marking.

In recent years, the research calculated both at home and abroad word association degree can substantially be divided into following two categories: 1) according to semantic knowledge Library carries out the calculating of word association degree；2) calculating of word association degree is carried out according to large-scale corpus.

Rada and J.H.Lee etc. is by calculating in the shortest path that hyponymy is constituted between word node in WordNet To calculate the similarity [3-4] between English word.P.Resnik is according to the maximum fault information of the public ancestor node of two words To measure the semantic similarity [5] of two English words.E.Agirre and G.Rigau is calculating English word using WordNet Semantic similarity when, other than the path length between node, it is also contemplated that some other factor, for example, concept abstraction hierarchy Depth, areal concentration of conceptual level number etc. [6].In terms of Chinese terms similarity calculation research, Wang Bin uses arborescence The method [7] in path, the similarity between Chinese terms is calculated using " Chinese thesaurus " between interior joint.Liu Qun et al. is mentioned A kind of Lexical Similarity calculation method [8] being based on " Hownet " out.L.Su-jian et al., which proposes one kind and fully utilizes, " to be known Net " and " Chinese thesaurus " calculate the method [9] of Chinese terms similarity.In the calculating process of adopted former similarity, not only Consider the context relation between adopted original, it is also contemplated that other relationships between adopted original.

L.Lillian utilizes joint entropy, and P.Brown et al. calculates the similarity between word using Average Mutual [10-11].Dagan et al. has used increasingly complex probabilistic model to calculate the distance [12] of word.Hu Junfeng et al. is utilized The vocabulary vector space model of context approximatively describes the semanteme of vocabulary, then defines the similarity relation of vocabulary on this basis [13].Liu Qun carries out the association [8] of word and word, and the association reflected between keyword with one using Hopfeild neural network The fuzzy introspective matrix of degree stores the similarity magnitude between word and word.

In the above-mentioned word association degree calculation method based on semantic dictionary, require to be provided in advance one for the field Semantic dictionary.Since there is currently no the semantic dictionaries of stock, and the dictionary for constructing the stock is at high cost, the time It is long, therefore the word association degree calculation method based on semantic dictionary is not suitable for measuring stock information press center word and correlation The stock degree of association.In addition, traditional word association degree calculation method based on statistics cannot combine word to co-occurrence frequency height and Low situation not can guarantee the measurement quality of stock information press center word to related stock association results.

Bibliography:

[1] Web page keyword research [D] Jiangsu University of Science and Technology of the stone love duckweed based on semantic distance, 2011.

[2] more than valiant victory news report program content automatic marking system [D] Tsinghua University, 2011.

[3].Rada R,Mili H,Bicknell E,et al.Development and application of a metric on semantic nets[J].IEEE Transaction on System Man&Cybernetics,1989,19 (1):17-30.

[4].Lee J H,Kim M H,Lee Y J.Information Retrieval Based on Conceptual Distance in a Is-a Hierarchy.J Doc 49:188-207[J].Journal of Documentation, 1993,49(2):188-207.

[5].Resnik P.Semantic similarity in a taxonomy:an information-based measure and its application to problems of ambiguity in natural language[M] .AI Access Foundation,1999.

[6].Agirre E,Rigau G.A Proposal for Word Sense Disambiguation using Conceptual Distance[J].Computer Science,2009.

[7] (computing technique is ground for Wang Bin Chinese-English bilingual corpus automatic aligning research [D] Postgraduate School, Chinese Academy of Sciences Study carefully institute), 1999.

[8] Similarity of Words that Liu Qun, Li Sujian are based on " Hownet " calculates [J] Chinese computing linguistics, 2002.

[9].Li S,Zhang J,Huang X,et al.Semantic computation in a Chinese Question-answering system [J] Journal of Computer Science and Technology (English edition), 2002,17 (6): 933-939.

[10].Brown P F,Pietra S A D,Pietra V J D,et al.Word-Sense Disambiguation Using Statistical Methods[C]//The,Meeting of the Association for Computational Linguistics.1991:264--270.

[11].Lee L J.Similarity-Based Approaches to Natural Language Processing[J].Computer Science,1997.

[12].Dagan I,Lee L,Pereira F C N.Similarity-Based Models of Word Cooccurrence Probabilities[J].Machine Learning,1999,34(1-3):43-69.

[13] statistical analysis of Similarity of Words and application [J] Chinese information in Hu Junfeng, Yu Shiwen Tang and Song Dynasty poem Journal, 2002,16 (4): 40-45.

Summary of the invention

The technical problem to be solved in the present invention is to provide the association of a kind of stock information news keyword and related stock Measure, robustness that different frequency word influences the degree of association can be increased and improve stock information news keyword with The measurement quality of association results between related stock.

The present invention is implemented as follows: a kind of relevance metric method of stock information news keyword and related stock, packet It includes:

Step S10, the data in the stock information news file of preparation are read, and construct transaction database D, D={ T₁, T₂,T₃,…,T_i, affairs T_iIndicate the item collection formed from same piece stock information news keyword, i ∈ [1, n], n indicate institute State the record for the related stock information news included in stock information news file；

Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequently Item collection group L_k, L={ L₁,L₂,L₃,…,L_k, L_k={ F_k.1,F_k.2,F_k.3,…,F_k.m, frequent item set F_k.mIt indicates by k stock The frequent k item collection of information news keyword composition, m indicate serial number, and k and m are positive integer；

Step S30, from the frequent item set F_k,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation be F_k.mNonvoid proper subset, item collection β be the item collection α about the frequent item set F_k.mSupplementary set, and by the correlation rule α → β is included into Term co-occurrence database.

Further, the step S20 is specifically included:

Step S21, the transaction database D is scanned, candidate group C is generated₁, C₁={ E_1.1,E_1.2,E_1.3,…,E_1.w, Candidate E_1.jIndicate 1 item collection being made of 1 stock information news keyword, w indicates serial number and is positive integer；

Step S22, the candidate E is calculated_1.wSupport counting count (E in the transaction database D_1.j), If the support counting count (E_1.w) be more than or equal to preset minimum support count threshold, then by the candidate E_1.wIt is included into frequent item set group L₁；If the support counting count (E_1.w) it is less than the preset minimum support counting Threshold value then removes the candidate E_1.w；

Step S23, by frequent item set group L_h-1In frequent item set F_h-1.mGenerate candidate E_h.s, the candidate E_h.sIndicate the h item collection being made of h stock information news keywords, s indicates serial number and is positive integer, h be more than or equal to 2 integer；

Step S24, to the candidate E_h.sNon- frequent beta pruning processing is carried out, candidate group C is then generated_h, C_h= {E_h.1,E_h.2,…,E_h.j, j indicates serial number and is positive integer；

Step S25, candidate E is calculated_h.jSupport counting count (E in the transaction database D_h.j), if Support counting count (the E_h.j) be more than or equal to the minimum support count threshold, then by the candidate E_h.jReturn Enter the frequent item set group L_hIf the support counting count (E_h.j) be less than minimum support count threshold, then it removes The candidate E_h.j；

Step S26, numerous item collection group L is tired of from described_hMiddle generation candidate group C_h+1If the candidate group C_h+1For Sky, then it is described to be tired of numerous item collection group L_hAs maximum frequent itemsets group finally obtains the frequent item set database L；If described Candidate group C_h+1It is not sky, then goes to the step S25.

Further, described by frequent item set group L_h-1In frequent item set F_h-1.mGenerate candidate E_h.sSpecifically: from The frequent item set group L_h-1Middle traversal chooses two item collections: item collection F_h-1.pWith item collection F_h-1.q, p and q indicates serial number and is positive Integer, by the item collection F_h-1.pIn item and the item collection F_h-1.qIn item be compared, judge whether there is and only k-2 Identical entry, if "Yes", by the item collection F_h-1.pWith the item collection F_h-1.qIt is attached and generates the candidate E_h.s；Such as Fruit "No", then without connection.

Further, described to the candidate E_h.sCarry out non-frequent beta pruning processing specifically: traverse the candidate item Collect E_h.sSubset S_h-1.u, the subset S_h-1.uIt is h-1 item collection, u indicates serial number and is positive integer, if the subset S_h-1.uNo Belong to the frequent item set group L_h-1, then the candidate E is excluded_h.s；If the subset S_h-1.uBelong to the frequent item set Group L_h-1, then by the candidate E_h.sIt is included into candidate group C_h。

Further, described to calculate the candidate E_1.wSupport counting count (E in transaction database D_1.w) Specifically: the transaction database D is traversed, if the candidate E_1.wIt is to belong to the affairs T_iSubset, then add up institute State candidate E_1.wFrequency of occurrence is primary；

The calculating candidate E_h.jSupport counting count (E in the transaction database D_h.j) specifically: time The transaction database D is gone through, if the candidate E_h.jIt is to belong to the affairs T_iSubset, add up the candidate E_h.jFrequency of occurrence is primary.

Further, the calculating of the cooccurrence relation specifically: the unrelated degree for calculating correlation rule α → β, if the pass The unrelated degree for joining rule α → β is less than or equal to preset unrelated degree threshold value, then it represents that the item collection α and item collection β has described Cooccurrence relation；If the unrelated degree of the correlation rule α → β is greater than the preset unrelated degree threshold value, then it represents that the item collection α Do not have the cooccurrence relation with the item collection β.

Further, the calculation formula of the unrelated degree of the correlation rule α → β is as follows:

Wherein M indicates the size of the transaction database D；F (α) and f (β) respectively indicate the item collection α and the item collection β The number of appearance；F (α, β) indicates while occurring the number of the item collection α and the item collection β, and NDG (α, β) indicates the association The unrelated degree of regular α → β.

Further, the keyword includes centre word and mark word.

The present invention has the advantage that the relevance metric method of stock information news keyword of the invention and related stock It is counted, is used by the number that stock information news keyword and related stock occurs in different news in mining algorithm Unrelated degree formula carries out the relevance metric of stock information news keyword and related stock, and computational efficiency is high, quick reliable.

Detailed description of the invention

The present invention is further illustrated in conjunction with the embodiments with reference to the accompanying drawings.

Fig. 1 is to generate frequent item set group L from affairs database D in method of the invention₁Program flow diagram.

Fig. 2 is in method of the invention from frequent item set group L_h-1Generate candidate group C_hProgram flow diagram.

Fig. 3 is in method of the invention from candidate group C_hFind out the program flow diagram of maximum frequent itemsets group.

Fig. 4 is in method of the invention from frequent item set group L_kFind out the program flow diagram of the correlation rule of cooccurrence relation.

Specific embodiment

Refering to fig. 1 to Fig. 4, a kind of relevance metric method of stock information news keyword and related stock, comprising:

Step S10, the data in the stock information news file of preparation are read, and construct transaction database D, D={ T₁, T₂,T₃,…,T_i, affairs T_iIndicate the item collection by forming from same piece stock information news keyword, i ∈ [1, n], n are indicated The record for the related stock information news included in the stock information news file；Keyword includes centre word and mark word； Wherein, the data in stock information news file are made of the words such as multirow centre word and mark word, the centre word and mark of every row Word is all from same piece news, is separated at intervals between word with space.

The step S20 is specifically included:

Step S21, the transaction database D is scanned, candidate group C is generated₁, C₁={ E_1.1,E_1.2,E_1.3,…,E_1.w, Candidate E_1.wIndicate a stock information news keyword, the candidate E_1.wFor 1 item collection, w indicates serial number and is positive Integer；Candidate group C₁There is w candidate 1 item collection, also implies that whole stock information news keywords there are w.

Step S22, the candidate E is calculated_1.wSupport counting count (E in the transaction database D_1.w), If the support counting count (E_1.w) be more than or equal to preset minimum support count threshold, then by the candidate E_1.wIt is included into frequent item set group L₁；If the support counting count (E_1.w) it is less than the preset minimum support counting Threshold value then removes the candidate E_1.w；In this way, frequent item set group L is just obtained₁, L₁={ F_1.1,F_1.2,F_1.3,…,F_1.m}；Its Middle minimum support threshold value is indicated with min_sup, and is preset numerical value；Frequent item set F_1.mIt is to meet " count (E_1.w)≥ The candidate E of min_sup "_1.w。

It is described to calculate the candidate E in step S22_1.wSupport counting count in transaction database D (E_1.w) specifically: the transaction database D is traversed, if the candidate E_1.wIt is to belong to the affairs T_iSubset, then Add up the candidate E_1.wFrequency of occurrence is primary；

Step S23, pass through function apriori_gen for frequent item set group L_h-1In frequent item set F_h-1.mGenerate candidate item Collect E_h.s, the candidate E_h.sIndicate the h item collection being made of the h stock information news keywords, s indicates serial number and is Positive integer, h are the integer more than or equal to 2；

In step S23, the function of function apriori_gen: i.e. described by frequent item set group L_h-1In frequent item set F_h-1.mGenerate candidate E_h.sSpecifically: from the frequent item set group L_h-1Middle traversal chooses two item collections: item collection F_h-1.pWith item Collect F_h-1.q, p and q indicate serial number and are positive integer, by the item collection F_h-1.pIn item and the item collection F_h-1.qIn item carry out Compare, judge whether there is and only h-2 identical entry, if "Yes", by the item collection F_h-1.pWith the item collection F_h-1.qIt carries out Connection generates the candidate E_h.s；If "No", without connection.

Step S24, by function has_infrequent_sub to the candidate E_h.sIt carries out at non-frequent beta pruning Then reason generates candidate group C_h, C_h={ E_h.1,E_h.2,…,E_h.j, j indicates serial number and is positive integer；

In step s 24, the function of function has_infrequent_sub: i.e. described to the candidate E_h.sIt carries out Non- frequent beta pruning processing specifically: traverse the candidate E_h.sSubset S_h-1.u, the subset S_h-1.uIt is h-1 item collection, u table Show serial number and be positive integer, if the subset S_h-1.uIt is not belonging to the frequent item set group L_h-1, then the candidate is excluded E_h.s, i.e. beta pruning processing；If the subset S_h-1.uBelong to the frequent item set group L_h-1, then by the candidate E_h.sIt is included into Candidate group C_h, i.e. not beta pruning processing.Candidate E_h.sThere is u subset.Candidate group C is determined in this way_hIn candidate item Collection is frequent candidate.Step S25, candidate E is calculated_h.jSupport counting count in the transaction database D (E_h.j), if the support counting count (E_h.j) be more than or equal to the minimum support count threshold, then by the candidate Item collection E_h.jIt is included into the frequent item set group L_hIf the support counting count (E_h.j) it is less than minimum support counting threshold Value, then remove the candidate E_h.j；Thus obtain frequent item set group L_h, L_h={ F_h.1,F_h.2,F_h.3,…,F_h.m, frequently Item collection F_h.mIt is to meet " count (E_h.jThe candidate E of) >=min_sup "_h.j。

In step s 25, the calculating candidate E_h.jSupport counting count in the transaction database D (E_h.j) specifically: the transaction database D is traversed, if the candidate E_h.jIt is to belong to the affairs T_iSubset, tire out Count the candidate E_h.jFrequency of occurrence is primary.

Step S26, numerous item collection group L is tired of from described_hMiddle generation candidate group C_h+1If the candidate group C_h+1For Sky, then the frequent item set group L_hAs maximum frequent itemsets group finally obtains the frequent item set database L；Such L= {L₁,L₂,L₃,…,L_h, since h is the integer more than or equal to 2, k is positive integer, so the frequent item set group L in step S20_kPacket Frequent item set group L is included₁With frequent item set group L_h；

If the candidate group C_h+1It is not sky, then goes to the step S25, continues to generate frequent item set in this way Group L_h+1, L={ L at this time₁,L₂,L₃,…,L_h,L_h+1, then generate candidate group C_h+2, and judge candidate group C_h+2It is No is sky, until finding maximum frequent itemsets group.

Step S30, from the frequent item set F_k,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation be F_k.mNonvoid proper subset, item collection β be the item collection α about the frequent item set F_k.mSupplementary set, and by the correlation rule α → β is included into Term co-occurrence database

In step s 30, the calculating of the cooccurrence relation specifically: the unrelated degree for calculating correlation rule α → β, if institute The unrelated degree for stating correlation rule α → β is less than or equal to preset unrelated degree threshold value, then it represents that the item collection α has with the item collection β The cooccurrence relation；If the unrelated degree of the correlation rule α → β is greater than the preset unrelated degree threshold value, then it represents that described The item collection α and item collection β does not have the cooccurrence relation.

The calculation formula of the unrelated degree of the correlation rule α → β is as follows:

Wherein M indicates the size of the transaction database D, i.e., the related stock information included in stock information news file The record of news；F (α) and f (β) respectively indicates the number that the item collection α and item collection β occurs；F (α, β) is indicated while being gone out The number of existing the item collection α and the item collection β, NDG (α, β) indicate the unrelated degree of the correlation rule α → β.

In Fig. 4, preset unrelated degree threshold value is indicated with max_Disc.

If the stock information news keyword in item collection α with the stock information news keyword in item collection β never together When appearing in same piece news, but individually occurring, then the unrelated degree between them is infinitely great, illustrates that there is no strong passes Connection relationship；If the news word in item collection α always occurs simultaneously with the news word in item collection β, the unrelated degree of the two is 0, illustrate that there are very strong incidence relations.

It is crucial to record the stock information news for having cooccurrence relation to related stock for the Term co-occurrence database finally obtained Word, it is ensured that the relevance metric quality of stock information news keyword and related stock, measure computational efficiency of the invention is high, It is quick reliable.

For example, stock information news file has included 17 related stock information news, thus T₁=Amazon across Border electric business }, T₂={ Maotai Group Alibaba Guizhou Maotai white wine Ali concept }, T₃={ Wanke's development of real estate }, T₄= { the cross-border electric business of Amazon }, T₅={ today's tops culture medium }, T₆={ Meituan, which is rubbed, visits shared bicycle internet+}, T₇={ apple Fruit apple concept artificial intelligence }, T₈={ Ren Zhengfei Huawei Guangdong plate electronic manufacture }, T₉={ Alibaba's Ali's concept }, T₁₀={ bit coin block chain }, T₁₁={ the red live streaming of quick worker volcano internet Media culture medium net }, T₁₂={ today's tops is short The red live streaming of video internet Media culture medium net }, T₁₃={ Shenzhen smart city smart city Guangdong plate }, T₁₄={ Jia Yue Pavilion LeTV Hainan plate LeTV }, T₁₅={ LeTV Jia Yueting LeTV new-energy automobile }, T₁₆={ pig grain live pig pig is raised Material }, T₁₇={ new stock new stock and sub-new stock }.

Then generating candidate group C1 has 41 candidates, E_1.1={ Amazon }, E_1.2={ block chain }, E_1.3= { cultural medium }, E_1.4={ pig grain }, E_1.5={ development of real estate }, E_1.6={ quick worker }, E_1.7={ electronic manufacture }, E_1.8=across Border electric business }, E_1.9={ today's tops }, E_1.10={ apple }, E_1.11={ Guangdong plate }, E_1.12={ internet+}, E_1.13= { Alibaba }, E_1.14={ artificial intelligence }, E_1.15={ Meituan }, E_1.16={ Ali's concept }, E_1.17={ Mo Bai }, E_1.18= { bit coin }, E_1.19={ Ren Zhengfei }, E_1.20={ netting red live streaming }, E_1.21={ Wanke }, E_1.22={ live pig }, E_1.23={ China For, E_1.24={ new-energy automobile }, E_1.25={ pig }, E_1.26={ volcano }, E_1.27={ short-sighted frequency }, E_1.28={ Jia Yueting }, E_1.29={ smart city }, E_1.30={ LeTV }, E_1.31={ apple concept }, E_1.32={ Shenzhen }, E_1.33={ Hainan plate }, E1_.34={ internet medium }, E_1.35={ Guizhou Maotai }, E_1.36={ feed }, E_1.37={ new stock and sub-new stock }, E_1.38= { Maotai Group }, E_1.39={ white wine }, E_1.40={ shared bicycle }, E_1.41={ new stock }.

The record n for the related stock information news included according to stock information news file obtains pre- multiplied by scale factor θ If minimum support count threshold min_sup.N value 17 in this example, θ value 10%, then min_sup value 1.7；By In preset minimum support count threshold be 1.7, then from candidate group C₁The frequent item set group L of generation₁There are 11 frequent episodes Collection, F_1.1={ Amazon }, F_1.2={ cultural medium }, F_1.3={ cross-border electric business }, F_1.4={ today's tops }, F_1.5={ Guangdong plate Block }, F_1.6={ Alibaba }, F_1.7={ Ali's concept }, F_1.8={ netting red live streaming }, F_1.9={ Jia Yueting }, F_1.10={ LeEco Net }, F_1.11={ internet medium }.

From frequent item set group L₁The candidate group C of generation₂There are 55 candidates, E_2.1={ cultural medium Amazon }, E_2.2={ cross-border electric business Amazon }, E_2.3={ Amazon today's tops }, E_2.4={ Amazon Guangdong plate }, E_2.5={ sub- horse Inferior Alibaba }, E_2.6={ Ali's concept Amazon }, E_2.7={ netting red live streaming Amazon }, E_2.8={ Amazon Jia Yueting }, E_2.9={ LeTV Amazon }, E_2.10={ internet medium Amazon }, E_2.11={ cross-border electric business culture medium }, E_2.12= { cultural medium today's tops }, E_2.13={ cultural medium Guangdong plate }, E_2.14={ cultural medium Alibaba }, E_2.15=Ah In concept culture medium, E_2.16={ the cultural medium of the red live streaming of net }, E_2.17={ cultural medium Jia Yueting }, E_2.18={ LeTV Cultural medium }, E_2.19={ cultural medium internet medium }, E_2.20={ cross-border electric business today's tops }, E_2.21={ cross-border electric business Guangdong plate }, E_2.22={ cross-border electric business Alibaba }, E_2.23={ cross-border electric business Ali concept }, E_2.24={ cross-border electric business net Red live streaming }, E_2.25={ cross-border electric business Jia Yueting }, E_2.26={ cross-border electric business LeTV }, E_2.27={ cross-border electric business internet passes Matchmaker }, E_2.28={ today's tops Guangdong plate }, E_2.29={ today's tops Alibaba }, E_2.30={ Ali's concept head today Item }, E_2.31={ netting red live streaming today's tops }, E_2.32={ today's tops Jia Yueting }, E_2.33={ LeTV today's tops }, E_2.34={ internet medium today's tops }, E_2.35={ Alibaba's Guangdong plate }, E_2.36={ Ali's concept Guangdong plate }, E_2.37={ net red live streaming Guangdong plate }, E_2.38={ Guangdong plate Jia Yueting }, E_2.39={ LeTV Guangdong plate }, E_2.40= { internet medium Guangdong plate }, E_2.41={ Ali's concept Alibaba }, E_2.42={ netting red live streaming Alibaba }, E_2.43= { Alibaba Jia Yueting }, E_2.44={ LeTV Alibaba }, E_2.45={ internet medium Alibaba }, E_2.46={ Ali The red live streaming of concept net }, E_2.47={ Ali's concept Jia Yueting }, E_2.48={ Ali's concept LeTV }, E_2.49={ Ali's concept is mutual Networking medium }, E_2.50={ netting red live streaming Jia Yueting }, E_2.51={ netting red live streaming LeTV }, E_2.52={ net red live streaming internet Medium }, E_2.53={ LeTV Jia Yueting }, E_2.54={ internet medium Jia Yueting }, E2_.55={ LeTV internet medium }.

From candidate group C₂The frequent item set group L of generation₂There are 7 frequent item sets, F_2.1={ cross-border electric business Amazon }, F_2.2={ cultural medium today's tops }, F_2.3={ the cultural medium of the red live streaming of net }, F_2.4={ cultural medium internet medium }, F_2.5 ={ Ali's concept Alibaba }, F_2.6={ net red live streaming internet medium }, F_2.7={ LeTV Jia Yueting }.

From frequent item set group L₂The candidate group C of generation₃There are 4 candidates, E_3.1={ the cultural medium of the red live streaming of net is modern Day is top }, E_3.2={ cultural medium internet medium today's tops }, E_3.3={ the red live streaming of net cultural medium internet medium }, E_3.4={ net red live streaming internet Media culture medium }.

From candidate group C₃The frequent item set group L of generation₃There are 2 frequent item sets, F_3.1={ the cultural medium of the red live streaming of net is mutual Networking medium }, F_3.2={ net red live streaming internet Media culture medium }.

According to requiring, the candidate group C4 of generation is sky, and frequent item set group L3 is maximum frequent item set group.

Frequent item set is chosen from frequent item set group L1, L2, L3, and calculates the correlation rule of cooccurrence relation, this example Preset unrelated degree threshold value is 0.4 in son, calculated result:

The unrelated degree of correlation rule { Amazon } → { cross-border electric business } is 0；

The unrelated degree of correlation rule { today's tops } → { cultural medium } is 0.18；

The unrelated degree of correlation rule { Alibaba } → { Ali's concept } is 0；

The unrelated degree of correlation rule { Jia Yueting } → { LeTV } is 0；

Be considered as in this way { Amazon } → { cross-border electric business }, { today's tops } → { cultural medium }, { Alibaba } → Ah In concept, { Jia Yueting } → { LeTV }, there are cooccurrence relations for these keywords.

Although specific embodiments of the present invention have been described above, those familiar with the art should be managed Solution, we are merely exemplary described specific embodiment, rather than for the restriction to the scope of the present invention, it is familiar with this The technical staff in field should be covered of the invention according to modification and variation equivalent made by spirit of the invention In scope of the claimed protection.

Claims

1. a kind of relevance metric method of stock information news keyword and related stock, it is characterised in that: include:

Step S10, the data in the stock information news file of preparation are read, and construct transaction database D, D={ T₁,T₂, T₃,…,T_i, affairs T_iThe item collection that form from same piece stock information news keyword is indicated, described in i ∈ [1, n], n are indicated The record for the related stock information news included in stock information news file；

Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequent item set Group L_k, L={ L₁,L₂,L₃,…,L_k, L_k={ F_k.1,F_k.2,F_k.3,…,F_k.m, frequent item set F_k.mIt indicates by k stock information The frequent k item collection of news keyword composition, m indicate serial number, and k and m are positive integer；

Step S30, from the frequent item set F_k,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation are F_k.m Nonvoid proper subset, item collection β be the item collection α about the frequent item set F_k.mSupplementary set, and the correlation rule α → β is returned Enter Term co-occurrence database.

2. the relevance metric method of a kind of stock information news keyword according to claim 1 and related stock, special Sign is: the step S20 is specifically included:

Step S21, the transaction database D is scanned, candidate group C is generated₁, C₁={ E_1.1,E_1.2,E_1.3,…,E_1.w, it is candidate Item collection E_1.jIndicate 1 item collection being made of 1 stock information news keyword, w indicates serial number and is positive integer；

Step S22, the candidate E is calculated_1.wSupport counting count (E in the transaction database D_1.j), if Support counting count (the E_1.w) be more than or equal to preset minimum support count threshold, then by the candidate E_1.w It is included into frequent item set group L₁；If the support counting count (E_1.w) it is less than the preset minimum support counting threshold Value, then remove the candidate E_1.w；

Step S23, by frequent item set group L_h-1In frequent item set F_h-1.mGenerate candidate E_h.s, the candidate E_h.sTable Show the h item collection being made of the h stock information news keywords, s indicates serial number and is positive integer, and h is whole more than or equal to 2 Number；

Step S24, to the candidate E_h.sNon- frequent beta pruning processing is carried out, candidate group C is then generated_h, C_h={ E_h.1, E_h.2,…,E_h.j, j indicates serial number and is positive integer；

Step S25, candidate E is calculated_h.jSupport counting count (E in the transaction database D_h.j), if described Support counting count (E_h.j) be more than or equal to the minimum support count threshold, then by the candidate E_h.jIt is included into institute State frequent item set group L_hIf the support counting count (E_h.j) be less than minimum support count threshold, then described in removal Candidate E_h.j；

Step S26, numerous item collection group L is tired of from described_hMiddle generation candidate group C_h+1If the candidate group C_h+1For sky, then It is described to be tired of numerous item collection group L_hAs maximum frequent itemsets group finally obtains the frequent item set database L；If the candidate item Collection group C_h+1It is not sky, then goes to the step S25.

3. the relevance metric method of a kind of stock information news keyword according to claim 2 and related stock, special Sign is: described by frequent item set group L_h-1In frequent item set F_h-1.mGenerate candidate E_h.sSpecifically: from the frequent episode Collection group L_h-1Middle traversal chooses two item collections: item collection F_h-1.pWith item collection F_h-1.q, p and q indicate serial number and for positive integer, will described in Item collection F_h-1.pIn item and the item collection F_h-1.qIn item be compared, judge whether there is and only k-2 identical entry, if "Yes", then by the item collection F_h-1.pWith the item collection F_h-1.qIt is attached and generates the candidate E_h.s；If "No", no It is attached.

4. the relevance metric method of a kind of stock information news keyword according to claim 2 and related stock, special Sign is: described to the candidate E_h.sCarry out non-frequent beta pruning processing specifically: traverse the candidate E_h.sSon Collect S_h-1.u, the subset S_h-1.uIt is h-1 item collection, u indicates serial number and is positive integer, if the subset S_h-1.uIt is not belonging to described Frequent item set group L_h-1, then the candidate E is excluded_h.s；If the subset S_h-1.uBelong to the frequent item set group L_h-1, then By the candidate E_h.sIt is included into candidate group C_h。

5. the relevance metric method of a kind of stock information news keyword according to claim 2 and related stock, special Sign is: described to calculate the candidate E_1.wSupport counting count (E in transaction database D_1.w) specifically: time The transaction database D is gone through, if the candidate E_1.wIt is to belong to the affairs T_iSubset, then add up the candidate item Collect E_1.wFrequency of occurrence is primary；

The calculating candidate E_h.jSupport counting count (E in the transaction database D_h.j) specifically: traversal institute Transaction database D is stated, if the candidate E_h.jIt is to belong to the affairs T_iSubset, add up the candidate E_h.jOut Occurrence number is primary.

6. the relevance metric method of a kind of stock information news keyword according to claim 1 and related stock, special Sign is: the calculating of the cooccurrence relation specifically: the unrelated degree for calculating correlation rule α → β, if the correlation rule α → β Unrelated degree be less than or equal to preset unrelated degree threshold value, then it represents that the item collection α and the item collection β have the cooccurrence relation； If the unrelated degree of the correlation rule α → β is greater than the preset unrelated degree threshold value, then it represents that the item collection α and the item Collect β and does not have the cooccurrence relation.

7. the relevance metric method of a kind of stock information news keyword according to claim 6 and related stock, special Sign is: the calculation formula of the unrelated degree of the correlation rule α → β is as follows:

Wherein M indicates the size of the transaction database D；F (α) and f (β), which respectively indicates the item collection α and item collection β, to be occurred Number；F (α, β) indicates while occurring the number of the item collection α and the item collection β, and NDG (α, β) indicates the correlation rule α The unrelated degree of → β.

8. the relevance metric method of a kind of stock information news keyword according to claim 1 and related stock, special Sign is: the keyword includes centre word and mark word.