CN109508386A - A kind of relevance metric method of stock information press center word and related stock - Google Patents

A kind of relevance metric method of stock information press center word and related stock Download PDF

Info

Publication number
CN109508386A
CN109508386A CN201811318217.1A CN201811318217A CN109508386A CN 109508386 A CN109508386 A CN 109508386A CN 201811318217 A CN201811318217 A CN 201811318217A CN 109508386 A CN109508386 A CN 109508386A
Authority
CN
China
Prior art keywords
candidate
item
frequent
stock
item collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811318217.1A
Other languages
Chinese (zh)
Inventor
王家华
薛醒思
詹先银
朱钟元
范淑娟
刘艳萍
杨莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian University of Technology
Original Assignee
Fujian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian University of Technology filed Critical Fujian University of Technology
Priority to CN201811318217.1A priority Critical patent/CN109508386A/en
Publication of CN109508386A publication Critical patent/CN109508386A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides the relevance metric method of a kind of stock information news keyword and related stock, comprising: step S10, reads the data in the stock information news file of preparation, and constructs transaction database D;Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequent item set group Lk;Step S30, from the frequent item set Fk,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation are Fk.mNonvoid proper subset, item collection β be the item collection α about the frequent item set Fk.mSupplementary set, and the correlation rule α → β is included into Term co-occurrence database.The relevance metric method of stock information news keyword and related stock of the invention is counted by the number that stock information news keyword and related stock occurs in different news in mining algorithm, the relevance metric of stock information news keyword and related stock is carried out using degree of association formula, computational efficiency is high, quick reliable.

Description

A kind of relevance metric method of stock information press center word and related stock
Technical field
The present invention relates to stock certificate datas to send out digging technology field, more particularly to a kind of stock information press center word and phase Close the relevance metric method of stock.
Background technique
The collection of item is collectively referred to as item collection;Item collection comprising k item is known as k- item collection;Support is greater than minimum support threshold value Item collection is frequent item set;The item frequency that goes out of item collection is the number of transactions comprising item collection, the referred to as frequency of item collection, support counting or It counts.Correlation rule is the implications shaped like X → Y, wherein X and Y is referred to as the guide of correlation rule and subsequent.
With the rapid development of information technology and universal, all kinds of news information rapid expansions in relation to stock of network.Such as Useful information required for what is quickly and accurately obtained out from massive information just becomes a problem [1].In order to have A large amount of news datas that the organization and management of effect is all updated in accumulation all the time need to be labeled its content to realize knot Structure [2].In view of marking the time consuming nature of news by hand, accurately and easily carrying out automatic marking to the information news of magnanimity becomes The needs in market.And stock information news automatic marking is completed, it needs to construct a storage center word and related stock exists " synonym " corpus of cooccurrence relation.So in the building of " synonym " corpus, stock information press center word is measured Whether there is incidence relation to related stock is a critical issue for realizing stock information news automatic marking.
In recent years, the research calculated both at home and abroad word association degree can substantially be divided into following two categories: 1) according to semantic knowledge Library carries out the calculating of word association degree;2) calculating of word association degree is carried out according to large-scale corpus.
Rada and J.H.Lee etc. is by calculating in the shortest path that hyponymy is constituted between word node in WordNet To calculate the similarity [3-4] between English word.P.Resnik is according to the maximum fault information of the public ancestor node of two words To measure the semantic similarity [5] of two English words.E.Agirre and G.Rigau is calculating English word using WordNet Semantic similarity when, other than the path length between node, it is also contemplated that some other factor, for example, concept abstraction hierarchy Depth, areal concentration of conceptual level number etc. [6].In terms of Chinese terms similarity calculation research, Wang Bin uses arborescence The method [7] in path, the similarity between Chinese terms is calculated using " Chinese thesaurus " between interior joint.Liu Qun et al. is mentioned A kind of Lexical Similarity calculation method [8] being based on " Hownet " out.L.Su-jian et al., which proposes one kind and fully utilizes, " to be known Net " and " Chinese thesaurus " calculate the method [9] of Chinese terms similarity.In the calculating process of adopted former similarity, not only Consider the context relation between adopted original, it is also contemplated that other relationships between adopted original.
L.Lillian utilizes joint entropy, and P.Brown et al. calculates the similarity between word using Average Mutual [10-11].Dagan et al. has used increasingly complex probabilistic model to calculate the distance [12] of word.Hu Junfeng et al. is utilized The vocabulary vector space model of context approximatively describes the semanteme of vocabulary, then defines the similarity relation of vocabulary on this basis [13].Liu Qun carries out the association [8] of word and word, and the association reflected between keyword with one using Hopfeild neural network The fuzzy introspective matrix of degree stores the similarity magnitude between word and word.
In the above-mentioned word association degree calculation method based on semantic dictionary, require to be provided in advance one for the field Semantic dictionary.Since there is currently no the semantic dictionaries of stock, and the dictionary for constructing the stock is at high cost, the time It is long, therefore the word association degree calculation method based on semantic dictionary is not suitable for measuring stock information press center word and correlation The stock degree of association.In addition, traditional word association degree calculation method based on statistics cannot combine word to co-occurrence frequency height and Low situation not can guarantee the measurement quality of stock information press center word to related stock association results.
Bibliography:
[1] Web page keyword research [D] Jiangsu University of Science and Technology of the stone love duckweed based on semantic distance, 2011.
[2] more than valiant victory news report program content automatic marking system [D] Tsinghua University, 2011.
[3].Rada R,Mili H,Bicknell E,et al.Development and application of a metric on semantic nets[J].IEEE Transaction on System Man&Cybernetics,1989,19 (1):17-30.
[4].Lee J H,Kim M H,Lee Y J.Information Retrieval Based on Conceptual Distance in a Is-a Hierarchy.J Doc 49:188-207[J].Journal of Documentation, 1993,49(2):188-207.
[5].Resnik P.Semantic similarity in a taxonomy:an information-based measure and its application to problems of ambiguity in natural language[M] .AI Access Foundation,1999.
[6].Agirre E,Rigau G.A Proposal for Word Sense Disambiguation using Conceptual Distance[J].Computer Science,2009.
[7] (computing technique is ground for Wang Bin Chinese-English bilingual corpus automatic aligning research [D] Postgraduate School, Chinese Academy of Sciences Study carefully institute), 1999.
[8] Similarity of Words that Liu Qun, Li Sujian are based on " Hownet " calculates [J] Chinese computing linguistics, 2002.
[9].Li S,Zhang J,Huang X,et al.Semantic computation in a Chinese Question-answering system [J] Journal of Computer Science and Technology (English edition), 2002,17 (6): 933-939.
[10].Brown P F,Pietra S A D,Pietra V J D,et al.Word-Sense Disambiguation Using Statistical Methods[C]//The,Meeting of the Association for Computational Linguistics.1991:264--270.
[11].Lee L J.Similarity-Based Approaches to Natural Language Processing[J].Computer Science,1997.
[12].Dagan I,Lee L,Pereira F C N.Similarity-Based Models of Word Cooccurrence Probabilities[J].Machine Learning,1999,34(1-3):43-69.
[13] statistical analysis of Similarity of Words and application [J] Chinese information in Hu Junfeng, Yu Shiwen Tang and Song Dynasty poem Journal, 2002,16 (4): 40-45.
Summary of the invention
The technical problem to be solved in the present invention is to provide the association of a kind of stock information news keyword and related stock Measure, robustness that different frequency word influences the degree of association can be increased and improve stock information news keyword with The measurement quality of association results between related stock.
The present invention is implemented as follows: a kind of relevance metric method of stock information news keyword and related stock, packet It includes:
Step S10, the data in the stock information news file of preparation are read, and construct transaction database D, D={ T1, T2,T3,…,Ti, affairs TiIndicate the item collection formed from same piece stock information news keyword, i ∈ [1, n], n indicate institute State the record for the related stock information news included in stock information news file;
Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequently Item collection group Lk, L={ L1,L2,L3,…,Lk, Lk={ Fk.1,Fk.2,Fk.3,…,Fk.m, frequent item set Fk.mIt indicates by k stock The frequent k item collection of information news keyword composition, m indicate serial number, and k and m are positive integer;
Step S30, from the frequent item set Fk,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation be Fk.mNonvoid proper subset, item collection β be the item collection α about the frequent item set Fk.mSupplementary set, and by the correlation rule α → β is included into Term co-occurrence database.
Further, the step S20 is specifically included:
Step S21, the transaction database D is scanned, candidate group C is generated1, C1={ E1.1,E1.2,E1.3,…,E1.w, Candidate E1.jIndicate 1 item collection being made of 1 stock information news keyword, w indicates serial number and is positive integer;
Step S22, the candidate E is calculated1.wSupport counting count (E in the transaction database D1.j), If the support counting count (E1.w) be more than or equal to preset minimum support count threshold, then by the candidate E1.wIt is included into frequent item set group L1;If the support counting count (E1.w) it is less than the preset minimum support counting Threshold value then removes the candidate E1.w
Step S23, by frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate Eh.s, the candidate Eh.sIndicate the h item collection being made of h stock information news keywords, s indicates serial number and is positive integer, h be more than or equal to 2 integer;
Step S24, to the candidate Eh.sNon- frequent beta pruning processing is carried out, candidate group C is then generatedh, Ch= {Eh.1,Eh.2,…,Eh.j, j indicates serial number and is positive integer;
Step S25, candidate E is calculatedh.jSupport counting count (E in the transaction database Dh.j), if Support counting count (the Eh.j) be more than or equal to the minimum support count threshold, then by the candidate Eh.jReturn Enter the frequent item set group LhIf the support counting count (Eh.j) be less than minimum support count threshold, then it removes The candidate Eh.j
Step S26, numerous item collection group L is tired of from describedhMiddle generation candidate group Ch+1If the candidate group Ch+1For Sky, then it is described to be tired of numerous item collection group LhAs maximum frequent itemsets group finally obtains the frequent item set database L;If described Candidate group Ch+1It is not sky, then goes to the step S25.
Further, described by frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate Eh.sSpecifically: from The frequent item set group Lh-1Middle traversal chooses two item collections: item collection Fh-1.pWith item collection Fh-1.q, p and q indicates serial number and is positive Integer, by the item collection Fh-1.pIn item and the item collection Fh-1.qIn item be compared, judge whether there is and only k-2 Identical entry, if "Yes", by the item collection Fh-1.pWith the item collection Fh-1.qIt is attached and generates the candidate Eh.s;Such as Fruit "No", then without connection.
Further, described to the candidate Eh.sCarry out non-frequent beta pruning processing specifically: traverse the candidate item Collect Eh.sSubset Sh-1.u, the subset Sh-1.uIt is h-1 item collection, u indicates serial number and is positive integer, if the subset Sh-1.uNo Belong to the frequent item set group Lh-1, then the candidate E is excludedh.s;If the subset Sh-1.uBelong to the frequent item set Group Lh-1, then by the candidate Eh.sIt is included into candidate group Ch
Further, described to calculate the candidate E1.wSupport counting count (E in transaction database D1.w) Specifically: the transaction database D is traversed, if the candidate E1.wIt is to belong to the affairs TiSubset, then add up institute State candidate E1.wFrequency of occurrence is primary;
The calculating candidate Eh.jSupport counting count (E in the transaction database Dh.j) specifically: time The transaction database D is gone through, if the candidate Eh.jIt is to belong to the affairs TiSubset, add up the candidate Eh.jFrequency of occurrence is primary.
Further, the calculating of the cooccurrence relation specifically: the unrelated degree for calculating correlation rule α → β, if the pass The unrelated degree for joining rule α → β is less than or equal to preset unrelated degree threshold value, then it represents that the item collection α and item collection β has described Cooccurrence relation;If the unrelated degree of the correlation rule α → β is greater than the preset unrelated degree threshold value, then it represents that the item collection α Do not have the cooccurrence relation with the item collection β.
Further, the calculation formula of the unrelated degree of the correlation rule α → β is as follows:
Wherein M indicates the size of the transaction database D;F (α) and f (β) respectively indicate the item collection α and the item collection β The number of appearance;F (α, β) indicates while occurring the number of the item collection α and the item collection β, and NDG (α, β) indicates the association The unrelated degree of regular α → β.
Further, the keyword includes centre word and mark word.
The present invention has the advantage that the relevance metric method of stock information news keyword of the invention and related stock It is counted, is used by the number that stock information news keyword and related stock occurs in different news in mining algorithm Unrelated degree formula carries out the relevance metric of stock information news keyword and related stock, and computational efficiency is high, quick reliable.
Detailed description of the invention
The present invention is further illustrated in conjunction with the embodiments with reference to the accompanying drawings.
Fig. 1 is to generate frequent item set group L from affairs database D in method of the invention1Program flow diagram.
Fig. 2 is in method of the invention from frequent item set group Lh-1Generate candidate group ChProgram flow diagram.
Fig. 3 is in method of the invention from candidate group ChFind out the program flow diagram of maximum frequent itemsets group.
Fig. 4 is in method of the invention from frequent item set group LkFind out the program flow diagram of the correlation rule of cooccurrence relation.
Specific embodiment
Refering to fig. 1 to Fig. 4, a kind of relevance metric method of stock information news keyword and related stock, comprising:
Step S10, the data in the stock information news file of preparation are read, and construct transaction database D, D={ T1, T2,T3,…,Ti, affairs TiIndicate the item collection by forming from same piece stock information news keyword, i ∈ [1, n], n are indicated The record for the related stock information news included in the stock information news file;Keyword includes centre word and mark word; Wherein, the data in stock information news file are made of the words such as multirow centre word and mark word, the centre word and mark of every row Word is all from same piece news, is separated at intervals between word with space.
Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequently Item collection group Lk, L={ L1,L2,L3,…,Lk, Lk={ Fk.1,Fk.2,Fk.3,…,Fk.m, frequent item set Fk.mIt indicates by k stock The frequent k item collection of information news keyword composition, m indicate serial number, and k and m are positive integer;
The step S20 is specifically included:
Step S21, the transaction database D is scanned, candidate group C is generated1, C1={ E1.1,E1.2,E1.3,…,E1.w, Candidate E1.wIndicate a stock information news keyword, the candidate E1.wFor 1 item collection, w indicates serial number and is positive Integer;Candidate group C1There is w candidate 1 item collection, also implies that whole stock information news keywords there are w.
Step S22, the candidate E is calculated1.wSupport counting count (E in the transaction database D1.w), If the support counting count (E1.w) be more than or equal to preset minimum support count threshold, then by the candidate E1.wIt is included into frequent item set group L1;If the support counting count (E1.w) it is less than the preset minimum support counting Threshold value then removes the candidate E1.w;In this way, frequent item set group L is just obtained1, L1={ F1.1,F1.2,F1.3,…,F1.m};Its Middle minimum support threshold value is indicated with min_sup, and is preset numerical value;Frequent item set F1.mIt is to meet " count (E1.w)≥ The candidate E of min_sup "1.w
It is described to calculate the candidate E in step S221.wSupport counting count in transaction database D (E1.w) specifically: the transaction database D is traversed, if the candidate E1.wIt is to belong to the affairs TiSubset, then Add up the candidate E1.wFrequency of occurrence is primary;
Step S23, pass through function apriori_gen for frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate item Collect Eh.s, the candidate Eh.sIndicate the h item collection being made of the h stock information news keywords, s indicates serial number and is Positive integer, h are the integer more than or equal to 2;
In step S23, the function of function apriori_gen: i.e. described by frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate Eh.sSpecifically: from the frequent item set group Lh-1Middle traversal chooses two item collections: item collection Fh-1.pWith item Collect Fh-1.q, p and q indicate serial number and are positive integer, by the item collection Fh-1.pIn item and the item collection Fh-1.qIn item carry out Compare, judge whether there is and only h-2 identical entry, if "Yes", by the item collection Fh-1.pWith the item collection Fh-1.qIt carries out Connection generates the candidate Eh.s;If "No", without connection.
Step S24, by function has_infrequent_sub to the candidate Eh.sIt carries out at non-frequent beta pruning Then reason generates candidate group Ch, Ch={ Eh.1,Eh.2,…,Eh.j, j indicates serial number and is positive integer;
In step s 24, the function of function has_infrequent_sub: i.e. described to the candidate Eh.sIt carries out Non- frequent beta pruning processing specifically: traverse the candidate Eh.sSubset Sh-1.u, the subset Sh-1.uIt is h-1 item collection, u table Show serial number and be positive integer, if the subset Sh-1.uIt is not belonging to the frequent item set group Lh-1, then the candidate is excluded Eh.s, i.e. beta pruning processing;If the subset Sh-1.uBelong to the frequent item set group Lh-1, then by the candidate Eh.sIt is included into Candidate group Ch, i.e. not beta pruning processing.Candidate Eh.sThere is u subset.Candidate group C is determined in this wayhIn candidate item Collection is frequent candidate.Step S25, candidate E is calculatedh.jSupport counting count in the transaction database D (Eh.j), if the support counting count (Eh.j) be more than or equal to the minimum support count threshold, then by the candidate Item collection Eh.jIt is included into the frequent item set group LhIf the support counting count (Eh.j) it is less than minimum support counting threshold Value, then remove the candidate Eh.j;Thus obtain frequent item set group Lh, Lh={ Fh.1,Fh.2,Fh.3,…,Fh.m, frequently Item collection Fh.mIt is to meet " count (Eh.jThe candidate E of) >=min_sup "h.j
In step s 25, the calculating candidate Eh.jSupport counting count in the transaction database D (Eh.j) specifically: the transaction database D is traversed, if the candidate Eh.jIt is to belong to the affairs TiSubset, tire out Count the candidate Eh.jFrequency of occurrence is primary.
Step S26, numerous item collection group L is tired of from describedhMiddle generation candidate group Ch+1If the candidate group Ch+1For Sky, then the frequent item set group LhAs maximum frequent itemsets group finally obtains the frequent item set database L;Such L= {L1,L2,L3,…,Lh, since h is the integer more than or equal to 2, k is positive integer, so the frequent item set group L in step S20kPacket Frequent item set group L is included1With frequent item set group Lh
If the candidate group Ch+1It is not sky, then goes to the step S25, continues to generate frequent item set in this way Group Lh+1, L={ L at this time1,L2,L3,…,Lh,Lh+1, then generate candidate group Ch+2, and judge candidate group Ch+2It is No is sky, until finding maximum frequent itemsets group.
Step S30, from the frequent item set Fk,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation be Fk.mNonvoid proper subset, item collection β be the item collection α about the frequent item set Fk.mSupplementary set, and by the correlation rule α → β is included into Term co-occurrence database
In step s 30, the calculating of the cooccurrence relation specifically: the unrelated degree for calculating correlation rule α → β, if institute The unrelated degree for stating correlation rule α → β is less than or equal to preset unrelated degree threshold value, then it represents that the item collection α has with the item collection β The cooccurrence relation;If the unrelated degree of the correlation rule α → β is greater than the preset unrelated degree threshold value, then it represents that described The item collection α and item collection β does not have the cooccurrence relation.
The calculation formula of the unrelated degree of the correlation rule α → β is as follows:
Wherein M indicates the size of the transaction database D, i.e., the related stock information included in stock information news file The record of news;F (α) and f (β) respectively indicates the number that the item collection α and item collection β occurs;F (α, β) is indicated while being gone out The number of existing the item collection α and the item collection β, NDG (α, β) indicate the unrelated degree of the correlation rule α → β.
In Fig. 4, preset unrelated degree threshold value is indicated with max_Disc.
If the stock information news keyword in item collection α with the stock information news keyword in item collection β never together When appearing in same piece news, but individually occurring, then the unrelated degree between them is infinitely great, illustrates that there is no strong passes Connection relationship;If the news word in item collection α always occurs simultaneously with the news word in item collection β, the unrelated degree of the two is 0, illustrate that there are very strong incidence relations.
It is crucial to record the stock information news for having cooccurrence relation to related stock for the Term co-occurrence database finally obtained Word, it is ensured that the relevance metric quality of stock information news keyword and related stock, measure computational efficiency of the invention is high, It is quick reliable.
For example, stock information news file has included 17 related stock information news, thus T1=Amazon across Border electric business }, T2={ Maotai Group Alibaba Guizhou Maotai white wine Ali concept }, T3={ Wanke's development of real estate }, T4= { the cross-border electric business of Amazon }, T5={ today's tops culture medium }, T6={ Meituan, which is rubbed, visits shared bicycle internet+}, T7={ apple Fruit apple concept artificial intelligence }, T8={ Ren Zhengfei Huawei Guangdong plate electronic manufacture }, T9={ Alibaba's Ali's concept }, T10={ bit coin block chain }, T11={ the red live streaming of quick worker volcano internet Media culture medium net }, T12={ today's tops is short The red live streaming of video internet Media culture medium net }, T13={ Shenzhen smart city smart city Guangdong plate }, T14={ Jia Yue Pavilion LeTV Hainan plate LeTV }, T15={ LeTV Jia Yueting LeTV new-energy automobile }, T16={ pig grain live pig pig is raised Material }, T17={ new stock new stock and sub-new stock }.
Then generating candidate group C1 has 41 candidates, E1.1={ Amazon }, E1.2={ block chain }, E1.3= { cultural medium }, E1.4={ pig grain }, E1.5={ development of real estate }, E1.6={ quick worker }, E1.7={ electronic manufacture }, E1.8=across Border electric business }, E1.9={ today's tops }, E1.10={ apple }, E1.11={ Guangdong plate }, E1.12={ internet+}, E1.13= { Alibaba }, E1.14={ artificial intelligence }, E1.15={ Meituan }, E1.16={ Ali's concept }, E1.17={ Mo Bai }, E1.18= { bit coin }, E1.19={ Ren Zhengfei }, E1.20={ netting red live streaming }, E1.21={ Wanke }, E1.22={ live pig }, E1.23={ China For, E1.24={ new-energy automobile }, E1.25={ pig }, E1.26={ volcano }, E1.27={ short-sighted frequency }, E1.28={ Jia Yueting }, E1.29={ smart city }, E1.30={ LeTV }, E1.31={ apple concept }, E1.32={ Shenzhen }, E1.33={ Hainan plate }, E1.34={ internet medium }, E1.35={ Guizhou Maotai }, E1.36={ feed }, E1.37={ new stock and sub-new stock }, E1.38= { Maotai Group }, E1.39={ white wine }, E1.40={ shared bicycle }, E1.41={ new stock }.
The record n for the related stock information news included according to stock information news file obtains pre- multiplied by scale factor θ If minimum support count threshold min_sup.N value 17 in this example, θ value 10%, then min_sup value 1.7;By In preset minimum support count threshold be 1.7, then from candidate group C1The frequent item set group L of generation1There are 11 frequent episodes Collection, F1.1={ Amazon }, F1.2={ cultural medium }, F1.3={ cross-border electric business }, F1.4={ today's tops }, F1.5={ Guangdong plate Block }, F1.6={ Alibaba }, F1.7={ Ali's concept }, F1.8={ netting red live streaming }, F1.9={ Jia Yueting }, F1.10={ LeEco Net }, F1.11={ internet medium }.
From frequent item set group L1The candidate group C of generation2There are 55 candidates, E2.1={ cultural medium Amazon }, E2.2={ cross-border electric business Amazon }, E2.3={ Amazon today's tops }, E2.4={ Amazon Guangdong plate }, E2.5={ sub- horse Inferior Alibaba }, E2.6={ Ali's concept Amazon }, E2.7={ netting red live streaming Amazon }, E2.8={ Amazon Jia Yueting }, E2.9={ LeTV Amazon }, E2.10={ internet medium Amazon }, E2.11={ cross-border electric business culture medium }, E2.12= { cultural medium today's tops }, E2.13={ cultural medium Guangdong plate }, E2.14={ cultural medium Alibaba }, E2.15=Ah In concept culture medium, E2.16={ the cultural medium of the red live streaming of net }, E2.17={ cultural medium Jia Yueting }, E2.18={ LeTV Cultural medium }, E2.19={ cultural medium internet medium }, E2.20={ cross-border electric business today's tops }, E2.21={ cross-border electric business Guangdong plate }, E2.22={ cross-border electric business Alibaba }, E2.23={ cross-border electric business Ali concept }, E2.24={ cross-border electric business net Red live streaming }, E2.25={ cross-border electric business Jia Yueting }, E2.26={ cross-border electric business LeTV }, E2.27={ cross-border electric business internet passes Matchmaker }, E2.28={ today's tops Guangdong plate }, E2.29={ today's tops Alibaba }, E2.30={ Ali's concept head today Item }, E2.31={ netting red live streaming today's tops }, E2.32={ today's tops Jia Yueting }, E2.33={ LeTV today's tops }, E2.34={ internet medium today's tops }, E2.35={ Alibaba's Guangdong plate }, E2.36={ Ali's concept Guangdong plate }, E2.37={ net red live streaming Guangdong plate }, E2.38={ Guangdong plate Jia Yueting }, E2.39={ LeTV Guangdong plate }, E2.40= { internet medium Guangdong plate }, E2.41={ Ali's concept Alibaba }, E2.42={ netting red live streaming Alibaba }, E2.43= { Alibaba Jia Yueting }, E2.44={ LeTV Alibaba }, E2.45={ internet medium Alibaba }, E2.46={ Ali The red live streaming of concept net }, E2.47={ Ali's concept Jia Yueting }, E2.48={ Ali's concept LeTV }, E2.49={ Ali's concept is mutual Networking medium }, E2.50={ netting red live streaming Jia Yueting }, E2.51={ netting red live streaming LeTV }, E2.52={ net red live streaming internet Medium }, E2.53={ LeTV Jia Yueting }, E2.54={ internet medium Jia Yueting }, E2.55={ LeTV internet medium }.
From candidate group C2The frequent item set group L of generation2There are 7 frequent item sets, F2.1={ cross-border electric business Amazon }, F2.2={ cultural medium today's tops }, F2.3={ the cultural medium of the red live streaming of net }, F2.4={ cultural medium internet medium }, F2.5 ={ Ali's concept Alibaba }, F2.6={ net red live streaming internet medium }, F2.7={ LeTV Jia Yueting }.
From frequent item set group L2The candidate group C of generation3There are 4 candidates, E3.1={ the cultural medium of the red live streaming of net is modern Day is top }, E3.2={ cultural medium internet medium today's tops }, E3.3={ the red live streaming of net cultural medium internet medium }, E3.4={ net red live streaming internet Media culture medium }.
From candidate group C3The frequent item set group L of generation3There are 2 frequent item sets, F3.1={ the cultural medium of the red live streaming of net is mutual Networking medium }, F3.2={ net red live streaming internet Media culture medium }.
According to requiring, the candidate group C4 of generation is sky, and frequent item set group L3 is maximum frequent item set group.
Frequent item set is chosen from frequent item set group L1, L2, L3, and calculates the correlation rule of cooccurrence relation, this example Preset unrelated degree threshold value is 0.4 in son, calculated result:
The unrelated degree of correlation rule { Amazon } → { cross-border electric business } is 0;
The unrelated degree of correlation rule { today's tops } → { cultural medium } is 0.18;
The unrelated degree of correlation rule { Alibaba } → { Ali's concept } is 0;
The unrelated degree of correlation rule { Jia Yueting } → { LeTV } is 0;
Be considered as in this way { Amazon } → { cross-border electric business }, { today's tops } → { cultural medium }, { Alibaba } → Ah In concept, { Jia Yueting } → { LeTV }, there are cooccurrence relations for these keywords.
Although specific embodiments of the present invention have been described above, those familiar with the art should be managed Solution, we are merely exemplary described specific embodiment, rather than for the restriction to the scope of the present invention, it is familiar with this The technical staff in field should be covered of the invention according to modification and variation equivalent made by spirit of the invention In scope of the claimed protection.

Claims (8)

1. a kind of relevance metric method of stock information news keyword and related stock, it is characterised in that: include:
Step S10, the data in the stock information news file of preparation are read, and construct transaction database D, D={ T1,T2, T3,…,Ti, affairs TiThe item collection that form from same piece stock information news keyword is indicated, described in i ∈ [1, n], n are indicated The record for the related stock information news included in stock information news file;
Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequent item set Group Lk, L={ L1,L2,L3,…,Lk, Lk={ Fk.1,Fk.2,Fk.3,…,Fk.m, frequent item set Fk.mIt indicates by k stock information The frequent k item collection of news keyword composition, m indicate serial number, and k and m are positive integer;
Step S30, from the frequent item set Fk,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation are Fk.m Nonvoid proper subset, item collection β be the item collection α about the frequent item set Fk.mSupplementary set, and the correlation rule α → β is returned Enter Term co-occurrence database.
2. the relevance metric method of a kind of stock information news keyword according to claim 1 and related stock, special Sign is: the step S20 is specifically included:
Step S21, the transaction database D is scanned, candidate group C is generated1, C1={ E1.1,E1.2,E1.3,…,E1.w, it is candidate Item collection E1.jIndicate 1 item collection being made of 1 stock information news keyword, w indicates serial number and is positive integer;
Step S22, the candidate E is calculated1.wSupport counting count (E in the transaction database D1.j), if Support counting count (the E1.w) be more than or equal to preset minimum support count threshold, then by the candidate E1.w It is included into frequent item set group L1;If the support counting count (E1.w) it is less than the preset minimum support counting threshold Value, then remove the candidate E1.w
Step S23, by frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate Eh.s, the candidate Eh.sTable Show the h item collection being made of the h stock information news keywords, s indicates serial number and is positive integer, and h is whole more than or equal to 2 Number;
Step S24, to the candidate Eh.sNon- frequent beta pruning processing is carried out, candidate group C is then generatedh, Ch={ Eh.1, Eh.2,…,Eh.j, j indicates serial number and is positive integer;
Step S25, candidate E is calculatedh.jSupport counting count (E in the transaction database Dh.j), if described Support counting count (Eh.j) be more than or equal to the minimum support count threshold, then by the candidate Eh.jIt is included into institute State frequent item set group LhIf the support counting count (Eh.j) be less than minimum support count threshold, then described in removal Candidate Eh.j
Step S26, numerous item collection group L is tired of from describedhMiddle generation candidate group Ch+1If the candidate group Ch+1For sky, then It is described to be tired of numerous item collection group LhAs maximum frequent itemsets group finally obtains the frequent item set database L;If the candidate item Collection group Ch+1It is not sky, then goes to the step S25.
3. the relevance metric method of a kind of stock information news keyword according to claim 2 and related stock, special Sign is: described by frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate Eh.sSpecifically: from the frequent episode Collection group Lh-1Middle traversal chooses two item collections: item collection Fh-1.pWith item collection Fh-1.q, p and q indicate serial number and for positive integer, will described in Item collection Fh-1.pIn item and the item collection Fh-1.qIn item be compared, judge whether there is and only k-2 identical entry, if "Yes", then by the item collection Fh-1.pWith the item collection Fh-1.qIt is attached and generates the candidate Eh.s;If "No", no It is attached.
4. the relevance metric method of a kind of stock information news keyword according to claim 2 and related stock, special Sign is: described to the candidate Eh.sCarry out non-frequent beta pruning processing specifically: traverse the candidate Eh.sSon Collect Sh-1.u, the subset Sh-1.uIt is h-1 item collection, u indicates serial number and is positive integer, if the subset Sh-1.uIt is not belonging to described Frequent item set group Lh-1, then the candidate E is excludedh.s;If the subset Sh-1.uBelong to the frequent item set group Lh-1, then By the candidate Eh.sIt is included into candidate group Ch
5. the relevance metric method of a kind of stock information news keyword according to claim 2 and related stock, special Sign is: described to calculate the candidate E1.wSupport counting count (E in transaction database D1.w) specifically: time The transaction database D is gone through, if the candidate E1.wIt is to belong to the affairs TiSubset, then add up the candidate item Collect E1.wFrequency of occurrence is primary;
The calculating candidate Eh.jSupport counting count (E in the transaction database Dh.j) specifically: traversal institute Transaction database D is stated, if the candidate Eh.jIt is to belong to the affairs TiSubset, add up the candidate Eh.jOut Occurrence number is primary.
6. the relevance metric method of a kind of stock information news keyword according to claim 1 and related stock, special Sign is: the calculating of the cooccurrence relation specifically: the unrelated degree for calculating correlation rule α → β, if the correlation rule α → β Unrelated degree be less than or equal to preset unrelated degree threshold value, then it represents that the item collection α and the item collection β have the cooccurrence relation; If the unrelated degree of the correlation rule α → β is greater than the preset unrelated degree threshold value, then it represents that the item collection α and the item Collect β and does not have the cooccurrence relation.
7. the relevance metric method of a kind of stock information news keyword according to claim 6 and related stock, special Sign is: the calculation formula of the unrelated degree of the correlation rule α → β is as follows:
Wherein M indicates the size of the transaction database D;F (α) and f (β), which respectively indicates the item collection α and item collection β, to be occurred Number;F (α, β) indicates while occurring the number of the item collection α and the item collection β, and NDG (α, β) indicates the correlation rule α The unrelated degree of → β.
8. the relevance metric method of a kind of stock information news keyword according to claim 1 and related stock, special Sign is: the keyword includes centre word and mark word.
CN201811318217.1A 2018-11-07 2018-11-07 A kind of relevance metric method of stock information press center word and related stock Pending CN109508386A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811318217.1A CN109508386A (en) 2018-11-07 2018-11-07 A kind of relevance metric method of stock information press center word and related stock

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811318217.1A CN109508386A (en) 2018-11-07 2018-11-07 A kind of relevance metric method of stock information press center word and related stock

Publications (1)

Publication Number Publication Date
CN109508386A true CN109508386A (en) 2019-03-22

Family

ID=65747769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811318217.1A Pending CN109508386A (en) 2018-11-07 2018-11-07 A kind of relevance metric method of stock information press center word and related stock

Country Status (1)

Country Link
CN (1) CN109508386A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115305A (en) * 2019-06-21 2020-12-22 杭州海康威视数字技术股份有限公司 Group identification method and device and computer readable storage medium
CN113722432A (en) * 2021-08-26 2021-11-30 杭州隆埠科技有限公司 Method and device for associating news with stocks

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150006502A1 (en) * 2013-06-28 2015-01-01 International Business Machines Corporation Augmenting search results with interactive search matrix
CN105740466A (en) * 2016-03-04 2016-07-06 百度在线网络技术(北京)有限公司 Method and device for excavating incidence relation between hotspot concepts
CN106296401A (en) * 2016-11-24 2017-01-04 吴梅红 A kind of Strong association rule method for digging understood for stock market's operation logic

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150006502A1 (en) * 2013-06-28 2015-01-01 International Business Machines Corporation Augmenting search results with interactive search matrix
CN105740466A (en) * 2016-03-04 2016-07-06 百度在线网络技术(北京)有限公司 Method and device for excavating incidence relation between hotspot concepts
CN106296401A (en) * 2016-11-24 2017-01-04 吴梅红 A kind of Strong association rule method for digging understood for stock market's operation logic

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄名选: "基于完全加权关联规则挖掘的查询扩展研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115305A (en) * 2019-06-21 2020-12-22 杭州海康威视数字技术股份有限公司 Group identification method and device and computer readable storage medium
CN112115305B (en) * 2019-06-21 2024-04-09 杭州海康威视数字技术股份有限公司 Group identification method apparatus and computer-readable storage medium
CN113722432A (en) * 2021-08-26 2021-11-30 杭州隆埠科技有限公司 Method and device for associating news with stocks
CN113722432B (en) * 2021-08-26 2024-01-09 杭州隆埠科技有限公司 Method and device for associating news with stocks

Similar Documents

Publication Publication Date Title
Limaye et al. Annotating and searching web tables using entities, types and relationships
CN108121829A (en) The domain knowledge collection of illustrative plates automated construction method of software-oriented defect
CN105393263A (en) Feature completion in computer-human interactive learning
Wang et al. Automatic image annotation via local multi-label classification
Zhu et al. A process for mining science & technology documents databases, illustrated for the case of" knowledge discovery and data mining"
Hu et al. Block matching for ontologies
Chen et al. Citation recommendation based on weighted heterogeneous information network containing semantic linking
Zhou et al. Automatic image annotation by an iterative approach: incorporating keyword correlations and region matching
Han et al. Clustering and retrieval of mechanical CAD assembly models based on multi-source attributes information
Wang et al. Data-driven approach for bridging the cognitive gap in image retrieval
Hu et al. EGC: A novel event-oriented graph clustering framework for social media text
CN109508386A (en) A kind of relevance metric method of stock information press center word and related stock
Sun et al. GEDIT: geographic-enhanced and dependency-guided tagging for joint POI and accessibility extraction at baidu maps
KR20090117110A (en) Apparatus for generating ontology based on association and method thereof
Zhang et al. Enhancing event-level sentiment analysis with structured arguments
Cong et al. Pylon: Semantic Table Union Search in Data Lakes
Chen et al. A deep learning based method benefiting from characteristics of patents for semantic relation classification
Mattas et al. Comparing data mining techniques for mining patents
Tao et al. From citation network to study map: a novel model to reorganize academic literatures
Khan et al. NeuSyRE: Neuro-symbolic visual understanding and reasoning framework based on scene graph enrichment
Zhu Financial data analysis application via multi-strategy text processing
Zhang et al. Adaptive association rule mining for web video event classification
Furche et al. Amber: Automatic supervision for multi-attribute extraction
Bhavani et al. An efficient clustering approach for fair semantic web content retrieval via tri-level ontology construction model with hybrid dragonfly algorithm
Birjali et al. Measuring documents similarity in large corpus using MapReduce algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190322