CN109508386A - A kind of relevance metric method of stock information press center word and related stock - Google Patents
A kind of relevance metric method of stock information press center word and related stock Download PDFInfo
- Publication number
- CN109508386A CN109508386A CN201811318217.1A CN201811318217A CN109508386A CN 109508386 A CN109508386 A CN 109508386A CN 201811318217 A CN201811318217 A CN 201811318217A CN 109508386 A CN109508386 A CN 109508386A
- Authority
- CN
- China
- Prior art keywords
- candidate
- item
- frequent
- stock
- item collection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides the relevance metric method of a kind of stock information news keyword and related stock, comprising: step S10, reads the data in the stock information news file of preparation, and constructs transaction database D;Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequent item set group Lk;Step S30, from the frequent item set Fk,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation are Fk.mNonvoid proper subset, item collection β be the item collection α about the frequent item set Fk.mSupplementary set, and the correlation rule α → β is included into Term co-occurrence database.The relevance metric method of stock information news keyword and related stock of the invention is counted by the number that stock information news keyword and related stock occurs in different news in mining algorithm, the relevance metric of stock information news keyword and related stock is carried out using degree of association formula, computational efficiency is high, quick reliable.
Description
Technical field
The present invention relates to stock certificate datas to send out digging technology field, more particularly to a kind of stock information press center word and phase
Close the relevance metric method of stock.
Background technique
The collection of item is collectively referred to as item collection;Item collection comprising k item is known as k- item collection;Support is greater than minimum support threshold value
Item collection is frequent item set;The item frequency that goes out of item collection is the number of transactions comprising item collection, the referred to as frequency of item collection, support counting or
It counts.Correlation rule is the implications shaped like X → Y, wherein X and Y is referred to as the guide of correlation rule and subsequent.
With the rapid development of information technology and universal, all kinds of news information rapid expansions in relation to stock of network.Such as
Useful information required for what is quickly and accurately obtained out from massive information just becomes a problem [1].In order to have
A large amount of news datas that the organization and management of effect is all updated in accumulation all the time need to be labeled its content to realize knot
Structure [2].In view of marking the time consuming nature of news by hand, accurately and easily carrying out automatic marking to the information news of magnanimity becomes
The needs in market.And stock information news automatic marking is completed, it needs to construct a storage center word and related stock exists
" synonym " corpus of cooccurrence relation.So in the building of " synonym " corpus, stock information press center word is measured
Whether there is incidence relation to related stock is a critical issue for realizing stock information news automatic marking.
In recent years, the research calculated both at home and abroad word association degree can substantially be divided into following two categories: 1) according to semantic knowledge
Library carries out the calculating of word association degree;2) calculating of word association degree is carried out according to large-scale corpus.
Rada and J.H.Lee etc. is by calculating in the shortest path that hyponymy is constituted between word node in WordNet
To calculate the similarity [3-4] between English word.P.Resnik is according to the maximum fault information of the public ancestor node of two words
To measure the semantic similarity [5] of two English words.E.Agirre and G.Rigau is calculating English word using WordNet
Semantic similarity when, other than the path length between node, it is also contemplated that some other factor, for example, concept abstraction hierarchy
Depth, areal concentration of conceptual level number etc. [6].In terms of Chinese terms similarity calculation research, Wang Bin uses arborescence
The method [7] in path, the similarity between Chinese terms is calculated using " Chinese thesaurus " between interior joint.Liu Qun et al. is mentioned
A kind of Lexical Similarity calculation method [8] being based on " Hownet " out.L.Su-jian et al., which proposes one kind and fully utilizes, " to be known
Net " and " Chinese thesaurus " calculate the method [9] of Chinese terms similarity.In the calculating process of adopted former similarity, not only
Consider the context relation between adopted original, it is also contemplated that other relationships between adopted original.
L.Lillian utilizes joint entropy, and P.Brown et al. calculates the similarity between word using Average Mutual
[10-11].Dagan et al. has used increasingly complex probabilistic model to calculate the distance [12] of word.Hu Junfeng et al. is utilized
The vocabulary vector space model of context approximatively describes the semanteme of vocabulary, then defines the similarity relation of vocabulary on this basis
[13].Liu Qun carries out the association [8] of word and word, and the association reflected between keyword with one using Hopfeild neural network
The fuzzy introspective matrix of degree stores the similarity magnitude between word and word.
In the above-mentioned word association degree calculation method based on semantic dictionary, require to be provided in advance one for the field
Semantic dictionary.Since there is currently no the semantic dictionaries of stock, and the dictionary for constructing the stock is at high cost, the time
It is long, therefore the word association degree calculation method based on semantic dictionary is not suitable for measuring stock information press center word and correlation
The stock degree of association.In addition, traditional word association degree calculation method based on statistics cannot combine word to co-occurrence frequency height and
Low situation not can guarantee the measurement quality of stock information press center word to related stock association results.
Bibliography:
[1] Web page keyword research [D] Jiangsu University of Science and Technology of the stone love duckweed based on semantic distance, 2011.
[2] more than valiant victory news report program content automatic marking system [D] Tsinghua University, 2011.
[3].Rada R,Mili H,Bicknell E,et al.Development and application of a
metric on semantic nets[J].IEEE Transaction on System Man&Cybernetics,1989,19
(1):17-30.
[4].Lee J H,Kim M H,Lee Y J.Information Retrieval Based on Conceptual
Distance in a Is-a Hierarchy.J Doc 49:188-207[J].Journal of Documentation,
1993,49(2):188-207.
[5].Resnik P.Semantic similarity in a taxonomy:an information-based
measure and its application to problems of ambiguity in natural language[M]
.AI Access Foundation,1999.
[6].Agirre E,Rigau G.A Proposal for Word Sense Disambiguation using
Conceptual Distance[J].Computer Science,2009.
[7] (computing technique is ground for Wang Bin Chinese-English bilingual corpus automatic aligning research [D] Postgraduate School, Chinese Academy of Sciences
Study carefully institute), 1999.
[8] Similarity of Words that Liu Qun, Li Sujian are based on " Hownet " calculates [J] Chinese computing linguistics,
2002.
[9].Li S,Zhang J,Huang X,et al.Semantic computation in a Chinese
Question-answering system [J] Journal of Computer Science and Technology (English edition), 2002,17 (6): 933-939.
[10].Brown P F,Pietra S A D,Pietra V J D,et al.Word-Sense
Disambiguation Using Statistical Methods[C]//The,Meeting of the Association
for Computational Linguistics.1991:264--270.
[11].Lee L J.Similarity-Based Approaches to Natural Language
Processing[J].Computer Science,1997.
[12].Dagan I,Lee L,Pereira F C N.Similarity-Based Models of Word
Cooccurrence Probabilities[J].Machine Learning,1999,34(1-3):43-69.
[13] statistical analysis of Similarity of Words and application [J] Chinese information in Hu Junfeng, Yu Shiwen Tang and Song Dynasty poem
Journal, 2002,16 (4): 40-45.
Summary of the invention
The technical problem to be solved in the present invention is to provide the association of a kind of stock information news keyword and related stock
Measure, robustness that different frequency word influences the degree of association can be increased and improve stock information news keyword with
The measurement quality of association results between related stock.
The present invention is implemented as follows: a kind of relevance metric method of stock information news keyword and related stock, packet
It includes:
Step S10, the data in the stock information news file of preparation are read, and construct transaction database D, D={ T1,
T2,T3,…,Ti, affairs TiIndicate the item collection formed from same piece stock information news keyword, i ∈ [1, n], n indicate institute
State the record for the related stock information news included in stock information news file;
Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequently
Item collection group Lk, L={ L1,L2,L3,…,Lk, Lk={ Fk.1,Fk.2,Fk.3,…,Fk.m, frequent item set Fk.mIt indicates by k stock
The frequent k item collection of information news keyword composition, m indicate serial number, and k and m are positive integer;
Step S30, from the frequent item set Fk,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation be
Fk.mNonvoid proper subset, item collection β be the item collection α about the frequent item set Fk.mSupplementary set, and by the correlation rule α →
β is included into Term co-occurrence database.
Further, the step S20 is specifically included:
Step S21, the transaction database D is scanned, candidate group C is generated1, C1={ E1.1,E1.2,E1.3,…,E1.w,
Candidate E1.jIndicate 1 item collection being made of 1 stock information news keyword, w indicates serial number and is positive integer;
Step S22, the candidate E is calculated1.wSupport counting count (E in the transaction database D1.j),
If the support counting count (E1.w) be more than or equal to preset minimum support count threshold, then by the candidate
E1.wIt is included into frequent item set group L1;If the support counting count (E1.w) it is less than the preset minimum support counting
Threshold value then removes the candidate E1.w;
Step S23, by frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate Eh.s, the candidate
Eh.sIndicate the h item collection being made of h stock information news keywords, s indicates serial number and is positive integer, h be more than or equal to
2 integer;
Step S24, to the candidate Eh.sNon- frequent beta pruning processing is carried out, candidate group C is then generatedh, Ch=
{Eh.1,Eh.2,…,Eh.j, j indicates serial number and is positive integer;
Step S25, candidate E is calculatedh.jSupport counting count (E in the transaction database Dh.j), if
Support counting count (the Eh.j) be more than or equal to the minimum support count threshold, then by the candidate Eh.jReturn
Enter the frequent item set group LhIf the support counting count (Eh.j) be less than minimum support count threshold, then it removes
The candidate Eh.j;
Step S26, numerous item collection group L is tired of from describedhMiddle generation candidate group Ch+1If the candidate group Ch+1For
Sky, then it is described to be tired of numerous item collection group LhAs maximum frequent itemsets group finally obtains the frequent item set database L;If described
Candidate group Ch+1It is not sky, then goes to the step S25.
Further, described by frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate Eh.sSpecifically: from
The frequent item set group Lh-1Middle traversal chooses two item collections: item collection Fh-1.pWith item collection Fh-1.q, p and q indicates serial number and is positive
Integer, by the item collection Fh-1.pIn item and the item collection Fh-1.qIn item be compared, judge whether there is and only k-2
Identical entry, if "Yes", by the item collection Fh-1.pWith the item collection Fh-1.qIt is attached and generates the candidate Eh.s;Such as
Fruit "No", then without connection.
Further, described to the candidate Eh.sCarry out non-frequent beta pruning processing specifically: traverse the candidate item
Collect Eh.sSubset Sh-1.u, the subset Sh-1.uIt is h-1 item collection, u indicates serial number and is positive integer, if the subset Sh-1.uNo
Belong to the frequent item set group Lh-1, then the candidate E is excludedh.s;If the subset Sh-1.uBelong to the frequent item set
Group Lh-1, then by the candidate Eh.sIt is included into candidate group Ch。
Further, described to calculate the candidate E1.wSupport counting count (E in transaction database D1.w)
Specifically: the transaction database D is traversed, if the candidate E1.wIt is to belong to the affairs TiSubset, then add up institute
State candidate E1.wFrequency of occurrence is primary;
The calculating candidate Eh.jSupport counting count (E in the transaction database Dh.j) specifically: time
The transaction database D is gone through, if the candidate Eh.jIt is to belong to the affairs TiSubset, add up the candidate
Eh.jFrequency of occurrence is primary.
Further, the calculating of the cooccurrence relation specifically: the unrelated degree for calculating correlation rule α → β, if the pass
The unrelated degree for joining rule α → β is less than or equal to preset unrelated degree threshold value, then it represents that the item collection α and item collection β has described
Cooccurrence relation;If the unrelated degree of the correlation rule α → β is greater than the preset unrelated degree threshold value, then it represents that the item collection α
Do not have the cooccurrence relation with the item collection β.
Further, the calculation formula of the unrelated degree of the correlation rule α → β is as follows:
Wherein M indicates the size of the transaction database D;F (α) and f (β) respectively indicate the item collection α and the item collection β
The number of appearance;F (α, β) indicates while occurring the number of the item collection α and the item collection β, and NDG (α, β) indicates the association
The unrelated degree of regular α → β.
Further, the keyword includes centre word and mark word.
The present invention has the advantage that the relevance metric method of stock information news keyword of the invention and related stock
It is counted, is used by the number that stock information news keyword and related stock occurs in different news in mining algorithm
Unrelated degree formula carries out the relevance metric of stock information news keyword and related stock, and computational efficiency is high, quick reliable.
Detailed description of the invention
The present invention is further illustrated in conjunction with the embodiments with reference to the accompanying drawings.
Fig. 1 is to generate frequent item set group L from affairs database D in method of the invention1Program flow diagram.
Fig. 2 is in method of the invention from frequent item set group Lh-1Generate candidate group ChProgram flow diagram.
Fig. 3 is in method of the invention from candidate group ChFind out the program flow diagram of maximum frequent itemsets group.
Fig. 4 is in method of the invention from frequent item set group LkFind out the program flow diagram of the correlation rule of cooccurrence relation.
Specific embodiment
Refering to fig. 1 to Fig. 4, a kind of relevance metric method of stock information news keyword and related stock, comprising:
Step S10, the data in the stock information news file of preparation are read, and construct transaction database D, D={ T1,
T2,T3,…,Ti, affairs TiIndicate the item collection by forming from same piece stock information news keyword, i ∈ [1, n], n are indicated
The record for the related stock information news included in the stock information news file;Keyword includes centre word and mark word;
Wherein, the data in stock information news file are made of the words such as multirow centre word and mark word, the centre word and mark of every row
Word is all from same piece news, is separated at intervals between word with space.
Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequently
Item collection group Lk, L={ L1,L2,L3,…,Lk, Lk={ Fk.1,Fk.2,Fk.3,…,Fk.m, frequent item set Fk.mIt indicates by k stock
The frequent k item collection of information news keyword composition, m indicate serial number, and k and m are positive integer;
The step S20 is specifically included:
Step S21, the transaction database D is scanned, candidate group C is generated1, C1={ E1.1,E1.2,E1.3,…,E1.w,
Candidate E1.wIndicate a stock information news keyword, the candidate E1.wFor 1 item collection, w indicates serial number and is positive
Integer;Candidate group C1There is w candidate 1 item collection, also implies that whole stock information news keywords there are w.
Step S22, the candidate E is calculated1.wSupport counting count (E in the transaction database D1.w),
If the support counting count (E1.w) be more than or equal to preset minimum support count threshold, then by the candidate
E1.wIt is included into frequent item set group L1;If the support counting count (E1.w) it is less than the preset minimum support counting
Threshold value then removes the candidate E1.w;In this way, frequent item set group L is just obtained1, L1={ F1.1,F1.2,F1.3,…,F1.m};Its
Middle minimum support threshold value is indicated with min_sup, and is preset numerical value;Frequent item set F1.mIt is to meet " count (E1.w)≥
The candidate E of min_sup "1.w。
It is described to calculate the candidate E in step S221.wSupport counting count in transaction database D
(E1.w) specifically: the transaction database D is traversed, if the candidate E1.wIt is to belong to the affairs TiSubset, then
Add up the candidate E1.wFrequency of occurrence is primary;
Step S23, pass through function apriori_gen for frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate item
Collect Eh.s, the candidate Eh.sIndicate the h item collection being made of the h stock information news keywords, s indicates serial number and is
Positive integer, h are the integer more than or equal to 2;
In step S23, the function of function apriori_gen: i.e. described by frequent item set group Lh-1In frequent item set
Fh-1.mGenerate candidate Eh.sSpecifically: from the frequent item set group Lh-1Middle traversal chooses two item collections: item collection Fh-1.pWith item
Collect Fh-1.q, p and q indicate serial number and are positive integer, by the item collection Fh-1.pIn item and the item collection Fh-1.qIn item carry out
Compare, judge whether there is and only h-2 identical entry, if "Yes", by the item collection Fh-1.pWith the item collection Fh-1.qIt carries out
Connection generates the candidate Eh.s;If "No", without connection.
Step S24, by function has_infrequent_sub to the candidate Eh.sIt carries out at non-frequent beta pruning
Then reason generates candidate group Ch, Ch={ Eh.1,Eh.2,…,Eh.j, j indicates serial number and is positive integer;
In step s 24, the function of function has_infrequent_sub: i.e. described to the candidate Eh.sIt carries out
Non- frequent beta pruning processing specifically: traverse the candidate Eh.sSubset Sh-1.u, the subset Sh-1.uIt is h-1 item collection, u table
Show serial number and be positive integer, if the subset Sh-1.uIt is not belonging to the frequent item set group Lh-1, then the candidate is excluded
Eh.s, i.e. beta pruning processing;If the subset Sh-1.uBelong to the frequent item set group Lh-1, then by the candidate Eh.sIt is included into
Candidate group Ch, i.e. not beta pruning processing.Candidate Eh.sThere is u subset.Candidate group C is determined in this wayhIn candidate item
Collection is frequent candidate.Step S25, candidate E is calculatedh.jSupport counting count in the transaction database D
(Eh.j), if the support counting count (Eh.j) be more than or equal to the minimum support count threshold, then by the candidate
Item collection Eh.jIt is included into the frequent item set group LhIf the support counting count (Eh.j) it is less than minimum support counting threshold
Value, then remove the candidate Eh.j;Thus obtain frequent item set group Lh, Lh={ Fh.1,Fh.2,Fh.3,…,Fh.m, frequently
Item collection Fh.mIt is to meet " count (Eh.jThe candidate E of) >=min_sup "h.j。
In step s 25, the calculating candidate Eh.jSupport counting count in the transaction database D
(Eh.j) specifically: the transaction database D is traversed, if the candidate Eh.jIt is to belong to the affairs TiSubset, tire out
Count the candidate Eh.jFrequency of occurrence is primary.
Step S26, numerous item collection group L is tired of from describedhMiddle generation candidate group Ch+1If the candidate group Ch+1For
Sky, then the frequent item set group LhAs maximum frequent itemsets group finally obtains the frequent item set database L;Such L=
{L1,L2,L3,…,Lh, since h is the integer more than or equal to 2, k is positive integer, so the frequent item set group L in step S20kPacket
Frequent item set group L is included1With frequent item set group Lh;
If the candidate group Ch+1It is not sky, then goes to the step S25, continues to generate frequent item set in this way
Group Lh+1, L={ L at this time1,L2,L3,…,Lh,Lh+1, then generate candidate group Ch+2, and judge candidate group Ch+2It is
No is sky, until finding maximum frequent itemsets group.
Step S30, from the frequent item set Fk,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation be
Fk.mNonvoid proper subset, item collection β be the item collection α about the frequent item set Fk.mSupplementary set, and by the correlation rule α →
β is included into Term co-occurrence database
In step s 30, the calculating of the cooccurrence relation specifically: the unrelated degree for calculating correlation rule α → β, if institute
The unrelated degree for stating correlation rule α → β is less than or equal to preset unrelated degree threshold value, then it represents that the item collection α has with the item collection β
The cooccurrence relation;If the unrelated degree of the correlation rule α → β is greater than the preset unrelated degree threshold value, then it represents that described
The item collection α and item collection β does not have the cooccurrence relation.
The calculation formula of the unrelated degree of the correlation rule α → β is as follows:
Wherein M indicates the size of the transaction database D, i.e., the related stock information included in stock information news file
The record of news;F (α) and f (β) respectively indicates the number that the item collection α and item collection β occurs;F (α, β) is indicated while being gone out
The number of existing the item collection α and the item collection β, NDG (α, β) indicate the unrelated degree of the correlation rule α → β.
In Fig. 4, preset unrelated degree threshold value is indicated with max_Disc.
If the stock information news keyword in item collection α with the stock information news keyword in item collection β never together
When appearing in same piece news, but individually occurring, then the unrelated degree between them is infinitely great, illustrates that there is no strong passes
Connection relationship;If the news word in item collection α always occurs simultaneously with the news word in item collection β, the unrelated degree of the two is
0, illustrate that there are very strong incidence relations.
It is crucial to record the stock information news for having cooccurrence relation to related stock for the Term co-occurrence database finally obtained
Word, it is ensured that the relevance metric quality of stock information news keyword and related stock, measure computational efficiency of the invention is high,
It is quick reliable.
For example, stock information news file has included 17 related stock information news, thus T1=Amazon across
Border electric business }, T2={ Maotai Group Alibaba Guizhou Maotai white wine Ali concept }, T3={ Wanke's development of real estate }, T4=
{ the cross-border electric business of Amazon }, T5={ today's tops culture medium }, T6={ Meituan, which is rubbed, visits shared bicycle internet+}, T7={ apple
Fruit apple concept artificial intelligence }, T8={ Ren Zhengfei Huawei Guangdong plate electronic manufacture }, T9={ Alibaba's Ali's concept },
T10={ bit coin block chain }, T11={ the red live streaming of quick worker volcano internet Media culture medium net }, T12={ today's tops is short
The red live streaming of video internet Media culture medium net }, T13={ Shenzhen smart city smart city Guangdong plate }, T14={ Jia Yue
Pavilion LeTV Hainan plate LeTV }, T15={ LeTV Jia Yueting LeTV new-energy automobile }, T16={ pig grain live pig pig is raised
Material }, T17={ new stock new stock and sub-new stock }.
Then generating candidate group C1 has 41 candidates, E1.1={ Amazon }, E1.2={ block chain }, E1.3=
{ cultural medium }, E1.4={ pig grain }, E1.5={ development of real estate }, E1.6={ quick worker }, E1.7={ electronic manufacture }, E1.8=across
Border electric business }, E1.9={ today's tops }, E1.10={ apple }, E1.11={ Guangdong plate }, E1.12={ internet+}, E1.13=
{ Alibaba }, E1.14={ artificial intelligence }, E1.15={ Meituan }, E1.16={ Ali's concept }, E1.17={ Mo Bai }, E1.18=
{ bit coin }, E1.19={ Ren Zhengfei }, E1.20={ netting red live streaming }, E1.21={ Wanke }, E1.22={ live pig }, E1.23={ China
For, E1.24={ new-energy automobile }, E1.25={ pig }, E1.26={ volcano }, E1.27={ short-sighted frequency }, E1.28={ Jia Yueting },
E1.29={ smart city }, E1.30={ LeTV }, E1.31={ apple concept }, E1.32={ Shenzhen }, E1.33={ Hainan plate },
E1.34={ internet medium }, E1.35={ Guizhou Maotai }, E1.36={ feed }, E1.37={ new stock and sub-new stock }, E1.38=
{ Maotai Group }, E1.39={ white wine }, E1.40={ shared bicycle }, E1.41={ new stock }.
The record n for the related stock information news included according to stock information news file obtains pre- multiplied by scale factor θ
If minimum support count threshold min_sup.N value 17 in this example, θ value 10%, then min_sup value 1.7;By
In preset minimum support count threshold be 1.7, then from candidate group C1The frequent item set group L of generation1There are 11 frequent episodes
Collection, F1.1={ Amazon }, F1.2={ cultural medium }, F1.3={ cross-border electric business }, F1.4={ today's tops }, F1.5={ Guangdong plate
Block }, F1.6={ Alibaba }, F1.7={ Ali's concept }, F1.8={ netting red live streaming }, F1.9={ Jia Yueting }, F1.10={ LeEco
Net }, F1.11={ internet medium }.
From frequent item set group L1The candidate group C of generation2There are 55 candidates, E2.1={ cultural medium Amazon },
E2.2={ cross-border electric business Amazon }, E2.3={ Amazon today's tops }, E2.4={ Amazon Guangdong plate }, E2.5={ sub- horse
Inferior Alibaba }, E2.6={ Ali's concept Amazon }, E2.7={ netting red live streaming Amazon }, E2.8={ Amazon Jia Yueting },
E2.9={ LeTV Amazon }, E2.10={ internet medium Amazon }, E2.11={ cross-border electric business culture medium }, E2.12=
{ cultural medium today's tops }, E2.13={ cultural medium Guangdong plate }, E2.14={ cultural medium Alibaba }, E2.15=Ah
In concept culture medium, E2.16={ the cultural medium of the red live streaming of net }, E2.17={ cultural medium Jia Yueting }, E2.18={ LeTV
Cultural medium }, E2.19={ cultural medium internet medium }, E2.20={ cross-border electric business today's tops }, E2.21={ cross-border electric business
Guangdong plate }, E2.22={ cross-border electric business Alibaba }, E2.23={ cross-border electric business Ali concept }, E2.24={ cross-border electric business net
Red live streaming }, E2.25={ cross-border electric business Jia Yueting }, E2.26={ cross-border electric business LeTV }, E2.27={ cross-border electric business internet passes
Matchmaker }, E2.28={ today's tops Guangdong plate }, E2.29={ today's tops Alibaba }, E2.30={ Ali's concept head today
Item }, E2.31={ netting red live streaming today's tops }, E2.32={ today's tops Jia Yueting }, E2.33={ LeTV today's tops },
E2.34={ internet medium today's tops }, E2.35={ Alibaba's Guangdong plate }, E2.36={ Ali's concept Guangdong plate },
E2.37={ net red live streaming Guangdong plate }, E2.38={ Guangdong plate Jia Yueting }, E2.39={ LeTV Guangdong plate }, E2.40=
{ internet medium Guangdong plate }, E2.41={ Ali's concept Alibaba }, E2.42={ netting red live streaming Alibaba }, E2.43=
{ Alibaba Jia Yueting }, E2.44={ LeTV Alibaba }, E2.45={ internet medium Alibaba }, E2.46={ Ali
The red live streaming of concept net }, E2.47={ Ali's concept Jia Yueting }, E2.48={ Ali's concept LeTV }, E2.49={ Ali's concept is mutual
Networking medium }, E2.50={ netting red live streaming Jia Yueting }, E2.51={ netting red live streaming LeTV }, E2.52={ net red live streaming internet
Medium }, E2.53={ LeTV Jia Yueting }, E2.54={ internet medium Jia Yueting }, E2.55={ LeTV internet medium }.
From candidate group C2The frequent item set group L of generation2There are 7 frequent item sets, F2.1={ cross-border electric business Amazon },
F2.2={ cultural medium today's tops }, F2.3={ the cultural medium of the red live streaming of net }, F2.4={ cultural medium internet medium }, F2.5
={ Ali's concept Alibaba }, F2.6={ net red live streaming internet medium }, F2.7={ LeTV Jia Yueting }.
From frequent item set group L2The candidate group C of generation3There are 4 candidates, E3.1={ the cultural medium of the red live streaming of net is modern
Day is top }, E3.2={ cultural medium internet medium today's tops }, E3.3={ the red live streaming of net cultural medium internet medium },
E3.4={ net red live streaming internet Media culture medium }.
From candidate group C3The frequent item set group L of generation3There are 2 frequent item sets, F3.1={ the cultural medium of the red live streaming of net is mutual
Networking medium }, F3.2={ net red live streaming internet Media culture medium }.
According to requiring, the candidate group C4 of generation is sky, and frequent item set group L3 is maximum frequent item set group.
Frequent item set is chosen from frequent item set group L1, L2, L3, and calculates the correlation rule of cooccurrence relation, this example
Preset unrelated degree threshold value is 0.4 in son, calculated result:
The unrelated degree of correlation rule { Amazon } → { cross-border electric business } is 0;
The unrelated degree of correlation rule { today's tops } → { cultural medium } is 0.18;
The unrelated degree of correlation rule { Alibaba } → { Ali's concept } is 0;
The unrelated degree of correlation rule { Jia Yueting } → { LeTV } is 0;
Be considered as in this way { Amazon } → { cross-border electric business }, { today's tops } → { cultural medium }, { Alibaba } → Ah
In concept, { Jia Yueting } → { LeTV }, there are cooccurrence relations for these keywords.
Although specific embodiments of the present invention have been described above, those familiar with the art should be managed
Solution, we are merely exemplary described specific embodiment, rather than for the restriction to the scope of the present invention, it is familiar with this
The technical staff in field should be covered of the invention according to modification and variation equivalent made by spirit of the invention
In scope of the claimed protection.
Claims (8)
1. a kind of relevance metric method of stock information news keyword and related stock, it is characterised in that: include:
Step S10, the data in the stock information news file of preparation are read, and construct transaction database D, D={ T1,T2,
T3,…,Ti, affairs TiThe item collection that form from same piece stock information news keyword is indicated, described in i ∈ [1, n], n are indicated
The record for the related stock information news included in stock information news file;
Step S20, exhaustive all frequent item sets from affairs database D, and generate frequent item set database L and frequent item set
Group Lk, L={ L1,L2,L3,…,Lk, Lk={ Fk.1,Fk.2,Fk.3,…,Fk.m, frequent item set Fk.mIt indicates by k stock information
The frequent k item collection of news keyword composition, m indicate serial number, and k and m are positive integer;
Step S30, from the frequent item set Fk,mSeveral correlation rules α → β, item collection α for calculating cooccurrence relation are Fk.m
Nonvoid proper subset, item collection β be the item collection α about the frequent item set Fk.mSupplementary set, and the correlation rule α → β is returned
Enter Term co-occurrence database.
2. the relevance metric method of a kind of stock information news keyword according to claim 1 and related stock, special
Sign is: the step S20 is specifically included:
Step S21, the transaction database D is scanned, candidate group C is generated1, C1={ E1.1,E1.2,E1.3,…,E1.w, it is candidate
Item collection E1.jIndicate 1 item collection being made of 1 stock information news keyword, w indicates serial number and is positive integer;
Step S22, the candidate E is calculated1.wSupport counting count (E in the transaction database D1.j), if
Support counting count (the E1.w) be more than or equal to preset minimum support count threshold, then by the candidate E1.w
It is included into frequent item set group L1;If the support counting count (E1.w) it is less than the preset minimum support counting threshold
Value, then remove the candidate E1.w;
Step S23, by frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate Eh.s, the candidate Eh.sTable
Show the h item collection being made of the h stock information news keywords, s indicates serial number and is positive integer, and h is whole more than or equal to 2
Number;
Step S24, to the candidate Eh.sNon- frequent beta pruning processing is carried out, candidate group C is then generatedh, Ch={ Eh.1,
Eh.2,…,Eh.j, j indicates serial number and is positive integer;
Step S25, candidate E is calculatedh.jSupport counting count (E in the transaction database Dh.j), if described
Support counting count (Eh.j) be more than or equal to the minimum support count threshold, then by the candidate Eh.jIt is included into institute
State frequent item set group LhIf the support counting count (Eh.j) be less than minimum support count threshold, then described in removal
Candidate Eh.j;
Step S26, numerous item collection group L is tired of from describedhMiddle generation candidate group Ch+1If the candidate group Ch+1For sky, then
It is described to be tired of numerous item collection group LhAs maximum frequent itemsets group finally obtains the frequent item set database L;If the candidate item
Collection group Ch+1It is not sky, then goes to the step S25.
3. the relevance metric method of a kind of stock information news keyword according to claim 2 and related stock, special
Sign is: described by frequent item set group Lh-1In frequent item set Fh-1.mGenerate candidate Eh.sSpecifically: from the frequent episode
Collection group Lh-1Middle traversal chooses two item collections: item collection Fh-1.pWith item collection Fh-1.q, p and q indicate serial number and for positive integer, will described in
Item collection Fh-1.pIn item and the item collection Fh-1.qIn item be compared, judge whether there is and only k-2 identical entry, if
"Yes", then by the item collection Fh-1.pWith the item collection Fh-1.qIt is attached and generates the candidate Eh.s;If "No", no
It is attached.
4. the relevance metric method of a kind of stock information news keyword according to claim 2 and related stock, special
Sign is: described to the candidate Eh.sCarry out non-frequent beta pruning processing specifically: traverse the candidate Eh.sSon
Collect Sh-1.u, the subset Sh-1.uIt is h-1 item collection, u indicates serial number and is positive integer, if the subset Sh-1.uIt is not belonging to described
Frequent item set group Lh-1, then the candidate E is excludedh.s;If the subset Sh-1.uBelong to the frequent item set group Lh-1, then
By the candidate Eh.sIt is included into candidate group Ch。
5. the relevance metric method of a kind of stock information news keyword according to claim 2 and related stock, special
Sign is: described to calculate the candidate E1.wSupport counting count (E in transaction database D1.w) specifically: time
The transaction database D is gone through, if the candidate E1.wIt is to belong to the affairs TiSubset, then add up the candidate item
Collect E1.wFrequency of occurrence is primary;
The calculating candidate Eh.jSupport counting count (E in the transaction database Dh.j) specifically: traversal institute
Transaction database D is stated, if the candidate Eh.jIt is to belong to the affairs TiSubset, add up the candidate Eh.jOut
Occurrence number is primary.
6. the relevance metric method of a kind of stock information news keyword according to claim 1 and related stock, special
Sign is: the calculating of the cooccurrence relation specifically: the unrelated degree for calculating correlation rule α → β, if the correlation rule α → β
Unrelated degree be less than or equal to preset unrelated degree threshold value, then it represents that the item collection α and the item collection β have the cooccurrence relation;
If the unrelated degree of the correlation rule α → β is greater than the preset unrelated degree threshold value, then it represents that the item collection α and the item
Collect β and does not have the cooccurrence relation.
7. the relevance metric method of a kind of stock information news keyword according to claim 6 and related stock, special
Sign is: the calculation formula of the unrelated degree of the correlation rule α → β is as follows:
Wherein M indicates the size of the transaction database D;F (α) and f (β), which respectively indicates the item collection α and item collection β, to be occurred
Number;F (α, β) indicates while occurring the number of the item collection α and the item collection β, and NDG (α, β) indicates the correlation rule α
The unrelated degree of → β.
8. the relevance metric method of a kind of stock information news keyword according to claim 1 and related stock, special
Sign is: the keyword includes centre word and mark word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811318217.1A CN109508386A (en) | 2018-11-07 | 2018-11-07 | A kind of relevance metric method of stock information press center word and related stock |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811318217.1A CN109508386A (en) | 2018-11-07 | 2018-11-07 | A kind of relevance metric method of stock information press center word and related stock |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109508386A true CN109508386A (en) | 2019-03-22 |
Family
ID=65747769
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811318217.1A Pending CN109508386A (en) | 2018-11-07 | 2018-11-07 | A kind of relevance metric method of stock information press center word and related stock |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109508386A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112115305A (en) * | 2019-06-21 | 2020-12-22 | 杭州海康威视数字技术股份有限公司 | Group identification method and device and computer readable storage medium |
CN113722432A (en) * | 2021-08-26 | 2021-11-30 | 杭州隆埠科技有限公司 | Method and device for associating news with stocks |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150006502A1 (en) * | 2013-06-28 | 2015-01-01 | International Business Machines Corporation | Augmenting search results with interactive search matrix |
CN105740466A (en) * | 2016-03-04 | 2016-07-06 | 百度在线网络技术(北京)有限公司 | Method and device for excavating incidence relation between hotspot concepts |
CN106296401A (en) * | 2016-11-24 | 2017-01-04 | 吴梅红 | A kind of Strong association rule method for digging understood for stock market's operation logic |
-
2018
- 2018-11-07 CN CN201811318217.1A patent/CN109508386A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150006502A1 (en) * | 2013-06-28 | 2015-01-01 | International Business Machines Corporation | Augmenting search results with interactive search matrix |
CN105740466A (en) * | 2016-03-04 | 2016-07-06 | 百度在线网络技术(北京)有限公司 | Method and device for excavating incidence relation between hotspot concepts |
CN106296401A (en) * | 2016-11-24 | 2017-01-04 | 吴梅红 | A kind of Strong association rule method for digging understood for stock market's operation logic |
Non-Patent Citations (1)
Title |
---|
黄名选: "基于完全加权关联规则挖掘的查询扩展研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112115305A (en) * | 2019-06-21 | 2020-12-22 | 杭州海康威视数字技术股份有限公司 | Group identification method and device and computer readable storage medium |
CN112115305B (en) * | 2019-06-21 | 2024-04-09 | 杭州海康威视数字技术股份有限公司 | Group identification method apparatus and computer-readable storage medium |
CN113722432A (en) * | 2021-08-26 | 2021-11-30 | 杭州隆埠科技有限公司 | Method and device for associating news with stocks |
CN113722432B (en) * | 2021-08-26 | 2024-01-09 | 杭州隆埠科技有限公司 | Method and device for associating news with stocks |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Limaye et al. | Annotating and searching web tables using entities, types and relationships | |
CN108121829A (en) | The domain knowledge collection of illustrative plates automated construction method of software-oriented defect | |
CN105393263A (en) | Feature completion in computer-human interactive learning | |
Wang et al. | Automatic image annotation via local multi-label classification | |
Zhu et al. | A process for mining science & technology documents databases, illustrated for the case of" knowledge discovery and data mining" | |
Hu et al. | Block matching for ontologies | |
Chen et al. | Citation recommendation based on weighted heterogeneous information network containing semantic linking | |
Zhou et al. | Automatic image annotation by an iterative approach: incorporating keyword correlations and region matching | |
Han et al. | Clustering and retrieval of mechanical CAD assembly models based on multi-source attributes information | |
Wang et al. | Data-driven approach for bridging the cognitive gap in image retrieval | |
Hu et al. | EGC: A novel event-oriented graph clustering framework for social media text | |
CN109508386A (en) | A kind of relevance metric method of stock information press center word and related stock | |
Sun et al. | GEDIT: geographic-enhanced and dependency-guided tagging for joint POI and accessibility extraction at baidu maps | |
KR20090117110A (en) | Apparatus for generating ontology based on association and method thereof | |
Zhang et al. | Enhancing event-level sentiment analysis with structured arguments | |
Cong et al. | Pylon: Semantic Table Union Search in Data Lakes | |
Chen et al. | A deep learning based method benefiting from characteristics of patents for semantic relation classification | |
Mattas et al. | Comparing data mining techniques for mining patents | |
Tao et al. | From citation network to study map: a novel model to reorganize academic literatures | |
Khan et al. | NeuSyRE: Neuro-symbolic visual understanding and reasoning framework based on scene graph enrichment | |
Zhu | Financial data analysis application via multi-strategy text processing | |
Zhang et al. | Adaptive association rule mining for web video event classification | |
Furche et al. | Amber: Automatic supervision for multi-attribute extraction | |
Bhavani et al. | An efficient clustering approach for fair semantic web content retrieval via tri-level ontology construction model with hybrid dragonfly algorithm | |
Birjali et al. | Measuring documents similarity in large corpus using MapReduce algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190322 |