CN108932229A

CN108932229A - A kind of money article proneness analysis method

Info

Publication number: CN108932229A
Application number: CN201810605916.8A
Authority: CN
Inventors: 吕学强; 董志安
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2018-12-04

Abstract

The present invention relates to a kind of money article proneness analysis methods, including：Identification Business Name is extracted critical sentence group and is classified using LSTM model to critical sentence group.Money article proneness analysis method provided by the invention, Business Name is identified using based on company name abbreviation dictionary and the method for encyclopaedia inquiry, effect is excellent and favorable expandability, comprehensive characteristics attribute critical sentence group's abstracting method is matched using based on deep learning frame doc2vec text similarity, it is good to extract effect, accuracy rate and recall rate are high, and Text Orientation judging nicety rate is high, effect is good, can meet the needs of practical application well.

Description

A kind of money article proneness analysis method

Technical field

The invention belongs to text analysis technique fields, and in particular to a kind of money article proneness analysis method.

Background technique

Current text emotional orientation analytical method includes following several：1) attributes such as emotion, position and keyword are made For the factor for extracting critical sentence, then critical sentence group is carried out have supervision and semi-supervised emotional semantic classification, this kind of method takes key The accuracy rate of sentence is not high；2) spy is carried out using the emotion vocabulary training text containing negative vocabulary, tendentiousness vocabulary, degree vocabulary Sign extension, this method do not account for context, and effect is bad.Targetedly money article text classification is at home and abroad studied It is relatively fewer, event semantics markup information in newsletter archive is mainly extracted using vocabulary and semantic rules, and the information is used In the feature of machine learning classification, but this method is excessively complicated, and accuracy rate is not high.Web money article based on semantic rules Text divides sentiment analysis method to extract text attribute by machine learning method Apriori, constructs emotion dictionary and semantic rules, from And Sentiment orientation is calculated, this method is complex, and analytical effect is also very general, cannot meet the needs of practical application well. Company name identification is that money article critical sentence group extracts critically important research point, however up to the present, the research of this respect at Fruit is relatively fewer, and most of method of the prior art is still relatively low for the recognition accuracy of company's abbreviation, complicated in method The building of rule and knowledge base seriously affects the application of method.

Summary of the invention

For above-mentioned problems of the prior art, it can avoid above-mentioned skill occur the purpose of the present invention is to provide one kind The money article proneness analysis method of art defect.

In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows：

A kind of money article proneness analysis method, including：It identifies Business Name, extract critical sentence group and uses LSTM mould Type classifies to critical sentence group.

Further, it identifies and includes the step of Business Name：

Newsletter archive to be processed is decomposed into N tuple-set as candidate company name by step (1)；

N tuple score of the step (2) in the sentence containing six company codes and before company code adds 1；

Each N tuple is successively carried out similarity mode with basic dictionary and updates score by step (3)；

Candidate company name is carried out Baidu search to step (4) and Baidupedia inquiry updates score, and score is higher than the N of threshold value Tuple is set as company name.

Further, step (1) is specially：N tuple-set score is initialized first, respectively by N tuple in N tuple-set Similarity mode is carried out with basic company name dictionary created above, obtains candidate company name set；One N tuple X and one A company name Y calculating formula of similarity is

α in formula, β are weight, and count is both to belong to the statistics that X also belongs to Y word, start indicate N member ancestral X with company name Y beginning, end indicate N member ancestral X with company name Y ending.

Further, step (4) is specially：

Candidate company name set is carried out to update set score into Baidu search and Baidupedia inquiry, if Baidu search As a result there is " stock code " in, " company ", " group ", " enterprise " is then considered as an effective inquiry；If single hundred In degree encyclopaedia query result title be not empty or summary and essential information in occur " stock code ", " company ", " group ", " enterprise ", then this inquiry is considered as effective query；Candidate company name is updated in conjunction with Baidupedia inquiry and Baidu search to obtain Point, internet checking update is scored at

Search (X)=η * count (X ∈ search_list)+γ * baike_query (x)；

η is Baidu search weight, and count is that item number is effectively inquired in Baidu search, and γ is that Baidupedia inquires weight, Baike_query is Baidupedia return value；

Business Name identification calculation formula be

Name=λ * Sim+ μ * search；

Name is the final score of N tuple, and λ and μ are weight, and Sim is to calculate N tuple and company name dictionary similarity, Search is that internet hunt N tuple updates result.

Further, the step of extraction critical sentence group includes：

(1) critical sentence group is added in headline；

(2) similarity calculation for carrying out each sentence and headline, updates sentence score；

(3) to candidate sentences updating location information score, judge whether there is field word information to remember in sentence if containing Otherwise it is whether to contain containing company name or company code to be otherwise denoted as 1 be 0 in 0, sentence for 1, updates each sentence again Score；

(4) inverted order arrangement being carried out according to the score of sentence, score is greater than the sentence of threshold value as newsletter archive critical sentence group, If not having sentence score to be greater than threshold value in candidate key sentence group, critical sentence group is added in the sentence of highest scoring.

The marking formula of used sentence position is when further, to candidate sentences updating location information score

S_iFor i-th of sentence in text, abs is to seek absolute value, and n is sentence sum in text；

Sentence must be divided into

Score(S_i)=∑ W_j*Score_j(S_i), i=1,2,3 ... n；

Score(S_i) it is sentence S_iFinal score, S_iFor i-th of sentence in a newsletter archive, j is that sentence marking is special Collection is closed, W_jIt is characterized j score weight, Score_j(S_i) represent sentence S_iMarking in terms of feature j.

Further, the step of being classified using LSTM model to critical sentence group include：

(1) corpus marked with LSTM model training, until meeting parameter request；

(2) the critical sentence group obtained to the second section segments, and removes stop words；

(3) sentence is trained with Word2vec and TFIDF, obtains sentence vector；

(4) tendentiousness classification is carried out using trained LSTM model distich subvector；

(5) use tendency judgment mechanism is analyzed positive and negative to number in critical sentence group in a newsletter archive, obtains one The tendentiousness of newsletter archive.

Further, in LSTM model, f_tThe calculation formula of value is

f_t=σ (w_f[h_t-1, x_t]+b_f),

σ is sigmoid function, which determines the value that update, w_fIt is to forget door weight, b_fIt is bigoted to forget door；

i_tFor updated value, influence of the current input data to memory unit state is controlled, tanh layers generate newly Candidate value vectorAnd it is added in state.i_tWithMore new formula be respectively

i_t=σ (w_i[h_t-1, x_t]+b_i)；

w_iTo update door weight, b_iIt is that update door is bigoted, tanh is hyperbolic tangent function, w_cFor candidate value after update, b_cFor It is bigoted to update candidate value,It is candidate value；

C_tMore new formula is

The output par, c of sigmoid layers of decision current state, state obtain value of the section -1 and 1, the value by tanh Multiplied by sigmoid output O_t, export the output valve at this moment.O_tAnd h_tMore new formula be respectively

O_t=σ (w_o[h_t-1, x_t]+b_o)；h_t=O_t*tanh(C_t)；

W in formula_oFor the weight for updating output valve, b_oIt is that update output valve is bigoted, h_tFor final output value.

Further, in conjunction with Word2vec and TFIDF, term vector of the word t in a piece of document is expressed as

V (t)=word2vec (t) * tfidf (t)；

V (t) indicates that term vector indicates after two kinds of model-weights in formula, and word2vec (t) is to instruct through word2vec model The term vector of t is practised, tfidf (t) is to go out the term vector weight of t in a document through TFIDF model training.

Further, the calculation formula of TF and IDF is respectively

F (t, d) represents the number that word t occurs in document d, df in formula_tFor the number of files containing word t, N is all documents Number；The weight calculation formula in a document of word t is

tfidf_t=tf (t, d) * idf_t；

Using CBOW training pattern, CBOW's is expressed as

p(w_t|τ(w_t-k, w_t-k+1..., w_t+k|w_t))；

W in formula_tSome word in dictionary, by and W_tAdjacent window up and down is the word of k to predict word W_tWhat is occurred is general Rate, τ are expressed as doing the vector of the adjacent word of window or so into the operator of sum operation.

Money article proneness analysis method provided by the invention, using what is inquired based on company name abbreviation dictionary and encyclopaedia Method identifies Business Name, and effect is excellent and favorable expandability, matches using based on deep learning frame doc2vec text similarity Comprehensive characteristics attribute critical sentence group's abstracting method, extraction effect is good, and accuracy rate and recall rate are high, Text Orientation judging nicety rate Height, effect is good, can meet the needs of practical application well.

Detailed description of the invention

Fig. 1 is LSTM door figure.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing and specific implementation The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, and does not have to It is of the invention in limiting.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present invention.

A kind of money article proneness analysis method, includes the following steps：Identification Business Name extracts critical sentence group and makes Classified with LSTM model to critical sentence group.

Identify Business Name the step of include：

(1) newsletter archive to be processed is decomposed into N tuple-set as candidate company name；

(2) the N tuple score in the sentence containing six company codes and before company code adds 1；

(3) each N tuple is successively subjected to similarity mode with basic dictionary and updates score；

(4) candidate company name is subjected to Baidu search and Baidupedia inquiry updates score, score is higher than the N tuple of threshold value It is set as company name.

Business Name is identified using based on company name abbreviation dictionary and the method for encyclopaedia inquiry in the present embodiment, Company is added referred to as to company's abbreviation dictionary and the mapping of company code, increase Baidupedia inquire the factor.This method is easy reason Solution, it is convenient to realize, scalability is strong and has preferable recognition effect to new company's name.It is extracted in each text to be processed first N tuple (N-gram) set calculates similarity as candidate company name, in conjunction with basic dictionary, judges whether tuple is containing six Baidupedia is carried out in the sentence of company code, by each tuple and Baidu search carries out comprehensive score, finally by N tuple-set Middle score is higher than the N tuple of threshold value as company name.

The present embodiment obtains company code and company from domestic three big stock exchanges and referred to as creates basic dictionary and the two Mapped each other in dictionary, such as in basic dictionary ' 000027 ' and ' Shenzhen energy ' to represent Shenzhen energy group share limited Company.

Specifically, N tuple-set score is initialized first, respectively by N tuple in N tuple-set and base created above Plinth company name dictionary carries out similarity mode, obtains candidate company name set.One N tuple X and a company name Y similarity meter Calculating formula is

α in formula, β are weight, and count is both to belong to the statistics that X also belongs to Y word, start indicate N member ancestral X with company name Y beginning, end indicate N member ancestral X with company name Y ending, and through overfitting, value is set to obtain optimal result when 0.4 and 1.

Candidate company name set is carried out to update set score into Baidu search and Baidupedia inquiry, if Baidu search As a result there is " stock code " in, " company ", " group ", " enterprise " is then considered as an effective inquiry.If single hundred In degree encyclopaedia query result title be not empty or summary and essential information in occur " stock code ", " company ", " group ", " enterprise ", then this inquiry is considered as effective query, and Tables 1 and 2 is by Baidupedia and Baidu search respectively to key The query result of word " Baidu ".

1 encyclopaedia query result of table

2 Baidu search result of table

According to above-mentioned two table it is found that if only 10 search data only have from the point of view of with the result of 2 Baidu search of table return 2 search results confirm that " Baidu " is a company, and then proving that " Baidu " is very likely in conjunction with the inquiry of 1 Baidupedia of table is one Company updates candidate company name score in conjunction with Baidupedia inquiry and Baidu search, and internet checking update is scored at

Search (X)=η * count (X ∈ search_list)+γ * baike_query (x) (2)

In formula, η is Baidu search weight, and count is that item number is effectively inquired in Baidu search, and γ is Baidupedia inquiry Weight, baike_query are Baidupedia return value.By learning to data, weight parameter η and γ are set to 0.2 and 1.3 Obtain optimal solution.

Business Name identification calculation formula be

Name=λ * Sim+ μ * search (3),

In formula, name is the final score of N tuple, and λ and μ are weight, and Sim is that calculating N tuple is similar to company name dictionary Degree, search are that internet hunt N tuple updates result.Through overfitting, λ and μ are set to 1 and 1.12 acquirement optimum efficiencies.

Extract critical sentence group the step of include：

(1) critical sentence group is added in headline；

(2) similarity calculation of each sentence and headline is carried out using trained doc2vec model, updates sentence Score；

(3) with formula (4) to candidate sentences updating location information score, if judging whether there is field word information in sentence Whether otherwise to be denoted as 1 be 0 containing being then denoted as 1 and otherwise contain containing company name or company code for 0, in sentence, is updated again Each sentence score；

(4) inverted order arrangement is carried out according to the score of sentence, score is greater than the sentence of threshold value Phi as newsletter archive critical sentence Group, if not having sentence score to be greater than Φ in candidate key sentence group, critical sentence group is added in the sentence of highest scoring.

Headline carries the more important information of text.The critical sentence of news often has beginning or text in text Ending at, therefore the sentence of text beginning and end position has been set as higher weight.

Doc2vec is based on word2vec deep learning model, it can indicate sentence with real number value, between sentence Similarity calculation.Comprehensive characteristics attribute is matched using based on deep learning frame doc2vec text similarity in the present embodiment Critical sentence group's abstracting method：Critical sentence group is added in headline first, using sentence in doc2vec model calculating text and newly It hears title similarity, while whether containing company name or six companies in position of the comprehensive sentence in newsletter archive, sentence Whether code containing field verb information updates sentence collection score again, and score is higher than the sentence collection of threshold value Phi as news pass If being higher than threshold value without sentence score critical sentence group is added in the sentence of highest scoring by key sentence group.The marking of sentence position Formula (4) is

In formula, S_iFor i-th of sentence in text, abs is to seek absolute value, and n is sentence sum in text, by the mechanism, Text starts that higher score can be obtained with the sentence of end of text position, meets and focuses on text in newsletter archive and start Or the rule at the end of text.Sentence must be divided into

Score(S_i)=∑ W_j*Score_j(S_i), i=1,2,3 ... n (5)；

Score (S in formula_i) it is sentence S_iFinal score, S_iFor i-th of sentence in a newsletter archive, j is that sentence is beaten Divide characteristic set, includes sentence position (position), whether contain company name (name), whether contain domain term (field) And the similarity (similarity) of sentence and headline, W_jIt is characterized j score weight, Score_j(S_i) represent sentence S_i Marking in terms of feature j.

Critical sentence group extracts result and plays a key role to newsletter archive proneness analysis accuracy rate, extracts the good of effect The bad effect for directly affecting text classification, which achieves extracts effect well.

The step of being classified using LSTM model to critical sentence group include：

(1) corpus marked with LSTM model training, until meeting parameters requirement；

(3) sentence is trained with Word2vec and TFIDF, obtains sentence vector；

One newsletter archive proneness analysis can be converted into the whole tendentiousness for judging its critical sentence group, tendentiousness judgement Mechanism is as follows：It carries out tendentiousness to each critical sentence respectively with trained LSTM model to judge, if positive critical sentence number Greater than the critical sentence number of negative sense, then the newsletter archive is considered positive；If the critical sentence number of negative sense is greater than positive pass Key sentence number, then it is assumed that newsletter archive is negative sense；If positively and negatively critical sentence number is identical, the tendency of newsletter archive depends on Sentence is segmented using jieba and is removed deactivated when carrying out proneness analysis to critical sentence in headline tendentiousness Word can improve classifying quality while improve efficiency.

LSTM network model can have closed loop, the weight between hidden layer between model hidden layer with Chief Learning Officer, CLO's Dependency Specification The memory of LSTM network is controlled, the scheduling of memory is responsible for, model is calculated the current memory state of hidden layer as subsequent time Part input.The input layer of traditional RNN and hidden layer are implanted in memory unit by model, manage cell by door State, be LSTM door, ft, i as shown in Figure 1_t、o_tRespectively forget door, input gate, out gate.

X_tFor the input data of t moment LSTM unit, h_tIt is output, C is the value of different moments memory unit.Forget door f_t Determine the throughput of information, the goalkeeper X_tH is exported with last moment_t-1As input, between zero and one, value is used to retouch output valve State each part throughput number, 0 represent give up completely, 1 represents whole passes through.f_tThe calculation formula of value is

f_t=σ (w_f[h_t-1, x_t]+b_f) (6)；

σ is sigmoid function or is " input gate layer " in formula, which determines the value that update, w_fIt is to forget door Weight, b_fIt is bigoted to forget door.

i_t=σ (w_i[h_t-1, x_t]+b_i) (7),

σ is sigmoid function, w in formula_iTo update door weight, b_iIt is that update door is bigoted, tanh is hyperbolic tangent function, w_c For candidate value after update, b_cIt is bigoted to update candidate value,It is candidate value.

Next the state for updating original unit, by state C_t-1To C_tState, by original state C_t-1And f_tIt is multiplied, abandons The information to be shielded, is addedValue.C_tMore new formula is

O_t=σ (w_o[h_t-1, x_t]+b_o) (10)；

h_t=O_t*tanh(C_t) (11)；

Word2vec indicates that text, the model indicate that text both can solve traditional vector space mould using distributed method The high latitude Sparse Problems of type, while also having to the classification of short text bright supplemented with semantic expressiveness not available for conventional model Aobvious advantage.TFIDF is a kind of word frequency statistics method, for counting the significance level of word or word in a class text, this method Introducing solve the problems, such as that the significance level of vocabulary in the text cannot be distinguished in Word2vec.The combination of Word2vec and TFIDF Keep the expression of text vector more accurate.

TFIDF is a kind of statistical method, and thought is mainly：If the number that some word or word occur in a class text It is higher, while rarely occurring in other texts, then it is assumed that there is good class to distinguish effect for the word or word.TFIDF, that is, TF × IDF, TF represent probability of the word t in document d, and IDF is the difference class effect of word t, i.e., have word t in fewer document, then IDF value Bigger, the calculation formula of TF and IDF are respectively

F (t, d) represents the number that word t occurs in document d, df in formula_tFor the number of files containing word t, N is all documents Number.The weight calculation formula in a document of word t is

tfidf_t=tf (t, d) * idf_t (14)。

Word2vec is that a kind of deep neural network probabilistic model is compared with the traditional method for calculating term vector, the mould Type can make full use of the semantic information of context.There are two types of training patterns, respectively CBOW and skip-gram by Word2vec. CBOW training pattern is used in the present embodiment, and CBOW's is expressed as

p(w_t|τ(w_t-k, w_t-k+1..., w_t+k|w_t)) (15)；

W in formula_tSome word in dictionary, by and W_tAdjacent window up and down is the word of k to predict word W_tWhat is occurred is general Rate, τ are expressed as doing the vector of the adjacent word of window or so into the operator of sum operation.In conjunction with Word2vec and TFIDF, word t exists Term vector in a piece of document is expressed as

V (t)=word2vec (t) * tfidf (t) (16)；

V (t) indicates that term vector indicates after two kinds of model-weights in formula, and word2vec (t) is to instruct through word2vec model The term vector of t is practised, tfidf (t) is to go out the term vector weight of t in a document through TFIDF model training.The expression of sentence vector To use the method for formula (16) to be added the term vector of word in sentence.

Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not Therefore limitations on the scope of the patent of the present invention are interpreted as.It should be pointed out that for those of ordinary skill in the art, Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection model of the invention It encloses.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

1. a kind of money article proneness analysis method, which is characterized in that including：Identify Business Name, extract critical sentence group and Classified using LSTM model to critical sentence group.

2. money article proneness analysis method according to claim 1, which is characterized in that the step of identifying Business Name Including：

Candidate company name is carried out Baidu search to step (4) and Baidupedia inquiry updates score, and score is higher than the N tuple of threshold value It is set as company name.

3. money article proneness analysis method according to claim 1, which is characterized in that step (1) is specially：First N tuple-set score is initialized, N tuple in N tuple-set and basic company name dictionary created above are subjected to phase respectively It is matched like degree, obtains candidate company name set；One N tuple X and a company name Y calculating formula of similarity are

α in formula, β are weight, and count is both to belong to the statistics that X also belongs to Y word, and start indicates opening with company name Y for N member ancestral X Head, end indicate N member ancestral X with company name Y ending.

4. money article proneness analysis method according to claim 1, which is characterized in that step (4) is specially：

Candidate company name set is carried out to update set score into Baidu search and Baidupedia inquiry, if Baidu search result Middle appearance " stock code ", " company ", " group ", " enterprise " are then considered as an effective inquiry；If single Baidu hundred Title is not to occur " stock code ", " company ", " group ", " enterprise in empty or summary and essential information in section's query result Industry ", then this inquiry is considered as effective query；Candidate company name score is updated in conjunction with Baidupedia inquiry and Baidu search, Internet checking update is scored at

Search (X)=η * count (X ∈ search_list)+γ * baike_query (x)；

Business Name identification calculation formula be

Name=λ * Sim+ μ * search；

Name is the final score of N tuple, and λ and μ are weight, and Sim is to calculate N tuple and company name dictionary similarity, search Result is updated for internet hunt N tuple.

5. money article proneness analysis method described in -4 according to claim 1, which is characterized in that extract the step of critical sentence group Suddenly include：

(1) critical sentence group is added in headline；

(3) to candidate sentences updating location information score, judge whether have field word information no if being denoted as 1 containing if in sentence Then whether contain otherwise to be denoted as 1 be 0 containing company name or company code for 0, in sentence, updates each sentence score again；

(4) inverted order arrangement being carried out according to the score of sentence, score is greater than the sentence of threshold value as newsletter archive critical sentence group, if There is no sentence score to be greater than threshold value in candidate key sentence group, critical sentence group is added in the sentence of highest scoring.

6. money article proneness analysis method described in -5 according to claim 1, which is characterized in that believe candidate sentences position The marking formula of used sentence position is when breath update score

Sentence must be divided into

Score(S_i)=∑ W_j*Score_j(S_j), i=1,2,3 ... n；

Score(S_i) it is sentence S_iFinal score, S_iFor i-th of sentence in a newsletter archive, j is sentence marking feature set It closes, W_jIt is characterized j score weight, Score_j(S_i) represent sentence S_iMarking in terms of feature j.

7. money article proneness analysis method described in -6 according to claim 1, which is characterized in that using LSTM model to pass The step of key sentence group classifies include：

(1) corpus marked with LSTM model training, until meeting parameter request；

(3) sentence is trained with Word2vec and TFIDF, obtains sentence vector；

(5) use tendency judgment mechanism is analyzed positive and negative to number in critical sentence group in a newsletter archive, obtains a news The tendentiousness of text.

8. money article proneness analysis method described in -7 according to claim 1, which is characterized in that in LSTM model, f_tValue Calculation formula be

f_t=σ (w_f[h_t-1, x_t]+b_f),

i_tFor updated value, influence of the current input data to memory unit state is controlled, tanh layers generate new candidate value VectorAnd it is added in state.i_tWithMore new formula be respectively

i_t=σ (w_i[h_t-1, x_t]+b_j)；

w_iTo update door weight, b_iIt is that update door is bigoted, tanh is hyperbolic tangent function, w_cFor candidate value after update, b_cTo update Candidate value is bigoted,It is candidate value；

C_tMore new formula is

The output par, c of sigmoid layers of decision current state, state obtain value of the section -1 and 1 by tanh, the value multiplied by Sigmoid output O_t, export the output valve at this moment.O_tAnd h_tMore new formula be respectively

O_t=σ (w_o[h_t-1, x_t]+b_o)；h_t=O_t*tanh(C_t)；

9. money article proneness analysis method described in -7 according to claim 1, which is characterized in that in conjunction with Word2vec and The term vector of TFIDF, word t in a piece of document is expressed as

V (t)=word2vec (t) * tfidf (t)；

V (t) indicates that term vector indicates after two kinds of model-weights in formula, and word2vec (t) is to go out t through word2vec model training Term vector, tfidf (t) is to go out the term vector weight of t in a document through TFIDF model training.

10. money article proneness analysis method described in -7 according to claim 1, which is characterized in that the calculating of TF and IDF is public Formula is respectively

F (t, d) represents the number that word t occurs in document d, df in formula_tFor the number of files containing word t, N is all number of files；Word The weight calculation formula in a document of t is

tfidf_t=tf (t, d) * idf_t；

Using CBOW training pattern, CBOW's is expressed as

p(w_t|τ(w_t-k, w_t-k+1..., w_t+k|w_t))；

W in formula_tSome word in dictionary, by and W_tAdjacent window up and down is the word of k to predict word W_tThe probability of appearance, τ It is expressed as doing the vector of the adjacent word of window or so into the operator of sum operation.