CN108932229A - A kind of money article proneness analysis method - Google Patents

A kind of money article proneness analysis method Download PDF

Info

Publication number
CN108932229A
CN108932229A CN201810605916.8A CN201810605916A CN108932229A CN 108932229 A CN108932229 A CN 108932229A CN 201810605916 A CN201810605916 A CN 201810605916A CN 108932229 A CN108932229 A CN 108932229A
Authority
CN
China
Prior art keywords
sentence
score
formula
tuple
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810605916.8A
Other languages
Chinese (zh)
Inventor
吕学强
董志安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201810605916.8A priority Critical patent/CN108932229A/en
Publication of CN108932229A publication Critical patent/CN108932229A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of money article proneness analysis methods, including:Identification Business Name is extracted critical sentence group and is classified using LSTM model to critical sentence group.Money article proneness analysis method provided by the invention, Business Name is identified using based on company name abbreviation dictionary and the method for encyclopaedia inquiry, effect is excellent and favorable expandability, comprehensive characteristics attribute critical sentence group's abstracting method is matched using based on deep learning frame doc2vec text similarity, it is good to extract effect, accuracy rate and recall rate are high, and Text Orientation judging nicety rate is high, effect is good, can meet the needs of practical application well.

Description

A kind of money article proneness analysis method
Technical field
The invention belongs to text analysis technique fields, and in particular to a kind of money article proneness analysis method.
Background technique
Current text emotional orientation analytical method includes following several:1) attributes such as emotion, position and keyword are made For the factor for extracting critical sentence, then critical sentence group is carried out have supervision and semi-supervised emotional semantic classification, this kind of method takes key The accuracy rate of sentence is not high;2) spy is carried out using the emotion vocabulary training text containing negative vocabulary, tendentiousness vocabulary, degree vocabulary Sign extension, this method do not account for context, and effect is bad.Targetedly money article text classification is at home and abroad studied It is relatively fewer, event semantics markup information in newsletter archive is mainly extracted using vocabulary and semantic rules, and the information is used In the feature of machine learning classification, but this method is excessively complicated, and accuracy rate is not high.Web money article based on semantic rules Text divides sentiment analysis method to extract text attribute by machine learning method Apriori, constructs emotion dictionary and semantic rules, from And Sentiment orientation is calculated, this method is complex, and analytical effect is also very general, cannot meet the needs of practical application well. Company name identification is that money article critical sentence group extracts critically important research point, however up to the present, the research of this respect at Fruit is relatively fewer, and most of method of the prior art is still relatively low for the recognition accuracy of company's abbreviation, complicated in method The building of rule and knowledge base seriously affects the application of method.
Summary of the invention
For above-mentioned problems of the prior art, it can avoid above-mentioned skill occur the purpose of the present invention is to provide one kind The money article proneness analysis method of art defect.
In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows:
A kind of money article proneness analysis method, including:It identifies Business Name, extract critical sentence group and uses LSTM mould Type classifies to critical sentence group.
Further, it identifies and includes the step of Business Name:
Newsletter archive to be processed is decomposed into N tuple-set as candidate company name by step (1);
N tuple score of the step (2) in the sentence containing six company codes and before company code adds 1;
Each N tuple is successively carried out similarity mode with basic dictionary and updates score by step (3);
Candidate company name is carried out Baidu search to step (4) and Baidupedia inquiry updates score, and score is higher than the N of threshold value Tuple is set as company name.
Further, step (1) is specially:N tuple-set score is initialized first, respectively by N tuple in N tuple-set Similarity mode is carried out with basic company name dictionary created above, obtains candidate company name set;One N tuple X and one A company name Y calculating formula of similarity is
α in formula, β are weight, and count is both to belong to the statistics that X also belongs to Y word, start indicate N member ancestral X with company name Y beginning, end indicate N member ancestral X with company name Y ending.
Further, step (4) is specially:
Candidate company name set is carried out to update set score into Baidu search and Baidupedia inquiry, if Baidu search As a result there is " stock code " in, " company ", " group ", " enterprise " is then considered as an effective inquiry;If single hundred In degree encyclopaedia query result title be not empty or summary and essential information in occur " stock code ", " company ", " group ", " enterprise ", then this inquiry is considered as effective query;Candidate company name is updated in conjunction with Baidupedia inquiry and Baidu search to obtain Point, internet checking update is scored at
Search (X)=η * count (X ∈ search_list)+γ * baike_query (x);
η is Baidu search weight, and count is that item number is effectively inquired in Baidu search, and γ is that Baidupedia inquires weight, Baike_query is Baidupedia return value;
Business Name identification calculation formula be
Name=λ * Sim+ μ * search;
Name is the final score of N tuple, and λ and μ are weight, and Sim is to calculate N tuple and company name dictionary similarity, Search is that internet hunt N tuple updates result.
Further, the step of extraction critical sentence group includes:
(1) critical sentence group is added in headline;
(2) similarity calculation for carrying out each sentence and headline, updates sentence score;
(3) to candidate sentences updating location information score, judge whether there is field word information to remember in sentence if containing Otherwise it is whether to contain containing company name or company code to be otherwise denoted as 1 be 0 in 0, sentence for 1, updates each sentence again Score;
(4) inverted order arrangement being carried out according to the score of sentence, score is greater than the sentence of threshold value as newsletter archive critical sentence group, If not having sentence score to be greater than threshold value in candidate key sentence group, critical sentence group is added in the sentence of highest scoring.
The marking formula of used sentence position is when further, to candidate sentences updating location information score
SiFor i-th of sentence in text, abs is to seek absolute value, and n is sentence sum in text;
Sentence must be divided into
Score(Si)=∑ Wj*Scorej(Si), i=1,2,3 ... n;
Score(Si) it is sentence SiFinal score, SiFor i-th of sentence in a newsletter archive, j is that sentence marking is special Collection is closed, WjIt is characterized j score weight, Scorej(Si) represent sentence SiMarking in terms of feature j.
Further, the step of being classified using LSTM model to critical sentence group include:
(1) corpus marked with LSTM model training, until meeting parameter request;
(2) the critical sentence group obtained to the second section segments, and removes stop words;
(3) sentence is trained with Word2vec and TFIDF, obtains sentence vector;
(4) tendentiousness classification is carried out using trained LSTM model distich subvector;
(5) use tendency judgment mechanism is analyzed positive and negative to number in critical sentence group in a newsletter archive, obtains one The tendentiousness of newsletter archive.
Further, in LSTM model, ftThe calculation formula of value is
ft=σ (wf[ht-1, xt]+bf),
σ is sigmoid function, which determines the value that update, wfIt is to forget door weight, bfIt is bigoted to forget door;
itFor updated value, influence of the current input data to memory unit state is controlled, tanh layers generate newly Candidate value vectorAnd it is added in state.itWithMore new formula be respectively
it=σ (wi[ht-1, xt]+bi);
wiTo update door weight, biIt is that update door is bigoted, tanh is hyperbolic tangent function, wcFor candidate value after update, bcFor It is bigoted to update candidate value,It is candidate value;
CtMore new formula is
The output par, c of sigmoid layers of decision current state, state obtain value of the section -1 and 1, the value by tanh Multiplied by sigmoid output Ot, export the output valve at this moment.OtAnd htMore new formula be respectively
Ot=σ (wo[ht-1, xt]+bo);ht=Ot*tanh(Ct);
W in formulaoFor the weight for updating output valve, boIt is that update output valve is bigoted, htFor final output value.
Further, in conjunction with Word2vec and TFIDF, term vector of the word t in a piece of document is expressed as
V (t)=word2vec (t) * tfidf (t);
V (t) indicates that term vector indicates after two kinds of model-weights in formula, and word2vec (t) is to instruct through word2vec model The term vector of t is practised, tfidf (t) is to go out the term vector weight of t in a document through TFIDF model training.
Further, the calculation formula of TF and IDF is respectively
F (t, d) represents the number that word t occurs in document d, df in formulatFor the number of files containing word t, N is all documents Number;The weight calculation formula in a document of word t is
tfidft=tf (t, d) * idft
Using CBOW training pattern, CBOW's is expressed as
p(wt|τ(wt-k, wt-k+1..., wt+k|wt));
W in formulatSome word in dictionary, by and WtAdjacent window up and down is the word of k to predict word WtWhat is occurred is general Rate, τ are expressed as doing the vector of the adjacent word of window or so into the operator of sum operation.
Money article proneness analysis method provided by the invention, using what is inquired based on company name abbreviation dictionary and encyclopaedia Method identifies Business Name, and effect is excellent and favorable expandability, matches using based on deep learning frame doc2vec text similarity Comprehensive characteristics attribute critical sentence group's abstracting method, extraction effect is good, and accuracy rate and recall rate are high, Text Orientation judging nicety rate Height, effect is good, can meet the needs of practical application well.
Detailed description of the invention
Fig. 1 is LSTM door figure.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing and specific implementation The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, and does not have to It is of the invention in limiting.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present invention.
A kind of money article proneness analysis method, includes the following steps:Identification Business Name extracts critical sentence group and makes Classified with LSTM model to critical sentence group.
Identify Business Name the step of include:
(1) newsletter archive to be processed is decomposed into N tuple-set as candidate company name;
(2) the N tuple score in the sentence containing six company codes and before company code adds 1;
(3) each N tuple is successively subjected to similarity mode with basic dictionary and updates score;
(4) candidate company name is subjected to Baidu search and Baidupedia inquiry updates score, score is higher than the N tuple of threshold value It is set as company name.
Business Name is identified using based on company name abbreviation dictionary and the method for encyclopaedia inquiry in the present embodiment, Company is added referred to as to company's abbreviation dictionary and the mapping of company code, increase Baidupedia inquire the factor.This method is easy reason Solution, it is convenient to realize, scalability is strong and has preferable recognition effect to new company's name.It is extracted in each text to be processed first N tuple (N-gram) set calculates similarity as candidate company name, in conjunction with basic dictionary, judges whether tuple is containing six Baidupedia is carried out in the sentence of company code, by each tuple and Baidu search carries out comprehensive score, finally by N tuple-set Middle score is higher than the N tuple of threshold value as company name.
The present embodiment obtains company code and company from domestic three big stock exchanges and referred to as creates basic dictionary and the two Mapped each other in dictionary, such as in basic dictionary ' 000027 ' and ' Shenzhen energy ' to represent Shenzhen energy group share limited Company.
Specifically, N tuple-set score is initialized first, respectively by N tuple in N tuple-set and base created above Plinth company name dictionary carries out similarity mode, obtains candidate company name set.One N tuple X and a company name Y similarity meter Calculating formula is
α in formula, β are weight, and count is both to belong to the statistics that X also belongs to Y word, start indicate N member ancestral X with company name Y beginning, end indicate N member ancestral X with company name Y ending, and through overfitting, value is set to obtain optimal result when 0.4 and 1.
Candidate company name set is carried out to update set score into Baidu search and Baidupedia inquiry, if Baidu search As a result there is " stock code " in, " company ", " group ", " enterprise " is then considered as an effective inquiry.If single hundred In degree encyclopaedia query result title be not empty or summary and essential information in occur " stock code ", " company ", " group ", " enterprise ", then this inquiry is considered as effective query, and Tables 1 and 2 is by Baidupedia and Baidu search respectively to key The query result of word " Baidu ".
1 encyclopaedia query result of table
2 Baidu search result of table
According to above-mentioned two table it is found that if only 10 search data only have from the point of view of with the result of 2 Baidu search of table return 2 search results confirm that " Baidu " is a company, and then proving that " Baidu " is very likely in conjunction with the inquiry of 1 Baidupedia of table is one Company updates candidate company name score in conjunction with Baidupedia inquiry and Baidu search, and internet checking update is scored at
Search (X)=η * count (X ∈ search_list)+γ * baike_query (x) (2)
In formula, η is Baidu search weight, and count is that item number is effectively inquired in Baidu search, and γ is Baidupedia inquiry Weight, baike_query are Baidupedia return value.By learning to data, weight parameter η and γ are set to 0.2 and 1.3 Obtain optimal solution.
Business Name identification calculation formula be
Name=λ * Sim+ μ * search (3),
In formula, name is the final score of N tuple, and λ and μ are weight, and Sim is that calculating N tuple is similar to company name dictionary Degree, search are that internet hunt N tuple updates result.Through overfitting, λ and μ are set to 1 and 1.12 acquirement optimum efficiencies.
Extract critical sentence group the step of include:
(1) critical sentence group is added in headline;
(2) similarity calculation of each sentence and headline is carried out using trained doc2vec model, updates sentence Score;
(3) with formula (4) to candidate sentences updating location information score, if judging whether there is field word information in sentence Whether otherwise to be denoted as 1 be 0 containing being then denoted as 1 and otherwise contain containing company name or company code for 0, in sentence, is updated again Each sentence score;
(4) inverted order arrangement is carried out according to the score of sentence, score is greater than the sentence of threshold value Phi as newsletter archive critical sentence Group, if not having sentence score to be greater than Φ in candidate key sentence group, critical sentence group is added in the sentence of highest scoring.
Headline carries the more important information of text.The critical sentence of news often has beginning or text in text Ending at, therefore the sentence of text beginning and end position has been set as higher weight.
Doc2vec is based on word2vec deep learning model, it can indicate sentence with real number value, between sentence Similarity calculation.Comprehensive characteristics attribute is matched using based on deep learning frame doc2vec text similarity in the present embodiment Critical sentence group's abstracting method:Critical sentence group is added in headline first, using sentence in doc2vec model calculating text and newly It hears title similarity, while whether containing company name or six companies in position of the comprehensive sentence in newsletter archive, sentence Whether code containing field verb information updates sentence collection score again, and score is higher than the sentence collection of threshold value Phi as news pass If being higher than threshold value without sentence score critical sentence group is added in the sentence of highest scoring by key sentence group.The marking of sentence position Formula (4) is
In formula, SiFor i-th of sentence in text, abs is to seek absolute value, and n is sentence sum in text, by the mechanism, Text starts that higher score can be obtained with the sentence of end of text position, meets and focuses on text in newsletter archive and start Or the rule at the end of text.Sentence must be divided into
Score(Si)=∑ Wj*Scorej(Si), i=1,2,3 ... n (5);
Score (S in formulai) it is sentence SiFinal score, SiFor i-th of sentence in a newsletter archive, j is that sentence is beaten Divide characteristic set, includes sentence position (position), whether contain company name (name), whether contain domain term (field) And the similarity (similarity) of sentence and headline, WjIt is characterized j score weight, Scorej(Si) represent sentence Si Marking in terms of feature j.
Critical sentence group extracts result and plays a key role to newsletter archive proneness analysis accuracy rate, extracts the good of effect The bad effect for directly affecting text classification, which achieves extracts effect well.
The step of being classified using LSTM model to critical sentence group include:
(1) corpus marked with LSTM model training, until meeting parameters requirement;
(2) the critical sentence group obtained to the second section segments, and removes stop words;
(3) sentence is trained with Word2vec and TFIDF, obtains sentence vector;
(4) tendentiousness classification is carried out using trained LSTM model distich subvector;
(5) use tendency judgment mechanism is analyzed positive and negative to number in critical sentence group in a newsletter archive, obtains one The tendentiousness of newsletter archive.
One newsletter archive proneness analysis can be converted into the whole tendentiousness for judging its critical sentence group, tendentiousness judgement Mechanism is as follows:It carries out tendentiousness to each critical sentence respectively with trained LSTM model to judge, if positive critical sentence number Greater than the critical sentence number of negative sense, then the newsletter archive is considered positive;If the critical sentence number of negative sense is greater than positive pass Key sentence number, then it is assumed that newsletter archive is negative sense;If positively and negatively critical sentence number is identical, the tendency of newsletter archive depends on Sentence is segmented using jieba and is removed deactivated when carrying out proneness analysis to critical sentence in headline tendentiousness Word can improve classifying quality while improve efficiency.
LSTM network model can have closed loop, the weight between hidden layer between model hidden layer with Chief Learning Officer, CLO's Dependency Specification The memory of LSTM network is controlled, the scheduling of memory is responsible for, model is calculated the current memory state of hidden layer as subsequent time Part input.The input layer of traditional RNN and hidden layer are implanted in memory unit by model, manage cell by door State, be LSTM door, ft, i as shown in Figure 1t、otRespectively forget door, input gate, out gate.
XtFor the input data of t moment LSTM unit, htIt is output, C is the value of different moments memory unit.Forget door ft Determine the throughput of information, the goalkeeper XtH is exported with last momentt-1As input, between zero and one, value is used to retouch output valve State each part throughput number, 0 represent give up completely, 1 represents whole passes through.ftThe calculation formula of value is
ft=σ (wf[ht-1, xt]+bf) (6);
σ is sigmoid function or is " input gate layer " in formula, which determines the value that update, wfIt is to forget door Weight, bfIt is bigoted to forget door.
itFor updated value, influence of the current input data to memory unit state is controlled, tanh layers generate newly Candidate value vectorAnd it is added in state.itWithMore new formula be respectively
it=σ (wi[ht-1, xt]+bi) (7),
σ is sigmoid function, w in formulaiTo update door weight, biIt is that update door is bigoted, tanh is hyperbolic tangent function, wc For candidate value after update, bcIt is bigoted to update candidate value,It is candidate value.
Next the state for updating original unit, by state Ct-1To CtState, by original state Ct-1And ftIt is multiplied, abandons The information to be shielded, is addedValue.CtMore new formula is
The output par, c of sigmoid layers of decision current state, state obtain value of the section -1 and 1, the value by tanh Multiplied by sigmoid output Ot, export the output valve at this moment.OtAnd htMore new formula be respectively
Ot=σ (wo[ht-1, xt]+bo) (10);
ht=Ot*tanh(Ct) (11);
W in formulaoFor the weight for updating output valve, boIt is that update output valve is bigoted, htFor final output value.
Word2vec indicates that text, the model indicate that text both can solve traditional vector space mould using distributed method The high latitude Sparse Problems of type, while also having to the classification of short text bright supplemented with semantic expressiveness not available for conventional model Aobvious advantage.TFIDF is a kind of word frequency statistics method, for counting the significance level of word or word in a class text, this method Introducing solve the problems, such as that the significance level of vocabulary in the text cannot be distinguished in Word2vec.The combination of Word2vec and TFIDF Keep the expression of text vector more accurate.
TFIDF is a kind of statistical method, and thought is mainly:If the number that some word or word occur in a class text It is higher, while rarely occurring in other texts, then it is assumed that there is good class to distinguish effect for the word or word.TFIDF, that is, TF × IDF, TF represent probability of the word t in document d, and IDF is the difference class effect of word t, i.e., have word t in fewer document, then IDF value Bigger, the calculation formula of TF and IDF are respectively
F (t, d) represents the number that word t occurs in document d, df in formulatFor the number of files containing word t, N is all documents Number.The weight calculation formula in a document of word t is
tfidft=tf (t, d) * idft (14)。
Word2vec is that a kind of deep neural network probabilistic model is compared with the traditional method for calculating term vector, the mould Type can make full use of the semantic information of context.There are two types of training patterns, respectively CBOW and skip-gram by Word2vec. CBOW training pattern is used in the present embodiment, and CBOW's is expressed as
p(wt|τ(wt-k, wt-k+1..., wt+k|wt)) (15);
W in formulatSome word in dictionary, by and WtAdjacent window up and down is the word of k to predict word WtWhat is occurred is general Rate, τ are expressed as doing the vector of the adjacent word of window or so into the operator of sum operation.In conjunction with Word2vec and TFIDF, word t exists Term vector in a piece of document is expressed as
V (t)=word2vec (t) * tfidf (t) (16);
V (t) indicates that term vector indicates after two kinds of model-weights in formula, and word2vec (t) is to instruct through word2vec model The term vector of t is practised, tfidf (t) is to go out the term vector weight of t in a document through TFIDF model training.The expression of sentence vector To use the method for formula (16) to be added the term vector of word in sentence.
Money article proneness analysis method provided by the invention, using what is inquired based on company name abbreviation dictionary and encyclopaedia Method identifies Business Name, and effect is excellent and favorable expandability, matches using based on deep learning frame doc2vec text similarity Comprehensive characteristics attribute critical sentence group's abstracting method, extraction effect is good, and accuracy rate and recall rate are high, Text Orientation judging nicety rate Height, effect is good, can meet the needs of practical application well.
Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not Therefore limitations on the scope of the patent of the present invention are interpreted as.It should be pointed out that for those of ordinary skill in the art, Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection model of the invention It encloses.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (10)

1. a kind of money article proneness analysis method, which is characterized in that including:Identify Business Name, extract critical sentence group and Classified using LSTM model to critical sentence group.
2. money article proneness analysis method according to claim 1, which is characterized in that the step of identifying Business Name Including:
Newsletter archive to be processed is decomposed into N tuple-set as candidate company name by step (1);
N tuple score of the step (2) in the sentence containing six company codes and before company code adds 1;
Each N tuple is successively carried out similarity mode with basic dictionary and updates score by step (3);
Candidate company name is carried out Baidu search to step (4) and Baidupedia inquiry updates score, and score is higher than the N tuple of threshold value It is set as company name.
3. money article proneness analysis method according to claim 1, which is characterized in that step (1) is specially:First N tuple-set score is initialized, N tuple in N tuple-set and basic company name dictionary created above are subjected to phase respectively It is matched like degree, obtains candidate company name set;One N tuple X and a company name Y calculating formula of similarity are
α in formula, β are weight, and count is both to belong to the statistics that X also belongs to Y word, and start indicates opening with company name Y for N member ancestral X Head, end indicate N member ancestral X with company name Y ending.
4. money article proneness analysis method according to claim 1, which is characterized in that step (4) is specially:
Candidate company name set is carried out to update set score into Baidu search and Baidupedia inquiry, if Baidu search result Middle appearance " stock code ", " company ", " group ", " enterprise " are then considered as an effective inquiry;If single Baidu hundred Title is not to occur " stock code ", " company ", " group ", " enterprise in empty or summary and essential information in section's query result Industry ", then this inquiry is considered as effective query;Candidate company name score is updated in conjunction with Baidupedia inquiry and Baidu search, Internet checking update is scored at
Search (X)=η * count (X ∈ search_list)+γ * baike_query (x);
η is Baidu search weight, and count is that item number is effectively inquired in Baidu search, and γ is that Baidupedia inquires weight, Baike_query is Baidupedia return value;
Business Name identification calculation formula be
Name=λ * Sim+ μ * search;
Name is the final score of N tuple, and λ and μ are weight, and Sim is to calculate N tuple and company name dictionary similarity, search Result is updated for internet hunt N tuple.
5. money article proneness analysis method described in -4 according to claim 1, which is characterized in that extract the step of critical sentence group Suddenly include:
(1) critical sentence group is added in headline;
(2) similarity calculation for carrying out each sentence and headline, updates sentence score;
(3) to candidate sentences updating location information score, judge whether have field word information no if being denoted as 1 containing if in sentence Then whether contain otherwise to be denoted as 1 be 0 containing company name or company code for 0, in sentence, updates each sentence score again;
(4) inverted order arrangement being carried out according to the score of sentence, score is greater than the sentence of threshold value as newsletter archive critical sentence group, if There is no sentence score to be greater than threshold value in candidate key sentence group, critical sentence group is added in the sentence of highest scoring.
6. money article proneness analysis method described in -5 according to claim 1, which is characterized in that believe candidate sentences position The marking formula of used sentence position is when breath update score
SiFor i-th of sentence in text, abs is to seek absolute value, and n is sentence sum in text;
Sentence must be divided into
Score(Si)=∑ Wj*Scorej(Sj), i=1,2,3 ... n;
Score(Si) it is sentence SiFinal score, SiFor i-th of sentence in a newsletter archive, j is sentence marking feature set It closes, WjIt is characterized j score weight, Scorej(Si) represent sentence SiMarking in terms of feature j.
7. money article proneness analysis method described in -6 according to claim 1, which is characterized in that using LSTM model to pass The step of key sentence group classifies include:
(1) corpus marked with LSTM model training, until meeting parameter request;
(2) the critical sentence group obtained to the second section segments, and removes stop words;
(3) sentence is trained with Word2vec and TFIDF, obtains sentence vector;
(4) tendentiousness classification is carried out using trained LSTM model distich subvector;
(5) use tendency judgment mechanism is analyzed positive and negative to number in critical sentence group in a newsletter archive, obtains a news The tendentiousness of text.
8. money article proneness analysis method described in -7 according to claim 1, which is characterized in that in LSTM model, ftValue Calculation formula be
ft=σ (wf[ht-1, xt]+bf),
σ is sigmoid function, which determines the value that update, wfIt is to forget door weight, bfIt is bigoted to forget door;
itFor updated value, influence of the current input data to memory unit state is controlled, tanh layers generate new candidate value VectorAnd it is added in state.itWithMore new formula be respectively
it=σ (wi[ht-1, xt]+bj);
wiTo update door weight, biIt is that update door is bigoted, tanh is hyperbolic tangent function, wcFor candidate value after update, bcTo update Candidate value is bigoted,It is candidate value;
CtMore new formula is
The output par, c of sigmoid layers of decision current state, state obtain value of the section -1 and 1 by tanh, the value multiplied by Sigmoid output Ot, export the output valve at this moment.OtAnd htMore new formula be respectively
Ot=σ (wo[ht-1, xt]+bo);ht=Ot*tanh(Ct);
W in formulaoFor the weight for updating output valve, boIt is that update output valve is bigoted, htFor final output value.
9. money article proneness analysis method described in -7 according to claim 1, which is characterized in that in conjunction with Word2vec and The term vector of TFIDF, word t in a piece of document is expressed as
V (t)=word2vec (t) * tfidf (t);
V (t) indicates that term vector indicates after two kinds of model-weights in formula, and word2vec (t) is to go out t through word2vec model training Term vector, tfidf (t) is to go out the term vector weight of t in a document through TFIDF model training.
10. money article proneness analysis method described in -7 according to claim 1, which is characterized in that the calculating of TF and IDF is public Formula is respectively
F (t, d) represents the number that word t occurs in document d, df in formulatFor the number of files containing word t, N is all number of files;Word The weight calculation formula in a document of t is
tfidft=tf (t, d) * idft
Using CBOW training pattern, CBOW's is expressed as
p(wt|τ(wt-k, wt-k+1..., wt+k|wt));
W in formulatSome word in dictionary, by and WtAdjacent window up and down is the word of k to predict word WtThe probability of appearance, τ It is expressed as doing the vector of the adjacent word of window or so into the operator of sum operation.
CN201810605916.8A 2018-06-13 2018-06-13 A kind of money article proneness analysis method Pending CN108932229A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810605916.8A CN108932229A (en) 2018-06-13 2018-06-13 A kind of money article proneness analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810605916.8A CN108932229A (en) 2018-06-13 2018-06-13 A kind of money article proneness analysis method

Publications (1)

Publication Number Publication Date
CN108932229A true CN108932229A (en) 2018-12-04

Family

ID=64446501

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810605916.8A Pending CN108932229A (en) 2018-06-13 2018-06-13 A kind of money article proneness analysis method

Country Status (1)

Country Link
CN (1) CN108932229A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614490A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Money article proneness analysis method based on LSTM
CN111159223A (en) * 2019-12-31 2020-05-15 武汉大学 Interactive code searching method and device based on structured embedding
CN111782907A (en) * 2020-07-01 2020-10-16 北京知因智慧科技有限公司 News classification method and device and electronic equipment
CN112287687A (en) * 2020-09-17 2021-01-29 昆明理工大学 Case tendency extraction type summarization method based on case attribute perception
CN113064964A (en) * 2021-03-22 2021-07-02 广东博智林机器人有限公司 Text classification method, model training method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933800A (en) * 2016-11-29 2017-07-07 首都师范大学 A kind of event sentence abstracting method of financial field
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
US20180121788A1 (en) * 2016-11-03 2018-05-03 Salesforce.Com, Inc. Deep Neural Network Model for Processing Data Through Mutliple Linguistic Task Hiearchies

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121788A1 (en) * 2016-11-03 2018-05-03 Salesforce.Com, Inc. Deep Neural Network Model for Processing Data Through Mutliple Linguistic Task Hiearchies
CN106933800A (en) * 2016-11-29 2017-07-07 首都师范大学 A kind of event sentence abstracting method of financial field
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李江龙等: "金融领域的事件句抽取", 《计算机应用研究》 *
胡新辰: "基于LSTM的语义关系分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614490A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Money article proneness analysis method based on LSTM
CN111159223A (en) * 2019-12-31 2020-05-15 武汉大学 Interactive code searching method and device based on structured embedding
CN111159223B (en) * 2019-12-31 2021-09-03 武汉大学 Interactive code searching method and device based on structured embedding
CN111782907A (en) * 2020-07-01 2020-10-16 北京知因智慧科技有限公司 News classification method and device and electronic equipment
CN111782907B (en) * 2020-07-01 2024-03-01 北京知因智慧科技有限公司 News classification method and device and electronic equipment
CN112287687A (en) * 2020-09-17 2021-01-29 昆明理工大学 Case tendency extraction type summarization method based on case attribute perception
CN113064964A (en) * 2021-03-22 2021-07-02 广东博智林机器人有限公司 Text classification method, model training method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112100344B (en) Knowledge graph-based financial domain knowledge question-answering method
CN109858028B (en) Short text similarity calculation method based on probability model
CN106919673B (en) Text mood analysis system based on deep learning
CN108932229A (en) A kind of money article proneness analysis method
Karim et al. Classification benchmarks for under-resourced bengali language based on multichannel convolutional-lstm network
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
CN108984526A (en) A kind of document subject matter vector abstracting method based on deep learning
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN109726745B (en) Target-based emotion classification method integrating description knowledge
CN107180026B (en) Event phrase learning method and device based on word embedding semantic mapping
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
CN111563143B (en) Method and device for determining new words
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN109614490A (en) Money article proneness analysis method based on LSTM
CN106933800A (en) A kind of event sentence abstracting method of financial field
Wu et al. Exploring syntactic and semantic features for authorship attribution
CN113033183B (en) Network new word discovery method and system based on statistics and similarity
CN110750646B (en) Attribute description extracting method for hotel comment text
CN111444704B (en) Network safety keyword extraction method based on deep neural network
CN114416942A (en) Automatic question-answering method based on deep learning
CN112949713A (en) Text emotion classification method based on ensemble learning of complex network
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
Chen et al. Sentiment classification of tourism based on rules and LDA topic model
Ao et al. News keywords extraction algorithm based on TextRank and classified TF-IDF
CN109033087A (en) Calculate method, De-weight method, clustering method and the device of text semantic distance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20221206

AD01 Patent right deemed abandoned