CN108932229A - A kind of money article proneness analysis method - Google Patents
A kind of money article proneness analysis method Download PDFInfo
- Publication number
- CN108932229A CN108932229A CN201810605916.8A CN201810605916A CN108932229A CN 108932229 A CN108932229 A CN 108932229A CN 201810605916 A CN201810605916 A CN 201810605916A CN 108932229 A CN108932229 A CN 108932229A
- Authority
- CN
- China
- Prior art keywords
- sentence
- score
- formula
- tuple
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of money article proneness analysis methods, including:Identification Business Name is extracted critical sentence group and is classified using LSTM model to critical sentence group.Money article proneness analysis method provided by the invention, Business Name is identified using based on company name abbreviation dictionary and the method for encyclopaedia inquiry, effect is excellent and favorable expandability, comprehensive characteristics attribute critical sentence group's abstracting method is matched using based on deep learning frame doc2vec text similarity, it is good to extract effect, accuracy rate and recall rate are high, and Text Orientation judging nicety rate is high, effect is good, can meet the needs of practical application well.
Description
Technical field
The invention belongs to text analysis technique fields, and in particular to a kind of money article proneness analysis method.
Background technique
Current text emotional orientation analytical method includes following several:1) attributes such as emotion, position and keyword are made
For the factor for extracting critical sentence, then critical sentence group is carried out have supervision and semi-supervised emotional semantic classification, this kind of method takes key
The accuracy rate of sentence is not high;2) spy is carried out using the emotion vocabulary training text containing negative vocabulary, tendentiousness vocabulary, degree vocabulary
Sign extension, this method do not account for context, and effect is bad.Targetedly money article text classification is at home and abroad studied
It is relatively fewer, event semantics markup information in newsletter archive is mainly extracted using vocabulary and semantic rules, and the information is used
In the feature of machine learning classification, but this method is excessively complicated, and accuracy rate is not high.Web money article based on semantic rules
Text divides sentiment analysis method to extract text attribute by machine learning method Apriori, constructs emotion dictionary and semantic rules, from
And Sentiment orientation is calculated, this method is complex, and analytical effect is also very general, cannot meet the needs of practical application well.
Company name identification is that money article critical sentence group extracts critically important research point, however up to the present, the research of this respect at
Fruit is relatively fewer, and most of method of the prior art is still relatively low for the recognition accuracy of company's abbreviation, complicated in method
The building of rule and knowledge base seriously affects the application of method.
Summary of the invention
For above-mentioned problems of the prior art, it can avoid above-mentioned skill occur the purpose of the present invention is to provide one kind
The money article proneness analysis method of art defect.
In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows:
A kind of money article proneness analysis method, including:It identifies Business Name, extract critical sentence group and uses LSTM mould
Type classifies to critical sentence group.
Further, it identifies and includes the step of Business Name:
Newsletter archive to be processed is decomposed into N tuple-set as candidate company name by step (1);
N tuple score of the step (2) in the sentence containing six company codes and before company code adds 1;
Each N tuple is successively carried out similarity mode with basic dictionary and updates score by step (3);
Candidate company name is carried out Baidu search to step (4) and Baidupedia inquiry updates score, and score is higher than the N of threshold value
Tuple is set as company name.
Further, step (1) is specially:N tuple-set score is initialized first, respectively by N tuple in N tuple-set
Similarity mode is carried out with basic company name dictionary created above, obtains candidate company name set;One N tuple X and one
A company name Y calculating formula of similarity is
α in formula, β are weight, and count is both to belong to the statistics that X also belongs to Y word, start indicate N member ancestral X with company name
Y beginning, end indicate N member ancestral X with company name Y ending.
Further, step (4) is specially:
Candidate company name set is carried out to update set score into Baidu search and Baidupedia inquiry, if Baidu search
As a result there is " stock code " in, " company ", " group ", " enterprise " is then considered as an effective inquiry;If single hundred
In degree encyclopaedia query result title be not empty or summary and essential information in occur " stock code ", " company ", " group ",
" enterprise ", then this inquiry is considered as effective query;Candidate company name is updated in conjunction with Baidupedia inquiry and Baidu search to obtain
Point, internet checking update is scored at
Search (X)=η * count (X ∈ search_list)+γ * baike_query (x);
η is Baidu search weight, and count is that item number is effectively inquired in Baidu search, and γ is that Baidupedia inquires weight,
Baike_query is Baidupedia return value;
Business Name identification calculation formula be
Name=λ * Sim+ μ * search;
Name is the final score of N tuple, and λ and μ are weight, and Sim is to calculate N tuple and company name dictionary similarity,
Search is that internet hunt N tuple updates result.
Further, the step of extraction critical sentence group includes:
(1) critical sentence group is added in headline;
(2) similarity calculation for carrying out each sentence and headline, updates sentence score;
(3) to candidate sentences updating location information score, judge whether there is field word information to remember in sentence if containing
Otherwise it is whether to contain containing company name or company code to be otherwise denoted as 1 be 0 in 0, sentence for 1, updates each sentence again
Score;
(4) inverted order arrangement being carried out according to the score of sentence, score is greater than the sentence of threshold value as newsletter archive critical sentence group,
If not having sentence score to be greater than threshold value in candidate key sentence group, critical sentence group is added in the sentence of highest scoring.
The marking formula of used sentence position is when further, to candidate sentences updating location information score
SiFor i-th of sentence in text, abs is to seek absolute value, and n is sentence sum in text;
Sentence must be divided into
Score(Si)=∑ Wj*Scorej(Si), i=1,2,3 ... n;
Score(Si) it is sentence SiFinal score, SiFor i-th of sentence in a newsletter archive, j is that sentence marking is special
Collection is closed, WjIt is characterized j score weight, Scorej(Si) represent sentence SiMarking in terms of feature j.
Further, the step of being classified using LSTM model to critical sentence group include:
(1) corpus marked with LSTM model training, until meeting parameter request;
(2) the critical sentence group obtained to the second section segments, and removes stop words;
(3) sentence is trained with Word2vec and TFIDF, obtains sentence vector;
(4) tendentiousness classification is carried out using trained LSTM model distich subvector;
(5) use tendency judgment mechanism is analyzed positive and negative to number in critical sentence group in a newsletter archive, obtains one
The tendentiousness of newsletter archive.
Further, in LSTM model, ftThe calculation formula of value is
ft=σ (wf[ht-1, xt]+bf),
σ is sigmoid function, which determines the value that update, wfIt is to forget door weight, bfIt is bigoted to forget door;
itFor updated value, influence of the current input data to memory unit state is controlled, tanh layers generate newly
Candidate value vectorAnd it is added in state.itWithMore new formula be respectively
it=σ (wi[ht-1, xt]+bi);
wiTo update door weight, biIt is that update door is bigoted, tanh is hyperbolic tangent function, wcFor candidate value after update, bcFor
It is bigoted to update candidate value,It is candidate value;
CtMore new formula is
The output par, c of sigmoid layers of decision current state, state obtain value of the section -1 and 1, the value by tanh
Multiplied by sigmoid output Ot, export the output valve at this moment.OtAnd htMore new formula be respectively
Ot=σ (wo[ht-1, xt]+bo);ht=Ot*tanh(Ct);
W in formulaoFor the weight for updating output valve, boIt is that update output valve is bigoted, htFor final output value.
Further, in conjunction with Word2vec and TFIDF, term vector of the word t in a piece of document is expressed as
V (t)=word2vec (t) * tfidf (t);
V (t) indicates that term vector indicates after two kinds of model-weights in formula, and word2vec (t) is to instruct through word2vec model
The term vector of t is practised, tfidf (t) is to go out the term vector weight of t in a document through TFIDF model training.
Further, the calculation formula of TF and IDF is respectively
F (t, d) represents the number that word t occurs in document d, df in formulatFor the number of files containing word t, N is all documents
Number;The weight calculation formula in a document of word t is
tfidft=tf (t, d) * idft;
Using CBOW training pattern, CBOW's is expressed as
p(wt|τ(wt-k, wt-k+1..., wt+k|wt));
W in formulatSome word in dictionary, by and WtAdjacent window up and down is the word of k to predict word WtWhat is occurred is general
Rate, τ are expressed as doing the vector of the adjacent word of window or so into the operator of sum operation.
Money article proneness analysis method provided by the invention, using what is inquired based on company name abbreviation dictionary and encyclopaedia
Method identifies Business Name, and effect is excellent and favorable expandability, matches using based on deep learning frame doc2vec text similarity
Comprehensive characteristics attribute critical sentence group's abstracting method, extraction effect is good, and accuracy rate and recall rate are high, Text Orientation judging nicety rate
Height, effect is good, can meet the needs of practical application well.
Detailed description of the invention
Fig. 1 is LSTM door figure.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing and specific implementation
The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, and does not have to
It is of the invention in limiting.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise
Under every other embodiment obtained, shall fall within the protection scope of the present invention.
A kind of money article proneness analysis method, includes the following steps:Identification Business Name extracts critical sentence group and makes
Classified with LSTM model to critical sentence group.
Identify Business Name the step of include:
(1) newsletter archive to be processed is decomposed into N tuple-set as candidate company name;
(2) the N tuple score in the sentence containing six company codes and before company code adds 1;
(3) each N tuple is successively subjected to similarity mode with basic dictionary and updates score;
(4) candidate company name is subjected to Baidu search and Baidupedia inquiry updates score, score is higher than the N tuple of threshold value
It is set as company name.
Business Name is identified using based on company name abbreviation dictionary and the method for encyclopaedia inquiry in the present embodiment,
Company is added referred to as to company's abbreviation dictionary and the mapping of company code, increase Baidupedia inquire the factor.This method is easy reason
Solution, it is convenient to realize, scalability is strong and has preferable recognition effect to new company's name.It is extracted in each text to be processed first
N tuple (N-gram) set calculates similarity as candidate company name, in conjunction with basic dictionary, judges whether tuple is containing six
Baidupedia is carried out in the sentence of company code, by each tuple and Baidu search carries out comprehensive score, finally by N tuple-set
Middle score is higher than the N tuple of threshold value as company name.
The present embodiment obtains company code and company from domestic three big stock exchanges and referred to as creates basic dictionary and the two
Mapped each other in dictionary, such as in basic dictionary ' 000027 ' and ' Shenzhen energy ' to represent Shenzhen energy group share limited
Company.
Specifically, N tuple-set score is initialized first, respectively by N tuple in N tuple-set and base created above
Plinth company name dictionary carries out similarity mode, obtains candidate company name set.One N tuple X and a company name Y similarity meter
Calculating formula is
α in formula, β are weight, and count is both to belong to the statistics that X also belongs to Y word, start indicate N member ancestral X with company name
Y beginning, end indicate N member ancestral X with company name Y ending, and through overfitting, value is set to obtain optimal result when 0.4 and 1.
Candidate company name set is carried out to update set score into Baidu search and Baidupedia inquiry, if Baidu search
As a result there is " stock code " in, " company ", " group ", " enterprise " is then considered as an effective inquiry.If single hundred
In degree encyclopaedia query result title be not empty or summary and essential information in occur " stock code ", " company ", " group ",
" enterprise ", then this inquiry is considered as effective query, and Tables 1 and 2 is by Baidupedia and Baidu search respectively to key
The query result of word " Baidu ".
1 encyclopaedia query result of table
2 Baidu search result of table
According to above-mentioned two table it is found that if only 10 search data only have from the point of view of with the result of 2 Baidu search of table return
2 search results confirm that " Baidu " is a company, and then proving that " Baidu " is very likely in conjunction with the inquiry of 1 Baidupedia of table is one
Company updates candidate company name score in conjunction with Baidupedia inquiry and Baidu search, and internet checking update is scored at
Search (X)=η * count (X ∈ search_list)+γ * baike_query (x) (2)
In formula, η is Baidu search weight, and count is that item number is effectively inquired in Baidu search, and γ is Baidupedia inquiry
Weight, baike_query are Baidupedia return value.By learning to data, weight parameter η and γ are set to 0.2 and 1.3
Obtain optimal solution.
Business Name identification calculation formula be
Name=λ * Sim+ μ * search (3),
In formula, name is the final score of N tuple, and λ and μ are weight, and Sim is that calculating N tuple is similar to company name dictionary
Degree, search are that internet hunt N tuple updates result.Through overfitting, λ and μ are set to 1 and 1.12 acquirement optimum efficiencies.
Extract critical sentence group the step of include:
(1) critical sentence group is added in headline;
(2) similarity calculation of each sentence and headline is carried out using trained doc2vec model, updates sentence
Score;
(3) with formula (4) to candidate sentences updating location information score, if judging whether there is field word information in sentence
Whether otherwise to be denoted as 1 be 0 containing being then denoted as 1 and otherwise contain containing company name or company code for 0, in sentence, is updated again
Each sentence score;
(4) inverted order arrangement is carried out according to the score of sentence, score is greater than the sentence of threshold value Phi as newsletter archive critical sentence
Group, if not having sentence score to be greater than Φ in candidate key sentence group, critical sentence group is added in the sentence of highest scoring.
Headline carries the more important information of text.The critical sentence of news often has beginning or text in text
Ending at, therefore the sentence of text beginning and end position has been set as higher weight.
Doc2vec is based on word2vec deep learning model, it can indicate sentence with real number value, between sentence
Similarity calculation.Comprehensive characteristics attribute is matched using based on deep learning frame doc2vec text similarity in the present embodiment
Critical sentence group's abstracting method:Critical sentence group is added in headline first, using sentence in doc2vec model calculating text and newly
It hears title similarity, while whether containing company name or six companies in position of the comprehensive sentence in newsletter archive, sentence
Whether code containing field verb information updates sentence collection score again, and score is higher than the sentence collection of threshold value Phi as news pass
If being higher than threshold value without sentence score critical sentence group is added in the sentence of highest scoring by key sentence group.The marking of sentence position
Formula (4) is
In formula, SiFor i-th of sentence in text, abs is to seek absolute value, and n is sentence sum in text, by the mechanism,
Text starts that higher score can be obtained with the sentence of end of text position, meets and focuses on text in newsletter archive and start
Or the rule at the end of text.Sentence must be divided into
Score(Si)=∑ Wj*Scorej(Si), i=1,2,3 ... n (5);
Score (S in formulai) it is sentence SiFinal score, SiFor i-th of sentence in a newsletter archive, j is that sentence is beaten
Divide characteristic set, includes sentence position (position), whether contain company name (name), whether contain domain term (field)
And the similarity (similarity) of sentence and headline, WjIt is characterized j score weight, Scorej(Si) represent sentence Si
Marking in terms of feature j.
Critical sentence group extracts result and plays a key role to newsletter archive proneness analysis accuracy rate, extracts the good of effect
The bad effect for directly affecting text classification, which achieves extracts effect well.
The step of being classified using LSTM model to critical sentence group include:
(1) corpus marked with LSTM model training, until meeting parameters requirement;
(2) the critical sentence group obtained to the second section segments, and removes stop words;
(3) sentence is trained with Word2vec and TFIDF, obtains sentence vector;
(4) tendentiousness classification is carried out using trained LSTM model distich subvector;
(5) use tendency judgment mechanism is analyzed positive and negative to number in critical sentence group in a newsletter archive, obtains one
The tendentiousness of newsletter archive.
One newsletter archive proneness analysis can be converted into the whole tendentiousness for judging its critical sentence group, tendentiousness judgement
Mechanism is as follows:It carries out tendentiousness to each critical sentence respectively with trained LSTM model to judge, if positive critical sentence number
Greater than the critical sentence number of negative sense, then the newsletter archive is considered positive;If the critical sentence number of negative sense is greater than positive pass
Key sentence number, then it is assumed that newsletter archive is negative sense;If positively and negatively critical sentence number is identical, the tendency of newsletter archive depends on
Sentence is segmented using jieba and is removed deactivated when carrying out proneness analysis to critical sentence in headline tendentiousness
Word can improve classifying quality while improve efficiency.
LSTM network model can have closed loop, the weight between hidden layer between model hidden layer with Chief Learning Officer, CLO's Dependency Specification
The memory of LSTM network is controlled, the scheduling of memory is responsible for, model is calculated the current memory state of hidden layer as subsequent time
Part input.The input layer of traditional RNN and hidden layer are implanted in memory unit by model, manage cell by door
State, be LSTM door, ft, i as shown in Figure 1t、otRespectively forget door, input gate, out gate.
XtFor the input data of t moment LSTM unit, htIt is output, C is the value of different moments memory unit.Forget door ft
Determine the throughput of information, the goalkeeper XtH is exported with last momentt-1As input, between zero and one, value is used to retouch output valve
State each part throughput number, 0 represent give up completely, 1 represents whole passes through.ftThe calculation formula of value is
ft=σ (wf[ht-1, xt]+bf) (6);
σ is sigmoid function or is " input gate layer " in formula, which determines the value that update, wfIt is to forget door
Weight, bfIt is bigoted to forget door.
itFor updated value, influence of the current input data to memory unit state is controlled, tanh layers generate newly
Candidate value vectorAnd it is added in state.itWithMore new formula be respectively
it=σ (wi[ht-1, xt]+bi) (7),
σ is sigmoid function, w in formulaiTo update door weight, biIt is that update door is bigoted, tanh is hyperbolic tangent function, wc
For candidate value after update, bcIt is bigoted to update candidate value,It is candidate value.
Next the state for updating original unit, by state Ct-1To CtState, by original state Ct-1And ftIt is multiplied, abandons
The information to be shielded, is addedValue.CtMore new formula is
The output par, c of sigmoid layers of decision current state, state obtain value of the section -1 and 1, the value by tanh
Multiplied by sigmoid output Ot, export the output valve at this moment.OtAnd htMore new formula be respectively
Ot=σ (wo[ht-1, xt]+bo) (10);
ht=Ot*tanh(Ct) (11);
W in formulaoFor the weight for updating output valve, boIt is that update output valve is bigoted, htFor final output value.
Word2vec indicates that text, the model indicate that text both can solve traditional vector space mould using distributed method
The high latitude Sparse Problems of type, while also having to the classification of short text bright supplemented with semantic expressiveness not available for conventional model
Aobvious advantage.TFIDF is a kind of word frequency statistics method, for counting the significance level of word or word in a class text, this method
Introducing solve the problems, such as that the significance level of vocabulary in the text cannot be distinguished in Word2vec.The combination of Word2vec and TFIDF
Keep the expression of text vector more accurate.
TFIDF is a kind of statistical method, and thought is mainly:If the number that some word or word occur in a class text
It is higher, while rarely occurring in other texts, then it is assumed that there is good class to distinguish effect for the word or word.TFIDF, that is, TF ×
IDF, TF represent probability of the word t in document d, and IDF is the difference class effect of word t, i.e., have word t in fewer document, then IDF value
Bigger, the calculation formula of TF and IDF are respectively
F (t, d) represents the number that word t occurs in document d, df in formulatFor the number of files containing word t, N is all documents
Number.The weight calculation formula in a document of word t is
tfidft=tf (t, d) * idft (14)。
Word2vec is that a kind of deep neural network probabilistic model is compared with the traditional method for calculating term vector, the mould
Type can make full use of the semantic information of context.There are two types of training patterns, respectively CBOW and skip-gram by Word2vec.
CBOW training pattern is used in the present embodiment, and CBOW's is expressed as
p(wt|τ(wt-k, wt-k+1..., wt+k|wt)) (15);
W in formulatSome word in dictionary, by and WtAdjacent window up and down is the word of k to predict word WtWhat is occurred is general
Rate, τ are expressed as doing the vector of the adjacent word of window or so into the operator of sum operation.In conjunction with Word2vec and TFIDF, word t exists
Term vector in a piece of document is expressed as
V (t)=word2vec (t) * tfidf (t) (16);
V (t) indicates that term vector indicates after two kinds of model-weights in formula, and word2vec (t) is to instruct through word2vec model
The term vector of t is practised, tfidf (t) is to go out the term vector weight of t in a document through TFIDF model training.The expression of sentence vector
To use the method for formula (16) to be added the term vector of word in sentence.
Money article proneness analysis method provided by the invention, using what is inquired based on company name abbreviation dictionary and encyclopaedia
Method identifies Business Name, and effect is excellent and favorable expandability, matches using based on deep learning frame doc2vec text similarity
Comprehensive characteristics attribute critical sentence group's abstracting method, extraction effect is good, and accuracy rate and recall rate are high, Text Orientation judging nicety rate
Height, effect is good, can meet the needs of practical application well.
Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not
Therefore limitations on the scope of the patent of the present invention are interpreted as.It should be pointed out that for those of ordinary skill in the art,
Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection model of the invention
It encloses.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (10)
1. a kind of money article proneness analysis method, which is characterized in that including:Identify Business Name, extract critical sentence group and
Classified using LSTM model to critical sentence group.
2. money article proneness analysis method according to claim 1, which is characterized in that the step of identifying Business Name
Including:
Newsletter archive to be processed is decomposed into N tuple-set as candidate company name by step (1);
N tuple score of the step (2) in the sentence containing six company codes and before company code adds 1;
Each N tuple is successively carried out similarity mode with basic dictionary and updates score by step (3);
Candidate company name is carried out Baidu search to step (4) and Baidupedia inquiry updates score, and score is higher than the N tuple of threshold value
It is set as company name.
3. money article proneness analysis method according to claim 1, which is characterized in that step (1) is specially:First
N tuple-set score is initialized, N tuple in N tuple-set and basic company name dictionary created above are subjected to phase respectively
It is matched like degree, obtains candidate company name set;One N tuple X and a company name Y calculating formula of similarity are
α in formula, β are weight, and count is both to belong to the statistics that X also belongs to Y word, and start indicates opening with company name Y for N member ancestral X
Head, end indicate N member ancestral X with company name Y ending.
4. money article proneness analysis method according to claim 1, which is characterized in that step (4) is specially:
Candidate company name set is carried out to update set score into Baidu search and Baidupedia inquiry, if Baidu search result
Middle appearance " stock code ", " company ", " group ", " enterprise " are then considered as an effective inquiry;If single Baidu hundred
Title is not to occur " stock code ", " company ", " group ", " enterprise in empty or summary and essential information in section's query result
Industry ", then this inquiry is considered as effective query;Candidate company name score is updated in conjunction with Baidupedia inquiry and Baidu search,
Internet checking update is scored at
Search (X)=η * count (X ∈ search_list)+γ * baike_query (x);
η is Baidu search weight, and count is that item number is effectively inquired in Baidu search, and γ is that Baidupedia inquires weight,
Baike_query is Baidupedia return value;
Business Name identification calculation formula be
Name=λ * Sim+ μ * search;
Name is the final score of N tuple, and λ and μ are weight, and Sim is to calculate N tuple and company name dictionary similarity, search
Result is updated for internet hunt N tuple.
5. money article proneness analysis method described in -4 according to claim 1, which is characterized in that extract the step of critical sentence group
Suddenly include:
(1) critical sentence group is added in headline;
(2) similarity calculation for carrying out each sentence and headline, updates sentence score;
(3) to candidate sentences updating location information score, judge whether have field word information no if being denoted as 1 containing if in sentence
Then whether contain otherwise to be denoted as 1 be 0 containing company name or company code for 0, in sentence, updates each sentence score again;
(4) inverted order arrangement being carried out according to the score of sentence, score is greater than the sentence of threshold value as newsletter archive critical sentence group, if
There is no sentence score to be greater than threshold value in candidate key sentence group, critical sentence group is added in the sentence of highest scoring.
6. money article proneness analysis method described in -5 according to claim 1, which is characterized in that believe candidate sentences position
The marking formula of used sentence position is when breath update score
SiFor i-th of sentence in text, abs is to seek absolute value, and n is sentence sum in text;
Sentence must be divided into
Score(Si)=∑ Wj*Scorej(Sj), i=1,2,3 ... n;
Score(Si) it is sentence SiFinal score, SiFor i-th of sentence in a newsletter archive, j is sentence marking feature set
It closes, WjIt is characterized j score weight, Scorej(Si) represent sentence SiMarking in terms of feature j.
7. money article proneness analysis method described in -6 according to claim 1, which is characterized in that using LSTM model to pass
The step of key sentence group classifies include:
(1) corpus marked with LSTM model training, until meeting parameter request;
(2) the critical sentence group obtained to the second section segments, and removes stop words;
(3) sentence is trained with Word2vec and TFIDF, obtains sentence vector;
(4) tendentiousness classification is carried out using trained LSTM model distich subvector;
(5) use tendency judgment mechanism is analyzed positive and negative to number in critical sentence group in a newsletter archive, obtains a news
The tendentiousness of text.
8. money article proneness analysis method described in -7 according to claim 1, which is characterized in that in LSTM model, ftValue
Calculation formula be
ft=σ (wf[ht-1, xt]+bf),
σ is sigmoid function, which determines the value that update, wfIt is to forget door weight, bfIt is bigoted to forget door;
itFor updated value, influence of the current input data to memory unit state is controlled, tanh layers generate new candidate value
VectorAnd it is added in state.itWithMore new formula be respectively
it=σ (wi[ht-1, xt]+bj);
wiTo update door weight, biIt is that update door is bigoted, tanh is hyperbolic tangent function, wcFor candidate value after update, bcTo update
Candidate value is bigoted,It is candidate value;
CtMore new formula is
The output par, c of sigmoid layers of decision current state, state obtain value of the section -1 and 1 by tanh, the value multiplied by
Sigmoid output Ot, export the output valve at this moment.OtAnd htMore new formula be respectively
Ot=σ (wo[ht-1, xt]+bo);ht=Ot*tanh(Ct);
W in formulaoFor the weight for updating output valve, boIt is that update output valve is bigoted, htFor final output value.
9. money article proneness analysis method described in -7 according to claim 1, which is characterized in that in conjunction with Word2vec and
The term vector of TFIDF, word t in a piece of document is expressed as
V (t)=word2vec (t) * tfidf (t);
V (t) indicates that term vector indicates after two kinds of model-weights in formula, and word2vec (t) is to go out t through word2vec model training
Term vector, tfidf (t) is to go out the term vector weight of t in a document through TFIDF model training.
10. money article proneness analysis method described in -7 according to claim 1, which is characterized in that the calculating of TF and IDF is public
Formula is respectively
F (t, d) represents the number that word t occurs in document d, df in formulatFor the number of files containing word t, N is all number of files;Word
The weight calculation formula in a document of t is
tfidft=tf (t, d) * idft;
Using CBOW training pattern, CBOW's is expressed as
p(wt|τ(wt-k, wt-k+1..., wt+k|wt));
W in formulatSome word in dictionary, by and WtAdjacent window up and down is the word of k to predict word WtThe probability of appearance, τ
It is expressed as doing the vector of the adjacent word of window or so into the operator of sum operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810605916.8A CN108932229A (en) | 2018-06-13 | 2018-06-13 | A kind of money article proneness analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810605916.8A CN108932229A (en) | 2018-06-13 | 2018-06-13 | A kind of money article proneness analysis method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108932229A true CN108932229A (en) | 2018-12-04 |
Family
ID=64446501
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810605916.8A Pending CN108932229A (en) | 2018-06-13 | 2018-06-13 | A kind of money article proneness analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108932229A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614490A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Money article proneness analysis method based on LSTM |
CN111159223A (en) * | 2019-12-31 | 2020-05-15 | 武汉大学 | Interactive code searching method and device based on structured embedding |
CN111782907A (en) * | 2020-07-01 | 2020-10-16 | 北京知因智慧科技有限公司 | News classification method and device and electronic equipment |
CN112287687A (en) * | 2020-09-17 | 2021-01-29 | 昆明理工大学 | Case tendency extraction type summarization method based on case attribute perception |
CN113064964A (en) * | 2021-03-22 | 2021-07-02 | 广东博智林机器人有限公司 | Text classification method, model training method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106933800A (en) * | 2016-11-29 | 2017-07-07 | 首都师范大学 | A kind of event sentence abstracting method of financial field |
CN107908614A (en) * | 2017-10-12 | 2018-04-13 | 北京知道未来信息技术有限公司 | A kind of name entity recognition method based on Bi LSTM |
US20180121788A1 (en) * | 2016-11-03 | 2018-05-03 | Salesforce.Com, Inc. | Deep Neural Network Model for Processing Data Through Mutliple Linguistic Task Hiearchies |
-
2018
- 2018-06-13 CN CN201810605916.8A patent/CN108932229A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180121788A1 (en) * | 2016-11-03 | 2018-05-03 | Salesforce.Com, Inc. | Deep Neural Network Model for Processing Data Through Mutliple Linguistic Task Hiearchies |
CN106933800A (en) * | 2016-11-29 | 2017-07-07 | 首都师范大学 | A kind of event sentence abstracting method of financial field |
CN107908614A (en) * | 2017-10-12 | 2018-04-13 | 北京知道未来信息技术有限公司 | A kind of name entity recognition method based on Bi LSTM |
Non-Patent Citations (2)
Title |
---|
李江龙等: "金融领域的事件句抽取", 《计算机应用研究》 * |
胡新辰: "基于LSTM的语义关系分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614490A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Money article proneness analysis method based on LSTM |
CN111159223A (en) * | 2019-12-31 | 2020-05-15 | 武汉大学 | Interactive code searching method and device based on structured embedding |
CN111159223B (en) * | 2019-12-31 | 2021-09-03 | 武汉大学 | Interactive code searching method and device based on structured embedding |
CN111782907A (en) * | 2020-07-01 | 2020-10-16 | 北京知因智慧科技有限公司 | News classification method and device and electronic equipment |
CN111782907B (en) * | 2020-07-01 | 2024-03-01 | 北京知因智慧科技有限公司 | News classification method and device and electronic equipment |
CN112287687A (en) * | 2020-09-17 | 2021-01-29 | 昆明理工大学 | Case tendency extraction type summarization method based on case attribute perception |
CN113064964A (en) * | 2021-03-22 | 2021-07-02 | 广东博智林机器人有限公司 | Text classification method, model training method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112100344B (en) | Knowledge graph-based financial domain knowledge question-answering method | |
CN109858028B (en) | Short text similarity calculation method based on probability model | |
CN106919673B (en) | Text mood analysis system based on deep learning | |
CN108932229A (en) | A kind of money article proneness analysis method | |
Karim et al. | Classification benchmarks for under-resourced bengali language based on multichannel convolutional-lstm network | |
CN106997382B (en) | Innovative creative tag automatic labeling method and system based on big data | |
CN108984526A (en) | A kind of document subject matter vector abstracting method based on deep learning | |
CN109753660B (en) | LSTM-based winning bid web page named entity extraction method | |
CN109726745B (en) | Target-based emotion classification method integrating description knowledge | |
CN107180026B (en) | Event phrase learning method and device based on word embedding semantic mapping | |
CN114254653A (en) | Scientific and technological project text semantic extraction and representation analysis method | |
CN111563143B (en) | Method and device for determining new words | |
CN112069312B (en) | Text classification method based on entity recognition and electronic device | |
CN109614490A (en) | Money article proneness analysis method based on LSTM | |
CN106933800A (en) | A kind of event sentence abstracting method of financial field | |
Wu et al. | Exploring syntactic and semantic features for authorship attribution | |
CN113033183B (en) | Network new word discovery method and system based on statistics and similarity | |
CN110750646B (en) | Attribute description extracting method for hotel comment text | |
CN111444704B (en) | Network safety keyword extraction method based on deep neural network | |
CN114416942A (en) | Automatic question-answering method based on deep learning | |
CN112949713A (en) | Text emotion classification method based on ensemble learning of complex network | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
Chen et al. | Sentiment classification of tourism based on rules and LDA topic model | |
Ao et al. | News keywords extraction algorithm based on TextRank and classified TF-IDF | |
CN109033087A (en) | Calculate method, De-weight method, clustering method and the device of text semantic distance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20221206 |
|
AD01 | Patent right deemed abandoned |