CN110334300A

CN110334300A - Text aid reading method towards the analysis of public opinion

Info

Publication number: CN110334300A
Application number: CN201910621253.3A
Authority: CN
Inventors: 赵铁军; 徐冰; 杨沐昀; 胡东瑶; 曹海龙; 朱聪慧; 郑德权
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-07-10
Filing date: 2019-07-10
Publication date: 2019-10-15

Abstract

The present invention provides the text aid reading method towards the analysis of public opinion, belongs to natural language processing technique field.The present invention carries out text extracting using unified approach to various types webpage first；Then the name Entity recognition of the entities such as personage, place, mechanism is carried out to text and highlighted；Autoabstract is carried out to text again, and sentence is ranked up and is highlighted by different degree.The present invention solves the problems, such as that there are emphasis extraction difficulty, reading efficiency are low when existing the analysis of public opinion personnel read a large amount of text informations.The present invention can be used for the text aid reading of the analysis of public opinion, and reader can be quickly obtained purport information.

Description

Text aid reading method towards the analysis of public opinion

Technical field

The present invention relates to text aid reading methods, belong to natural language processing technique field.

Background technique

The analysis of public opinion is that one kind passes through real-time collecting public opinion information and statisticallys analyze, and is rationally determined with aid decision person The technology of plan.In the analysis of public opinion, the information being collected into can substantially be divided into structural data (such as social networks, the index of discharge Deng) and unstructured data (such as user comment and newsletter archive), wherein again in the majority with unstructured data.For a carriage Facts part, in addition to statistical analysis of data, the analysis of public opinion personnel be often also required to read a large amount of news report just can make it is relatively complete The analysis and summary in face, is easy to cause visual fatigue.And the development of natural language processing technique, allowing the machine auxiliary mankind to read becomes It may.Text aid reading system proposed by the present invention towards the analysis of public opinion attempts to solve following difficulty:

One, since public sentiment news is usually from webpage, and page structure, the character code of different news websites are inconsistent, And page structure has variation at any time, extracts content and has difficulties.

Two, due to generally involving the important informations such as personage, region, mechanism in public sentiment news and all kinds of editorials.Due to Chinese is accustomed to without the word space gap of similar English and entity initial caps, causes to focus when reading a large amount of texts tired It is difficult.

Three, it due to the larger low efficiency of public sentiment journalism amount, is not easy to catch article purport.

Summary of the invention

The present invention is solves the problems, such as that existing the analysis of public opinion technology exists to extract content difficulty, low efficiency, provide towards The text aid reading method of the analysis of public opinion.

Text aid reading method of the present invention towards the analysis of public opinion, is achieved through the following technical solutions:

Step 1: carrying out text extracting to various types webpage；

Step 2: carrying out the name Entity recognition of the entities such as personage, place, mechanism to text and highlighting；

Step 3: carrying out autoabstract to text, then sentence is ranked up and is highlighted by different degree.

Present invention feature the most prominent and significant beneficial effect are:

Text aid reading method according to the present invention towards the analysis of public opinion, successively to webpage by text extracting, Entity recognition, autoabstract processing are named, so that the refreshing after rendering to original web page and " reading model " webpage reproduces. The method of the present invention allows reader to easily pass the emphasis for reading highlighted sentence only to understand entire article, saves Reading time doubles the efficiency of the analysis of public opinion at least；If reader wants to understand the role relation in media event, It can quickly assist understanding by highlighted entity word.Since processing is fully established on original text, quick side is allowed users to Just original read.

Detailed description of the invention

Fig. 1 is embodiment of the present invention method flow chart；

Fig. 2 is the Named Entity Extraction Model schematic diagram based on stacking HMM in the present invention；

Fig. 3 is the Named Entity Extraction Model schematic diagram based on BiLSTM+CRF in the present invention；Human is people, and space refers to Place, institute are mechanisms；

Fig. 4 is CBOW schematic illustration；The word of w (t) expression current location；

Fig. 5 is Skip-Gram schematic illustration；

Fig. 6 is the original web page in the embodiment of the present invention；

Fig. 7 is to open effect after text auxiliary reading function in the embodiment of the present invention in original web page；

Fig. 8 is to open effect after text auxiliary reading function in the embodiment of the present invention on " reading model " webpage.

Specific embodiment

Specific embodiment 1: being illustrated in conjunction with Fig. 1 to present embodiment, what present embodiment provided divides towards public sentiment The text aid reading method of analysis, specifically includes the following steps:

Step 1: carrying out text extracting using unified approach to various types webpage；

Since in a practical situation, the page HTML structure of major news website is more complicated, and inconsistent.From It mainly include the distracters such as typesetting, picture, special efficacy, advertisement from the point of view of external sense organ；It mainly include label from the point of view of inherent code Intricate, situations such as character code is chaotic.The purpose of this step is the effective text paragraph extracted in webpage, is rejected in webpage Distracter, be clean succinct " reading model " by web repristination, while facilitating subsequent natural language processing.

The purpose of this step is to extract important entity elements and from non-structured text to meet visual attention Form is presented.Under normal circumstances, news, editorial usually surround an event and are unfolded, and event having time, place, personage's (machine Structure) etc. fundamentals.In English, proper noun is write in the form of initial caps, and is allowed specially between word using space There is the boundary of noun to be more clear.But Chinese will cause reader when in face of a large amount of texts without above-mentioned writing feature It is difficult to focus important sentence element.Name Entity recognition highlights such issues that can alleviate in conjunction with appropriate.

Step 3: carrying out autoabstract to text, then sentence is ranked up and is highlighted by different degree；

The purpose of this step is to excavate important sentences from non-structured text, and to visualize shape on original text Formula is presented, with aid reading person's Fast Reading.

Specific embodiment 2: the present embodiment is different from the first embodiment in that, dom tree solution is used in step 1 Analysis carries out the text extracting, detailed process the following steps are included:

Step 1 one obtains original web page HTML (Hyper Text Markup Language hypertext markup language), Detection coding；UTF-8 volume is converted into if encoding non-UTF-8 (many Chinese web pages are encoded using GB2312) Code；

Step 1 two predefines several regular expression groupings, and web page tag is grouped；Grouping mainly has " weighting Grouping " (may contain the label of text, such as body, article, content), " grouping of drop power " (are less likely Label containing text, such as footnote, media, meta) and other groupings；

Step 1 three establishes dom tree to HTML；DOM is DOM Document Object Model (Document Object Model)；

Step 1 four, the element for deleting non-textual content in DOM；

All elements need further recurrence time if element is the label of<div>in step 1 five, traversal DOM All nested elements in<div>label are gone through, by the weight plus-minus to grouping, reconfigures and sorts out the interior of page body Hold.

Other steps and parameter are same as the specific embodiment one.

Specific embodiment 3: present embodiment unlike specific embodiment two, is named in fact described in step 2 Body identification is generally viewed as sequence labelling task, precision can be used slightly lower but the conventional machines learning algorithm of fast speed, or Precision is slightly higher but slow deep learning algorithm；It can select according to actual needs.The conventional machines learning algorithm, Including the Named Entity Extraction Model algorithm based on stacking HMM and the Named Entity Extraction Model algorithm based on CRF, wherein HMM Indicate Hidden Markov Model, CRF indicates condition random field；The deep learning algorithm includes the name entity based on BiLSTM Identification model algorithm and Named Entity Extraction Model algorithm based on BiLSTM+CRF, the BiLSTM indicate that two-way length is remembered in short-term Recall network.

Other steps and parameter are identical with embodiment two.

Specific embodiment 4: present embodiment is unlike specific embodiment three, although HMM is simplest sequence Column marking model, but its speed is fast, scalability is strong.Due to requiring in production environment the process of refinement of name Entity recognition It is higher, it the use of multilayer laminated HMM is still at present one of widely used method.

In the stacking HMM of the Named Entity Extraction Model based on stacking HMM:

1st layer of HMM is for segmenting；Due to the reason of training corpus, this layer tends to cutting and opens too remote place name (such as at county level, township level) and name.

2nd layer of HMM roughly identifies place name, name on the basis of the 1st layer, then carries out mode to generation result With secondary automatic marking corpus is carried out, as next layer of training corpus；

For 3rd layer of HMM on secondary automatic marking corpus, training subtly identifies place name, name；

4th layer of HMM mechanism name for identification.Mechanism name is then slightly more complex, because it generally comprises place name, (minority is even also Include name), it needs on the basis of identifying place name and name, reuses pattern match and carry out secondary automatic marking, then with this For training corpus training identification mechanism name.

Therefore, it is desirable to more subtly identify name, place name, 3 layers of HMM are needed；And want relatively subtly identification mechanism Name, then need 4 layers of HMM.Training process is (the 2nd, 3 layer is reduced to one layer) as shown in Figure 2: about character position in entity in figure Label has: B presentation-entity beginning character (Begin), I presentation-entity intermediate character (Inside), E presentation-entity termination character (End), S indicates corpus separatum character (Single), and O indicates non-physical character (Outside)；About the mark of entity type in figure Be signed with: Nh indicates that name entity, Ns indicate place name entity, Ni outgoing mechanism name entity；Position and type label, which can combine, to be made With if the label of " Jinan District " in " Jinan District procuratorial organ " is " B-Ni ", expression is the prefix of a mechanism name, and the mark of " procuratorial organ " Label are " E-Ni ", and expression is the suffix of a mechanism name.

It is trained using CRF similar above-mentioned.It needs to set corresponding feature templates, can start to train.

Other steps and parameter are the same as the specific implementation mode 3.

Specific embodiment 5: present embodiment is unlike specific embodiment three, the life based on BiLSTM Name entity recognition model is as shown in figure 3, mainly include following 3 layers:

1st layer is insertion (Embedding) layer, and for converting character vector for the character of sentence, character vector can Enough random initializtions, then update during training；Also it is good to be able to use pre-training on network；

2nd layer is BiLSTM layers, first to the stochastic parameter in each unit of LSTM (long memory network in short-term), so By character vector, spacer step is sent into LSTM unit one by one and does cycle calculations at any time afterwards；BiLSTM carries out LSTM primary positive (forward LSTM) cycle calculations and primary reversed (backward LSTM) cycle calculations；Obtained result, which is passed through, splices, just Then change, send to output layer；

3rd layer is output layer, which is simple softmax (normalization exponential function) output layer；It should be noted that , there are inclusion relations for the label of place name, the label of name and mechanism name, so if being concerned about inclusion relation, here can Two-way output is done, simultaneously training；If being indifferent to inclusion relation, it can also ignore by comprising entity, directly export all the way.

Other steps and parameter are identical as specific embodiment three or four.

Specific embodiment 6: present embodiment is unlike specific embodiment five, it is described to be based on BiLSTM+CRF Named Entity Extraction Model, on the basis of the Named Entity Extraction Model based on BiLSTM, in BiLSTM layers and output layer Between increase by one CRF layer, final label is calculated by CRF layers again for the BiLSTM layers of data transmitted, and is transmitted To output layer.

Other steps and parameter are identical as specific embodiment five.

Specific embodiment 7: the present embodiment is different from the first embodiment in that, Step 2: described in step 3 Highlighting can either carry out in original web page, also can " reading model " webpage (i.e. after step 1 is handled only wrap Containing text, the webpage without distracters such as picture, special efficacy, advertisements) on carry out, by being that corresponding word, sentence add in HTML Upper effect is presented.

Name, place name are highlighted:

It is previously noted that identification name entity compares English difficulty in Chinese text, Chinese is essentially consisted in without word point Every with initial caps feature.In fact, there is the punctuate of a kind of entitled " line under or beside a word to show that it is a proper noun " (" _ ") for overcoming this in Chinese punctuate Problem.Its regulation in 1919 " new-type punctuation mark proposal please be issue for enforcement ", and issued in State Bureau of Technical Supervision in nineteen ninety-five In " punctuation usage ", it is proposed that used in ancient books, for identifying name, place name, towards code name.

The present invention continues to use the usage of line under or beside a word to show that it is a proper noun, and name, place name class are named entity, subscript line under or beside a word to show that it is a proper noun, and added with font Slightly, the form of color is presented.

Mechanism name is highlighted:

Mechanism name can continue to use name, the method for place name uses line under or beside a word to show that it is a proper noun.Here it is considered that since mechanism may includes Place name, if reusing line under or beside a word to show that it is a proper noun, it will cause continuous.And the frequency that occurs in article of mechanism name far below name, Name, it is also important for the analysis of public opinion, preferably indicated with more prominent symbol.It is presented in the form of adding frame in the present embodiment Mechanism name.

Abstract is highlighted:

Background color is arranged in the sentence that autoabstract is obtained in original text, and Sentence significance score is then mapped to back The significance level of sentence is embodied in the brightness of scenery coloured silk.

Other steps and parameter are same as the specific embodiment one.

Specific embodiment 8: the present embodiment is different from the first embodiment in that, due to needing the result that will make a summary It is highlighted on original text (as highlighted), the autoabstract is carried out using unsupervised extraction-type digest algorithm in step 3； Unsupervised extraction-type digest algorithm includes mining algorithm (such as TextRank), the algorithm based on cluster based on figure.

Other steps and parameter are same as the specific embodiment one.

Specific embodiment 9: present embodiment unlike specific embodiment one to eight, uses in step 3 and belongs to TextRank algorithm in the mining algorithm based on figure carries out the autoabstract；The TextRank algorithm be it is a kind of will be literary Originally it is built into figure expression, is then excavated using figure to find the algorithm of key node (i.e. important sentences)；Specifically include following step It is rapid:

Firstly, document is first divided into sentence, and by sentence expression at vector form；

Then, it calculates sentence similarity matrix: to the vector of two sentences any in text, calculating phase using cosine formula Like degree, it is aggregated into similarity matrix；In this way, it is the nothing on side that entire text, which can be considered as using sentence the similarity between node, sentence, To the side connected graph G that has the right；

Finally, carrying out important node excavation to G using PageRank (page rank) algorithm；Calculation formula is as follows:

Wherein, c indicates damped coefficient, generally may be configured as 0.85, V_tIndicate t-th of node (node, that is, text in figure G In sentence), In (V_t) indicate to be directed toward node V_tNode set, Out (V_j) indicate node V_jPointed node set, w_jt Indicate node V_tTo node V_jSide weight；WS (V on the left of formula_t) indicate node V_tWeight and (Weight Sum), it is right The sum term of side then indicates each adjacent node to the percentage contribution of this node；

Continuous iteration update is carried out to nodes all in figure using above-mentioned formula, until all weights tend to be steady；Finally Weight selection and highest N number of node take its corresponding N number of sentence as abstract output.

Other steps and parameter are identical as specific embodiment one to eight.

Specific embodiment 10: present embodiment is unlike specific embodiment nine, it is described by sentence expression to There are many amount forms, and method is optional, can learn using BM25 algorithm (best match algorithm Best Match) or based on distributed Practise algorithm；

BM25 algorithm is commonly used in doing relevance of searches scoring.Its calculation formula is as follows:

Wherein, Q indicates inquiry string (Query)；q_iIndicate i-th of word in inquiry string Q (for Chinese Speech, each word after participle can be considered as), i=1 ... n；N is word number in inquiry string Q；D indicates a search result Document；W_iIndicate q_iWeight；R(q_i, d) and indicate q_iWith the Relevance scores of search result document d；W_iWith R (q_i, d) can Designed, designed is typically designed W_i=IDF (q_i), IDF () i.e. inverse document frequency；And R (q_i, d) it is then more more flexible, it embodies single Word q_iWith the correlation of document d；

Corpus pre-training is generally first passed through based on distributed learning algorithm and obtains term vector, and word then is done to the word in sentence Vector average calculating operation obtains the vector of sentence, and the method that corpus pre-training obtains term vector has Skip-Gram, CBOW etc., former Reason is as shown in Figure 4, Figure 5:

Wherein, CBOW (Fig. 4) then removes prediction centre word with upper and lower cliction, and Skip-Gram (Fig. 5) goes to predict with centre word Upper and lower cliction.The distributed word that training obtains in this way indicates the semantic information for having contained word.Finally, constructing sentence with term vector Subvector can be obtained after rejecting stop words by being simply averaging.

Other steps and parameter are identical as specific embodiment nine.

Embodiment

Beneficial effects of the present invention are verified using following embodiment:

The present embodiment is carried out according to process as shown in Figure 1.The aid reading system towards the analysis of public opinion is built, this is For system by two module compositions of front end plug-in unit and back-end algorithm, front-end module can be installed in the form of browser window plug-in unit (can Choosing), it is mainly responsible for after downloading the original HTML of current page and is sent into rear end reception, and the processing result wash with watercolours that rear end is returned It dyes and；Back-end algorithm module is then mainly extracted comprising text, the natural language processings such as Entity recognition, autoabstract is named to calculate Method.

A news web page is opened, such as Fig. 6 chooses a piece of news editorial of wherein publication in " People's Net " Yu Sanyue 5th here For.

Front end plug-in unit detects start command, and the original HTML of current web page is sent to rear end；Rear end is by original web page It is forwarded to each algorithm, successively by processing such as text extracting, name Entity recognition, autoabstracts.Using these results to original Webpage and " reading model " webpage carry out visualization rendering, and the webpage after rendering is returned to front end and refreshes reproduction.

Effect is as shown in Figure 7 after opening text auxiliary reading function in original web page；There are typesettings, figure for original web page The interference such as piece, special efficacy, advertisement, can handle into succinct " reading model ", effect is as shown in Figure 8.For name, place name class life Name entity, subscript line under or beside a word to show that it is a proper noun, and presented in the form of font-weight, different colours etc.；It is prominent in the form of adding frame to mechanism name Display；Highlighted word indicates entity word, and highlighting background indicates the significance level of sentence.Reader, which can easily pass, only reads face Color fills the sentence of background to understand the emphasis of entire article, improves the efficiency of the analysis of public opinion；If it is desired to understanding media event In role relation, highlighted entity word also can quickly assist understanding.Since visualization processing is fully established on original text, if Want to read in detail also very convenient in full.

The present invention can also have other various embodiments, without deviating from the spirit and substance of the present invention, this field Technical staff makes various corresponding changes and modifications in accordance with the present invention, but these corresponding changes and modifications all should belong to The protection scope of the appended claims of the present invention.

Claims

1. the text aid reading method towards the analysis of public opinion, which is characterized in that specifically includes the following steps:

Step 1: carrying out text extracting to various types webpage；

2. the text aid reading method towards the analysis of public opinion according to claim 1, which is characterized in that used in step 1 Dom tree parsing carries out the text extracting, detailed process the following steps are included:

Step 1 one obtains original web page HTML, detection coding；UTF-8 coding is converted into if encoding non-UTF-8；

Web page tag is grouped by step 1 two；

Step 1 three establishes dom tree to HTML；DOM is DOM Document Object Model；

Step 1 four, the element for deleting non-textual content in DOM；

All element in step 1 five, traversal DOM, if element is the label of<div>, need further recursive traversal< All nested elements in div > label reconfigure the content for sorting out page body by the weight plus-minus to grouping.

3. the text aid reading method towards the analysis of public opinion according to claim 2, which is characterized in that described in step 2 Name Entity recognition can use conventional machines learning algorithm or deep learning algorithm；The conventional machines learning algorithm, packet Include the Named Entity Extraction Model algorithm based on stacking HMM and the Named Entity Extraction Model algorithm based on CRF, wherein HMM table Show Hidden Markov Model, CRF indicates condition random field；The deep learning algorithm includes that the name entity based on BiLSTM is known Other model algorithm and Named Entity Extraction Model algorithm based on BiLSTM+CRF, the BiLSTM indicate two-way long short-term memory Network.

4. the text aid reading method towards the analysis of public opinion according to claim 3, which is characterized in that described based on stacking In the stacking HMM of the Named Entity Extraction Model of HMM:

1st layer of HMM is for segmenting；

2nd layer of HMM roughly identifies place name, name on the basis of the 1st layer, then next to result progress pattern match is generated Secondary automatic marking corpus；

4th layer of HMM mechanism name for identification.

5. the text aid reading method towards the analysis of public opinion according to claim 3, which is characterized in that described to be based on The Named Entity Extraction Model of BiLSTM includes following 3 layers:

1st layer is embeding layer, for converting character vector for the character of sentence, character vector can either random initializtion, so It is updated during training afterwards；Also it is good to be able to use pre-training on network；

2nd layer is BiLSTM layers, first to the stochastic parameter in each unit of LSTM, then by character vector spacer step at any time It is sent into LSTM unit one by one and does cycle calculations；LSTM is carried out a forward circulation to calculate and a recycled back calculating；It obtains Result by splicing, regularization, send to output layer；

3rd layer is output layer, which is softmax output layer.

6. the text aid reading method towards the analysis of public opinion according to claim 5, which is characterized in that described to be based on The Named Entity Extraction Model of BiLSTM+CRF, on the basis of the Named Entity Extraction Model based on BiLSTM, in BiLSTM Increasing by one CRF layers between layer and output layer, final label is calculated by CRF layers again for the data of BiLSTM layers of transmission, And it is sent to output layer.

7. the text aid reading method towards the analysis of public opinion according to claim 1, which is characterized in that Step 2: step Highlighting described in three can either carry out in original web page, can also carry out on " reading model " webpage, by It is corresponding word, sentence in HTML plus effect presentation.

8. the text aid reading method towards the analysis of public opinion according to claim 1, which is characterized in that used in step 3 Unsupervised extraction-type digest algorithm carries out the autoabstract；Unsupervised extraction-type digest algorithm includes the excavation based on figure Algorithm, the algorithm based on cluster.

9. the text aid reading method described in any one towards the analysis of public opinion according to claim 1~8, which is characterized in that The autoabstract is carried out using the TextRank algorithm belonged in the mining algorithm based on figure in step 3；The TextRank Algorithm be it is a kind of text be built into figure indicate, then excavated using figure to find the algorithm of key node；It specifically includes as follows Step:

Then, it calculates sentence similarity matrix: to the vector of two sentences any in text, being calculated using cosine formula similar Degree, is aggregated into similarity matrix；Entire text can be considered as using sentence between node, sentence similarity for while it is undirected have the right while Connected graph G；

Finally, carrying out important node excavation to G using PageRank algorithm；Calculation formula is as follows:

Wherein, c indicates damped coefficient, V_tIndicate t-th of node in figure G, In (V_t) indicate to be directed toward node V_tNode set, Out(V_j) indicate node V_jPointed node set, w_jtIndicate node V_tTo node V_jSide weight；WS(V_t) indicate section Point V_tWeight and, the sum term on right side then indicates each adjacent node to the percentage contribution of this node；

Continuous iteration update is carried out to nodes all in figure using above-mentioned formula, until all weights tend to be steady；It is final to choose Weight and highest N number of node take its corresponding N number of sentence as abstract output.

10. the text aid reading method towards the analysis of public opinion according to claim 9, which is characterized in that described by sentence Being expressed as vector form can be using BM25 algorithm or based on distributed learning algorithm；

Its calculation formula of BM25 algorithm is as follows:

Wherein, Q indicates inquiry string；q_iIndicate i-th of word in inquiry string Q, i=1 ... n；N is in inquiry string Q Word number；D indicates a search result document；W_iIndicate q_iWeight；R(q_i, d) and indicate q_iIt is related to search result document d's Property score；

Corpus pre-training is first passed through based on distributed learning algorithm and obtains term vector, and it is average then to do term vector to the word in sentence Operation obtains the vector of sentence.