CN110334300A - Text aid reading method towards the analysis of public opinion - Google Patents

Text aid reading method towards the analysis of public opinion Download PDF

Info

Publication number
CN110334300A
CN110334300A CN201910621253.3A CN201910621253A CN110334300A CN 110334300 A CN110334300 A CN 110334300A CN 201910621253 A CN201910621253 A CN 201910621253A CN 110334300 A CN110334300 A CN 110334300A
Authority
CN
China
Prior art keywords
text
algorithm
analysis
sentence
public opinion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910621253.3A
Other languages
Chinese (zh)
Inventor
赵铁军
徐冰
杨沐昀
胡东瑶
曹海龙
朱聪慧
郑德权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201910621253.3A priority Critical patent/CN110334300A/en
Publication of CN110334300A publication Critical patent/CN110334300A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides the text aid reading method towards the analysis of public opinion, belongs to natural language processing technique field.The present invention carries out text extracting using unified approach to various types webpage first;Then the name Entity recognition of the entities such as personage, place, mechanism is carried out to text and highlighted;Autoabstract is carried out to text again, and sentence is ranked up and is highlighted by different degree.The present invention solves the problems, such as that there are emphasis extraction difficulty, reading efficiency are low when existing the analysis of public opinion personnel read a large amount of text informations.The present invention can be used for the text aid reading of the analysis of public opinion, and reader can be quickly obtained purport information.

Description

Text aid reading method towards the analysis of public opinion
Technical field
The present invention relates to text aid reading methods, belong to natural language processing technique field.
Background technique
The analysis of public opinion is that one kind passes through real-time collecting public opinion information and statisticallys analyze, and is rationally determined with aid decision person The technology of plan.In the analysis of public opinion, the information being collected into can substantially be divided into structural data (such as social networks, the index of discharge Deng) and unstructured data (such as user comment and newsletter archive), wherein again in the majority with unstructured data.For a carriage Facts part, in addition to statistical analysis of data, the analysis of public opinion personnel be often also required to read a large amount of news report just can make it is relatively complete The analysis and summary in face, is easy to cause visual fatigue.And the development of natural language processing technique, allowing the machine auxiliary mankind to read becomes It may.Text aid reading system proposed by the present invention towards the analysis of public opinion attempts to solve following difficulty:
One, since public sentiment news is usually from webpage, and page structure, the character code of different news websites are inconsistent, And page structure has variation at any time, extracts content and has difficulties.
Two, due to generally involving the important informations such as personage, region, mechanism in public sentiment news and all kinds of editorials.Due to Chinese is accustomed to without the word space gap of similar English and entity initial caps, causes to focus when reading a large amount of texts tired It is difficult.
Three, it due to the larger low efficiency of public sentiment journalism amount, is not easy to catch article purport.
Summary of the invention
The present invention is solves the problems, such as that existing the analysis of public opinion technology exists to extract content difficulty, low efficiency, provide towards The text aid reading method of the analysis of public opinion.
Text aid reading method of the present invention towards the analysis of public opinion, is achieved through the following technical solutions:
Step 1: carrying out text extracting to various types webpage;
Step 2: carrying out the name Entity recognition of the entities such as personage, place, mechanism to text and highlighting;
Step 3: carrying out autoabstract to text, then sentence is ranked up and is highlighted by different degree.
Present invention feature the most prominent and significant beneficial effect are:
Text aid reading method according to the present invention towards the analysis of public opinion, successively to webpage by text extracting, Entity recognition, autoabstract processing are named, so that the refreshing after rendering to original web page and " reading model " webpage reproduces. The method of the present invention allows reader to easily pass the emphasis for reading highlighted sentence only to understand entire article, saves Reading time doubles the efficiency of the analysis of public opinion at least;If reader wants to understand the role relation in media event, It can quickly assist understanding by highlighted entity word.Since processing is fully established on original text, quick side is allowed users to Just original read.
Detailed description of the invention
Fig. 1 is embodiment of the present invention method flow chart;
Fig. 2 is the Named Entity Extraction Model schematic diagram based on stacking HMM in the present invention;
Fig. 3 is the Named Entity Extraction Model schematic diagram based on BiLSTM+CRF in the present invention;Human is people, and space refers to Place, institute are mechanisms;
Fig. 4 is CBOW schematic illustration;The word of w (t) expression current location;
Fig. 5 is Skip-Gram schematic illustration;
Fig. 6 is the original web page in the embodiment of the present invention;
Fig. 7 is to open effect after text auxiliary reading function in the embodiment of the present invention in original web page;
Fig. 8 is to open effect after text auxiliary reading function in the embodiment of the present invention on " reading model " webpage.
Specific embodiment
Specific embodiment 1: being illustrated in conjunction with Fig. 1 to present embodiment, what present embodiment provided divides towards public sentiment The text aid reading method of analysis, specifically includes the following steps:
Step 1: carrying out text extracting using unified approach to various types webpage;
Since in a practical situation, the page HTML structure of major news website is more complicated, and inconsistent.From It mainly include the distracters such as typesetting, picture, special efficacy, advertisement from the point of view of external sense organ;It mainly include label from the point of view of inherent code Intricate, situations such as character code is chaotic.The purpose of this step is the effective text paragraph extracted in webpage, is rejected in webpage Distracter, be clean succinct " reading model " by web repristination, while facilitating subsequent natural language processing.
Step 2: carrying out the name Entity recognition of the entities such as personage, place, mechanism to text and highlighting;
The purpose of this step is to extract important entity elements and from non-structured text to meet visual attention Form is presented.Under normal circumstances, news, editorial usually surround an event and are unfolded, and event having time, place, personage's (machine Structure) etc. fundamentals.In English, proper noun is write in the form of initial caps, and is allowed specially between word using space There is the boundary of noun to be more clear.But Chinese will cause reader when in face of a large amount of texts without above-mentioned writing feature It is difficult to focus important sentence element.Name Entity recognition highlights such issues that can alleviate in conjunction with appropriate.
Step 3: carrying out autoabstract to text, then sentence is ranked up and is highlighted by different degree;
The purpose of this step is to excavate important sentences from non-structured text, and to visualize shape on original text Formula is presented, with aid reading person's Fast Reading.
Specific embodiment 2: the present embodiment is different from the first embodiment in that, dom tree solution is used in step 1 Analysis carries out the text extracting, detailed process the following steps are included:
Step 1 one obtains original web page HTML (Hyper Text Markup Language hypertext markup language), Detection coding;UTF-8 volume is converted into if encoding non-UTF-8 (many Chinese web pages are encoded using GB2312) Code;
Step 1 two predefines several regular expression groupings, and web page tag is grouped;Grouping mainly has " weighting Grouping " (may contain the label of text, such as body, article, content), " grouping of drop power " (are less likely Label containing text, such as footnote, media, meta) and other groupings;
Step 1 three establishes dom tree to HTML;DOM is DOM Document Object Model (Document Object Model);
Step 1 four, the element for deleting non-textual content in DOM;
All elements need further recurrence time if element is the label of<div>in step 1 five, traversal DOM All nested elements in<div>label are gone through, by the weight plus-minus to grouping, reconfigures and sorts out the interior of page body Hold.
Other steps and parameter are same as the specific embodiment one.
Specific embodiment 3: present embodiment unlike specific embodiment two, is named in fact described in step 2 Body identification is generally viewed as sequence labelling task, precision can be used slightly lower but the conventional machines learning algorithm of fast speed, or Precision is slightly higher but slow deep learning algorithm;It can select according to actual needs.The conventional machines learning algorithm, Including the Named Entity Extraction Model algorithm based on stacking HMM and the Named Entity Extraction Model algorithm based on CRF, wherein HMM Indicate Hidden Markov Model, CRF indicates condition random field;The deep learning algorithm includes the name entity based on BiLSTM Identification model algorithm and Named Entity Extraction Model algorithm based on BiLSTM+CRF, the BiLSTM indicate that two-way length is remembered in short-term Recall network.
Other steps and parameter are identical with embodiment two.
Specific embodiment 4: present embodiment is unlike specific embodiment three, although HMM is simplest sequence Column marking model, but its speed is fast, scalability is strong.Due to requiring in production environment the process of refinement of name Entity recognition It is higher, it the use of multilayer laminated HMM is still at present one of widely used method.
In the stacking HMM of the Named Entity Extraction Model based on stacking HMM:
1st layer of HMM is for segmenting;Due to the reason of training corpus, this layer tends to cutting and opens too remote place name (such as at county level, township level) and name.
2nd layer of HMM roughly identifies place name, name on the basis of the 1st layer, then carries out mode to generation result With secondary automatic marking corpus is carried out, as next layer of training corpus;
For 3rd layer of HMM on secondary automatic marking corpus, training subtly identifies place name, name;
4th layer of HMM mechanism name for identification.Mechanism name is then slightly more complex, because it generally comprises place name, (minority is even also Include name), it needs on the basis of identifying place name and name, reuses pattern match and carry out secondary automatic marking, then with this For training corpus training identification mechanism name.
Therefore, it is desirable to more subtly identify name, place name, 3 layers of HMM are needed;And want relatively subtly identification mechanism Name, then need 4 layers of HMM.Training process is (the 2nd, 3 layer is reduced to one layer) as shown in Figure 2: about character position in entity in figure Label has: B presentation-entity beginning character (Begin), I presentation-entity intermediate character (Inside), E presentation-entity termination character (End), S indicates corpus separatum character (Single), and O indicates non-physical character (Outside);About the mark of entity type in figure Be signed with: Nh indicates that name entity, Ns indicate place name entity, Ni outgoing mechanism name entity;Position and type label, which can combine, to be made With if the label of " Jinan District " in " Jinan District procuratorial organ " is " B-Ni ", expression is the prefix of a mechanism name, and the mark of " procuratorial organ " Label are " E-Ni ", and expression is the suffix of a mechanism name.
It is trained using CRF similar above-mentioned.It needs to set corresponding feature templates, can start to train.
Other steps and parameter are the same as the specific implementation mode 3.
Specific embodiment 5: present embodiment is unlike specific embodiment three, the life based on BiLSTM Name entity recognition model is as shown in figure 3, mainly include following 3 layers:
1st layer is insertion (Embedding) layer, and for converting character vector for the character of sentence, character vector can Enough random initializtions, then update during training;Also it is good to be able to use pre-training on network;
2nd layer is BiLSTM layers, first to the stochastic parameter in each unit of LSTM (long memory network in short-term), so By character vector, spacer step is sent into LSTM unit one by one and does cycle calculations at any time afterwards;BiLSTM carries out LSTM primary positive (forward LSTM) cycle calculations and primary reversed (backward LSTM) cycle calculations;Obtained result, which is passed through, splices, just Then change, send to output layer;
3rd layer is output layer, which is simple softmax (normalization exponential function) output layer;It should be noted that , there are inclusion relations for the label of place name, the label of name and mechanism name, so if being concerned about inclusion relation, here can Two-way output is done, simultaneously training;If being indifferent to inclusion relation, it can also ignore by comprising entity, directly export all the way.
Other steps and parameter are identical as specific embodiment three or four.
Specific embodiment 6: present embodiment is unlike specific embodiment five, it is described to be based on BiLSTM+CRF Named Entity Extraction Model, on the basis of the Named Entity Extraction Model based on BiLSTM, in BiLSTM layers and output layer Between increase by one CRF layer, final label is calculated by CRF layers again for the BiLSTM layers of data transmitted, and is transmitted To output layer.
Other steps and parameter are identical as specific embodiment five.
Specific embodiment 7: the present embodiment is different from the first embodiment in that, Step 2: described in step 3 Highlighting can either carry out in original web page, also can " reading model " webpage (i.e. after step 1 is handled only wrap Containing text, the webpage without distracters such as picture, special efficacy, advertisements) on carry out, by being that corresponding word, sentence add in HTML Upper effect is presented.
Name, place name are highlighted:
It is previously noted that identification name entity compares English difficulty in Chinese text, Chinese is essentially consisted in without word point Every with initial caps feature.In fact, there is the punctuate of a kind of entitled " line under or beside a word to show that it is a proper noun " (" _ ") for overcoming this in Chinese punctuate Problem.Its regulation in 1919 " new-type punctuation mark proposal please be issue for enforcement ", and issued in State Bureau of Technical Supervision in nineteen ninety-five In " punctuation usage ", it is proposed that used in ancient books, for identifying name, place name, towards code name.
The present invention continues to use the usage of line under or beside a word to show that it is a proper noun, and name, place name class are named entity, subscript line under or beside a word to show that it is a proper noun, and added with font Slightly, the form of color is presented.
Mechanism name is highlighted:
Mechanism name can continue to use name, the method for place name uses line under or beside a word to show that it is a proper noun.Here it is considered that since mechanism may includes Place name, if reusing line under or beside a word to show that it is a proper noun, it will cause continuous.And the frequency that occurs in article of mechanism name far below name, Name, it is also important for the analysis of public opinion, preferably indicated with more prominent symbol.It is presented in the form of adding frame in the present embodiment Mechanism name.
Abstract is highlighted:
Background color is arranged in the sentence that autoabstract is obtained in original text, and Sentence significance score is then mapped to back The significance level of sentence is embodied in the brightness of scenery coloured silk.
Other steps and parameter are same as the specific embodiment one.
Specific embodiment 8: the present embodiment is different from the first embodiment in that, due to needing the result that will make a summary It is highlighted on original text (as highlighted), the autoabstract is carried out using unsupervised extraction-type digest algorithm in step 3; Unsupervised extraction-type digest algorithm includes mining algorithm (such as TextRank), the algorithm based on cluster based on figure.
Other steps and parameter are same as the specific embodiment one.
Specific embodiment 9: present embodiment unlike specific embodiment one to eight, uses in step 3 and belongs to TextRank algorithm in the mining algorithm based on figure carries out the autoabstract;The TextRank algorithm be it is a kind of will be literary Originally it is built into figure expression, is then excavated using figure to find the algorithm of key node (i.e. important sentences);Specifically include following step It is rapid:
Firstly, document is first divided into sentence, and by sentence expression at vector form;
Then, it calculates sentence similarity matrix: to the vector of two sentences any in text, calculating phase using cosine formula Like degree, it is aggregated into similarity matrix;In this way, it is the nothing on side that entire text, which can be considered as using sentence the similarity between node, sentence, To the side connected graph G that has the right;
Finally, carrying out important node excavation to G using PageRank (page rank) algorithm;Calculation formula is as follows:
Wherein, c indicates damped coefficient, generally may be configured as 0.85, VtIndicate t-th of node (node, that is, text in figure G In sentence), In (Vt) indicate to be directed toward node VtNode set, Out (Vj) indicate node VjPointed node set, wjt Indicate node VtTo node VjSide weight;WS (V on the left of formulat) indicate node VtWeight and (Weight Sum), it is right The sum term of side then indicates each adjacent node to the percentage contribution of this node;
Continuous iteration update is carried out to nodes all in figure using above-mentioned formula, until all weights tend to be steady;Finally Weight selection and highest N number of node take its corresponding N number of sentence as abstract output.
Other steps and parameter are identical as specific embodiment one to eight.
Specific embodiment 10: present embodiment is unlike specific embodiment nine, it is described by sentence expression to There are many amount forms, and method is optional, can learn using BM25 algorithm (best match algorithm Best Match) or based on distributed Practise algorithm;
BM25 algorithm is commonly used in doing relevance of searches scoring.Its calculation formula is as follows:
Wherein, Q indicates inquiry string (Query);qiIndicate i-th of word in inquiry string Q (for Chinese Speech, each word after participle can be considered as), i=1 ... n;N is word number in inquiry string Q;D indicates a search result Document;WiIndicate qiWeight;R(qi, d) and indicate qiWith the Relevance scores of search result document d;WiWith R (qi, d) can Designed, designed is typically designed Wi=IDF (qi), IDF () i.e. inverse document frequency;And R (qi, d) it is then more more flexible, it embodies single Word qiWith the correlation of document d;
Corpus pre-training is generally first passed through based on distributed learning algorithm and obtains term vector, and word then is done to the word in sentence Vector average calculating operation obtains the vector of sentence, and the method that corpus pre-training obtains term vector has Skip-Gram, CBOW etc., former Reason is as shown in Figure 4, Figure 5:
Wherein, CBOW (Fig. 4) then removes prediction centre word with upper and lower cliction, and Skip-Gram (Fig. 5) goes to predict with centre word Upper and lower cliction.The distributed word that training obtains in this way indicates the semantic information for having contained word.Finally, constructing sentence with term vector Subvector can be obtained after rejecting stop words by being simply averaging.
Other steps and parameter are identical as specific embodiment nine.
Embodiment
Beneficial effects of the present invention are verified using following embodiment:
The present embodiment is carried out according to process as shown in Figure 1.The aid reading system towards the analysis of public opinion is built, this is For system by two module compositions of front end plug-in unit and back-end algorithm, front-end module can be installed in the form of browser window plug-in unit (can Choosing), it is mainly responsible for after downloading the original HTML of current page and is sent into rear end reception, and the processing result wash with watercolours that rear end is returned It dyes and;Back-end algorithm module is then mainly extracted comprising text, the natural language processings such as Entity recognition, autoabstract is named to calculate Method.
A news web page is opened, such as Fig. 6 chooses a piece of news editorial of wherein publication in " People's Net " Yu Sanyue 5th here For.
Front end plug-in unit detects start command, and the original HTML of current web page is sent to rear end;Rear end is by original web page It is forwarded to each algorithm, successively by processing such as text extracting, name Entity recognition, autoabstracts.Using these results to original Webpage and " reading model " webpage carry out visualization rendering, and the webpage after rendering is returned to front end and refreshes reproduction.
Effect is as shown in Figure 7 after opening text auxiliary reading function in original web page;There are typesettings, figure for original web page The interference such as piece, special efficacy, advertisement, can handle into succinct " reading model ", effect is as shown in Figure 8.For name, place name class life Name entity, subscript line under or beside a word to show that it is a proper noun, and presented in the form of font-weight, different colours etc.;It is prominent in the form of adding frame to mechanism name Display;Highlighted word indicates entity word, and highlighting background indicates the significance level of sentence.Reader, which can easily pass, only reads face Color fills the sentence of background to understand the emphasis of entire article, improves the efficiency of the analysis of public opinion;If it is desired to understanding media event In role relation, highlighted entity word also can quickly assist understanding.Since visualization processing is fully established on original text, if Want to read in detail also very convenient in full.
The present invention can also have other various embodiments, without deviating from the spirit and substance of the present invention, this field Technical staff makes various corresponding changes and modifications in accordance with the present invention, but these corresponding changes and modifications all should belong to The protection scope of the appended claims of the present invention.

Claims (10)

1. the text aid reading method towards the analysis of public opinion, which is characterized in that specifically includes the following steps:
Step 1: carrying out text extracting to various types webpage;
Step 2: carrying out the name Entity recognition of the entities such as personage, place, mechanism to text and highlighting;
Step 3: carrying out autoabstract to text, then sentence is ranked up and is highlighted by different degree.
2. the text aid reading method towards the analysis of public opinion according to claim 1, which is characterized in that used in step 1 Dom tree parsing carries out the text extracting, detailed process the following steps are included:
Step 1 one obtains original web page HTML, detection coding;UTF-8 coding is converted into if encoding non-UTF-8;
Web page tag is grouped by step 1 two;
Step 1 three establishes dom tree to HTML;DOM is DOM Document Object Model;
Step 1 four, the element for deleting non-textual content in DOM;
All element in step 1 five, traversal DOM, if element is the label of<div>, need further recursive traversal< All nested elements in div > label reconfigure the content for sorting out page body by the weight plus-minus to grouping.
3. the text aid reading method towards the analysis of public opinion according to claim 2, which is characterized in that described in step 2 Name Entity recognition can use conventional machines learning algorithm or deep learning algorithm;The conventional machines learning algorithm, packet Include the Named Entity Extraction Model algorithm based on stacking HMM and the Named Entity Extraction Model algorithm based on CRF, wherein HMM table Show Hidden Markov Model, CRF indicates condition random field;The deep learning algorithm includes that the name entity based on BiLSTM is known Other model algorithm and Named Entity Extraction Model algorithm based on BiLSTM+CRF, the BiLSTM indicate two-way long short-term memory Network.
4. the text aid reading method towards the analysis of public opinion according to claim 3, which is characterized in that described based on stacking In the stacking HMM of the Named Entity Extraction Model of HMM:
1st layer of HMM is for segmenting;
2nd layer of HMM roughly identifies place name, name on the basis of the 1st layer, then next to result progress pattern match is generated Secondary automatic marking corpus;
For 3rd layer of HMM on secondary automatic marking corpus, training subtly identifies place name, name;
4th layer of HMM mechanism name for identification.
5. the text aid reading method towards the analysis of public opinion according to claim 3, which is characterized in that described to be based on The Named Entity Extraction Model of BiLSTM includes following 3 layers:
1st layer is embeding layer, for converting character vector for the character of sentence, character vector can either random initializtion, so It is updated during training afterwards;Also it is good to be able to use pre-training on network;
2nd layer is BiLSTM layers, first to the stochastic parameter in each unit of LSTM, then by character vector spacer step at any time It is sent into LSTM unit one by one and does cycle calculations;LSTM is carried out a forward circulation to calculate and a recycled back calculating;It obtains Result by splicing, regularization, send to output layer;
3rd layer is output layer, which is softmax output layer.
6. the text aid reading method towards the analysis of public opinion according to claim 5, which is characterized in that described to be based on The Named Entity Extraction Model of BiLSTM+CRF, on the basis of the Named Entity Extraction Model based on BiLSTM, in BiLSTM Increasing by one CRF layers between layer and output layer, final label is calculated by CRF layers again for the data of BiLSTM layers of transmission, And it is sent to output layer.
7. the text aid reading method towards the analysis of public opinion according to claim 1, which is characterized in that Step 2: step Highlighting described in three can either carry out in original web page, can also carry out on " reading model " webpage, by It is corresponding word, sentence in HTML plus effect presentation.
8. the text aid reading method towards the analysis of public opinion according to claim 1, which is characterized in that used in step 3 Unsupervised extraction-type digest algorithm carries out the autoabstract;Unsupervised extraction-type digest algorithm includes the excavation based on figure Algorithm, the algorithm based on cluster.
9. the text aid reading method described in any one towards the analysis of public opinion according to claim 1~8, which is characterized in that The autoabstract is carried out using the TextRank algorithm belonged in the mining algorithm based on figure in step 3;The TextRank Algorithm be it is a kind of text be built into figure indicate, then excavated using figure to find the algorithm of key node;It specifically includes as follows Step:
Firstly, document is first divided into sentence, and by sentence expression at vector form;
Then, it calculates sentence similarity matrix: to the vector of two sentences any in text, being calculated using cosine formula similar Degree, is aggregated into similarity matrix;Entire text can be considered as using sentence between node, sentence similarity for while it is undirected have the right while Connected graph G;
Finally, carrying out important node excavation to G using PageRank algorithm;Calculation formula is as follows:
Wherein, c indicates damped coefficient, VtIndicate t-th of node in figure G, In (Vt) indicate to be directed toward node VtNode set, Out(Vj) indicate node VjPointed node set, wjtIndicate node VtTo node VjSide weight;WS(Vt) indicate section Point VtWeight and, the sum term on right side then indicates each adjacent node to the percentage contribution of this node;
Continuous iteration update is carried out to nodes all in figure using above-mentioned formula, until all weights tend to be steady;It is final to choose Weight and highest N number of node take its corresponding N number of sentence as abstract output.
10. the text aid reading method towards the analysis of public opinion according to claim 9, which is characterized in that described by sentence Being expressed as vector form can be using BM25 algorithm or based on distributed learning algorithm;
Its calculation formula of BM25 algorithm is as follows:
Wherein, Q indicates inquiry string;qiIndicate i-th of word in inquiry string Q, i=1 ... n;N is in inquiry string Q Word number;D indicates a search result document;WiIndicate qiWeight;R(qi, d) and indicate qiIt is related to search result document d's Property score;
Corpus pre-training is first passed through based on distributed learning algorithm and obtains term vector, and it is average then to do term vector to the word in sentence Operation obtains the vector of sentence.
CN201910621253.3A 2019-07-10 2019-07-10 Text aid reading method towards the analysis of public opinion Pending CN110334300A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910621253.3A CN110334300A (en) 2019-07-10 2019-07-10 Text aid reading method towards the analysis of public opinion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910621253.3A CN110334300A (en) 2019-07-10 2019-07-10 Text aid reading method towards the analysis of public opinion

Publications (1)

Publication Number Publication Date
CN110334300A true CN110334300A (en) 2019-10-15

Family

ID=68145988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910621253.3A Pending CN110334300A (en) 2019-07-10 2019-07-10 Text aid reading method towards the analysis of public opinion

Country Status (1)

Country Link
CN (1) CN110334300A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160019A (en) * 2019-12-30 2020-05-15 中国联合网络通信集团有限公司 Public opinion monitoring method, device and system
CN112989811A (en) * 2021-03-01 2021-06-18 哈尔滨工业大学 BilSTM-CRF-based historical book reading auxiliary system and control method thereof
CN113297826A (en) * 2020-06-28 2021-08-24 上海交通大学 Method for marking on natural language text

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device
CN109753660A (en) * 2019-01-07 2019-05-14 福州大学 A kind of acceptance of the bid webpage name entity abstracting method based on LSTM
CN109800386A (en) * 2017-11-17 2019-05-24 奥多比公司 Highlight the key component of text in document

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device
CN109800386A (en) * 2017-11-17 2019-05-24 奥多比公司 Highlight the key component of text in document
CN109753660A (en) * 2019-01-07 2019-05-14 福州大学 A kind of acceptance of the bid webpage name entity abstracting method based on LSTM

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
裴大帅2021: "NLP机构名识别中的层叠式HMM架构", 《新浪博客》 *
郭正斌: "面向社会安全事件的知识图谱构建方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
陈老师或波哥: "网页内容高亮的实现", 《简书》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160019A (en) * 2019-12-30 2020-05-15 中国联合网络通信集团有限公司 Public opinion monitoring method, device and system
CN111160019B (en) * 2019-12-30 2023-08-15 中国联合网络通信集团有限公司 Public opinion monitoring method, device and system
CN113297826A (en) * 2020-06-28 2021-08-24 上海交通大学 Method for marking on natural language text
CN112989811A (en) * 2021-03-01 2021-06-18 哈尔滨工业大学 BilSTM-CRF-based historical book reading auxiliary system and control method thereof
CN112989811B (en) * 2021-03-01 2022-09-09 哈尔滨工业大学 History book reading auxiliary system based on BiLSTM-CRF and control method thereof

Similar Documents

Publication Publication Date Title
Zhang et al. Sentiment analysis of Chinese micro-blog text based on extended sentiment dictionary
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
WO2021114745A1 (en) Named entity recognition method employing affix perception for use in social media
CN103049435B (en) Text fine granularity sentiment analysis method and device
CN109857990A (en) A kind of financial class notice information abstracting method based on file structure and deep learning
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN110334300A (en) Text aid reading method towards the analysis of public opinion
CN103678412A (en) Document retrieval method and device
CN111444704B (en) Network safety keyword extraction method based on deep neural network
CN109086355A (en) Hot spot association relationship analysis method and system based on theme of news word
Wang et al. Learning morpheme representation for mongolian named entity recognition
Tohidi et al. A Practice of Human-Machine Collaboration for Persian Text Summarization
Liu et al. A parallel computing-based deep attention model for named entity recognition
Xu et al. ALSEE: a framework for attribute-level sentiment element extraction towards product reviews
Le-Hong Diacritics generation and application in hate speech detection on Vietnamese social networks
CN114265936A (en) Method for realizing text mining of science and technology project
CN108595466B (en) Internet information filtering and internet user information and network card structure analysis method
Feifei et al. Bert-based Siamese network for semantic similarity
Nasim et al. Evaluation of clustering techniques on Urdu News head-lines: A case of short length text
Behere et al. Text summarization and classification of conversation data between service chatbot and customer
Mohnot et al. Hybrid approach for Part of Speech Tagger for Hindi language
Hua et al. A character-level method for text classification
CN116049437A (en) Element extraction method of document-level low-resource scene based on self-label and prompt
Jiang et al. A hierarchical bidirectional LSTM sequence model for extractive text summarization in electric power systems
CN112445887A (en) Method and device for realizing machine reading understanding system based on retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191015